linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v4 00/19] Core scheduling v4
@ 2019-10-30 18:33 Vineeth Remanan Pillai
  2019-10-30 18:33 ` [RFC PATCH v4 01/19] stop_machine: Fix stop_cpus_in_progress ordering Vineeth Remanan Pillai
                   ` (21 more replies)
  0 siblings, 22 replies; 84+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-30 18:33 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: Vineeth Remanan Pillai, linux-kernel, Dario Faggioli, fweisbec,
	keescook, kerrnel, Phil Auld, Aaron Lu, Aubrey Li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

Fourth iteration of the Core-Scheduling feature.

This version was aiming mostly at addressing the vruntime comparison
issues with v3. The main issue seen in v3 was the starvation of
interactive tasks when competing with cpu intensive tasks. This issue
is mitigated to a large extent.

We have tested and verified that incompatible processes are not
selected during schedule. In terms of performance, the impact
depends on the workload:
- on CPU intensive applications that use all the logical CPUs with
  SMT enabled, enabling core scheduling performs better than nosmt.
- on mixed workloads with considerable io compared to cpu usage,
  nosmt seems to perform better than core scheduling.

v4 is rebased on top of 5.3.5(dc073f193b70):
https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.3.5

Changes in v4
-------------
- Implement a core wide min_vruntime for vruntime comparison of tasks
  across cpus in a core.
- Fixes a typo bug in setting the forced_idle cpu.

Changes in v3
-------------
- Fixes the issue of sibling picking up an incompatible task
  - Aaron Lu
  - Vineeth Pillai
  - Julien Desfossez
- Fixes the issue of starving threads due to forced idle
  - Peter Zijlstra
- Fixes the refcounting issue when deleting a cgroup with tag
  - Julien Desfossez
- Fixes a crash during cpu offline/online with coresched enabled
  - Vineeth Pillai
- Fixes a comparison logic issue in sched_core_find
  - Aaron Lu

Changes in v2
-------------
- Fixes for couple of NULL pointer dereference crashes
  - Subhra Mazumdar
  - Tim Chen
- Improves priority comparison logic for process in different cpus
  - Peter Zijlstra
  - Aaron Lu
- Fixes a hard lockup in rq locking
  - Vineeth Pillai
  - Julien Desfossez
- Fixes a performance issue seen on IO heavy workloads
  - Vineeth Pillai
  - Julien Desfossez
- Fix for 32bit build
  - Aubrey Li

TODO
----
- Decide on the API for exposing the feature to userland
- Investigate the source of the overhead even when no tasks are tagged:
  https://lkml.org/lkml/2019/10/29/242
- Investigate the performance scaling issue when we have a high number of
  tagged threads: https://lkml.org/lkml/2019/10/29/248
- Try to optimize the performance for IO-demanding applications:
  https://lkml.org/lkml/2019/10/29/261

---

Aaron Lu (3):
  sched/fair: wrapper for cfs_rq->min_vruntime
  sched/fair: core wide vruntime comparison
  sched/fair : Wake up forced idle siblings if needed

Peter Zijlstra (16):
  stop_machine: Fix stop_cpus_in_progress ordering
  sched: Fix kerneldoc comment for ia64_set_curr_task
  sched: Wrap rq::lock access
  sched/{rt,deadline}: Fix set_next_task vs pick_next_task
  sched: Add task_struct pointer to sched_class::set_curr_task
  sched/fair: Export newidle_balance()
  sched: Allow put_prev_task() to drop rq->lock
  sched: Rework pick_next_task() slow-path
  sched: Introduce sched_class::pick_task()
  sched: Core-wide rq->lock
  sched: Basic tracking of matching tasks
  sched: A quick and dirty cgroup tagging interface
  sched: Add core wide task selection and scheduling.
  sched/fair: Add a few assertions
  sched: Trivial forced-newidle balancer
  sched: Debug bits...

 include/linux/sched.h    |   9 +-
 kernel/Kconfig.preempt   |   6 +
 kernel/sched/core.c      | 847 +++++++++++++++++++++++++++++++++++++--
 kernel/sched/cpuacct.c   |  12 +-
 kernel/sched/deadline.c  |  99 +++--
 kernel/sched/debug.c     |   4 +-
 kernel/sched/fair.c      | 346 +++++++++++-----
 kernel/sched/idle.c      |  42 +-
 kernel/sched/pelt.h      |   2 +-
 kernel/sched/rt.c        |  96 ++---
 kernel/sched/sched.h     | 246 +++++++++---
 kernel/sched/stop_task.c |  35 +-
 kernel/sched/topology.c  |   4 +-
 kernel/stop_machine.c    |   2 +
 14 files changed, 1399 insertions(+), 351 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 84+ messages in thread

* [RFC PATCH v4 01/19] stop_machine: Fix stop_cpus_in_progress ordering
  2019-10-30 18:33 [RFC PATCH v4 00/19] Core scheduling v4 Vineeth Remanan Pillai
@ 2019-10-30 18:33 ` Vineeth Remanan Pillai
  2019-10-30 18:33 ` [RFC PATCH v4 02/19] sched: Fix kerneldoc comment for ia64_set_curr_task Vineeth Remanan Pillai
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 84+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-30 18:33 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, Dario Faggioli, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

From: Peter Zijlstra <peterz@infradead.org>

Make sure the entire for loop has stop_cpus_in_progress set.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/stop_machine.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index b4f83f7bdf86..c7031a22aa7b 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -383,6 +383,7 @@ static bool queue_stop_cpus_work(const struct cpumask *cpumask,
 	 */
 	preempt_disable();
 	stop_cpus_in_progress = true;
+	barrier();
 	for_each_cpu(cpu, cpumask) {
 		work = &per_cpu(cpu_stopper.stop_work, cpu);
 		work->fn = fn;
@@ -391,6 +392,7 @@ static bool queue_stop_cpus_work(const struct cpumask *cpumask,
 		if (cpu_stop_queue_work(cpu, work))
 			queued = true;
 	}
+	barrier();
 	stop_cpus_in_progress = false;
 	preempt_enable();
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [RFC PATCH v4 02/19] sched: Fix kerneldoc comment for ia64_set_curr_task
  2019-10-30 18:33 [RFC PATCH v4 00/19] Core scheduling v4 Vineeth Remanan Pillai
  2019-10-30 18:33 ` [RFC PATCH v4 01/19] stop_machine: Fix stop_cpus_in_progress ordering Vineeth Remanan Pillai
@ 2019-10-30 18:33 ` Vineeth Remanan Pillai
  2019-10-30 18:33 ` [RFC PATCH v4 03/19] sched: Wrap rq::lock access Vineeth Remanan Pillai
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 84+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-30 18:33 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, Dario Faggioli, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

From: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index df9f1fe5689b..8225ddba344e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6705,7 +6705,7 @@ struct task_struct *curr_task(int cpu)
 
 #ifdef CONFIG_IA64
 /**
- * set_curr_task - set the current task for a given CPU.
+ * ia64_set_curr_task - set the current task for a given CPU.
  * @cpu: the processor in question.
  * @p: the task pointer to set.
  *
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [RFC PATCH v4 03/19] sched: Wrap rq::lock access
  2019-10-30 18:33 [RFC PATCH v4 00/19] Core scheduling v4 Vineeth Remanan Pillai
  2019-10-30 18:33 ` [RFC PATCH v4 01/19] stop_machine: Fix stop_cpus_in_progress ordering Vineeth Remanan Pillai
  2019-10-30 18:33 ` [RFC PATCH v4 02/19] sched: Fix kerneldoc comment for ia64_set_curr_task Vineeth Remanan Pillai
@ 2019-10-30 18:33 ` Vineeth Remanan Pillai
  2019-10-30 18:33 ` [RFC PATCH v4 04/19] sched/{rt,deadline}: Fix set_next_task vs pick_next_task Vineeth Remanan Pillai
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 84+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-30 18:33 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, Dario Faggioli, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Vineeth Remanan Pillai

From: Peter Zijlstra <peterz@infradead.org>

In preparation of playing games with rq->lock, abstract the thing
using an accessor.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
---
 kernel/sched/core.c     |  46 ++++++++--------
 kernel/sched/cpuacct.c  |  12 ++---
 kernel/sched/deadline.c |  18 +++----
 kernel/sched/debug.c    |   4 +-
 kernel/sched/fair.c     |  42 +++++++--------
 kernel/sched/idle.c     |   4 +-
 kernel/sched/pelt.h     |   2 +-
 kernel/sched/rt.c       |   8 +--
 kernel/sched/sched.h    | 113 +++++++++++++++++++++-------------------
 kernel/sched/topology.c |   4 +-
 10 files changed, 128 insertions(+), 125 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8225ddba344e..a9c93cf71f5f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -84,12 +84,12 @@ struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
 
 	for (;;) {
 		rq = task_rq(p);
-		raw_spin_lock(&rq->lock);
+		raw_spin_lock(rq_lockp(rq));
 		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
 			rq_pin_lock(rq, rf);
 			return rq;
 		}
-		raw_spin_unlock(&rq->lock);
+		raw_spin_unlock(rq_lockp(rq));
 
 		while (unlikely(task_on_rq_migrating(p)))
 			cpu_relax();
@@ -108,7 +108,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
 	for (;;) {
 		raw_spin_lock_irqsave(&p->pi_lock, rf->flags);
 		rq = task_rq(p);
-		raw_spin_lock(&rq->lock);
+		raw_spin_lock(rq_lockp(rq));
 		/*
 		 *	move_queued_task()		task_rq_lock()
 		 *
@@ -130,7 +130,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
 			rq_pin_lock(rq, rf);
 			return rq;
 		}
-		raw_spin_unlock(&rq->lock);
+		raw_spin_unlock(rq_lockp(rq));
 		raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 
 		while (unlikely(task_on_rq_migrating(p)))
@@ -200,7 +200,7 @@ void update_rq_clock(struct rq *rq)
 {
 	s64 delta;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	if (rq->clock_update_flags & RQCF_ACT_SKIP)
 		return;
@@ -509,7 +509,7 @@ void resched_curr(struct rq *rq)
 	struct task_struct *curr = rq->curr;
 	int cpu;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	if (test_tsk_need_resched(curr))
 		return;
@@ -533,10 +533,10 @@ void resched_cpu(int cpu)
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), flags);
 	if (cpu_online(cpu) || cpu == smp_processor_id())
 		resched_curr(rq);
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 }
 
 #ifdef CONFIG_SMP
@@ -910,7 +910,7 @@ static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
 	struct uclamp_se *uc_se = &p->uclamp[clamp_id];
 	struct uclamp_bucket *bucket;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	/* Update task effective clamp */
 	p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
@@ -950,7 +950,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
 	unsigned int bkt_clamp;
 	unsigned int rq_clamp;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	bucket = &uc_rq->bucket[uc_se->bucket_id];
 	SCHED_WARN_ON(!bucket->tasks);
@@ -1369,7 +1369,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
 				   struct task_struct *p, int new_cpu)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
 	dequeue_task(rq, p, DEQUEUE_NOCLOCK);
@@ -1483,7 +1483,7 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 		 * Because __kthread_bind() calls this on blocked tasks without
 		 * holding rq->lock.
 		 */
-		lockdep_assert_held(&rq->lock);
+		lockdep_assert_held(rq_lockp(rq));
 		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
 	}
 	if (running)
@@ -1615,7 +1615,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	 * task_rq_lock().
 	 */
 	WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) ||
-				      lockdep_is_held(&task_rq(p)->lock)));
+				      lockdep_is_held(rq_lockp(task_rq(p)))));
 #endif
 	/*
 	 * Clearly, migrating tasks to offline CPUs is a fairly daft thing.
@@ -2132,7 +2132,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 {
 	int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 #ifdef CONFIG_SMP
 	if (p->sched_contributes_to_load)
@@ -2986,10 +2986,10 @@ prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf
 	 * do an early lockdep release here:
 	 */
 	rq_unpin_lock(rq, rf);
-	spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
+	spin_release(&rq_lockp(rq)->dep_map, 1, _THIS_IP_);
 #ifdef CONFIG_DEBUG_SPINLOCK
 	/* this is a valid case when another task releases the spinlock */
-	rq->lock.owner = next;
+	rq_lockp(rq)->owner = next;
 #endif
 }
 
@@ -3000,8 +3000,8 @@ static inline void finish_lock_switch(struct rq *rq)
 	 * fix up the runqueue lock - which gets 'carried over' from
 	 * prev into current:
 	 */
-	spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
-	raw_spin_unlock_irq(&rq->lock);
+	spin_acquire(&rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
+	raw_spin_unlock_irq(rq_lockp(rq));
 }
 
 /*
@@ -3151,7 +3151,7 @@ static void __balance_callback(struct rq *rq)
 	void (*func)(struct rq *rq);
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), flags);
 	head = rq->balance_callback;
 	rq->balance_callback = NULL;
 	while (head) {
@@ -3162,7 +3162,7 @@ static void __balance_callback(struct rq *rq)
 
 		func(rq);
 	}
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 }
 
 static inline void balance_callback(struct rq *rq)
@@ -5834,7 +5834,7 @@ void init_idle(struct task_struct *idle, int cpu)
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&idle->pi_lock, flags);
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 
 	__sched_fork(0, idle);
 	idle->state = TASK_RUNNING;
@@ -5871,7 +5871,7 @@ void init_idle(struct task_struct *idle, int cpu)
 #ifdef CONFIG_SMP
 	idle->on_cpu = 1;
 #endif
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 	raw_spin_unlock_irqrestore(&idle->pi_lock, flags);
 
 	/* Set the preempt count _outside_ the spinlocks! */
@@ -6439,7 +6439,7 @@ void __init sched_init(void)
 		struct rq *rq;
 
 		rq = cpu_rq(i);
-		raw_spin_lock_init(&rq->lock);
+		raw_spin_lock_init(&rq->__lock);
 		rq->nr_running = 0;
 		rq->calc_load_active = 0;
 		rq->calc_load_update = jiffies + LOAD_FREQ;
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 9fbb10383434..78de28ebc45d 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -111,7 +111,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
 	/*
 	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
 	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
 	if (index == CPUACCT_STAT_NSTATS) {
@@ -125,7 +125,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
 	}
 
 #ifndef CONFIG_64BIT
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
 	return data;
@@ -140,14 +140,14 @@ static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
 	/*
 	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
 	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
 	for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
 		cpuusage->usages[i] = val;
 
 #ifndef CONFIG_64BIT
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 }
 
@@ -252,13 +252,13 @@ static int cpuacct_all_seq_show(struct seq_file *m, void *V)
 			 * Take rq->lock to make 64-bit read safe on 32-bit
 			 * platforms.
 			 */
-			raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+			raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
 			seq_printf(m, " %llu", cpuusage->usages[index]);
 
 #ifndef CONFIG_64BIT
-			raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+			raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 		}
 		seq_puts(m, "\n");
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 46122edd8552..899e988d4811 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -80,7 +80,7 @@ void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
 {
 	u64 old = dl_rq->running_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
 	dl_rq->running_bw += dl_bw;
 	SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
 	SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
@@ -93,7 +93,7 @@ void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
 {
 	u64 old = dl_rq->running_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
 	dl_rq->running_bw -= dl_bw;
 	SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
 	if (dl_rq->running_bw > old)
@@ -107,7 +107,7 @@ void __add_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
 {
 	u64 old = dl_rq->this_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
 	dl_rq->this_bw += dl_bw;
 	SCHED_WARN_ON(dl_rq->this_bw < old); /* overflow */
 }
@@ -117,7 +117,7 @@ void __sub_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
 {
 	u64 old = dl_rq->this_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_held(rq_lockp(rq_of_dl_rq(dl_rq)));
 	dl_rq->this_bw -= dl_bw;
 	SCHED_WARN_ON(dl_rq->this_bw > old); /* underflow */
 	if (dl_rq->this_bw > old)
@@ -892,7 +892,7 @@ static int start_dl_timer(struct task_struct *p)
 	ktime_t now, act;
 	s64 delta;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	/*
 	 * We want the timer to fire at the deadline, but considering
@@ -1002,9 +1002,9 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 		 * If the runqueue is no longer available, migrate the
 		 * task elsewhere. This necessarily changes rq.
 		 */
-		lockdep_unpin_lock(&rq->lock, rf.cookie);
+		lockdep_unpin_lock(rq_lockp(rq), rf.cookie);
 		rq = dl_task_offline_migration(rq, p);
-		rf.cookie = lockdep_pin_lock(&rq->lock);
+		rf.cookie = lockdep_pin_lock(rq_lockp(rq));
 		update_rq_clock(rq);
 
 		/*
@@ -1619,7 +1619,7 @@ static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused
 	 * from try_to_wake_up(). Hence, p->pi_lock is locked, but
 	 * rq->lock is not... So, lock it
 	 */
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 	if (p->dl.dl_non_contending) {
 		sub_running_bw(&p->dl, &rq->dl);
 		p->dl.dl_non_contending = 0;
@@ -1634,7 +1634,7 @@ static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused
 			put_task_struct(p);
 	}
 	sub_rq_bw(&p->dl, &rq->dl);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 }
 
 static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index f7e4579e746c..7e5f2237c7e4 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -498,7 +498,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "exec_clock",
 			SPLIT_NS(cfs_rq->exec_clock));
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), flags);
 	if (rb_first_cached(&cfs_rq->tasks_timeline))
 		MIN_vruntime = (__pick_first_entity(cfs_rq))->vruntime;
 	last = __pick_last_entity(cfs_rq);
@@ -506,7 +506,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 		max_vruntime = last->vruntime;
 	min_vruntime = cfs_rq->min_vruntime;
 	rq0_min_vruntime = cpu_rq(0)->cfs.min_vruntime;
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "MIN_vruntime",
 			SPLIT_NS(MIN_vruntime));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "min_vruntime",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 500f5db0de0b..a1e8b811ce1f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1093,7 +1093,7 @@ struct numa_group {
 static struct numa_group *deref_task_numa_group(struct task_struct *p)
 {
 	return rcu_dereference_check(p->numa_group, p == current ||
-		(lockdep_is_held(&task_rq(p)->lock) && !READ_ONCE(p->on_cpu)));
+		(lockdep_is_held(rq_lockp(task_rq(p))) && !READ_ONCE(p->on_cpu)));
 }
 
 static struct numa_group *deref_curr_numa_group(struct task_struct *p)
@@ -5083,7 +5083,7 @@ static void __maybe_unused update_runtime_enabled(struct rq *rq)
 {
 	struct task_group *tg;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(tg, &task_groups, list) {
@@ -5102,7 +5102,7 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
 {
 	struct task_group *tg;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(tg, &task_groups, list) {
@@ -6593,7 +6593,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 		 * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
 		 * rq->lock and can modify state directly.
 		 */
-		lockdep_assert_held(&task_rq(p)->lock);
+		lockdep_assert_held(rq_lockp(task_rq(p)));
 		detach_entity_cfs_rq(&p->se);
 
 	} else {
@@ -7167,7 +7167,7 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 {
 	s64 delta;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	if (p->sched_class != &fair_sched_class)
 		return 0;
@@ -7261,7 +7261,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 {
 	int tsk_cache_hot;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	/*
 	 * We do not migrate tasks that are:
@@ -7339,7 +7339,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
  */
 static void detach_task(struct task_struct *p, struct lb_env *env)
 {
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
 	set_task_cpu(p, env->dst_cpu);
@@ -7355,7 +7355,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)
 {
 	struct task_struct *p;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	list_for_each_entry_reverse(p,
 			&env->src_rq->cfs_tasks, se.group_node) {
@@ -7391,7 +7391,7 @@ static int detach_tasks(struct lb_env *env)
 	unsigned long load;
 	int detached = 0;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_held(rq_lockp(env->src_rq));
 
 	if (env->imbalance <= 0)
 		return 0;
@@ -7472,7 +7472,7 @@ static int detach_tasks(struct lb_env *env)
  */
 static void attach_task(struct rq *rq, struct task_struct *p)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	BUG_ON(task_rq(p) != rq);
 	activate_task(rq, p, ENQUEUE_NOCLOCK);
@@ -8995,7 +8995,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		if (need_active_balance(&env)) {
 			unsigned long flags;
 
-			raw_spin_lock_irqsave(&busiest->lock, flags);
+			raw_spin_lock_irqsave(rq_lockp(busiest), flags);
 
 			/*
 			 * Don't kick the active_load_balance_cpu_stop,
@@ -9003,7 +9003,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 			 * moved to this_cpu:
 			 */
 			if (!cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr)) {
-				raw_spin_unlock_irqrestore(&busiest->lock,
+				raw_spin_unlock_irqrestore(rq_lockp(busiest),
 							    flags);
 				env.flags |= LBF_ALL_PINNED;
 				goto out_one_pinned;
@@ -9019,7 +9019,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 				busiest->push_cpu = this_cpu;
 				active_balance = 1;
 			}
-			raw_spin_unlock_irqrestore(&busiest->lock, flags);
+			raw_spin_unlock_irqrestore(rq_lockp(busiest), flags);
 
 			if (active_balance) {
 				stop_one_cpu_nowait(cpu_of(busiest),
@@ -9761,7 +9761,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
 	    time_before(jiffies, READ_ONCE(nohz.next_blocked)))
 		return;
 
-	raw_spin_unlock(&this_rq->lock);
+	raw_spin_unlock(rq_lockp(this_rq));
 	/*
 	 * This CPU is going to be idle and blocked load of idle CPUs
 	 * need to be updated. Run the ilb locally as it is a good
@@ -9770,7 +9770,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
 	 */
 	if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
 		kick_ilb(NOHZ_STATS_KICK);
-	raw_spin_lock(&this_rq->lock);
+	raw_spin_lock(rq_lockp(this_rq));
 }
 
 #else /* !CONFIG_NO_HZ_COMMON */
@@ -9830,7 +9830,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
 		goto out;
 	}
 
-	raw_spin_unlock(&this_rq->lock);
+	raw_spin_unlock(rq_lockp(this_rq));
 
 	update_blocked_averages(this_cpu);
 	rcu_read_lock();
@@ -9871,7 +9871,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
 	}
 	rcu_read_unlock();
 
-	raw_spin_lock(&this_rq->lock);
+	raw_spin_lock(rq_lockp(this_rq));
 
 	if (curr_cost > this_rq->max_idle_balance_cost)
 		this_rq->max_idle_balance_cost = curr_cost;
@@ -10307,11 +10307,11 @@ void online_fair_sched_group(struct task_group *tg)
 		rq = cpu_rq(i);
 		se = tg->se[i];
 
-		raw_spin_lock_irq(&rq->lock);
+		raw_spin_lock_irq(rq_lockp(rq));
 		update_rq_clock(rq);
 		attach_entity_cfs_rq(se);
 		sync_throttle(tg, i);
-		raw_spin_unlock_irq(&rq->lock);
+		raw_spin_unlock_irq(rq_lockp(rq));
 	}
 }
 
@@ -10334,9 +10334,9 @@ void unregister_fair_sched_group(struct task_group *tg)
 
 		rq = cpu_rq(cpu);
 
-		raw_spin_lock_irqsave(&rq->lock, flags);
+		raw_spin_lock_irqsave(rq_lockp(rq), flags);
 		list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
-		raw_spin_unlock_irqrestore(&rq->lock, flags);
+		raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 	}
 }
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 80940939b733..0d2f83899c83 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -391,10 +391,10 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 static void
 dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)
 {
-	raw_spin_unlock_irq(&rq->lock);
+	raw_spin_unlock_irq(rq_lockp(rq));
 	printk(KERN_ERR "bad: scheduling from the idle thread!\n");
 	dump_stack();
-	raw_spin_lock_irq(&rq->lock);
+	raw_spin_lock_irq(rq_lockp(rq));
 }
 
 static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index afff644da065..6649cb63e32a 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -116,7 +116,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
 
 static inline u64 rq_clock_pelt(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	assert_clock_updated(rq);
 
 	return rq->clock_pelt - rq->lost_idle_time;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a532558a5176..5651d9e65897 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -845,7 +845,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
 		if (skip)
 			continue;
 
-		raw_spin_lock(&rq->lock);
+		raw_spin_lock(rq_lockp(rq));
 		update_rq_clock(rq);
 
 		if (rt_rq->rt_time) {
@@ -883,7 +883,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
 
 		if (enqueue)
 			sched_rt_rq_enqueue(rt_rq);
-		raw_spin_unlock(&rq->lock);
+		raw_spin_unlock(rq_lockp(rq));
 	}
 
 	if (!throttled && (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF))
@@ -2034,9 +2034,9 @@ void rto_push_irq_work_func(struct irq_work *work)
 	 * When it gets updated, a check is made if a push is possible.
 	 */
 	if (has_pushable_tasks(rq)) {
-		raw_spin_lock(&rq->lock);
+		raw_spin_lock(rq_lockp(rq));
 		push_rt_tasks(rq);
-		raw_spin_unlock(&rq->lock);
+		raw_spin_unlock(rq_lockp(rq));
 	}
 
 	raw_spin_lock(&rd->rto_lock);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 802b1f3405f2..af2a9972149a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -842,7 +842,7 @@ struct uclamp_rq {
  */
 struct rq {
 	/* runqueue lock: */
-	raw_spinlock_t		lock;
+	raw_spinlock_t		__lock;
 
 	/*
 	 * nr_running and cpu_load should be in the same cacheline because
@@ -1018,6 +1018,10 @@ static inline int cpu_of(struct rq *rq)
 #endif
 }
 
+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+	return &rq->__lock;
+}
 
 #ifdef CONFIG_SCHED_SMT
 extern void __update_idle_core(struct rq *rq);
@@ -1085,7 +1089,7 @@ static inline void assert_clock_updated(struct rq *rq)
 
 static inline u64 rq_clock(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	assert_clock_updated(rq);
 
 	return rq->clock;
@@ -1093,7 +1097,7 @@ static inline u64 rq_clock(struct rq *rq)
 
 static inline u64 rq_clock_task(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	assert_clock_updated(rq);
 
 	return rq->clock_task;
@@ -1101,7 +1105,7 @@ static inline u64 rq_clock_task(struct rq *rq)
 
 static inline void rq_clock_skip_update(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	rq->clock_update_flags |= RQCF_REQ_SKIP;
 }
 
@@ -1111,7 +1115,7 @@ static inline void rq_clock_skip_update(struct rq *rq)
  */
 static inline void rq_clock_cancel_skipupdate(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 	rq->clock_update_flags &= ~RQCF_REQ_SKIP;
 }
 
@@ -1130,7 +1134,7 @@ struct rq_flags {
 
 static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
 {
-	rf->cookie = lockdep_pin_lock(&rq->lock);
+	rf->cookie = lockdep_pin_lock(rq_lockp(rq));
 
 #ifdef CONFIG_SCHED_DEBUG
 	rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
@@ -1145,12 +1149,12 @@ static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
 		rf->clock_update_flags = RQCF_UPDATED;
 #endif
 
-	lockdep_unpin_lock(&rq->lock, rf->cookie);
+	lockdep_unpin_lock(rq_lockp(rq), rf->cookie);
 }
 
 static inline void rq_repin_lock(struct rq *rq, struct rq_flags *rf)
 {
-	lockdep_repin_lock(&rq->lock, rf->cookie);
+	lockdep_repin_lock(rq_lockp(rq), rf->cookie);
 
 #ifdef CONFIG_SCHED_DEBUG
 	/*
@@ -1171,7 +1175,7 @@ static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 }
 
 static inline void
@@ -1180,7 +1184,7 @@ task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 	__releases(p->pi_lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 }
 
@@ -1188,7 +1192,7 @@ static inline void
 rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock_irqsave(&rq->lock, rf->flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), rf->flags);
 	rq_pin_lock(rq, rf);
 }
 
@@ -1196,7 +1200,7 @@ static inline void
 rq_lock_irq(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock_irq(&rq->lock);
+	raw_spin_lock_irq(rq_lockp(rq));
 	rq_pin_lock(rq, rf);
 }
 
@@ -1204,7 +1208,7 @@ static inline void
 rq_lock(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 	rq_pin_lock(rq, rf);
 }
 
@@ -1212,7 +1216,7 @@ static inline void
 rq_relock(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock(&rq->lock);
+	raw_spin_lock(rq_lockp(rq));
 	rq_repin_lock(rq, rf);
 }
 
@@ -1221,7 +1225,7 @@ rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock_irqrestore(&rq->lock, rf->flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), rf->flags);
 }
 
 static inline void
@@ -1229,7 +1233,7 @@ rq_unlock_irq(struct rq *rq, struct rq_flags *rf)
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock_irq(&rq->lock);
+	raw_spin_unlock_irq(rq_lockp(rq));
 }
 
 static inline void
@@ -1237,7 +1241,7 @@ rq_unlock(struct rq *rq, struct rq_flags *rf)
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock(rq_lockp(rq));
 }
 
 static inline struct rq *
@@ -1300,7 +1304,7 @@ queue_balance_callback(struct rq *rq,
 		       struct callback_head *head,
 		       void (*func)(struct rq *rq))
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_held(rq_lockp(rq));
 
 	if (unlikely(head->next))
 		return;
@@ -1960,10 +1964,10 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
 	__acquires(busiest->lock)
 	__acquires(this_rq->lock)
 {
-	raw_spin_unlock(&this_rq->lock);
-	double_rq_lock(this_rq, busiest);
-
-	return 1;
+	raw_spin_unlock(rq_lockp(this_rq));
+ 	double_rq_lock(this_rq, busiest);
+ 
+ 	return 1;
 }
 
 #else
@@ -1979,20 +1983,22 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
 	__acquires(busiest->lock)
 	__acquires(this_rq->lock)
 {
-	int ret = 0;
-
-	if (unlikely(!raw_spin_trylock(&busiest->lock))) {
-		if (busiest < this_rq) {
-			raw_spin_unlock(&this_rq->lock);
-			raw_spin_lock(&busiest->lock);
-			raw_spin_lock_nested(&this_rq->lock,
-					      SINGLE_DEPTH_NESTING);
-			ret = 1;
-		} else
-			raw_spin_lock_nested(&busiest->lock,
-					      SINGLE_DEPTH_NESTING);
-	}
-	return ret;
+	if (rq_lockp(this_rq) == rq_lockp(busiest))
+		return 0;
+
+	if (likely(raw_spin_trylock(rq_lockp(busiest))))
+		return 0;
+
+	if (rq_lockp(busiest) >= rq_lockp(this_rq)) {
+		raw_spin_lock_nested(rq_lockp(busiest), SINGLE_DEPTH_NESTING);
+		return 0;
+ 	}
+
+	raw_spin_unlock(rq_lockp(this_rq));
+	raw_spin_lock(rq_lockp(busiest));
+	raw_spin_lock_nested(rq_lockp(this_rq), SINGLE_DEPTH_NESTING);
+
+	return 1;
 }
 
 #endif /* CONFIG_PREEMPT */
@@ -2002,11 +2008,7 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
  */
 static inline int double_lock_balance(struct rq *this_rq, struct rq *busiest)
 {
-	if (unlikely(!irqs_disabled())) {
-		/* printk() doesn't work well under rq->lock */
-		raw_spin_unlock(&this_rq->lock);
-		BUG_ON(1);
-	}
+	lockdep_assert_irqs_disabled();
 
 	return _double_lock_balance(this_rq, busiest);
 }
@@ -2014,8 +2016,9 @@ static inline int double_lock_balance(struct rq *this_rq, struct rq *busiest)
 static inline void double_unlock_balance(struct rq *this_rq, struct rq *busiest)
 	__releases(busiest->lock)
 {
-	raw_spin_unlock(&busiest->lock);
-	lock_set_subclass(&this_rq->lock.dep_map, 0, _RET_IP_);
+	if (rq_lockp(this_rq) != rq_lockp(busiest))
+		raw_spin_unlock(rq_lockp(busiest));
+	lock_set_subclass(&rq_lockp(this_rq)->dep_map, 0, _RET_IP_);
 }
 
 static inline void double_lock(spinlock_t *l1, spinlock_t *l2)
@@ -2056,16 +2059,16 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
 	__acquires(rq2->lock)
 {
 	BUG_ON(!irqs_disabled());
-	if (rq1 == rq2) {
-		raw_spin_lock(&rq1->lock);
+	if (rq_lockp(rq1) == rq_lockp(rq2)) {
+		raw_spin_lock(rq_lockp(rq1));
 		__acquire(rq2->lock);	/* Fake it out ;) */
 	} else {
-		if (rq1 < rq2) {
-			raw_spin_lock(&rq1->lock);
-			raw_spin_lock_nested(&rq2->lock, SINGLE_DEPTH_NESTING);
+		if (rq_lockp(rq1) < rq_lockp(rq2)) {
+			raw_spin_lock(rq_lockp(rq1));
+			raw_spin_lock_nested(rq_lockp(rq2), SINGLE_DEPTH_NESTING);
 		} else {
-			raw_spin_lock(&rq2->lock);
-			raw_spin_lock_nested(&rq1->lock, SINGLE_DEPTH_NESTING);
+			raw_spin_lock(rq_lockp(rq2));
+			raw_spin_lock_nested(rq_lockp(rq1), SINGLE_DEPTH_NESTING);
 		}
 	}
 }
@@ -2080,9 +2083,9 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
 	__releases(rq1->lock)
 	__releases(rq2->lock)
 {
-	raw_spin_unlock(&rq1->lock);
-	if (rq1 != rq2)
-		raw_spin_unlock(&rq2->lock);
+	raw_spin_unlock(rq_lockp(rq1));
+	if (rq_lockp(rq1) != rq_lockp(rq2))
+		raw_spin_unlock(rq_lockp(rq2));
 	else
 		__release(rq2->lock);
 }
@@ -2105,7 +2108,7 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
 {
 	BUG_ON(!irqs_disabled());
 	BUG_ON(rq1 != rq2);
-	raw_spin_lock(&rq1->lock);
+	raw_spin_lock(rq_lockp(rq1));
 	__acquire(rq2->lock);	/* Fake it out ;) */
 }
 
@@ -2120,7 +2123,7 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
 	__releases(rq2->lock)
 {
 	BUG_ON(rq1 != rq2);
-	raw_spin_unlock(&rq1->lock);
+	raw_spin_unlock(rq_lockp(rq1));
 	__release(rq2->lock);
 }
 
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f751ce0b783e..ed9a0edf8024 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -442,7 +442,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
 	struct root_domain *old_rd = NULL;
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_lock_irqsave(rq_lockp(rq), flags);
 
 	if (rq->rd) {
 		old_rd = rq->rd;
@@ -468,7 +468,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
 	if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
 		set_rq_online(rq);
 
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_unlock_irqrestore(rq_lockp(rq), flags);
 
 	if (old_rd)
 		call_rcu(&old_rd->rcu, free_rootdomain);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [RFC PATCH v4 04/19] sched/{rt,deadline}: Fix set_next_task vs pick_next_task
  2019-10-30 18:33 [RFC PATCH v4 00/19] Core scheduling v4 Vineeth Remanan Pillai
                   ` (2 preceding siblings ...)
  2019-10-30 18:33 ` [RFC PATCH v4 03/19] sched: Wrap rq::lock access Vineeth Remanan Pillai
@ 2019-10-30 18:33 ` Vineeth Remanan Pillai
  2019-10-30 18:33 ` [RFC PATCH v4 05/19] sched: Add task_struct pointer to sched_class::set_curr_task Vineeth Remanan Pillai
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 84+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-30 18:33 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, Dario Faggioli, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

From: Peter Zijlstra <peterz@infradead.org>

Because pick_next_task() implies set_curr_task() and some of the
details haven't matter too much, some of what _should_ be in
set_curr_task() ended up in pick_next_task, correct this.

This prepares the way for a pick_next_task() variant that does not
affect the current state; allowing remote picking.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/deadline.c | 23 ++++++++++++-----------
 kernel/sched/rt.c       | 27 ++++++++++++++-------------
 2 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 899e988d4811..4b53e7c696c8 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1694,12 +1694,21 @@ static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
 }
 #endif
 
-static inline void set_next_task(struct rq *rq, struct task_struct *p)
+static void set_next_task_dl(struct rq *rq, struct task_struct *p)
 {
 	p->se.exec_start = rq_clock_task(rq);
 
 	/* You can't push away the running task */
 	dequeue_pushable_dl_task(rq, p);
+
+	if (hrtick_enabled(rq))
+		start_hrtick_dl(rq, p);
+
+	if (rq->curr->sched_class != &dl_sched_class)
+		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+
+	if (rq->curr != p)
+		deadline_queue_push_tasks(rq);
 }
 
 static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
@@ -1758,15 +1767,7 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	p = dl_task_of(dl_se);
 
-	set_next_task(rq, p);
-
-	if (hrtick_enabled(rq))
-		start_hrtick_dl(rq, p);
-
-	deadline_queue_push_tasks(rq);
-
-	if (rq->curr->sched_class != &dl_sched_class)
-		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+	set_next_task_dl(rq, p);
 
 	return p;
 }
@@ -1813,7 +1814,7 @@ static void task_fork_dl(struct task_struct *p)
 
 static void set_curr_task_dl(struct rq *rq)
 {
-	set_next_task(rq, rq->curr);
+	set_next_task_dl(rq, rq->curr);
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 5651d9e65897..f4590ac6fd92 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1498,12 +1498,23 @@ static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flag
 #endif
 }
 
-static inline void set_next_task(struct rq *rq, struct task_struct *p)
+static inline void set_next_task_rt(struct rq *rq, struct task_struct *p)
 {
 	p->se.exec_start = rq_clock_task(rq);
 
 	/* The running task is never eligible for pushing */
 	dequeue_pushable_task(rq, p);
+
+	/*
+	 * If prev task was rt, put_prev_task() has already updated the
+	 * utilization. We only care of the case where we start to schedule a
+	 * rt task
+	 */
+	if (rq->curr->sched_class != &rt_sched_class)
+		update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+
+	if (rq->curr != p)
+		rt_queue_push_tasks(rq);
 }
 
 static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
@@ -1577,17 +1588,7 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	p = _pick_next_task_rt(rq);
 
-	set_next_task(rq, p);
-
-	rt_queue_push_tasks(rq);
-
-	/*
-	 * If prev task was rt, put_prev_task() has already updated the
-	 * utilization. We only care of the case where we start to schedule a
-	 * rt task
-	 */
-	if (rq->curr->sched_class != &rt_sched_class)
-		update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+	set_next_task_rt(rq, p);
 
 	return p;
 }
@@ -2356,7 +2357,7 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
 
 static void set_curr_task_rt(struct rq *rq)
 {
-	set_next_task(rq, rq->curr);
+	set_next_task_rt(rq, rq->curr);
 }
 
 static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [RFC PATCH v4 05/19] sched: Add task_struct pointer to sched_class::set_curr_task
  2019-10-30 18:33 [RFC PATCH v4 00/19] Core scheduling v4 Vineeth Remanan Pillai
                   ` (3 preceding siblings ...)
  2019-10-30 18:33 ` [RFC PATCH v4 04/19] sched/{rt,deadline}: Fix set_next_task vs pick_next_task Vineeth Remanan Pillai
@ 2019-10-30 18:33 ` Vineeth Remanan Pillai
  2019-10-30 18:33 ` [RFC PATCH v4 06/19] sched/fair: Export newidle_balance() Vineeth Remanan Pillai
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 84+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-30 18:33 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, Dario Faggioli, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

From: Peter Zijlstra <peterz@infradead.org>

In preparation of further separating pick_next_task() and
set_curr_task() we have to pass the actual task into it, while there,
rename the thing to better pair with put_prev_task().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c      | 12 ++++++------
 kernel/sched/deadline.c  |  7 +------
 kernel/sched/fair.c      | 17 ++++++++++++++---
 kernel/sched/idle.c      | 27 +++++++++++++++------------
 kernel/sched/rt.c        |  7 +------
 kernel/sched/sched.h     |  8 +++++---
 kernel/sched/stop_task.c | 17 +++++++----------
 7 files changed, 49 insertions(+), 46 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a9c93cf71f5f..95cc10ecc7c9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1494,7 +1494,7 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 	if (queued)
 		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
 	if (running)
-		set_curr_task(rq, p);
+		set_next_task(rq, p);
 }
 
 /*
@@ -4276,7 +4276,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
 	if (queued)
 		enqueue_task(rq, p, queue_flag);
 	if (running)
-		set_curr_task(rq, p);
+		set_next_task(rq, p);
 
 	check_class_changed(rq, p, prev_class, oldprio);
 out_unlock:
@@ -4343,7 +4343,7 @@ void set_user_nice(struct task_struct *p, long nice)
 			resched_curr(rq);
 	}
 	if (running)
-		set_curr_task(rq, p);
+		set_next_task(rq, p);
 out_unlock:
 	task_rq_unlock(rq, p, &rf);
 }
@@ -4786,7 +4786,7 @@ static int __sched_setscheduler(struct task_struct *p,
 		enqueue_task(rq, p, queue_flags);
 	}
 	if (running)
-		set_curr_task(rq, p);
+		set_next_task(rq, p);
 
 	check_class_changed(rq, p, prev_class, oldprio);
 
@@ -5975,7 +5975,7 @@ void sched_setnuma(struct task_struct *p, int nid)
 	if (queued)
 		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
 	if (running)
-		set_curr_task(rq, p);
+		set_next_task(rq, p);
 	task_rq_unlock(rq, p, &rf);
 }
 #endif /* CONFIG_NUMA_BALANCING */
@@ -6856,7 +6856,7 @@ void sched_move_task(struct task_struct *tsk)
 	if (queued)
 		enqueue_task(rq, tsk, queue_flags);
 	if (running)
-		set_curr_task(rq, tsk);
+		set_next_task(rq, tsk);
 
 	task_rq_unlock(rq, tsk, &rf);
 }
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 4b53e7c696c8..38b45f2f890b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1812,11 +1812,6 @@ static void task_fork_dl(struct task_struct *p)
 	 */
 }
 
-static void set_curr_task_dl(struct rq *rq)
-{
-	set_next_task_dl(rq, rq->curr);
-}
-
 #ifdef CONFIG_SMP
 
 /* Only try algorithms three times */
@@ -2396,6 +2391,7 @@ const struct sched_class dl_sched_class = {
 
 	.pick_next_task		= pick_next_task_dl,
 	.put_prev_task		= put_prev_task_dl,
+	.set_next_task		= set_next_task_dl,
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_dl,
@@ -2406,7 +2402,6 @@ const struct sched_class dl_sched_class = {
 	.task_woken		= task_woken_dl,
 #endif
 
-	.set_curr_task		= set_curr_task_dl,
 	.task_tick		= task_tick_dl,
 	.task_fork              = task_fork_dl,
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a1e8b811ce1f..a58e5de1732d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10180,9 +10180,19 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)
  * This routine is mostly called to set cfs_rq->curr field when a task
  * migrates between groups/classes.
  */
-static void set_curr_task_fair(struct rq *rq)
+static void set_next_task_fair(struct rq *rq, struct task_struct *p)
 {
-	struct sched_entity *se = &rq->curr->se;
+	struct sched_entity *se = &p->se;
+
+#ifdef CONFIG_SMP
+	if (task_on_rq_queued(p)) {
+		/*
+		 * Move the next running task to the front of the list, so our
+		 * cfs_tasks list becomes MRU one.
+		 */
+		list_move(&se->group_node, &rq->cfs_tasks);
+	}
+#endif
 
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
@@ -10453,7 +10463,9 @@ const struct sched_class fair_sched_class = {
 	.check_preempt_curr	= check_preempt_wakeup,
 
 	.pick_next_task		= pick_next_task_fair,
+
 	.put_prev_task		= put_prev_task_fair,
+	.set_next_task          = set_next_task_fair,
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_fair,
@@ -10466,7 +10478,6 @@ const struct sched_class fair_sched_class = {
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
 
-	.set_curr_task          = set_curr_task_fair,
 	.task_tick		= task_tick_fair,
 	.task_fork		= task_fork_fair,
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 0d2f83899c83..3ff4889196e1 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -374,14 +374,25 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
 	resched_curr(rq);
 }
 
+static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
+{
+}
+
+static void set_next_task_idle(struct rq *rq, struct task_struct *next)
+{
+	update_idle_core(rq);
+	schedstat_inc(rq->sched_goidle);
+}
+
 static struct task_struct *
 pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
+	struct task_struct *next = rq->idle;
+
 	put_prev_task(rq, prev);
-	update_idle_core(rq);
-	schedstat_inc(rq->sched_goidle);
+	set_next_task_idle(rq, next);
 
-	return rq->idle;
+	return next;
 }
 
 /*
@@ -397,10 +408,6 @@ dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)
 	raw_spin_lock_irq(rq_lockp(rq));
 }
 
-static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
-{
-}
-
 /*
  * scheduler tick hitting a task of our scheduling class.
  *
@@ -413,10 +420,6 @@ static void task_tick_idle(struct rq *rq, struct task_struct *curr, int queued)
 {
 }
 
-static void set_curr_task_idle(struct rq *rq)
-{
-}
-
 static void switched_to_idle(struct rq *rq, struct task_struct *p)
 {
 	BUG();
@@ -451,13 +454,13 @@ const struct sched_class idle_sched_class = {
 
 	.pick_next_task		= pick_next_task_idle,
 	.put_prev_task		= put_prev_task_idle,
+	.set_next_task          = set_next_task_idle,
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_idle,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
 
-	.set_curr_task          = set_curr_task_idle,
 	.task_tick		= task_tick_idle,
 
 	.get_rr_interval	= get_rr_interval_idle,
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f4590ac6fd92..a857945772d1 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2355,11 +2355,6 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
 	}
 }
 
-static void set_curr_task_rt(struct rq *rq)
-{
-	set_next_task_rt(rq, rq->curr);
-}
-
 static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task)
 {
 	/*
@@ -2381,6 +2376,7 @@ const struct sched_class rt_sched_class = {
 
 	.pick_next_task		= pick_next_task_rt,
 	.put_prev_task		= put_prev_task_rt,
+	.set_next_task          = set_next_task_rt,
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_rt,
@@ -2392,7 +2388,6 @@ const struct sched_class rt_sched_class = {
 	.switched_from		= switched_from_rt,
 #endif
 
-	.set_curr_task          = set_curr_task_rt,
 	.task_tick		= task_tick_rt,
 
 	.get_rr_interval	= get_rr_interval_rt,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index af2a9972149a..657831e26008 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1715,6 +1715,7 @@ struct sched_class {
 					       struct task_struct *prev,
 					       struct rq_flags *rf);
 	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
+	void (*set_next_task)(struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
@@ -1729,7 +1730,6 @@ struct sched_class {
 	void (*rq_offline)(struct rq *rq);
 #endif
 
-	void (*set_curr_task)(struct rq *rq);
 	void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
 	void (*task_fork)(struct task_struct *p);
 	void (*task_dead)(struct task_struct *p);
@@ -1759,12 +1759,14 @@ struct sched_class {
 
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
+	WARN_ON_ONCE(rq->curr != prev);
 	prev->sched_class->put_prev_task(rq, prev);
 }
 
-static inline void set_curr_task(struct rq *rq, struct task_struct *curr)
+static inline void set_next_task(struct rq *rq, struct task_struct *next)
 {
-	curr->sched_class->set_curr_task(rq);
+	WARN_ON_ONCE(rq->curr != next);
+	next->sched_class->set_next_task(rq, next);
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index c183b790ca54..47a3d2a18a9a 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -23,6 +23,11 @@ check_preempt_curr_stop(struct rq *rq, struct task_struct *p, int flags)
 	/* we're never preempted */
 }
 
+static void set_next_task_stop(struct rq *rq, struct task_struct *stop)
+{
+	stop->se.exec_start = rq_clock_task(rq);
+}
+
 static struct task_struct *
 pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
@@ -32,8 +37,7 @@ pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 		return NULL;
 
 	put_prev_task(rq, prev);
-
-	stop->se.exec_start = rq_clock_task(rq);
+	set_next_task_stop(rq, stop);
 
 	return stop;
 }
@@ -86,13 +90,6 @@ static void task_tick_stop(struct rq *rq, struct task_struct *curr, int queued)
 {
 }
 
-static void set_curr_task_stop(struct rq *rq)
-{
-	struct task_struct *stop = rq->stop;
-
-	stop->se.exec_start = rq_clock_task(rq);
-}
-
 static void switched_to_stop(struct rq *rq, struct task_struct *p)
 {
 	BUG(); /* its impossible to change to this class */
@@ -128,13 +125,13 @@ const struct sched_class stop_sched_class = {
 
 	.pick_next_task		= pick_next_task_stop,
 	.put_prev_task		= put_prev_task_stop,
+	.set_next_task          = set_next_task_stop,
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_stop,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
 
-	.set_curr_task          = set_curr_task_stop,
 	.task_tick		= task_tick_stop,
 
 	.get_rr_interval	= get_rr_interval_stop,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [RFC PATCH v4 06/19] sched/fair: Export newidle_balance()
  2019-10-30 18:33 [RFC PATCH v4 00/19] Core scheduling v4 Vineeth Remanan Pillai
                   ` (4 preceding siblings ...)
  2019-10-30 18:33 ` [RFC PATCH v4 05/19] sched: Add task_struct pointer to sched_class::set_curr_task Vineeth Remanan Pillai
@ 2019-10-30 18:33 ` Vineeth Remanan Pillai
  2019-10-30 18:33 ` [RFC PATCH v4 07/19] sched: Allow put_prev_task() to drop rq->lock Vineeth Remanan Pillai
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 84+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-30 18:33 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, Dario Faggioli, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

From: Peter Zijlstra <peterz@infradead.org>

For pick_next_task_fair() it is the newidle balance that requires
dropping the rq->lock; provided we do put_prev_task() early, we can
also detect the condition for doing newidle early.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c  | 18 ++++++++----------
 kernel/sched/sched.h |  4 ++++
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a58e5de1732d..c6b6835a5945 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3689,8 +3689,6 @@ static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
 	return cfs_rq->avg.load_avg;
 }
 
-static int idle_balance(struct rq *this_rq, struct rq_flags *rf);
-
 static inline unsigned long task_util(struct task_struct *p)
 {
 	return READ_ONCE(p->se.avg.util_avg);
@@ -6907,11 +6905,10 @@ done: __maybe_unused;
 	return p;
 
 idle:
-	update_misfit_status(NULL, rq);
-	new_tasks = idle_balance(rq, rf);
+	new_tasks = newidle_balance(rq, rf);
 
 	/*
-	 * Because idle_balance() releases (and re-acquires) rq->lock, it is
+	 * Because newidle_balance() releases (and re-acquires) rq->lock, it is
 	 * possible for any higher priority task to appear. In that case we
 	 * must re-start the pick_next_entity() loop.
 	 */
@@ -9075,10 +9072,10 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 	ld_moved = 0;
 
 	/*
-	 * idle_balance() disregards balance intervals, so we could repeatedly
-	 * reach this code, which would lead to balance_interval skyrocketting
-	 * in a short amount of time. Skip the balance_interval increase logic
-	 * to avoid that.
+	 * newidle_balance() disregards balance intervals, so we could
+	 * repeatedly reach this code, which would lead to balance_interval
+	 * skyrocketting in a short amount of time. Skip the balance_interval
+	 * increase logic to avoid that.
 	 */
 	if (env.idle == CPU_NEWLY_IDLE)
 		goto out;
@@ -9788,7 +9785,7 @@ static inline void nohz_newidle_balance(struct rq *this_rq) { }
  * idle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
  */
-static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
+int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 {
 	unsigned long next_balance = jiffies + HZ;
 	int this_cpu = this_rq->cpu;
@@ -9796,6 +9793,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
 	int pulled_task = 0;
 	u64 curr_cost = 0;
 
+	update_misfit_status(NULL, this_rq);
 	/*
 	 * We must set idle_stamp _before_ calling idle_balance(), such that we
 	 * measure the duration of idle_balance() as idle time.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 657831e26008..dc889af24efb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1453,10 +1453,14 @@ static inline void unregister_sched_domain_sysctl(void)
 }
 #endif
 
+extern int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
+
 #else
 
 static inline void sched_ttwu_pending(void) { }
 
+static inline int newidle_balance(struct rq *this_rq, struct rq_flags *rf) { return 0; }
+
 #endif /* CONFIG_SMP */
 
 #include "stats.h"
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [RFC PATCH v4 07/19] sched: Allow put_prev_task() to drop rq->lock
  2019-10-30 18:33 [RFC PATCH v4 00/19] Core scheduling v4 Vineeth Remanan Pillai
                   ` (5 preceding siblings ...)
  2019-10-30 18:33 ` [RFC PATCH v4 06/19] sched/fair: Export newidle_balance() Vineeth Remanan Pillai
@ 2019-10-30 18:33 ` Vineeth Remanan Pillai
  2019-10-30 18:33 ` [RFC PATCH v4 08/19] sched: Rework pick_next_task() slow-path Vineeth Remanan Pillai
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 84+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-30 18:33 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, Dario Faggioli, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

From: Peter Zijlstra <peterz@infradead.org>

Currently the pick_next_task() loop is convoluted and ugly because of
how it can drop the rq->lock and needs to restart the picking.

For the RT/Deadline classes, it is put_prev_task() where we do
balancing, and we could do this before the picking loop. Make this
possible.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c      |  2 +-
 kernel/sched/deadline.c  | 14 +++++++++++++-
 kernel/sched/fair.c      |  2 +-
 kernel/sched/idle.c      |  2 +-
 kernel/sched/rt.c        | 14 +++++++++++++-
 kernel/sched/sched.h     |  4 ++--
 kernel/sched/stop_task.c |  2 +-
 7 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 95cc10ecc7c9..3fa7d37f2879 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6015,7 +6015,7 @@ static void calc_load_migrate(struct rq *rq)
 		atomic_long_add(delta, &calc_load_tasks);
 }
 
-static void put_prev_task_fake(struct rq *rq, struct task_struct *prev)
+static void put_prev_task_fake(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 }
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 38b45f2f890b..416e06c97e98 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1772,13 +1772,25 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	return p;
 }
 
-static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
+static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 {
 	update_curr_dl(rq);
 
 	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1);
 	if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_dl_task(rq, p);
+
+	if (rf && !on_dl_rq(&p->dl) && need_pull_dl_task(rq, p)) {
+		/*
+		 * This is OK, because current is on_cpu, which avoids it being
+		 * picked for load-balance and preemption/IRQs are still
+		 * disabled avoiding further scheduler activity on it and we've
+		 * not yet started the picking loop.
+		 */
+		rq_unpin_lock(rq, rf);
+		pull_dl_task(rq);
+		rq_repin_lock(rq, rf);
+	}
 }
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c6b6835a5945..7ee57472d88e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6930,7 +6930,7 @@ done: __maybe_unused;
 /*
  * Account for a descheduled task:
  */
-static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
+static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct sched_entity *se = &prev->se;
 	struct cfs_rq *cfs_rq;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 3ff4889196e1..5848ae3d5210 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -374,7 +374,7 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
 	resched_curr(rq);
 }
 
-static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
+static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 }
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a857945772d1..a7d9656f70b5 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1593,7 +1593,7 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	return p;
 }
 
-static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
+static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 {
 	update_curr_rt(rq);
 
@@ -1605,6 +1605,18 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
 	 */
 	if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
+
+	if (rf && !on_rt_rq(&p->rt) && need_pull_rt_task(rq, p)) {
+		/*
+		 * This is OK, because current is on_cpu, which avoids it being
+		 * picked for load-balance and preemption/IRQs are still
+		 * disabled avoiding further scheduler activity on it and we've
+		 * not yet started the picking loop.
+		 */
+		rq_unpin_lock(rq, rf);
+		pull_rt_task(rq);
+		rq_repin_lock(rq, rf);
+	}
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dc889af24efb..8b86b24a3ccd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1718,7 +1718,7 @@ struct sched_class {
 	struct task_struct * (*pick_next_task)(struct rq *rq,
 					       struct task_struct *prev,
 					       struct rq_flags *rf);
-	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
+	void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct rq_flags *rf);
 	void (*set_next_task)(struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
@@ -1764,7 +1764,7 @@ struct sched_class {
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
 	WARN_ON_ONCE(rq->curr != prev);
-	prev->sched_class->put_prev_task(rq, prev);
+	prev->sched_class->put_prev_task(rq, prev, NULL);
 }
 
 static inline void set_next_task(struct rq *rq, struct task_struct *next)
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 47a3d2a18a9a..8f414018d5e0 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -59,7 +59,7 @@ static void yield_task_stop(struct rq *rq)
 	BUG(); /* the stop task should never yield, its pointless. */
 }
 
-static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
+static void put_prev_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct task_struct *curr = rq->curr;
 	u64 delta_exec;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [RFC PATCH v4 08/19] sched: Rework pick_next_task() slow-path
  2019-10-30 18:33 [RFC PATCH v4 00/19] Core scheduling v4 Vineeth Remanan Pillai
                   ` (6 preceding siblings ...)
  2019-10-30 18:33 ` [RFC PATCH v4 07/19] sched: Allow put_prev_task() to drop rq->lock Vineeth Remanan Pillai
@ 2019-10-30 18:33 ` Vineeth Remanan Pillai
  2019-10-30 18:33 ` [RFC PATCH v4 09/19] sched: Introduce sched_class::pick_task() Vineeth Remanan Pillai
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 84+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-30 18:33 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, Dario Faggioli, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

From: Peter Zijlstra <peterz@infradead.org>

Avoid the RETRY_TASK case in the pick_next_task() slow path.

By doing the put_prev_task() early, we get the rt/deadline pull done,
and by testing rq->nr_running we know if we need newidle_balance().

This then gives a stable state to pick a task from.

Since the fast-path is fair only; it means the other classes will
always have pick_next_task(.prev=NULL, .rf=NULL) and we can simplify.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c      | 19 ++++++++++++-------
 kernel/sched/deadline.c  | 30 ++----------------------------
 kernel/sched/fair.c      |  9 ++++++---
 kernel/sched/idle.c      |  4 +++-
 kernel/sched/rt.c        | 29 +----------------------------
 kernel/sched/sched.h     | 13 ++++++++-----
 kernel/sched/stop_task.c |  3 ++-
 7 files changed, 34 insertions(+), 73 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3fa7d37f2879..1f40211ff29a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3739,7 +3739,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 		p = fair_sched_class.pick_next_task(rq, prev, rf);
 		if (unlikely(p == RETRY_TASK))
-			goto again;
+			goto restart;
 
 		/* Assumes fair_sched_class->next == idle_sched_class */
 		if (unlikely(!p))
@@ -3748,14 +3748,19 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		return p;
 	}
 
-again:
+restart:
+	/*
+	 * Ensure that we put DL/RT tasks before the pick loop, such that they
+	 * can PULL higher prio tasks when we lower the RQ 'priority'.
+	 */
+	prev->sched_class->put_prev_task(rq, prev, rf);
+	if (!rq->nr_running)
+		newidle_balance(rq, rf);
+
 	for_each_class(class) {
-		p = class->pick_next_task(rq, prev, rf);
-		if (p) {
-			if (unlikely(p == RETRY_TASK))
-				goto again;
+		p = class->pick_next_task(rq, NULL, NULL);
+		if (p)
 			return p;
-		}
 	}
 
 	/* The idle class should always have a runnable task: */
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 416e06c97e98..0c7a51745813 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1729,39 +1729,13 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	struct task_struct *p;
 	struct dl_rq *dl_rq;
 
-	dl_rq = &rq->dl;
-
-	if (need_pull_dl_task(rq, prev)) {
-		/*
-		 * This is OK, because current is on_cpu, which avoids it being
-		 * picked for load-balance and preemption/IRQs are still
-		 * disabled avoiding further scheduler activity on it and we're
-		 * being very careful to re-start the picking loop.
-		 */
-		rq_unpin_lock(rq, rf);
-		pull_dl_task(rq);
-		rq_repin_lock(rq, rf);
-		/*
-		 * pull_dl_task() can drop (and re-acquire) rq->lock; this
-		 * means a stop task can slip in, in which case we need to
-		 * re-start task selection.
-		 */
-		if (rq->stop && task_on_rq_queued(rq->stop))
-			return RETRY_TASK;
-	}
+	WARN_ON_ONCE(prev || rf);
 
-	/*
-	 * When prev is DL, we may throttle it in put_prev_task().
-	 * So, we update time before we check for dl_nr_running.
-	 */
-	if (prev->sched_class == &dl_sched_class)
-		update_curr_dl(rq);
+	dl_rq = &rq->dl;
 
 	if (unlikely(!dl_rq->dl_nr_running))
 		return NULL;
 
-	put_prev_task(rq, prev);
-
 	dl_se = pick_next_dl_entity(rq, dl_rq);
 	BUG_ON(!dl_se);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ee57472d88e..850e235c9209 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6799,7 +6799,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 		goto idle;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-	if (prev->sched_class != &fair_sched_class)
+	if (!prev || prev->sched_class != &fair_sched_class)
 		goto simple;
 
 	/*
@@ -6876,8 +6876,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	goto done;
 simple:
 #endif
-
-	put_prev_task(rq, prev);
+	if (prev)
+		put_prev_task(rq, prev);
 
 	do {
 		se = pick_next_entity(cfs_rq, NULL);
@@ -6905,6 +6905,9 @@ done: __maybe_unused;
 	return p;
 
 idle:
+	if (!rf)
+		return NULL;
+
 	new_tasks = newidle_balance(rq, rf);
 
 	/*
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 5848ae3d5210..5bb8d44fbff9 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -389,7 +389,9 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 {
 	struct task_struct *next = rq->idle;
 
-	put_prev_task(rq, prev);
+	if (prev)
+		put_prev_task(rq, prev);
+
 	set_next_task_idle(rq, next);
 
 	return next;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a7d9656f70b5..09c317a916aa 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1554,38 +1554,11 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	struct task_struct *p;
 	struct rt_rq *rt_rq = &rq->rt;
 
-	if (need_pull_rt_task(rq, prev)) {
-		/*
-		 * This is OK, because current is on_cpu, which avoids it being
-		 * picked for load-balance and preemption/IRQs are still
-		 * disabled avoiding further scheduler activity on it and we're
-		 * being very careful to re-start the picking loop.
-		 */
-		rq_unpin_lock(rq, rf);
-		pull_rt_task(rq);
-		rq_repin_lock(rq, rf);
-		/*
-		 * pull_rt_task() can drop (and re-acquire) rq->lock; this
-		 * means a dl or stop task can slip in, in which case we need
-		 * to re-start task selection.
-		 */
-		if (unlikely((rq->stop && task_on_rq_queued(rq->stop)) ||
-			     rq->dl.dl_nr_running))
-			return RETRY_TASK;
-	}
-
-	/*
-	 * We may dequeue prev's rt_rq in put_prev_task().
-	 * So, we update time before rt_queued check.
-	 */
-	if (prev->sched_class == &rt_sched_class)
-		update_curr_rt(rq);
+	WARN_ON_ONCE(prev || rf);
 
 	if (!rt_rq->rt_queued)
 		return NULL;
 
-	put_prev_task(rq, prev);
-
 	p = _pick_next_task_rt(rq);
 
 	set_next_task_rt(rq, p);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8b86b24a3ccd..16c9654ac750 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1708,12 +1708,15 @@ struct sched_class {
 	void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
 
 	/*
-	 * It is the responsibility of the pick_next_task() method that will
-	 * return the next task to call put_prev_task() on the @prev task or
-	 * something equivalent.
+	 * Both @prev and @rf are optional and may be NULL, in which case the
+	 * caller must already have invoked put_prev_task(rq, prev, rf).
 	 *
-	 * May return RETRY_TASK when it finds a higher prio class has runnable
-	 * tasks.
+	 * Otherwise it is the responsibility of the pick_next_task() to call
+	 * put_prev_task() on the @prev task or something equivalent, IFF it
+	 * returns a next task.
+	 *
+	 * In that case (@rf != NULL) it may return RETRY_TASK when it finds a
+	 * higher prio class has runnable tasks.
 	 */
 	struct task_struct * (*pick_next_task)(struct rq *rq,
 					       struct task_struct *prev,
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 8f414018d5e0..7e1cee4e65b2 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -33,10 +33,11 @@ pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 {
 	struct task_struct *stop = rq->stop;
 
+	WARN_ON_ONCE(prev || rf);
+
 	if (!stop || !task_on_rq_queued(stop))
 		return NULL;
 
-	put_prev_task(rq, prev);
 	set_next_task_stop(rq, stop);
 
 	return stop;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [RFC PATCH v4 09/19] sched: Introduce sched_class::pick_task()
  2019-10-30 18:33 [RFC PATCH v4 00/19] Core scheduling v4 Vineeth Remanan Pillai
                   ` (7 preceding siblings ...)
  2019-10-30 18:33 ` [RFC PATCH v4 08/19] sched: Rework pick_next_task() slow-path Vineeth Remanan Pillai
@ 2019-10-30 18:33 ` Vineeth Remanan Pillai
  2019-10-30 18:33 ` [RFC PATCH v4 10/19] sched: Core-wide rq->lock Vineeth Remanan Pillai
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 84+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-30 18:33 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, Dario Faggioli, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Vineeth Remanan Pillai

From: Peter Zijlstra <peterz@infradead.org>

Because sched_class::pick_next_task() also implies
sched_class::set_next_task() (and possibly put_prev_task() and
newidle_balance) it is not state invariant. This makes it unsuitable
for remote task selection.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
---
 kernel/sched/deadline.c  | 21 ++++++++++++++++-----
 kernel/sched/fair.c      | 36 +++++++++++++++++++++++++++++++++---
 kernel/sched/idle.c      | 10 +++++++++-
 kernel/sched/rt.c        | 21 ++++++++++++++++-----
 kernel/sched/sched.h     |  2 ++
 kernel/sched/stop_task.c | 21 ++++++++++++++++-----
 6 files changed, 92 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0c7a51745813..461f1c1b791d 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1722,15 +1722,12 @@ static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
 	return rb_entry(left, struct sched_dl_entity, rb_node);
 }
 
-static struct task_struct *
-pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+static struct task_struct *pick_task_dl(struct rq *rq)
 {
 	struct sched_dl_entity *dl_se;
 	struct task_struct *p;
 	struct dl_rq *dl_rq;
 
-	WARN_ON_ONCE(prev || rf);
-
 	dl_rq = &rq->dl;
 
 	if (unlikely(!dl_rq->dl_nr_running))
@@ -1741,7 +1738,19 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	p = dl_task_of(dl_se);
 
-	set_next_task_dl(rq, p);
+	return p;
+}
+
+static struct task_struct *
+pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	struct task_struct *p;
+
+	WARN_ON_ONCE(prev || rf);
+
+	p = pick_task_dl(rq);
+	if (p)
+		set_next_task_dl(rq, p);
 
 	return p;
 }
@@ -2380,6 +2389,8 @@ const struct sched_class dl_sched_class = {
 	.set_next_task		= set_next_task_dl,
 
 #ifdef CONFIG_SMP
+	.pick_task		= pick_task_dl,
+
 	.select_task_rq		= select_task_rq_dl,
 	.migrate_task_rq	= migrate_task_rq_dl,
 	.set_cpus_allowed       = set_cpus_allowed_dl,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 850e235c9209..cc78b334e1c3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4211,7 +4211,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 	 * Avoid running the skip buddy, if running something else can
 	 * be done without getting too unfair.
 	 */
-	if (cfs_rq->skip == se) {
+	if (cfs_rq->skip && cfs_rq->skip == se) {
 		struct sched_entity *second;
 
 		if (se == curr) {
@@ -4229,13 +4229,13 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 	/*
 	 * Prefer last buddy, try to return the CPU to a preempted task.
 	 */
-	if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
+	if (left && cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
 		se = cfs_rq->last;
 
 	/*
 	 * Someone really wants this to run. If it's not unfair, run it.
 	 */
-	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
+	if (left && cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
 		se = cfs_rq->next;
 
 	clear_buddies(cfs_rq, se);
@@ -6786,6 +6786,34 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 		set_last_buddy(se);
 }
 
+static struct task_struct *
+pick_task_fair(struct rq *rq)
+{
+	struct cfs_rq *cfs_rq = &rq->cfs;
+	struct sched_entity *se;
+
+	if (!cfs_rq->nr_running)
+		return NULL;
+
+	do {
+		struct sched_entity *curr = cfs_rq->curr;
+
+		se = pick_next_entity(cfs_rq, NULL);
+
+		if (curr) {
+			if (se && curr->on_rq)
+				update_curr(cfs_rq);
+
+			if (!se || entity_before(curr, se))
+				se = curr;
+		}
+
+		cfs_rq = group_cfs_rq(se);
+	} while (cfs_rq);
+
+	return task_of(se);
+}
+
 static struct task_struct *
 pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
@@ -10469,6 +10497,8 @@ const struct sched_class fair_sched_class = {
 	.set_next_task          = set_next_task_fair,
 
 #ifdef CONFIG_SMP
+	.pick_task		= pick_task_fair,
+
 	.select_task_rq		= select_task_rq_fair,
 	.migrate_task_rq	= migrate_task_rq_fair,
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 5bb8d44fbff9..87c55e3b988d 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -374,6 +374,12 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
 	resched_curr(rq);
 }
 
+static struct task_struct *
+pick_task_idle(struct rq *rq)
+{
+	return rq->idle;
+}
+
 static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 }
@@ -387,11 +393,12 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next)
 static struct task_struct *
 pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
-	struct task_struct *next = rq->idle;
+	struct task_struct *next;
 
 	if (prev)
 		put_prev_task(rq, prev);
 
+	next = pick_task_idle(rq);
 	set_next_task_idle(rq, next);
 
 	return next;
@@ -459,6 +466,7 @@ const struct sched_class idle_sched_class = {
 	.set_next_task          = set_next_task_idle,
 
 #ifdef CONFIG_SMP
+	.pick_task		= pick_task_idle,
 	.select_task_rq		= select_task_rq_idle,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 09c317a916aa..4714630a90b9 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1548,20 +1548,29 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
 	return rt_task_of(rt_se);
 }
 
-static struct task_struct *
-pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+static struct task_struct *pick_task_rt(struct rq *rq)
 {
 	struct task_struct *p;
 	struct rt_rq *rt_rq = &rq->rt;
 
-	WARN_ON_ONCE(prev || rf);
-
 	if (!rt_rq->rt_queued)
 		return NULL;
 
 	p = _pick_next_task_rt(rq);
 
-	set_next_task_rt(rq, p);
+	return p;
+}
+
+static struct task_struct *
+pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	struct task_struct *p;
+
+	WARN_ON_ONCE(prev || rf);
+
+	p = pick_task_rt(rq);
+	if (p)
+		set_next_task_rt(rq, p);
 
 	return p;
 }
@@ -2364,6 +2373,8 @@ const struct sched_class rt_sched_class = {
 	.set_next_task          = set_next_task_rt,
 
 #ifdef CONFIG_SMP
+	.pick_task		= pick_task_rt,
+
 	.select_task_rq		= select_task_rq_rt,
 
 	.set_cpus_allowed       = set_cpus_allowed_common,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 16c9654ac750..44f45aad3f97 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1725,6 +1725,8 @@ struct sched_class {
 	void (*set_next_task)(struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
+	struct task_struct * (*pick_task)(struct rq *rq);
+
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
 	void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
 
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 7e1cee4e65b2..fb6c436cba6c 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -29,20 +29,30 @@ static void set_next_task_stop(struct rq *rq, struct task_struct *stop)
 }
 
 static struct task_struct *
-pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+pick_task_stop(struct rq *rq)
 {
 	struct task_struct *stop = rq->stop;
 
-	WARN_ON_ONCE(prev || rf);
-
 	if (!stop || !task_on_rq_queued(stop))
 		return NULL;
 
-	set_next_task_stop(rq, stop);
-
 	return stop;
 }
 
+static struct task_struct *
+pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	struct task_struct *p;
+
+	WARN_ON_ONCE(prev || rf);
+
+	p = pick_task_stop(rq);
+	if (p)
+		set_next_task_stop(rq, p);
+
+	return p;
+}
+
 static void
 enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 {
@@ -129,6 +139,7 @@ const struct sched_class stop_sched_class = {
 	.set_next_task          = set_next_task_stop,
 
 #ifdef CONFIG_SMP
+	.pick_task		= pick_task_stop,
 	.select_task_rq		= select_task_rq_stop,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [RFC PATCH v4 10/19] sched: Core-wide rq->lock
  2019-10-30 18:33 [RFC PATCH v4 00/19] Core scheduling v4 Vineeth Remanan Pillai
                   ` (8 preceding siblings ...)
  2019-10-30 18:33 ` [RFC PATCH v4 09/19] sched: Introduce sched_class::pick_task() Vineeth Remanan Pillai
@ 2019-10-30 18:33 ` Vineeth Remanan Pillai
  2019-10-30 18:33 ` [RFC PATCH v4 11/19] sched: Basic tracking of matching tasks Vineeth Remanan Pillai
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 84+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-30 18:33 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, Dario Faggioli, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Vineeth Remanan Pillai

From: Peter Zijlstra <peterz@infradead.org>

Introduce the basic infrastructure to have a core wide rq->lock.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
---
 kernel/Kconfig.preempt |   6 +++
 kernel/sched/core.c    | 113 ++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h   |  31 +++++++++++
 3 files changed, 148 insertions(+), 2 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index deff97217496..63493a1b9ff6 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -80,3 +80,9 @@ config PREEMPT_COUNT
 config PREEMPTION
        bool
        select PREEMPT_COUNT
+
+config SCHED_CORE
+	bool
+	default y
+	depends on SCHED_SMT
+
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1f40211ff29a..0427cf610d6a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -72,6 +72,70 @@ __read_mostly int scheduler_running;
  */
 int sysctl_sched_rt_runtime = 950000;
 
+#ifdef CONFIG_SCHED_CORE
+
+DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+/*
+ * The static-key + stop-machine variable are needed such that:
+ *
+ *	spin_lock(rq_lockp(rq));
+ *	...
+ *	spin_unlock(rq_lockp(rq));
+ *
+ * ends up locking and unlocking the _same_ lock, and all CPUs
+ * always agree on what rq has what lock.
+ *
+ * XXX entirely possible to selectively enable cores, don't bother for now.
+ */
+static int __sched_core_stopper(void *data)
+{
+	bool enabled = !!(unsigned long)data;
+	int cpu;
+
+	for_each_online_cpu(cpu)
+		cpu_rq(cpu)->core_enabled = enabled;
+
+	return 0;
+}
+
+static DEFINE_MUTEX(sched_core_mutex);
+static int sched_core_count;
+
+static void __sched_core_enable(void)
+{
+	// XXX verify there are no cookie tasks (yet)
+
+	static_branch_enable(&__sched_core_enabled);
+	stop_machine(__sched_core_stopper, (void *)true, NULL);
+}
+
+static void __sched_core_disable(void)
+{
+	// XXX verify there are no cookie tasks (left)
+
+	stop_machine(__sched_core_stopper, (void *)false, NULL);
+	static_branch_disable(&__sched_core_enabled);
+}
+
+void sched_core_get(void)
+{
+	mutex_lock(&sched_core_mutex);
+	if (!sched_core_count++)
+		__sched_core_enable();
+	mutex_unlock(&sched_core_mutex);
+}
+
+void sched_core_put(void)
+{
+	mutex_lock(&sched_core_mutex);
+	if (!--sched_core_count)
+		__sched_core_disable();
+	mutex_unlock(&sched_core_mutex);
+}
+
+#endif /* CONFIG_SCHED_CORE */
+
 /*
  * __task_rq_lock - lock the rq @p resides on.
  */
@@ -6210,8 +6274,15 @@ int sched_cpu_activate(unsigned int cpu)
 	/*
 	 * When going up, increment the number of cores with SMT present.
 	 */
-	if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
+	if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
 		static_branch_inc_cpuslocked(&sched_smt_present);
+#ifdef CONFIG_SCHED_CORE
+		if (static_branch_unlikely(&__sched_core_enabled)) {
+			rq->core_enabled = true;
+		}
+#endif
+	}
+
 #endif
 	set_cpu_active(cpu, true);
 
@@ -6259,8 +6330,16 @@ int sched_cpu_deactivate(unsigned int cpu)
 	/*
 	 * When going down, decrement the number of cores with SMT present.
 	 */
-	if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
+	if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
+#ifdef CONFIG_SCHED_CORE
+		struct rq *rq = cpu_rq(cpu);
+		if (static_branch_unlikely(&__sched_core_enabled)) {
+			rq->core_enabled = false;
+		}
+#endif
 		static_branch_dec_cpuslocked(&sched_smt_present);
+
+	}
 #endif
 
 	if (!sched_smp_initialized)
@@ -6285,6 +6364,28 @@ static void sched_rq_cpu_starting(unsigned int cpu)
 
 int sched_cpu_starting(unsigned int cpu)
 {
+#ifdef CONFIG_SCHED_CORE
+	const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+	struct rq *rq, *core_rq = NULL;
+	int i;
+
+	for_each_cpu(i, smt_mask) {
+		rq = cpu_rq(i);
+		if (rq->core && rq->core == rq)
+			core_rq = rq;
+	}
+
+	if (!core_rq)
+		core_rq = cpu_rq(cpu);
+
+	for_each_cpu(i, smt_mask) {
+		rq = cpu_rq(i);
+
+		WARN_ON_ONCE(rq->core && rq->core != core_rq);
+		rq->core = core_rq;
+	}
+#endif /* CONFIG_SCHED_CORE */
+
 	sched_rq_cpu_starting(cpu);
 	sched_tick_start(cpu);
 	return 0;
@@ -6313,6 +6414,9 @@ int sched_cpu_dying(unsigned int cpu)
 	update_max_interval();
 	nohz_balance_exit_idle(rq);
 	hrtick_clear(rq);
+#ifdef CONFIG_SCHED_CORE
+	rq->core = NULL;
+#endif
 	return 0;
 }
 #endif
@@ -6507,6 +6611,11 @@ void __init sched_init(void)
 #endif /* CONFIG_SMP */
 		hrtick_rq_init(rq);
 		atomic_set(&rq->nr_iowait, 0);
+
+#ifdef CONFIG_SCHED_CORE
+		rq->core = NULL;
+		rq->core_enabled = 0;
+#endif
 	}
 
 	set_load_weight(&init_task, false);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 44f45aad3f97..2ba68ab093ee 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -991,6 +991,12 @@ struct rq {
 	/* Must be inspected within a rcu lock section */
 	struct cpuidle_state	*idle_state;
 #endif
+
+#ifdef CONFIG_SCHED_CORE
+	/* per rq */
+	struct rq		*core;
+	unsigned int		core_enabled;
+#endif
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -1018,11 +1024,36 @@ static inline int cpu_of(struct rq *rq)
 #endif
 }
 
+#ifdef CONFIG_SCHED_CORE
+DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+	return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled;
+}
+
+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+	if (sched_core_enabled(rq))
+		return &rq->core->__lock;
+
+	return &rq->__lock;
+}
+
+#else /* !CONFIG_SCHED_CORE */
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+	return false;
+}
+
 static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 {
 	return &rq->__lock;
 }
 
+#endif /* CONFIG_SCHED_CORE */
+
 #ifdef CONFIG_SCHED_SMT
 extern void __update_idle_core(struct rq *rq);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [RFC PATCH v4 11/19] sched: Basic tracking of matching tasks
  2019-10-30 18:33 [RFC PATCH v4 00/19] Core scheduling v4 Vineeth Remanan Pillai
                   ` (9 preceding siblings ...)
  2019-10-30 18:33 ` [RFC PATCH v4 10/19] sched: Core-wide rq->lock Vineeth Remanan Pillai
@ 2019-10-30 18:33 ` Vineeth Remanan Pillai
  2019-10-30 18:33 ` [RFC PATCH v4 12/19] sched: A quick and dirty cgroup tagging interface Vineeth Remanan Pillai
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 84+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-30 18:33 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, Dario Faggioli, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Vineeth Remanan Pillai

From: Peter Zijlstra <peterz@infradead.org>

Introduce task_struct::core_cookie as an opaque identifier for core
scheduling. When enabled; core scheduling will only allow matching
task to be on the core; where idle matches everything.

When task_struct::core_cookie is set (and core scheduling is enabled)
these tasks are indexed in a second RB-tree, first on cookie value
then on scheduling function, such that matching task selection always
finds the most elegible match.

NOTE: *shudder* at the overhead...

NOTE: *sigh*, a 3rd copy of the scheduling function; the alternative
is per class tracking of cookies and that just duplicates a lot of
stuff for no raisin (the 2nd copy lives in the rt-mutex PI code).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
---
 include/linux/sched.h |   8 ++-
 kernel/sched/core.c   | 146 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c   |  46 -------------
 kernel/sched/sched.h  |  55 ++++++++++++++++
 4 files changed, 208 insertions(+), 47 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9f51932bd543..d319219ea58f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -688,10 +688,16 @@ struct task_struct {
 	const struct sched_class	*sched_class;
 	struct sched_entity		se;
 	struct sched_rt_entity		rt;
+	struct sched_dl_entity		dl;
+
+#ifdef CONFIG_SCHED_CORE
+	struct rb_node			core_node;
+	unsigned long			core_cookie;
+#endif
+
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group		*sched_task_group;
 #endif
-	struct sched_dl_entity		dl;
 
 #ifdef CONFIG_UCLAMP_TASK
 	/* Clamp values requested for a scheduling entity */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0427cf610d6a..44dadbe9a2d0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -76,6 +76,141 @@ int sysctl_sched_rt_runtime = 950000;
 
 DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
 
+/* kernel prio, less is more */
+static inline int __task_prio(struct task_struct *p)
+{
+	if (p->sched_class == &stop_sched_class) /* trumps deadline */
+		return -2;
+
+	if (rt_prio(p->prio)) /* includes deadline */
+		return p->prio; /* [-1, 99] */
+
+	if (p->sched_class == &idle_sched_class)
+		return MAX_RT_PRIO + NICE_WIDTH; /* 140 */
+
+	return MAX_RT_PRIO + MAX_NICE; /* 120, squash fair */
+}
+
+/*
+ * l(a,b)
+ * le(a,b) := !l(b,a)
+ * g(a,b)  := l(b,a)
+ * ge(a,b) := !l(a,b)
+ */
+
+/* real prio, less is less */
+static inline bool prio_less(struct task_struct *a, struct task_struct *b)
+{
+
+	int pa = __task_prio(a), pb = __task_prio(b);
+
+	if (-pa < -pb)
+		return true;
+
+	if (-pb < -pa)
+		return false;
+
+	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
+		return !dl_time_before(a->dl.deadline, b->dl.deadline);
+
+	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
+		u64 vruntime = b->se.vruntime;
+
+		/*
+		 * Normalize the vruntime if tasks are in different cpus.
+		 */
+		if (task_cpu(a) != task_cpu(b)) {
+			vruntime -= task_cfs_rq(b)->min_vruntime;
+			vruntime += task_cfs_rq(a)->min_vruntime;
+		}
+
+		return !((s64)(a->se.vruntime - vruntime) <= 0);
+	}
+
+	return false;
+}
+
+static inline bool __sched_core_less(struct task_struct *a, struct task_struct *b)
+{
+	if (a->core_cookie < b->core_cookie)
+		return true;
+
+	if (a->core_cookie > b->core_cookie)
+		return false;
+
+	/* flip prio, so high prio is leftmost */
+	if (prio_less(b, a))
+		return true;
+
+	return false;
+}
+
+static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+{
+	struct rb_node *parent, **node;
+	struct task_struct *node_task;
+
+	rq->core->core_task_seq++;
+
+	if (!p->core_cookie)
+		return;
+
+	node = &rq->core_tree.rb_node;
+	parent = *node;
+
+	while (*node) {
+		node_task = container_of(*node, struct task_struct, core_node);
+		parent = *node;
+
+		if (__sched_core_less(p, node_task))
+			node = &parent->rb_left;
+		else
+			node = &parent->rb_right;
+	}
+
+	rb_link_node(&p->core_node, parent, node);
+	rb_insert_color(&p->core_node, &rq->core_tree);
+}
+
+static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+{
+	rq->core->core_task_seq++;
+
+	if (!p->core_cookie)
+		return;
+
+	rb_erase(&p->core_node, &rq->core_tree);
+}
+
+/*
+ * Find left-most (aka, highest priority) task matching @cookie.
+ */
+static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
+{
+	struct rb_node *node = rq->core_tree.rb_node;
+	struct task_struct *node_task, *match;
+
+	/*
+	 * The idle task always matches any cookie!
+	 */
+	match = idle_sched_class.pick_task(rq);
+
+	while (node) {
+		node_task = container_of(node, struct task_struct, core_node);
+
+		if (cookie < node_task->core_cookie) {
+			node = node->rb_left;
+		} else if (cookie > node_task->core_cookie) {
+			node = node->rb_right;
+		} else {
+			match = node_task;
+			node = node->rb_left;
+		}
+	}
+
+	return match;
+}
+
 /*
  * The static-key + stop-machine variable are needed such that:
  *
@@ -134,6 +269,11 @@ void sched_core_put(void)
 	mutex_unlock(&sched_core_mutex);
 }
 
+#else /* !CONFIG_SCHED_CORE */
+
+static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
+static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+
 #endif /* CONFIG_SCHED_CORE */
 
 /*
@@ -1233,6 +1373,9 @@ static inline void init_uclamp(void) { }
 
 static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	if (sched_core_enabled(rq))
+		sched_core_enqueue(rq, p);
+
 	if (!(flags & ENQUEUE_NOCLOCK))
 		update_rq_clock(rq);
 
@@ -1247,6 +1390,9 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 
 static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	if (sched_core_enabled(rq))
+		sched_core_dequeue(rq, p);
+
 	if (!(flags & DEQUEUE_NOCLOCK))
 		update_rq_clock(rq);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cc78b334e1c3..e3b7e8ba1420 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -248,33 +248,11 @@ const struct sched_class fair_sched_class;
  */
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static inline struct task_struct *task_of(struct sched_entity *se)
-{
-	SCHED_WARN_ON(!entity_is_task(se));
-	return container_of(se, struct task_struct, se);
-}
 
 /* Walk up scheduling entities hierarchy */
 #define for_each_sched_entity(se) \
 		for (; se; se = se->parent)
 
-static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
-{
-	return p->se.cfs_rq;
-}
-
-/* runqueue on which this entity is (to be) queued */
-static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
-{
-	return se->cfs_rq;
-}
-
-/* runqueue "owned" by this group */
-static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
-{
-	return grp->my_q;
-}
-
 static inline void cfs_rq_tg_path(struct cfs_rq *cfs_rq, char *path, int len)
 {
 	if (!path)
@@ -435,33 +413,9 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #else	/* !CONFIG_FAIR_GROUP_SCHED */
 
-static inline struct task_struct *task_of(struct sched_entity *se)
-{
-	return container_of(se, struct task_struct, se);
-}
-
 #define for_each_sched_entity(se) \
 		for (; se; se = NULL)
 
-static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
-{
-	return &task_rq(p)->cfs;
-}
-
-static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
-{
-	struct task_struct *p = task_of(se);
-	struct rq *rq = task_rq(p);
-
-	return &rq->cfs;
-}
-
-/* runqueue "owned" by this group */
-static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
-{
-	return NULL;
-}
-
 static inline void cfs_rq_tg_path(struct cfs_rq *cfs_rq, char *path, int len)
 {
 	if (path)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2ba68ab093ee..666b3bc2cfc2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -996,6 +996,10 @@ struct rq {
 	/* per rq */
 	struct rq		*core;
 	unsigned int		core_enabled;
+	struct rb_root		core_tree;
+
+	/* shared state */
+	unsigned int		core_task_seq;
 #endif
 };
 
@@ -1075,6 +1079,57 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		raw_cpu_ptr(&runqueues)
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline struct task_struct *task_of(struct sched_entity *se)
+{
+	SCHED_WARN_ON(!entity_is_task(se));
+	return container_of(se, struct task_struct, se);
+}
+
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+	return p->se.cfs_rq;
+}
+
+/* runqueue on which this entity is (to be) queued */
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+	return se->cfs_rq;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+	return grp->my_q;
+}
+
+#else
+
+static inline struct task_struct *task_of(struct sched_entity *se)
+{
+	return container_of(se, struct task_struct, se);
+}
+
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+	return &task_rq(p)->cfs;
+}
+
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+	struct task_struct *p = task_of(se);
+	struct rq *rq = task_rq(p);
+
+	return &rq->cfs;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+	return NULL;
+}
+#endif
+
 extern void update_rq_clock(struct rq *rq);
 
 static inline u64 __rq_clock_broken(struct rq *rq)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [RFC PATCH v4 12/19] sched: A quick and dirty cgroup tagging interface
  2019-10-30 18:33 [RFC PATCH v4 00/19] Core scheduling v4 Vineeth Remanan Pillai
                   ` (10 preceding siblings ...)
  2019-10-30 18:33 ` [RFC PATCH v4 11/19] sched: Basic tracking of matching tasks Vineeth Remanan Pillai
@ 2019-10-30 18:33 ` Vineeth Remanan Pillai
  2019-10-30 18:33 ` [RFC PATCH v4 13/19] sched: Add core wide task selection and scheduling Vineeth Remanan Pillai
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 84+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-30 18:33 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, Dario Faggioli, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Vineeth Remanan Pillai

From: Peter Zijlstra <peterz@infradead.org>

Marks all tasks in a cgroup as matching for core-scheduling.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
---
 kernel/sched/core.c  | 78 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  4 +++
 2 files changed, 82 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 44dadbe9a2d0..27af86aa9a8b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7076,6 +7076,15 @@ static void sched_change_group(struct task_struct *tsk, int type)
 	tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
 			  struct task_group, css);
 	tg = autogroup_task_group(tsk, tg);
+
+#ifdef CONFIG_SCHED_CORE
+	if ((unsigned long)tsk->sched_task_group == tsk->core_cookie)
+		tsk->core_cookie = 0UL;
+
+	if (tg->tagged /* && !tsk->core_cookie ? */)
+		tsk->core_cookie = (unsigned long)tg;
+#endif
+
 	tsk->sched_task_group = tg;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -7155,6 +7164,18 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
 	return 0;
 }
 
+static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
+{
+#ifdef CONFIG_SCHED_CORE
+	struct task_group *tg = css_tg(css);
+
+	if (tg->tagged) {
+		sched_core_put();
+		tg->tagged = 0;
+	}
+#endif
+}
+
 static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
@@ -7542,6 +7563,46 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_SCHED_CORE
+static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	struct task_group *tg = css_tg(css);
+
+	return !!tg->tagged;
+}
+
+static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+{
+	struct task_group *tg = css_tg(css);
+	struct css_task_iter it;
+	struct task_struct *p;
+
+	if (val > 1)
+		return -ERANGE;
+
+	if (!static_branch_likely(&sched_smt_present))
+		return -EINVAL;
+
+	if (tg->tagged == !!val)
+		return 0;
+
+	tg->tagged = !!val;
+
+	if (!!val)
+		sched_core_get();
+
+	css_task_iter_start(css, 0, &it);
+	while ((p = css_task_iter_next(&it)))
+		p->core_cookie = !!val ? (unsigned long)tg : 0UL;
+	css_task_iter_end(&it);
+
+	if (!val)
+		sched_core_put();
+
+	return 0;
+}
+#endif
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -7577,6 +7638,14 @@ static struct cftype cpu_legacy_files[] = {
 		.read_u64 = cpu_rt_period_read_uint,
 		.write_u64 = cpu_rt_period_write_uint,
 	},
+#endif
+#ifdef CONFIG_SCHED_CORE
+	{
+		.name = "tag",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_tag_read_u64,
+		.write_u64 = cpu_core_tag_write_u64,
+	},
 #endif
 	{ }	/* Terminate */
 };
@@ -7744,6 +7813,14 @@ static struct cftype cpu_files[] = {
 		.seq_show = cpu_max_show,
 		.write = cpu_max_write,
 	},
+#endif
+#ifdef CONFIG_SCHED_CORE
+	{
+		.name = "tag",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_core_tag_read_u64,
+		.write_u64 = cpu_core_tag_write_u64,
+	},
 #endif
 	{ }	/* terminate */
 };
@@ -7751,6 +7828,7 @@ static struct cftype cpu_files[] = {
 struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
 	.css_online	= cpu_cgroup_css_online,
+	.css_offline	= cpu_cgroup_css_offline,
 	.css_released	= cpu_cgroup_css_released,
 	.css_free	= cpu_cgroup_css_free,
 	.css_extra_stat_show = cpu_extra_stat_show,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 666b3bc2cfc2..0442a5abec01 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -357,6 +357,10 @@ struct cfs_bandwidth {
 struct task_group {
 	struct cgroup_subsys_state css;
 
+#ifdef CONFIG_SCHED_CORE
+	int			tagged;
+#endif
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* schedulable entities of this group on each CPU */
 	struct sched_entity	**se;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [RFC PATCH v4 13/19] sched: Add core wide task selection and scheduling.
  2019-10-30 18:33 [RFC PATCH v4 00/19] Core scheduling v4 Vineeth Remanan Pillai
                   ` (11 preceding siblings ...)
  2019-10-30 18:33 ` [RFC PATCH v4 12/19] sched: A quick and dirty cgroup tagging interface Vineeth Remanan Pillai
@ 2019-10-30 18:33 ` Vineeth Remanan Pillai
  2019-10-30 18:33 ` [RFC PATCH v4 14/19] sched/fair: Add a few assertions Vineeth Remanan Pillai
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 84+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-30 18:33 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, Dario Faggioli, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Vineeth Remanan Pillai, Aaron Lu

From: Peter Zijlstra <peterz@infradead.org>

Instead of only selecting a local task, select a task for all SMT
siblings for every reschedule on the core (irrespective which logical
CPU does the reschedule).

NOTE: there is still potential for siblings rivalry.
NOTE: this is far too complicated; but thus far I've failed to
      simplify it further.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Signed-off-by: Aaron Lu <aaron.lu@linux.alibaba.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/core.c  | 271 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |   6 +-
 2 files changed, 274 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 27af86aa9a8b..10b20de100e3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3932,7 +3932,7 @@ static inline void schedule_debug(struct task_struct *prev)
  * Pick up the highest-prio task:
  */
 static inline struct task_struct *
-pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	const struct sched_class *class;
 	struct task_struct *p;
@@ -3977,6 +3977,268 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	BUG();
 }
 
+#ifdef CONFIG_SCHED_CORE
+
+static inline bool cookie_equals(struct task_struct *a, unsigned long cookie)
+{
+	return is_idle_task(a) || (a->core_cookie == cookie);
+}
+
+static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
+{
+	if (is_idle_task(a) || is_idle_task(b))
+		return true;
+
+	return a->core_cookie == b->core_cookie;
+}
+
+// XXX fairness/fwd progress conditions
+/*
+ * Returns
+ * - NULL if there is no runnable task for this class.
+ * - the highest priority task for this runqueue if it matches
+ *   rq->core->core_cookie or its priority is greater than max.
+ * - Else returns idle_task.
+ */
+static struct task_struct *
+pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max)
+{
+	struct task_struct *class_pick, *cookie_pick;
+	unsigned long cookie = rq->core->core_cookie;
+
+	class_pick = class->pick_task(rq);
+	if (!class_pick)
+		return NULL;
+
+	if (!cookie) {
+		/*
+		 * If class_pick is tagged, return it only if it has
+		 * higher priority than max.
+		 */
+		if (max && class_pick->core_cookie &&
+		    prio_less(class_pick, max))
+			return idle_sched_class.pick_task(rq);
+
+		return class_pick;
+	}
+
+	/*
+	 * If class_pick is idle or matches cookie, return early.
+	 */
+	if (cookie_equals(class_pick, cookie))
+		return class_pick;
+
+	cookie_pick = sched_core_find(rq, cookie);
+
+	/*
+	 * If class > max && class > cookie, it is the highest priority task on
+	 * the core (so far) and it must be selected, otherwise we must go with
+	 * the cookie pick in order to satisfy the constraint.
+	 */
+	if (prio_less(cookie_pick, class_pick) &&
+	    (!max || prio_less(max, class_pick)))
+		return class_pick;
+
+	return cookie_pick;
+}
+
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	struct task_struct *next, *max = NULL;
+	const struct sched_class *class;
+	const struct cpumask *smt_mask;
+	int i, j, cpu;
+	bool need_sync = false;
+
+	if (!sched_core_enabled(rq))
+		return __pick_next_task(rq, prev, rf);
+
+	/*
+	 * If there were no {en,de}queues since we picked (IOW, the task
+	 * pointers are all still valid), and we haven't scheduled the last
+	 * pick yet, do so now.
+	 */
+	if (rq->core->core_pick_seq == rq->core->core_task_seq &&
+	    rq->core->core_pick_seq != rq->core_sched_seq) {
+		WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);
+
+		next = rq->core_pick;
+		if (next != prev) {
+			put_prev_task(rq, prev);
+			set_next_task(rq, next);
+		}
+		return next;
+	}
+
+	prev->sched_class->put_prev_task(rq, prev, rf);
+	if (!rq->nr_running)
+		newidle_balance(rq, rf);
+
+	cpu = cpu_of(rq);
+	smt_mask = cpu_smt_mask(cpu);
+
+	/*
+	 * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
+	 *
+	 * @task_seq guards the task state ({en,de}queues)
+	 * @pick_seq is the @task_seq we did a selection on
+	 * @sched_seq is the @pick_seq we scheduled
+	 *
+	 * However, preemptions can cause multiple picks on the same task set.
+	 * 'Fix' this by also increasing @task_seq for every pick.
+	 */
+	rq->core->core_task_seq++;
+	need_sync = !!rq->core->core_cookie;
+
+	/* reset state */
+	rq->core->core_cookie = 0UL;
+	for_each_cpu(i, smt_mask) {
+		struct rq *rq_i = cpu_rq(i);
+
+		rq_i->core_pick = NULL;
+
+		if (rq_i->core_forceidle) {
+			need_sync = true;
+			rq_i->core_forceidle = false;
+		}
+
+		if (i != cpu)
+			update_rq_clock(rq_i);
+	}
+
+	/*
+	 * Try and select tasks for each sibling in decending sched_class
+	 * order.
+	 */
+	for_each_class(class) {
+again:
+		for_each_cpu_wrap(i, smt_mask, cpu) {
+			struct rq *rq_i = cpu_rq(i);
+			struct task_struct *p;
+
+			if (cpu_is_offline(i))
+				continue;
+
+			if (rq_i->core_pick)
+				continue;
+
+			/*
+			 * If this sibling doesn't yet have a suitable task to
+			 * run; ask for the most elegible task, given the
+			 * highest priority task already selected for this
+			 * core.
+			 */
+			p = pick_task(rq_i, class, max);
+			if (!p) {
+				/*
+				 * If there weren't no cookies; we don't need
+				 * to bother with the other siblings.
+				 */
+				if (i == cpu && !need_sync)
+					goto next_class;
+
+				continue;
+			}
+
+			/*
+			 * Optimize the 'normal' case where there aren't any
+			 * cookies and we don't need to sync up.
+			 */
+			if (i == cpu && !need_sync && !p->core_cookie) {
+				next = p;
+				goto done;
+			}
+
+			rq_i->core_pick = p;
+
+			/*
+			 * If this new candidate is of higher priority than the
+			 * previous; and they're incompatible; we need to wipe
+			 * the slate and start over. pick_task makes sure that
+			 * p's priority is more than max if it doesn't match
+			 * max's cookie.
+			 *
+			 * NOTE: this is a linear max-filter and is thus bounded
+			 * in execution time.
+			 */
+			if (!max || !cookie_match(max, p)) {
+				struct task_struct *old_max = max;
+
+				rq->core->core_cookie = p->core_cookie;
+				max = p;
+
+				if (old_max) {
+					for_each_cpu(j, smt_mask) {
+						if (j == i)
+							continue;
+
+						cpu_rq(j)->core_pick = NULL;
+					}
+					goto again;
+				} else {
+					/*
+					 * Once we select a task for a cpu, we
+					 * should not be doing an unconstrained
+					 * pick because it might starve a task
+					 * on a forced idle cpu.
+					 */
+					need_sync = true;
+				}
+
+			}
+		}
+next_class:;
+	}
+
+	rq->core->core_pick_seq = rq->core->core_task_seq;
+	next = rq->core_pick;
+	rq->core_sched_seq = rq->core->core_pick_seq;
+
+	/*
+	 * Reschedule siblings
+	 *
+	 * NOTE: L1TF -- at this point we're no longer running the old task and
+	 * sending an IPI (below) ensures the sibling will no longer be running
+	 * their task. This ensures there is no inter-sibling overlap between
+	 * non-matching user state.
+	 */
+	for_each_cpu(i, smt_mask) {
+		struct rq *rq_i = cpu_rq(i);
+
+		if (cpu_is_offline(i))
+			continue;
+
+		WARN_ON_ONCE(!rq_i->core_pick);
+
+		if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
+			rq_i->core_forceidle = true;
+
+		if (i == cpu)
+			continue;
+
+		if (rq_i->curr != rq_i->core_pick)
+			resched_curr(rq_i);
+
+		/* Did we break L1TF mitigation requirements? */
+		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+	}
+
+done:
+	set_next_task(rq, next);
+	return next;
+}
+
+#else /* !CONFIG_SCHED_CORE */
+
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	return __pick_next_task(rq, prev, rf);
+}
+
+#endif /* CONFIG_SCHED_CORE */
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -6290,7 +6552,7 @@ static void migrate_tasks(struct rq *dead_rq, struct rq_flags *rf)
 		/*
 		 * pick_next_task() assumes pinned rq->lock:
 		 */
-		next = pick_next_task(rq, &fake_task, rf);
+		next = __pick_next_task(rq, &fake_task, rf);
 		BUG_ON(!next);
 		put_prev_task(rq, next);
 
@@ -6760,7 +7022,12 @@ void __init sched_init(void)
 
 #ifdef CONFIG_SCHED_CORE
 		rq->core = NULL;
+		rq->core_pick = NULL;
 		rq->core_enabled = 0;
+		rq->core_tree = RB_ROOT;
+		rq->core_forceidle = false;
+
+		rq->core_cookie = 0UL;
 #endif
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0442a5abec01..fa92f6d3c73b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -999,11 +999,16 @@ struct rq {
 #ifdef CONFIG_SCHED_CORE
 	/* per rq */
 	struct rq		*core;
+	struct task_struct	*core_pick;
 	unsigned int		core_enabled;
+	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
+	bool			core_forceidle;
 
 	/* shared state */
 	unsigned int		core_task_seq;
+	unsigned int		core_pick_seq;
+	unsigned long		core_cookie;
 #endif
 };
 
@@ -1864,7 +1869,6 @@ static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 
 static inline void set_next_task(struct rq *rq, struct task_struct *next)
 {
-	WARN_ON_ONCE(rq->curr != next);
 	next->sched_class->set_next_task(rq, next);
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [RFC PATCH v4 14/19] sched/fair: Add a few assertions
  2019-10-30 18:33 [RFC PATCH v4 00/19] Core scheduling v4 Vineeth Remanan Pillai
                   ` (12 preceding siblings ...)
  2019-10-30 18:33 ` [RFC PATCH v4 13/19] sched: Add core wide task selection and scheduling Vineeth Remanan Pillai
@ 2019-10-30 18:33 ` Vineeth Remanan Pillai
  2019-10-30 18:33 ` [RFC PATCH v4 15/19] sched: Trivial forced-newidle balancer Vineeth Remanan Pillai
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 84+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-30 18:33 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, Dario Faggioli, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

From: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e3b7e8ba1420..840d27628d36 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5984,6 +5984,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	struct sched_domain *sd;
 	int i, recent_used_cpu;
 
+	/*
+	 * per-cpu select_idle_mask usage
+	 */
+	lockdep_assert_irqs_disabled();
+
 	if (available_idle_cpu(target))
 		return target;
 
@@ -6439,8 +6444,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
  * certain conditions an idle sibling CPU if the domain has SD_WAKE_AFFINE set.
  *
  * Returns the target CPU number.
- *
- * preempt must be disabled.
  */
 static int
 select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
@@ -6451,6 +6454,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	int want_affine = 0;
 	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
 
+	/*
+	 * required for stable ->cpus_allowed
+	 */
+	lockdep_assert_held(&p->pi_lock);
+
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [RFC PATCH v4 15/19] sched: Trivial forced-newidle balancer
  2019-10-30 18:33 [RFC PATCH v4 00/19] Core scheduling v4 Vineeth Remanan Pillai
                   ` (13 preceding siblings ...)
  2019-10-30 18:33 ` [RFC PATCH v4 14/19] sched/fair: Add a few assertions Vineeth Remanan Pillai
@ 2019-10-30 18:33 ` Vineeth Remanan Pillai
  2019-10-30 18:33 ` [RFC PATCH v4 16/19] sched: Debug bits Vineeth Remanan Pillai
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 84+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-30 18:33 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, Dario Faggioli, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

From: Peter Zijlstra <peterz@infradead.org>

When a sibling is forced-idle to match the core-cookie; search for
matching tasks to fill the core.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h |   1 +
 kernel/sched/core.c   | 131 +++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/idle.c   |   1 +
 kernel/sched/sched.h  |   6 ++
 4 files changed, 138 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d319219ea58f..896dd00e393f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -693,6 +693,7 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
 	struct rb_node			core_node;
 	unsigned long			core_cookie;
+	unsigned int			core_occupation;
 #endif
 
 #ifdef CONFIG_CGROUP_SCHED
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 10b20de100e3..f0a63fe5f0a4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -211,6 +211,21 @@ static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
 	return match;
 }
 
+static struct task_struct *sched_core_next(struct task_struct *p, unsigned long cookie)
+{
+	struct rb_node *node = &p->core_node;
+
+	node = rb_next(node);
+	if (!node)
+		return NULL;
+
+	p = container_of(node, struct task_struct, core_node);
+	if (p->core_cookie != cookie)
+		return NULL;
+
+	return p;
+}
+
 /*
  * The static-key + stop-machine variable are needed such that:
  *
@@ -4048,7 +4063,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	struct task_struct *next, *max = NULL;
 	const struct sched_class *class;
 	const struct cpumask *smt_mask;
-	int i, j, cpu;
+	int i, j, cpu, occ = 0;
 	bool need_sync = false;
 
 	if (!sched_core_enabled(rq))
@@ -4150,6 +4165,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 				goto done;
 			}
 
+			if (!is_idle_task(p))
+				occ++;
+
 			rq_i->core_pick = p;
 
 			/*
@@ -4175,6 +4193,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 						cpu_rq(j)->core_pick = NULL;
 					}
+					occ = 1;
 					goto again;
 				} else {
 					/*
@@ -4214,6 +4233,8 @@ next_class:;
 		if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
 			rq_i->core_forceidle = true;
 
+		rq_i->core_pick->core_occupation = occ;
+
 		if (i == cpu)
 			continue;
 
@@ -4229,6 +4250,114 @@ next_class:;
 	return next;
 }
 
+static bool try_steal_cookie(int this, int that)
+{
+	struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
+	struct task_struct *p;
+	unsigned long cookie;
+	bool success = false;
+
+	local_irq_disable();
+	double_rq_lock(dst, src);
+
+	cookie = dst->core->core_cookie;
+	if (!cookie)
+		goto unlock;
+
+	if (dst->curr != dst->idle)
+		goto unlock;
+
+	p = sched_core_find(src, cookie);
+	if (p == src->idle)
+		goto unlock;
+
+	do {
+		if (p == src->core_pick || p == src->curr)
+			goto next;
+
+		if (!cpumask_test_cpu(this, &p->cpus_mask))
+			goto next;
+
+		if (p->core_occupation > dst->idle->core_occupation)
+			goto next;
+
+		p->on_rq = TASK_ON_RQ_MIGRATING;
+		deactivate_task(src, p, 0);
+		set_task_cpu(p, this);
+		activate_task(dst, p, 0);
+		p->on_rq = TASK_ON_RQ_QUEUED;
+
+		resched_curr(dst);
+
+		success = true;
+		break;
+
+next:
+		p = sched_core_next(p, cookie);
+	} while (p);
+
+unlock:
+	double_rq_unlock(dst, src);
+	local_irq_enable();
+
+	return success;
+}
+
+static bool steal_cookie_task(int cpu, struct sched_domain *sd)
+{
+	int i;
+
+	for_each_cpu_wrap(i, sched_domain_span(sd), cpu) {
+		if (i == cpu)
+			continue;
+
+		if (need_resched())
+			break;
+
+		if (try_steal_cookie(cpu, i))
+			return true;
+	}
+
+	return false;
+}
+
+static void sched_core_balance(struct rq *rq)
+{
+	struct sched_domain *sd;
+	int cpu = cpu_of(rq);
+
+	rcu_read_lock();
+	raw_spin_unlock_irq(rq_lockp(rq));
+	for_each_domain(cpu, sd) {
+		if (!(sd->flags & SD_LOAD_BALANCE))
+			break;
+
+		if (need_resched())
+			break;
+
+		if (steal_cookie_task(cpu, sd))
+			break;
+	}
+	raw_spin_lock_irq(rq_lockp(rq));
+	rcu_read_unlock();
+}
+
+static DEFINE_PER_CPU(struct callback_head, core_balance_head);
+
+void queue_core_balance(struct rq *rq)
+{
+	if (!sched_core_enabled(rq))
+		return;
+
+	if (!rq->core->core_cookie)
+		return;
+
+	if (!rq->nr_running) /* not forced idle */
+		return;
+
+	queue_balance_callback(rq, &per_cpu(core_balance_head, rq->cpu), sched_core_balance);
+}
+
 #else /* !CONFIG_SCHED_CORE */
 
 static struct task_struct *
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 87c55e3b988d..bd21d9922a05 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -388,6 +388,7 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next)
 {
 	update_idle_core(rq);
 	schedstat_inc(rq->sched_goidle);
+	queue_core_balance(rq);
 }
 
 static struct task_struct *
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fa92f6d3c73b..311ab1e2a00e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1053,6 +1053,8 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 	return &rq->__lock;
 }
 
+extern void queue_core_balance(struct rq *rq);
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline bool sched_core_enabled(struct rq *rq)
@@ -1065,6 +1067,10 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 	return &rq->__lock;
 }
 
+static inline void queue_core_balance(struct rq *rq)
+{
+}
+
 #endif /* CONFIG_SCHED_CORE */
 
 #ifdef CONFIG_SCHED_SMT
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [RFC PATCH v4 16/19] sched: Debug bits...
  2019-10-30 18:33 [RFC PATCH v4 00/19] Core scheduling v4 Vineeth Remanan Pillai
                   ` (14 preceding siblings ...)
  2019-10-30 18:33 ` [RFC PATCH v4 15/19] sched: Trivial forced-newidle balancer Vineeth Remanan Pillai
@ 2019-10-30 18:33 ` Vineeth Remanan Pillai
  2019-10-30 18:33 ` [RFC PATCH v4 17/19] sched/fair: wrapper for cfs_rq->min_vruntime Vineeth Remanan Pillai
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 84+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-30 18:33 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: linux-kernel, Dario Faggioli, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

From: Peter Zijlstra <peterz@infradead.org>

Not-Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f0a63fe5f0a4..18fbaa85ec30 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -104,6 +104,10 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 
 	int pa = __task_prio(a), pb = __task_prio(b);
 
+	trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n",
+		     a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline,
+		     b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline);
+
 	if (-pa < -pb)
 		return true;
 
@@ -258,6 +262,8 @@ static void __sched_core_enable(void)
 
 	static_branch_enable(&__sched_core_enabled);
 	stop_machine(__sched_core_stopper, (void *)true, NULL);
+
+	printk("core sched enabled\n");
 }
 
 static void __sched_core_disable(void)
@@ -266,6 +272,8 @@ static void __sched_core_disable(void)
 
 	stop_machine(__sched_core_stopper, (void *)false, NULL);
 	static_branch_disable(&__sched_core_enabled);
+
+	printk("core sched disabled\n");
 }
 
 void sched_core_get(void)
@@ -4083,6 +4091,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			put_prev_task(rq, prev);
 			set_next_task(rq, next);
 		}
+
+		trace_printk("pick pre selected (%u %u %u): %s/%d %lx\n",
+			     rq->core->core_task_seq,
+			     rq->core->core_pick_seq,
+			     rq->core_sched_seq,
+			     next->comm, next->pid,
+			     next->core_cookie);
+
 		return next;
 	}
 
@@ -4162,6 +4178,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			 */
 			if (i == cpu && !need_sync && !p->core_cookie) {
 				next = p;
+				trace_printk("unconstrained pick: %s/%d %lx\n",
+					     next->comm, next->pid, next->core_cookie);
+
 				goto done;
 			}
 
@@ -4170,6 +4189,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 			rq_i->core_pick = p;
 
+			trace_printk("cpu(%d): selected: %s/%d %lx\n",
+				     i, p->comm, p->pid, p->core_cookie);
+
 			/*
 			 * If this new candidate is of higher priority than the
 			 * previous; and they're incompatible; we need to wipe
@@ -4186,6 +4208,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 				rq->core->core_cookie = p->core_cookie;
 				max = p;
 
+				trace_printk("max: %s/%d %lx\n", max->comm, max->pid, max->core_cookie);
+
 				if (old_max) {
 					for_each_cpu(j, smt_mask) {
 						if (j == i)
@@ -4213,6 +4237,7 @@ next_class:;
 	rq->core->core_pick_seq = rq->core->core_task_seq;
 	next = rq->core_pick;
 	rq->core_sched_seq = rq->core->core_pick_seq;
+	trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, next->core_cookie);
 
 	/*
 	 * Reschedule siblings
@@ -4238,11 +4263,20 @@ next_class:;
 		if (i == cpu)
 			continue;
 
-		if (rq_i->curr != rq_i->core_pick)
+		if (rq_i->curr != rq_i->core_pick) {
+			trace_printk("IPI(%d)\n", i);
 			resched_curr(rq_i);
+		}
 
 		/* Did we break L1TF mitigation requirements? */
-		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+		if (unlikely(!cookie_match(next, rq_i->core_pick))) {
+			trace_printk("[%d]: cookie mismatch. %s/%d/0x%lx/0x%lx\n",
+				     rq_i->cpu, rq_i->core_pick->comm,
+				     rq_i->core_pick->pid,
+				     rq_i->core_pick->core_cookie,
+				     rq_i->core->core_cookie);
+			WARN_ON_ONCE(1);
+		}
 	}
 
 done:
@@ -4281,6 +4315,10 @@ static bool try_steal_cookie(int this, int that)
 		if (p->core_occupation > dst->idle->core_occupation)
 			goto next;
 
+		trace_printk("core fill: %s/%d (%d->%d) %d %d %lx\n",
+			     p->comm, p->pid, that, this,
+			     p->core_occupation, dst->idle->core_occupation, cookie);
+
 		p->on_rq = TASK_ON_RQ_MIGRATING;
 		deactivate_task(src, p, 0);
 		set_task_cpu(p, this);
@@ -6921,6 +6959,8 @@ int sched_cpu_starting(unsigned int cpu)
 		WARN_ON_ONCE(rq->core && rq->core != core_rq);
 		rq->core = core_rq;
 	}
+
+	printk("core: %d -> %d\n", cpu, cpu_of(core_rq));
 #endif /* CONFIG_SCHED_CORE */
 
 	sched_rq_cpu_starting(cpu);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [RFC PATCH v4 17/19] sched/fair: wrapper for cfs_rq->min_vruntime
  2019-10-30 18:33 [RFC PATCH v4 00/19] Core scheduling v4 Vineeth Remanan Pillai
                   ` (15 preceding siblings ...)
  2019-10-30 18:33 ` [RFC PATCH v4 16/19] sched: Debug bits Vineeth Remanan Pillai
@ 2019-10-30 18:33 ` Vineeth Remanan Pillai
  2019-10-30 18:33 ` [RFC PATCH v4 18/19] sched/fair: core wide vruntime comparison Vineeth Remanan Pillai
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 84+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-30 18:33 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: Aaron Lu, linux-kernel, Dario Faggioli, fweisbec, keescook,
	kerrnel, Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, Aaron Lu

From: Aaron Lu <aaron.lu@linux.alibaba.com>

Add a wrapper function cfs_rq_min_vruntime(cfs_rq) to
return cfs_rq->min_vruntime.

It will be used in the following patch, no functionality
change.

Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com>
---
 kernel/sched/fair.c | 27 ++++++++++++++++-----------
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 840d27628d36..ab32b22b0574 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -450,6 +450,11 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->min_vruntime;
+}
+
 static __always_inline
 void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
 
@@ -486,7 +491,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
 	struct sched_entity *curr = cfs_rq->curr;
 	struct rb_node *leftmost = rb_first_cached(&cfs_rq->tasks_timeline);
 
-	u64 vruntime = cfs_rq->min_vruntime;
+	u64 vruntime = cfs_rq_min_vruntime(cfs_rq);
 
 	if (curr) {
 		if (curr->on_rq)
@@ -506,7 +511,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
 	}
 
 	/* ensure we never gain time by being placed backwards. */
-	cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
+	cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime);
 #ifndef CONFIG_64BIT
 	smp_wmb();
 	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
@@ -3816,7 +3821,7 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {}
 static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 #ifdef CONFIG_SCHED_DEBUG
-	s64 d = se->vruntime - cfs_rq->min_vruntime;
+	s64 d = se->vruntime - cfs_rq_min_vruntime(cfs_rq);
 
 	if (d < 0)
 		d = -d;
@@ -3829,7 +3834,7 @@ static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
-	u64 vruntime = cfs_rq->min_vruntime;
+	u64 vruntime = cfs_rq_min_vruntime(cfs_rq);
 
 	/*
 	 * The 'current' period is already promised to the current tasks,
@@ -3922,7 +3927,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * update_curr().
 	 */
 	if (renorm && curr)
-		se->vruntime += cfs_rq->min_vruntime;
+		se->vruntime += cfs_rq_min_vruntime(cfs_rq);
 
 	update_curr(cfs_rq);
 
@@ -3933,7 +3938,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * fairness detriment of existing tasks.
 	 */
 	if (renorm && !curr)
-		se->vruntime += cfs_rq->min_vruntime;
+		se->vruntime += cfs_rq_min_vruntime(cfs_rq);
 
 	/*
 	 * When enqueuing a sched_entity, we must:
@@ -4046,7 +4051,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * can move min_vruntime forward still more.
 	 */
 	if (!(flags & DEQUEUE_SLEEP))
-		se->vruntime -= cfs_rq->min_vruntime;
+		se->vruntime -= cfs_rq_min_vruntime(cfs_rq);
 
 	/* return excess runtime on last dequeue */
 	return_cfs_rq_runtime(cfs_rq);
@@ -6542,7 +6547,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 			min_vruntime = cfs_rq->min_vruntime;
 		} while (min_vruntime != min_vruntime_copy);
 #else
-		min_vruntime = cfs_rq->min_vruntime;
+		min_vruntime = cfs_rq_min_vruntime(cfs_rq);
 #endif
 
 		se->vruntime -= min_vruntime;
@@ -10007,7 +10012,7 @@ static void task_fork_fair(struct task_struct *p)
 		resched_curr(rq);
 	}
 
-	se->vruntime -= cfs_rq->min_vruntime;
+	se->vruntime -= cfs_rq_min_vruntime(cfs_rq);
 	rq_unlock(rq, &rf);
 }
 
@@ -10127,7 +10132,7 @@ static void detach_task_cfs_rq(struct task_struct *p)
 		 * cause 'unlimited' sleep bonus.
 		 */
 		place_entity(cfs_rq, se, 0);
-		se->vruntime -= cfs_rq->min_vruntime;
+		se->vruntime -= cfs_rq_min_vruntime(cfs_rq);
 	}
 
 	detach_entity_cfs_rq(se);
@@ -10141,7 +10146,7 @@ static void attach_task_cfs_rq(struct task_struct *p)
 	attach_entity_cfs_rq(se);
 
 	if (!vruntime_normalized(p))
-		se->vruntime += cfs_rq->min_vruntime;
+		se->vruntime += cfs_rq_min_vruntime(cfs_rq);
 }
 
 static void switched_from_fair(struct rq *rq, struct task_struct *p)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [RFC PATCH v4 18/19] sched/fair: core wide vruntime comparison
  2019-10-30 18:33 [RFC PATCH v4 00/19] Core scheduling v4 Vineeth Remanan Pillai
                   ` (16 preceding siblings ...)
  2019-10-30 18:33 ` [RFC PATCH v4 17/19] sched/fair: wrapper for cfs_rq->min_vruntime Vineeth Remanan Pillai
@ 2019-10-30 18:33 ` Vineeth Remanan Pillai
  2019-10-30 18:33 ` [RFC PATCH v4 19/19] sched/fair : Wake up forced idle siblings if needed Vineeth Remanan Pillai
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 84+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-30 18:33 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: Aaron Lu, linux-kernel, Dario Faggioli, fweisbec, keescook,
	kerrnel, Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, Aaron Lu,
	Vineeth Remanan Pillai

From: Aaron Lu <aaron.lu@linux.alibaba.com>

This patch provides a vruntime based way to compare two cfs task's
priority, be it on the same cpu or different threads of the same core.

When the two tasks are on the same CPU, we just need to find a common
cfs_rq both sched_entities are on and then do the comparison.

When the two tasks are on differen threads of the same core, the root
level sched_entities to which the two tasks belong will be used to do
the comparison.

An ugly illustration for the cross CPU case:

   cpu0         cpu1
 /   |  \     /   |  \
se1 se2 se3  se4 se5 se6
    /  \            /   \
  se21 se22       se61  se62

Assume CPU0 and CPU1 are smt siblings and task A's se is se21 while
task B's se is se61. To compare priority of task A and B, we compare
priority of se2 and se6. Whose vruntime is smaller, who wins.

To make this work, the root level se should have a common cfs_rq min
vuntime, which I call it the core cfs_rq min vruntime.

When we adjust the min_vruntime of rq->core, we need to propgate
that down the tree so as to not cause starvation of existing tasks
based on previous vruntime.

Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com>
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
---
 kernel/sched/core.c  | 15 +------
 kernel/sched/fair.c  | 99 +++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |  2 +
 3 files changed, 102 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 18fbaa85ec30..09e5c77e54c3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -117,19 +117,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
 		return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-		u64 vruntime = b->se.vruntime;
-
-		/*
-		 * Normalize the vruntime if tasks are in different cpus.
-		 */
-		if (task_cpu(a) != task_cpu(b)) {
-			vruntime -= task_cfs_rq(b)->min_vruntime;
-			vruntime += task_cfs_rq(a)->min_vruntime;
-		}
-
-		return !((s64)(a->se.vruntime - vruntime) <= 0);
-	}
+	if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
+		return cfs_prio_less(a, b);
 
 	return false;
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ab32b22b0574..e8dd78a8c54d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -450,9 +450,105 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+static inline struct cfs_rq *root_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	return &rq_of(cfs_rq)->cfs;
+}
+
+static inline bool is_root_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq == root_cfs_rq(cfs_rq);
+}
+
+static inline struct cfs_rq *core_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	return &rq_of(cfs_rq)->core->cfs;
+}
+
 static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
 {
-	return cfs_rq->min_vruntime;
+	if (!sched_core_enabled(rq_of(cfs_rq)))
+		return cfs_rq->min_vruntime;
+
+	if (is_root_cfs_rq(cfs_rq))
+		return core_cfs_rq(cfs_rq)->min_vruntime;
+	else
+		return cfs_rq->min_vruntime;
+}
+
+static void coresched_adjust_vruntime(struct cfs_rq *cfs_rq, u64 delta)
+{
+	struct sched_entity *se, *next;
+
+	if (!cfs_rq)
+		return;
+
+	cfs_rq->min_vruntime -= delta;
+	rbtree_postorder_for_each_entry_safe(se, next,
+			&cfs_rq->tasks_timeline.rb_root, run_node) {
+		if (se->vruntime > delta)
+			se->vruntime -= delta;
+		if (se->my_q)
+			coresched_adjust_vruntime(se->my_q, delta);
+	}
+}
+
+static void update_core_cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
+{
+	struct cfs_rq *cfs_rq_core;
+
+	if (!sched_core_enabled(rq_of(cfs_rq)))
+		return;
+
+	if (!is_root_cfs_rq(cfs_rq))
+		return;
+
+	cfs_rq_core = core_cfs_rq(cfs_rq);
+	if (cfs_rq_core != cfs_rq &&
+	    cfs_rq->min_vruntime < cfs_rq_core->min_vruntime) {
+		u64 delta = cfs_rq_core->min_vruntime - cfs_rq->min_vruntime;
+		coresched_adjust_vruntime(cfs_rq_core, delta);
+	}
+}
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+{
+	struct sched_entity *sea = &a->se;
+	struct sched_entity *seb = &b->se;
+	bool samecpu = task_cpu(a) == task_cpu(b);
+	struct task_struct *p;
+	s64 delta;
+
+	if (samecpu) {
+		/* vruntime is per cfs_rq */
+		while (!is_same_group(sea, seb)) {
+			int sea_depth = sea->depth;
+			int seb_depth = seb->depth;
+
+			if (sea_depth >= seb_depth)
+				sea = parent_entity(sea);
+			if (sea_depth <= seb_depth)
+				seb = parent_entity(seb);
+		}
+
+		delta = (s64)(sea->vruntime - seb->vruntime);
+		goto out;
+	}
+
+	/* crosscpu: compare root level se's vruntime to decide priority */
+	while (sea->parent)
+		sea = sea->parent;
+	while (seb->parent)
+		seb = seb->parent;
+	delta = (s64)(sea->vruntime - seb->vruntime);
+
+out:
+	p = delta > 0 ? b : a;
+	trace_printk("picked %s/%d %s: %Ld %Ld %Ld\n", p->comm, p->pid,
+			samecpu ? "samecpu" : "crosscpu",
+			sea->vruntime, seb->vruntime, delta);
+
+	return delta > 0;
 }
 
 static __always_inline
@@ -512,6 +608,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
 
 	/* ensure we never gain time by being placed backwards. */
 	cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime);
+	update_core_cfs_rq_min_vruntime(cfs_rq);
 #ifndef CONFIG_64BIT
 	smp_wmb();
 	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 311ab1e2a00e..4844e703298a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2537,3 +2537,5 @@ static inline bool sched_energy_enabled(void)
 static inline bool sched_energy_enabled(void) { return false; }
 
 #endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [RFC PATCH v4 19/19] sched/fair : Wake up forced idle siblings if needed
  2019-10-30 18:33 [RFC PATCH v4 00/19] Core scheduling v4 Vineeth Remanan Pillai
                   ` (17 preceding siblings ...)
  2019-10-30 18:33 ` [RFC PATCH v4 18/19] sched/fair: core wide vruntime comparison Vineeth Remanan Pillai
@ 2019-10-30 18:33 ` Vineeth Remanan Pillai
  2019-10-31 11:42 ` [RFC PATCH v4 00/19] Core scheduling v4 Li, Aubrey
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 84+ messages in thread
From: Vineeth Remanan Pillai @ 2019-10-30 18:33 UTC (permalink / raw)
  To: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds
  Cc: Aaron Lu, linux-kernel, Dario Faggioli, fweisbec, keescook,
	kerrnel, Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini, Vineeth Remanan Pillai

From: Aaron Lu <aaron.lu@linux.alibaba.com>

If the sibling of a forced idle cpu has only one task and if it has
used up its timeslice, then we should try to wake up the forced idle
cpu to give the starving task on it a chance.

Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Signed-off-by: Julien Desfossez <jdesfossez@digitalocean.com>
---
 kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8dd78a8c54d..9d4cc97d4dd8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4165,6 +4165,13 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		update_min_vruntime(cfs_rq);
 }
 
+static inline bool
+__entity_slice_used(struct sched_entity *se)
+{
+	return (se->sum_exec_runtime - se->prev_sum_exec_runtime) >
+		sched_slice(cfs_rq_of(se), se);
+}
+
 /*
  * Preempt the current task with a newly woken task if needed:
  */
@@ -10052,6 +10059,34 @@ static void rq_offline_fair(struct rq *rq)
 
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_SCHED_CORE
+/*
+ * If runqueue has only one task which used up its slice and
+ * if the sibling is forced idle, then trigger schedule
+ * to give forced idle task a chance.
+ */
+static void resched_forceidle_sibling(struct rq *rq, struct sched_entity *se)
+{
+	int cpu = cpu_of(rq), sibling_cpu;
+	if (rq->cfs.nr_running > 1 || !__entity_slice_used(se))
+		return;
+
+	for_each_cpu(sibling_cpu, cpu_smt_mask(cpu)) {
+		struct rq *sibling_rq;
+		if (sibling_cpu == cpu)
+			continue;
+		if (cpu_is_offline(sibling_cpu))
+			continue;
+
+		sibling_rq = cpu_rq(sibling_cpu);
+		if (sibling_rq->core_forceidle) {
+			resched_curr(sibling_rq);
+		}
+	}
+}
+#endif
+
+
 /*
  * scheduler tick hitting a task of our scheduling class.
  *
@@ -10075,6 +10110,11 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	update_misfit_status(curr, rq);
 	update_overutilized_status(task_rq(curr));
+
+#ifdef CONFIG_SCHED_CORE
+	if (sched_core_enabled(rq))
+		resched_forceidle_sibling(rq, &curr->se);
+#endif
 }
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2019-10-30 18:33 [RFC PATCH v4 00/19] Core scheduling v4 Vineeth Remanan Pillai
                   ` (18 preceding siblings ...)
  2019-10-30 18:33 ` [RFC PATCH v4 19/19] sched/fair : Wake up forced idle siblings if needed Vineeth Remanan Pillai
@ 2019-10-31 11:42 ` Li, Aubrey
  2019-11-01 11:33   ` Li, Aubrey
  2019-10-31 18:42 ` Phil Auld
  2019-11-11 19:10 ` Tim Chen
  21 siblings, 1 reply; 84+ messages in thread
From: Li, Aubrey @ 2019-10-31 11:42 UTC (permalink / raw)
  To: Vineeth Remanan Pillai, Nishanth Aravamudan, Julien Desfossez,
	Peter Zijlstra, Tim Chen, mingo, tglx, pjt, torvalds
  Cc: linux-kernel, Dario Faggioli, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 2019/10/31 2:33, Vineeth Remanan Pillai wrote:
> Fourth iteration of the Core-Scheduling feature.
> 
> This version was aiming mostly at addressing the vruntime comparison
> issues with v3. The main issue seen in v3 was the starvation of
> interactive tasks when competing with cpu intensive tasks. This issue
> is mitigated to a large extent.
> 
> We have tested and verified that incompatible processes are not
> selected during schedule. In terms of performance, the impact
> depends on the workload:
> - on CPU intensive applications that use all the logical CPUs with
>   SMT enabled, enabling core scheduling performs better than nosmt.
> - on mixed workloads with considerable io compared to cpu usage,
>   nosmt seems to perform better than core scheduling.
> 
> v4 is rebased on top of 5.3.5(dc073f193b70):
> https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.3.5

Thanks to post V4 out. Refresh the data on my side. Since we have played
with Aaron's core vruntime patch for a while, no surprise in the result.

Thanks,
-Aubrey

Environment setup
--------------------------
Skylake 8170 server, 2 NUMA nodes, 52 cores, 104 CPUs (HT on)

Case 1:
-------
cgroup1 workload, sysbench CPU mode (non AVX workload)
cgroup2 workload, gemmbench (AVX512 workload)

sysbench throughput result:
.--------------------------------------------------------------------------------------------------------------------------------------.
|NA/AVX	vanilla-SMT	[std% / sem%]	  cpu% |coresched-SMT	[std% / sem%]	  +/-	  cpu% |  no-SMT [std% / sem%]	 +/-	  cpu% |
|--------------------------------------------------------------------------------------------------------------------------------------|
|  1/1	     1269.1	[ 0.1%/ 0.0%]	  1.9% |       1272.4	[ 0.1%/ 0.0%]	  0.3%	  1.9% |  1272.0 [ 0.1%/ 0.0%]   0.2%	  3.9% |
|  2/2	     2466.9	[ 0.6%/ 0.1%]	  3.9% |       2534.2	[ 0.6%/ 0.1%]	  2.7%	  3.8% |  2511.9 [ 0.2%/ 0.0%]   1.8%	  7.7% |
|  4/4	     4725.2	[ 0.3%/ 0.0%]	  7.7% |       4806.3	[ 0.2%/ 0.0%]	  1.7%	  7.7% |  4786.7 [ 0.9%/ 0.1%]   1.3%	 14.6% |
|  8/8	     9353.4	[ 0.1%/ 0.0%]	 14.6% |       9357.4	[ 0.1%/ 0.0%]	  0.0%	 14.6% |  9352.3 [ 0.1%/ 0.0%]  -0.0%	 30.0% |
| 16/16	    17543.1	[ 1.0%/ 0.1%]	 30.1% |      18120.7	[ 0.2%/ 0.0%]	  3.3%	 30.1% | 17864.8 [ 1.2%/ 0.1%]   1.8%	 60.1% |
| 32/32	    26968.8	[ 3.9%/ 0.4%]	 60.1% |      29448.9	[ 3.5%/ 0.3%]	  9.2%	 59.9% | 25308.1 [10.7%/ 0.9%]  -6.2%	 97.7% |
| 48/48	    30466.2	[10.4%/ 1.0%]	 89.3% |      38624.4	[ 4.2%/ 0.4%]	 26.8%	 89.1% | 26891.2 [14.8%/ 1.0%] -11.7%	 99.5% |
| 64/64	    37909.3	[11.1%/ 1.1%]	 97.7% |      41671.7	[ 8.7%/ 0.9%]	  9.9%	 97.6% | 25898.3 [16.2%/ 1.0%] -31.7%	100.0% |
|128/128    39479.4	[24.6%/ 2.5%]	100.0% |      42119.6	[ 6.3%/ 0.6%]	  6.7%	 99.5% | 26830.1 [16.5%/ 1.1%] -32.0%	100.0% |
|256/256    42602.1	[16.4%/ 1.6%]	100.0% |      40041.3	[ 7.0%/ 0.7%]	 -6.0%	 99.7% | 27634.7 [15.4%/ 1.1%] -35.1%	100.0% |
'--------------------------------------------------------------------------------------------------------------------------------------'

Case 2
------
cgroup1 workload, sysbench MySQL (non AVX workload)
cgroup2 workload, gemmbench (AVX512 workload)

sysbench throughput result:
.--------------------------------------------------------------------------------------------------------------------------------------.
|NA/AVX	vanilla-SMT	[std% / sem%]	  cpu% |coresched-SMT	[std% / sem%]	  +/-	  cpu% |  no-SMT [std% / sem%]	 +/-	  cpu% |
|--------------------------------------------------------------------------------------------------------------------------------------|
|  1/1	     1018.2	[ 1.0%/ 0.1%]	  1.9% |        915.8	[ 0.9%/ 0.1%]	-10.1%	  1.9% |   994.0 [ 1.4%/ 0.2%]  -2.4%	  3.9% |
|  2/2	     1941.2	[ 0.7%/ 0.1%]	  3.9% |       1746.0	[ 0.5%/ 0.1%]	-10.1%	  3.9% |  1946.2 [ 0.8%/ 0.1%]   0.3%	  7.8% |
|  4/4	     3763.9	[ 0.5%/ 0.0%]	  7.8% |       3426.0	[ 1.5%/ 0.2%]	 -9.0%	  7.8% |  3745.1 [ 1.1%/ 0.1%]  -0.5%	 15.6% |
|  8/8	     7375.5	[ 1.3%/ 0.1%]	 15.5% |       6647.1	[ 1.1%/ 0.1%]	 -9.9%	 16.1% |  7368.4 [ 0.8%/ 0.1%]  -0.1%	 31.1% |
| 16/16	    12990.3	[ 0.6%/ 0.1%]	 31.1% |      10903.7	[ 1.9%/ 0.2%]	-16.1%	 32.0% | 12082.6 [ 4.7%/ 0.5%]  -7.0%	 62.9% |
| 32/32	    18238.1	[ 6.1%/ 0.6%]	 62.1% |      16580.8	[ 3.0%/ 0.3%]	 -9.1%	 62.8% | 21193.6 [ 4.9%/ 0.6%]  16.2%	 97.8% |
| 48/48	    21708.6	[ 8.3%/ 0.8%]	 90.3% |      17064.1	[ 9.5%/ 0.9%]	-21.4%	 90.4% | 18531.4 [16.6%/ 1.8%] -14.6%	 99.5% |
| 64/64	    18636.9	[13.1%/ 1.3%]	 97.9% |      12376.1	[20.9%/ 2.1%]	-33.6%	 96.8% | 20025.8 [14.9%/ 2.4%]   7.5%	100.0% |
|128/128    16204.2	[16.8%/ 1.7%]	 99.4% |       3776.1	[88.7%/ 8.9%]	-76.7%	 97.6% | 20263.5 [12.7%/ 6.8%]  25.1%	100.0% |
|256/256    16730.5	[17.9%/ 1.8%]	 98.9% |       1499.7	[210.3%/21.0%]	-91.0%	 98.4% | 17633.1 [ 7.5%/ 8.9%]   5.4%	100.0% |
'--------------------------------------------------------------------------------------------------------------------------------------'

And for this case, we care about sysbench MySQL latency(ms):
.--------------------------------------------------------------------------------------------------------------------------------------.
|NA/AVX	vanilla-SMT	[std% / sem%]	  cpu% |coresched-SMT	[std% / sem%]	  +/-	  cpu% |  no-SMT [std% / sem%]	 +/-	  cpu% |
|--------------------------------------------------------------------------------------------------------------------------------------|
|  1/1	        1.1	[ 3.7%/ 0.4%]	  1.9% |          1.1	[ 1.0%/ 0.1%]	 -8.9%	  1.9% |     1.1 [ 4.1%/ 0.4%]  -2.0%	  3.9% |
|  2/2	        1.1	[ 0.7%/ 0.1%]	  3.9% |          1.2	[ 0.8%/ 0.1%]	-10.8%	  3.9% |     1.1 [ 0.8%/ 0.1%]   0.2%	  7.8% |
|  4/4	        1.1	[ 0.7%/ 0.1%]	  7.8% |          1.3	[ 3.8%/ 0.4%]	-11.8%	  7.8% |     1.2 [ 2.2%/ 0.2%]  -1.1%	 15.6% |
|  8/8	        1.2	[ 2.2%/ 0.2%]	 15.5% |          1.3	[ 3.0%/ 0.3%]	-11.7%	 16.1% |     1.2 [ 1.8%/ 0.2%]   0.4%	 31.1% |
| 16/16	        1.4	[ 1.5%/ 0.1%]	 31.1% |          2.0	[ 8.2%/ 0.8%]	-45.8%	 32.0% |     1.9 [18.2%/ 1.7%] -33.2%	 62.9% |
| 32/32	        2.4	[ 6.6%/ 0.7%]	 62.1% |          2.6	[ 3.1%/ 0.3%]	 -6.2%	 62.8% |     2.2 [23.5%/ 2.0%]   8.5%	 97.8% |
| 48/48	        2.7	[ 5.3%/ 0.5%]	 90.3% |          3.4	[ 3.5%/ 0.4%]	-26.1%	 90.4% |     6.2 [19.3%/ 3.5%] -128.0%	 99.5% |
| 64/64	        5.9	[13.0%/ 1.3%]	 97.9% |          8.3	[ 9.8%/ 1.0%]	-40.1%	 96.8% |     7.4 [16.6%/ 1.5%] -25.1%	100.0% |
|128/128       17.4	[46.8%/ 4.7%]	 99.4% |        248.0	[146.9%/14.7%]	-1327.8% 97.6% |    11.0 [10.5%/ 0.0%]  36.7%	100.0% |
|256/256       33.5	[67.1%/ 6.7%]	 98.9% |       1279.5	[245.6%/24.6%]	-3716.6% 98.4% |    21.5 [21.5%/ 0.0%]  36.0%	100.0% |
'--------------------------------------------------------------------------------------------------------------------------------------'

Note:
----
64/64:		64 sysbench threads(in one cgroup) and 64 gemmbench threads(in other cgroup) run simultaneously.
Vanilla-SMT:	baseline with HT on
coresched-SMT:	core scheduling enabled
no-SMT:		HT off thru /sys/devices/system/cpu/smt/control
std%:		standard deviation
sem%:		standard error of the mean
±:		improvement/regression against baseline
cpu%:		derived by vmstat.idle and vmstat.iowait

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2019-10-30 18:33 [RFC PATCH v4 00/19] Core scheduling v4 Vineeth Remanan Pillai
                   ` (19 preceding siblings ...)
  2019-10-31 11:42 ` [RFC PATCH v4 00/19] Core scheduling v4 Li, Aubrey
@ 2019-10-31 18:42 ` Phil Auld
  2019-11-01 14:03   ` Vineeth Remanan Pillai
  2019-11-11 19:10 ` Tim Chen
  21 siblings, 1 reply; 84+ messages in thread
From: Phil Auld @ 2019-10-31 18:42 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	mingo, tglx, pjt, torvalds, linux-kernel, Dario Faggioli,
	fweisbec, keescook, kerrnel, Aaron Lu, Aubrey Li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

Hi Vineeth,

On Wed, Oct 30, 2019 at 06:33:13PM +0000 Vineeth Remanan Pillai wrote:
> Fourth iteration of the Core-Scheduling feature.
> 
> This version was aiming mostly at addressing the vruntime comparison
> issues with v3. The main issue seen in v3 was the starvation of
> interactive tasks when competing with cpu intensive tasks. This issue
> is mitigated to a large extent.
> 
> We have tested and verified that incompatible processes are not
> selected during schedule. In terms of performance, the impact
> depends on the workload:
> - on CPU intensive applications that use all the logical CPUs with
>   SMT enabled, enabling core scheduling performs better than nosmt.
> - on mixed workloads with considerable io compared to cpu usage,
>   nosmt seems to perform better than core scheduling.
> 
> v4 is rebased on top of 5.3.5(dc073f193b70):
> https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.3.5
> 
> Changes in v4
> -------------
> - Implement a core wide min_vruntime for vruntime comparison of tasks
>   across cpus in a core.
> - Fixes a typo bug in setting the forced_idle cpu.
> 
> Changes in v3
> -------------
> - Fixes the issue of sibling picking up an incompatible task
>   - Aaron Lu
>   - Vineeth Pillai
>   - Julien Desfossez
> - Fixes the issue of starving threads due to forced idle
>   - Peter Zijlstra
> - Fixes the refcounting issue when deleting a cgroup with tag
>   - Julien Desfossez
> - Fixes a crash during cpu offline/online with coresched enabled
>   - Vineeth Pillai
> - Fixes a comparison logic issue in sched_core_find
>   - Aaron Lu
> 
> Changes in v2
> -------------
> - Fixes for couple of NULL pointer dereference crashes
>   - Subhra Mazumdar
>   - Tim Chen
> - Improves priority comparison logic for process in different cpus
>   - Peter Zijlstra
>   - Aaron Lu
> - Fixes a hard lockup in rq locking
>   - Vineeth Pillai
>   - Julien Desfossez
> - Fixes a performance issue seen on IO heavy workloads
>   - Vineeth Pillai
>   - Julien Desfossez
> - Fix for 32bit build
>   - Aubrey Li
> 
> TODO
> ----
> - Decide on the API for exposing the feature to userland
> - Investigate the source of the overhead even when no tasks are tagged:
>   https://lkml.org/lkml/2019/10/29/242
> - Investigate the performance scaling issue when we have a high number of
>   tagged threads: https://lkml.org/lkml/2019/10/29/248
> - Try to optimize the performance for IO-demanding applications:
>   https://lkml.org/lkml/2019/10/29/261
> 
> ---
> 
> Aaron Lu (3):
>   sched/fair: wrapper for cfs_rq->min_vruntime
>   sched/fair: core wide vruntime comparison
>   sched/fair : Wake up forced idle siblings if needed
> 
> Peter Zijlstra (16):
>   stop_machine: Fix stop_cpus_in_progress ordering
>   sched: Fix kerneldoc comment for ia64_set_curr_task
>   sched: Wrap rq::lock access
>   sched/{rt,deadline}: Fix set_next_task vs pick_next_task
>   sched: Add task_struct pointer to sched_class::set_curr_task
>   sched/fair: Export newidle_balance()
>   sched: Allow put_prev_task() to drop rq->lock
>   sched: Rework pick_next_task() slow-path
>   sched: Introduce sched_class::pick_task()
>   sched: Core-wide rq->lock
>   sched: Basic tracking of matching tasks
>   sched: A quick and dirty cgroup tagging interface
>   sched: Add core wide task selection and scheduling.
>   sched/fair: Add a few assertions
>   sched: Trivial forced-newidle balancer
>   sched: Debug bits...
> 
>  include/linux/sched.h    |   9 +-
>  kernel/Kconfig.preempt   |   6 +
>  kernel/sched/core.c      | 847 +++++++++++++++++++++++++++++++++++++--
>  kernel/sched/cpuacct.c   |  12 +-
>  kernel/sched/deadline.c  |  99 +++--
>  kernel/sched/debug.c     |   4 +-
>  kernel/sched/fair.c      | 346 +++++++++++-----
>  kernel/sched/idle.c      |  42 +-
>  kernel/sched/pelt.h      |   2 +-
>  kernel/sched/rt.c        |  96 ++---
>  kernel/sched/sched.h     | 246 +++++++++---
>  kernel/sched/stop_task.c |  35 +-
>  kernel/sched/topology.c  |   4 +-
>  kernel/stop_machine.c    |   2 +
>  14 files changed, 1399 insertions(+), 351 deletions(-)
> 
> -- 
> 2.17.1
> 

Unless I'm mistaken 7 of the first 8 of these went into sched/core
and are now in linux (from v5.4-rc1). It may make sense to rebase on 
that and simplify the series.


Cheers,
Phil
-- 


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2019-10-31 11:42 ` [RFC PATCH v4 00/19] Core scheduling v4 Li, Aubrey
@ 2019-11-01 11:33   ` Li, Aubrey
  2019-11-08  3:20     ` Li, Aubrey
  0 siblings, 1 reply; 84+ messages in thread
From: Li, Aubrey @ 2019-11-01 11:33 UTC (permalink / raw)
  To: Vineeth Remanan Pillai, Nishanth Aravamudan, Julien Desfossez,
	Peter Zijlstra, Tim Chen, mingo, tglx, pjt, torvalds
  Cc: linux-kernel, Dario Faggioli, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 2019/10/31 19:42, Li, Aubrey wrote:
> On 2019/10/31 2:33, Vineeth Remanan Pillai wrote:
>> Fourth iteration of the Core-Scheduling feature.
>>
>> This version was aiming mostly at addressing the vruntime comparison
>> issues with v3. The main issue seen in v3 was the starvation of
>> interactive tasks when competing with cpu intensive tasks. This issue
>> is mitigated to a large extent.
>>
>> We have tested and verified that incompatible processes are not
>> selected during schedule. In terms of performance, the impact
>> depends on the workload:
>> - on CPU intensive applications that use all the logical CPUs with
>>   SMT enabled, enabling core scheduling performs better than nosmt.
>> - on mixed workloads with considerable io compared to cpu usage,
>>   nosmt seems to perform better than core scheduling.
>>
>> v4 is rebased on top of 5.3.5(dc073f193b70):
>> https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.3.5
> 
> Thanks to post V4 out. Refresh the data on my side. Since we have played
> with Aaron's core vruntime patch for a while, no surprise in the result.
> 
I have three small patches against V4. The basic idea is,
- for load balance, don't pull/push task if its cookie does not match with
destination CPU's core cookie
- for task wakeup, select idle CPU whose core cookie matches with task's
cookie.

Sysbench MySQL result shows significant improvement for the overload cases.

Here is original data of coresched_v4(for comparison):
.--------------------------------------------------------------------------------------------------------------------------------------.
|NA/AVX	vanilla-SMT	[std% / sem%]	  cpu% |coresched-SMT	[std% / sem%]	  +/-	  cpu% |  no-SMT [std% / sem%]	 +/-	  cpu% |
|--------------------------------------------------------------------------------------------------------------------------------------|
|  1/1	     1018.2	[ 1.0%/ 0.1%]	  1.9% |        915.8	[ 0.9%/ 0.1%]	-10.1%	  1.9% |   994.0 [ 1.4%/ 0.2%]  -2.4%	  3.9% |
|  2/2	     1941.2	[ 0.7%/ 0.1%]	  3.9% |       1746.0	[ 0.5%/ 0.1%]	-10.1%	  3.9% |  1946.2 [ 0.8%/ 0.1%]   0.3%	  7.8% |
|  4/4	     3763.9	[ 0.5%/ 0.0%]	  7.8% |       3426.0	[ 1.5%/ 0.2%]	 -9.0%	  7.8% |  3745.1 [ 1.1%/ 0.1%]  -0.5%	 15.6% |
|  8/8	     7375.5	[ 1.3%/ 0.1%]	 15.5% |       6647.1	[ 1.1%/ 0.1%]	 -9.9%	 16.1% |  7368.4 [ 0.8%/ 0.1%]  -0.1%	 31.1% |
| 16/16	    12990.3	[ 0.6%/ 0.1%]	 31.1% |      10903.7	[ 1.9%/ 0.2%]	-16.1%	 32.0% | 12082.6 [ 4.7%/ 0.5%]  -7.0%	 62.9% |
| 32/32	    18238.1	[ 6.1%/ 0.6%]	 62.1% |      16580.8	[ 3.0%/ 0.3%]	 -9.1%	 62.8% | 21193.6 [ 4.9%/ 0.6%]  16.2%	 97.8% |
| 48/48	    21708.6	[ 8.3%/ 0.8%]	 90.3% |      17064.1	[ 9.5%/ 0.9%]	-21.4%	 90.4% | 18531.4 [16.6%/ 1.8%] -14.6%	 99.5% |
| 64/64	    18636.9	[13.1%/ 1.3%]	 97.9% |      12376.1	[20.9%/ 2.1%]	-33.6%	 96.8% | 20025.8 [14.9%/ 2.4%]   7.5%	100.0% |
|128/128    16204.2	[16.8%/ 1.7%]	 99.4% |       3776.1	[88.7%/ 8.9%]	-76.7%	 97.6% | 20263.5 [12.7%/ 6.8%]  25.1%	100.0% |
|256/256    16730.5	[17.9%/ 1.8%]	 98.9% |       1499.7	[210.3%/21.0%]	-91.0%	 98.4% | 17633.1 [ 7.5%/ 8.9%]   5.4%	100.0% |
'--------------------------------------------------------------------------------------------------------------------------------------'

And the following is the new data with cookie match checking,
coresched-SMT is coresched_v4 plus my patches.
.--------------------------------------------------------------------------------------------------------------------------------------.
|NA/AVX	vanilla-SMT	[std% / sem%]	  cpu% |coresched-SMT	[std% / sem%]	  +/-	  cpu% |  no-SMT [std% / sem%]	 +/-	  cpu% |
|--------------------------------------------------------------------------------------------------------------------------------------|
|  1/1	     1018.2	[ 1.0%/ 0.1%]	  1.9% |        901.2	[ 0.7%/ 0.1%]	-11.5%	  1.9% |   994.0 [ 1.4%/ 0.2%]  -2.4%	  3.9% |
|  2/2	     1941.2	[ 0.7%/ 0.1%]	  3.9% |       1733.9	[ 0.8%/ 0.1%]	-10.7%	  3.9% |  1946.2 [ 0.8%/ 0.1%]   0.3%	  7.8% |
|  4/4	     3763.9	[ 0.5%/ 0.0%]	  7.8% |       3429.4	[ 1.0%/ 0.1%]	 -8.9%	  7.8% |  3745.1 [ 1.1%/ 0.1%]  -0.5%	 15.6% |
|  8/8	     7375.5	[ 1.3%/ 0.1%]	 15.5% |       6571.3	[ 1.2%/ 0.1%]	-10.9%	 16.1% |  7368.4 [ 0.8%/ 0.1%]  -0.1%	 31.1% |
| 16/16	    12990.3	[ 0.6%/ 0.1%]	 31.1% |       9365.1	[ 2.6%/ 0.3%]	-27.9%	 31.1% | 12082.6 [ 4.7%/ 0.6%]  -7.0%	 62.9% |
| 32/32	    18238.1	[ 6.1%/ 0.6%]	 62.1% |      16181.9	[ 4.0%/ 0.4%]	-11.3%	 63.6% | 21193.6 [ 4.9%/ 0.6%]  16.2%	 97.8% |
| 48/48	    21708.6	[ 8.3%/ 0.8%]	 90.3% |      18390.4	[ 6.0%/ 0.6%]	-15.3%	 85.3% | 18531.4 [16.6%/ 1.7%] -14.6%	 99.5% |
| 64/64	    18636.9	[13.1%/ 1.3%]	 97.9% |      12831.7	[22.5%/ 2.2%]	-31.1%	 96.2% | 20025.8 [14.9%/ 2.3%]   7.5%	100.0% |
|128/128    16204.2	[16.8%/ 1.7%]	 99.4% |      10199.4	[16.3%/ 1.6%]	-37.1%	 97.9% | 20263.5 [12.7%/ 2.5%]  25.1%	100.0% |
|256/256    16730.5	[17.9%/ 1.8%]	 98.9% |       7739.4	[15.6%/ 1.6%]	-53.7%	 96.4% | 17633.1 [ 7.5%/ 1.7%]   5.4%	100.0% |
'--------------------------------------------------------------------------------------------------------------------------------------'

This may be the workload specific. Looking forward to more testing and comments.

Thanks,
-Aubrey
----------------------------------------------------------------------
From 48f27dde3a54915c25f08b0dfff0a8d148cd6093 Mon Sep 17 00:00:00 2001
From: Aubrey Li <aubrey.li@intel.com>
Date: Fri, 1 Nov 2019 14:17:01 +0800
Subject: [PATCH 1/3] sched/fair: don't migrate task if cookie not match

Load balance tries to move task from busiest CPU to the destination
CPU. When core scheduling is enabled, if the task's cookie does not
match with the destination CPU's core cookie, this task will be
skipped by this CPU. This mitigates the forced idle time on the
destination CPU.

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
---
 kernel/sched/fair.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4728f5e..ba937b9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7366,8 +7366,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * We do not migrate tasks that are:
 	 * 1) throttled_lb_pair, or
 	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
-	 * 3) running (obviously), or
-	 * 4) are cache-hot on their current CPU.
+	 * 3) task's cookie does not match with this CPU's core cookie
+	 * 4) running (obviously), or
+	 * 5) are cache-hot on their current CPU.
 	 */
 	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
 		return 0;
@@ -7402,6 +7403,17 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		return 0;
 	}
 
+#ifdef CONFIG_SCHED_CORE
+	if (sched_core_enabled(cpu_rq(env->dst_cpu))) {
+		/*
+		 * Don't migrate task if task's cookie does not match
+		 * with core cookie
+		 */
+		if (p->core_cookie != cpu_rq(env->dst_cpu)->core->core_cookie)
+			return 0;
+	}
+#endif
+
 	/* Record that we found atleast one task that could run on dst_cpu */
 	env->flags &= ~LBF_ALL_PINNED;
 
-- 
2.7.4

----------------------------------------------------------------------
From c6cd478408097c47f60a124bf67562d0e2994f7d Mon Sep 17 00:00:00 2001
From: Aubrey Li <aubrey.li@intel.com>
Date: Fri, 1 Nov 2019 14:28:09 +0800
Subject: [PATCH 2/3] sched/fair: select cookie matched idle CPU

In the fast path of task wakeup, select the first cookie matched idle
CPU instead of the first idle CPU.

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
---
 kernel/sched/fair.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ba937b9..19ea8d8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6074,7 +6074,13 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 		if (!cpumask_test_cpu(cpu, p->cpus_ptr))
 			continue;
 		if (available_idle_cpu(cpu))
+#ifdef CONFIG_SCHED_CORE
+			if (sched_core_enabled(cpu_rq(cpu)) &&
+			    (p->core_cookie == cpu_rq(cpu)->core->core_cookie))
+				break;
+#else
 			break;
+#endif
 	}
 
 	time = cpu_clock(this) - time;
-- 
2.7.4

----------------------------------------------------------------------
From 1dc91d6b3e95c81d1387f4989f160a1a9924a8f4 Mon Sep 17 00:00:00 2001
From: Aubrey Li <aubrey.li@intel.com>
Date: Fri, 1 Nov 2019 14:49:00 +0800
Subject: [PATCH 3/3] sched/fair: find cookie matched idlest CPU

In the slow path of task wakeup, find the idlest CPU whose core
cookie matches with task's cookie

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
---
 kernel/sched/fair.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 19ea8d8..3aec68a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5692,6 +5692,22 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 					p->cpus_ptr))
 			continue;
 
+#ifdef CONFIG_SCHED_CORE
+		if (sched_core_enabled(cpu_rq(this_cpu))) {
+			bool cookie_match = false;
+
+			for_each_cpu(i, sched_group_span(group)) {
+				struct rq *rq = cpu_rq(i);
+
+				if (p->core_cookie == rq->core->core_cookie)
+					cookie_match = true;
+			}
+			/* Skip over this group if no cookie matched */
+			if (!cookie_match)
+				continue;
+		}
+#endif
+
 		local_group = cpumask_test_cpu(this_cpu,
 					       sched_group_span(group));
 
@@ -5817,8 +5833,14 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
 
 	/* Traverse only the allowed CPUs */
 	for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+		struct rq *rq = cpu_rq(i);
+
+#ifdef CONFIG_SCHED_CORE
+		if (sched_core_enabled(rq) && (p->core_cookie !=
+			rq->core->core_cookie))
+			continue;
+#endif
 		if (available_idle_cpu(i)) {
-			struct rq *rq = cpu_rq(i);
 			struct cpuidle_state *idle = idle_get_state(rq);
 			if (idle && idle->exit_latency < min_exit_latency) {
 				/*
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2019-10-31 18:42 ` Phil Auld
@ 2019-11-01 14:03   ` Vineeth Remanan Pillai
  2019-11-01 16:35     ` Greg Kerr
  2019-11-12  1:45     ` Dario Faggioli
  0 siblings, 2 replies; 84+ messages in thread
From: Vineeth Remanan Pillai @ 2019-11-01 14:03 UTC (permalink / raw)
  To: Phil Auld
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Aaron Lu,
	Aubrey Li, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

Hi Phil,

> Unless I'm mistaken 7 of the first 8 of these went into sched/core
> and are now in linux (from v5.4-rc1). It may make sense to rebase on
> that and simplify the series.
>
Thanks a lot for pointing this out. We shall test on a rebased 5.4 RC
and post the changes soon, if the tests goes well. For v3, while rebasing
to an RC kernel, we saw perf regressions and hence did not check the
RC kernel this time. You are absolutely right that we can simplify the
patch series with 5.4 RC.


Thanks
Vineeth

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2019-11-01 14:03   ` Vineeth Remanan Pillai
@ 2019-11-01 16:35     ` Greg Kerr
  2019-11-01 18:07       ` Dario Faggioli
  2019-11-12  1:45     ` Dario Faggioli
  1 sibling, 1 reply; 84+ messages in thread
From: Greg Kerr @ 2019-11-01 16:35 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Phil Auld, Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra,
	Tim Chen, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Aaron Lu, Aubrey Li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Fri, Nov 1, 2019 at 7:03 AM Vineeth Remanan Pillai
<vpillai@digitalocean.com> wrote:
>
> Hi Phil,
>
> > Unless I'm mistaken 7 of the first 8 of these went into sched/core
> > and are now in linux (from v5.4-rc1). It may make sense to rebase on
> > that and simplify the series.
> >
> Thanks a lot for pointing this out. We shall test on a rebased 5.4 RC
> and post the changes soon, if the tests goes well. For v3, while rebasing
> to an RC kernel, we saw perf regressions and hence did not check the
> RC kernel this time. You are absolutely right that we can simplify the
> patch series with 5.4 RC.

Has anyone considering shipping a V1 implementation which just allows
threads from the same process to share a core together? And then
iterating on that? Would that be simpler to implement or do the same
fundamental problems exist as tagging arbitrary processes with
cookies?

Regards,

Greg Kerr

>
>
> Thanks
> Vineeth

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2019-11-01 16:35     ` Greg Kerr
@ 2019-11-01 18:07       ` Dario Faggioli
  0 siblings, 0 replies; 84+ messages in thread
From: Dario Faggioli @ 2019-11-01 18:07 UTC (permalink / raw)
  To: Greg Kerr, Vineeth Remanan Pillai
  Cc: Phil Auld, Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra,
	Tim Chen, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Aaron Lu, Aubrey Li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 953 bytes --]

On Fri, 2019-11-01 at 09:35 -0700, Greg Kerr wrote:
> Has anyone considering shipping a V1 implementation which just allows
> threads from the same process to share a core together? And then
> iterating on that? Would that be simpler to implement or do the same
> fundamental problems exist as tagging arbitrary processes with
> cookies?
> 
IMO, the latter, i.e., the same fundamental problem exist.

It's _tasks_ having to be scheduled in a certain way.

It does not matter much, as far as the necessary modifications to the
scheduler are concerned, whether those tasks are independent processes,
or threads or the same process, or vcpus of a VMs, etc...

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2019-11-01 11:33   ` Li, Aubrey
@ 2019-11-08  3:20     ` Li, Aubrey
  0 siblings, 0 replies; 84+ messages in thread
From: Li, Aubrey @ 2019-11-08  3:20 UTC (permalink / raw)
  To: Vineeth Remanan Pillai, Nishanth Aravamudan, Julien Desfossez,
	Peter Zijlstra, Tim Chen, mingo, tglx, pjt, torvalds
  Cc: linux-kernel, Dario Faggioli, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 2019/11/1 19:33, Li, Aubrey wrote:
> On 2019/10/31 19:42, Li, Aubrey wrote:
>> On 2019/10/31 2:33, Vineeth Remanan Pillai wrote:
>>> Fourth iteration of the Core-Scheduling feature.
>>>
>>> This version was aiming mostly at addressing the vruntime comparison
>>> issues with v3. The main issue seen in v3 was the starvation of
>>> interactive tasks when competing with cpu intensive tasks. This issue
>>> is mitigated to a large extent.
>>>
>>> We have tested and verified that incompatible processes are not
>>> selected during schedule. In terms of performance, the impact
>>> depends on the workload:
>>> - on CPU intensive applications that use all the logical CPUs with
>>>   SMT enabled, enabling core scheduling performs better than nosmt.
>>> - on mixed workloads with considerable io compared to cpu usage,
>>>   nosmt seems to perform better than core scheduling.
>>>
>>> v4 is rebased on top of 5.3.5(dc073f193b70):
>>> https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.3.5
>>
>> Thanks to post V4 out. Refresh the data on my side. Since we have played
>> with Aaron's core vruntime patch for a while, no surprise in the result.
>>
> I have three small patches against V4. The basic idea is,
> - for load balance, don't pull/push task if its cookie does not match with
> destination CPU's core cookie
> - for task wakeup, select idle CPU whose core cookie matches with task's
> cookie.
> 
> Sysbench MySQL result shows significant improvement for the overload cases.
>
> This may be the workload specific. Looking forward to more testing and comments.
> 

Here is another one for task numa migration. We saw significant latency
improvement of workload sysbench MYSQL+gemmbench, for the overloaded case
on a 8-node system, the latency is reduced from 93.78ms to 28.36ms. So I
think it's worth to post this twist to draw more ideas and better solutions.

Thanks,
-Aubrey

----------------------------------------------------------------------
From 89fc5fd70d5dcc426dc4724afdf35d2c916cd303 Mon Sep 17 00:00:00 2001
From: Aubrey Li <aubrey.li@intel.com>
Date: Thu, 7 Nov 2019 14:51:28 +0800
Subject: [PATCH 1/2] sched/numa: don't migrate task if cookie not match

For the NUMA load balance, don't migrate task to the CPU whose core
cookie does not match with task's cookie

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
---
 kernel/sched/fair.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3aec68a..2909030 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1827,6 +1827,16 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
 			continue;
 
+#ifdef CONFIG_SCHED_CORE
+		/*
+		 * Skip this cpu if source task's cookie does not match
+		 * with CPU's core cookie.
+		 */
+		if (sched_core_enabled(cpu_rq(cpu)) && (env->p->core_cookie !=
+		    cpu_rq(cpu)->core->core_cookie))
+				continue;
+#endif
+
 		env->dst_cpu = cpu;
 		task_numa_compare(env, taskimp, groupimp, maymove);
 	}
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2019-10-30 18:33 [RFC PATCH v4 00/19] Core scheduling v4 Vineeth Remanan Pillai
                   ` (20 preceding siblings ...)
  2019-10-31 18:42 ` Phil Auld
@ 2019-11-11 19:10 ` Tim Chen
  2020-01-14  1:12   ` Tim Chen
  21 siblings, 1 reply; 84+ messages in thread
From: Tim Chen @ 2019-11-11 19:10 UTC (permalink / raw)
  To: Vineeth Remanan Pillai, Nishanth Aravamudan, Julien Desfossez,
	Peter Zijlstra, mingo, tglx, pjt, torvalds
  Cc: linux-kernel, Dario Faggioli, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 10/30/19 11:33 AM, Vineeth Remanan Pillai wrote:
> Fourth iteration of the Core-Scheduling feature.
> 
> This version was aiming mostly at addressing the vruntime comparison
> issues with v3. The main issue seen in v3 was the starvation of
> interactive tasks when competing with cpu intensive tasks. This issue
> is mitigated to a large extent.
> 
> We have tested and verified that incompatible processes are not
> selected during schedule. In terms of performance, the impact
> depends on the workload:
> - on CPU intensive applications that use all the logical CPUs with
>   SMT enabled, enabling core scheduling performs better than nosmt.
> - on mixed workloads with considerable io compared to cpu usage,
>   nosmt seems to perform better than core scheduling.
> 
> v4 is rebased on top of 5.3.5(dc073f193b70):
> https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.3.5
> 

The v4 core scheduler will crash when you toggle the core scheduling
tag of the cgroup while there are active tasks in the cgroup running.

The reason is because we don't properly move tasks in and out of the
core scheduler queue according to the new core scheduling tag status.
A task's core scheduler status will then get misaligned with its core
cookie status.

The patch below fixes that.

Thanks.

Tim

------->8----------------

From 2f3f035a9b74013627069da62d6553548700eeac Mon Sep 17 00:00:00 2001
From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Thu, 7 Nov 2019 15:45:24 -0500
Subject: [PATCH] sched: Update tasks in core queue when its cgroup tag is
 changed

A task will need to be moved into the core scheduler queue when the cgroup
it belongs to is tagged to run with core scheduling.  Similarly the task
will need to be moved out of the core scheduler queue when the cgroup
is untagged.

Also after we forked a task, its core scheduler queue's presence will
need to be updated according to its new cgroup's status.

Implement these missing core scheduler queue update mechanisms.
Otherwise, the core scheduler will OOPs the kernel when we toggle the
cgroup's core scheduling tag due to inconsistent core scheduler queue
status.

Use stop machine mechanism to update all tasks in a cgroup to prevent a
new task from sneaking into the cgroup, and missed out from the update
while we iterates through all the tasks in the cgroup.  A more complicated
scheme could probably avoid the stop machine.  Such scheme will also
need to resovle inconsistency between a task's cgroup core scheduling
tag and residency in core scheduler queue.

We are opting for the simple stop machine mechanism for now that avoids
such complications.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/core.c | 119 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 106 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4778b5940c30..08e7fdd5972d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -138,6 +138,37 @@ static inline bool __sched_core_less(struct task_struct *a, struct task_struct *
 	return false;
 }
 
+static bool sched_core_empty(struct rq *rq)
+{
+	return RB_EMPTY_ROOT(&rq->core_tree);
+}
+
+static bool sched_core_enqueued(struct task_struct *task)
+{
+	return !RB_EMPTY_NODE(&task->core_node);
+}
+
+static struct task_struct *sched_core_first(struct rq *rq)
+{
+	struct task_struct *task;
+
+	task = container_of(rb_first(&rq->core_tree), struct task_struct, core_node);
+	return task;
+}
+
+static void sched_core_flush(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+	struct task_struct *task;
+
+	while (!sched_core_empty(rq)) {
+		task = sched_core_first(rq);
+		rb_erase(&task->core_node, &rq->core_tree);
+		RB_CLEAR_NODE(&task->core_node);
+	}
+	rq->core->core_task_seq++;
+}
+
 static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 {
 	struct rb_node *parent, **node;
@@ -169,10 +200,11 @@ static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
 {
 	rq->core->core_task_seq++;
 
-	if (!p->core_cookie)
+	if (!sched_core_enqueued(p))
 		return;
 
 	rb_erase(&p->core_node, &rq->core_tree);
+	RB_CLEAR_NODE(&p->core_node);
 }
 
 /*
@@ -236,6 +268,18 @@ static int __sched_core_stopper(void *data)
 	bool enabled = !!(unsigned long)data;
 	int cpu;
 
+	if (!enabled) {
+		for_each_online_cpu(cpu) {
+		/*
+		 * All active and migrating tasks will have already been removed
+		 * from core queue when we clear the cgroup tags.
+		 * However, dying tasks could still be left in core queue.
+		 * Flush them here.
+		 */
+			sched_core_flush(cpu);
+		}
+	}
+
 	for_each_online_cpu(cpu)
 		cpu_rq(cpu)->core_enabled = enabled;
 
@@ -247,7 +291,11 @@ static int sched_core_count;
 
 static void __sched_core_enable(void)
 {
-	// XXX verify there are no cookie tasks (yet)
+	int cpu;
+
+	/* verify there are no cookie tasks (yet) */
+	for_each_online_cpu(cpu)
+		BUG_ON(!sched_core_empty(cpu_rq(cpu)));
 
 	static_branch_enable(&__sched_core_enabled);
 	stop_machine(__sched_core_stopper, (void *)true, NULL);
@@ -257,8 +305,6 @@ static void __sched_core_enable(void)
 
 static void __sched_core_disable(void)
 {
-	// XXX verify there are no cookie tasks (left)
-
 	stop_machine(__sched_core_stopper, (void *)false, NULL);
 	static_branch_disable(&__sched_core_enabled);
 
@@ -285,6 +331,7 @@ void sched_core_put(void)
 
 static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
 static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+static bool sched_core_enqueued(struct task_struct *task) { return false; }
 
 #endif /* CONFIG_SCHED_CORE */
 
@@ -3016,6 +3063,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 #ifdef CONFIG_SMP
 	plist_node_init(&p->pushable_tasks, MAX_PRIO);
 	RB_CLEAR_NODE(&p->pushable_dl_tasks);
+#endif
+#ifdef CONFIG_SCHED_CORE
+	RB_CLEAR_NODE(&p->core_node);
 #endif
 	return 0;
 }
@@ -6560,6 +6610,9 @@ void init_idle(struct task_struct *idle, int cpu)
 #ifdef CONFIG_SMP
 	sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu);
 #endif
+#ifdef CONFIG_SCHED_CORE
+	RB_CLEAR_NODE(&idle->core_node);
+#endif
 }
 
 #ifdef CONFIG_SMP
@@ -7671,7 +7724,12 @@ static void cpu_cgroup_fork(struct task_struct *task)
 	rq = task_rq_lock(task, &rf);
 
 	update_rq_clock(rq);
+	if (sched_core_enqueued(task))
+		sched_core_dequeue(rq, task);
 	sched_change_group(task, TASK_SET_GROUP);
+	if (sched_core_enabled(rq) && task_on_rq_queued(task) &&
+	    task->core_cookie)
+		sched_core_enqueue(rq, task);
 
 	task_rq_unlock(rq, task, &rf);
 }
@@ -8033,12 +8091,51 @@ static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype
 	return !!tg->tagged;
 }
 
-static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+struct write_core_tag {
+	struct cgroup_subsys_state *css;
+	int val;
+};
+
+static int __sched_write_tag(void *data)
 {
-	struct task_group *tg = css_tg(css);
+	struct write_core_tag *tag = (struct write_core_tag *) data;
+	struct cgroup_subsys_state *css = tag->css;
+	int val = tag->val;
+	struct task_group *tg = css_tg(tag->css);
 	struct css_task_iter it;
 	struct task_struct *p;
 
+	tg->tagged = !!val;
+
+	css_task_iter_start(css, 0, &it);
+	/*
+	 * Note: css_task_iter_next will skip dying tasks.
+	 * There could still be dying tasks left in the core queue
+	 * when we set cgroup tag to 0 when the loop is done below.
+	 */
+	while ((p = css_task_iter_next(&it))) {
+		p->core_cookie = !!val ? (unsigned long)tg : 0UL;
+
+		if (sched_core_enqueued(p)) {
+			sched_core_dequeue(task_rq(p), p);
+			if (!p->core_cookie)
+				continue;
+		}
+
+		if (p->core_cookie && task_on_rq_queued(p))
+			sched_core_enqueue(task_rq(p), p);
+
+	}
+	css_task_iter_end(&it);
+
+	return 0;
+}
+
+static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+{
+	struct task_group *tg = css_tg(css);
+	struct write_core_tag wtag;
+
 	if (val > 1)
 		return -ERANGE;
 
@@ -8048,16 +8145,12 @@ static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype
 	if (tg->tagged == !!val)
 		return 0;
 
-	tg->tagged = !!val;
-
 	if (!!val)
 		sched_core_get();
 
-	css_task_iter_start(css, 0, &it);
-	while ((p = css_task_iter_next(&it)))
-		p->core_cookie = !!val ? (unsigned long)tg : 0UL;
-	css_task_iter_end(&it);
-
+	wtag.css = css;
+	wtag.val = val;
+	stop_machine(__sched_write_tag, (void *) &wtag, NULL);
 	if (!val)
 		sched_core_put();
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2019-11-01 14:03   ` Vineeth Remanan Pillai
  2019-11-01 16:35     ` Greg Kerr
@ 2019-11-12  1:45     ` Dario Faggioli
  2019-11-13 17:16       ` Tim Chen
  2020-01-02  2:28       ` Aubrey Li
  1 sibling, 2 replies; 84+ messages in thread
From: Dario Faggioli @ 2019-11-12  1:45 UTC (permalink / raw)
  To: Vineeth Remanan Pillai, Phil Auld
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra, Tim Chen,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Aaron Lu, Aubrey Li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1195 bytes --]

On Fri, 2019-11-01 at 10:03 -0400, Vineeth Remanan Pillai wrote:
> Hi Phil,
> 
> > Unless I'm mistaken 7 of the first 8 of these went into sched/core
> > and are now in linux (from v5.4-rc1). It may make sense to rebase
> > on
> > that and simplify the series.
> > 
> Thanks a lot for pointing this out. We shall test on a rebased 5.4 RC
> and post the changes soon, if the tests goes well. For v3, while
> rebasing
> to an RC kernel, we saw perf regressions and hence did not check the
> RC kernel this time. You are absolutely right that we can simplify
> the
> patch series with 5.4 RC.
> 
And, in case it's useful to anybody, here's a rebase of this series on
top of 5.4-rc7:

https://github.com/dfaggioli/linux/tree/wip/sched/v5.4-rc7-coresched

This also includes Tim's patch (this one
<5e3cea14-28d1-bf1e-cabe-fb5b48fdeadc@linux.intel.com>) and is, for
now, *only* compile tested.

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2019-11-12  1:45     ` Dario Faggioli
@ 2019-11-13 17:16       ` Tim Chen
  2020-01-02  2:28       ` Aubrey Li
  1 sibling, 0 replies; 84+ messages in thread
From: Tim Chen @ 2019-11-13 17:16 UTC (permalink / raw)
  To: Dario Faggioli, Vineeth Remanan Pillai, Phil Auld
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Aaron Lu, Aubrey Li, Valentin Schneider,
	Mel Gorman, Pawan Gupta, Paolo Bonzini

On 11/11/19 5:45 PM, Dario Faggioli wrote:
> On Fri, 2019-11-01 at 10:03 -0400, Vineeth Remanan Pillai wrote:
>> Hi Phil,
>>
>>> Unless I'm mistaken 7 of the first 8 of these went into sched/core
>>> and are now in linux (from v5.4-rc1). It may make sense to rebase
>>> on
>>> that and simplify the series.
>>>
>> Thanks a lot for pointing this out. We shall test on a rebased 5.4 RC
>> and post the changes soon, if the tests goes well. For v3, while
>> rebasing
>> to an RC kernel, we saw perf regressions and hence did not check the
>> RC kernel this time. You are absolutely right that we can simplify
>> the
>> patch series with 5.4 RC.
>>
> And, in case it's useful to anybody, here's a rebase of this series on
> top of 5.4-rc7:
> 
> https://github.com/dfaggioli/linux/tree/wip/sched/v5.4-rc7-coresched

Thanks for the port.
> 
> This also includes Tim's patch (this one
> <5e3cea14-28d1-bf1e-cabe-fb5b48fdeadc@linux.intel.com>) and is, for
> now, *only* compile tested.
> 

I tested my patch on top of your port and seems to work fine.

Tim

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2019-11-12  1:45     ` Dario Faggioli
  2019-11-13 17:16       ` Tim Chen
@ 2020-01-02  2:28       ` Aubrey Li
  2020-01-10 23:19         ` Tim Chen
  1 sibling, 1 reply; 84+ messages in thread
From: Aubrey Li @ 2020-01-02  2:28 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Vineeth Remanan Pillai, Phil Auld, Nishanth Aravamudan,
	Julien Desfossez, Peter Zijlstra, Tim Chen, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Aaron Lu, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Tue, Nov 12, 2019 at 9:45 AM Dario Faggioli <dfaggioli@suse.com> wrote:
>
> On Fri, 2019-11-01 at 10:03 -0400, Vineeth Remanan Pillai wrote:
> > Hi Phil,
> >
> > > Unless I'm mistaken 7 of the first 8 of these went into sched/core
> > > and are now in linux (from v5.4-rc1). It may make sense to rebase
> > > on
> > > that and simplify the series.
> > >
> > Thanks a lot for pointing this out. We shall test on a rebased 5.4 RC
> > and post the changes soon, if the tests goes well. For v3, while
> > rebasing
> > to an RC kernel, we saw perf regressions and hence did not check the
> > RC kernel this time. You are absolutely right that we can simplify
> > the
> > patch series with 5.4 RC.
> >
> And, in case it's useful to anybody, here's a rebase of this series on
> top of 5.4-rc7:
>
> https://github.com/dfaggioli/linux/tree/wip/sched/v5.4-rc7-coresched
>

In case it's useful to anyone, I rebased the series on top of v5.5-rc4.
https://github.com/aubreyli/linux/tree/coresched_v4-v5.5-rc4

v5.5 includes a few scheduler rework and fix, so I modified patch1/2/6,
patch-0002 has relatively big changes, but still has no functionality
and logic change.

0001-sched-Wrap-rq-lock-access.patch
0002-sched-Introduce-sched_class-pick_task.patch
0003-sched-Core-wide-rq-lock.patch
0004-sched-Basic-tracking-of-matching-tasks.patch
0005-sched-A-quick-and-dirty-cgroup-tagging-interface.patch
0006-sched-Add-core-wide-task-selection-and-scheduling.patch
0007-sched-fair-Add-a-few-assertions.patch
0008-sched-Trivial-forced-newidle-balancer.patch
0009-sched-Debug-bits.patch
0010-sched-fair-wrapper-for-cfs_rq-min_vruntime.patch
0011-sched-fair-core-wide-vruntime-comparison.patch
0012-sched-fair-Wake-up-forced-idle-siblings-if-needed.patch

I verified by my test suites, it seems to work.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-01-02  2:28       ` Aubrey Li
@ 2020-01-10 23:19         ` Tim Chen
  0 siblings, 0 replies; 84+ messages in thread
From: Tim Chen @ 2020-01-10 23:19 UTC (permalink / raw)
  To: Aubrey Li, Dario Faggioli
  Cc: Vineeth Remanan Pillai, Phil Auld, Nishanth Aravamudan,
	Julien Desfossez, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Linus Torvalds, Linux List Kernel Mailing,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Aaron Lu,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On 1/1/20 6:28 PM, Aubrey Li wrote:
> On Tue, Nov 12, 2019 at 9:45 AM Dario Faggioli <dfaggioli@suse.com> wrote:
>>
>> On Fri, 2019-11-01 at 10:03 -0400, Vineeth Remanan Pillai wrote:
>>> Hi Phil,
>>>
>>>> Unless I'm mistaken 7 of the first 8 of these went into sched/core
>>>> and are now in linux (from v5.4-rc1). It may make sense to rebase
>>>> on
>>>> that and simplify the series.
>>>>
>>> Thanks a lot for pointing this out. We shall test on a rebased 5.4 RC
>>> and post the changes soon, if the tests goes well. For v3, while
>>> rebasing
>>> to an RC kernel, we saw perf regressions and hence did not check the
>>> RC kernel this time. You are absolutely right that we can simplify
>>> the
>>> patch series with 5.4 RC.
>>>
>> And, in case it's useful to anybody, here's a rebase of this series on
>> top of 5.4-rc7:
>>
>> https://github.com/dfaggioli/linux/tree/wip/sched/v5.4-rc7-coresched
>>
> 
> In case it's useful to anyone, I rebased the series on top of v5.5-rc4.
> https://github.com/aubreyli/linux/tree/coresched_v4-v5.5-rc4
> 
> v5.5 includes a few scheduler rework and fix, so I modified patch1/2/6,
> patch-0002 has relatively big changes, but still has no functionality
> and logic change.
> 
> 0001-sched-Wrap-rq-lock-access.patch
> 0002-sched-Introduce-sched_class-pick_task.patch
> 0003-sched-Core-wide-rq-lock.patch
> 0004-sched-Basic-tracking-of-matching-tasks.patch
> 0005-sched-A-quick-and-dirty-cgroup-tagging-interface.patch
> 0006-sched-Add-core-wide-task-selection-and-scheduling.patch
> 0007-sched-fair-Add-a-few-assertions.patch
> 0008-sched-Trivial-forced-newidle-balancer.patch
> 0009-sched-Debug-bits.patch
> 0010-sched-fair-wrapper-for-cfs_rq-min_vruntime.patch
> 0011-sched-fair-core-wide-vruntime-comparison.patch
> 0012-sched-fair-Wake-up-forced-idle-siblings-if-needed.patch
> 
> I verified by my test suites, it seems to work.
> 

Peter,

What do you see are the next steps to move the core scheduling
functionalities forward?

We'll like to find out what you see are the gaps that needed to be filled
to get this patchset be considered for mainline.

Thanks.

Tim


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2019-11-11 19:10 ` Tim Chen
@ 2020-01-14  1:12   ` Tim Chen
  2020-01-14 15:40     ` Vineeth Remanan Pillai
  0 siblings, 1 reply; 84+ messages in thread
From: Tim Chen @ 2020-01-14  1:12 UTC (permalink / raw)
  To: Vineeth Remanan Pillai, Nishanth Aravamudan, Julien Desfossez,
	Peter Zijlstra, mingo, tglx, pjt, torvalds
  Cc: linux-kernel, Dario Faggioli, fweisbec, keescook, kerrnel,
	Phil Auld, Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On 11/11/19 11:10 AM, Tim Chen wrote:
> On 10/30/19 11:33 AM, Vineeth Remanan Pillai wrote:
>> Fourth iteration of the Core-Scheduling feature.
>>
>> This version was aiming mostly at addressing the vruntime comparison
>> issues with v3. The main issue seen in v3 was the starvation of
>> interactive tasks when competing with cpu intensive tasks. This issue
>> is mitigated to a large extent.
>>
>> We have tested and verified that incompatible processes are not
>> selected during schedule. In terms of performance, the impact
>> depends on the workload:
>> - on CPU intensive applications that use all the logical CPUs with
>>   SMT enabled, enabling core scheduling performs better than nosmt.
>> - on mixed workloads with considerable io compared to cpu usage,
>>   nosmt seems to perform better than core scheduling.
>>
>> v4 is rebased on top of 5.3.5(dc073f193b70):
>> https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.3.5
>>
> 
> The v4 core scheduler will crash when you toggle the core scheduling
> tag of the cgroup while there are active tasks in the cgroup running.
> 
> The reason is because we don't properly move tasks in and out of the
> core scheduler queue according to the new core scheduling tag status.
> A task's core scheduler status will then get misaligned with its core
> cookie status.
> 
> The patch below fixes that.
> 
> Thanks.
> 
> Tim
> 

I also encountered kernel panic with the v4 code when taking cpu offline or online
when core scheduler is running.  I've refreshed the previous patch, along
with 3 other patches to fix problems related to CPU online/offline.

As a side effect of the fix, each core can now operate in core-scheduling
mode or non core-scheduling mode, depending on how many online SMT threads it has.

Vineet, are you guys planning to refresh v4 and update it to v5?  Aubrey posted
a port to the latest kernel earlier.

Tim

--->8---
From 3c00f178538e2a33d075b5705adaf31784b19add Mon Sep 17 00:00:00 2001
Message-Id: <3c00f178538e2a33d075b5705adaf31784b19add.1578960120.git.tim.c.chen@linux.intel.com>
From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Thu, 7 Nov 2019 15:45:24 -0500
Subject: [PATCH 1/4] sched: Update tasks in core queue when its cgroup tag is
 changed

A task will need to be moved into the core scheduler queue when the cgroup
it belongs to is tagged to run with core scheduling.  Similarly the task
will need to be moved out of the core scheduler queue when the cgroup
is untagged.

Also after we forked a task, its core scheduler queue's presence will
need to be updated according to its new cgroup's status.

Implement these missing core scheduler queue update mechanisms.
Otherwise, the core scheduler will OOPs the kernel when we toggle the
cgroup's core scheduling tag due to inconsistent core scheduler queue
status.

Use stop machine mechanism to update all tasks in a cgroup to prevent a
new task from sneaking into the cgroup, and missed out from the update
while we iterates through all the tasks in the cgroup.  A more complicated
scheme could probably avoid the stop machine.  Such scheme will also
need to resovle inconsistency between a task's cgroup core scheduling
tag and residency in core scheduler queue.

We are opting for the simple stop machine mechanism for now that avoids
such complications.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/core.c | 120 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 107 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4778b5940c30..44399f5434f8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -138,6 +138,37 @@ static inline bool __sched_core_less(struct task_struct *a, struct task_struct *
 	return false;
 }
 
+static bool sched_core_empty(struct rq *rq)
+{
+	return RB_EMPTY_ROOT(&rq->core_tree);
+}
+
+static bool sched_core_enqueued(struct task_struct *task)
+{
+	return !RB_EMPTY_NODE(&task->core_node);
+}
+
+static struct task_struct *sched_core_first(struct rq *rq)
+{
+	struct task_struct *task;
+
+	task = container_of(rb_first(&rq->core_tree), struct task_struct, core_node);
+	return task;
+}
+
+static void sched_core_flush(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+	struct task_struct *task;
+
+	while (!sched_core_empty(rq)) {
+		task = sched_core_first(rq);
+		rb_erase(&task->core_node, &rq->core_tree);
+		RB_CLEAR_NODE(&task->core_node);
+	}
+	rq->core->core_task_seq++;
+}
+
 static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 {
 	struct rb_node *parent, **node;
@@ -169,10 +200,11 @@ static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
 {
 	rq->core->core_task_seq++;
 
-	if (!p->core_cookie)
+	if (!sched_core_enqueued(p))
 		return;
 
 	rb_erase(&p->core_node, &rq->core_tree);
+	RB_CLEAR_NODE(&p->core_node);
 }
 
 /*
@@ -236,6 +268,18 @@ static int __sched_core_stopper(void *data)
 	bool enabled = !!(unsigned long)data;
 	int cpu;
 
+	if (!enabled) {
+		for_each_online_cpu(cpu) {
+		/*
+		 * All active and migrating tasks will have already been removed
+		 * from core queue when we clear the cgroup tags.
+		 * However, dying tasks could still be left in core queue.
+		 * Flush them here.
+		 */
+			sched_core_flush(cpu);
+		}
+	}
+
 	for_each_online_cpu(cpu)
 		cpu_rq(cpu)->core_enabled = enabled;
 
@@ -247,7 +291,11 @@ static int sched_core_count;
 
 static void __sched_core_enable(void)
 {
-	// XXX verify there are no cookie tasks (yet)
+	int cpu;
+
+	/* verify there are no cookie tasks (yet) */
+	for_each_online_cpu(cpu)
+		BUG_ON(!sched_core_empty(cpu_rq(cpu)));
 
 	static_branch_enable(&__sched_core_enabled);
 	stop_machine(__sched_core_stopper, (void *)true, NULL);
@@ -257,8 +305,6 @@ static void __sched_core_enable(void)
 
 static void __sched_core_disable(void)
 {
-	// XXX verify there are no cookie tasks (left)
-
 	stop_machine(__sched_core_stopper, (void *)false, NULL);
 	static_branch_disable(&__sched_core_enabled);
 
@@ -285,6 +331,7 @@ void sched_core_put(void)
 
 static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
 static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+static bool sched_core_enqueued(struct task_struct *task) { return false; }
 
 #endif /* CONFIG_SCHED_CORE */
 
@@ -3016,6 +3063,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 #ifdef CONFIG_SMP
 	plist_node_init(&p->pushable_tasks, MAX_PRIO);
 	RB_CLEAR_NODE(&p->pushable_dl_tasks);
+#endif
+#ifdef CONFIG_SCHED_CORE
+	RB_CLEAR_NODE(&p->core_node);
 #endif
 	return 0;
 }
@@ -6560,6 +6610,9 @@ void init_idle(struct task_struct *idle, int cpu)
 #ifdef CONFIG_SMP
 	sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu);
 #endif
+#ifdef CONFIG_SCHED_CORE
+	RB_CLEAR_NODE(&idle->core_node);
+#endif
 }
 
 #ifdef CONFIG_SMP
@@ -7671,7 +7724,12 @@ static void cpu_cgroup_fork(struct task_struct *task)
 	rq = task_rq_lock(task, &rf);
 
 	update_rq_clock(rq);
+	if (sched_core_enqueued(task))
+		sched_core_dequeue(rq, task);
 	sched_change_group(task, TASK_SET_GROUP);
+	if (sched_core_enabled(rq) && task_on_rq_queued(task) &&
+	    task->core_cookie)
+		sched_core_enqueue(rq, task);
 
 	task_rq_unlock(rq, task, &rf);
 }
@@ -8033,12 +8091,52 @@ static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct cftype
 	return !!tg->tagged;
 }
 
-static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+struct write_core_tag {
+	struct cgroup_subsys_state *css;
+	int val;
+};
+
+static int __sched_write_tag(void *data)
 {
-	struct task_group *tg = css_tg(css);
+	struct write_core_tag *tag = (struct write_core_tag *) data;
+	struct cgroup_subsys_state *css = tag->css;
+	int val = tag->val;
+	struct task_group *tg = css_tg(tag->css);
 	struct css_task_iter it;
 	struct task_struct *p;
 
+	tg->tagged = !!val;
+
+	css_task_iter_start(css, 0, &it);
+	/*
+	 * Note: css_task_iter_next will skip dying tasks.
+	 * There could still be dying tasks left in the core queue
+	 * when we set cgroup tag to 0 when the loop is done below.
+	 */
+	while ((p = css_task_iter_next(&it))) {
+		p->core_cookie = !!val ? (unsigned long)tg : 0UL;
+
+		if (sched_core_enqueued(p)) {
+			sched_core_dequeue(task_rq(p), p);
+			if (!p->core_cookie)
+				continue;
+		}
+
+		if (sched_core_enabled(task_rq(p)) &&
+		    p->core_cookie && task_on_rq_queued(p))
+			sched_core_enqueue(task_rq(p), p);
+
+	}
+	css_task_iter_end(&it);
+
+	return 0;
+}
+
+static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+{
+	struct task_group *tg = css_tg(css);
+	struct write_core_tag wtag;
+
 	if (val > 1)
 		return -ERANGE;
 
@@ -8048,16 +8146,12 @@ static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct cftype
 	if (tg->tagged == !!val)
 		return 0;
 
-	tg->tagged = !!val;
-
 	if (!!val)
 		sched_core_get();
 
-	css_task_iter_start(css, 0, &it);
-	while ((p = css_task_iter_next(&it)))
-		p->core_cookie = !!val ? (unsigned long)tg : 0UL;
-	css_task_iter_end(&it);
-
+	wtag.css = css;
+	wtag.val = val;
+	stop_machine(__sched_write_tag, (void *) &wtag, NULL);
 	if (!val)
 		sched_core_put();
 
-- 
2.20.1


From ee10a2f324702b5d940545322686fb6f1ec61431 Mon Sep 17 00:00:00 2001
Message-Id: <ee10a2f324702b5d940545322686fb6f1ec61431.1578960120.git.tim.c.chen@linux.intel.com>
In-Reply-To: <3c00f178538e2a33d075b5705adaf31784b19add.1578960120.git.tim.c.chen@linux.intel.com>
References: <3c00f178538e2a33d075b5705adaf31784b19add.1578960120.git.tim.c.chen@linux.intel.com>
From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Mon, 22 Jul 2019 16:13:45 -0700
Subject: [PATCH 2/4] sched: Return idle task in pick_next_task for offlined
 CPU

There could be races in core scheduler where a CPU is trying to pick
a task for its sibling in core scheduler, when that CPU has just been
offlined.  We should not schedule any tasks on the CPU in this case.
Return an idle task in pick_next_task for this situation.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/core.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 44399f5434f8..1a132beba3f8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4154,6 +4154,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	int i, j, cpu, occ = 0;
 	bool need_sync = false;
 
+	cpu = cpu_of(rq);
+	if (cpu_is_offline(cpu))
+		return idle_sched_class.pick_next_task(rq, prev, rf);
+
 	if (!sched_core_enabled(rq))
 		return __pick_next_task(rq, prev, rf);
 
@@ -4186,7 +4190,6 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	if (!rq->nr_running)
 		newidle_balance(rq, rf);
 
-	cpu = cpu_of(rq);
 	smt_mask = cpu_smt_mask(cpu);
 
 	/*
@@ -4228,8 +4231,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			struct rq *rq_i = cpu_rq(i);
 			struct task_struct *p;
 
-			if (cpu_is_offline(i))
+			if (cpu_is_offline(i)) {
+				rq_i->core_pick = rq_i->idle;
 				continue;
+			}
 
 			if (rq_i->core_pick)
 				continue;
-- 
2.20.1


From 226c811b3dc74452dc04244d57919d67a2555cc1 Mon Sep 17 00:00:00 2001
Message-Id: <226c811b3dc74452dc04244d57919d67a2555cc1.1578960120.git.tim.c.chen@linux.intel.com>
In-Reply-To: <3c00f178538e2a33d075b5705adaf31784b19add.1578960120.git.tim.c.chen@linux.intel.com>
References: <3c00f178538e2a33d075b5705adaf31784b19add.1578960120.git.tim.c.chen@linux.intel.com>
From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Tue, 7 Jan 2020 13:22:28 -0800
Subject: [PATCH 3/4] sched/core: Enable core scheduler only for core with SMT
 threads

Core scheduler has extra overhead.  Enable it only for core with
more than one SMT hardware threads.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/core.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1a132beba3f8..9d875d6ed3f3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -280,8 +280,10 @@ static int __sched_core_stopper(void *data)
 		}
 	}
 
-	for_each_online_cpu(cpu)
-		cpu_rq(cpu)->core_enabled = enabled;
+	for_each_online_cpu(cpu) {
+		if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2))
+			cpu_rq(cpu)->core_enabled = enabled;
+	}
 
 	return 0;
 }
-- 
2.20.1


From 38c85b523a88b7dd0f659a874c7a1793751ca5ab Mon Sep 17 00:00:00 2001
Message-Id: <38c85b523a88b7dd0f659a874c7a1793751ca5ab.1578960120.git.tim.c.chen@linux.intel.com>
In-Reply-To: <3c00f178538e2a33d075b5705adaf31784b19add.1578960120.git.tim.c.chen@linux.intel.com>
References: <3c00f178538e2a33d075b5705adaf31784b19add.1578960120.git.tim.c.chen@linux.intel.com>
From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Tue, 7 Jan 2020 13:26:54 -0800
Subject: [PATCH 4/4] sched/core: Update core scheduler queue when taking cpu
 online/offline

When we bring a CPU online and enable core scheduler, tasks that need
core scheduling need to be placed in the core's core scheduling queue.
Likewise when we taks a CPU offline or disable core scheudling on a
core, tasks in the core's core scheduling queue need to be removed.
Without such mechanisms, the core scheduler causes OOPs due to
inconsistent core scheduling state of a task.

Implement such enqueue and dequeue mechanisms according to a CPU's change
in core scheduling status.  The switch of core scheduling mode of a core,
and enqueue/dequeue of tasks on a core's queue due to the core scheduling
mode change has to be run in a separate context as it cannot be done in
the context taking cpu online/offline.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/core.c     | 156 ++++++++++++++++++++++++++++++++++++----
 kernel/sched/deadline.c |  35 +++++++++
 kernel/sched/fair.c     |  38 ++++++++++
 kernel/sched/rt.c       |  43 +++++++++++
 kernel/sched/sched.h    |   7 ++
 5 files changed, 264 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9d875d6ed3f3..8db8960c6e69 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,11 @@ int sysctl_sched_rt_runtime = 950000;
 
 #ifdef CONFIG_SCHED_CORE
 
+struct core_sched_cpu_work {
+	struct work_struct work;
+	cpumask_t smt_mask;
+};
+
 DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
 
 /* kernel prio, less is more */
@@ -207,6 +212,18 @@ static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
 	RB_CLEAR_NODE(&p->core_node);
 }
 
+void sched_core_add(struct rq *rq, struct task_struct *p)
+{
+	if (p->core_cookie && task_on_rq_queued(p))
+		sched_core_enqueue(rq, p);
+}
+
+void sched_core_remove(struct rq *rq, struct task_struct *p)
+{
+	if (sched_core_enqueued(p))
+		sched_core_dequeue(rq, p);
+}
+
 /*
  * Find left-most (aka, highest priority) task matching @cookie.
  */
@@ -329,11 +346,133 @@ void sched_core_put(void)
 	mutex_unlock(&sched_core_mutex);
 }
 
+enum cpu_action {
+	CPU_ACTIVATE = 1,
+	CPU_DEACTIVATE = 2
+};
+
+static int __activate_cpu_core_sched(void *data);
+static int __deactivate_cpu_core_sched(void *data);
+static void core_sched_cpu_update(unsigned int cpu, enum cpu_action action);
+
+static int activate_cpu_core_sched(struct core_sched_cpu_work *work)
+{
+	if (static_branch_unlikely(&__sched_core_enabled))
+		stop_machine(__activate_cpu_core_sched, (void *) work, NULL);
+
+	return 0;
+}
+
+static int deactivate_cpu_core_sched(struct core_sched_cpu_work *work)
+{
+	if (static_branch_unlikely(&__sched_core_enabled))
+		stop_machine(__deactivate_cpu_core_sched, (void *) work, NULL);
+
+	return 0;
+}
+
+static void core_sched_cpu_activate_fn(struct work_struct *work)
+{
+	struct core_sched_cpu_work *cpu_work;
+
+	cpu_work = container_of(work, struct core_sched_cpu_work, work);
+	activate_cpu_core_sched(cpu_work);
+	kfree(cpu_work);
+}
+
+static void core_sched_cpu_deactivate_fn(struct work_struct *work)
+{
+	struct core_sched_cpu_work *cpu_work;
+
+	cpu_work = container_of(work, struct core_sched_cpu_work, work);
+	deactivate_cpu_core_sched(cpu_work);
+	kfree(cpu_work);
+}
+
+static void core_sched_cpu_update(unsigned int cpu, enum cpu_action action)
+{
+	struct core_sched_cpu_work *work;
+
+	work = kmalloc(sizeof(struct core_sched_cpu_work), GFP_ATOMIC);
+	if (!work)
+		return;
+
+	if (action == CPU_ACTIVATE)
+		INIT_WORK(&work->work, core_sched_cpu_activate_fn);
+	else
+		INIT_WORK(&work->work, core_sched_cpu_deactivate_fn);
+
+	cpumask_copy(&work->smt_mask, cpu_smt_mask(cpu));
+
+	queue_work(system_highpri_wq, &work->work);
+}
+
+static int __activate_cpu_core_sched(void *data)
+{
+	struct core_sched_cpu_work *work = (struct core_sched_cpu_work *) data;
+	struct rq *rq;
+	int i;
+
+	if (cpumask_weight(&work->smt_mask) < 2)
+		return 0;
+
+	for_each_cpu(i, &work->smt_mask) {
+		const struct sched_class *class;
+
+		rq = cpu_rq(i);
+
+		if (rq->core_enabled)
+			continue;
+
+		for_each_class(class) {
+			if (!class->core_sched_activate)
+				continue;
+
+			if (cpu_online(i))
+				class->core_sched_activate(rq);
+		}
+
+		rq->core_enabled = true;
+	}
+	return 0;
+}
+
+static int __deactivate_cpu_core_sched(void *data)
+{
+	struct core_sched_cpu_work *work = (struct core_sched_cpu_work *) data;
+	struct rq *rq;
+	int i;
+
+	if (cpumask_weight(&work->smt_mask) > 2)
+		return 0;
+
+	for_each_cpu(i, &work->smt_mask) {
+		const struct sched_class *class;
+
+		rq = cpu_rq(i);
+
+		if (!rq->core_enabled)
+			continue;
+
+		for_each_class(class) {
+			if (!class->core_sched_deactivate)
+				continue;
+
+			if (cpu_online(i))
+				class->core_sched_deactivate(cpu_rq(i));
+		}
+
+		rq->core_enabled = false;
+	}
+	return 0;
+}
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
 static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
 static bool sched_core_enqueued(struct task_struct *task) { return false; }
+static inline void core_sched_cpu_update(unsigned int cpu, int action) { }
 
 #endif /* CONFIG_SCHED_CORE */
 
@@ -6941,13 +7080,8 @@ int sched_cpu_activate(unsigned int cpu)
 	 */
 	if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
 		static_branch_inc_cpuslocked(&sched_smt_present);
-#ifdef CONFIG_SCHED_CORE
-		if (static_branch_unlikely(&__sched_core_enabled)) {
-			rq->core_enabled = true;
-		}
-#endif
 	}
-
+	core_sched_cpu_update(cpu, CPU_ACTIVATE);
 #endif
 	set_cpu_active(cpu, true);
 
@@ -6996,15 +7130,10 @@ int sched_cpu_deactivate(unsigned int cpu)
 	 * When going down, decrement the number of cores with SMT present.
 	 */
 	if (cpumask_weight(cpu_smt_mask(cpu)) == 2) {
-#ifdef CONFIG_SCHED_CORE
-		struct rq *rq = cpu_rq(cpu);
-		if (static_branch_unlikely(&__sched_core_enabled)) {
-			rq->core_enabled = false;
-		}
-#endif
 		static_branch_dec_cpuslocked(&sched_smt_present);
 
 	}
+	core_sched_cpu_update(cpu, CPU_DEACTIVATE);
 #endif
 
 	if (!sched_smp_initialized)
@@ -7081,9 +7210,6 @@ int sched_cpu_dying(unsigned int cpu)
 	update_max_interval();
 	nohz_balance_exit_idle(rq);
 	hrtick_clear(rq);
-#ifdef CONFIG_SCHED_CORE
-	rq->core = NULL;
-#endif
 	return 0;
 }
 #endif
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 514b6328262f..6bb69d42965b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1755,6 +1755,37 @@ static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
 	return rb_entry(left, struct sched_dl_entity, rb_node);
 }
 
+static void for_each_dl_task(struct rq *rq,
+                             void (*fn)(struct rq *rq, struct task_struct *p))
+{
+	struct dl_rq *dl_rq = &rq->dl;
+	struct sched_dl_entity *dl_ent;
+	struct task_struct *task;
+	struct rb_node *rb_node;
+
+	rb_node = rb_first_cached(&dl_rq->root);
+	while (rb_node) {
+		dl_ent = rb_entry(rb_node, struct sched_dl_entity, rb_node);
+		task = dl_task_of(dl_ent);
+		fn(rq, task);
+		rb_node = rb_next(rb_node);
+	}
+}
+
+#ifdef CONFIG_SCHED_CORE
+
+static void core_sched_activate_dl(struct rq *rq)
+{
+	for_each_dl_task(rq, sched_core_add);
+}
+
+static void core_sched_deactivate_dl(struct rq *rq)
+{
+	for_each_dl_task(rq, sched_core_remove);
+}
+
+#endif
+
 static struct task_struct *pick_task_dl(struct rq *rq)
 {
 	struct sched_dl_entity *dl_se;
@@ -2430,6 +2461,10 @@ const struct sched_class dl_sched_class = {
 	.rq_online              = rq_online_dl,
 	.rq_offline             = rq_offline_dl,
 	.task_woken		= task_woken_dl,
+#ifdef CONFIG_SCHED_CORE
+	.core_sched_activate    = core_sched_activate_dl,
+	.core_sched_deactivate  = core_sched_deactivate_dl,
+#endif
 #endif
 
 	.task_tick		= task_tick_dl,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4728f5ed45aa..6cfcced2b0bd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10058,6 +10058,40 @@ static void rq_offline_fair(struct rq *rq)
 	unthrottle_offline_cfs_rqs(rq);
 }
 
+static void for_each_fair_task(struct rq *rq,
+			       void (*fn)(struct rq *rq, struct task_struct *p))
+{
+	struct cfs_rq *cfs_rq, *pos;
+	struct sched_entity *se;
+	struct task_struct *task;
+
+	for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
+		for (se = __pick_first_entity(cfs_rq);
+		     se != NULL;
+		     se = __pick_next_entity(se)) {
+
+			if (!entity_is_task(se))
+				continue;
+
+			task = task_of(se);
+			fn(rq, task);
+		}
+	}
+}
+
+#ifdef CONFIG_SCHED_CORE
+
+static void core_sched_activate_fair(struct rq *rq)
+{
+	for_each_fair_task(rq, sched_core_add);
+}
+
+static void core_sched_deactivate_fair(struct rq *rq)
+{
+	for_each_fair_task(rq, sched_core_remove);
+}
+
+#endif
 #endif /* CONFIG_SMP */
 
 #ifdef CONFIG_SCHED_CORE
@@ -10612,6 +10646,10 @@ const struct sched_class fair_sched_class = {
 
 	.task_dead		= task_dead_fair,
 	.set_cpus_allowed	= set_cpus_allowed_common,
+#ifdef CONFIG_SCHED_CORE
+	.core_sched_activate	= core_sched_activate_fair,
+	.core_sched_deactivate	= core_sched_deactivate_fair,
+#endif
 #endif
 
 	.task_tick		= task_tick_fair,
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 4714630a90b9..c6694e45b255 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1548,6 +1548,45 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
 	return rt_task_of(rt_se);
 }
 
+static void for_each_rt_task(struct rq *rq,
+			     void (*fn)(struct rq *rq, struct task_struct *p))
+{
+	rt_rq_iter_t iter;
+	struct rt_prio_array *array;
+	struct list_head *queue;
+	int i;
+	struct rt_rq *rt_rq = &rq->rt;
+	struct sched_rt_entity *rt_se = NULL;
+	struct task_struct *task;
+
+	for_each_rt_rq(rt_rq, iter, rq) {
+		array = &rt_rq->active;
+		for (i = 0; i < MAX_RT_PRIO; i++) {
+			queue = array->queue + i;
+			list_for_each_entry(rt_se, queue, run_list) {
+				if (rt_entity_is_task(rt_se)) {
+					task = rt_task_of(rt_se);
+					fn(rq, task);
+				}
+			}
+		}
+	}
+}
+
+#ifdef CONFIG_SCHED_CORE
+
+static void core_sched_activate_rt(struct rq *rq)
+{
+	for_each_rt_task(rq, sched_core_add);
+}
+
+static void core_sched_deactivate_rt(struct rq *rq)
+{
+	for_each_rt_task(rq, sched_core_remove);
+}
+
+#endif
+
 static struct task_struct *pick_task_rt(struct rq *rq)
 {
 	struct task_struct *p;
@@ -2382,6 +2421,10 @@ const struct sched_class rt_sched_class = {
 	.rq_offline             = rq_offline_rt,
 	.task_woken		= task_woken_rt,
 	.switched_from		= switched_from_rt,
+#ifdef CONFIG_SCHED_CORE
+	.core_sched_activate    = core_sched_activate_rt,
+	.core_sched_deactivate  = core_sched_deactivate_rt,
+#endif
 #endif
 
 	.task_tick		= task_tick_rt,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4844e703298a..c2068f2e2dd2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1055,6 +1055,9 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 
 extern void queue_core_balance(struct rq *rq);
 
+void sched_core_add(struct rq *rq, struct task_struct *p);
+void sched_core_remove(struct rq *rq, struct task_struct *p);
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline bool sched_core_enabled(struct rq *rq)
@@ -1838,6 +1841,10 @@ struct sched_class {
 
 	void (*rq_online)(struct rq *rq);
 	void (*rq_offline)(struct rq *rq);
+#ifdef CONFIG_SCHED_CORE
+	void (*core_sched_activate)(struct rq *rq);
+	void (*core_sched_deactivate)(struct rq *rq);
+#endif
 #endif
 
 	void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-01-14  1:12   ` Tim Chen
@ 2020-01-14 15:40     ` Vineeth Remanan Pillai
  2020-01-15  3:43       ` Li, Aubrey
                         ` (2 more replies)
  0 siblings, 3 replies; 84+ messages in thread
From: Vineeth Remanan Pillai @ 2020-01-14 15:40 UTC (permalink / raw)
  To: Tim Chen
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

On Mon, Jan 13, 2020 at 8:12 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:

> I also encountered kernel panic with the v4 code when taking cpu offline or online
> when core scheduler is running.  I've refreshed the previous patch, along
> with 3 other patches to fix problems related to CPU online/offline.
>
> As a side effect of the fix, each core can now operate in core-scheduling
> mode or non core-scheduling mode, depending on how many online SMT threads it has.
>
> Vineet, are you guys planning to refresh v4 and update it to v5?  Aubrey posted
> a port to the latest kernel earlier.
>
Thanks for the updated patch Tim.

We have been testing with v4 rebased on 5.4.8 as RC kernels had given us
trouble in the past. v5 is due soon and we are planning to release v5 when
5.5 comes out. As of now, v5 has your crash fixes and Aubrey's changes
related to load balancing. We are investigating a performance issue with
high overcommit io intensive workload and also we are trying to see if
we can add synchronization during VMEXITs so that a guest vm cannot run
run alongside with host kernel. We also need to think about the userland
interface for corescheduling in preparation for upstreaming work.

Does it make sense to wait until 5.5 release for v5 considering the
above details?

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-01-14 15:40     ` Vineeth Remanan Pillai
@ 2020-01-15  3:43       ` Li, Aubrey
  2020-01-15 19:33         ` Tim Chen
  2020-01-28  2:40       ` Dario Faggioli
  2020-02-06  0:28       ` Tim Chen
  2 siblings, 1 reply; 84+ messages in thread
From: Li, Aubrey @ 2020-01-15  3:43 UTC (permalink / raw)
  To: Vineeth Remanan Pillai, Tim Chen
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

On 2020/1/14 23:40, Vineeth Remanan Pillai wrote:
> On Mon, Jan 13, 2020 at 8:12 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> 
>> I also encountered kernel panic with the v4 code when taking cpu offline or online
>> when core scheduler is running.  I've refreshed the previous patch, along
>> with 3 other patches to fix problems related to CPU online/offline.
>>
>> As a side effect of the fix, each core can now operate in core-scheduling
>> mode or non core-scheduling mode, depending on how many online SMT threads it has.
>>
>> Vineet, are you guys planning to refresh v4 and update it to v5?  Aubrey posted
>> a port to the latest kernel earlier.
>>
> Thanks for the updated patch Tim.
> 
> We have been testing with v4 rebased on 5.4.8 as RC kernels had given us
> trouble in the past. v5 is due soon and we are planning to release v5 when
> 5.5 comes out. As of now, v5 has your crash fixes and Aubrey's changes
> related to load balancing.

It turns out my load balancing related changes need to be refined.
For example, we don't migrate task if the task's core cookie does not match
with CPU's core cookie, but if the entire core is idle, we should allow task
migration, something like the following:

I plan to do this after my Chinese New Year holiday(Feb 3rd).

Thanks,
-Aubrey

----------------------------------------------------------------------------------
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4728f5e..75e0172 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7366,8 +7366,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * We do not migrate tasks that are:
 	 * 1) throttled_lb_pair, or
 	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
-	 * 3) running (obviously), or
-	 * 4) are cache-hot on their current CPU.
+	 * 3) task's cookie does not match with this CPU's core cookie
+	 * 4) running (obviously), or
+	 * 5) are cache-hot on their current CPU.
 	 */
 	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
 		return 0;
@@ -7402,6 +7403,25 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		return 0;
 	}
 
+#ifdef CONFIG_SCHED_CORE
+	if (sched_core_enabled(cpu_rq(env->dst_cpu))) {
+		bool idle_core = true;
+		int cpu;
+
+		for_each_cpu(cpu, cpu_smt_mask(env->dst_cpu)) {
+			if (!available_idle_cpu(cpu))
+				idle_core = false;
+		}
+		/*
+		 * Don't migrate task if task's cookie does not match
+		 * with core cookie, unless the entire core is idle.
+		 */
+		if (!idle_core && p->core_cookie !=
+		    cpu_rq(env->dst_cpu)->core->core_cookie)
+			return 0;
+	}
+#endif
+
 	/* Record that we found atleast one task that could run on dst_cpu */
 	env->flags &= ~LBF_ALL_PINNED;
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-01-15  3:43       ` Li, Aubrey
@ 2020-01-15 19:33         ` Tim Chen
  2020-01-16  1:45           ` Aubrey Li
  0 siblings, 1 reply; 84+ messages in thread
From: Tim Chen @ 2020-01-15 19:33 UTC (permalink / raw)
  To: Aubrey Li, Vineeth Remanan Pillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

On 1/14/20 7:43 PM, Li, Aubrey wrote:
> On 2020/1/14 23:40, Vineeth Remanan Pillai wrote:
>> On Mon, Jan 13, 2020 at 8:12 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>>
>>> I also encountered kernel panic with the v4 code when taking cpu offline or online
>>> when core scheduler is running.  I've refreshed the previous patch, along
>>> with 3 other patches to fix problems related to CPU online/offline.
>>>
>>> As a side effect of the fix, each core can now operate in core-scheduling
>>> mode or non core-scheduling mode, depending on how many online SMT threads it has.
>>>
>>> Vineet, are you guys planning to refresh v4 and update it to v5?  Aubrey posted
>>> a port to the latest kernel earlier.
>>>
>> Thanks for the updated patch Tim.
>>
>> We have been testing with v4 rebased on 5.4.8 as RC kernels had given us
>> trouble in the past. v5 is due soon and we are planning to release v5 when
>> 5.5 comes out. As of now, v5 has your crash fixes and Aubrey's changes
>> related to load balancing.
> 
> It turns out my load balancing related changes need to be refined.
> For example, we don't migrate task if the task's core cookie does not match
> with CPU's core cookie, but if the entire core is idle, we should allow task
> migration, something like the following:
> 
> I plan to do this after my Chinese New Year holiday(Feb 3rd).
> 
> Thanks,
> -Aubrey
> 

Aubrey's attached patch should replace his previous patch
sched/fair: don't migrate task if cookie not match

I've also added a fix below for Aubrey's patch
sched/fair: find cookie matched idlest CPU.  

Aubrey, can you merge this fix into that patch when you update
your patches?

Tim

---->8----
From 06c09a9c86db6a7c30e5ebca7a635005ac2df37b Mon Sep 17 00:00:00 2001
From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Thu, 12 Dec 2019 09:08:18 -0800
Subject: [PATCH] sched: Check core scheduler in use when comparing cookie

When scanning scheduler groups to find the idlest cpu for a task waking
up, cookie matching should not be considered if a target cpu doesn't
use core scheduling.  Code introduced by patch

sched/fair: find cookie matched idlest CPU

matches cookie regardless of cpu's core scheduler state.  Fix this.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a90179937f63..55e7b22522db 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5656,8 +5656,15 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 			for_each_cpu(i, sched_group_span(group)) {
 				struct rq *rq = cpu_rq(i);
 
-				if (p->core_cookie == rq->core->core_cookie)
+				if (!sched_core_enabled(rq)) {
 					cookie_match = true;
+					break;
+				}
+
+				if (p->core_cookie == rq->core->core_cookie) {
+					cookie_match = true;
+					break;
+				}
 			}
 			/* Skip over this group if no cookie matched */
 			if (!cookie_match)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-01-15 19:33         ` Tim Chen
@ 2020-01-16  1:45           ` Aubrey Li
  2020-01-17 16:00             ` Vineeth Remanan Pillai
  0 siblings, 1 reply; 84+ messages in thread
From: Aubrey Li @ 2020-01-16  1:45 UTC (permalink / raw)
  To: Tim Chen
  Cc: Aubrey Li, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Julien Desfossez, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Linus Torvalds, Linux List Kernel Mailing,
	Dario Faggioli, Frédéric Weisbecker, Kees Cook,
	Greg Kerr, Phil Auld, Aaron Lu, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini

On Thu, Jan 16, 2020 at 3:33 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> On 1/14/20 7:43 PM, Li, Aubrey wrote:
> > On 2020/1/14 23:40, Vineeth Remanan Pillai wrote:
> >> On Mon, Jan 13, 2020 at 8:12 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >>
> >>> I also encountered kernel panic with the v4 code when taking cpu offline or online
> >>> when core scheduler is running.  I've refreshed the previous patch, along
> >>> with 3 other patches to fix problems related to CPU online/offline.
> >>>
> >>> As a side effect of the fix, each core can now operate in core-scheduling
> >>> mode or non core-scheduling mode, depending on how many online SMT threads it has.
> >>>
> >>> Vineet, are you guys planning to refresh v4 and update it to v5?  Aubrey posted
> >>> a port to the latest kernel earlier.
> >>>
> >> Thanks for the updated patch Tim.
> >>
> >> We have been testing with v4 rebased on 5.4.8 as RC kernels had given us
> >> trouble in the past. v5 is due soon and we are planning to release v5 when
> >> 5.5 comes out. As of now, v5 has your crash fixes and Aubrey's changes
> >> related to load balancing.
> >
> > It turns out my load balancing related changes need to be refined.
> > For example, we don't migrate task if the task's core cookie does not match
> > with CPU's core cookie, but if the entire core is idle, we should allow task
> > migration, something like the following:
> >
> > I plan to do this after my Chinese New Year holiday(Feb 3rd).
> >
> > Thanks,
> > -Aubrey
> >
>
> Aubrey's attached patch should replace his previous patch
> sched/fair: don't migrate task if cookie not match
>
> I've also added a fix below for Aubrey's patch
> sched/fair: find cookie matched idlest CPU.
>
> Aubrey, can you merge this fix into that patch when you update
> your patches?

Thanks Tim, I'll include your fix when I update my patches.

-Aubrey

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-01-16  1:45           ` Aubrey Li
@ 2020-01-17 16:00             ` Vineeth Remanan Pillai
  2020-01-22 18:04               ` Gruza, Agata
  0 siblings, 1 reply; 84+ messages in thread
From: Vineeth Remanan Pillai @ 2020-01-17 16:00 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Tim Chen, Aubrey Li, Nishanth Aravamudan, Julien Desfossez,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

> >
> > Aubrey's attached patch should replace his previous patch
> > sched/fair: don't migrate task if cookie not match
> >
> > I've also added a fix below for Aubrey's patch
> > sched/fair: find cookie matched idlest CPU.
> >
> > Aubrey, can you merge this fix into that patch when you update
> > your patches?
>
> Thanks Tim, I'll include your fix when I update my patches.
>
We have included both these changes in our branch rebased to 5.4.y:
https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.4.y

This is our testing branch.

Many Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 84+ messages in thread

* RE: [RFC PATCH v4 00/19] Core scheduling v4
  2020-01-17 16:00             ` Vineeth Remanan Pillai
@ 2020-01-22 18:04               ` Gruza, Agata
  0 siblings, 0 replies; 84+ messages in thread
From: Gruza, Agata @ 2020-01-22 18:04 UTC (permalink / raw)
  To: Vineeth Remanan Pillai, Aubrey Li
  Cc: Tim Chen, Aubrey Li, Nishanth Aravamudan, Julien Desfossez,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 7515 bytes --]

-----Original Message-----
From: linux-kernel-owner@vger.kernel.org <linux-kernel-owner@vger.kernel.org> On Behalf Of Vineeth Remanan Pillai
Sent: Friday, January 17, 2020 8:01 AM
To: Aubrey Li <aubrey.intel@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>; Aubrey Li <aubrey.li@linux.intel.com>; Nishanth Aravamudan <naravamudan@digitalocean.com>; Julien Desfossez <jdesfossez@digitalocean.com>; Peter Zijlstra <peterz@infradead.org>; Ingo Molnar <mingo@kernel.org>; Thomas Gleixner <tglx@linutronix.de>; Paul Turner <pjt@google.com>; Linus Torvalds <torvalds@linux-foundation.org>; Linux List Kernel Mailing <linux-kernel@vger.kernel.org>; Dario Faggioli <dfaggioli@suse.com>; Frédéric Weisbecker <fweisbec@gmail.com>; Kees Cook <keescook@chromium.org>; Greg Kerr <kerrnel@google.com>; Phil Auld <pauld@redhat.com>; Aaron Lu <aaron.lwe@gmail.com>; Valentin Schneider <valentin.schneider@arm.com>; Mel Gorman <mgorman@techsingularity.net>; Pawan Gupta <pawan.kumar.gupta@linux.intel.com>; Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [RFC PATCH v4 00/19] Core scheduling v4

> >
> > Aubrey's attached patch should replace his previous patch
> > sched/fair: don't migrate task if cookie not match
> >
> > I've also added a fix below for Aubrey's patch
> > sched/fair: find cookie matched idlest CPU.
> >
> > Aubrey, can you merge this fix into that patch when you update your 
> > patches?
>
> Thanks Tim, I'll include your fix when I update my patches.
>
We have included both these changes in our branch rebased to 5.4.y:
https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.4.y

This is our testing branch.



----------------------------------------------------------------------
ABOUT:
----------------------------------------------------------------------
Hello,

Core scheduling is required to protect against leakage of sensitive 
data allocated on a sibling thread. Our goal is to measure performance 
impact of core scheduling across different workloads and show how it 
evolved over time. Below you will find data for core scheduling. Data 
presented here are based on previous version of core-sched (v3) plus 
additional kernel patches added by tim.c.chen@intel.com and load 
balancer made by aubrey.li@intel.com that are now in (v4) core 
scheduler kernel 
https://github.com/digitalocean/linux-coresched/commits/coresched/v4-v5.4.y   
Detailed specifics about each system set up in attached PDF. 

----------------------------------------------------------------------
BENCHMARKS:
----------------------------------------------------------------------
- SPECreate2017int  : industry standard for measure compute intensive 
                    CPU performance. 
- CASSANDRA         : NoSQL distributed DB and benchmark. Provides 
                    high availability with no single point of failure.
- HammerDB          : database benchmarking application
- SPECvirt          : industry standard for measure server performance 
                    in virtual environment. 

----------------------------------------------------------------------      
PERFORMANCE IMPACT:
----------------------------------------------------------------------

---------------------------------------------------------------------------------------------------------------------------------------------------
Kernel & Benchmark        |# of cgroups|Overcommit |   Performance Change %|   Performance Change %    |           
                                             |                                                 |   (HT ON, core-sched)      |   (HT OFF, No core-sched)  |
---------------------------------------------------------------------------------------------------------------------------------------------------
                                             |  2cgroups     |          0.5x      |                    -0.48               |                           0.26               |
SPECint_rate_base 2017 |  1 per VM     |          1x         |                    -0.54               |                           0.66               |
                                             |                        |          2x         |                    -0.18               |                          -0.07               |
---------------------------------------------------------------------------------------------------------------------------------------------------
                                             |  2cgroups      |          0.5x     |                      -2.56              |                       -44.61              |
CASSANDRA                      |  1 per VM      |          1x         |                      -4.71              |                        -47.47             |
                                             |                         |          2x        |                       -7.63             |                        -28.47             |
---------------------------------------------------------------------------------------------------------------------------------------------------
                                             |  2cgroups      |          0.5x    |                       -19.99            |                        -23.52             |
HammerDB                        |  1 per VM      |          1x       |                        -14.90           |                        -24.63             |
                                             |                         |          2x       |                        -13.31           |                        -25.71             |
---------------------------------------------------------------------------------------------------------------------------------------------------
                                             | attachment  |attachment |                          -18               |                       -31                   |           
SPECvirt                              |                         |                     |                                               |                                                |   
---------------------------------------------------------------------------------------------------------------------------------------------------
                                             |  2cgroups      |         0.5x    |                         -5.18               |                      -44.61              |
CASSANDRA                      |  1 per VM      |         1x        |                         -7.90               |                     -47.47               |   
                                             |                         |          2x      |                          -6.52              |                     -28.47               |
---------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------
TAKE AWAYS:
----------------------------------------------------------------------
1. Core scheduling performs better than turning off HT in all 
overcommitting scenarios. However we observed in certain cases 
up to 20% performance drop. 
2. Impact of core scheduling depends on the workload and thread 
scheduling intensity. 
3. Core scheduling requires cgroups. Single cgroup per VM. Each 
VM is running on it’s own independent cgroup.

Many thanks,
--Agata


[-- Attachment #2: LKML_core_scheduling.pdf --]
[-- Type: application/pdf, Size: 397498 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-01-14 15:40     ` Vineeth Remanan Pillai
  2020-01-15  3:43       ` Li, Aubrey
@ 2020-01-28  2:40       ` Dario Faggioli
       [not found]         ` <CANaguZDDpzrzdTmvjXvCmV2c+wBt6mXWSz4Vn-LJ-onc_Oj=yw@mail.gmail.com>
  2020-02-06  0:28       ` Tim Chen
  2 siblings, 1 reply; 84+ messages in thread
From: Dario Faggioli @ 2020-01-28  2:40 UTC (permalink / raw)
  To: Vineeth Remanan Pillai, Tim Chen
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Aaron Lu, Aubrey Li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1701 bytes --]

On Tue, 2020-01-14 at 10:40 -0500, Vineeth Remanan Pillai wrote:
> On Mon, Jan 13, 2020 at 8:12 PM Tim Chen <tim.c.chen@linux.intel.com>
> 
> > As a side effect of the fix, each core can now operate in core-
> > scheduling
> > mode or non core-scheduling mode, depending on how many online SMT
> > threads it has.
> > 
> > Vineet, are you guys planning to refresh v4 and update it to
> > v5?  Aubrey posted
> > a port to the latest kernel earlier.
> > 
> We are investigating a performance issue
> with
> high overcommit io intensive workload and also we are trying to see
> if
> we can add synchronization during VMEXITs so that a guest vm cannot
> run
> run alongside with host kernel. 
>
So, about this VMEXIT sync thing. I do agree that we should at least
try and do it (and assess performance).

I was wondering, however, what we think about core-scheduling + address
space isolation (or whatever it is/will be called). More specifically,
whether such a solution wouldn't be considered an equally safe setup
(at least for the virt use-cases, of course).

Basically, core-scheduling would prevent VM-to-VM attacks while ASI
would mitigate VM-to-hypervisor attacks.

Of course, such a solution would need to be fully implemented and
evaluated too... I just wanted to toss it around, mostly to know what
you think about it and whether or not it is already on your radar.

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
       [not found]         ` <CANaguZDDpzrzdTmvjXvCmV2c+wBt6mXWSz4Vn-LJ-onc_Oj=yw@mail.gmail.com>
@ 2020-02-01 15:31           ` Dario Faggioli
  0 siblings, 0 replies; 84+ messages in thread
From: Dario Faggioli @ 2020-02-01 15:31 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Tim Chen, Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Frédéric Weisbecker,
	Kees Cook, Greg Kerr, Phil Auld, Aaron Lu, Aubrey Li,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1848 bytes --]

On Fri, 2020-01-31 at 09:44 -0500, Vineeth Remanan Pillai wrote:
> > Basically, core-scheduling would prevent VM-to-VM attacks while ASI
> > would mitigate VM-to-hypervisor attacks.
> > 
> > Of course, such a solution would need to be fully implemented and
> > evaluated too... I just wanted to toss it around, mostly to know
> > what
> > you think about it and whether or not it is already on your radar.
> 
> We had this discussion during LPC. 
>
I know. I wanted to be there, but couldn't. But I watched the
recordings of the miniconf. :-)

> Its something on the radar, but we
> haven't yet spend any dedicated time looking into it.
> Theoretically it is very promising. While looking into practical
> aspects,
> the main difficulty is to determine what is safe/unsafe to expose in
> the kernel when the sibling is running in userland/VM. Coming up with
> a
> minimal pagetable for the kernel when sibling is running untrusted
> code
> would be non-trivial.
> 
It is. And this is exactly my point. :-)

I mean, what you're describing is pretty much what the memory isolation
efforts are mostly (all?) about, at least AFAIUI.

Therefore, I think we should see about "joining forces".

FWIW, there's a talk about ASI going on right now at FOSDEM2020:
https://fosdem.org/2020/schedule/event/kernel_address_space_isolation/
(this is also video recorded, so it will be possible for everyone to
watch it, in a few days time).

> Its definitely worth spending some time and effort on this idea.
> 
Cool! Happy to hear this. :-)

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-01-14 15:40     ` Vineeth Remanan Pillai
  2020-01-15  3:43       ` Li, Aubrey
  2020-01-28  2:40       ` Dario Faggioli
@ 2020-02-06  0:28       ` Tim Chen
  2020-02-06 22:37         ` Julien Desfossez
  2020-02-12 23:07         ` Julien Desfossez
  2 siblings, 2 replies; 84+ messages in thread
From: Tim Chen @ 2020-02-06  0:28 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Nishanth Aravamudan, Julien Desfossez, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

On 1/14/20 7:40 AM, Vineeth Remanan Pillai wrote:
> On Mon, Jan 13, 2020 at 8:12 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> 
>> I also encountered kernel panic with the v4 code when taking cpu offline or online
>> when core scheduler is running.  I've refreshed the previous patch, along
>> with 3 other patches to fix problems related to CPU online/offline.
>>
>> As a side effect of the fix, each core can now operate in core-scheduling
>> mode or non core-scheduling mode, depending on how many online SMT threads it has.
>>
>> Vineet, are you guys planning to refresh v4 and update it to v5?  Aubrey posted
>> a port to the latest kernel earlier.
>>
> Thanks for the updated patch Tim.
> 
> We have been testing with v4 rebased on 5.4.8 as RC kernels had given us
> trouble in the past. v5 is due soon and we are planning to release v5 when
> 5.5 comes out. As of now, v5 has your crash fixes and Aubrey's changes
> related to load balancing. We are investigating a performance issue with
> high overcommit io intensive workload and also we are trying to see if
> we can add synchronization during VMEXITs so that a guest vm cannot run
> run alongside with host kernel. We also need to think about the userland
> interface for corescheduling in preparation for upstreaming work.
> 

Vineet,

Have you guys been able to make progress on the issues with I/O intensive workload?

Can you explain a bit further what the problem is?  
Is the VM doing lots of I/O, causing frequent VMEXITs?
And the the host thread doing the I/O cannot run along side with vcpu
thread, resulting in extra forced idle?  So you are trying not to put vcpu
on the same core with such host thread?

Tim


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-02-06  0:28       ` Tim Chen
@ 2020-02-06 22:37         ` Julien Desfossez
  2020-02-12 23:07         ` Julien Desfossez
  1 sibling, 0 replies; 84+ messages in thread
From: Julien Desfossez @ 2020-02-06 22:37 UTC (permalink / raw)
  To: Tim Chen
  Cc: Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

On 05-Feb-2020 04:28:18 PM, Tim Chen wrote:
> Have you guys been able to make progress on the issues with I/O intensive workload?

Hi Tim,

We are in the process of retesting everything with the following branch:
https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.5.y

The CPU-intensive case that becomes overcommitted when we disable SMT
behaves the same as before (core scheduling is a clear win there). So we
are now digging a bit more in our mixed tests with CPU and IO (MySQL
benchmark running in parallel with low-activity VMs). The tests are
still running, but I expect to be able to present some numbers tomorrow.

We will keep the list informed.

Julien

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-02-06  0:28       ` Tim Chen
  2020-02-06 22:37         ` Julien Desfossez
@ 2020-02-12 23:07         ` Julien Desfossez
  2020-02-13 18:37           ` Tim Chen
  1 sibling, 1 reply; 84+ messages in thread
From: Julien Desfossez @ 2020-02-12 23:07 UTC (permalink / raw)
  To: Tim Chen
  Cc: Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

On 05-Feb-2020 04:28:18 PM, Tim Chen wrote:
> On 1/14/20 7:40 AM, Vineeth Remanan Pillai wrote:
> > On Mon, Jan 13, 2020 at 8:12 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> > 
> >> I also encountered kernel panic with the v4 code when taking cpu offline or online
> >> when core scheduler is running.  I've refreshed the previous patch, along
> >> with 3 other patches to fix problems related to CPU online/offline.
> >>
> >> As a side effect of the fix, each core can now operate in core-scheduling
> >> mode or non core-scheduling mode, depending on how many online SMT threads it has.
> >>
> >> Vineet, are you guys planning to refresh v4 and update it to v5?  Aubrey posted
> >> a port to the latest kernel earlier.
> >>
> > Thanks for the updated patch Tim.
> > 
> > We have been testing with v4 rebased on 5.4.8 as RC kernels had given us
> > trouble in the past. v5 is due soon and we are planning to release v5 when
> > 5.5 comes out. As of now, v5 has your crash fixes and Aubrey's changes
> > related to load balancing. We are investigating a performance issue with
> > high overcommit io intensive workload and also we are trying to see if
> > we can add synchronization during VMEXITs so that a guest vm cannot run
> > run alongside with host kernel. We also need to think about the userland
> > interface for corescheduling in preparation for upstreaming work.
> > 
> 
> Vineet,
> 
> Have you guys been able to make progress on the issues with I/O intensive workload?

I finally have some results with the following branch:
https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.5.y

We tested the following classes of workloads in VMs (all vcpus in the
same cgroup/tag):
- linpack (pure CPU work)
- sysbench TPC-C (MySQL benchmark, good mix of CPU/net/disk)
  with/without noise VMs around
- FIO randrw VM with/without noise VMs around

Our "noise VMs" are 1-vcpu VMs running a simple workload that wakes up
every 30 seconds, sends a couple of metrics over a VPN and go back to
sleep. They use between 0% and 30% of CPU on the host all the time,
nothing sustained just ups and downs.

# linpack
3x 12-vcpus pinned on a 36 hwthreads NUMA node (with smt on):
- core scheduling manages to perform slightly better than the baseline
  by up to 20% in some cases !
- with nosmt (so 2:1 overcommit) the performance drop by 24%

# sysbench TPC-C
1x 12-vcpus MySQL server on each NUMA node, 48 client threads (running
on a different server):
- without noise: no performance difference between the 3 configurations
- with 96 noise VMs on each NUMA node:
  - Performance drops by 54% with core scheduling
  - Performance drops by 75% with nosmt
We write at about 130MB/s on disk with that test.

# FIO randrw 50%, 1 thread, libaio, bs=128k, iodepth=32
1x 12-vcpus FIO VM, usually only require up to 100% CPU overall (data
thread and all vcpus summed), we read and write at about 350MB/s
alone:
 - coresched drops 5%
 - nosmt drops 1%

1:1 vcpus vs hardware thread on the NUMA node (filled with noise VMs):
 - coresched drops 7%
 - nosmt drops 22%

3:1 ratio:
 - coresched drops 16%
 - nosmt drops 22%

5:1 ratio:
 - coresched drops 51%
 - nosmt drops 61%

So the main conclusion is that for all the test cases we have studied,
core scheduling performs better than nosmt ! This is different than what
we tested a while back, so it's looking really good !

Now I am looking for confirmation from others. Dario did you have time
to re-run your test suite against that same branch ?

After that, our next step is to trash all that with adding VMEXIT
synchronization points ;-)

Thanks,

Julien

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-02-12 23:07         ` Julien Desfossez
@ 2020-02-13 18:37           ` Tim Chen
  2020-02-14  6:10             ` Aubrey Li
  0 siblings, 1 reply; 84+ messages in thread
From: Tim Chen @ 2020-02-13 18:37 UTC (permalink / raw)
  To: Julien Desfossez
  Cc: Vineeth Remanan Pillai, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Aubrey Li, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

On 2/12/20 3:07 PM, Julien Desfossez wrote:

>>
>> Have you guys been able to make progress on the issues with I/O intensive workload?
> 
> I finally have some results with the following branch:
> https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.5.y
> 
> 
> So the main conclusion is that for all the test cases we have studied,
> core scheduling performs better than nosmt ! This is different than what
> we tested a while back, so it's looking really good !

Thanks for the data.  They look really encouraging.

Aubrey is working on updating his patches so it will load balance
to the idle cores a bit better.  We are testing those and will post
the update soon.

Tim


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-02-13 18:37           ` Tim Chen
@ 2020-02-14  6:10             ` Aubrey Li
       [not found]               ` <CANaguZC40mDHfL1H_9AA7H8cyd028t9PQVRqQ3kB4ha8R7hhqg@mail.gmail.com>
  2020-02-25  3:44               ` Aaron Lu
  0 siblings, 2 replies; 84+ messages in thread
From: Aubrey Li @ 2020-02-14  6:10 UTC (permalink / raw)
  To: Tim Chen
  Cc: Julien Desfossez, Vineeth Remanan Pillai, Nishanth Aravamudan,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

On Fri, Feb 14, 2020 at 2:37 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> On 2/12/20 3:07 PM, Julien Desfossez wrote:
>
> >>
> >> Have you guys been able to make progress on the issues with I/O intensive workload?
> >
> > I finally have some results with the following branch:
> > https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.5.y
> >
> >
> > So the main conclusion is that for all the test cases we have studied,
> > core scheduling performs better than nosmt ! This is different than what
> > we tested a while back, so it's looking really good !
>
> Thanks for the data.  They look really encouraging.
>
> Aubrey is working on updating his patches so it will load balance
> to the idle cores a bit better.  We are testing those and will post
> the update soon.

I added a helper to check task and cpu cookie match, including the
entire core idle case. The refined patchset updated at here:
https://github.com/aubreyli/linux/tree/coresched_v4-v5.5.2

This branch also includes Tim's patchset. According to our testing
result, the performance data looks on par with the previous version.
A good news is, v5.4.y stability issue on our 8 numa node machine
is gone on this v5.5.2 branch.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
       [not found]               ` <CANaguZC40mDHfL1H_9AA7H8cyd028t9PQVRqQ3kB4ha8R7hhqg@mail.gmail.com>
@ 2020-02-15  6:01                 ` Aubrey Li
       [not found]                   ` <CANaguZBj_x_2+9KwbHCQScsmraC_mHdQB6uRqMTYMmvhBYfv2Q@mail.gmail.com>
  0 siblings, 1 reply; 84+ messages in thread
From: Aubrey Li @ 2020-02-15  6:01 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Tim Chen, Julien Desfossez, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

On Fri, Feb 14, 2020 at 11:05 PM Vineeth Remanan Pillai
<vpillai@digitalocean.com> wrote:
>
> Hi Aubrey,
>
> Thanks for the updated patches. Merged the changes to our testing branch
> in preparation for v5.
>
>>
>> I added a helper to check task and cpu cookie match, including the
>> entire core idle case. The refined patchset updated at here:
>> https://github.com/aubreyli/linux/tree/coresched_v4-v5.5.2
>
>
> I did not go through all the changes thoroughly, but on a quick glance,
> I feel a small change would optimize it a bit.
>
> +       /*
> +        * A CPU in an idle core is always the best choice for tasks with
> +        * cookies, and ignore cookie match if core scheduler is not enabled
> +        * on the CPU.
> +        */
> +       if (idle_core || !sched_core_enabled(rq))
> +               return true;
> +
> +       return rq->core->core_cookie == p->core_cookie;
> +}
> +
>
> I think check for sched_core_enabled would make sense to be done above the
> for loop. Something like this:
>
> +static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
> +{
> +       bool idle_core = true;
> +       int cpu;
> +
> +       if (!sched_core_enabled(rq))
> +               return true;
> +
> +       for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
> +               if (!available_idle_cpu(cpu)) {
> +                       idle_core = false;
> +                       break;
> +               }
> +       }
> +
> +       /*
> +        * A CPU in an idle core is always the best choice for tasks with
> +        * cookies.
> +        */
> +       return idle_core || rq->core->core_cookie == p->core_cookie;
> +}
> +
>
> We can avoid the unnecessary check for idle_core is sched_core is disabled.
> This would optimize in case of systems with lots of cpus.
>
> Please let me know!

Yes, this makes sense, patch updated at here, I put your name there if
you don't mind.
https://github.com/aubreyli/linux/tree/coresched_v4-v5.5.2-rc2

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
       [not found]                   ` <CANaguZBj_x_2+9KwbHCQScsmraC_mHdQB6uRqMTYMmvhBYfv2Q@mail.gmail.com>
@ 2020-02-21 23:20                     ` Julien Desfossez
  2020-03-17  0:55                       ` Joel Fernandes
  0 siblings, 1 reply; 84+ messages in thread
From: Julien Desfossez @ 2020-02-21 23:20 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Aubrey Li, Tim Chen, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

On 18-Feb-2020 04:58:02 PM, Vineeth Remanan Pillai wrote:
> > Yes, this makes sense, patch updated at here, I put your name there if
> > you don't mind.
> > https://github.com/aubreyli/linux/tree/coresched_v4-v5.5.2-rc2
> >
> > Thanks Aubrey!

Just a quick note, I ran a very cpu-intensive benchmark (9x12 vcpus VMs
running linpack), all affined to an 18 cores NUMA node (36 hardware
threads). Each VM is running in its own cgroup/tag with core scheduling
enabled. We know it already performed much better than nosmt, so for
this case, I measured various co-scheduling statistics:
- how much time the process spends co-scheduled with idle, a compatible
  or an incompatible task
- how long does the process spends running in a inefficient
  configuration (more than 1 thread running alone on a core)

And I am very happy to report than even though the 9 VMs were configured
to float on the whole NUMA node, the scheduler / load-balancer did a
very good job at keeping an efficient configuration:

Process 10667 (qemu-system-x86), 10 seconds trace:
  - total runtime: 46451472309 ns,
  - local neighbors (total: 45713285084 ns, 98.411 % of process runtime):
  - idle neighbors (total: 484054061 ns, 1.042 % of process runtime):
  - foreign neighbors (total: 4191002 ns, 0.009 % of process runtime):
  - unknown neighbors  (total: 92042503 ns, 0.198 % of process runtime)
  - inefficient periods (total: 464832 ns, 0.001 % of process runtime):
    - number of periods: 48
    - min period duration: 1424 ns
    - max period duration: 116988 ns
    - average period duration: 9684.000 ns
    - stdev: 19282.130

I thought you would enjoy seeing this :-)

Have a good weekend,

Julien

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-02-14  6:10             ` Aubrey Li
       [not found]               ` <CANaguZC40mDHfL1H_9AA7H8cyd028t9PQVRqQ3kB4ha8R7hhqg@mail.gmail.com>
@ 2020-02-25  3:44               ` Aaron Lu
  2020-02-25  5:32                 ` Aubrey Li
       [not found]                 ` <CANaguZD205ccu1V_2W-QuMRrJA9SjJ5ng1do4NCdLy8NDKKrbA@mail.gmail.com>
  1 sibling, 2 replies; 84+ messages in thread
From: Aaron Lu @ 2020-02-25  3:44 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Tim Chen, Julien Desfossez, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Fri, Feb 14, 2020 at 02:10:53PM +0800, Aubrey Li wrote:
> On Fri, Feb 14, 2020 at 2:37 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >
> > On 2/12/20 3:07 PM, Julien Desfossez wrote:
> >
> > >>
> > >> Have you guys been able to make progress on the issues with I/O intensive workload?
> > >
> > > I finally have some results with the following branch:
> > > https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.5.y
> > >
> > >
> > > So the main conclusion is that for all the test cases we have studied,
> > > core scheduling performs better than nosmt ! This is different than what
> > > we tested a while back, so it's looking really good !
> >
> > Thanks for the data.  They look really encouraging.
> >
> > Aubrey is working on updating his patches so it will load balance
> > to the idle cores a bit better.  We are testing those and will post
> > the update soon.
> 
> I added a helper to check task and cpu cookie match, including the
> entire core idle case. The refined patchset updated at here:
> https://github.com/aubreyli/linux/tree/coresched_v4-v5.5.2
> 
> This branch also includes Tim's patchset. According to our testing
> result, the performance data looks on par with the previous version.
> A good news is, v5.4.y stability issue on our 8 numa node machine
> is gone on this v5.5.2 branch.

One problem I have when testing this branch: the weight of the croup
seems to be ignored.

On a 2sockets/16cores/32threads VM, I grouped 8 sysbench(cpu mode)
threads into one cgroup(cgA) and another 16 sysbench(cpu mode) threads
into another cgroup(cgB). cgA and cgB's cpusets are set to the same
socket's 8 cores/16 CPUs and cgA's cpu.shares is set to 10240 while cgB's
cpu.shares is set to 2(so consider cgB as noise workload and cgA as
the real workload).

I had expected cgA to occupy 8 cpus(with each cpu on a different core)
most of the time since it has way more weight than cgB, while cgB should
occupy almost no CPUs since:
 - when cgB's task is in the same CPU queue as cgA's task, then cgB's
   task is given very little CPU due to its small weight;
 - when cgB's task is in a CPU queue whose sibling's queue has cgA's
   task, cgB's task should be forced idle(again, due to its small weight).
 
But testing shows cgA occupies only 2 cpus during the entire run while
cgB enjoys the remaining 14 cpus. As a comparison, when coresched is off,
cgA can occupy 8 cpus during its run.

I haven't taken a look at the patches, but would like to raise the
problem first. My gut feeling is that, we didn't make the CPU's load
balanced.

P.S. it's not that I care about VM's performance, it's just easier to
test kernel stuff using a VM than on a bare metal. Its CPU setup might
seem weird, I just set it up to be the same as my host setup.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-02-25  3:44               ` Aaron Lu
@ 2020-02-25  5:32                 ` Aubrey Li
  2020-02-25  7:34                   ` Aaron Lu
       [not found]                 ` <CANaguZD205ccu1V_2W-QuMRrJA9SjJ5ng1do4NCdLy8NDKKrbA@mail.gmail.com>
  1 sibling, 1 reply; 84+ messages in thread
From: Aubrey Li @ 2020-02-25  5:32 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Tim Chen, Julien Desfossez, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Tue, Feb 25, 2020 at 11:44 AM Aaron Lu <aaron.lwe@gmail.com> wrote:
>
> On Fri, Feb 14, 2020 at 02:10:53PM +0800, Aubrey Li wrote:
> > On Fri, Feb 14, 2020 at 2:37 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> > >
> > > On 2/12/20 3:07 PM, Julien Desfossez wrote:
> > >
> > > >>
> > > >> Have you guys been able to make progress on the issues with I/O intensive workload?
> > > >
> > > > I finally have some results with the following branch:
> > > > https://github.com/digitalocean/linux-coresched/tree/coresched/v4-v5.5.y
> > > >
> > > >
> > > > So the main conclusion is that for all the test cases we have studied,
> > > > core scheduling performs better than nosmt ! This is different than what
> > > > we tested a while back, so it's looking really good !
> > >
> > > Thanks for the data.  They look really encouraging.
> > >
> > > Aubrey is working on updating his patches so it will load balance
> > > to the idle cores a bit better.  We are testing those and will post
> > > the update soon.
> >
> > I added a helper to check task and cpu cookie match, including the
> > entire core idle case. The refined patchset updated at here:
> > https://github.com/aubreyli/linux/tree/coresched_v4-v5.5.2
> >
> > This branch also includes Tim's patchset. According to our testing
> > result, the performance data looks on par with the previous version.
> > A good news is, v5.4.y stability issue on our 8 numa node machine
> > is gone on this v5.5.2 branch.
>
> One problem I have when testing this branch: the weight of the croup
> seems to be ignored.
>
> On a 2sockets/16cores/32threads VM, I grouped 8 sysbench(cpu mode)
> threads into one cgroup(cgA) and another 16 sysbench(cpu mode) threads
> into another cgroup(cgB). cgA and cgB's cpusets are set to the same
> socket's 8 cores/16 CPUs and cgA's cpu.shares is set to 10240 while cgB's
> cpu.shares is set to 2(so consider cgB as noise workload and cgA as
> the real workload).
>
> I had expected cgA to occupy 8 cpus(with each cpu on a different core)
> most of the time since it has way more weight than cgB, while cgB should
> occupy almost no CPUs since:
>  - when cgB's task is in the same CPU queue as cgA's task, then cgB's
>    task is given very little CPU due to its small weight;
>  - when cgB's task is in a CPU queue whose sibling's queue has cgA's
>    task, cgB's task should be forced idle(again, due to its small weight).
>
> But testing shows cgA occupies only 2 cpus during the entire run while
> cgB enjoys the remaining 14 cpus. As a comparison, when coresched is off,
> cgA can occupy 8 cpus during its run.
>
> I haven't taken a look at the patches, but would like to raise the
> problem first. My gut feeling is that, we didn't make the CPU's load
> balanced.
>
> P.S. it's not that I care about VM's performance, it's just easier to
> test kernel stuff using a VM than on a bare metal. Its CPU setup might
> seem weird, I just set it up to be the same as my host setup.

Aaron - did you test this before? In other words, if you reset repo to your
last commit:

- 5bd3c80 sched/fair : Wake up forced idle siblings if needed

Does the problem remain? Just want to check if this is a regression
 introduced by the subsequent patchset.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-02-25  5:32                 ` Aubrey Li
@ 2020-02-25  7:34                   ` Aaron Lu
  2020-02-25 10:40                     ` Aubrey Li
  0 siblings, 1 reply; 84+ messages in thread
From: Aaron Lu @ 2020-02-25  7:34 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Tim Chen, Julien Desfossez, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Tue, Feb 25, 2020 at 01:32:35PM +0800, Aubrey Li wrote:
> Aaron - did you test this before? In other words, if you reset repo to your
> last commit:

I did this test only recently when I started to think if I can use
coresched to boost main workload's performance in a colocated
environment.

> 
> - 5bd3c80 sched/fair : Wake up forced idle siblings if needed
> 
> Does the problem remain? Just want to check if this is a regression
>  introduced by the subsequent patchset.

The problem isn't there with commit 5bd3c80 as the head, so yes, it
looks like indeed a regression introduced by subsequent patchset.

P.S. I will need to take a closer look if each cgA's task is running
on a different core later but the cpu usage of cgA is back to 800% with
commit 5bd3c80.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-02-25  7:34                   ` Aaron Lu
@ 2020-02-25 10:40                     ` Aubrey Li
  2020-02-25 11:21                       ` Aaron Lu
  0 siblings, 1 reply; 84+ messages in thread
From: Aubrey Li @ 2020-02-25 10:40 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Tim Chen, Julien Desfossez, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Tue, Feb 25, 2020 at 3:34 PM Aaron Lu <aaron.lwe@gmail.com> wrote:
>
> On Tue, Feb 25, 2020 at 01:32:35PM +0800, Aubrey Li wrote:
> > Aaron - did you test this before? In other words, if you reset repo to your
> > last commit:
>
> I did this test only recently when I started to think if I can use
> coresched to boost main workload's performance in a colocated
> environment.
>
> >
> > - 5bd3c80 sched/fair : Wake up forced idle siblings if needed
> >
> > Does the problem remain? Just want to check if this is a regression
> >  introduced by the subsequent patchset.
>
> The problem isn't there with commit 5bd3c80 as the head, so yes, it
> looks like indeed a regression introduced by subsequent patchset.
>
> P.S. I will need to take a closer look if each cgA's task is running
> on a different core later but the cpu usage of cgA is back to 800% with
> commit 5bd3c80.

Hmm..., I went through the subsequent patches, and I think this one

- 4041eeb8f3 sched/fair: don't migrate task if cookie not match

is probably the major cause, can you please revert this one to see
if the problem is gone?

From what I can tell, if 16 threads in cgB occupied 8 cores, this
patch prevents any thread in cgA from migrating when load balance
is triggered, and yes, cpu.shares is ignored at this point.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-02-25 10:40                     ` Aubrey Li
@ 2020-02-25 11:21                       ` Aaron Lu
  2020-02-25 13:41                         ` Aubrey Li
  0 siblings, 1 reply; 84+ messages in thread
From: Aaron Lu @ 2020-02-25 11:21 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Tim Chen, Julien Desfossez, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Tue, Feb 25, 2020 at 06:40:02PM +0800, Aubrey Li wrote:
> On Tue, Feb 25, 2020 at 3:34 PM Aaron Lu <aaron.lwe@gmail.com> wrote:
> >
> > On Tue, Feb 25, 2020 at 01:32:35PM +0800, Aubrey Li wrote:
> > > Aaron - did you test this before? In other words, if you reset repo to your
> > > last commit:
> >
> > I did this test only recently when I started to think if I can use
> > coresched to boost main workload's performance in a colocated
> > environment.
> >
> > >
> > > - 5bd3c80 sched/fair : Wake up forced idle siblings if needed
> > >
> > > Does the problem remain? Just want to check if this is a regression
> > >  introduced by the subsequent patchset.
> >
> > The problem isn't there with commit 5bd3c80 as the head, so yes, it
> > looks like indeed a regression introduced by subsequent patchset.
> >
> > P.S. I will need to take a closer look if each cgA's task is running
> > on a different core later but the cpu usage of cgA is back to 800% with
> > commit 5bd3c80.
> 
> Hmm..., I went through the subsequent patches, and I think this one
> 
> - 4041eeb8f3 sched/fair: don't migrate task if cookie not match
> 
> is probably the major cause, can you please revert this one to see
> if the problem is gone?

Yes, reverting this one fixed the problem.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-02-25 11:21                       ` Aaron Lu
@ 2020-02-25 13:41                         ` Aubrey Li
  0 siblings, 0 replies; 84+ messages in thread
From: Aubrey Li @ 2020-02-25 13:41 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Tim Chen, Julien Desfossez, Vineeth Remanan Pillai,
	Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Tue, Feb 25, 2020 at 7:22 PM Aaron Lu <aaron.lwe@gmail.com> wrote:
>
> On Tue, Feb 25, 2020 at 06:40:02PM +0800, Aubrey Li wrote:
> > On Tue, Feb 25, 2020 at 3:34 PM Aaron Lu <aaron.lwe@gmail.com> wrote:
> > >
> > > On Tue, Feb 25, 2020 at 01:32:35PM +0800, Aubrey Li wrote:
> > > > Aaron - did you test this before? In other words, if you reset repo to your
> > > > last commit:
> > >
> > > I did this test only recently when I started to think if I can use
> > > coresched to boost main workload's performance in a colocated
> > > environment.
> > >
> > > >
> > > > - 5bd3c80 sched/fair : Wake up forced idle siblings if needed
> > > >
> > > > Does the problem remain? Just want to check if this is a regression
> > > >  introduced by the subsequent patchset.
> > >
> > > The problem isn't there with commit 5bd3c80 as the head, so yes, it
> > > looks like indeed a regression introduced by subsequent patchset.
> > >
> > > P.S. I will need to take a closer look if each cgA's task is running
> > > on a different core later but the cpu usage of cgA is back to 800% with
> > > commit 5bd3c80.
> >
> > Hmm..., I went through the subsequent patches, and I think this one
> >
> > - 4041eeb8f3 sched/fair: don't migrate task if cookie not match
> >
> > is probably the major cause, can you please revert this one to see
> > if the problem is gone?
>
> Yes, reverting this one fixed the problem.

okay, but this patch also contributed the improvement of a few benchmarks
on my side. So we need a way to fix your case, my quick thought is allowing
task migration in this case(sounds like a workaround). Need take a deep look
at CPU resource controlled code path when core scheduling enabled.

Any ideas?

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
       [not found]                 ` <CANaguZD205ccu1V_2W-QuMRrJA9SjJ5ng1do4NCdLy8NDKKrbA@mail.gmail.com>
@ 2020-02-26  3:13                   ` Aaron Lu
  2020-02-26  7:21                   ` Aubrey Li
  2020-02-27  2:04                   ` Aaron Lu
  2 siblings, 0 replies; 84+ messages in thread
From: Aaron Lu @ 2020-02-26  3:13 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Aubrey Li, Tim Chen, Julien Desfossez, Nishanth Aravamudan,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Tue, Feb 25, 2020 at 03:51:37PM -0500, Vineeth Remanan Pillai wrote:
> Hi Aaron,
> We tried reproducing this with a sample script here:
> https://gist.github.com/vineethrp/4356e66694269d1525ff254d7f213aef

Nice script.

> But the set1 cgroup processes always get their share of cpu time in
> our test. Could you please verify if its the same test that you were
> also doing? The only difference is that we run on a 2 16core/32thread
> socket bare metal using only socket 0. We also tried threads instead of
> processes, but the results are the same.

Sorry for missing one detail: I always start the noise workload first,
and then start the real workload. This is critical for this test here
since only so, the noise workload can occupy all CPUs and presents a
challange for the load balancer to balance real workload's tasks. If
both workloads are started at the same time, the initial task placement
might mitigate the problem.

BTW, your script has given 12cores/24 CPUs to the workloads and cgA
spawned 16 tasks, cgB spawned 32. This is an even more complex scenario
to test since the real workload's task number is already more than the
available core number. Perhaps just starting 12 tasks for cgA and 24
tasks for cgB is enough for now. As for the start sequence, simply sleep
5 seconds after cgB workload is started, then start cgA. I have left a
comment on the script's gist page.

> 
> 
> On a 2sockets/16cores/32threads VM, I grouped 8 sysbench(cpu mode)
> > threads into one cgroup(cgA) and another 16 sysbench(cpu mode) threads
> > into another cgroup(cgB). cgA and cgB's cpusets are set to the same
> > socket's 8 cores/16 CPUs and cgA's cpu.shares is set to 10240 while cgB's
> > cpu.shares is set to 2(so consider cgB as noise workload and cgA as
> > the real workload).
> >
> > I had expected cgA to occupy 8 cpus(with each cpu on a different core)
> 
> The expected behaviour could also be that 8 processes share 4 cores and
> 8 hw threads right? This is what we are seeing mostly
> 
> most of the time since it has way more weight than cgB, while cgB should
> > occupy almost no CPUs since:
> >  - when cgB's task is in the same CPU queue as cgA's task, then cgB's
> >    task is given very little CPU due to its small weight;
> >  - when cgB's task is in a CPU queue whose sibling's queue has cgA's
> >    task, cgB's task should be forced idle(again, due to its small weight).
> >
> We are seeing the cgA is taking half the cores and cgB taking rest half
> of the cores. Looks like the scheduler ultimately groups the tasks into
> its own cores.
> 
> 
> >
> > But testing shows cgA occupies only 2 cpus during the entire run while
> > cgB enjoys the remaining 14 cpus. As a comparison, when coresched is off,
> > cgA can occupy 8 cpus during its run.
> >
> > Not sure why we are not able to reproduce this. I have a quick patch
> which might fix this. The idea is that, we allow migration if p's
> hierarchical load or estimated utilization is more than dest_rq->curr.
> While thinking about this fix, I noticed that we are not holding the
> dest_rq lock for any of the migration patches. Migration patches would
> probably need a rework. Attaching my patch down, but it also does not
> take the dest_rq lock. I have also added a case of dest_core being
> forced_idle. I think that would be an opportunity to migrate. Ideally
> we should check if the forced idle task has the same cookie as p.
> 
> https://gist.github.com/vineethrp/887743608f42a6ce96bf7847b5b119ae

Is this on top of Aubrey's coresched_v4-v5.5.2 branch?

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
       [not found]                 ` <CANaguZD205ccu1V_2W-QuMRrJA9SjJ5ng1do4NCdLy8NDKKrbA@mail.gmail.com>
  2020-02-26  3:13                   ` Aaron Lu
@ 2020-02-26  7:21                   ` Aubrey Li
       [not found]                     ` <CANaguZDQZg-Z6aNpeLcjQ-cGm3X8CQOkZ_hnJNUyqDRM=yVDFQ@mail.gmail.com>
  2020-02-27  2:04                   ` Aaron Lu
  2 siblings, 1 reply; 84+ messages in thread
From: Aubrey Li @ 2020-02-26  7:21 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Aaron Lu, Tim Chen, Julien Desfossez, Nishanth Aravamudan,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Wed, Feb 26, 2020 at 4:51 AM Vineeth Remanan Pillai
<vpillai@digitalocean.com> wrote:
>
> I have a quick patch
> which might fix this. The idea is that, we allow migration if p's
> hierarchical load or estimated utilization is more than dest_rq->curr.
> While thinking about this fix, I noticed that we are not holding the
> dest_rq lock for any of the migration patches. Migration patches would
> probably need a rework. Attaching my patch down, but it also does not
> take the dest_rq lock. I have also added a case of dest_core being
> forced_idle. I think that would be an opportunity to migrate. Ideally
> we should check if the forced idle task has the same cookie as p.
>
> https://gist.github.com/vineethrp/887743608f42a6ce96bf7847b5b119ae
>

Hi Vineeth,

Thanks for the quick fix. I guess this patch should work for Aaron's case
because it almost removed the cookie checking here. Some comments
below:

+ if (rq->core->core_forceidle)
+         return true;

We check cookie match during load balance to avoid two different
cookie tasks on the same core. If one cpu is forced idle, its sibling should
be running a non-idle task, not sure why we can return true directly here.
And if we need to check cookie, with this forced idle case, why it's special
to be picked up?

+ if (task_h_load(p) > task_h_load(rq->curr))
+         return true;
+ if (task_util_est(p) > task_util_est(rq->curr))
+         return true;

These two need to exclude rq->curr == rq->idle case?
otherwise idle balance always return true?

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
       [not found]                 ` <CANaguZD205ccu1V_2W-QuMRrJA9SjJ5ng1do4NCdLy8NDKKrbA@mail.gmail.com>
  2020-02-26  3:13                   ` Aaron Lu
  2020-02-26  7:21                   ` Aubrey Li
@ 2020-02-27  2:04                   ` Aaron Lu
  2020-02-27 14:10                     ` Phil Auld
  2 siblings, 1 reply; 84+ messages in thread
From: Aaron Lu @ 2020-02-27  2:04 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Aubrey Li, Tim Chen, Julien Desfossez, Nishanth Aravamudan,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Tue, Feb 25, 2020 at 03:51:37PM -0500, Vineeth Remanan Pillai wrote:
> On a 2sockets/16cores/32threads VM, I grouped 8 sysbench(cpu mode)
> > threads into one cgroup(cgA) and another 16 sysbench(cpu mode) threads
> > into another cgroup(cgB). cgA and cgB's cpusets are set to the same
> > socket's 8 cores/16 CPUs and cgA's cpu.shares is set to 10240 while cgB's
> > cpu.shares is set to 2(so consider cgB as noise workload and cgA as
> > the real workload).
> >
> > I had expected cgA to occupy 8 cpus(with each cpu on a different core)
> 
> The expected behaviour could also be that 8 processes share 4 cores and
> 8 hw threads right? This is what we are seeing mostly

I expect the 8 cgA tasks to spread on each core, instead of occupying
4 cores/8 hw threads. If they stay on 4 cores/8 hw threads, than on the
core level, these cores' load would be much higher than other cores
which are running cgB's tasks, this doesn't look right to me.

I think the end result should be: each core has two tasks queued, one
cgA task and one cgB task(to maintain load balance on the core level).
The two tasks are queued on different hw thread, with cgA's task runs
most of the time on one thread and cgB's task being forced idle most
of the time on the other thread.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
       [not found]                     ` <CANaguZDQZg-Z6aNpeLcjQ-cGm3X8CQOkZ_hnJNUyqDRM=yVDFQ@mail.gmail.com>
@ 2020-02-27  4:45                       ` Aubrey Li
  2020-02-28 23:55                       ` Tim Chen
  1 sibling, 0 replies; 84+ messages in thread
From: Aubrey Li @ 2020-02-27  4:45 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Aaron Lu, Tim Chen, Julien Desfossez, Nishanth Aravamudan,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Thu, Feb 27, 2020 at 5:54 AM Vineeth Remanan Pillai
<vpillai@digitalocean.com> wrote:
>
> Hi Aubrey,
>>
>>
>> Thanks for the quick fix. I guess this patch should work for Aaron's case
>> because it almost removed the cookie checking here. Some comments
>> below:
>>
>> + if (rq->core->core_forceidle)
>> +         return true;
>>
>> We check cookie match during load balance to avoid two different
>> cookie tasks on the same core. If one cpu is forced idle, its sibling should
>> be running a non-idle task, not sure why we can return true directly here.
>> And if we need to check cookie, with this forced idle case, why it's special
>> to be picked up?
>
> The idea behind this was: If a core is forced idle, then we have a
> scenario where one task is running without a companion and one task
> is forced to be idle. So if this source task is of the same cookie
> as the forced idle task, then migrating this source task will offer a
> chance for the force idle task to run together with task in subsequent
> schedules. We don't have a framework to check the cookie of forced
> idle tasks and hence this if block is only partial fix.

To mitigate force idle, this helper is trying to prevent unmatched case, not
catch the matched case. If the task's cookie is matched with forced idle
cpu's sibling, it's able to pass the check later.  Comparing to my original
intention, this fix increases unmatched case as it returns immediately and
we don't check cookie later.

>>
>>
>> + if (task_h_load(p) > task_h_load(rq->curr))
>> +         return true;
>> + if (task_util_est(p) > task_util_est(rq->curr))
>> +         return true;
>>
>> These two need to exclude rq->curr == rq->idle case?
>> otherwise idle balance always return true?
>
> rq->curr being NULL can mean that the sibling is idle or forced idle.
> In both the cases, I think it makes sense to migrate a task so that it can
> compete with the other sibling for a chance to run. This function
> can_migrate_task actually only says if this task is eligible and
> later part of the code decides whether it is okay to migrate it
> based on factors like load and util and capacity. So I think its
> fine to declare the task as eligible if the dest core is running
> idle. Does this thinking make sense?
>
The intention here is stopping migration not allowing migration.
task_h_load(idle) is zero, but idle_task's sibling CPU is running
a task with cookieA, the fix returns immediately when
task_h_load(task cookieB) > 0, this puts another unmatched case in.

> On our testing, it did not show much degradation in performance with
> this change. I am reworking the fix by removing the check for
> task_est_util. It doesn't seem to be valid to check for util to migrate
> the task.

I'm not sure the check for task_h_load is exactly wanted here either.
From my understanding, the problem is because we stopped high
weight task migration because its cookie does not match with the target's,
can we check high load here instead of high weight?

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-02-27  2:04                   ` Aaron Lu
@ 2020-02-27 14:10                     ` Phil Auld
  2020-02-27 14:37                       ` Aubrey Li
  2020-02-28  2:54                       ` Aaron Lu
  0 siblings, 2 replies; 84+ messages in thread
From: Phil Auld @ 2020-02-27 14:10 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Vineeth Remanan Pillai, Aubrey Li, Tim Chen, Julien Desfossez,
	Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

Hi Aaron,

On Thu, Feb 27, 2020 at 10:04:32AM +0800 Aaron Lu wrote:
> On Tue, Feb 25, 2020 at 03:51:37PM -0500, Vineeth Remanan Pillai wrote:
> > On a 2sockets/16cores/32threads VM, I grouped 8 sysbench(cpu mode)
> > > threads into one cgroup(cgA) and another 16 sysbench(cpu mode) threads
> > > into another cgroup(cgB). cgA and cgB's cpusets are set to the same
> > > socket's 8 cores/16 CPUs and cgA's cpu.shares is set to 10240 while cgB's
> > > cpu.shares is set to 2(so consider cgB as noise workload and cgA as
> > > the real workload).
> > >
> > > I had expected cgA to occupy 8 cpus(with each cpu on a different core)
> > 
> > The expected behaviour could also be that 8 processes share 4 cores and
> > 8 hw threads right? This is what we are seeing mostly
> 
> I expect the 8 cgA tasks to spread on each core, instead of occupying
> 4 cores/8 hw threads. If they stay on 4 cores/8 hw threads, than on the
> core level, these cores' load would be much higher than other cores
> which are running cgB's tasks, this doesn't look right to me.
> 

I don't think that's a valid assumption, at least since the load balancer rework.

The scheduler will be looking much more at the number of running task versus
the group weight. So in this case 2 running tasks, 2 siblings at the core level
will look fine. There will be no reason to migrate. 

> I think the end result should be: each core has two tasks queued, one
> cgA task and one cgB task(to maintain load balance on the core level).
> The two tasks are queued on different hw thread, with cgA's task runs
> most of the time on one thread and cgB's task being forced idle most
> of the time on the other thread.
> 

With the core scheduler that does not seem to be a desired outcome. I think
grouping the 8 cgA tasks on the 8 cpus of 4 cores seems right. 



Cheers,
Phil

-- 


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-02-27 14:10                     ` Phil Auld
@ 2020-02-27 14:37                       ` Aubrey Li
  2020-02-28  2:54                       ` Aaron Lu
  1 sibling, 0 replies; 84+ messages in thread
From: Aubrey Li @ 2020-02-27 14:37 UTC (permalink / raw)
  To: Phil Auld
  Cc: Aaron Lu, Vineeth Remanan Pillai, Tim Chen, Julien Desfossez,
	Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Thu, Feb 27, 2020 at 10:10 PM Phil Auld <pauld@redhat.com> wrote:
>
> Hi Aaron,
>
> On Thu, Feb 27, 2020 at 10:04:32AM +0800 Aaron Lu wrote:
> > On Tue, Feb 25, 2020 at 03:51:37PM -0500, Vineeth Remanan Pillai wrote:
> > > On a 2sockets/16cores/32threads VM, I grouped 8 sysbench(cpu mode)
> > > > threads into one cgroup(cgA) and another 16 sysbench(cpu mode) threads
> > > > into another cgroup(cgB). cgA and cgB's cpusets are set to the same
> > > > socket's 8 cores/16 CPUs and cgA's cpu.shares is set to 10240 while cgB's
> > > > cpu.shares is set to 2(so consider cgB as noise workload and cgA as
> > > > the real workload).
> > > >
> > > > I had expected cgA to occupy 8 cpus(with each cpu on a different core)
> > >
> > > The expected behaviour could also be that 8 processes share 4 cores and
> > > 8 hw threads right? This is what we are seeing mostly
> >
> > I expect the 8 cgA tasks to spread on each core, instead of occupying
> > 4 cores/8 hw threads. If they stay on 4 cores/8 hw threads, than on the
> > core level, these cores' load would be much higher than other cores
> > which are running cgB's tasks, this doesn't look right to me.
> >
>
> I don't think that's a valid assumption, at least since the load balancer rework.
>
> The scheduler will be looking much more at the number of running task versus
> the group weight. So in this case 2 running tasks, 2 siblings at the core level
> will look fine. There will be no reason to migrate.

Can this be replicated?

>
> > I think the end result should be: each core has two tasks queued, one
> > cgA task and one cgB task(to maintain load balance on the core level).
> > The two tasks are queued on different hw thread, with cgA's task runs
> > most of the time on one thread and cgB's task being forced idle most
> > of the time on the other thread.
> >
>
> With the core scheduler that does not seem to be a desired outcome. I think
> grouping the 8 cgA tasks on the 8 cpus of 4 cores seems right.
>
Especially, if the load of cgA task + cgB task > cpu capacity, grouping cgA
tasks can avoid forced idle completely. Maintaining core level balance seems
not the best result. I guess that's why with core scheduler enabled we saw
10-20% improvement in some cases against the default core scheduler disabled.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-02-27 14:10                     ` Phil Auld
  2020-02-27 14:37                       ` Aubrey Li
@ 2020-02-28  2:54                       ` Aaron Lu
  2020-03-05 13:45                         ` Aubrey Li
  1 sibling, 1 reply; 84+ messages in thread
From: Aaron Lu @ 2020-02-28  2:54 UTC (permalink / raw)
  To: Phil Auld
  Cc: Vineeth Remanan Pillai, Aubrey Li, Tim Chen, Julien Desfossez,
	Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Thu, Feb 27, 2020 at 09:10:33AM -0500, Phil Auld wrote:
> Hi Aaron,
> 
> On Thu, Feb 27, 2020 at 10:04:32AM +0800 Aaron Lu wrote:
> > On Tue, Feb 25, 2020 at 03:51:37PM -0500, Vineeth Remanan Pillai wrote:
> > > On a 2sockets/16cores/32threads VM, I grouped 8 sysbench(cpu mode)
> > > > threads into one cgroup(cgA) and another 16 sysbench(cpu mode) threads
> > > > into another cgroup(cgB). cgA and cgB's cpusets are set to the same
> > > > socket's 8 cores/16 CPUs and cgA's cpu.shares is set to 10240 while cgB's
> > > > cpu.shares is set to 2(so consider cgB as noise workload and cgA as
> > > > the real workload).
> > > >
> > > > I had expected cgA to occupy 8 cpus(with each cpu on a different core)
> > > 
> > > The expected behaviour could also be that 8 processes share 4 cores and
> > > 8 hw threads right? This is what we are seeing mostly
> > 
> > I expect the 8 cgA tasks to spread on each core, instead of occupying
> > 4 cores/8 hw threads. If they stay on 4 cores/8 hw threads, than on the
> > core level, these cores' load would be much higher than other cores
> > which are running cgB's tasks, this doesn't look right to me.
> > 
> 
> I don't think that's a valid assumption, at least since the load balancer rework.
> 
> The scheduler will be looking much more at the number of running task versus
> the group weight. So in this case 2 running tasks, 2 siblings at the core level
> will look fine. There will be no reason to migrate. 

In the absence of core scheduling, I agree there is no reason to migrate
since no matter how to migrate, the end result is one high-weight task
sharing a core with another (high or low weight) task. But with core
scheduling, things can be different: if the high weight tasks are
spread among cores, then these high weight tasks can enjoy the core
alone(by force idling its sibling) and get better performance.

I'm thinking to use core scheduling to protect main workload's
performance in a colocated environment, similar to the realtime use
case described here:
https://lwn.net/Articles/799454/

I'll quote the relevant part here:
"
But core scheduling can force sibling CPUs to go idle when a realtime
process is running on the core, thus preventing this kind of
interference. That opens the door to enabling SMT whenever a core has no
realtime work, but effectively disabling it when realtime constraints
apply, getting the best of both worlds. 
"

Using cpuset for the main workload to only allow its task run on one HT
of each core might also solve this, but I had hoped not to need use
cpuset as that can add complexity in deployment.

> > I think the end result should be: each core has two tasks queued, one
> > cgA task and one cgB task(to maintain load balance on the core level).
> > The two tasks are queued on different hw thread, with cgA's task runs
> > most of the time on one thread and cgB's task being forced idle most
> > of the time on the other thread.
> > 
> 
> With the core scheduler that does not seem to be a desired outcome. I think
> grouping the 8 cgA tasks on the 8 cpus of 4 cores seems right. 
> 

When the core wide weight is somewhat balanced, yes I definitely agree.
But when core wide weight mismatch a lot, I'm not so sure since if these
high weight task is spread among cores, with the feature of core
scheduling, these high weight tasks can get better performance. So this
appeared to me like a question of: is it desirable to protect/enhance
high weight task performance in the presence of core scheduling?

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
       [not found]                     ` <CANaguZDQZg-Z6aNpeLcjQ-cGm3X8CQOkZ_hnJNUyqDRM=yVDFQ@mail.gmail.com>
  2020-02-27  4:45                       ` Aubrey Li
@ 2020-02-28 23:55                       ` Tim Chen
  2020-03-03 14:59                         ` Li, Aubrey
  1 sibling, 1 reply; 84+ messages in thread
From: Tim Chen @ 2020-02-28 23:55 UTC (permalink / raw)
  To: Vineeth Remanan Pillai, Aubrey Li
  Cc: Aaron Lu, Julien Desfossez, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On 2/26/20 1:54 PM, Vineeth Remanan Pillai wrote:

> rq->curr being NULL can mean that the sibling is idle or forced idle.
> In both the cases, I think it makes sense to migrate a task so that it can
> compete with the other sibling for a chance to run. This function
> can_migrate_task actually only says if this task is eligible and
> later part of the code decides whether it is okay to migrate it
> based on factors like load and util and capacity. So I think its
> fine to declare the task as eligible if the dest core is running
> idle. Does this thinking make sense?
> 
> On our testing, it did not show much degradation in performance with
> this change. I am reworking the fix by removing the check for
> task_est_util. It doesn't seem to be valid to check for util to migrate
> the task.
> 

In Aaron's test case, there is a great imbalance in the load on one core
where all the grp A tasks are vs the other cores where the grp B tasks are
spread around.  Normally, load balancer will move the tasks for grp A.

Aubrey's can_migrate_task patch prevented the load balancer to migrate tasks if the core
cookie on the target queue don't match.  The thought was it will induce
force idle and reduces cpu utilization if we migrate task to it.
That kept all the grp A tasks from getting migrated and kept the imbalance
indefinitely in Aaron's test case.

Perhaps we should also look at the load imbalance between the src rq and
target rq.  If the imbalance is big (say two full cpu bound tasks worth
of load), we should migrate anyway despite the cookie mismatch.  We are willing
to pay a bit for the force idle by balancing the load out more.
I think Aubrey's patch on can_migrate_task should be more friendly to
Aaron's test scenario if such logic is incorporated.

In Vinnet's fix, we only look at the currently running task's weight in
src and dst rq.  Perhaps the load on the src and dst rq needs to be considered
to prevent too great an imbalance between the run queues?

Tim

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-02-28 23:55                       ` Tim Chen
@ 2020-03-03 14:59                         ` Li, Aubrey
  2020-03-03 23:54                           ` Li, Aubrey
  0 siblings, 1 reply; 84+ messages in thread
From: Li, Aubrey @ 2020-03-03 14:59 UTC (permalink / raw)
  To: Tim Chen, Vineeth Remanan Pillai, Aubrey Li
  Cc: Aaron Lu, Julien Desfossez, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On 2020/2/29 7:55, Tim Chen wrote:
> On 2/26/20 1:54 PM, Vineeth Remanan Pillai wrote:
> 
>> rq->curr being NULL can mean that the sibling is idle or forced idle.
>> In both the cases, I think it makes sense to migrate a task so that it can
>> compete with the other sibling for a chance to run. This function
>> can_migrate_task actually only says if this task is eligible and
>> later part of the code decides whether it is okay to migrate it
>> based on factors like load and util and capacity. So I think its
>> fine to declare the task as eligible if the dest core is running
>> idle. Does this thinking make sense?
>>
>> On our testing, it did not show much degradation in performance with
>> this change. I am reworking the fix by removing the check for
>> task_est_util. It doesn't seem to be valid to check for util to migrate
>> the task.
>>
> 
> In Aaron's test case, there is a great imbalance in the load on one core
> where all the grp A tasks are vs the other cores where the grp B tasks are
> spread around.  Normally, load balancer will move the tasks for grp A.
> 
> Aubrey's can_migrate_task patch prevented the load balancer to migrate tasks if the core
> cookie on the target queue don't match.  The thought was it will induce
> force idle and reduces cpu utilization if we migrate task to it.
> That kept all the grp A tasks from getting migrated and kept the imbalance
> indefinitely in Aaron's test case.
> 
> Perhaps we should also look at the load imbalance between the src rq and
> target rq.  If the imbalance is big (say two full cpu bound tasks worth
> of load), we should migrate anyway despite the cookie mismatch.  We are willing
> to pay a bit for the force idle by balancing the load out more.
> I think Aubrey's patch on can_migrate_task should be more friendly to
> Aaron's test scenario if such logic is incorporated.
> 
> In Vinnet's fix, we only look at the currently running task's weight in
> src and dst rq.  Perhaps the load on the src and dst rq needs to be considered
> to prevent too great an imbalance between the run queues?

We are trying to migrate a task, can we just use cfs.h_nr_running? This signal
is used to find the busiest run queue as well.

Thanks,
-Aubrey


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-03-03 14:59                         ` Li, Aubrey
@ 2020-03-03 23:54                           ` Li, Aubrey
  2020-03-05  4:33                             ` Aaron Lu
  0 siblings, 1 reply; 84+ messages in thread
From: Li, Aubrey @ 2020-03-03 23:54 UTC (permalink / raw)
  To: Tim Chen, Vineeth Remanan Pillai, Aubrey Li
  Cc: Aaron Lu, Julien Desfossez, Nishanth Aravamudan, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On 2020/3/3 22:59, Li, Aubrey wrote:
> On 2020/2/29 7:55, Tim Chen wrote:
>> On 2/26/20 1:54 PM, Vineeth Remanan Pillai wrote:
>>
>>> rq->curr being NULL can mean that the sibling is idle or forced idle.
>>> In both the cases, I think it makes sense to migrate a task so that it can
>>> compete with the other sibling for a chance to run. This function
>>> can_migrate_task actually only says if this task is eligible and
>>> later part of the code decides whether it is okay to migrate it
>>> based on factors like load and util and capacity. So I think its
>>> fine to declare the task as eligible if the dest core is running
>>> idle. Does this thinking make sense?
>>>
>>> On our testing, it did not show much degradation in performance with
>>> this change. I am reworking the fix by removing the check for
>>> task_est_util. It doesn't seem to be valid to check for util to migrate
>>> the task.
>>>
>>
>> In Aaron's test case, there is a great imbalance in the load on one core
>> where all the grp A tasks are vs the other cores where the grp B tasks are
>> spread around.  Normally, load balancer will move the tasks for grp A.
>>
>> Aubrey's can_migrate_task patch prevented the load balancer to migrate tasks if the core
>> cookie on the target queue don't match.  The thought was it will induce
>> force idle and reduces cpu utilization if we migrate task to it.
>> That kept all the grp A tasks from getting migrated and kept the imbalance
>> indefinitely in Aaron's test case.
>>
>> Perhaps we should also look at the load imbalance between the src rq and
>> target rq.  If the imbalance is big (say two full cpu bound tasks worth
>> of load), we should migrate anyway despite the cookie mismatch.  We are willing
>> to pay a bit for the force idle by balancing the load out more.
>> I think Aubrey's patch on can_migrate_task should be more friendly to
>> Aaron's test scenario if such logic is incorporated.
>>
>> In Vinnet's fix, we only look at the currently running task's weight in
>> src and dst rq.  Perhaps the load on the src and dst rq needs to be considered
>> to prevent too great an imbalance between the run queues?
> 
> We are trying to migrate a task, can we just use cfs.h_nr_running? This signal
> is used to find the busiest run queue as well.

How about this one? the cgroup weight issue seems fixed on my side.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f42ceec..90024cf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1767,6 +1767,8 @@ static void task_numa_compare(struct task_numa_env *env,
 	rcu_read_unlock();
 }
 
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p);
+
 static void task_numa_find_cpu(struct task_numa_env *env,
 				long taskimp, long groupimp)
 {
@@ -5650,6 +5652,44 @@ static struct sched_group *
 find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		  int this_cpu, int sd_flag);
 
+#ifdef CONFIG_SCHED_CORE
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	struct rq *src_rq = task_rq(p);
+	bool idle_core = true;
+	int cpu;
+
+	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
+	if (!sched_core_enabled(rq))
+		return true;
+
+	if (rq->core->core_cookie == p->core_cookie)
+		return true;
+
+	for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
+		if (!available_idle_cpu(cpu)) {
+			idle_core = false;
+			break;
+		}
+	}
+	/*
+	 * A CPU in an idle core is always the best choice for tasks with
+	 * cookies.
+	 */
+	if (idle_core)
+		return true;
+
+	/*
+	 * Ignore cookie match if there is a big imbalance between the src rq
+	 * and dst rq.
+	 */
+	if ((src_rq->cfs.h_nr_running - rq->cfs.h_nr_running) > 1)
+		return true;
+
+	return false;
+}
+#endif
+
 /*
  * find_idlest_group_cpu - find the idlest CPU among the CPUs in the group.
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7ae6858..8c607e9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1061,28 +1061,6 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 	return &rq->__lock;
 }
 
-static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
-{
-	bool idle_core = true;
-	int cpu;
-
-	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
-	if (!sched_core_enabled(rq))
-		return true;
-
-	for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
-		if (!available_idle_cpu(cpu)) {
-			idle_core = false;
-			break;
-		}
-	}
-	/*
-	 * A CPU in an idle core is always the best choice for tasks with
-	 * cookies.
-	 */
-	return idle_core || rq->core->core_cookie == p->core_cookie;
-}
-
 extern void queue_core_balance(struct rq *rq);
 
 void sched_core_add(struct rq *rq, struct task_struct *p);

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-03-03 23:54                           ` Li, Aubrey
@ 2020-03-05  4:33                             ` Aaron Lu
  2020-03-05  6:10                               ` Li, Aubrey
  0 siblings, 1 reply; 84+ messages in thread
From: Aaron Lu @ 2020-03-05  4:33 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Tim Chen, Vineeth Remanan Pillai, Aubrey Li, Julien Desfossez,
	Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Wed, Mar 04, 2020 at 07:54:39AM +0800, Li, Aubrey wrote:
> On 2020/3/3 22:59, Li, Aubrey wrote:
> > On 2020/2/29 7:55, Tim Chen wrote:
...
> >> In Vinnet's fix, we only look at the currently running task's weight in
> >> src and dst rq.  Perhaps the load on the src and dst rq needs to be considered
> >> to prevent too great an imbalance between the run queues?
> > 
> > We are trying to migrate a task, can we just use cfs.h_nr_running? This signal
> > is used to find the busiest run queue as well.
> 
> How about this one? the cgroup weight issue seems fixed on my side.

It doesn't apply on top of your coresched_v4-v5.5.2 branch, so I
manually allied it. Not sure if I missed something.

It's now getting 4 cpus in 2 cores. Better, but not back to normal yet..

Thanks,
Aaron
 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f42ceec..90024cf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1767,6 +1767,8 @@ static void task_numa_compare(struct task_numa_env *env,
>  	rcu_read_unlock();
>  }
>  
> +static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p);
> +
>  static void task_numa_find_cpu(struct task_numa_env *env,
>  				long taskimp, long groupimp)
>  {
> @@ -5650,6 +5652,44 @@ static struct sched_group *
>  find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>  		  int this_cpu, int sd_flag);
>  
> +#ifdef CONFIG_SCHED_CORE
> +static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
> +{
> +	struct rq *src_rq = task_rq(p);
> +	bool idle_core = true;
> +	int cpu;
> +
> +	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
> +	if (!sched_core_enabled(rq))
> +		return true;
> +
> +	if (rq->core->core_cookie == p->core_cookie)
> +		return true;
> +
> +	for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
> +		if (!available_idle_cpu(cpu)) {
> +			idle_core = false;
> +			break;
> +		}
> +	}
> +	/*
> +	 * A CPU in an idle core is always the best choice for tasks with
> +	 * cookies.
> +	 */
> +	if (idle_core)
> +		return true;
> +
> +	/*
> +	 * Ignore cookie match if there is a big imbalance between the src rq
> +	 * and dst rq.
> +	 */
> +	if ((src_rq->cfs.h_nr_running - rq->cfs.h_nr_running) > 1)
> +		return true;
> +
> +	return false;
> +}
> +#endif
> +
>  /*
>   * find_idlest_group_cpu - find the idlest CPU among the CPUs in the group.
>   */
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 7ae6858..8c607e9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1061,28 +1061,6 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
>  	return &rq->__lock;
>  }
>  
> -static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
> -{
> -	bool idle_core = true;
> -	int cpu;
> -
> -	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
> -	if (!sched_core_enabled(rq))
> -		return true;
> -
> -	for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
> -		if (!available_idle_cpu(cpu)) {
> -			idle_core = false;
> -			break;
> -		}
> -	}
> -	/*
> -	 * A CPU in an idle core is always the best choice for tasks with
> -	 * cookies.
> -	 */
> -	return idle_core || rq->core->core_cookie == p->core_cookie;
> -}
> -
>  extern void queue_core_balance(struct rq *rq);
>  
>  void sched_core_add(struct rq *rq, struct task_struct *p);

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-03-05  4:33                             ` Aaron Lu
@ 2020-03-05  6:10                               ` Li, Aubrey
  2020-03-05  8:52                                 ` Aaron Lu
  0 siblings, 1 reply; 84+ messages in thread
From: Li, Aubrey @ 2020-03-05  6:10 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Tim Chen, Vineeth Remanan Pillai, Aubrey Li, Julien Desfossez,
	Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On 2020/3/5 12:33, Aaron Lu wrote:
> On Wed, Mar 04, 2020 at 07:54:39AM +0800, Li, Aubrey wrote:
>> On 2020/3/3 22:59, Li, Aubrey wrote:
>>> On 2020/2/29 7:55, Tim Chen wrote:
> ...
>>>> In Vinnet's fix, we only look at the currently running task's weight in
>>>> src and dst rq.  Perhaps the load on the src and dst rq needs to be considered
>>>> to prevent too great an imbalance between the run queues?
>>>
>>> We are trying to migrate a task, can we just use cfs.h_nr_running? This signal
>>> is used to find the busiest run queue as well.
>>
>> How about this one? the cgroup weight issue seems fixed on my side.
> 
> It doesn't apply on top of your coresched_v4-v5.5.2 branch, so I
> manually allied it. Not sure if I missed something.

Here is a rebase version on coresched_v5 Vineeth released this morning:
https://github.com/aubreyli/linux/tree/coresched_V5-v5.5.y-rc1

> 
> It's now getting 4 cpus in 2 cores. Better, but not back to normal yet..

I always saw higher weight tasks getting 8 cpus in 4 cores on my side.
Are you still running 8+16 sysbench cpu threads? 

I replicated your setup, the cpuset with 8cores 16threads, cpu mode 8 sysbench
threads with cpu.shares=10240, 16 sysbench threads with cpu.shares=2, and here
is the data on my side.

				weight(10240)		weight(2)
coresched disabled		324.23(eps)		356.43(eps)
coresched enabled		340.74(eps)		311.62(eps)

It seems higher weight tasks win this time and lower weight tasks have ~15%
regression(not big deal?), did you see anything worse?

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-03-05  6:10                               ` Li, Aubrey
@ 2020-03-05  8:52                                 ` Aaron Lu
  0 siblings, 0 replies; 84+ messages in thread
From: Aaron Lu @ 2020-03-05  8:52 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Tim Chen, Vineeth Remanan Pillai, Aubrey Li, Julien Desfossez,
	Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Thu, Mar 05, 2020 at 02:10:36PM +0800, Li, Aubrey wrote:
> On 2020/3/5 12:33, Aaron Lu wrote:
> > On Wed, Mar 04, 2020 at 07:54:39AM +0800, Li, Aubrey wrote:
> >> On 2020/3/3 22:59, Li, Aubrey wrote:
> >>> On 2020/2/29 7:55, Tim Chen wrote:
> > ...
> >>>> In Vinnet's fix, we only look at the currently running task's weight in
> >>>> src and dst rq.  Perhaps the load on the src and dst rq needs to be considered
> >>>> to prevent too great an imbalance between the run queues?
> >>>
> >>> We are trying to migrate a task, can we just use cfs.h_nr_running? This signal
> >>> is used to find the busiest run queue as well.
> >>
> >> How about this one? the cgroup weight issue seems fixed on my side.
> > 
> > It doesn't apply on top of your coresched_v4-v5.5.2 branch, so I
> > manually allied it. Not sure if I missed something.
> 
> Here is a rebase version on coresched_v5 Vineeth released this morning:
> https://github.com/aubreyli/linux/tree/coresched_V5-v5.5.y-rc1
> 
> > 
> > It's now getting 4 cpus in 2 cores. Better, but not back to normal yet..
> 
> I always saw higher weight tasks getting 8 cpus in 4 cores on my side.
> Are you still running 8+16 sysbench cpu threads? 

I used the wrong workload for high weight cgroup. After using sysbench
for high weight cgroup, I also see the problem fixed. Sorry for the noise.

P.S. I used redis-server as the high weight workload this morning, it has
some softirq processing and I guess that makes things not as expected.

Thanks,
Aaron

> 
> I replicated your setup, the cpuset with 8cores 16threads, cpu mode 8 sysbench
> threads with cpu.shares=10240, 16 sysbench threads with cpu.shares=2, and here
> is the data on my side.
> 
> 				weight(10240)		weight(2)
> coresched disabled		324.23(eps)		356.43(eps)
> coresched enabled		340.74(eps)		311.62(eps)
> 
> It seems higher weight tasks win this time and lower weight tasks have ~15%
> regression(not big deal?), did you see anything worse?
> 
> Thanks,
> -Aubrey

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-02-28  2:54                       ` Aaron Lu
@ 2020-03-05 13:45                         ` Aubrey Li
  2020-03-06  2:41                           ` Aaron Lu
  0 siblings, 1 reply; 84+ messages in thread
From: Aubrey Li @ 2020-03-05 13:45 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Phil Auld, Vineeth Remanan Pillai, Tim Chen, Julien Desfossez,
	Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Fri, Feb 28, 2020 at 10:54 AM Aaron Lu <aaron.lwe@gmail.com> wrote:
>
> When the core wide weight is somewhat balanced, yes I definitely agree.
> But when core wide weight mismatch a lot, I'm not so sure since if these
> high weight task is spread among cores, with the feature of core
> scheduling, these high weight tasks can get better performance.

It depends.

Say TaskA(cookie 1) and TaskB(cookie1) has high weight,
TaskC(cookie 2) and Task D(cookie 2) has low weight.
And we have two cores 4 CPU.

If we dispatch
- TaskA and TaskB on Core0,
- TaskC and TaskD on Core1,

with coresched enabled, all 4 tasks can run all the time.

But if we dispatch
- TaskA on Core0.CPU0, TaskB on Core1.CPU2,
- TaskC on Core0.CPU1, TaskB on Core1.CPU3,

with coresched enabled, when TaskC is running, TaskA will be forced
off CPU and replaced with a forced idle thread.

Things get worse if TaskA and TaskB share some data and can get
benefit from the core level cache.

> So this appeared to me like a question of: is it desirable to protect/enhance
> high weight task performance in the presence of core scheduling?

This sounds to me a policy VS mechanism question. Do you have any idea
how to spread high weight task among the cores with coresched enabled?

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-03-05 13:45                         ` Aubrey Li
@ 2020-03-06  2:41                           ` Aaron Lu
  2020-03-06 18:06                             ` Tim Chen
  0 siblings, 1 reply; 84+ messages in thread
From: Aaron Lu @ 2020-03-06  2:41 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Phil Auld, Vineeth Remanan Pillai, Tim Chen, Julien Desfossez,
	Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Thu, Mar 05, 2020 at 09:45:15PM +0800, Aubrey Li wrote:
> On Fri, Feb 28, 2020 at 10:54 AM Aaron Lu <aaron.lwe@gmail.com> wrote:
> >
> > When the core wide weight is somewhat balanced, yes I definitely agree.
> > But when core wide weight mismatch a lot, I'm not so sure since if these
> > high weight task is spread among cores, with the feature of core
> > scheduling, these high weight tasks can get better performance.
> 
> It depends.
> 
> Say TaskA(cookie 1) and TaskB(cookie1) has high weight,
> TaskC(cookie 2) and Task D(cookie 2) has low weight.
> And we have two cores 4 CPU.
> 
> If we dispatch
> - TaskA and TaskB on Core0,
> - TaskC and TaskD on Core1,
> 
> with coresched enabled, all 4 tasks can run all the time.

Although all tasks get CPU, TaskA and TaskB are competing hardware
resources and will run slower.

> But if we dispatch
> - TaskA on Core0.CPU0, TaskB on Core1.CPU2,
> - TaskC on Core0.CPU1, TaskB on Core1.CPU3,
> 
> with coresched enabled, when TaskC is running, TaskA will be forced
> off CPU and replaced with a forced idle thread.

Not likely to happen since TaskA and TaskB's share will normally be a
lot higher to make sure they get the CPU most of the time.

> 
> Things get worse if TaskA and TaskB share some data and can get
> benefit from the core level cache.

That's a good point and hard to argue.

I'm mostly considering colocating redis-server(the main workload) with
other compute intensive workload. redis-server can be idle most of the
time but needs every hardware resource when it runs to meet its latency
and throughput requirement. Test at my side shows redis-server's
throughput can be about 30% lower when two redis-servers run at the
same core(throughput is about 80000 when runs exclusively on a core VS
about 56000 when runs with sibling thread busy) IIRC.

So my use case here is that I don't really care about low weight task's
performance when high weight task demands CPU. I understand that there
will be other use cases that also care about low weight task's
performance. So what I have done is to make the two task's weight
difference as large as possible to signal that the low weight task is
not important, maybe I can also try to tag low weight task as SCHED_IDLE
ones and then we can happily sacrifice SCHED_IDLE task's performance?
 
> > So this appeared to me like a question of: is it desirable to protect/enhance
> > high weight task performance in the presence of core scheduling?
> 
> This sounds to me a policy VS mechanism question. Do you have any idea
> how to spread high weight task among the cores with coresched enabled?

Yes I would like to get us on the same page of the expected behaviour
before jumping to the implementation details. As for how to achieve
that: I'm thinking about to make core wide load balanced and then high
weight task shall spread on different cores. This isn't just about load
balance, the initial task placement will also need to be considered of
course if the high weight task only runs a small period.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-03-06  2:41                           ` Aaron Lu
@ 2020-03-06 18:06                             ` Tim Chen
  2020-03-06 18:33                               ` Phil Auld
  0 siblings, 1 reply; 84+ messages in thread
From: Tim Chen @ 2020-03-06 18:06 UTC (permalink / raw)
  To: Aaron Lu, Aubrey Li
  Cc: Phil Auld, Vineeth Remanan Pillai, Julien Desfossez,
	Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On 3/5/20 6:41 PM, Aaron Lu wrote:

>>> So this appeared to me like a question of: is it desirable to protect/enhance
>>> high weight task performance in the presence of core scheduling?
>>
>> This sounds to me a policy VS mechanism question. Do you have any idea
>> how to spread high weight task among the cores with coresched enabled?
> 
> Yes I would like to get us on the same page of the expected behaviour
> before jumping to the implementation details. As for how to achieve
> that: I'm thinking about to make core wide load balanced and then high
> weight task shall spread on different cores. This isn't just about load
> balance, the initial task placement will also need to be considered of
> course if the high weight task only runs a small period.
> 

I am wondering why this is not happening:  

When the low weight task group has exceeded its cfs allocation during a cfs period, the task group
should be throttled.  In that case, the CPU cores that the low
weight task group occupies will become idle, and allow load balance from the
overloaded CPUs for the high weight task group to migrate over.  

Tim

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-03-06 18:06                             ` Tim Chen
@ 2020-03-06 18:33                               ` Phil Auld
  2020-03-06 21:44                                 ` Tim Chen
  0 siblings, 1 reply; 84+ messages in thread
From: Phil Auld @ 2020-03-06 18:33 UTC (permalink / raw)
  To: Tim Chen
  Cc: Aaron Lu, Aubrey Li, Vineeth Remanan Pillai, Julien Desfossez,
	Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Fri, Mar 06, 2020 at 10:06:16AM -0800 Tim Chen wrote:
> On 3/5/20 6:41 PM, Aaron Lu wrote:
> 
> >>> So this appeared to me like a question of: is it desirable to protect/enhance
> >>> high weight task performance in the presence of core scheduling?
> >>
> >> This sounds to me a policy VS mechanism question. Do you have any idea
> >> how to spread high weight task among the cores with coresched enabled?
> > 
> > Yes I would like to get us on the same page of the expected behaviour
> > before jumping to the implementation details. As for how to achieve
> > that: I'm thinking about to make core wide load balanced and then high
> > weight task shall spread on different cores. This isn't just about load
> > balance, the initial task placement will also need to be considered of
> > course if the high weight task only runs a small period.
> > 
> 
> I am wondering why this is not happening:  
> 
> When the low weight task group has exceeded its cfs allocation during a cfs period, the task group
> should be throttled.  In that case, the CPU cores that the low
> weight task group occupies will become idle, and allow load balance from the
> overloaded CPUs for the high weight task group to migrate over.  
> 

cpu.shares is not quota. I think it will only get throttled if it has and 
exceeds quota.  Shares are supposed to be used to help weight contention
without providing a hard limit. 


Cheers,
Phil


> Tim
> 

-- 


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-03-06 18:33                               ` Phil Auld
@ 2020-03-06 21:44                                 ` Tim Chen
  2020-03-07  3:13                                   ` Aaron Lu
  0 siblings, 1 reply; 84+ messages in thread
From: Tim Chen @ 2020-03-06 21:44 UTC (permalink / raw)
  To: Phil Auld
  Cc: Aaron Lu, Aubrey Li, Vineeth Remanan Pillai, Julien Desfossez,
	Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On 3/6/20 10:33 AM, Phil Auld wrote:
> On Fri, Mar 06, 2020 at 10:06:16AM -0800 Tim Chen wrote:
>> On 3/5/20 6:41 PM, Aaron Lu wrote:
>>
>>>>> So this appeared to me like a question of: is it desirable to protect/enhance
>>>>> high weight task performance in the presence of core scheduling?
>>>>
>>>> This sounds to me a policy VS mechanism question. Do you have any idea
>>>> how to spread high weight task among the cores with coresched enabled?
>>>
>>> Yes I would like to get us on the same page of the expected behaviour
>>> before jumping to the implementation details. As for how to achieve
>>> that: I'm thinking about to make core wide load balanced and then high
>>> weight task shall spread on different cores. This isn't just about load
>>> balance, the initial task placement will also need to be considered of
>>> course if the high weight task only runs a small period.
>>>
>>
>> I am wondering why this is not happening:  
>>
>> When the low weight task group has exceeded its cfs allocation during a cfs period, the task group
>> should be throttled.  In that case, the CPU cores that the low
>> weight task group occupies will become idle, and allow load balance from the
>> overloaded CPUs for the high weight task group to migrate over.  
>>
> 
> cpu.shares is not quota. I think it will only get throttled if it has and 
> exceeds quota.  Shares are supposed to be used to help weight contention
> without providing a hard limit. 
> 

Ah yes.  cpu.quota is not set in Aaron's test case.  

That said, I wonder if the time consumed is getting out of whack with the 
cpu shares assigned, we can leverage the quota mechanism to throttle
those cgroup that have overused their shares of cpu.  Most of the stats and machinery
needed are already in the throttling mechanism.  

I am hoping that allowing task migration with task group mismatch
under large load imbalance between CPUs will be good enough.

Tim

Tim

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-03-06 21:44                                 ` Tim Chen
@ 2020-03-07  3:13                                   ` Aaron Lu
  0 siblings, 0 replies; 84+ messages in thread
From: Aaron Lu @ 2020-03-07  3:13 UTC (permalink / raw)
  To: Tim Chen
  Cc: Phil Auld, Aubrey Li, Vineeth Remanan Pillai, Julien Desfossez,
	Nishanth Aravamudan, Peter Zijlstra, Ingo Molnar,
	Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr,
	Valentin Schneider, Mel Gorman, Pawan Gupta, Paolo Bonzini

On Fri, Mar 06, 2020 at 01:44:08PM -0800, Tim Chen wrote:
> On 3/6/20 10:33 AM, Phil Auld wrote:
> > On Fri, Mar 06, 2020 at 10:06:16AM -0800 Tim Chen wrote:
> >> On 3/5/20 6:41 PM, Aaron Lu wrote:
> >>
> >>>>> So this appeared to me like a question of: is it desirable to protect/enhance
> >>>>> high weight task performance in the presence of core scheduling?
> >>>>
> >>>> This sounds to me a policy VS mechanism question. Do you have any idea
> >>>> how to spread high weight task among the cores with coresched enabled?
> >>>
> >>> Yes I would like to get us on the same page of the expected behaviour
> >>> before jumping to the implementation details. As for how to achieve
> >>> that: I'm thinking about to make core wide load balanced and then high
> >>> weight task shall spread on different cores. This isn't just about load
> >>> balance, the initial task placement will also need to be considered of
> >>> course if the high weight task only runs a small period.
> >>>
> >>
> >> I am wondering why this is not happening:  
> >>
> >> When the low weight task group has exceeded its cfs allocation during a cfs period, the task group
> >> should be throttled.  In that case, the CPU cores that the low
> >> weight task group occupies will become idle, and allow load balance from the
> >> overloaded CPUs for the high weight task group to migrate over.  
> >>
> > 
> > cpu.shares is not quota. I think it will only get throttled if it has and 
> > exceeds quota.  Shares are supposed to be used to help weight contention
> > without providing a hard limit. 
> > 
> 
> Ah yes.  cpu.quota is not set in Aaron's test case.  
> 
> That said, I wonder if the time consumed is getting out of whack with the 
> cpu shares assigned, we can leverage the quota mechanism to throttle
> those cgroup that have overused their shares of cpu.  Most of the stats and machinery
> needed are already in the throttling mechanism.  

cpu.quota is not work conserving IIUC, it will reduce noise workload's
performance when real workload has no demand for CPU.

Also, while not exceeding quota, the noise workload can still hurt real
workload's performance. To protect real workload from noise, cpu.shares
and SCHED_IDLE seems appropriate but the implementation may not be
enough as of now.

> 
> I am hoping that allowing task migration with task group mismatch
> under large load imbalance between CPUs will be good enough.

I also hope so :-)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-02-21 23:20                     ` Julien Desfossez
@ 2020-03-17  0:55                       ` Joel Fernandes
  2020-03-17 19:07                         ` Tim Chen
  0 siblings, 1 reply; 84+ messages in thread
From: Joel Fernandes @ 2020-03-17  0:55 UTC (permalink / raw)
  To: Julien Desfossez, Peter Zijlstra
  Cc: Vineeth Remanan Pillai, Aubrey Li, Tim Chen, Nishanth Aravamudan,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini

Hi Julien, Peter, all,

On Fri, Feb 21, 2020 at 06:20:57PM -0500, Julien Desfossez wrote:
> On 18-Feb-2020 04:58:02 PM, Vineeth Remanan Pillai wrote:
> > > Yes, this makes sense, patch updated at here, I put your name there if
> > > you don't mind.
> > > https://github.com/aubreyli/linux/tree/coresched_v4-v5.5.2-rc2
> > >
> > > Thanks Aubrey!
> 
> Just a quick note, I ran a very cpu-intensive benchmark (9x12 vcpus VMs
> running linpack), all affined to an 18 cores NUMA node (36 hardware
> threads). Each VM is running in its own cgroup/tag with core scheduling
> enabled. We know it already performed much better than nosmt, so for
> this case, I measured various co-scheduling statistics:
> - how much time the process spends co-scheduled with idle, a compatible
>   or an incompatible task
> - how long does the process spends running in a inefficient
>   configuration (more than 1 thread running alone on a core)
> 
> And I am very happy to report than even though the 9 VMs were configured
> to float on the whole NUMA node, the scheduler / load-balancer did a
> very good job at keeping an efficient configuration:
> 
> Process 10667 (qemu-system-x86), 10 seconds trace:
>   - total runtime: 46451472309 ns,
>   - local neighbors (total: 45713285084 ns, 98.411 % of process runtime):
>   - idle neighbors (total: 484054061 ns, 1.042 % of process runtime):
>   - foreign neighbors (total: 4191002 ns, 0.009 % of process runtime):
>   - unknown neighbors  (total: 92042503 ns, 0.198 % of process runtime)
>   - inefficient periods (total: 464832 ns, 0.001 % of process runtime):
>     - number of periods: 48
>     - min period duration: 1424 ns
>     - max period duration: 116988 ns
>     - average period duration: 9684.000 ns
>     - stdev: 19282.130
> 
> I thought you would enjoy seeing this :-)

Looks quite interesting. We are trying apply this work to ChromeOS. What we
want to do is selectively marking tasks, instead of grouping sets of trusted
tasks. I have a patch that adds a prctl which a task can call, and it works
well (task calls prctl and gets a cookie which gives it a dedicated core).

However, I have the following questions, in particular there are 4 scenarios
where I feel the current patches do not resolve MDS/L1TF, would you guys
please share your thoughts?

1. HT1 is running either hostile guest or host code.
   HT2 is running an interrupt handler (victim).

   In this case I see there is a possible MDS issue between HT1 and HT2.

2. HT1 is executing hostile host code, and gets interrupted by a victim
   interrupt. HT2 is idle.

   In this case, I see there is a possible MDS issue between interrupt and
   the host code on the same HT1.

3. HT1 is executing hostile guest code, HT2 is executing a victim interrupt
   handler on the host.

   In this case, I see there is a possible L1TF issue between HT1 and HT2.
   This issue does not happen if HT1 is running host code, since the host
   kernel takes care of inverting PTE bits.

4. HT1 is idle, and HT2 is running a victim process. Now HT1 starts running
   hostile code on guest or host. HT2 is being forced idle. However, there is
   an overlap between HT1 starting to execute hostile code and HT2's victim
   process getting scheduled out.
   Speaking to Vineeth, we discussed an idea to monitor the core_sched_seq
   counter of the sibling being idled to detect that it is now idle.
   However we discussed today that looking at this data, it is not really an
   issue since it is such a small window.

My concern is now cases 1, 2 to which there does not seem a good solution,
short of disabling interrupts. For 3, we could still possibly do something on
the guest side, such as using shadow page tables. Any thoughts on all this?

thanks,

- Joel



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-03-17  0:55                       ` Joel Fernandes
@ 2020-03-17 19:07                         ` Tim Chen
  2020-03-17 20:18                           ` Tim Chen
                                             ` (2 more replies)
  0 siblings, 3 replies; 84+ messages in thread
From: Tim Chen @ 2020-03-17 19:07 UTC (permalink / raw)
  To: Joel Fernandes, Julien Desfossez, Peter Zijlstra
  Cc: Vineeth Remanan Pillai, Aubrey Li, Nishanth Aravamudan,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, Luck, Tony

Joel,

> 
> Looks quite interesting. We are trying apply this work to ChromeOS. What we
> want to do is selectively marking tasks, instead of grouping sets of trusted
> tasks. I have a patch that adds a prctl which a task can call, and it works
> well (task calls prctl and gets a cookie which gives it a dedicated core).
> 
> However, I have the following questions, in particular there are 4 scenarios
> where I feel the current patches do not resolve MDS/L1TF, would you guys
> please share your thoughts?
> 
> 1. HT1 is running either hostile guest or host code.
>    HT2 is running an interrupt handler (victim).
> 
>    In this case I see there is a possible MDS issue between HT1 and HT2.

Core scheduling mitigates the userspace to userspace attacks via MDS between the HT.
It does not prevent the userspace to kernel space attack.  That will
have to be mitigated via other means, e.g. redirecting interrupts to a core
that don't run potentially unsafe code.

> 
> 2. HT1 is executing hostile host code, and gets interrupted by a victim
>    interrupt. HT2 is idle.

Similar to above.

> 
>    In this case, I see there is a possible MDS issue between interrupt and
>    the host code on the same HT1.

The cpu buffers are cleared before return to the hostile host code.  So
MDS shouldn't be an issue if interrupt handler and hostile code
runs on the same HT thread.

> 
> 3. HT1 is executing hostile guest code, HT2 is executing a victim interrupt
>    handler on the host.
> 
>    In this case, I see there is a possible L1TF issue between HT1 and HT2.
>    This issue does not happen if HT1 is running host code, since the host
>    kernel takes care of inverting PTE bits.

The interrupt handler will be run with PTE inverted.  So I don't think
there's a leak via L1TF in this scenario.

> 
> 4. HT1 is idle, and HT2 is running a victim process. Now HT1 starts running
>    hostile code on guest or host. HT2 is being forced idle. However, there is
>    an overlap between HT1 starting to execute hostile code and HT2's victim
>    process getting scheduled out.
>    Speaking to Vineeth, we discussed an idea to monitor the core_sched_seq
>    counter of the sibling being idled to detect that it is now idle.
>    However we discussed today that looking at this data, it is not really an
>    issue since it is such a small window.
> 
> My concern is now cases 1, 2 to which there does not seem a good solution,
> short of disabling interrupts. For 3, we could still possibly do something on
> the guest side, such as using shadow page tables. Any thoughts on all this?
> 

+ Tony who may have more insights on L1TF and MDS.

Thanks.

Tim

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-03-17 19:07                         ` Tim Chen
@ 2020-03-17 20:18                           ` Tim Chen
  2020-03-18  1:10                             ` Joel Fernandes
  2020-03-17 21:17                           ` Thomas Gleixner
  2020-03-18  0:52                           ` Joel Fernandes
  2 siblings, 1 reply; 84+ messages in thread
From: Tim Chen @ 2020-03-17 20:18 UTC (permalink / raw)
  To: Joel Fernandes, Julien Desfossez, Peter Zijlstra
  Cc: Vineeth Remanan Pillai, Aubrey Li, Nishanth Aravamudan,
	Ingo Molnar, Thomas Gleixner, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, Luck, Tony



BTW Joel,

Did you guys have a chance to try out v5 core scheduler?

> - Joel(ChromeOS) found a deadlock and crash on PREEMPT kernel in the
>   coreshed idle balance logic

We did some patches to fix a few stability issues in v4.  I wonder
if v5 still has the deadlock that you saw before?

Tim

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-03-17 19:07                         ` Tim Chen
  2020-03-17 20:18                           ` Tim Chen
@ 2020-03-17 21:17                           ` Thomas Gleixner
  2020-03-17 21:58                             ` Tim Chen
  2020-03-18  1:03                             ` Joel Fernandes
  2020-03-18  0:52                           ` Joel Fernandes
  2 siblings, 2 replies; 84+ messages in thread
From: Thomas Gleixner @ 2020-03-17 21:17 UTC (permalink / raw)
  To: Tim Chen, Joel Fernandes, Julien Desfossez, Peter Zijlstra
  Cc: Vineeth Remanan Pillai, Aubrey Li, Nishanth Aravamudan,
	Ingo Molnar, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, Luck, Tony

Tim,

Tim Chen <tim.c.chen@linux.intel.com> writes:
>> However, I have the following questions, in particular there are 4 scenarios
>> where I feel the current patches do not resolve MDS/L1TF, would you guys
>> please share your thoughts?
>> 
>> 1. HT1 is running either hostile guest or host code.
>>    HT2 is running an interrupt handler (victim).
>> 
>>    In this case I see there is a possible MDS issue between HT1 and HT2.
>
> Core scheduling mitigates the userspace to userspace attacks via MDS between the HT.
> It does not prevent the userspace to kernel space attack.  That will
> have to be mitigated via other means, e.g. redirecting interrupts to a core
> that don't run potentially unsafe code.

Which is in some cases simply impossible. Think multiqueue devices with
managed interrupts. You can't change the affinity of those. Neither can
you do that for the per cpu timer interrupt.

>> 2. HT1 is executing hostile host code, and gets interrupted by a victim
>>    interrupt. HT2 is idle.
>
> Similar to above.

No. It's the same HT so not similar at all.

>>    In this case, I see there is a possible MDS issue between interrupt and
>>    the host code on the same HT1.
>
> The cpu buffers are cleared before return to the hostile host code.  So
> MDS shouldn't be an issue if interrupt handler and hostile code
> runs on the same HT thread.

OTOH, thats mostly correct. Aside of the shouldn't wording:

  MDS _is_ no issue in this case when the full mitigation is enabled.

Assumed that I have not less information about MDS than you have :)

>> 3. HT1 is executing hostile guest code, HT2 is executing a victim interrupt
>>    handler on the host.
>> 
>>    In this case, I see there is a possible L1TF issue between HT1 and HT2.
>>    This issue does not happen if HT1 is running host code, since the host
>>    kernel takes care of inverting PTE bits.
>
> The interrupt handler will be run with PTE inverted.  So I don't think
> there's a leak via L1TF in this scenario.

How so?

Host memory is attackable, when one of the sibling SMT threads runs in
host OS (hypervisor) context and the other in guest context.

HT1 is in guest mode and attacking (has control over PTEs). HT2 is
running in host mode and executes an interrupt handler. The host PTE
inversion does not matter in this scenario at all.

So HT1 can very well see data which is brought into the shared L1 by
HT2.

The only way to mitigate that aside of disabling HT is disabling EPT.

>> 4. HT1 is idle, and HT2 is running a victim process. Now HT1 starts running
>>    hostile code on guest or host. HT2 is being forced idle. However, there is
>>    an overlap between HT1 starting to execute hostile code and HT2's victim
>>    process getting scheduled out.
>>    Speaking to Vineeth, we discussed an idea to monitor the core_sched_seq
>>    counter of the sibling being idled to detect that it is now idle.
>>    However we discussed today that looking at this data, it is not really an
>>    issue since it is such a small window.

If the victim HT is kicked out of execution with an IPI then the overlap
depends on the contexts:

        HT1 (attack)		HT2 (victim)

 A      idle -> user space      user space -> idle

 B      idle -> user space      guest -> idle

 C      idle -> guest           user space -> idle

 D      idle -> guest           guest -> idle

The IPI from HT1 brings HT2 immediately into the kernel when HT2 is in
host user mode or brings it immediately into VMEXIT when HT2 is in guest
mode.

#A On return from handling the IPI HT2 immediately reschedules to idle.
   To have an overlap the return to user space on HT1 must be faster.

#B Coming back from VEMXIT into schedule/idle might take slightly longer
   than #A.

#C Similar to #A, but reentering guest mode in HT1 after sending the IPI
   will probably take longer.

#D Similar to #C if you make the assumption that VMEXIT on HT2 and
   rescheduling into idle is not significantly slower than reaching
   VMENTER after sending the IPI.

In all cases the data exposed by a potential overlap shouldn't be that
interesting (e.g. scheduler state), but that obviously depends on what
the attacker is looking for.

But all of them are still problematic vs. interrupts / softinterrupts
which can happen on HT2 on the way to idle or while idling. i.e. #3 of
the original case list. #A and #B are only affected my MDS, #C and #D by
both MDS and L1TF (if EPT is in use).

>> My concern is now cases 1, 2 to which there does not seem a good solution,
>> short of disabling interrupts. For 3, we could still possibly do something on
>> the guest side, such as using shadow page tables. Any thoughts on all this?

#1 can be partially mitigated by changing interrupt affinities, which is
   not always possible and in the case of the local timer interrupt
   completely impossible. It's not only the timer interrupt itself, the
   timer callbacks which can run in the softirq on return from interrupt
   might be valuable attack surface depending on the nature of the
   callbacks, the random entropy timer just being a random example.

#2 is a non issue if MDS mitigation is on, i.e. buffers are flushed
   before returning to user space. It's pretty much a non SMT case,
   i.e. same CPU user to kernel attack.

#3 Can only be fully mitigated by disabling EPT 

#4 Assumed that my assumptions about transition times are correct, which
   I think they are, #4 is pretty much redirected to #1

Hope that helps.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-03-17 21:17                           ` Thomas Gleixner
@ 2020-03-17 21:58                             ` Tim Chen
  2020-03-18  1:03                             ` Joel Fernandes
  1 sibling, 0 replies; 84+ messages in thread
From: Tim Chen @ 2020-03-17 21:58 UTC (permalink / raw)
  To: Thomas Gleixner, Joel Fernandes, Julien Desfossez, Peter Zijlstra
  Cc: Vineeth Remanan Pillai, Aubrey Li, Nishanth Aravamudan,
	Ingo Molnar, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, Luck, Tony



On 3/17/20 2:17 PM, Thomas Gleixner wrote:

>> The interrupt handler will be run with PTE inverted.  So I don't think
>> there's a leak via L1TF in this scenario.
> 
> How so?
> 
> Host memory is attackable, when one of the sibling SMT threads runs in
> host OS (hypervisor) context and the other in guest context.
> 
> HT1 is in guest mode and attacking (has control over PTEs). HT2 is
> running in host mode and executes an interrupt handler. The host PTE
> inversion does not matter in this scenario at all.
> 
> So HT1 can very well see data which is brought into the shared L1 by
> HT2.
> 
> The only way to mitigate that aside of disabling HT is disabling EPT.
> 

I had a brain lapse. Yes, PTE inversion is for mitigating against malicious
user space code, not for malicious guest.

Thanks for the correction.

Tim


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-03-17 19:07                         ` Tim Chen
  2020-03-17 20:18                           ` Tim Chen
  2020-03-17 21:17                           ` Thomas Gleixner
@ 2020-03-18  0:52                           ` Joel Fernandes
  2020-03-18 11:53                             ` Thomas Gleixner
  2 siblings, 1 reply; 84+ messages in thread
From: Joel Fernandes @ 2020-03-18  0:52 UTC (permalink / raw)
  To: Tim Chen
  Cc: Julien Desfossez, Peter Zijlstra, Vineeth Remanan Pillai,
	Aubrey Li, Nishanth Aravamudan, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Linus Torvalds, Linux List Kernel Mailing,
	Dario Faggioli, Frédéric Weisbecker, Kees Cook,
	Greg Kerr, Phil Auld, Aaron Lu, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Luck, Tony

On Tue, Mar 17, 2020 at 3:07 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> Joel,
>
> >
> > Looks quite interesting. We are trying apply this work to ChromeOS. What we
> > want to do is selectively marking tasks, instead of grouping sets of trusted
> > tasks. I have a patch that adds a prctl which a task can call, and it works
> > well (task calls prctl and gets a cookie which gives it a dedicated core).
> >
> > However, I have the following questions, in particular there are 4 scenarios
> > where I feel the current patches do not resolve MDS/L1TF, would you guys
> > please share your thoughts?
> >
> > 1. HT1 is running either hostile guest or host code.
> >    HT2 is running an interrupt handler (victim).
> >
> >    In this case I see there is a possible MDS issue between HT1 and HT2.
>
> Core scheduling mitigates the userspace to userspace attacks via MDS between the HT.
> It does not prevent the userspace to kernel space attack.  That will
> have to be mitigated via other means, e.g. redirecting interrupts to a core
> that don't run potentially unsafe code.

We have only 2 cores (4 HT) on many devices. It is not an option to
dedicate a core to only running trusted code, that would kill
performance. Another option is to designate a single HT of a
particular core to run both untrusted code and an interrupt handler --
but as Thomas pointed out, this does not work for per-CPU interrupts
or managed interrupts, and the softirqs that they trigger. But if we
just consider interrupts for which we can control the affinities (and
assuming that most interrupts can be controlled like that), then maybe
it will work? In the ChromeOS model, each untrusted task is in its own
domain (cookie). So untrusted tasks cannot benefit from parallelism
(in our case) anyway -- so it seems reasonable to run an affinable
interrupt and an untrusted task, on a particular designated core.

(Just thinking out loud...).  Another option could be a patch that
Vineeth shared with me (that Peter experimentally wrote) where he
sends IPI from an interrupt handler to a sibling running untrusted
guest code which would result in it getting paused. I am hoping
something like this could work on the host side as well (not just for
guests). We could also set per-core state from the interrupted HT,
possibly IPI'ing the untrusted sibling if we have to. If sibling runs
untrusted code *after* the other's siblings interrupt already started,
then the schedule() loop on the untrusted sibling would spin knowing
the other sibling has an interrupt in progress. The softirq is a real
problem though. Perhaps it can also set similar per-core state.
Thoughts?

> > 2. HT1 is executing hostile host code, and gets interrupted by a victim
> >    interrupt. HT2 is idle.
> >    In this case, I see there is a possible MDS issue between interrupt and
> >    the host code on the same HT1.
>
> The cpu buffers are cleared before return to the hostile host code.  So
> MDS shouldn't be an issue if interrupt handler and hostile code
> runs on the same HT thread.

Got it, agreed this is not an issue.

> > 3. HT1 is executing hostile guest code, HT2 is executing a victim interrupt
> >    handler on the host.
> >
> >    In this case, I see there is a possible L1TF issue between HT1 and HT2.
> >    This issue does not happen if HT1 is running host code, since the host
> >    kernel takes care of inverting PTE bits.
>
> The interrupt handler will be run with PTE inverted.  So I don't think
> there's a leak via L1TF in this scenario.

As Thomas and you later pointed out, this is still an issue and will
require a similar solution as described above.

thanks,

- Joel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-03-17 21:17                           ` Thomas Gleixner
  2020-03-17 21:58                             ` Tim Chen
@ 2020-03-18  1:03                             ` Joel Fernandes
  2020-03-18  2:30                               ` Joel Fernandes
  1 sibling, 1 reply; 84+ messages in thread
From: Joel Fernandes @ 2020-03-18  1:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Tim Chen, Julien Desfossez, Peter Zijlstra,
	Vineeth Remanan Pillai, Aubrey Li, Nishanth Aravamudan,
	Ingo Molnar, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, Luck, Tony

Hi Thomas,

Thanks for the detailed email. I am on the same page with all your points, I
had a question on one of the points below (which I agree with as well) but
just to confirm,

On Tue, Mar 17, 2020 at 10:17:47PM +0100, Thomas Gleixner wrote:
[..] 
> >> 4. HT1 is idle, and HT2 is running a victim process. Now HT1 starts running
> >>    hostile code on guest or host. HT2 is being forced idle. However, there is
> >>    an overlap between HT1 starting to execute hostile code and HT2's victim
> >>    process getting scheduled out.
> >>    Speaking to Vineeth, we discussed an idea to monitor the core_sched_seq
> >>    counter of the sibling being idled to detect that it is now idle.
> >>    However we discussed today that looking at this data, it is not really an
> >>    issue since it is such a small window.
> 
> If the victim HT is kicked out of execution with an IPI then the overlap
> depends on the contexts:
> 
>         HT1 (attack)		HT2 (victim)
> 
>  A      idle -> user space      user space -> idle
> 
>  B      idle -> user space      guest -> idle
> 
>  C      idle -> guest           user space -> idle
> 
>  D      idle -> guest           guest -> idle
> 
> The IPI from HT1 brings HT2 immediately into the kernel when HT2 is in
> host user mode or brings it immediately into VMEXIT when HT2 is in guest
> mode.
> 
> #A On return from handling the IPI HT2 immediately reschedules to idle.
>    To have an overlap the return to user space on HT1 must be faster.
> 
> #B Coming back from VEMXIT into schedule/idle might take slightly longer
>    than #A.
> 
> #C Similar to #A, but reentering guest mode in HT1 after sending the IPI
>    will probably take longer.
> 
> #D Similar to #C if you make the assumption that VMEXIT on HT2 and
>    rescheduling into idle is not significantly slower than reaching
>    VMENTER after sending the IPI.
> 
> In all cases the data exposed by a potential overlap shouldn't be that
> interesting (e.g. scheduler state), but that obviously depends on what
> the attacker is looking for.

About the "shouldn't be that interesting" part, you are saying, the overlap
should not be that interesting because the act of one sibling IPI'ing the
other implies the sibling HT immediately entering kernel mode, right?

Thanks, your email really helped!!!

 - Joel



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-03-17 20:18                           ` Tim Chen
@ 2020-03-18  1:10                             ` Joel Fernandes
  0 siblings, 0 replies; 84+ messages in thread
From: Joel Fernandes @ 2020-03-18  1:10 UTC (permalink / raw)
  To: Tim Chen
  Cc: Julien Desfossez, Peter Zijlstra, Vineeth Remanan Pillai,
	Aubrey Li, Nishanth Aravamudan, Ingo Molnar, Thomas Gleixner,
	Paul Turner, Linus Torvalds, Linux List Kernel Mailing,
	Dario Faggioli, Frédéric Weisbecker, Kees Cook,
	Greg Kerr, Phil Auld, Aaron Lu, Valentin Schneider, Mel Gorman,
	Pawan Gupta, Paolo Bonzini, Luck, Tony

Hi Tim,

On Tue, Mar 17, 2020 at 01:18:45PM -0700, Tim Chen wrote:
[..] 
> Did you guys have a chance to try out v5 core scheduler?
> 
> > - Joel(ChromeOS) found a deadlock and crash on PREEMPT kernel in the
> >   coreshed idle balance logic
> 
> We did some patches to fix a few stability issues in v4.  I wonder
> if v5 still has the deadlock that you saw before?

I have not yet tested v5. I fixed the RCU-related deadlock in the scheduler
with this patch sent few days ago:
https://lore.kernel.org/patchwork/patch/1209561/

I'll rebase my tree on v5 soon.

Thanks!

 - Joel


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-03-18  1:03                             ` Joel Fernandes
@ 2020-03-18  2:30                               ` Joel Fernandes
  0 siblings, 0 replies; 84+ messages in thread
From: Joel Fernandes @ 2020-03-18  2:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Tim Chen, Julien Desfossez, Peter Zijlstra,
	Vineeth Remanan Pillai, Aubrey Li, Nishanth Aravamudan,
	Ingo Molnar, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, Luck, Tony

> On Tue, Mar 17, 2020 at 10:17:47PM +0100, Thomas Gleixner wrote:
> [..]
> > >> 4. HT1 is idle, and HT2 is running a victim process. Now HT1 starts running
> > >>    hostile code on guest or host. HT2 is being forced idle. However, there is
> > >>    an overlap between HT1 starting to execute hostile code and HT2's victim
> > >>    process getting scheduled out.
> > >>    Speaking to Vineeth, we discussed an idea to monitor the core_sched_seq
> > >>    counter of the sibling being idled to detect that it is now idle.
> > >>    However we discussed today that looking at this data, it is not really an
> > >>    issue since it is such a small window.
> >
> > If the victim HT is kicked out of execution with an IPI then the overlap
> > depends on the contexts:
> >
> >         HT1 (attack)          HT2 (victim)
> >
> >  A      idle -> user space      user space -> idle
> >
> >  B      idle -> user space      guest -> idle
> >
> >  C      idle -> guest           user space -> idle
> >
> >  D      idle -> guest           guest -> idle
> >
> > The IPI from HT1 brings HT2 immediately into the kernel when HT2 is in
> > host user mode or brings it immediately into VMEXIT when HT2 is in guest
> > mode.
> >
> > #A On return from handling the IPI HT2 immediately reschedules to idle.
> >    To have an overlap the return to user space on HT1 must be faster.
> >
> > #B Coming back from VEMXIT into schedule/idle might take slightly longer
> >    than #A.
> >
> > #C Similar to #A, but reentering guest mode in HT1 after sending the IPI
> >    will probably take longer.
> >
> > #D Similar to #C if you make the assumption that VMEXIT on HT2 and
> >    rescheduling into idle is not significantly slower than reaching
> >    VMENTER after sending the IPI.
> >
> > In all cases the data exposed by a potential overlap shouldn't be that
> > interesting (e.g. scheduler state), but that obviously depends on what
> > the attacker is looking for.
>
> About the "shouldn't be that interesting" part, you are saying, the overlap
> should not be that interesting because the act of one sibling IPI'ing the
> other implies the sibling HT immediately entering kernel mode, right?

Hi Thomas,

I see what you mean now. Basically after the IPI is sent, the HT
sending the IPI would have buffers cleared before the attacker gets a
chance to run on it. Because the victim HT enters into the IPI
handler, all the attacker will get to see is possibly uninteresting
data (e.g. scheduler state) as you mentioned.

Thanks!

- Joel

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-03-18  0:52                           ` Joel Fernandes
@ 2020-03-18 11:53                             ` Thomas Gleixner
  2020-03-19  1:54                               ` Joel Fernandes
  0 siblings, 1 reply; 84+ messages in thread
From: Thomas Gleixner @ 2020-03-18 11:53 UTC (permalink / raw)
  To: Joel Fernandes, Tim Chen
  Cc: Julien Desfossez, Peter Zijlstra, Vineeth Remanan Pillai,
	Aubrey Li, Nishanth Aravamudan, Ingo Molnar, Paul Turner,
	Linus Torvalds, Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, Luck, Tony

Joel,

Joel Fernandes <joel@joelfernandes.org> writes:
> We have only 2 cores (4 HT) on many devices. It is not an option to
> dedicate a core to only running trusted code, that would kill
> performance. Another option is to designate a single HT of a
> particular core to run both untrusted code and an interrupt handler --
> but as Thomas pointed out, this does not work for per-CPU interrupts
> or managed interrupts, and the softirqs that they trigger. But if we
> just consider interrupts for which we can control the affinities (and
> assuming that most interrupts can be controlled like that), then maybe
> it will work? In the ChromeOS model, each untrusted task is in its own
> domain (cookie). So untrusted tasks cannot benefit from parallelism
> (in our case) anyway -- so it seems reasonable to run an affinable
> interrupt and an untrusted task, on a particular designated core.
>
> (Just thinking out loud...).  Another option could be a patch that
> Vineeth shared with me (that Peter experimentally wrote) where he
> sends IPI from an interrupt handler to a sibling running untrusted
> guest code which would result in it getting paused. I am hoping
> something like this could work on the host side as well (not just for
> guests). We could also set per-core state from the interrupted HT,
> possibly IPI'ing the untrusted sibling if we have to. If sibling runs
> untrusted code *after* the other's siblings interrupt already started,
> then the schedule() loop on the untrusted sibling would spin knowing
> the other sibling has an interrupt in progress. The softirq is a real
> problem though. Perhaps it can also set similar per-core state.

There is not much difference between bringing the sibling out of guest
mode and bringing it out of host user mode.

Adding state to force spinning until the other side has completed is not
rocket science either.

But the whole concept is prone to starvation issues and full of nasty
corner cases. From experiments I did back in the L1TF days I'm pretty
much convinced that this can't result in a generaly usable solution.

Let me share a few thoughts what might be doable with less horrors, but
be aware that this is mostly a brain dump of half thought out ideas.

1) Managed interrupts on multi queue devices

   It should be reasonably simple to force a reduced number of queues
   which would in turn allow to shield one ore two CPUs from such
   interrupts and queue handling for the price of indirection.

2) Timers and softirqs

   If device interrupts are targeted to "safe" CPUs then the amount of
   timers and soft interrupt processing will be reduced as well.

   That still leaves e.g. network TX side soft interrupts when the task
   running on a shielded core does networking. Maybe that's a non issue,
   but I'm not familiar enough with the network maze to give an answer.

   A possible workaround would be to force softirq processing into
   thread context so everything is under scheduler control. How well
   that scales is a different story.

   That would bring out the timer_list timers and reduce the potential
   surface to hrtimer expiry callbacks. Most of them should be fine
   (doing wakeups or scheduler housekeeping of some sort). For the
   others we might just utilize the mechanism which PREEMPT_RT uses and
   force them off into softirq expiry mode.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [RFC PATCH v4 00/19] Core scheduling v4
  2020-03-18 11:53                             ` Thomas Gleixner
@ 2020-03-19  1:54                               ` Joel Fernandes
  0 siblings, 0 replies; 84+ messages in thread
From: Joel Fernandes @ 2020-03-19  1:54 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Tim Chen, Julien Desfossez, Peter Zijlstra,
	Vineeth Remanan Pillai, Aubrey Li, Nishanth Aravamudan,
	Ingo Molnar, Paul Turner, Linus Torvalds,
	Linux List Kernel Mailing, Dario Faggioli,
	Frédéric Weisbecker, Kees Cook, Greg Kerr, Phil Auld,
	Aaron Lu, Valentin Schneider, Mel Gorman, Pawan Gupta,
	Paolo Bonzini, Luck, Tony

Hi Thomas,

On Wed, Mar 18, 2020 at 12:53:01PM +0100, Thomas Gleixner wrote:
> Joel,
> 
> Joel Fernandes <joel@joelfernandes.org> writes:
> > We have only 2 cores (4 HT) on many devices. It is not an option to
> > dedicate a core to only running trusted code, that would kill
> > performance. Another option is to designate a single HT of a
> > particular core to run both untrusted code and an interrupt handler --
> > but as Thomas pointed out, this does not work for per-CPU interrupts
> > or managed interrupts, and the softirqs that they trigger. But if we
> > just consider interrupts for which we can control the affinities (and
> > assuming that most interrupts can be controlled like that), then maybe
> > it will work? In the ChromeOS model, each untrusted task is in its own
> > domain (cookie). So untrusted tasks cannot benefit from parallelism
> > (in our case) anyway -- so it seems reasonable to run an affinable
> > interrupt and an untrusted task, on a particular designated core.
> >
> > (Just thinking out loud...).  Another option could be a patch that
> > Vineeth shared with me (that Peter experimentally wrote) where he
> > sends IPI from an interrupt handler to a sibling running untrusted
> > guest code which would result in it getting paused. I am hoping
> > something like this could work on the host side as well (not just for
> > guests). We could also set per-core state from the interrupted HT,
> > possibly IPI'ing the untrusted sibling if we have to. If sibling runs
> > untrusted code *after* the other's siblings interrupt already started,
> > then the schedule() loop on the untrusted sibling would spin knowing
> > the other sibling has an interrupt in progress. The softirq is a real
> > problem though. Perhaps it can also set similar per-core state.
> 
> There is not much difference between bringing the sibling out of guest
> mode and bringing it out of host user mode.
> 
> Adding state to force spinning until the other side has completed is not
> rocket science either.
> 
> But the whole concept is prone to starvation issues and full of nasty
> corner cases. From experiments I did back in the L1TF days I'm pretty
> much convinced that this can't result in a generaly usable solution.
> Let me share a few thoughts what might be doable with less horrors, but
> be aware that this is mostly a brain dump of half thought out ideas.
> 
> 1) Managed interrupts on multi queue devices
> 
>    It should be reasonably simple to force a reduced number of queues
>    which would in turn allow to shield one ore two CPUs from such
>    interrupts and queue handling for the price of indirection.
> 
> 2) Timers and softirqs
> 
>    If device interrupts are targeted to "safe" CPUs then the amount of
>    timers and soft interrupt processing will be reduced as well.
> 
>    That still leaves e.g. network TX side soft interrupts when the task
>    running on a shielded core does networking. Maybe that's a non issue,
>    but I'm not familiar enough with the network maze to give an answer.
> 
>    A possible workaround would be to force softirq processing into
>    thread context so everything is under scheduler control. How well
>    that scales is a different story.
> 
>    That would bring out the timer_list timers and reduce the potential
>    surface to hrtimer expiry callbacks. Most of them should be fine
>    (doing wakeups or scheduler housekeeping of some sort). For the
>    others we might just utilize the mechanism which PREEMPT_RT uses and
>    force them off into softirq expiry mode.

Thanks a lot for your thoughts, it is really helpful.

One issue with interrupt affinities is, we have only 2 cores on many devices
(4 HTs).  If we assign 1 HT out of the 4 for receiving interrupts (at least
the ones we can control affinity off), then that means we have to devote that
core to running only 1 trusted task at a time, since one of the 2 HT is
designated for interrupts. That leaves us with only the other 2 HTs on the
other core for simultaneously running the 2 trusted tasks, thus reducing the
parallellism.

Thanks for raising the point about running interrupts and softirqs in process
context to solve these, turns out we do use threaded IRQs however not on most
interrupts.  We could consider threading as many IRQs (and softirqs) as
as possible and for the ones we cannot thread, may be we can code audit code and
assess the risk, possibly deferring processing of sensitivity data to
workqueue context. Perhaps we can also conditionally force a softirq into
process context say if we detect that the core has an untrusted task running
on the softirq HT's sibling, and if not let the softirq execute in softirq
context itself.

Thanks, Thomas!!

 - Joel


^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2020-03-19  1:54 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-30 18:33 [RFC PATCH v4 00/19] Core scheduling v4 Vineeth Remanan Pillai
2019-10-30 18:33 ` [RFC PATCH v4 01/19] stop_machine: Fix stop_cpus_in_progress ordering Vineeth Remanan Pillai
2019-10-30 18:33 ` [RFC PATCH v4 02/19] sched: Fix kerneldoc comment for ia64_set_curr_task Vineeth Remanan Pillai
2019-10-30 18:33 ` [RFC PATCH v4 03/19] sched: Wrap rq::lock access Vineeth Remanan Pillai
2019-10-30 18:33 ` [RFC PATCH v4 04/19] sched/{rt,deadline}: Fix set_next_task vs pick_next_task Vineeth Remanan Pillai
2019-10-30 18:33 ` [RFC PATCH v4 05/19] sched: Add task_struct pointer to sched_class::set_curr_task Vineeth Remanan Pillai
2019-10-30 18:33 ` [RFC PATCH v4 06/19] sched/fair: Export newidle_balance() Vineeth Remanan Pillai
2019-10-30 18:33 ` [RFC PATCH v4 07/19] sched: Allow put_prev_task() to drop rq->lock Vineeth Remanan Pillai
2019-10-30 18:33 ` [RFC PATCH v4 08/19] sched: Rework pick_next_task() slow-path Vineeth Remanan Pillai
2019-10-30 18:33 ` [RFC PATCH v4 09/19] sched: Introduce sched_class::pick_task() Vineeth Remanan Pillai
2019-10-30 18:33 ` [RFC PATCH v4 10/19] sched: Core-wide rq->lock Vineeth Remanan Pillai
2019-10-30 18:33 ` [RFC PATCH v4 11/19] sched: Basic tracking of matching tasks Vineeth Remanan Pillai
2019-10-30 18:33 ` [RFC PATCH v4 12/19] sched: A quick and dirty cgroup tagging interface Vineeth Remanan Pillai
2019-10-30 18:33 ` [RFC PATCH v4 13/19] sched: Add core wide task selection and scheduling Vineeth Remanan Pillai
2019-10-30 18:33 ` [RFC PATCH v4 14/19] sched/fair: Add a few assertions Vineeth Remanan Pillai
2019-10-30 18:33 ` [RFC PATCH v4 15/19] sched: Trivial forced-newidle balancer Vineeth Remanan Pillai
2019-10-30 18:33 ` [RFC PATCH v4 16/19] sched: Debug bits Vineeth Remanan Pillai
2019-10-30 18:33 ` [RFC PATCH v4 17/19] sched/fair: wrapper for cfs_rq->min_vruntime Vineeth Remanan Pillai
2019-10-30 18:33 ` [RFC PATCH v4 18/19] sched/fair: core wide vruntime comparison Vineeth Remanan Pillai
2019-10-30 18:33 ` [RFC PATCH v4 19/19] sched/fair : Wake up forced idle siblings if needed Vineeth Remanan Pillai
2019-10-31 11:42 ` [RFC PATCH v4 00/19] Core scheduling v4 Li, Aubrey
2019-11-01 11:33   ` Li, Aubrey
2019-11-08  3:20     ` Li, Aubrey
2019-10-31 18:42 ` Phil Auld
2019-11-01 14:03   ` Vineeth Remanan Pillai
2019-11-01 16:35     ` Greg Kerr
2019-11-01 18:07       ` Dario Faggioli
2019-11-12  1:45     ` Dario Faggioli
2019-11-13 17:16       ` Tim Chen
2020-01-02  2:28       ` Aubrey Li
2020-01-10 23:19         ` Tim Chen
2019-11-11 19:10 ` Tim Chen
2020-01-14  1:12   ` Tim Chen
2020-01-14 15:40     ` Vineeth Remanan Pillai
2020-01-15  3:43       ` Li, Aubrey
2020-01-15 19:33         ` Tim Chen
2020-01-16  1:45           ` Aubrey Li
2020-01-17 16:00             ` Vineeth Remanan Pillai
2020-01-22 18:04               ` Gruza, Agata
2020-01-28  2:40       ` Dario Faggioli
     [not found]         ` <CANaguZDDpzrzdTmvjXvCmV2c+wBt6mXWSz4Vn-LJ-onc_Oj=yw@mail.gmail.com>
2020-02-01 15:31           ` Dario Faggioli
2020-02-06  0:28       ` Tim Chen
2020-02-06 22:37         ` Julien Desfossez
2020-02-12 23:07         ` Julien Desfossez
2020-02-13 18:37           ` Tim Chen
2020-02-14  6:10             ` Aubrey Li
     [not found]               ` <CANaguZC40mDHfL1H_9AA7H8cyd028t9PQVRqQ3kB4ha8R7hhqg@mail.gmail.com>
2020-02-15  6:01                 ` Aubrey Li
     [not found]                   ` <CANaguZBj_x_2+9KwbHCQScsmraC_mHdQB6uRqMTYMmvhBYfv2Q@mail.gmail.com>
2020-02-21 23:20                     ` Julien Desfossez
2020-03-17  0:55                       ` Joel Fernandes
2020-03-17 19:07                         ` Tim Chen
2020-03-17 20:18                           ` Tim Chen
2020-03-18  1:10                             ` Joel Fernandes
2020-03-17 21:17                           ` Thomas Gleixner
2020-03-17 21:58                             ` Tim Chen
2020-03-18  1:03                             ` Joel Fernandes
2020-03-18  2:30                               ` Joel Fernandes
2020-03-18  0:52                           ` Joel Fernandes
2020-03-18 11:53                             ` Thomas Gleixner
2020-03-19  1:54                               ` Joel Fernandes
2020-02-25  3:44               ` Aaron Lu
2020-02-25  5:32                 ` Aubrey Li
2020-02-25  7:34                   ` Aaron Lu
2020-02-25 10:40                     ` Aubrey Li
2020-02-25 11:21                       ` Aaron Lu
2020-02-25 13:41                         ` Aubrey Li
     [not found]                 ` <CANaguZD205ccu1V_2W-QuMRrJA9SjJ5ng1do4NCdLy8NDKKrbA@mail.gmail.com>
2020-02-26  3:13                   ` Aaron Lu
2020-02-26  7:21                   ` Aubrey Li
     [not found]                     ` <CANaguZDQZg-Z6aNpeLcjQ-cGm3X8CQOkZ_hnJNUyqDRM=yVDFQ@mail.gmail.com>
2020-02-27  4:45                       ` Aubrey Li
2020-02-28 23:55                       ` Tim Chen
2020-03-03 14:59                         ` Li, Aubrey
2020-03-03 23:54                           ` Li, Aubrey
2020-03-05  4:33                             ` Aaron Lu
2020-03-05  6:10                               ` Li, Aubrey
2020-03-05  8:52                                 ` Aaron Lu
2020-02-27  2:04                   ` Aaron Lu
2020-02-27 14:10                     ` Phil Auld
2020-02-27 14:37                       ` Aubrey Li
2020-02-28  2:54                       ` Aaron Lu
2020-03-05 13:45                         ` Aubrey Li
2020-03-06  2:41                           ` Aaron Lu
2020-03-06 18:06                             ` Tim Chen
2020-03-06 18:33                               ` Phil Auld
2020-03-06 21:44                                 ` Tim Chen
2020-03-07  3:13                                   ` Aaron Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).