All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/19] sched: Core Scheduling
@ 2021-04-22 12:04 Peter Zijlstra
  2021-04-22 12:05 ` [PATCH 01/19] sched/fair: Add a few assertions Peter Zijlstra
                   ` (22 more replies)
  0 siblings, 23 replies; 103+ messages in thread
From: Peter Zijlstra @ 2021-04-22 12:04 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tglx

Hai,

This is an agressive fold of all the core-scheduling work so far. I've stripped
a whole bunch of tags along the way (hopefully not too many, please yell if you
feel I made a mistake), including tested-by. Please retest.

Changes since the last partial post is dropping all the cgroup stuff and
PR_SCHED_CORE_CLEAR as well as that exec() behaviour in order to later resolve
the cgroup issue.

Since we're really rather late for the coming merge window, my plan was to
merge the lot right after the merge window.

Again, please test.

These patches should shortly be available in my queue.git.

---
 b/kernel/sched/core_sched.c                     |  229 ++++++
 b/tools/testing/selftests/sched/.gitignore      |    1 
 b/tools/testing/selftests/sched/Makefile        |   14 
 b/tools/testing/selftests/sched/config          |    1 
 b/tools/testing/selftests/sched/cs_prctl_test.c |  338 +++++++++
 include/linux/sched.h                           |   19 
 include/uapi/linux/prctl.h                      |    8 
 kernel/Kconfig.preempt                          |    6 
 kernel/fork.c                                   |    4 
 kernel/sched/Makefile                           |    1 
 kernel/sched/core.c                             |  858 ++++++++++++++++++++++--
 kernel/sched/cpuacct.c                          |   12 
 kernel/sched/deadline.c                         |   38 -
 kernel/sched/debug.c                            |    4 
 kernel/sched/fair.c                             |  276 +++++--
 kernel/sched/idle.c                             |   13 
 kernel/sched/pelt.h                             |    2 
 kernel/sched/rt.c                               |   31 
 kernel/sched/sched.h                            |  393 ++++++++--
 kernel/sched/stop_task.c                        |   14 
 kernel/sched/topology.c                         |    4 
 kernel/sys.c                                    |    5 
 tools/include/uapi/linux/prctl.h                |    8 
 23 files changed, 2057 insertions(+), 222 deletions(-)


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 01/19] sched/fair: Add a few assertions
  2021-04-22 12:04 [PATCH 00/19] sched: Core Scheduling Peter Zijlstra
@ 2021-04-22 12:05 ` Peter Zijlstra
  2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2021-04-22 12:05 ` [PATCH 02/19] sched: Provide raw_spin_rq_*lock*() helpers Peter Zijlstra
                   ` (21 subsequent siblings)
  22 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2021-04-22 12:05 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tglx


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6246,6 +6246,11 @@ static int select_idle_sibling(struct ta
 		task_util = uclamp_task_util(p);
 	}
 
+	/*
+	 * per-cpu select_idle_mask usage
+	 */
+	lockdep_assert_irqs_disabled();
+
 	if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
 	    asym_fits_capacity(task_util, target))
 		return target;
@@ -6711,8 +6716,6 @@ static int find_energy_efficient_cpu(str
  * certain conditions an idle sibling CPU if the domain has SD_WAKE_AFFINE set.
  *
  * Returns the target CPU number.
- *
- * preempt must be disabled.
  */
 static int
 select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
@@ -6725,6 +6728,10 @@ select_task_rq_fair(struct task_struct *
 	/* SD_flags and WF_flags share the first nibble */
 	int sd_flag = wake_flags & 0xF;
 
+	/*
+	 * required for stable ->cpus_allowed
+	 */
+	lockdep_assert_held(&p->pi_lock);
 	if (wake_flags & WF_TTWU) {
 		record_wakee(p);
 



^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 02/19] sched: Provide raw_spin_rq_*lock*() helpers
  2021-04-22 12:04 [PATCH 00/19] sched: Core Scheduling Peter Zijlstra
  2021-04-22 12:05 ` [PATCH 01/19] sched/fair: Add a few assertions Peter Zijlstra
@ 2021-04-22 12:05 ` Peter Zijlstra
  2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2021-04-22 12:05 ` [PATCH 03/19] sched: Wrap rq::lock access Peter Zijlstra
                   ` (20 subsequent siblings)
  22 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2021-04-22 12:05 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tglx

In prepration for playing games with rq->lock, add some rq_lock
wrappers.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |   15 +++++++++++++++
 kernel/sched/sched.h |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 65 insertions(+)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -184,6 +184,21 @@ int sysctl_sched_rt_runtime = 950000;
  *
  */
 
+void raw_spin_rq_lock_nested(struct rq *rq, int subclass)
+{
+	raw_spin_lock_nested(rq_lockp(rq), subclass);
+}
+
+bool raw_spin_rq_trylock(struct rq *rq)
+{
+	return raw_spin_trylock(rq_lockp(rq));
+}
+
+void raw_spin_rq_unlock(struct rq *rq)
+{
+	raw_spin_unlock(rq_lockp(rq));
+}
+
 /*
  * __task_rq_lock - lock the rq @p resides on.
  */
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1113,6 +1113,56 @@ static inline bool is_migration_disabled
 #endif
 }
 
+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+	return &rq->lock;
+}
+
+static inline void lockdep_assert_rq_held(struct rq *rq)
+{
+	lockdep_assert_held(rq_lockp(rq));
+}
+
+extern void raw_spin_rq_lock_nested(struct rq *rq, int subclass);
+extern bool raw_spin_rq_trylock(struct rq *rq);
+extern void raw_spin_rq_unlock(struct rq *rq);
+
+static inline void raw_spin_rq_lock(struct rq *rq)
+{
+	raw_spin_rq_lock_nested(rq, 0);
+}
+
+static inline void raw_spin_rq_lock_irq(struct rq *rq)
+{
+	local_irq_disable();
+	raw_spin_rq_lock(rq);
+}
+
+static inline void raw_spin_rq_unlock_irq(struct rq *rq)
+{
+	raw_spin_rq_unlock(rq);
+	local_irq_enable();
+}
+
+static inline unsigned long _raw_spin_rq_lock_irqsave(struct rq *rq)
+{
+	unsigned long flags;
+	local_irq_save(flags);
+	raw_spin_rq_lock(rq);
+	return flags;
+}
+
+static inline void raw_spin_rq_unlock_irqrestore(struct rq *rq, unsigned long flags)
+{
+	raw_spin_rq_unlock(rq);
+	local_irq_restore(flags);
+}
+
+#define raw_spin_rq_lock_irqsave(rq, flags)	\
+do {						\
+	flags = _raw_spin_rq_lock_irqsave(rq);	\
+} while (0)
+
 #ifdef CONFIG_SCHED_SMT
 extern void __update_idle_core(struct rq *rq);
 



^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 03/19] sched: Wrap rq::lock access
  2021-04-22 12:04 [PATCH 00/19] sched: Core Scheduling Peter Zijlstra
  2021-04-22 12:05 ` [PATCH 01/19] sched/fair: Add a few assertions Peter Zijlstra
  2021-04-22 12:05 ` [PATCH 02/19] sched: Provide raw_spin_rq_*lock*() helpers Peter Zijlstra
@ 2021-04-22 12:05 ` Peter Zijlstra
  2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2021-04-22 12:05 ` [PATCH 04/19] sched: Prepare for Core-wide rq->lock Peter Zijlstra
                   ` (19 subsequent siblings)
  22 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2021-04-22 12:05 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tglx

In preparation of playing games with rq->lock, abstract the thing
using an accessor.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c     |   70 ++++++++++++++++----------------
 kernel/sched/cpuacct.c  |   12 ++---
 kernel/sched/deadline.c |   22 +++++-----
 kernel/sched/debug.c    |    4 -
 kernel/sched/fair.c     |   35 +++++++---------
 kernel/sched/idle.c     |    4 -
 kernel/sched/pelt.h     |    2 
 kernel/sched/rt.c       |   16 +++----
 kernel/sched/sched.h    |  103 +++++++++++++++++++++++-------------------------
 kernel/sched/topology.c |    4 -
 10 files changed, 135 insertions(+), 137 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -211,12 +211,12 @@ struct rq *__task_rq_lock(struct task_st
 
 	for (;;) {
 		rq = task_rq(p);
-		raw_spin_lock(&rq->lock);
+		raw_spin_rq_lock(rq);
 		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
 			rq_pin_lock(rq, rf);
 			return rq;
 		}
-		raw_spin_unlock(&rq->lock);
+		raw_spin_rq_unlock(rq);
 
 		while (unlikely(task_on_rq_migrating(p)))
 			cpu_relax();
@@ -235,7 +235,7 @@ struct rq *task_rq_lock(struct task_stru
 	for (;;) {
 		raw_spin_lock_irqsave(&p->pi_lock, rf->flags);
 		rq = task_rq(p);
-		raw_spin_lock(&rq->lock);
+		raw_spin_rq_lock(rq);
 		/*
 		 *	move_queued_task()		task_rq_lock()
 		 *
@@ -257,7 +257,7 @@ struct rq *task_rq_lock(struct task_stru
 			rq_pin_lock(rq, rf);
 			return rq;
 		}
-		raw_spin_unlock(&rq->lock);
+		raw_spin_rq_unlock(rq);
 		raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 
 		while (unlikely(task_on_rq_migrating(p)))
@@ -327,7 +327,7 @@ void update_rq_clock(struct rq *rq)
 {
 	s64 delta;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	if (rq->clock_update_flags & RQCF_ACT_SKIP)
 		return;
@@ -626,7 +626,7 @@ void resched_curr(struct rq *rq)
 	struct task_struct *curr = rq->curr;
 	int cpu;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	if (test_tsk_need_resched(curr))
 		return;
@@ -650,10 +650,10 @@ void resched_cpu(int cpu)
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_rq_lock_irqsave(rq, flags);
 	if (cpu_online(cpu) || cpu == smp_processor_id())
 		resched_curr(rq);
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_rq_unlock_irqrestore(rq, flags);
 }
 
 #ifdef CONFIG_SMP
@@ -1152,7 +1152,7 @@ static inline void uclamp_rq_inc_id(stru
 	struct uclamp_se *uc_se = &p->uclamp[clamp_id];
 	struct uclamp_bucket *bucket;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	/* Update task effective clamp */
 	p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
@@ -1192,7 +1192,7 @@ static inline void uclamp_rq_dec_id(stru
 	unsigned int bkt_clamp;
 	unsigned int rq_clamp;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	/*
 	 * If sched_uclamp_used was enabled after task @p was enqueued,
@@ -1865,7 +1865,7 @@ static inline bool is_cpu_allowed(struct
 static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
 				   struct task_struct *p, int new_cpu)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
 	set_task_cpu(p, new_cpu);
@@ -2039,7 +2039,7 @@ int push_cpu_stop(void *arg)
 	struct task_struct *p = arg;
 
 	raw_spin_lock_irq(&p->pi_lock);
-	raw_spin_lock(&rq->lock);
+	raw_spin_rq_lock(rq);
 
 	if (task_rq(p) != rq)
 		goto out_unlock;
@@ -2069,7 +2069,7 @@ int push_cpu_stop(void *arg)
 
 out_unlock:
 	rq->push_busy = false;
-	raw_spin_unlock(&rq->lock);
+	raw_spin_rq_unlock(rq);
 	raw_spin_unlock_irq(&p->pi_lock);
 
 	put_task_struct(p);
@@ -2122,7 +2122,7 @@ __do_set_cpus_allowed(struct task_struct
 		 * Because __kthread_bind() calls this on blocked tasks without
 		 * holding rq->lock.
 		 */
-		lockdep_assert_held(&rq->lock);
+		lockdep_assert_rq_held(rq);
 		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
 	}
 	if (running)
@@ -2463,7 +2463,7 @@ void set_task_cpu(struct task_struct *p,
 	 * task_rq_lock().
 	 */
 	WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) ||
-				      lockdep_is_held(&task_rq(p)->lock)));
+				      lockdep_is_held(rq_lockp(task_rq(p)))));
 #endif
 	/*
 	 * Clearly, migrating tasks to offline CPUs is a fairly daft thing.
@@ -3005,7 +3005,7 @@ ttwu_do_activate(struct rq *rq, struct t
 {
 	int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	if (p->sched_contributes_to_load)
 		rq->nr_uninterruptible--;
@@ -4016,7 +4016,7 @@ static void do_balance_callbacks(struct
 	void (*func)(struct rq *rq);
 	struct callback_head *next;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	while (head) {
 		func = (void (*)(struct rq *))head->func;
@@ -4039,7 +4039,7 @@ static inline struct callback_head *spli
 {
 	struct callback_head *head = rq->balance_callback;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 	if (head)
 		rq->balance_callback = NULL;
 
@@ -4056,9 +4056,9 @@ static inline void balance_callbacks(str
 	unsigned long flags;
 
 	if (unlikely(head)) {
-		raw_spin_lock_irqsave(&rq->lock, flags);
+		raw_spin_rq_lock_irqsave(rq, flags);
 		do_balance_callbacks(rq, head);
-		raw_spin_unlock_irqrestore(&rq->lock, flags);
+		raw_spin_rq_unlock_irqrestore(rq, flags);
 	}
 }
 
@@ -4089,10 +4089,10 @@ prepare_lock_switch(struct rq *rq, struc
 	 * do an early lockdep release here:
 	 */
 	rq_unpin_lock(rq, rf);
-	spin_release(&rq->lock.dep_map, _THIS_IP_);
+	spin_release(&rq_lockp(rq)->dep_map, _THIS_IP_);
 #ifdef CONFIG_DEBUG_SPINLOCK
 	/* this is a valid case when another task releases the spinlock */
-	rq->lock.owner = next;
+	rq_lockp(rq)->owner = next;
 #endif
 }
 
@@ -4103,9 +4103,9 @@ static inline void finish_lock_switch(st
 	 * fix up the runqueue lock - which gets 'carried over' from
 	 * prev into current:
 	 */
-	spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
+	spin_acquire(&rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
 	__balance_callbacks(rq);
-	raw_spin_unlock_irq(&rq->lock);
+	raw_spin_rq_unlock_irq(rq);
 }
 
 /*
@@ -5165,7 +5165,7 @@ static void __sched notrace __schedule(b
 
 		rq_unpin_lock(rq, &rf);
 		__balance_callbacks(rq);
-		raw_spin_unlock_irq(&rq->lock);
+		raw_spin_rq_unlock_irq(rq);
 	}
 }
 
@@ -5707,7 +5707,7 @@ void rt_mutex_setprio(struct task_struct
 
 	rq_unpin_lock(rq, &rf);
 	__balance_callbacks(rq);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_rq_unlock(rq);
 
 	preempt_enable();
 }
@@ -7456,7 +7456,7 @@ void init_idle(struct task_struct *idle,
 	__sched_fork(0, idle);
 
 	raw_spin_lock_irqsave(&idle->pi_lock, flags);
-	raw_spin_lock(&rq->lock);
+	raw_spin_rq_lock(rq);
 
 	idle->state = TASK_RUNNING;
 	idle->se.exec_start = sched_clock();
@@ -7494,7 +7494,7 @@ void init_idle(struct task_struct *idle,
 #ifdef CONFIG_SMP
 	idle->on_cpu = 1;
 #endif
-	raw_spin_unlock(&rq->lock);
+	raw_spin_rq_unlock(rq);
 	raw_spin_unlock_irqrestore(&idle->pi_lock, flags);
 
 	/* Set the preempt count _outside_ the spinlocks! */
@@ -7660,7 +7660,7 @@ static void balance_push(struct rq *rq)
 {
 	struct task_struct *push_task = rq->curr;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 	SCHED_WARN_ON(rq->cpu != smp_processor_id());
 
 	/*
@@ -7698,9 +7698,9 @@ static void balance_push(struct rq *rq)
 		 */
 		if (!rq->nr_running && !rq_has_pinned_tasks(rq) &&
 		    rcuwait_active(&rq->hotplug_wait)) {
-			raw_spin_unlock(&rq->lock);
+			raw_spin_rq_unlock(rq);
 			rcuwait_wake_up(&rq->hotplug_wait);
-			raw_spin_lock(&rq->lock);
+			raw_spin_rq_lock(rq);
 		}
 		return;
 	}
@@ -7710,7 +7710,7 @@ static void balance_push(struct rq *rq)
 	 * Temporarily drop rq->lock such that we can wake-up the stop task.
 	 * Both preemption and IRQs are still disabled.
 	 */
-	raw_spin_unlock(&rq->lock);
+	raw_spin_rq_unlock(rq);
 	stop_one_cpu_nowait(rq->cpu, __balance_push_cpu_stop, push_task,
 			    this_cpu_ptr(&push_work));
 	/*
@@ -7718,7 +7718,7 @@ static void balance_push(struct rq *rq)
 	 * schedule(). The next pick is obviously going to be the stop task
 	 * which kthread_is_per_cpu() and will push this task away.
 	 */
-	raw_spin_lock(&rq->lock);
+	raw_spin_rq_lock(rq);
 }
 
 static void balance_push_set(int cpu, bool on)
@@ -8008,7 +8008,7 @@ static void dump_rq_tasks(struct rq *rq,
 	struct task_struct *g, *p;
 	int cpu = cpu_of(rq);
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	printk("%sCPU%d enqueued tasks (%u total):\n", loglvl, cpu, rq->nr_running);
 	for_each_process_thread(g, p) {
@@ -8181,7 +8181,7 @@ void __init sched_init(void)
 		struct rq *rq;
 
 		rq = cpu_rq(i);
-		raw_spin_lock_init(&rq->lock);
+		raw_spin_lock_init(&rq->__lock);
 		rq->nr_running = 0;
 		rq->calc_load_active = 0;
 		rq->calc_load_update = jiffies + LOAD_FREQ;
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -112,7 +112,7 @@ static u64 cpuacct_cpuusage_read(struct
 	/*
 	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
 	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_rq_lock_irq(cpu_rq(cpu));
 #endif
 
 	if (index == CPUACCT_STAT_NSTATS) {
@@ -126,7 +126,7 @@ static u64 cpuacct_cpuusage_read(struct
 	}
 
 #ifndef CONFIG_64BIT
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_rq_unlock_irq(cpu_rq(cpu));
 #endif
 
 	return data;
@@ -141,14 +141,14 @@ static void cpuacct_cpuusage_write(struc
 	/*
 	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
 	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_rq_lock_irq(cpu_rq(cpu));
 #endif
 
 	for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
 		cpuusage->usages[i] = val;
 
 #ifndef CONFIG_64BIT
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_rq_unlock_irq(cpu_rq(cpu));
 #endif
 }
 
@@ -253,13 +253,13 @@ static int cpuacct_all_seq_show(struct s
 			 * Take rq->lock to make 64-bit read safe on 32-bit
 			 * platforms.
 			 */
-			raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+			raw_spin_rq_lock_irq(cpu_rq(cpu));
 #endif
 
 			seq_printf(m, " %llu", cpuusage->usages[index]);
 
 #ifndef CONFIG_64BIT
-			raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+			raw_spin_rq_unlock_irq(cpu_rq(cpu));
 #endif
 		}
 		seq_puts(m, "\n");
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -157,7 +157,7 @@ void __add_running_bw(u64 dl_bw, struct
 {
 	u64 old = dl_rq->running_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_rq_held(rq_of_dl_rq(dl_rq));
 	dl_rq->running_bw += dl_bw;
 	SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
 	SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
@@ -170,7 +170,7 @@ void __sub_running_bw(u64 dl_bw, struct
 {
 	u64 old = dl_rq->running_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_rq_held(rq_of_dl_rq(dl_rq));
 	dl_rq->running_bw -= dl_bw;
 	SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
 	if (dl_rq->running_bw > old)
@@ -184,7 +184,7 @@ void __add_rq_bw(u64 dl_bw, struct dl_rq
 {
 	u64 old = dl_rq->this_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_rq_held(rq_of_dl_rq(dl_rq));
 	dl_rq->this_bw += dl_bw;
 	SCHED_WARN_ON(dl_rq->this_bw < old); /* overflow */
 }
@@ -194,7 +194,7 @@ void __sub_rq_bw(u64 dl_bw, struct dl_rq
 {
 	u64 old = dl_rq->this_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_rq_held(rq_of_dl_rq(dl_rq));
 	dl_rq->this_bw -= dl_bw;
 	SCHED_WARN_ON(dl_rq->this_bw > old); /* underflow */
 	if (dl_rq->this_bw > old)
@@ -987,7 +987,7 @@ static int start_dl_timer(struct task_st
 	ktime_t now, act;
 	s64 delta;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	/*
 	 * We want the timer to fire at the deadline, but considering
@@ -1097,9 +1097,9 @@ static enum hrtimer_restart dl_task_time
 		 * If the runqueue is no longer available, migrate the
 		 * task elsewhere. This necessarily changes rq.
 		 */
-		lockdep_unpin_lock(&rq->lock, rf.cookie);
+		lockdep_unpin_lock(rq_lockp(rq), rf.cookie);
 		rq = dl_task_offline_migration(rq, p);
-		rf.cookie = lockdep_pin_lock(&rq->lock);
+		rf.cookie = lockdep_pin_lock(rq_lockp(rq));
 		update_rq_clock(rq);
 
 		/*
@@ -1731,7 +1731,7 @@ static void migrate_task_rq_dl(struct ta
 	 * from try_to_wake_up(). Hence, p->pi_lock is locked, but
 	 * rq->lock is not... So, lock it
 	 */
-	raw_spin_lock(&rq->lock);
+	raw_spin_rq_lock(rq);
 	if (p->dl.dl_non_contending) {
 		sub_running_bw(&p->dl, &rq->dl);
 		p->dl.dl_non_contending = 0;
@@ -1746,7 +1746,7 @@ static void migrate_task_rq_dl(struct ta
 			put_task_struct(p);
 	}
 	sub_rq_bw(&p->dl, &rq->dl);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_rq_unlock(rq);
 }
 
 static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
@@ -2291,10 +2291,10 @@ static void pull_dl_task(struct rq *this
 		double_unlock_balance(this_rq, src_rq);
 
 		if (push_task) {
-			raw_spin_unlock(&this_rq->lock);
+			raw_spin_rq_unlock(this_rq);
 			stop_one_cpu_nowait(src_rq->cpu, push_cpu_stop,
 					    push_task, &src_rq->push_work);
-			raw_spin_lock(&this_rq->lock);
+			raw_spin_rq_lock(this_rq);
 		}
 	}
 
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -576,7 +576,7 @@ void print_cfs_rq(struct seq_file *m, in
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "exec_clock",
 			SPLIT_NS(cfs_rq->exec_clock));
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_rq_lock_irqsave(rq, flags);
 	if (rb_first_cached(&cfs_rq->tasks_timeline))
 		MIN_vruntime = (__pick_first_entity(cfs_rq))->vruntime;
 	last = __pick_last_entity(cfs_rq);
@@ -584,7 +584,7 @@ void print_cfs_rq(struct seq_file *m, in
 		max_vruntime = last->vruntime;
 	min_vruntime = cfs_rq->min_vruntime;
 	rq0_min_vruntime = cpu_rq(0)->cfs.min_vruntime;
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_rq_unlock_irqrestore(rq, flags);
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "MIN_vruntime",
 			SPLIT_NS(MIN_vruntime));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "min_vruntime",
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1107,7 +1107,7 @@ struct numa_group {
 static struct numa_group *deref_task_numa_group(struct task_struct *p)
 {
 	return rcu_dereference_check(p->numa_group, p == current ||
-		(lockdep_is_held(&task_rq(p)->lock) && !READ_ONCE(p->on_cpu)));
+		(lockdep_is_held(rq_lockp(task_rq(p))) && !READ_ONCE(p->on_cpu)));
 }
 
 static struct numa_group *deref_curr_numa_group(struct task_struct *p)
@@ -5328,7 +5328,7 @@ static void __maybe_unused update_runtim
 {
 	struct task_group *tg;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(tg, &task_groups, list) {
@@ -5347,7 +5347,7 @@ static void __maybe_unused unthrottle_of
 {
 	struct task_group *tg;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(tg, &task_groups, list) {
@@ -6874,7 +6874,7 @@ static void migrate_task_rq_fair(struct
 		 * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
 		 * rq->lock and can modify state directly.
 		 */
-		lockdep_assert_held(&task_rq(p)->lock);
+		lockdep_assert_rq_held(task_rq(p));
 		detach_entity_cfs_rq(&p->se);
 
 	} else {
@@ -7501,7 +7501,7 @@ static int task_hot(struct task_struct *
 {
 	s64 delta;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_rq_held(env->src_rq);
 
 	if (p->sched_class != &fair_sched_class)
 		return 0;
@@ -7599,7 +7599,7 @@ int can_migrate_task(struct task_struct
 {
 	int tsk_cache_hot;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_rq_held(env->src_rq);
 
 	/*
 	 * We do not migrate tasks that are:
@@ -7688,7 +7688,7 @@ int can_migrate_task(struct task_struct
  */
 static void detach_task(struct task_struct *p, struct lb_env *env)
 {
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_rq_held(env->src_rq);
 
 	deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
 	set_task_cpu(p, env->dst_cpu);
@@ -7704,7 +7704,7 @@ static struct task_struct *detach_one_ta
 {
 	struct task_struct *p;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_rq_held(env->src_rq);
 
 	list_for_each_entry_reverse(p,
 			&env->src_rq->cfs_tasks, se.group_node) {
@@ -7740,7 +7740,7 @@ static int detach_tasks(struct lb_env *e
 	struct task_struct *p;
 	int detached = 0;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_rq_held(env->src_rq);
 
 	/*
 	 * Source run queue has been emptied by another CPU, clear
@@ -7870,7 +7870,7 @@ static int detach_tasks(struct lb_env *e
  */
 static void attach_task(struct rq *rq, struct task_struct *p)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	BUG_ON(task_rq(p) != rq);
 	activate_task(rq, p, ENQUEUE_NOCLOCK);
@@ -9781,7 +9781,7 @@ static int load_balance(int this_cpu, st
 		if (need_active_balance(&env)) {
 			unsigned long flags;
 
-			raw_spin_lock_irqsave(&busiest->lock, flags);
+			raw_spin_rq_lock_irqsave(busiest, flags);
 
 			/*
 			 * Don't kick the active_load_balance_cpu_stop,
@@ -9789,8 +9789,7 @@ static int load_balance(int this_cpu, st
 			 * moved to this_cpu:
 			 */
 			if (!cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr)) {
-				raw_spin_unlock_irqrestore(&busiest->lock,
-							    flags);
+				raw_spin_rq_unlock_irqrestore(busiest, flags);
 				goto out_one_pinned;
 			}
 
@@ -9807,7 +9806,7 @@ static int load_balance(int this_cpu, st
 				busiest->push_cpu = this_cpu;
 				active_balance = 1;
 			}
-			raw_spin_unlock_irqrestore(&busiest->lock, flags);
+			raw_spin_rq_unlock_irqrestore(busiest, flags);
 
 			if (active_balance) {
 				stop_one_cpu_nowait(cpu_of(busiest),
@@ -10624,7 +10623,7 @@ static int newidle_balance(struct rq *th
 		goto out;
 	}
 
-	raw_spin_unlock(&this_rq->lock);
+	raw_spin_rq_unlock(this_rq);
 
 	update_blocked_averages(this_cpu);
 	rcu_read_lock();
@@ -10662,7 +10661,7 @@ static int newidle_balance(struct rq *th
 	}
 	rcu_read_unlock();
 
-	raw_spin_lock(&this_rq->lock);
+	raw_spin_rq_lock(this_rq);
 
 	if (curr_cost > this_rq->max_idle_balance_cost)
 		this_rq->max_idle_balance_cost = curr_cost;
@@ -11143,9 +11142,9 @@ void unregister_fair_sched_group(struct
 
 		rq = cpu_rq(cpu);
 
-		raw_spin_lock_irqsave(&rq->lock, flags);
+		raw_spin_rq_lock_irqsave(rq, flags);
 		list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
-		raw_spin_unlock_irqrestore(&rq->lock, flags);
+		raw_spin_rq_unlock_irqrestore(rq, flags);
 	}
 }
 
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -455,10 +455,10 @@ struct task_struct *pick_next_task_idle(
 static void
 dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)
 {
-	raw_spin_unlock_irq(&rq->lock);
+	raw_spin_rq_unlock_irq(rq);
 	printk(KERN_ERR "bad: scheduling from the idle thread!\n");
 	dump_stack();
-	raw_spin_lock_irq(&rq->lock);
+	raw_spin_rq_lock_irq(rq);
 }
 
 /*
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -141,7 +141,7 @@ static inline void update_idle_rq_clock_
 
 static inline u64 rq_clock_pelt(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 	assert_clock_updated(rq);
 
 	return rq->clock_pelt - rq->lost_idle_time;
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -888,7 +888,7 @@ static int do_sched_rt_period_timer(stru
 		if (skip)
 			continue;
 
-		raw_spin_lock(&rq->lock);
+		raw_spin_rq_lock(rq);
 		update_rq_clock(rq);
 
 		if (rt_rq->rt_time) {
@@ -926,7 +926,7 @@ static int do_sched_rt_period_timer(stru
 
 		if (enqueue)
 			sched_rt_rq_enqueue(rt_rq);
-		raw_spin_unlock(&rq->lock);
+		raw_spin_rq_unlock(rq);
 	}
 
 	if (!throttled && (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF))
@@ -1894,10 +1894,10 @@ static int push_rt_task(struct rq *rq, b
 		 */
 		push_task = get_push_task(rq);
 		if (push_task) {
-			raw_spin_unlock(&rq->lock);
+			raw_spin_rq_unlock(rq);
 			stop_one_cpu_nowait(rq->cpu, push_cpu_stop,
 					    push_task, &rq->push_work);
-			raw_spin_lock(&rq->lock);
+			raw_spin_rq_lock(rq);
 		}
 
 		return 0;
@@ -2122,10 +2122,10 @@ void rto_push_irq_work_func(struct irq_w
 	 * When it gets updated, a check is made if a push is possible.
 	 */
 	if (has_pushable_tasks(rq)) {
-		raw_spin_lock(&rq->lock);
+		raw_spin_rq_lock(rq);
 		while (push_rt_task(rq, true))
 			;
-		raw_spin_unlock(&rq->lock);
+		raw_spin_rq_unlock(rq);
 	}
 
 	raw_spin_lock(&rd->rto_lock);
@@ -2243,10 +2243,10 @@ static void pull_rt_task(struct rq *this
 		double_unlock_balance(this_rq, src_rq);
 
 		if (push_task) {
-			raw_spin_unlock(&this_rq->lock);
+			raw_spin_rq_unlock(this_rq);
 			stop_one_cpu_nowait(src_rq->cpu, push_cpu_stop,
 					    push_task, &src_rq->push_work);
-			raw_spin_lock(&this_rq->lock);
+			raw_spin_rq_lock(this_rq);
 		}
 	}
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -905,7 +905,7 @@ DECLARE_STATIC_KEY_FALSE(sched_uclamp_us
  */
 struct rq {
 	/* runqueue lock: */
-	raw_spinlock_t		lock;
+	raw_spinlock_t		__lock;
 
 	/*
 	 * nr_running and cpu_load should be in the same cacheline because
@@ -1115,7 +1115,7 @@ static inline bool is_migration_disabled
 
 static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 {
-	return &rq->lock;
+	return &rq->__lock;
 }
 
 static inline void lockdep_assert_rq_held(struct rq *rq)
@@ -1229,7 +1229,7 @@ static inline void assert_clock_updated(
 
 static inline u64 rq_clock(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 	assert_clock_updated(rq);
 
 	return rq->clock;
@@ -1237,7 +1237,7 @@ static inline u64 rq_clock(struct rq *rq
 
 static inline u64 rq_clock_task(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 	assert_clock_updated(rq);
 
 	return rq->clock_task;
@@ -1263,7 +1263,7 @@ static inline u64 rq_clock_thermal(struc
 
 static inline void rq_clock_skip_update(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 	rq->clock_update_flags |= RQCF_REQ_SKIP;
 }
 
@@ -1273,7 +1273,7 @@ static inline void rq_clock_skip_update(
  */
 static inline void rq_clock_cancel_skipupdate(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 	rq->clock_update_flags &= ~RQCF_REQ_SKIP;
 }
 
@@ -1304,7 +1304,7 @@ extern struct callback_head balance_push
  */
 static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
 {
-	rf->cookie = lockdep_pin_lock(&rq->lock);
+	rf->cookie = lockdep_pin_lock(rq_lockp(rq));
 
 #ifdef CONFIG_SCHED_DEBUG
 	rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
@@ -1322,12 +1322,12 @@ static inline void rq_unpin_lock(struct
 		rf->clock_update_flags = RQCF_UPDATED;
 #endif
 
-	lockdep_unpin_lock(&rq->lock, rf->cookie);
+	lockdep_unpin_lock(rq_lockp(rq), rf->cookie);
 }
 
 static inline void rq_repin_lock(struct rq *rq, struct rq_flags *rf)
 {
-	lockdep_repin_lock(&rq->lock, rf->cookie);
+	lockdep_repin_lock(rq_lockp(rq), rf->cookie);
 
 #ifdef CONFIG_SCHED_DEBUG
 	/*
@@ -1348,7 +1348,7 @@ static inline void __task_rq_unlock(stru
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_rq_unlock(rq);
 }
 
 static inline void
@@ -1357,7 +1357,7 @@ task_rq_unlock(struct rq *rq, struct tas
 	__releases(p->pi_lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_rq_unlock(rq);
 	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 }
 
@@ -1365,7 +1365,7 @@ static inline void
 rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock_irqsave(&rq->lock, rf->flags);
+	raw_spin_rq_lock_irqsave(rq, rf->flags);
 	rq_pin_lock(rq, rf);
 }
 
@@ -1373,7 +1373,7 @@ static inline void
 rq_lock_irq(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock_irq(&rq->lock);
+	raw_spin_rq_lock_irq(rq);
 	rq_pin_lock(rq, rf);
 }
 
@@ -1381,7 +1381,7 @@ static inline void
 rq_lock(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock(&rq->lock);
+	raw_spin_rq_lock(rq);
 	rq_pin_lock(rq, rf);
 }
 
@@ -1389,7 +1389,7 @@ static inline void
 rq_relock(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock(&rq->lock);
+	raw_spin_rq_lock(rq);
 	rq_repin_lock(rq, rf);
 }
 
@@ -1398,7 +1398,7 @@ rq_unlock_irqrestore(struct rq *rq, stru
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock_irqrestore(&rq->lock, rf->flags);
+	raw_spin_rq_unlock_irqrestore(rq, rf->flags);
 }
 
 static inline void
@@ -1406,7 +1406,7 @@ rq_unlock_irq(struct rq *rq, struct rq_f
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock_irq(&rq->lock);
+	raw_spin_rq_unlock_irq(rq);
 }
 
 static inline void
@@ -1414,7 +1414,7 @@ rq_unlock(struct rq *rq, struct rq_flags
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_rq_unlock(rq);
 }
 
 static inline struct rq *
@@ -1479,7 +1479,7 @@ queue_balance_callback(struct rq *rq,
 		       struct callback_head *head,
 		       void (*func)(struct rq *rq))
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	if (unlikely(head->next || rq->balance_callback == &balance_push_callback))
 		return;
@@ -2019,7 +2019,7 @@ static inline struct task_struct *get_pu
 {
 	struct task_struct *p = rq->curr;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	if (rq->push_busy)
 		return NULL;
@@ -2249,7 +2249,7 @@ static inline int _double_lock_balance(s
 	__acquires(busiest->lock)
 	__acquires(this_rq->lock)
 {
-	raw_spin_unlock(&this_rq->lock);
+	raw_spin_rq_unlock(this_rq);
 	double_rq_lock(this_rq, busiest);
 
 	return 1;
@@ -2268,20 +2268,22 @@ static inline int _double_lock_balance(s
 	__acquires(busiest->lock)
 	__acquires(this_rq->lock)
 {
-	int ret = 0;
+	if (rq_lockp(this_rq) == rq_lockp(busiest))
+		return 0;
 
-	if (unlikely(!raw_spin_trylock(&busiest->lock))) {
-		if (busiest < this_rq) {
-			raw_spin_unlock(&this_rq->lock);
-			raw_spin_lock(&busiest->lock);
-			raw_spin_lock_nested(&this_rq->lock,
-					      SINGLE_DEPTH_NESTING);
-			ret = 1;
-		} else
-			raw_spin_lock_nested(&busiest->lock,
-					      SINGLE_DEPTH_NESTING);
+	if (likely(raw_spin_rq_trylock(busiest)))
+		return 0;
+
+	if (rq_lockp(busiest) >= rq_lockp(this_rq)) {
+		raw_spin_rq_lock_nested(busiest, SINGLE_DEPTH_NESTING);
+		return 0;
 	}
-	return ret;
+
+	raw_spin_rq_unlock(this_rq);
+	raw_spin_rq_lock(busiest);
+	raw_spin_rq_lock_nested(this_rq, SINGLE_DEPTH_NESTING);
+
+	return 1;
 }
 
 #endif /* CONFIG_PREEMPTION */
@@ -2291,11 +2293,7 @@ static inline int _double_lock_balance(s
  */
 static inline int double_lock_balance(struct rq *this_rq, struct rq *busiest)
 {
-	if (unlikely(!irqs_disabled())) {
-		/* printk() doesn't work well under rq->lock */
-		raw_spin_unlock(&this_rq->lock);
-		BUG_ON(1);
-	}
+	lockdep_assert_irqs_disabled();
 
 	return _double_lock_balance(this_rq, busiest);
 }
@@ -2303,8 +2301,9 @@ static inline int double_lock_balance(st
 static inline void double_unlock_balance(struct rq *this_rq, struct rq *busiest)
 	__releases(busiest->lock)
 {
-	raw_spin_unlock(&busiest->lock);
-	lock_set_subclass(&this_rq->lock.dep_map, 0, _RET_IP_);
+	if (rq_lockp(this_rq) != rq_lockp(busiest))
+		raw_spin_rq_unlock(busiest);
+	lock_set_subclass(&rq_lockp(this_rq)->dep_map, 0, _RET_IP_);
 }
 
 static inline void double_lock(spinlock_t *l1, spinlock_t *l2)
@@ -2345,16 +2344,16 @@ static inline void double_rq_lock(struct
 	__acquires(rq2->lock)
 {
 	BUG_ON(!irqs_disabled());
-	if (rq1 == rq2) {
-		raw_spin_lock(&rq1->lock);
+	if (rq_lockp(rq1) == rq_lockp(rq2)) {
+		raw_spin_rq_lock(rq1);
 		__acquire(rq2->lock);	/* Fake it out ;) */
 	} else {
-		if (rq1 < rq2) {
-			raw_spin_lock(&rq1->lock);
-			raw_spin_lock_nested(&rq2->lock, SINGLE_DEPTH_NESTING);
+		if (rq_lockp(rq1) < rq_lockp(rq2)) {
+			raw_spin_rq_lock(rq1);
+			raw_spin_rq_lock_nested(rq2, SINGLE_DEPTH_NESTING);
 		} else {
-			raw_spin_lock(&rq2->lock);
-			raw_spin_lock_nested(&rq1->lock, SINGLE_DEPTH_NESTING);
+			raw_spin_rq_lock(rq2);
+			raw_spin_rq_lock_nested(rq1, SINGLE_DEPTH_NESTING);
 		}
 	}
 }
@@ -2369,9 +2368,9 @@ static inline void double_rq_unlock(stru
 	__releases(rq1->lock)
 	__releases(rq2->lock)
 {
-	raw_spin_unlock(&rq1->lock);
-	if (rq1 != rq2)
-		raw_spin_unlock(&rq2->lock);
+	raw_spin_rq_unlock(rq1);
+	if (rq_lockp(rq1) != rq_lockp(rq2))
+		raw_spin_rq_unlock(rq2);
 	else
 		__release(rq2->lock);
 }
@@ -2394,7 +2393,7 @@ static inline void double_rq_lock(struct
 {
 	BUG_ON(!irqs_disabled());
 	BUG_ON(rq1 != rq2);
-	raw_spin_lock(&rq1->lock);
+	raw_spin_rq_lock(rq1);
 	__acquire(rq2->lock);	/* Fake it out ;) */
 }
 
@@ -2409,7 +2408,7 @@ static inline void double_rq_unlock(stru
 	__releases(rq2->lock)
 {
 	BUG_ON(rq1 != rq2);
-	raw_spin_unlock(&rq1->lock);
+	raw_spin_rq_unlock(rq1);
 	__release(rq2->lock);
 }
 
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -467,7 +467,7 @@ void rq_attach_root(struct rq *rq, struc
 	struct root_domain *old_rd = NULL;
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_rq_lock_irqsave(rq, flags);
 
 	if (rq->rd) {
 		old_rd = rq->rd;
@@ -493,7 +493,7 @@ void rq_attach_root(struct rq *rq, struc
 	if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
 		set_rq_online(rq);
 
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_rq_unlock_irqrestore(rq, flags);
 
 	if (old_rd)
 		call_rcu(&old_rd->rcu, free_rootdomain);



^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-22 12:04 [PATCH 00/19] sched: Core Scheduling Peter Zijlstra
                   ` (2 preceding siblings ...)
  2021-04-22 12:05 ` [PATCH 03/19] sched: Wrap rq::lock access Peter Zijlstra
@ 2021-04-22 12:05 ` Peter Zijlstra
  2021-04-24  1:22   ` Josh Don
                     ` (3 more replies)
  2021-04-22 12:05 ` [PATCH 05/19] sched: " Peter Zijlstra
                   ` (18 subsequent siblings)
  22 siblings, 4 replies; 103+ messages in thread
From: Peter Zijlstra @ 2021-04-22 12:05 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tglx

When switching on core-sched, CPUs need to agree which lock to use for
their RQ.

The new rule will be that rq->core_enabled will be toggled while
holding all rq->__locks that belong to a core. This means we need to
double check the rq->core_enabled value after each lock acquire and
retry if it changed.

This also has implications for those sites that take multiple RQ
locks, they need to be careful that the second lock doesn't end up
being the first lock.

Verify the lock pointer after acquiring the first lock, because if
they're on the same core, holding any of the rq->__lock instances will
pin the core state.

While there, change the rq->__lock order to CPU number, instead of rq
address, this greatly simplifies the next patch.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |   48 ++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h |   41 +++++++++++------------------------------
 2 files changed, 57 insertions(+), 32 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -186,12 +186,37 @@ int sysctl_sched_rt_runtime = 950000;
 
 void raw_spin_rq_lock_nested(struct rq *rq, int subclass)
 {
-	raw_spin_lock_nested(rq_lockp(rq), subclass);
+	raw_spinlock_t *lock;
+
+	if (sched_core_disabled()) {
+		raw_spin_lock_nested(&rq->__lock, subclass);
+		return;
+	}
+
+	for (;;) {
+		lock = rq_lockp(rq);
+		raw_spin_lock_nested(lock, subclass);
+		if (likely(lock == rq_lockp(rq)))
+			return;
+		raw_spin_unlock(lock);
+	}
 }
 
 bool raw_spin_rq_trylock(struct rq *rq)
 {
-	return raw_spin_trylock(rq_lockp(rq));
+	raw_spinlock_t *lock;
+	bool ret;
+
+	if (sched_core_disabled())
+		return raw_spin_trylock(&rq->__lock);
+
+	for (;;) {
+		lock = rq_lockp(rq);
+		ret = raw_spin_trylock(lock);
+		if (!ret || (likely(lock == rq_lockp(rq))))
+			return ret;
+		raw_spin_unlock(lock);
+	}
 }
 
 void raw_spin_rq_unlock(struct rq *rq)
@@ -199,6 +224,25 @@ void raw_spin_rq_unlock(struct rq *rq)
 	raw_spin_unlock(rq_lockp(rq));
 }
 
+#ifdef CONFIG_SMP
+/*
+ * double_rq_lock - safely lock two runqueues
+ */
+void double_rq_lock(struct rq *rq1, struct rq *rq2)
+{
+	lockdep_assert_irqs_disabled();
+
+	if (rq1->cpu > rq2->cpu)
+		swap(rq1, rq2);
+
+	raw_spin_rq_lock(rq1);
+	if (rq_lockp(rq1) == rq_lockp(rq2))
+		return;
+
+	raw_spin_rq_lock_nested(rq2, SINGLE_DEPTH_NESTING);
+}
+#endif
+
 /*
  * __task_rq_lock - lock the rq @p resides on.
  */
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1113,6 +1113,11 @@ static inline bool is_migration_disabled
 #endif
 }
 
+static inline bool sched_core_disabled(void)
+{
+	return true;
+}
+
 static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 {
 	return &rq->__lock;
@@ -2231,10 +2236,12 @@ unsigned long arch_scale_freq_capacity(i
 }
 #endif
 
+
 #ifdef CONFIG_SMP
-#ifdef CONFIG_PREEMPTION
 
-static inline void double_rq_lock(struct rq *rq1, struct rq *rq2);
+extern void double_rq_lock(struct rq *rq1, struct rq *rq2);
+
+#ifdef CONFIG_PREEMPTION
 
 /*
  * fair double_lock_balance: Safely acquires both rq->locks in a fair
@@ -2274,14 +2281,13 @@ static inline int _double_lock_balance(s
 	if (likely(raw_spin_rq_trylock(busiest)))
 		return 0;
 
-	if (rq_lockp(busiest) >= rq_lockp(this_rq)) {
+	if (busiest->cpu > this_rq->cpu) {
 		raw_spin_rq_lock_nested(busiest, SINGLE_DEPTH_NESTING);
 		return 0;
 	}
 
 	raw_spin_rq_unlock(this_rq);
-	raw_spin_rq_lock(busiest);
-	raw_spin_rq_lock_nested(this_rq, SINGLE_DEPTH_NESTING);
+	double_rq_lock(this_rq, busiest);
 
 	return 1;
 }
@@ -2334,31 +2340,6 @@ static inline void double_raw_lock(raw_s
 }
 
 /*
- * double_rq_lock - safely lock two runqueues
- *
- * Note this does not disable interrupts like task_rq_lock,
- * you need to do so manually before calling.
- */
-static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
-	__acquires(rq1->lock)
-	__acquires(rq2->lock)
-{
-	BUG_ON(!irqs_disabled());
-	if (rq_lockp(rq1) == rq_lockp(rq2)) {
-		raw_spin_rq_lock(rq1);
-		__acquire(rq2->lock);	/* Fake it out ;) */
-	} else {
-		if (rq_lockp(rq1) < rq_lockp(rq2)) {
-			raw_spin_rq_lock(rq1);
-			raw_spin_rq_lock_nested(rq2, SINGLE_DEPTH_NESTING);
-		} else {
-			raw_spin_rq_lock(rq2);
-			raw_spin_rq_lock_nested(rq1, SINGLE_DEPTH_NESTING);
-		}
-	}
-}
-
-/*
  * double_rq_unlock - safely unlock two runqueues
  *
  * Note this does not restore interrupts like task_rq_unlock,



^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 05/19] sched: Core-wide rq->lock
  2021-04-22 12:04 [PATCH 00/19] sched: Core Scheduling Peter Zijlstra
                   ` (3 preceding siblings ...)
  2021-04-22 12:05 ` [PATCH 04/19] sched: Prepare for Core-wide rq->lock Peter Zijlstra
@ 2021-04-22 12:05 ` Peter Zijlstra
  2021-05-07  9:50   ` [PATCH v2 " Peter Zijlstra
  2021-04-22 12:05 ` [PATCH 06/19] sched: Optimize rq_lockp() usage Peter Zijlstra
                   ` (17 subsequent siblings)
  22 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2021-04-22 12:05 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tglx

Introduce the basic infrastructure to have a core wide rq->lock.

This relies on the rq->__lock order being in increasing CPU number. It
is also constrained to SMT8 per lockdep (and SMT256 per preempt_count).

Luckily SMT8 is the max supported SMT count for Linux (Mips, Sparc and
Power are known to have this).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/Kconfig.preempt |    6 ++
 kernel/sched/core.c    |  139 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h   |   37 +++++++++++++
 3 files changed, 182 insertions(+)

--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -99,3 +99,9 @@ config PREEMPT_DYNAMIC
 
 	  Interesting if you want the same pre-built kernel should be used for
 	  both Server and Desktop workloads.
+
+config SCHED_CORE
+	bool "Core Scheduling for SMT"
+	default y
+	depends on SCHED_SMT
+
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -84,6 +84,103 @@ unsigned int sysctl_sched_rt_period = 10
 
 __read_mostly int scheduler_running;
 
+#ifdef CONFIG_SCHED_CORE
+
+DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+/*
+ * Magic required such that:
+ *
+ *	raw_spin_rq_lock(rq);
+ *	...
+ *	raw_spin_rq_unlock(rq);
+ *
+ * ends up locking and unlocking the _same_ lock, and all CPUs
+ * always agree on what rq has what lock.
+ *
+ * XXX entirely possible to selectively enable cores, don't bother for now.
+ */
+
+static DEFINE_MUTEX(sched_core_mutex);
+static int sched_core_count;
+static struct cpumask sched_core_mask;
+
+static void __sched_core_flip(bool enabled)
+{
+	int cpu, t, i;
+
+	cpus_read_lock();
+
+	/*
+	 * Toggle the online cores, one by one.
+	 */
+	cpumask_copy(&sched_core_mask, cpu_online_mask);
+	for_each_cpu(cpu, &sched_core_mask) {
+		const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+
+		i = 0;
+		local_irq_disable();
+		for_each_cpu(t, smt_mask) {
+			/* supports up to SMT8 */
+			raw_spin_lock_nested(&cpu_rq(t)->__lock, i++);
+		}
+
+		for_each_cpu(t, smt_mask)
+			cpu_rq(t)->core_enabled = enabled;
+
+		for_each_cpu(t, smt_mask)
+			raw_spin_unlock(&cpu_rq(t)->__lock);
+		local_irq_enable();
+
+		cpumask_andnot(&sched_core_mask, &sched_core_mask, smt_mask);
+	}
+
+	/*
+	 * Toggle the offline CPUs.
+	 */
+	cpumask_copy(&sched_core_mask, cpu_possible_mask);
+	cpumask_andnot(&sched_core_mask, &sched_core_mask, cpu_online_mask);
+
+	for_each_cpu(cpu, &sched_core_mask)
+		cpu_rq(cpu)->core_enabled = enabled;
+
+	cpus_read_unlock();
+}
+
+static void __sched_core_enable(void)
+{
+	// XXX verify there are no cookie tasks (yet)
+
+	static_branch_enable(&__sched_core_enabled);
+	__sched_core_flip(true);
+}
+
+static void __sched_core_disable(void)
+{
+	// XXX verify there are no cookie tasks (left)
+
+	__sched_core_flip(false);
+	static_branch_disable(&__sched_core_enabled);
+}
+
+void sched_core_get(void)
+{
+	mutex_lock(&sched_core_mutex);
+	if (!sched_core_count++)
+		__sched_core_enable();
+	mutex_unlock(&sched_core_mutex);
+}
+
+void sched_core_put(void)
+{
+	mutex_lock(&sched_core_mutex);
+	if (!--sched_core_count)
+		__sched_core_disable();
+	mutex_unlock(&sched_core_mutex);
+}
+
+#endif /* CONFIG_SCHED_CORE */
+
 /*
  * part of the period that we allow rt tasks to run in us.
  * default: 0.95s
@@ -5042,6 +5139,40 @@ pick_next_task(struct rq *rq, struct tas
 	BUG();
 }
 
+#ifdef CONFIG_SCHED_CORE
+
+static inline void sched_core_cpu_starting(unsigned int cpu)
+{
+	const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+	struct rq *rq, *core_rq = NULL;
+	int i;
+
+	core_rq = cpu_rq(cpu)->core;
+
+	if (!core_rq) {
+		for_each_cpu(i, smt_mask) {
+			rq = cpu_rq(i);
+			if (rq->core && rq->core == rq)
+				core_rq = rq;
+		}
+
+		if (!core_rq)
+			core_rq = cpu_rq(cpu);
+
+		for_each_cpu(i, smt_mask) {
+			rq = cpu_rq(i);
+
+			WARN_ON_ONCE(rq->core && rq->core != core_rq);
+			rq->core = core_rq;
+		}
+	}
+}
+#else /* !CONFIG_SCHED_CORE */
+
+static inline void sched_core_cpu_starting(unsigned int cpu) {}
+
+#endif /* CONFIG_SCHED_CORE */
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -8006,6 +8137,7 @@ static void sched_rq_cpu_starting(unsign
 
 int sched_cpu_starting(unsigned int cpu)
 {
+	sched_core_cpu_starting(cpu);
 	sched_rq_cpu_starting(cpu);
 	sched_tick_start(cpu);
 	return 0;
@@ -8290,6 +8424,11 @@ void __init sched_init(void)
 #endif /* CONFIG_SMP */
 		hrtick_rq_init(rq);
 		atomic_set(&rq->nr_iowait, 0);
+
+#ifdef CONFIG_SCHED_CORE
+		rq->core = NULL;
+		rq->core_enabled = 0;
+#endif
 	}
 
 	set_load_weight(&init_task, false);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1075,6 +1075,12 @@ struct rq {
 #endif
 	unsigned int		push_busy;
 	struct cpu_stop_work	push_work;
+
+#ifdef CONFIG_SCHED_CORE
+	/* per rq */
+	struct rq		*core;
+	unsigned int		core_enabled;
+#endif
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -1113,6 +1119,35 @@ static inline bool is_migration_disabled
 #endif
 }
 
+#ifdef CONFIG_SCHED_CORE
+
+DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+	return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled;
+}
+
+static inline bool sched_core_disabled(void)
+{
+	return !static_branch_unlikely(&__sched_core_enabled);
+}
+
+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+	if (sched_core_enabled(rq))
+		return &rq->core->__lock;
+
+	return &rq->__lock;
+}
+
+#else /* !CONFIG_SCHED_CORE */
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+	return false;
+}
+
 static inline bool sched_core_disabled(void)
 {
 	return true;
@@ -1123,6 +1158,8 @@ static inline raw_spinlock_t *rq_lockp(s
 	return &rq->__lock;
 }
 
+#endif /* CONFIG_SCHED_CORE */
+
 static inline void lockdep_assert_rq_held(struct rq *rq)
 {
 	lockdep_assert_held(rq_lockp(rq));



^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 06/19] sched: Optimize rq_lockp() usage
  2021-04-22 12:04 [PATCH 00/19] sched: Core Scheduling Peter Zijlstra
                   ` (4 preceding siblings ...)
  2021-04-22 12:05 ` [PATCH 05/19] sched: " Peter Zijlstra
@ 2021-04-22 12:05 ` Peter Zijlstra
  2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2021-04-22 12:05 ` [PATCH 07/19] sched: Allow sched_core_put() from atomic context Peter Zijlstra
                   ` (16 subsequent siblings)
  22 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2021-04-22 12:05 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tglx

rq_lockp() includes a static_branch(), which is asm-goto, which is
asm volatile which defeats regular CSE. This means that:

	if (!static_branch(&foo))
		return simple;

	if (static_branch(&foo) && cond)
		return complex;

Doesn't fold and we get horrible code. Introduce __rq_lockp() without
the static_branch() on.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c     |   16 ++++++++--------
 kernel/sched/deadline.c |    4 ++--
 kernel/sched/fair.c     |    2 +-
 kernel/sched/sched.h    |   33 +++++++++++++++++++++++++--------
 4 files changed, 36 insertions(+), 19 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -281,9 +281,9 @@ void raw_spin_rq_lock_nested(struct rq *
 	}
 
 	for (;;) {
-		lock = rq_lockp(rq);
+		lock = __rq_lockp(rq);
 		raw_spin_lock_nested(lock, subclass);
-		if (likely(lock == rq_lockp(rq)))
+		if (likely(lock == __rq_lockp(rq)))
 			return;
 		raw_spin_unlock(lock);
 	}
@@ -298,9 +298,9 @@ bool raw_spin_rq_trylock(struct rq *rq)
 		return raw_spin_trylock(&rq->__lock);
 
 	for (;;) {
-		lock = rq_lockp(rq);
+		lock = __rq_lockp(rq);
 		ret = raw_spin_trylock(lock);
-		if (!ret || (likely(lock == rq_lockp(rq))))
+		if (!ret || (likely(lock == __rq_lockp(rq))))
 			return ret;
 		raw_spin_unlock(lock);
 	}
@@ -323,7 +323,7 @@ void double_rq_lock(struct rq *rq1, stru
 		swap(rq1, rq2);
 
 	raw_spin_rq_lock(rq1);
-	if (rq_lockp(rq1) == rq_lockp(rq2))
+	if (__rq_lockp(rq1) == __rq_lockp(rq2))
 		return;
 
 	raw_spin_rq_lock_nested(rq2, SINGLE_DEPTH_NESTING);
@@ -2594,7 +2594,7 @@ void set_task_cpu(struct task_struct *p,
 	 * task_rq_lock().
 	 */
 	WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) ||
-				      lockdep_is_held(rq_lockp(task_rq(p)))));
+				      lockdep_is_held(__rq_lockp(task_rq(p)))));
 #endif
 	/*
 	 * Clearly, migrating tasks to offline CPUs is a fairly daft thing.
@@ -4220,7 +4220,7 @@ prepare_lock_switch(struct rq *rq, struc
 	 * do an early lockdep release here:
 	 */
 	rq_unpin_lock(rq, rf);
-	spin_release(&rq_lockp(rq)->dep_map, _THIS_IP_);
+	spin_release(&__rq_lockp(rq)->dep_map, _THIS_IP_);
 #ifdef CONFIG_DEBUG_SPINLOCK
 	/* this is a valid case when another task releases the spinlock */
 	rq_lockp(rq)->owner = next;
@@ -4234,7 +4234,7 @@ static inline void finish_lock_switch(st
 	 * fix up the runqueue lock - which gets 'carried over' from
 	 * prev into current:
 	 */
-	spin_acquire(&rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
+	spin_acquire(&__rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
 	__balance_callbacks(rq);
 	raw_spin_rq_unlock_irq(rq);
 }
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1097,9 +1097,9 @@ static enum hrtimer_restart dl_task_time
 		 * If the runqueue is no longer available, migrate the
 		 * task elsewhere. This necessarily changes rq.
 		 */
-		lockdep_unpin_lock(rq_lockp(rq), rf.cookie);
+		lockdep_unpin_lock(__rq_lockp(rq), rf.cookie);
 		rq = dl_task_offline_migration(rq, p);
-		rf.cookie = lockdep_pin_lock(rq_lockp(rq));
+		rf.cookie = lockdep_pin_lock(__rq_lockp(rq));
 		update_rq_clock(rq);
 
 		/*
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1107,7 +1107,7 @@ struct numa_group {
 static struct numa_group *deref_task_numa_group(struct task_struct *p)
 {
 	return rcu_dereference_check(p->numa_group, p == current ||
-		(lockdep_is_held(rq_lockp(task_rq(p))) && !READ_ONCE(p->on_cpu)));
+		(lockdep_is_held(__rq_lockp(task_rq(p))) && !READ_ONCE(p->on_cpu)));
 }
 
 static struct numa_group *deref_curr_numa_group(struct task_struct *p)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1127,6 +1127,10 @@ static inline bool sched_core_disabled(v
 	return !static_branch_unlikely(&__sched_core_enabled);
 }
 
+/*
+ * Be careful with this function; not for general use. The return value isn't
+ * stable unless you actually hold a relevant rq->__lock.
+ */
 static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 {
 	if (sched_core_enabled(rq))
@@ -1135,6 +1139,14 @@ static inline raw_spinlock_t *rq_lockp(s
 	return &rq->__lock;
 }
 
+static inline raw_spinlock_t *__rq_lockp(struct rq *rq)
+{
+	if (rq->core_enabled)
+		return &rq->core->__lock;
+
+	return &rq->__lock;
+}
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline bool sched_core_enabled(struct rq *rq)
@@ -1152,11 +1164,16 @@ static inline raw_spinlock_t *rq_lockp(s
 	return &rq->__lock;
 }
 
+static inline raw_spinlock_t *__rq_lockp(struct rq *rq)
+{
+	return &rq->__lock;
+}
+
 #endif /* CONFIG_SCHED_CORE */
 
 static inline void lockdep_assert_rq_held(struct rq *rq)
 {
-	lockdep_assert_held(rq_lockp(rq));
+	lockdep_assert_held(__rq_lockp(rq));
 }
 
 extern void raw_spin_rq_lock_nested(struct rq *rq, int subclass);
@@ -1340,7 +1357,7 @@ extern struct callback_head balance_push
  */
 static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
 {
-	rf->cookie = lockdep_pin_lock(rq_lockp(rq));
+	rf->cookie = lockdep_pin_lock(__rq_lockp(rq));
 
 #ifdef CONFIG_SCHED_DEBUG
 	rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
@@ -1358,12 +1375,12 @@ static inline void rq_unpin_lock(struct
 		rf->clock_update_flags = RQCF_UPDATED;
 #endif
 
-	lockdep_unpin_lock(rq_lockp(rq), rf->cookie);
+	lockdep_unpin_lock(__rq_lockp(rq), rf->cookie);
 }
 
 static inline void rq_repin_lock(struct rq *rq, struct rq_flags *rf)
 {
-	lockdep_repin_lock(rq_lockp(rq), rf->cookie);
+	lockdep_repin_lock(__rq_lockp(rq), rf->cookie);
 
 #ifdef CONFIG_SCHED_DEBUG
 	/*
@@ -2306,7 +2323,7 @@ static inline int _double_lock_balance(s
 	__acquires(busiest->lock)
 	__acquires(this_rq->lock)
 {
-	if (rq_lockp(this_rq) == rq_lockp(busiest))
+	if (__rq_lockp(this_rq) == __rq_lockp(busiest))
 		return 0;
 
 	if (likely(raw_spin_rq_trylock(busiest)))
@@ -2338,9 +2355,9 @@ static inline int double_lock_balance(st
 static inline void double_unlock_balance(struct rq *this_rq, struct rq *busiest)
 	__releases(busiest->lock)
 {
-	if (rq_lockp(this_rq) != rq_lockp(busiest))
+	if (__rq_lockp(this_rq) != __rq_lockp(busiest))
 		raw_spin_rq_unlock(busiest);
-	lock_set_subclass(&rq_lockp(this_rq)->dep_map, 0, _RET_IP_);
+	lock_set_subclass(&__rq_lockp(this_rq)->dep_map, 0, _RET_IP_);
 }
 
 static inline void double_lock(spinlock_t *l1, spinlock_t *l2)
@@ -2381,7 +2398,7 @@ static inline void double_rq_unlock(stru
 	__releases(rq2->lock)
 {
 	raw_spin_rq_unlock(rq1);
-	if (rq_lockp(rq1) != rq_lockp(rq2))
+	if (__rq_lockp(rq1) != __rq_lockp(rq2))
 		raw_spin_rq_unlock(rq2);
 	else
 		__release(rq2->lock);



^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 07/19] sched: Allow sched_core_put() from atomic context
  2021-04-22 12:04 [PATCH 00/19] sched: Core Scheduling Peter Zijlstra
                   ` (5 preceding siblings ...)
  2021-04-22 12:05 ` [PATCH 06/19] sched: Optimize rq_lockp() usage Peter Zijlstra
@ 2021-04-22 12:05 ` Peter Zijlstra
  2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2021-04-22 12:05 ` [PATCH 08/19] sched: Introduce sched_class::pick_task() Peter Zijlstra
                   ` (15 subsequent siblings)
  22 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2021-04-22 12:05 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tglx

Stuff the meat of sched_core_put() into a work such that we can use
sched_core_put() from atomic context.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   33 +++++++++++++++++++++++++++------
 1 file changed, 27 insertions(+), 6 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -102,7 +102,7 @@ DEFINE_STATIC_KEY_FALSE(__sched_core_ena
  */
 
 static DEFINE_MUTEX(sched_core_mutex);
-static int sched_core_count;
+static atomic_t sched_core_count;
 static struct cpumask sched_core_mask;
 
 static void __sched_core_flip(bool enabled)
@@ -165,18 +165,39 @@ static void __sched_core_disable(void)
 
 void sched_core_get(void)
 {
+	if (atomic_inc_not_zero(&sched_core_count))
+		return;
+
 	mutex_lock(&sched_core_mutex);
-	if (!sched_core_count++)
+	if (!atomic_read(&sched_core_count))
 		__sched_core_enable();
+
+	smp_mb__before_atomic();
+	atomic_inc(&sched_core_count);
 	mutex_unlock(&sched_core_mutex);
 }
 
-void sched_core_put(void)
+static void __sched_core_put(struct work_struct *work)
 {
-	mutex_lock(&sched_core_mutex);
-	if (!--sched_core_count)
+	if (atomic_dec_and_mutex_lock(&sched_core_count, &sched_core_mutex)) {
 		__sched_core_disable();
-	mutex_unlock(&sched_core_mutex);
+		mutex_unlock(&sched_core_mutex);
+	}
+}
+
+void sched_core_put(void)
+{
+	static DECLARE_WORK(_work, __sched_core_put);
+
+	/*
+	 * "There can be only one"
+	 *
+	 * Either this is the last one, or we don't actually need to do any
+	 * 'work'. If it is the last *again*, we rely on
+	 * WORK_STRUCT_PENDING_BIT.
+	 */
+	if (!atomic_add_unless(&sched_core_count, -1, 1))
+		schedule_work(&_work);
 }
 
 #endif /* CONFIG_SCHED_CORE */



^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 08/19] sched: Introduce sched_class::pick_task()
  2021-04-22 12:04 [PATCH 00/19] sched: Core Scheduling Peter Zijlstra
                   ` (6 preceding siblings ...)
  2021-04-22 12:05 ` [PATCH 07/19] sched: Allow sched_core_put() from atomic context Peter Zijlstra
@ 2021-04-22 12:05 ` Peter Zijlstra
  2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2021-04-22 12:05 ` [PATCH 09/19] sched: Basic tracking of matching tasks Peter Zijlstra
                   ` (14 subsequent siblings)
  22 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2021-04-22 12:05 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tglx, Vineeth Remanan Pillai

Because sched_class::pick_next_task() also implies
sched_class::set_next_task() (and possibly put_prev_task() and
newidle_balance) it is not state invariant. This makes it unsuitable
for remote task selection.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Vineeth: folded fixes]
Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/deadline.c  |   16 ++++++++++++++--
 kernel/sched/fair.c      |   40 +++++++++++++++++++++++++++++++++++++---
 kernel/sched/idle.c      |    8 ++++++++
 kernel/sched/rt.c        |   15 +++++++++++++--
 kernel/sched/sched.h     |    3 +++
 kernel/sched/stop_task.c |   14 ++++++++++++--
 6 files changed, 87 insertions(+), 9 deletions(-)

--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1873,7 +1873,7 @@ static struct sched_dl_entity *pick_next
 	return rb_entry(left, struct sched_dl_entity, rb_node);
 }
 
-static struct task_struct *pick_next_task_dl(struct rq *rq)
+static struct task_struct *pick_task_dl(struct rq *rq)
 {
 	struct sched_dl_entity *dl_se;
 	struct dl_rq *dl_rq = &rq->dl;
@@ -1885,7 +1885,18 @@ static struct task_struct *pick_next_tas
 	dl_se = pick_next_dl_entity(rq, dl_rq);
 	BUG_ON(!dl_se);
 	p = dl_task_of(dl_se);
-	set_next_task_dl(rq, p, true);
+
+	return p;
+}
+
+static struct task_struct *pick_next_task_dl(struct rq *rq)
+{
+	struct task_struct *p;
+
+	p = pick_task_dl(rq);
+	if (p)
+		set_next_task_dl(rq, p, true);
+
 	return p;
 }
 
@@ -2557,6 +2568,7 @@ DEFINE_SCHED_CLASS(dl) = {
 
 #ifdef CONFIG_SMP
 	.balance		= balance_dl,
+	.pick_task		= pick_task_dl,
 	.select_task_rq		= select_task_rq_dl,
 	.migrate_task_rq	= migrate_task_rq_dl,
 	.set_cpus_allowed       = set_cpus_allowed_dl,
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4400,6 +4400,8 @@ check_preempt_tick(struct cfs_rq *cfs_rq
 static void
 set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	clear_buddies(cfs_rq, se);
+
 	/* 'current' is not kept within the tree. */
 	if (se->on_rq) {
 		/*
@@ -4459,7 +4461,7 @@ pick_next_entity(struct cfs_rq *cfs_rq,
 	 * Avoid running the skip buddy, if running something else can
 	 * be done without getting too unfair.
 	 */
-	if (cfs_rq->skip == se) {
+	if (cfs_rq->skip && cfs_rq->skip == se) {
 		struct sched_entity *second;
 
 		if (se == curr) {
@@ -4486,8 +4488,6 @@ pick_next_entity(struct cfs_rq *cfs_rq,
 		se = cfs_rq->last;
 	}
 
-	clear_buddies(cfs_rq, se);
-
 	return se;
 }
 
@@ -7018,6 +7018,39 @@ static void check_preempt_wakeup(struct
 		set_last_buddy(se);
 }
 
+#ifdef CONFIG_SMP
+static struct task_struct *pick_task_fair(struct rq *rq)
+{
+	struct sched_entity *se;
+	struct cfs_rq *cfs_rq;
+
+again:
+	cfs_rq = &rq->cfs;
+	if (!cfs_rq->nr_running)
+		return NULL;
+
+	do {
+		struct sched_entity *curr = cfs_rq->curr;
+
+		/* When we pick for a remote RQ, we'll not have done put_prev_entity() */
+		if (curr) {
+			if (curr->on_rq)
+				update_curr(cfs_rq);
+			else
+				curr = NULL;
+
+			if (unlikely(check_cfs_rq_runtime(cfs_rq)))
+				goto again;
+		}
+
+		se = pick_next_entity(cfs_rq, curr);
+		cfs_rq = group_cfs_rq(se);
+	} while (cfs_rq);
+
+	return task_of(se);
+}
+#endif
+
 struct task_struct *
 pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
@@ -11220,6 +11253,7 @@ DEFINE_SCHED_CLASS(fair) = {
 
 #ifdef CONFIG_SMP
 	.balance		= balance_fair,
+	.pick_task		= pick_task_fair,
 	.select_task_rq		= select_task_rq_fair,
 	.migrate_task_rq	= migrate_task_rq_fair,
 
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -406,6 +406,13 @@ static void set_next_task_idle(struct rq
 	schedstat_inc(rq->sched_goidle);
 }
 
+#ifdef CONFIG_SMP
+static struct task_struct *pick_task_idle(struct rq *rq)
+{
+	return rq->idle;
+}
+#endif
+
 struct task_struct *pick_next_task_idle(struct rq *rq)
 {
 	struct task_struct *next = rq->idle;
@@ -473,6 +480,7 @@ DEFINE_SCHED_CLASS(idle) = {
 
 #ifdef CONFIG_SMP
 	.balance		= balance_idle,
+	.pick_task		= pick_task_idle,
 	.select_task_rq		= select_task_rq_idle,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1626,7 +1626,7 @@ static struct task_struct *_pick_next_ta
 	return rt_task_of(rt_se);
 }
 
-static struct task_struct *pick_next_task_rt(struct rq *rq)
+static struct task_struct *pick_task_rt(struct rq *rq)
 {
 	struct task_struct *p;
 
@@ -1634,7 +1634,17 @@ static struct task_struct *pick_next_tas
 		return NULL;
 
 	p = _pick_next_task_rt(rq);
-	set_next_task_rt(rq, p, true);
+
+	return p;
+}
+
+static struct task_struct *pick_next_task_rt(struct rq *rq)
+{
+	struct task_struct *p = pick_task_rt(rq);
+
+	if (p)
+		set_next_task_rt(rq, p, true);
+
 	return p;
 }
 
@@ -2483,6 +2493,7 @@ DEFINE_SCHED_CLASS(rt) = {
 
 #ifdef CONFIG_SMP
 	.balance		= balance_rt,
+	.pick_task		= pick_task_rt,
 	.select_task_rq		= select_task_rq_rt,
 	.set_cpus_allowed       = set_cpus_allowed_common,
 	.rq_online              = rq_online_rt,
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1839,6 +1839,9 @@ struct sched_class {
 #ifdef CONFIG_SMP
 	int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int flags);
+
+	struct task_struct * (*pick_task)(struct rq *rq);
+
 	void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
 
 	void (*task_woken)(struct rq *this_rq, struct task_struct *task);
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -34,15 +34,24 @@ static void set_next_task_stop(struct rq
 	stop->se.exec_start = rq_clock_task(rq);
 }
 
-static struct task_struct *pick_next_task_stop(struct rq *rq)
+static struct task_struct *pick_task_stop(struct rq *rq)
 {
 	if (!sched_stop_runnable(rq))
 		return NULL;
 
-	set_next_task_stop(rq, rq->stop, true);
 	return rq->stop;
 }
 
+static struct task_struct *pick_next_task_stop(struct rq *rq)
+{
+	struct task_struct *p = pick_task_stop(rq);
+
+	if (p)
+		set_next_task_stop(rq, p, true);
+
+	return p;
+}
+
 static void
 enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 {
@@ -123,6 +132,7 @@ DEFINE_SCHED_CLASS(stop) = {
 
 #ifdef CONFIG_SMP
 	.balance		= balance_stop,
+	.pick_task		= pick_task_stop,
 	.select_task_rq		= select_task_rq_stop,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif



^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 09/19] sched: Basic tracking of matching tasks
  2021-04-22 12:04 [PATCH 00/19] sched: Core Scheduling Peter Zijlstra
                   ` (7 preceding siblings ...)
  2021-04-22 12:05 ` [PATCH 08/19] sched: Introduce sched_class::pick_task() Peter Zijlstra
@ 2021-04-22 12:05 ` Peter Zijlstra
  2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2021-04-22 12:05 ` [PATCH 10/19] sched: Add core wide task selection and scheduling Peter Zijlstra
                   ` (13 subsequent siblings)
  22 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2021-04-22 12:05 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tglx

Introduce task_struct::core_cookie as an opaque identifier for core
scheduling. When enabled; core scheduling will only allow matching
task to be on the core; where idle matches everything.

When task_struct::core_cookie is set (and core scheduling is enabled)
these tasks are indexed in a second RB-tree, first on cookie value
then on scheduling function, such that matching task selection always
finds the most elegible match.

NOTE: *shudder* at the overhead...

NOTE: *sigh*, a 3rd copy of the scheduling function; the alternative
is per class tracking of cookies and that just duplicates a lot of
stuff for no raisin (the 2nd copy lives in the rt-mutex PI code).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Joel: folded fixes]
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h |    8 ++
 kernel/sched/core.c   |  152 ++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/fair.c   |   46 ---------------
 kernel/sched/sched.h  |   55 ++++++++++++++++++
 4 files changed, 210 insertions(+), 51 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -699,10 +699,16 @@ struct task_struct {
 	const struct sched_class	*sched_class;
 	struct sched_entity		se;
 	struct sched_rt_entity		rt;
+	struct sched_dl_entity		dl;
+
+#ifdef CONFIG_SCHED_CORE
+	struct rb_node			core_node;
+	unsigned long			core_cookie;
+#endif
+
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group		*sched_task_group;
 #endif
-	struct sched_dl_entity		dl;
 
 #ifdef CONFIG_UCLAMP_TASK
 	/*
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -88,6 +88,133 @@ __read_mostly int scheduler_running;
 
 DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
 
+/* kernel prio, less is more */
+static inline int __task_prio(struct task_struct *p)
+{
+	if (p->sched_class == &stop_sched_class) /* trumps deadline */
+		return -2;
+
+	if (rt_prio(p->prio)) /* includes deadline */
+		return p->prio; /* [-1, 99] */
+
+	if (p->sched_class == &idle_sched_class)
+		return MAX_RT_PRIO + NICE_WIDTH; /* 140 */
+
+	return MAX_RT_PRIO + MAX_NICE; /* 120, squash fair */
+}
+
+/*
+ * l(a,b)
+ * le(a,b) := !l(b,a)
+ * g(a,b)  := l(b,a)
+ * ge(a,b) := !l(a,b)
+ */
+
+/* real prio, less is less */
+static inline bool prio_less(struct task_struct *a, struct task_struct *b)
+{
+
+	int pa = __task_prio(a), pb = __task_prio(b);
+
+	if (-pa < -pb)
+		return true;
+
+	if (-pb < -pa)
+		return false;
+
+	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
+		return !dl_time_before(a->dl.deadline, b->dl.deadline);
+
+	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
+		u64 vruntime = b->se.vruntime;
+
+		/*
+		 * Normalize the vruntime if tasks are in different cpus.
+		 */
+		if (task_cpu(a) != task_cpu(b)) {
+			vruntime -= task_cfs_rq(b)->min_vruntime;
+			vruntime += task_cfs_rq(a)->min_vruntime;
+		}
+
+		return !((s64)(a->se.vruntime - vruntime) <= 0);
+	}
+
+	return false;
+}
+
+static inline bool __sched_core_less(struct task_struct *a, struct task_struct *b)
+{
+	if (a->core_cookie < b->core_cookie)
+		return true;
+
+	if (a->core_cookie > b->core_cookie)
+		return false;
+
+	/* flip prio, so high prio is leftmost */
+	if (prio_less(b, a))
+		return true;
+
+	return false;
+}
+
+#define __node_2_sc(node) rb_entry((node), struct task_struct, core_node)
+
+static inline bool rb_sched_core_less(struct rb_node *a, const struct rb_node *b)
+{
+	return __sched_core_less(__node_2_sc(a), __node_2_sc(b));
+}
+
+static inline int rb_sched_core_cmp(const void *key, const struct rb_node *node)
+{
+	const struct task_struct *p = __node_2_sc(node);
+	unsigned long cookie = (unsigned long)key;
+
+	if (cookie < p->core_cookie)
+		return -1;
+
+	if (cookie > p->core_cookie)
+		return 1;
+
+	return 0;
+}
+
+static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+{
+	rq->core->core_task_seq++;
+
+	if (!p->core_cookie)
+		return;
+
+	rb_add(&p->core_node, &rq->core_tree, rb_sched_core_less);
+}
+
+static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+{
+	rq->core->core_task_seq++;
+
+	if (!p->core_cookie)
+		return;
+
+	rb_erase(&p->core_node, &rq->core_tree);
+}
+
+/*
+ * Find left-most (aka, highest priority) task matching @cookie.
+ */
+static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
+{
+	struct rb_node *node;
+
+	node = rb_find_first((void *)cookie, &rq->core_tree, rb_sched_core_cmp);
+	/*
+	 * The idle task always matches any cookie!
+	 */
+	if (!node)
+		return idle_sched_class.pick_task(rq);
+
+	return __node_2_sc(node);
+}
+
 /*
  * Magic required such that:
  *
@@ -147,18 +274,24 @@ static void __sched_core_flip(bool enabl
 	cpus_read_unlock();
 }
 
-static void __sched_core_enable(void)
+static void sched_core_assert_empty(void)
 {
-	// XXX verify there are no cookie tasks (yet)
+	int cpu;
 
+	for_each_possible_cpu(cpu)
+		WARN_ON_ONCE(!RB_EMPTY_ROOT(&cpu_rq(cpu)->core_tree));
+}
+
+static void __sched_core_enable(void)
+{
 	static_branch_enable(&__sched_core_enabled);
 	__sched_core_flip(true);
+	sched_core_assert_empty();
 }
 
 static void __sched_core_disable(void)
 {
-	// XXX verify there are no cookie tasks (left)
-
+	sched_core_assert_empty();
 	__sched_core_flip(false);
 	static_branch_disable(&__sched_core_enabled);
 }
@@ -200,6 +333,11 @@ void sched_core_put(void)
 		schedule_work(&_work);
 }
 
+#else /* !CONFIG_SCHED_CORE */
+
+static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
+static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+
 #endif /* CONFIG_SCHED_CORE */
 
 /*
@@ -1779,10 +1917,16 @@ static inline void enqueue_task(struct r
 
 	uclamp_rq_inc(rq, p);
 	p->sched_class->enqueue_task(rq, p, flags);
+
+	if (sched_core_enabled(rq))
+		sched_core_enqueue(rq, p);
 }
 
 static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	if (sched_core_enabled(rq))
+		sched_core_dequeue(rq, p);
+
 	if (!(flags & DEQUEUE_NOCLOCK))
 		update_rq_clock(rq);
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -268,33 +268,11 @@ const struct sched_class fair_sched_clas
  */
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static inline struct task_struct *task_of(struct sched_entity *se)
-{
-	SCHED_WARN_ON(!entity_is_task(se));
-	return container_of(se, struct task_struct, se);
-}
 
 /* Walk up scheduling entities hierarchy */
 #define for_each_sched_entity(se) \
 		for (; se; se = se->parent)
 
-static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
-{
-	return p->se.cfs_rq;
-}
-
-/* runqueue on which this entity is (to be) queued */
-static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
-{
-	return se->cfs_rq;
-}
-
-/* runqueue "owned" by this group */
-static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
-{
-	return grp->my_q;
-}
-
 static inline void cfs_rq_tg_path(struct cfs_rq *cfs_rq, char *path, int len)
 {
 	if (!path)
@@ -455,33 +433,9 @@ find_matching_se(struct sched_entity **s
 
 #else	/* !CONFIG_FAIR_GROUP_SCHED */
 
-static inline struct task_struct *task_of(struct sched_entity *se)
-{
-	return container_of(se, struct task_struct, se);
-}
-
 #define for_each_sched_entity(se) \
 		for (; se; se = NULL)
 
-static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
-{
-	return &task_rq(p)->cfs;
-}
-
-static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
-{
-	struct task_struct *p = task_of(se);
-	struct rq *rq = task_rq(p);
-
-	return &rq->cfs;
-}
-
-/* runqueue "owned" by this group */
-static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
-{
-	return NULL;
-}
-
 static inline void cfs_rq_tg_path(struct cfs_rq *cfs_rq, char *path, int len)
 {
 	if (path)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1080,6 +1080,10 @@ struct rq {
 	/* per rq */
 	struct rq		*core;
 	unsigned int		core_enabled;
+	struct rb_root		core_tree;
+
+	/* shared state */
+	unsigned int		core_task_seq;
 #endif
 };
 
@@ -1238,6 +1242,57 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		raw_cpu_ptr(&runqueues)
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline struct task_struct *task_of(struct sched_entity *se)
+{
+	SCHED_WARN_ON(!entity_is_task(se));
+	return container_of(se, struct task_struct, se);
+}
+
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+	return p->se.cfs_rq;
+}
+
+/* runqueue on which this entity is (to be) queued */
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+	return se->cfs_rq;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+	return grp->my_q;
+}
+
+#else
+
+static inline struct task_struct *task_of(struct sched_entity *se)
+{
+	return container_of(se, struct task_struct, se);
+}
+
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+	return &task_rq(p)->cfs;
+}
+
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+	struct task_struct *p = task_of(se);
+	struct rq *rq = task_rq(p);
+
+	return &rq->cfs;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+	return NULL;
+}
+#endif
+
 extern void update_rq_clock(struct rq *rq);
 
 static inline u64 __rq_clock_broken(struct rq *rq)



^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 10/19] sched: Add core wide task selection and scheduling
  2021-04-22 12:04 [PATCH 00/19] sched: Core Scheduling Peter Zijlstra
                   ` (8 preceding siblings ...)
  2021-04-22 12:05 ` [PATCH 09/19] sched: Basic tracking of matching tasks Peter Zijlstra
@ 2021-04-22 12:05 ` Peter Zijlstra
  2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2021-04-22 12:05 ` [PATCH 11/19] sched/fair: Fix forced idle sibling starvation corner case Peter Zijlstra
                   ` (12 subsequent siblings)
  22 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2021-04-22 12:05 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tglx

Instead of only selecting a local task, select a task for all SMT
siblings for every reschedule on the core (irrespective which logical
CPU does the reschedule).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |  301 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |    6 -
 2 files changed, 305 insertions(+), 2 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5077,7 +5077,7 @@ static void put_prev_task_balance(struct
  * Pick up the highest-prio task:
  */
 static inline struct task_struct *
-pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	const struct sched_class *class;
 	struct task_struct *p;
@@ -5118,6 +5118,294 @@ pick_next_task(struct rq *rq, struct tas
 }
 
 #ifdef CONFIG_SCHED_CORE
+static inline bool is_task_rq_idle(struct task_struct *t)
+{
+	return (task_rq(t)->idle == t);
+}
+
+static inline bool cookie_equals(struct task_struct *a, unsigned long cookie)
+{
+	return is_task_rq_idle(a) || (a->core_cookie == cookie);
+}
+
+static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
+{
+	if (is_task_rq_idle(a) || is_task_rq_idle(b))
+		return true;
+
+	return a->core_cookie == b->core_cookie;
+}
+
+// XXX fairness/fwd progress conditions
+/*
+ * Returns
+ * - NULL if there is no runnable task for this class.
+ * - the highest priority task for this runqueue if it matches
+ *   rq->core->core_cookie or its priority is greater than max.
+ * - Else returns idle_task.
+ */
+static struct task_struct *
+pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max)
+{
+	struct task_struct *class_pick, *cookie_pick;
+	unsigned long cookie = rq->core->core_cookie;
+
+	class_pick = class->pick_task(rq);
+	if (!class_pick)
+		return NULL;
+
+	if (!cookie) {
+		/*
+		 * If class_pick is tagged, return it only if it has
+		 * higher priority than max.
+		 */
+		if (max && class_pick->core_cookie &&
+		    prio_less(class_pick, max))
+			return idle_sched_class.pick_task(rq);
+
+		return class_pick;
+	}
+
+	/*
+	 * If class_pick is idle or matches cookie, return early.
+	 */
+	if (cookie_equals(class_pick, cookie))
+		return class_pick;
+
+	cookie_pick = sched_core_find(rq, cookie);
+
+	/*
+	 * If class > max && class > cookie, it is the highest priority task on
+	 * the core (so far) and it must be selected, otherwise we must go with
+	 * the cookie pick in order to satisfy the constraint.
+	 */
+	if (prio_less(cookie_pick, class_pick) &&
+	    (!max || prio_less(max, class_pick)))
+		return class_pick;
+
+	return cookie_pick;
+}
+
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	struct task_struct *next, *max = NULL;
+	const struct sched_class *class;
+	const struct cpumask *smt_mask;
+	bool need_sync;
+	int i, j, cpu;
+
+	if (!sched_core_enabled(rq))
+		return __pick_next_task(rq, prev, rf);
+
+	cpu = cpu_of(rq);
+
+	/* Stopper task is switching into idle, no need core-wide selection. */
+	if (cpu_is_offline(cpu)) {
+		/*
+		 * Reset core_pick so that we don't enter the fastpath when
+		 * coming online. core_pick would already be migrated to
+		 * another cpu during offline.
+		 */
+		rq->core_pick = NULL;
+		return __pick_next_task(rq, prev, rf);
+	}
+
+	/*
+	 * If there were no {en,de}queues since we picked (IOW, the task
+	 * pointers are all still valid), and we haven't scheduled the last
+	 * pick yet, do so now.
+	 *
+	 * rq->core_pick can be NULL if no selection was made for a CPU because
+	 * it was either offline or went offline during a sibling's core-wide
+	 * selection. In this case, do a core-wide selection.
+	 */
+	if (rq->core->core_pick_seq == rq->core->core_task_seq &&
+	    rq->core->core_pick_seq != rq->core_sched_seq &&
+	    rq->core_pick) {
+		WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);
+
+		next = rq->core_pick;
+		if (next != prev) {
+			put_prev_task(rq, prev);
+			set_next_task(rq, next);
+		}
+
+		rq->core_pick = NULL;
+		return next;
+	}
+
+	put_prev_task_balance(rq, prev, rf);
+
+	smt_mask = cpu_smt_mask(cpu);
+
+	/*
+	 * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
+	 *
+	 * @task_seq guards the task state ({en,de}queues)
+	 * @pick_seq is the @task_seq we did a selection on
+	 * @sched_seq is the @pick_seq we scheduled
+	 *
+	 * However, preemptions can cause multiple picks on the same task set.
+	 * 'Fix' this by also increasing @task_seq for every pick.
+	 */
+	rq->core->core_task_seq++;
+	need_sync = !!rq->core->core_cookie;
+
+	/* reset state */
+	rq->core->core_cookie = 0UL;
+	for_each_cpu(i, smt_mask) {
+		struct rq *rq_i = cpu_rq(i);
+
+		rq_i->core_pick = NULL;
+
+		if (rq_i->core_forceidle) {
+			need_sync = true;
+			rq_i->core_forceidle = false;
+		}
+
+		if (i != cpu)
+			update_rq_clock(rq_i);
+	}
+
+	/*
+	 * Try and select tasks for each sibling in decending sched_class
+	 * order.
+	 */
+	for_each_class(class) {
+again:
+		for_each_cpu_wrap(i, smt_mask, cpu) {
+			struct rq *rq_i = cpu_rq(i);
+			struct task_struct *p;
+
+			if (rq_i->core_pick)
+				continue;
+
+			/*
+			 * If this sibling doesn't yet have a suitable task to
+			 * run; ask for the most elegible task, given the
+			 * highest priority task already selected for this
+			 * core.
+			 */
+			p = pick_task(rq_i, class, max);
+			if (!p) {
+				/*
+				 * If there weren't no cookies; we don't need to
+				 * bother with the other siblings.
+				 * If the rest of the core is not running a tagged
+				 * task, i.e.  need_sync == 0, and the current CPU
+				 * which called into the schedule() loop does not
+				 * have any tasks for this class, skip selecting for
+				 * other siblings since there's no point. We don't skip
+				 * for RT/DL because that could make CFS force-idle RT.
+				 */
+				if (i == cpu && !need_sync && class == &fair_sched_class)
+					goto next_class;
+
+				continue;
+			}
+
+			/*
+			 * Optimize the 'normal' case where there aren't any
+			 * cookies and we don't need to sync up.
+			 */
+			if (i == cpu && !need_sync && !p->core_cookie) {
+				next = p;
+				goto done;
+			}
+
+			rq_i->core_pick = p;
+
+			/*
+			 * If this new candidate is of higher priority than the
+			 * previous; and they're incompatible; we need to wipe
+			 * the slate and start over. pick_task makes sure that
+			 * p's priority is more than max if it doesn't match
+			 * max's cookie.
+			 *
+			 * NOTE: this is a linear max-filter and is thus bounded
+			 * in execution time.
+			 */
+			if (!max || !cookie_match(max, p)) {
+				struct task_struct *old_max = max;
+
+				rq->core->core_cookie = p->core_cookie;
+				max = p;
+
+				if (old_max) {
+					for_each_cpu(j, smt_mask) {
+						if (j == i)
+							continue;
+
+						cpu_rq(j)->core_pick = NULL;
+					}
+					goto again;
+				} else {
+					/*
+					 * Once we select a task for a cpu, we
+					 * should not be doing an unconstrained
+					 * pick because it might starve a task
+					 * on a forced idle cpu.
+					 */
+					need_sync = true;
+				}
+
+			}
+		}
+next_class:;
+	}
+
+	rq->core->core_pick_seq = rq->core->core_task_seq;
+	next = rq->core_pick;
+	rq->core_sched_seq = rq->core->core_pick_seq;
+
+	/* Something should have been selected for current CPU */
+	WARN_ON_ONCE(!next);
+
+	/*
+	 * Reschedule siblings
+	 *
+	 * NOTE: L1TF -- at this point we're no longer running the old task and
+	 * sending an IPI (below) ensures the sibling will no longer be running
+	 * their task. This ensures there is no inter-sibling overlap between
+	 * non-matching user state.
+	 */
+	for_each_cpu(i, smt_mask) {
+		struct rq *rq_i = cpu_rq(i);
+
+		/*
+		 * An online sibling might have gone offline before a task
+		 * could be picked for it, or it might be offline but later
+		 * happen to come online, but its too late and nothing was
+		 * picked for it.  That's Ok - it will pick tasks for itself,
+		 * so ignore it.
+		 */
+		if (!rq_i->core_pick)
+			continue;
+
+		if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running)
+			rq_i->core_forceidle = true;
+
+		if (i == cpu) {
+			rq_i->core_pick = NULL;
+			continue;
+		}
+
+		/* Did we break L1TF mitigation requirements? */
+		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+
+		if (rq_i->curr == rq_i->core_pick) {
+			rq_i->core_pick = NULL;
+			continue;
+		}
+
+		resched_curr(rq_i);
+	}
+
+done:
+	set_next_task(rq, next);
+	return next;
+}
 
 static inline void sched_core_cpu_starting(unsigned int cpu)
 {
@@ -5151,6 +5439,12 @@ static inline void sched_core_cpu_starti
 
 static inline void sched_core_cpu_starting(unsigned int cpu) {}
 
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	return __pick_next_task(rq, prev, rf);
+}
+
 #endif /* CONFIG_SCHED_CORE */
 
 /*
@@ -8048,7 +8342,12 @@ void __init sched_init(void)
 
 #ifdef CONFIG_SCHED_CORE
 		rq->core = NULL;
+		rq->core_pick = NULL;
 		rq->core_enabled = 0;
+		rq->core_tree = RB_ROOT;
+		rq->core_forceidle = false;
+
+		rq->core_cookie = 0UL;
 #endif
 	}
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1065,11 +1065,16 @@ struct rq {
 #ifdef CONFIG_SCHED_CORE
 	/* per rq */
 	struct rq		*core;
+	struct task_struct	*core_pick;
 	unsigned int		core_enabled;
+	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
+	unsigned char		core_forceidle;
 
 	/* shared state */
 	unsigned int		core_task_seq;
+	unsigned int		core_pick_seq;
+	unsigned long		core_cookie;
 #endif
 };
 
@@ -1977,7 +1982,6 @@ static inline void put_prev_task(struct
 
 static inline void set_next_task(struct rq *rq, struct task_struct *next)
 {
-	WARN_ON_ONCE(rq->curr != next);
 	next->sched_class->set_next_task(rq, next, false);
 }
 



^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 11/19] sched/fair: Fix forced idle sibling starvation corner case
  2021-04-22 12:04 [PATCH 00/19] sched: Core Scheduling Peter Zijlstra
                   ` (9 preceding siblings ...)
  2021-04-22 12:05 ` [PATCH 10/19] sched: Add core wide task selection and scheduling Peter Zijlstra
@ 2021-04-22 12:05 ` Peter Zijlstra
  2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Vineeth Pillai
  2021-04-22 12:05 ` [PATCH 12/19] sched: Fix priority inversion of cookied task with sibling Peter Zijlstra
                   ` (11 subsequent siblings)
  22 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2021-04-22 12:05 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tglx, Vineeth Pillai

From: Vineeth Pillai <viremana@linux.microsoft.com>

If there is only one long running local task and the sibling is
forced idle, it  might not get a chance to run until a schedule
event happens on any cpu in the core.

So we check for this condition during a tick to see if a sibling
is starved and then give it a chance to schedule.

Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  | 15 ++++++++-------
 kernel/sched/fair.c  | 40 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  2 +-
 3 files changed, 49 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1bd0b0bbb040..52d0e83072a4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5206,16 +5206,15 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	/* reset state */
 	rq->core->core_cookie = 0UL;
+	if (rq->core->core_forceidle) {
+		need_sync = true;
+		rq->core->core_forceidle = false;
+	}
 	for_each_cpu(i, smt_mask) {
 		struct rq *rq_i = cpu_rq(i);
 
 		rq_i->core_pick = NULL;
 
-		if (rq_i->core_forceidle) {
-			need_sync = true;
-			rq_i->core_forceidle = false;
-		}
-
 		if (i != cpu)
 			update_rq_clock(rq_i);
 	}
@@ -5335,8 +5334,10 @@ next_class:;
 		if (!rq_i->core_pick)
 			continue;
 
-		if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running)
-			rq_i->core_forceidle = true;
+		if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running &&
+		    !rq_i->core->core_forceidle) {
+			rq_i->core->core_forceidle = true;
+		}
 
 		if (i == cpu) {
 			rq_i->core_pick = NULL;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f53681cd263e..42965c4fd71f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10692,6 +10692,44 @@ static void rq_offline_fair(struct rq *rq)
 
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_SCHED_CORE
+static inline bool
+__entity_slice_used(struct sched_entity *se, int min_nr_tasks)
+{
+	u64 slice = sched_slice(cfs_rq_of(se), se);
+	u64 rtime = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+
+	return (rtime * min_nr_tasks > slice);
+}
+
+#define MIN_NR_TASKS_DURING_FORCEIDLE	2
+static inline void task_tick_core(struct rq *rq, struct task_struct *curr)
+{
+	if (!sched_core_enabled(rq))
+		return;
+
+	/*
+	 * If runqueue has only one task which used up its slice and
+	 * if the sibling is forced idle, then trigger schedule to
+	 * give forced idle task a chance.
+	 *
+	 * sched_slice() considers only this active rq and it gets the
+	 * whole slice. But during force idle, we have siblings acting
+	 * like a single runqueue and hence we need to consider runnable
+	 * tasks on this cpu and the forced idle cpu. Ideally, we should
+	 * go through the forced idle rq, but that would be a perf hit.
+	 * We can assume that the forced idle cpu has atleast
+	 * MIN_NR_TASKS_DURING_FORCEIDLE - 1 tasks and use that to check
+	 * if we need to give up the cpu.
+	 */
+	if (rq->core->core_forceidle && rq->cfs.nr_running == 1 &&
+	    __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
+		resched_curr(rq);
+}
+#else
+static inline void task_tick_core(struct rq *rq, struct task_struct *curr) {}
+#endif
+
 /*
  * scheduler tick hitting a task of our scheduling class.
  *
@@ -10715,6 +10753,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	update_misfit_status(curr, rq);
 	update_overutilized_status(task_rq(curr));
+
+	task_tick_core(rq, curr);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 63b28e1843ee..be656ca8693d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1069,12 +1069,12 @@ struct rq {
 	unsigned int		core_enabled;
 	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
-	unsigned char		core_forceidle;
 
 	/* shared state */
 	unsigned int		core_task_seq;
 	unsigned int		core_pick_seq;
 	unsigned long		core_cookie;
+	unsigned char		core_forceidle;
 #endif
 };
 
-- 
2.29.2.299.gdc1121823c-goog




^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 12/19] sched: Fix priority inversion of cookied task with sibling
  2021-04-22 12:04 [PATCH 00/19] sched: Core Scheduling Peter Zijlstra
                   ` (10 preceding siblings ...)
  2021-04-22 12:05 ` [PATCH 11/19] sched/fair: Fix forced idle sibling starvation corner case Peter Zijlstra
@ 2021-04-22 12:05 ` Peter Zijlstra
  2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Joel Fernandes (Google)
  2021-04-22 12:05 ` [PATCH 13/19] sched/fair: Snapshot the min_vruntime of CPUs on force idle Peter Zijlstra
                   ` (10 subsequent siblings)
  22 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2021-04-22 12:05 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tglx

From: "Joel Fernandes (Google)" <joel@joelfernandes.org>

The rationale is as follows. In the core-wide pick logic, even if
need_sync == false, we need to go look at other CPUs (non-local CPUs)
to see if they could be running RT.

Say the RQs in a particular core look like this:

Let CFS1 and CFS2 be 2 tagged CFS tags.
Let RT1 be an untagged RT task.

	rq0		rq1
	CFS1 (tagged)	RT1 (no tag)
	CFS2 (tagged)

Say schedule() runs on rq0. Now, it will enter the above loop and
pick_task(RT) will return NULL for 'p'. It will enter the above if()
block and see that need_sync == false and will skip RT entirely.

The end result of the selection will be (say prio(CFS1) > prio(CFS2)):

	rq0             rq1
	CFS1            IDLE

When it should have selected:

	rq0             rq1
	IDLE            RT

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   65 ++++++++++++++++++++--------------------------------
 1 file changed, 26 insertions(+), 39 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5425,6 +5425,15 @@ pick_next_task(struct rq *rq, struct tas
 	put_prev_task_balance(rq, prev, rf);
 
 	smt_mask = cpu_smt_mask(cpu);
+	need_sync = !!rq->core->core_cookie;
+
+	/* reset state */
+	rq->core->core_cookie = 0UL;
+	if (rq->core->core_forceidle) {
+		need_sync = true;
+		fi_before = true;
+		rq->core->core_forceidle = false;
+	}
 
 	/*
 	 * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
@@ -5437,14 +5446,25 @@ pick_next_task(struct rq *rq, struct tas
 	 * 'Fix' this by also increasing @task_seq for every pick.
 	 */
 	rq->core->core_task_seq++;
-	need_sync = !!rq->core->core_cookie;
 
-	/* reset state */
-	rq->core->core_cookie = 0UL;
-	if (rq->core->core_forceidle) {
+	/*
+	 * Optimize for common case where this CPU has no cookies
+	 * and there are no cookied tasks running on siblings.
+	 */
+	if (!need_sync) {
+		for_each_class(class) {
+			next = class->pick_task(rq);
+			if (next)
+				break;
+		}
+
+		if (!next->core_cookie) {
+			rq->core_pick = NULL;
+			goto done;
+		}
 		need_sync = true;
-		rq->core->core_forceidle = false;
 	}
+
 	for_each_cpu(i, smt_mask) {
 		struct rq *rq_i = cpu_rq(i);
 
@@ -5474,31 +5494,8 @@ pick_next_task(struct rq *rq, struct tas
 			 * core.
 			 */
 			p = pick_task(rq_i, class, max);
-			if (!p) {
-				/*
-				 * If there weren't no cookies; we don't need to
-				 * bother with the other siblings.
-				 * If the rest of the core is not running a tagged
-				 * task, i.e.  need_sync == 0, and the current CPU
-				 * which called into the schedule() loop does not
-				 * have any tasks for this class, skip selecting for
-				 * other siblings since there's no point. We don't skip
-				 * for RT/DL because that could make CFS force-idle RT.
-				 */
-				if (i == cpu && !need_sync && class == &fair_sched_class)
-					goto next_class;
-
+			if (!p)
 				continue;
-			}
-
-			/*
-			 * Optimize the 'normal' case where there aren't any
-			 * cookies and we don't need to sync up.
-			 */
-			if (i == cpu && !need_sync && !p->core_cookie) {
-				next = p;
-				goto done;
-			}
 
 			rq_i->core_pick = p;
 
@@ -5526,19 +5523,9 @@ pick_next_task(struct rq *rq, struct tas
 						cpu_rq(j)->core_pick = NULL;
 					}
 					goto again;
-				} else {
-					/*
-					 * Once we select a task for a cpu, we
-					 * should not be doing an unconstrained
-					 * pick because it might starve a task
-					 * on a forced idle cpu.
-					 */
-					need_sync = true;
 				}
-
 			}
 		}
-next_class:;
 	}
 
 	rq->core->core_pick_seq = rq->core->core_task_seq;



^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 13/19] sched/fair: Snapshot the min_vruntime of CPUs on force idle
  2021-04-22 12:04 [PATCH 00/19] sched: Core Scheduling Peter Zijlstra
                   ` (11 preceding siblings ...)
  2021-04-22 12:05 ` [PATCH 12/19] sched: Fix priority inversion of cookied task with sibling Peter Zijlstra
@ 2021-04-22 12:05 ` Peter Zijlstra
  2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Joel Fernandes (Google)
  2021-04-22 12:05 ` [PATCH 14/19] sched: Trivial forced-newidle balancer Peter Zijlstra
                   ` (9 subsequent siblings)
  22 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2021-04-22 12:05 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tglx

From: "Joel Fernandes (Google)" <joel@joelfernandes.org>

During force-idle, we end up doing cross-cpu comparison of vruntimes
during pick_next_task. If we simply compare (vruntime-min_vruntime)
across CPUs, and if the CPUs only have 1 task each, we will always
end up comparing 0 with 0 and pick just one of the tasks all the time.
This starves the task that was not picked. To fix this, take a snapshot
of the min_vruntime when entering force idle and use it for comparison.
This min_vruntime snapshot will only be used for cross-CPU vruntime
comparison, and nothing else.

A note about the min_vruntime snapshot and force idling:

During selection:

  When we're not fi, we need to update snapshot.
  when we're fi and we were not fi, we must update snapshot.
  When we're fi and we were already fi, we must not update snapshot.

Which gives:

  fib     fi      update
  0       0       1
  0       1       1
  1       0       1
  1       1       0

Where:

  fi:  force-idled now
  fib: force-idled before

So the min_vruntime snapshot needs to be updated when: !(fib && fi).

Also, the cfs_prio_less() function needs to be aware of whether the
core is in force idle or not, since it will be use this information to
know whether to advance a cfs_rq's min_vruntime_fi in the hierarchy.
So pass this information along via pick_task() -> prio_less().

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |   59 +++++++++++++++++++++++-----------------
 kernel/sched/fair.c  |   75 +++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |    8 +++++
 3 files changed, 117 insertions(+), 25 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -101,7 +101,7 @@ static inline int __task_prio(struct tas
  */
 
 /* real prio, less is less */
-static inline bool prio_less(struct task_struct *a, struct task_struct *b)
+static inline bool prio_less(struct task_struct *a, struct task_struct *b, bool in_fi)
 {
 
 	int pa = __task_prio(a), pb = __task_prio(b);
@@ -115,19 +115,8 @@ static inline bool prio_less(struct task
 	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
 		return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-		u64 vruntime = b->se.vruntime;
-
-		/*
-		 * Normalize the vruntime if tasks are in different cpus.
-		 */
-		if (task_cpu(a) != task_cpu(b)) {
-			vruntime -= task_cfs_rq(b)->min_vruntime;
-			vruntime += task_cfs_rq(a)->min_vruntime;
-		}
-
-		return !((s64)(a->se.vruntime - vruntime) <= 0);
-	}
+	if (pa == MAX_RT_PRIO + MAX_NICE)	/* fair */
+		return cfs_prio_less(a, b, in_fi);
 
 	return false;
 }
@@ -141,7 +130,7 @@ static inline bool __sched_core_less(str
 		return false;
 
 	/* flip prio, so high prio is leftmost */
-	if (prio_less(b, a))
+	if (prio_less(b, a, task_rq(a)->core->core_forceidle))
 		return true;
 
 	return false;
@@ -5145,7 +5134,7 @@ static inline bool cookie_match(struct t
  * - Else returns idle_task.
  */
 static struct task_struct *
-pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max)
+pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max, bool in_fi)
 {
 	struct task_struct *class_pick, *cookie_pick;
 	unsigned long cookie = rq->core->core_cookie;
@@ -5160,7 +5149,7 @@ pick_task(struct rq *rq, const struct sc
 		 * higher priority than max.
 		 */
 		if (max && class_pick->core_cookie &&
-		    prio_less(class_pick, max))
+		    prio_less(class_pick, max, in_fi))
 			return idle_sched_class.pick_task(rq);
 
 		return class_pick;
@@ -5179,19 +5168,22 @@ pick_task(struct rq *rq, const struct sc
 	 * the core (so far) and it must be selected, otherwise we must go with
 	 * the cookie pick in order to satisfy the constraint.
 	 */
-	if (prio_less(cookie_pick, class_pick) &&
-	    (!max || prio_less(max, class_pick)))
+	if (prio_less(cookie_pick, class_pick, in_fi) &&
+	    (!max || prio_less(max, class_pick, in_fi)))
 		return class_pick;
 
 	return cookie_pick;
 }
 
+extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi);
+
 static struct task_struct *
 pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct task_struct *next, *max = NULL;
 	const struct sched_class *class;
 	const struct cpumask *smt_mask;
+	bool fi_before = false;
 	bool need_sync;
 	int i, j, cpu;
 
@@ -5273,9 +5265,14 @@ pick_next_task(struct rq *rq, struct tas
 
 		if (!next->core_cookie) {
 			rq->core_pick = NULL;
+			/*
+			 * For robustness, update the min_vruntime_fi for
+			 * unconstrained picks as well.
+			 */
+			WARN_ON_ONCE(fi_before);
+			task_vruntime_update(rq, next, false);
 			goto done;
 		}
-		need_sync = true;
 	}
 
 	for_each_cpu(i, smt_mask) {
@@ -5306,11 +5303,16 @@ pick_next_task(struct rq *rq, struct tas
 			 * highest priority task already selected for this
 			 * core.
 			 */
-			p = pick_task(rq_i, class, max);
+			p = pick_task(rq_i, class, max, fi_before);
 			if (!p)
 				continue;
 
 			rq_i->core_pick = p;
+			if (rq_i->idle == p && rq_i->nr_running) {
+				rq->core->core_forceidle = true;
+				if (!fi_before)
+					rq->core->core_forceidle_seq++;
+			}
 
 			/*
 			 * If this new candidate is of higher priority than the
@@ -5329,6 +5331,7 @@ pick_next_task(struct rq *rq, struct tas
 				max = p;
 
 				if (old_max) {
+					rq->core->core_forceidle = false;
 					for_each_cpu(j, smt_mask) {
 						if (j == i)
 							continue;
@@ -5369,10 +5372,16 @@ pick_next_task(struct rq *rq, struct tas
 		if (!rq_i->core_pick)
 			continue;
 
-		if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running &&
-		    !rq_i->core->core_forceidle) {
-			rq_i->core->core_forceidle = true;
-		}
+		/*
+		 * Update for new !FI->FI transitions, or if continuing to be in !FI:
+		 * fi_before     fi      update?
+		 *  0            0       1
+		 *  0            1       1
+		 *  1            0       1
+		 *  1            1       0
+		 */
+		if (!(fi_before && rq->core->core_forceidle))
+			task_vruntime_update(rq_i, rq_i->core_pick, rq->core->core_forceidle);
 
 		if (i == cpu) {
 			rq_i->core_pick = NULL;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10727,6 +10727,81 @@ static inline void task_tick_core(struct
 	    __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
 		resched_curr(rq);
 }
+
+/*
+ * se_fi_update - Update the cfs_rq->min_vruntime_fi in a CFS hierarchy if needed.
+ */
+static void se_fi_update(struct sched_entity *se, unsigned int fi_seq, bool forceidle)
+{
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		if (forceidle) {
+			if (cfs_rq->forceidle_seq == fi_seq)
+				break;
+			cfs_rq->forceidle_seq = fi_seq;
+		}
+
+		cfs_rq->min_vruntime_fi = cfs_rq->min_vruntime;
+	}
+}
+
+void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi)
+{
+	struct sched_entity *se = &p->se;
+
+	if (p->sched_class != &fair_sched_class)
+		return;
+
+	se_fi_update(se, rq->core->core_forceidle_seq, in_fi);
+}
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool in_fi)
+{
+	struct rq *rq = task_rq(a);
+	struct sched_entity *sea = &a->se;
+	struct sched_entity *seb = &b->se;
+	struct cfs_rq *cfs_rqa;
+	struct cfs_rq *cfs_rqb;
+	s64 delta;
+
+	SCHED_WARN_ON(task_rq(b)->core != rq->core);
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	/*
+	 * Find an se in the hierarchy for tasks a and b, such that the se's
+	 * are immediate siblings.
+	 */
+	while (sea->cfs_rq->tg != seb->cfs_rq->tg) {
+		int sea_depth = sea->depth;
+		int seb_depth = seb->depth;
+
+		if (sea_depth >= seb_depth)
+			sea = parent_entity(sea);
+		if (sea_depth <= seb_depth)
+			seb = parent_entity(seb);
+	}
+
+	se_fi_update(sea, rq->core->core_forceidle_seq, in_fi);
+	se_fi_update(seb, rq->core->core_forceidle_seq, in_fi);
+
+	cfs_rqa = sea->cfs_rq;
+	cfs_rqb = seb->cfs_rq;
+#else
+	cfs_rqa = &task_rq(a)->cfs;
+	cfs_rqb = &task_rq(b)->cfs;
+#endif
+
+	/*
+	 * Find delta after normalizing se's vruntime with its cfs_rq's
+	 * min_vruntime_fi, which would have been updated in prior calls
+	 * to se_fi_update().
+	 */
+	delta = (s64)(sea->vruntime - seb->vruntime) +
+		(s64)(cfs_rqb->min_vruntime_fi - cfs_rqa->min_vruntime_fi);
+
+	return delta > 0;
+}
 #else
 static inline void task_tick_core(struct rq *rq, struct task_struct *curr) {}
 #endif
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -517,6 +517,11 @@ struct cfs_rq {
 
 	u64			exec_clock;
 	u64			min_vruntime;
+#ifdef CONFIG_SCHED_CORE
+	unsigned int		forceidle_seq;
+	u64			min_vruntime_fi;
+#endif
+
 #ifndef CONFIG_64BIT
 	u64			min_vruntime_copy;
 #endif
@@ -1075,6 +1080,7 @@ struct rq {
 	unsigned int		core_pick_seq;
 	unsigned long		core_cookie;
 	unsigned char		core_forceidle;
+	unsigned int		core_forceidle_seq;
 #endif
 };
 
@@ -1130,6 +1136,8 @@ static inline raw_spinlock_t *rq_lockp(s
 	return &rq->__lock;
 }
 
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool fi);
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline bool sched_core_enabled(struct rq *rq)



^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 14/19] sched: Trivial forced-newidle balancer
  2021-04-22 12:04 [PATCH 00/19] sched: Core Scheduling Peter Zijlstra
                   ` (12 preceding siblings ...)
  2021-04-22 12:05 ` [PATCH 13/19] sched/fair: Snapshot the min_vruntime of CPUs on force idle Peter Zijlstra
@ 2021-04-22 12:05 ` Peter Zijlstra
  2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2021-04-22 12:05 ` [PATCH 15/19] sched: Migration changes for core scheduling Peter Zijlstra
                   ` (8 subsequent siblings)
  22 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2021-04-22 12:05 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tglx

When a sibling is forced-idle to match the core-cookie; search for
matching tasks to fill the core.

rcu_read_unlock() can incur an infrequent deadlock in
sched_core_balance(). Fix this by using the RCU-sched flavor instead.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h |    1 
 kernel/sched/core.c   |  130 +++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/idle.c   |    1 
 kernel/sched/sched.h  |    6 ++
 4 files changed, 137 insertions(+), 1 deletion(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -704,6 +704,7 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
 	struct rb_node			core_node;
 	unsigned long			core_cookie;
+	unsigned int			core_occupation;
 #endif
 
 #ifdef CONFIG_CGROUP_SCHED
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -204,6 +204,21 @@ static struct task_struct *sched_core_fi
 	return __node_2_sc(node);
 }
 
+static struct task_struct *sched_core_next(struct task_struct *p, unsigned long cookie)
+{
+	struct rb_node *node = &p->core_node;
+
+	node = rb_next(node);
+	if (!node)
+		return NULL;
+
+	p = container_of(node, struct task_struct, core_node);
+	if (p->core_cookie != cookie)
+		return NULL;
+
+	return p;
+}
+
 /*
  * Magic required such that:
  *
@@ -5371,8 +5386,8 @@ pick_next_task(struct rq *rq, struct tas
 	const struct sched_class *class;
 	const struct cpumask *smt_mask;
 	bool fi_before = false;
+	int i, j, cpu, occ = 0;
 	bool need_sync;
-	int i, j, cpu;
 
 	if (!sched_core_enabled(rq))
 		return __pick_next_task(rq, prev, rf);
@@ -5494,6 +5509,9 @@ pick_next_task(struct rq *rq, struct tas
 			if (!p)
 				continue;
 
+			if (!is_task_rq_idle(p))
+				occ++;
+
 			rq_i->core_pick = p;
 			if (rq_i->idle == p && rq_i->nr_running) {
 				rq->core->core_forceidle = true;
@@ -5525,6 +5543,7 @@ pick_next_task(struct rq *rq, struct tas
 
 						cpu_rq(j)->core_pick = NULL;
 					}
+					occ = 1;
 					goto again;
 				}
 			}
@@ -5570,6 +5589,8 @@ pick_next_task(struct rq *rq, struct tas
 		if (!(fi_before && rq->core->core_forceidle))
 			task_vruntime_update(rq_i, rq_i->core_pick, rq->core->core_forceidle);
 
+		rq_i->core_pick->core_occupation = occ;
+
 		if (i == cpu) {
 			rq_i->core_pick = NULL;
 			continue;
@@ -5591,6 +5612,113 @@ pick_next_task(struct rq *rq, struct tas
 	return next;
 }
 
+static bool try_steal_cookie(int this, int that)
+{
+	struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
+	struct task_struct *p;
+	unsigned long cookie;
+	bool success = false;
+
+	local_irq_disable();
+	double_rq_lock(dst, src);
+
+	cookie = dst->core->core_cookie;
+	if (!cookie)
+		goto unlock;
+
+	if (dst->curr != dst->idle)
+		goto unlock;
+
+	p = sched_core_find(src, cookie);
+	if (p == src->idle)
+		goto unlock;
+
+	do {
+		if (p == src->core_pick || p == src->curr)
+			goto next;
+
+		if (!cpumask_test_cpu(this, &p->cpus_mask))
+			goto next;
+
+		if (p->core_occupation > dst->idle->core_occupation)
+			goto next;
+
+		p->on_rq = TASK_ON_RQ_MIGRATING;
+		deactivate_task(src, p, 0);
+		set_task_cpu(p, this);
+		activate_task(dst, p, 0);
+		p->on_rq = TASK_ON_RQ_QUEUED;
+
+		resched_curr(dst);
+
+		success = true;
+		break;
+
+next:
+		p = sched_core_next(p, cookie);
+	} while (p);
+
+unlock:
+	double_rq_unlock(dst, src);
+	local_irq_enable();
+
+	return success;
+}
+
+static bool steal_cookie_task(int cpu, struct sched_domain *sd)
+{
+	int i;
+
+	for_each_cpu_wrap(i, sched_domain_span(sd), cpu) {
+		if (i == cpu)
+			continue;
+
+		if (need_resched())
+			break;
+
+		if (try_steal_cookie(cpu, i))
+			return true;
+	}
+
+	return false;
+}
+
+static void sched_core_balance(struct rq *rq)
+{
+	struct sched_domain *sd;
+	int cpu = cpu_of(rq);
+
+	preempt_disable();
+	rcu_read_lock();
+	raw_spin_rq_unlock_irq(rq);
+	for_each_domain(cpu, sd) {
+		if (need_resched())
+			break;
+
+		if (steal_cookie_task(cpu, sd))
+			break;
+	}
+	raw_spin_rq_lock_irq(rq);
+	rcu_read_unlock();
+	preempt_enable();
+}
+
+static DEFINE_PER_CPU(struct callback_head, core_balance_head);
+
+void queue_core_balance(struct rq *rq)
+{
+	if (!sched_core_enabled(rq))
+		return;
+
+	if (!rq->core->core_cookie)
+		return;
+
+	if (!rq->nr_running) /* not forced idle */
+		return;
+
+	queue_balance_callback(rq, &per_cpu(core_balance_head, rq->cpu), sched_core_balance);
+}
+
 static inline void sched_core_cpu_starting(unsigned int cpu)
 {
 	const struct cpumask *smt_mask = cpu_smt_mask(cpu);
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -437,6 +437,7 @@ static void set_next_task_idle(struct rq
 {
 	update_idle_core(rq);
 	schedstat_inc(rq->sched_goidle);
+	queue_core_balance(rq);
 }
 
 #ifdef CONFIG_SMP
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1170,6 +1170,8 @@ static inline raw_spinlock_t *__rq_lockp
 
 bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool fi);
 
+extern void queue_core_balance(struct rq *rq);
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline bool sched_core_enabled(struct rq *rq)
@@ -1187,6 +1189,10 @@ static inline raw_spinlock_t *rq_lockp(s
 	return &rq->__lock;
 }
 
+static inline void queue_core_balance(struct rq *rq)
+{
+}
+
 #endif /* CONFIG_SCHED_CORE */
 
 static inline void lockdep_assert_rq_held(struct rq *rq)



^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 15/19] sched: Migration changes for core scheduling
  2021-04-22 12:04 [PATCH 00/19] sched: Core Scheduling Peter Zijlstra
                   ` (13 preceding siblings ...)
  2021-04-22 12:05 ` [PATCH 14/19] sched: Trivial forced-newidle balancer Peter Zijlstra
@ 2021-04-22 12:05 ` Peter Zijlstra
  2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Aubrey Li
  2021-04-22 12:05 ` [PATCH 16/19] sched: Trivial core scheduling cookie management Peter Zijlstra
                   ` (7 subsequent siblings)
  22 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2021-04-22 12:05 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tglx, Aubrey Li

From: Aubrey Li <aubrey.li@linux.intel.com>

 - Don't migrate if there is a cookie mismatch
     Load balance tries to move task from busiest CPU to the
     destination CPU. When core scheduling is enabled, if the
     task's cookie does not match with the destination CPU's
     core cookie, this task may be skipped by this CPU. This
     mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
     In the fast path of task wakeup, select the first cookie matched
     idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
     In the slow path of task wakeup, find the idlest CPU whose core
     cookie matches with task's cookie

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c  |   29 ++++++++++++++++----
 kernel/sched/sched.h |   73 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 96 insertions(+), 6 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5889,11 +5889,15 @@ find_idlest_group_cpu(struct sched_group
 
 	/* Traverse only the allowed CPUs */
 	for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+		struct rq *rq = cpu_rq(i);
+
+		if (!sched_core_cookie_match(rq, p))
+			continue;
+
 		if (sched_idle_cpu(i))
 			return i;
 
 		if (available_idle_cpu(i)) {
-			struct rq *rq = cpu_rq(i);
 			struct cpuidle_state *idle = idle_get_state(rq);
 			if (idle && idle->exit_latency < min_exit_latency) {
 				/*
@@ -5979,9 +5983,10 @@ static inline int find_idlest_cpu(struct
 	return new_cpu;
 }
 
-static inline int __select_idle_cpu(int cpu)
+static inline int __select_idle_cpu(int cpu, struct task_struct *p)
 {
-	if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
+	    sched_cpu_cookie_match(cpu_rq(cpu), p))
 		return cpu;
 
 	return -1;
@@ -6051,7 +6056,7 @@ static int select_idle_core(struct task_
 	int cpu;
 
 	if (!static_branch_likely(&sched_smt_present))
-		return __select_idle_cpu(core);
+		return __select_idle_cpu(core, p);
 
 	for_each_cpu(cpu, cpu_smt_mask(core)) {
 		if (!available_idle_cpu(cpu)) {
@@ -6107,7 +6112,7 @@ static inline bool test_idle_cores(int c
 
 static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
 {
-	return __select_idle_cpu(core);
+	return __select_idle_cpu(core, p);
 }
 
 static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
@@ -6164,7 +6169,7 @@ static int select_idle_cpu(struct task_s
 		} else {
 			if (!--nr)
 				return -1;
-			idle_cpu = __select_idle_cpu(cpu);
+			idle_cpu = __select_idle_cpu(cpu, p);
 			if ((unsigned int)idle_cpu < nr_cpumask_bits)
 				break;
 		}
@@ -7517,6 +7522,14 @@ static int task_hot(struct task_struct *
 
 	if (sysctl_sched_migration_cost == -1)
 		return 1;
+
+	/*
+	 * Don't migrate task if the task's cookie does not match
+	 * with the destination CPU's core cookie.
+	 */
+	if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+		return 1;
+
 	if (sysctl_sched_migration_cost == 0)
 		return 0;
 
@@ -8847,6 +8860,10 @@ find_idlest_group(struct sched_domain *s
 					p->cpus_ptr))
 			continue;
 
+		/* Skip over this group if no cookie matched */
+		if (!sched_group_cookie_match(cpu_rq(this_cpu), p, group))
+			continue;
+
 		local_group = cpumask_test_cpu(this_cpu,
 					       sched_group_span(group));
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1134,7 +1134,9 @@ static inline bool is_migration_disabled
 #endif
 }
 
+struct sched_group;
 #ifdef CONFIG_SCHED_CORE
+static inline struct cpumask *sched_group_span(struct sched_group *sg);
 
 DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
 
@@ -1170,6 +1172,61 @@ static inline raw_spinlock_t *__rq_lockp
 
 bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool fi);
 
+/*
+ * Helpers to check if the CPU's core cookie matches with the task's cookie
+ * when core scheduling is enabled.
+ * A special case is that the task's cookie always matches with CPU's core
+ * cookie if the CPU is in an idle core.
+ */
+static inline bool sched_cpu_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
+	if (!sched_core_enabled(rq))
+		return true;
+
+	return rq->core->core_cookie == p->core_cookie;
+}
+
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	bool idle_core = true;
+	int cpu;
+
+	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
+	if (!sched_core_enabled(rq))
+		return true;
+
+	for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
+		if (!available_idle_cpu(cpu)) {
+			idle_core = false;
+			break;
+		}
+	}
+
+	/*
+	 * A CPU in an idle core is always the best choice for tasks with
+	 * cookies.
+	 */
+	return idle_core || rq->core->core_cookie == p->core_cookie;
+}
+
+static inline bool sched_group_cookie_match(struct rq *rq,
+					    struct task_struct *p,
+					    struct sched_group *group)
+{
+	int cpu;
+
+	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
+	if (!sched_core_enabled(rq))
+		return true;
+
+	for_each_cpu_and(cpu, sched_group_span(group), p->cpus_ptr) {
+		if (sched_core_cookie_match(rq, p))
+			return true;
+	}
+	return false;
+}
+
 extern void queue_core_balance(struct rq *rq);
 
 #else /* !CONFIG_SCHED_CORE */
@@ -1193,6 +1250,22 @@ static inline void queue_core_balance(st
 {
 }
 
+static inline bool sched_cpu_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	return true;
+}
+
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	return true;
+}
+
+static inline bool sched_group_cookie_match(struct rq *rq,
+					    struct task_struct *p,
+					    struct sched_group *group)
+{
+	return true;
+}
 #endif /* CONFIG_SCHED_CORE */
 
 static inline void lockdep_assert_rq_held(struct rq *rq)



^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 16/19] sched: Trivial core scheduling cookie management
  2021-04-22 12:04 [PATCH 00/19] sched: Core Scheduling Peter Zijlstra
                   ` (14 preceding siblings ...)
  2021-04-22 12:05 ` [PATCH 15/19] sched: Migration changes for core scheduling Peter Zijlstra
@ 2021-04-22 12:05 ` Peter Zijlstra
  2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2021-04-22 12:05 ` [PATCH 17/19] sched: Inherit task cookie on fork() Peter Zijlstra
                   ` (6 subsequent siblings)
  22 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2021-04-22 12:05 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tglx

In order to not have to use pid_struct, create a new, smaller,
structure to manage task cookies for core scheduling.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h     |    6 ++
 kernel/fork.c             |    1 
 kernel/sched/Makefile     |    1 
 kernel/sched/core.c       |    7 +-
 kernel/sched/core_sched.c |  109 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h      |   16 ++++++
 6 files changed, 137 insertions(+), 3 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2170,4 +2170,10 @@ int sched_trace_rq_nr_running(struct rq
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
+#ifdef CONFIG_SCHED_CORE
+extern void sched_core_free(struct task_struct *tsk);
+#else
+static inline void sched_core_free(struct task_struct *tsk) { }
+#endif
+
 #endif
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -737,6 +737,7 @@ void __put_task_struct(struct task_struc
 	exit_creds(tsk);
 	delayacct_tsk_free(tsk);
 	put_signal_struct(tsk->signal);
+	sched_core_free(tsk);
 
 	if (!profile_handoff_task(tsk))
 		free_task(tsk);
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -36,3 +36,4 @@ obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) +=
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
+obj-$(CONFIG_SCHED_CORE) += core_sched.o
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -157,7 +157,7 @@ static inline int rb_sched_core_cmp(cons
 	return 0;
 }
 
-static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 {
 	rq->core->core_task_seq++;
 
@@ -167,14 +167,15 @@ static void sched_core_enqueue(struct rq
 	rb_add(&p->core_node, &rq->core_tree, rb_sched_core_less);
 }
 
-static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+void sched_core_dequeue(struct rq *rq, struct task_struct *p)
 {
 	rq->core->core_task_seq++;
 
-	if (!p->core_cookie)
+	if (!sched_core_enqueued(p))
 		return;
 
 	rb_erase(&p->core_node, &rq->core_tree);
+	RB_CLEAR_NODE(&p->core_node);
 }
 
 /*
--- /dev/null
+++ b/kernel/sched/core_sched.c
@@ -0,0 +1,109 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "sched.h"
+
+/*
+ * A simple wrapper around refcount. An allocated sched_core_cookie's
+ * address is used to compute the cookie of the task.
+ */
+struct sched_core_cookie {
+	refcount_t refcnt;
+};
+
+unsigned long sched_core_alloc_cookie(void)
+{
+	struct sched_core_cookie *ck = kmalloc(sizeof(*ck), GFP_KERNEL);
+	if (!ck)
+		return 0;
+
+	refcount_set(&ck->refcnt, 1);
+	sched_core_get();
+
+	return (unsigned long)ck;
+}
+
+void sched_core_put_cookie(unsigned long cookie)
+{
+	struct sched_core_cookie *ptr = (void *)cookie;
+
+	if (ptr && refcount_dec_and_test(&ptr->refcnt)) {
+		kfree(ptr);
+		sched_core_put();
+	}
+}
+
+unsigned long sched_core_get_cookie(unsigned long cookie)
+{
+	struct sched_core_cookie *ptr = (void *)cookie;
+
+	if (ptr)
+		refcount_inc(&ptr->refcnt);
+
+	return cookie;
+}
+
+/*
+ * sched_core_update_cookie - replace the cookie on a task
+ * @p: the task to update
+ * @cookie: the new cookie
+ *
+ * Effectively exchange the task cookie; caller is responsible for lifetimes on
+ * both ends.
+ *
+ * Returns: the old cookie
+ */
+unsigned long sched_core_update_cookie(struct task_struct *p, unsigned long cookie)
+{
+	unsigned long old_cookie;
+	struct rq_flags rf;
+	struct rq *rq;
+	bool enqueued;
+
+	rq = task_rq_lock(p, &rf);
+
+	/*
+	 * Since creating a cookie implies sched_core_get(), and we cannot set
+	 * a cookie until after we've created it, similarly, we cannot destroy
+	 * a cookie until after we've removed it, we must have core scheduling
+	 * enabled here.
+	 */
+	SCHED_WARN_ON((p->core_cookie || cookie) && !sched_core_enabled(rq));
+
+	enqueued = sched_core_enqueued(p);
+	if (enqueued)
+		sched_core_dequeue(rq, p);
+
+	old_cookie = p->core_cookie;
+	p->core_cookie = cookie;
+
+	if (enqueued)
+		sched_core_enqueue(rq, p);
+
+	/*
+	 * If task is currently running, it may not be compatible anymore after
+	 * the cookie change, so enter the scheduler on its CPU to schedule it
+	 * away.
+	 */
+	if (task_running(rq, p))
+		resched_curr(rq);
+
+	task_rq_unlock(rq, p, &rf);
+
+	return old_cookie;
+}
+
+static unsigned long sched_core_clone_cookie(struct task_struct *p)
+{
+	unsigned long cookie, flags;
+
+	raw_spin_lock_irqsave(&p->pi_lock, flags);
+	cookie = sched_core_get_cookie(p->core_cookie);
+	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+
+	return cookie;
+}
+
+void sched_core_free(struct task_struct *p)
+{
+	sched_core_put_cookie(p->core_cookie);
+}
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1222,6 +1222,22 @@ static inline bool sched_group_cookie_ma
 
 extern void queue_core_balance(struct rq *rq);
 
+static inline bool sched_core_enqueued(struct task_struct *p)
+{
+	return !RB_EMPTY_NODE(&p->core_node);
+}
+
+extern void sched_core_enqueue(struct rq *rq, struct task_struct *p);
+extern void sched_core_dequeue(struct rq *rq, struct task_struct *p);
+
+extern void sched_core_get(void);
+extern void sched_core_put(void);
+
+extern unsigned long sched_core_alloc_cookie(void);
+extern void sched_core_put_cookie(unsigned long cookie);
+extern unsigned long sched_core_get_cookie(unsigned long cookie);
+extern unsigned long sched_core_update_cookie(struct task_struct *p, unsigned long cookie);
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline bool sched_core_enabled(struct rq *rq)



^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 17/19] sched: Inherit task cookie on fork()
  2021-04-22 12:04 [PATCH 00/19] sched: Core Scheduling Peter Zijlstra
                   ` (15 preceding siblings ...)
  2021-04-22 12:05 ` [PATCH 16/19] sched: Trivial core scheduling cookie management Peter Zijlstra
@ 2021-04-22 12:05 ` Peter Zijlstra
  2021-05-10 16:06   ` Joel Fernandes
  2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2021-04-22 12:05 ` [PATCH 18/19] sched: prctl() core-scheduling interface Peter Zijlstra
                   ` (5 subsequent siblings)
  22 siblings, 2 replies; 103+ messages in thread
From: Peter Zijlstra @ 2021-04-22 12:05 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tglx

Note that sched_core_fork() is called from under tasklist_lock, and
not from sched_fork() earlier. This avoids a few races later.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h     |    2 ++
 kernel/fork.c             |    3 +++
 kernel/sched/core_sched.c |    6 ++++++
 3 files changed, 11 insertions(+)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2172,8 +2172,10 @@ const struct cpumask *sched_trace_rd_spa
 
 #ifdef CONFIG_SCHED_CORE
 extern void sched_core_free(struct task_struct *tsk);
+extern void sched_core_fork(struct task_struct *p);
 #else
 static inline void sched_core_free(struct task_struct *tsk) { }
+static inline void sched_core_fork(struct task_struct *p) { }
 #endif
 
 #endif
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2249,6 +2249,8 @@ static __latent_entropy struct task_stru
 
 	klp_copy_process(p);
 
+	sched_core_fork(p);
+
 	spin_lock(&current->sighand->siglock);
 
 	/*
@@ -2336,6 +2338,7 @@ static __latent_entropy struct task_stru
 	return p;
 
 bad_fork_cancel_cgroup:
+	sched_core_free(p);
 	spin_unlock(&current->sighand->siglock);
 	write_unlock_irq(&tasklist_lock);
 	cgroup_cancel_fork(p, args);
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -100,6 +100,12 @@ static unsigned long sched_core_clone_co
 	return cookie;
 }
 
+void sched_core_fork(struct task_struct *p)
+{
+	RB_CLEAR_NODE(&p->core_node);
+	p->core_cookie = sched_core_clone_cookie(current);
+}
+
 void sched_core_free(struct task_struct *p)
 {
 	sched_core_put_cookie(p->core_cookie);



^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 18/19] sched: prctl() core-scheduling interface
  2021-04-22 12:04 [PATCH 00/19] sched: Core Scheduling Peter Zijlstra
                   ` (16 preceding siblings ...)
  2021-04-22 12:05 ` [PATCH 17/19] sched: Inherit task cookie on fork() Peter Zijlstra
@ 2021-04-22 12:05 ` Peter Zijlstra
  2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Chris Hyser
                     ` (3 more replies)
  2021-04-22 12:05 ` [PATCH 19/19] kselftest: Add test for core sched prctl interface Peter Zijlstra
                   ` (4 subsequent siblings)
  22 siblings, 4 replies; 103+ messages in thread
From: Peter Zijlstra @ 2021-04-22 12:05 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tglx

From: Chris Hyser <chris.hyser@oracle.com>

This patch provides support for setting and copying core scheduling
'task cookies' between threads (PID), processes (TGID), and process
groups (PGID).

The value of core scheduling isn't that tasks don't share a core,
'nosmt' can do that. The value lies in exploiting all the sharing
opportunities that exist to recover possible lost performance and that
requires a degree of flexibility in the API.

>From a security perspective (and there are others), the thread,
process and process group distinction is an existent hierarchal
categorization of tasks that reflects many of the security concerns
about 'data sharing'. For example, protecting against cache-snooping
by a thread that can just read the memory directly isn't all that
useful.

With this in mind, subcommands to CREATE/SHARE (TO/FROM) provide a
mechanism to create and share cookies. CREATE/SHARE_TO specify a
target pid with enum pidtype used to specify the scope of the targeted
tasks. For example, PIDTYPE_TGID will share the cookie with the
process and all of it's threads as typically desired in a security
scenario.

API:

  prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, tgtpid, pidtype, &cookie)
  prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, tgtpid, pidtype, NULL)
  prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, tgtpid, pidtype, NULL)
  prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, srcpid, pidtype, NULL)

where 'tgtpid/srcpid == 0' implies the current process and pidtype is
kernel enum pid_type {PIDTYPE_PID, PIDTYPE_TGID, PIDTYPE_PGID, ...}.

For return values, EINVAL, ENOMEM are what they say. ESRCH means the
tgtpid/srcpid was not found. EPERM indicates lack of PTRACE permission
access to tgtpid/srcpid. ENODEV indicates your machines lacks SMT.

[peterz: complete rewrite]
Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h            |    2 
 include/uapi/linux/prctl.h       |    8 ++
 kernel/sched/core_sched.c        |  114 +++++++++++++++++++++++++++++++++++++++
 kernel/sys.c                     |    5 +
 tools/include/uapi/linux/prctl.h |    8 ++
 5 files changed, 137 insertions(+)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2173,6 +2173,8 @@ const struct cpumask *sched_trace_rd_spa
 #ifdef CONFIG_SCHED_CORE
 extern void sched_core_free(struct task_struct *tsk);
 extern void sched_core_fork(struct task_struct *p);
+extern int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
+				unsigned long uaddr);
 #else
 static inline void sched_core_free(struct task_struct *tsk) { }
 static inline void sched_core_fork(struct task_struct *p) { }
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -255,4 +255,12 @@ struct prctl_mm_map {
 # define SYSCALL_DISPATCH_FILTER_ALLOW	0
 # define SYSCALL_DISPATCH_FILTER_BLOCK	1
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE			60
+# define PR_SCHED_CORE_GET		0
+# define PR_SCHED_CORE_CREATE		1 /* create unique core_sched cookie */
+# define PR_SCHED_CORE_SHARE_TO		2 /* push core_sched cookie to pid */
+# define PR_SCHED_CORE_SHARE_FROM	3 /* pull core_sched cookie to pid */
+# define PR_SCHED_CORE_MAX		4
+
 #endif /* _LINUX_PRCTL_H */
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-only
 
+#include <linux/prctl.h>
 #include "sched.h"
 
 /*
@@ -110,3 +111,116 @@ void sched_core_free(struct task_struct
 {
 	sched_core_put_cookie(p->core_cookie);
 }
+
+static void __sched_core_set(struct task_struct *p, unsigned long cookie)
+{
+	cookie = sched_core_get_cookie(cookie);
+	cookie = sched_core_update_cookie(p, cookie);
+	sched_core_put_cookie(cookie);
+}
+
+/* Called from prctl interface: PR_SCHED_CORE */
+int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
+			 unsigned long uaddr)
+{
+	unsigned long cookie = 0, id = 0;
+	struct task_struct *task, *p;
+	struct pid *grp;
+	int err = 0;
+
+	if (!static_branch_likely(&sched_smt_present))
+		return -ENODEV;
+
+	if (type > PIDTYPE_PGID || cmd >= PR_SCHED_CORE_MAX || pid < 0 ||
+	    (cmd != PR_SCHED_CORE_GET && uaddr))
+		return -EINVAL;
+
+	rcu_read_lock();
+	if (pid == 0) {
+		task = current;
+	} else {
+		task = find_task_by_vpid(pid);
+		if (!task) {
+			rcu_read_unlock();
+			return -ESRCH;
+		}
+	}
+	get_task_struct(task);
+	rcu_read_unlock();
+
+	/*
+	 * Check if this process has the right to modify the specified
+	 * process. Use the regular "ptrace_may_access()" checks.
+	 */
+	if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
+		err = -EPERM;
+		goto out;
+	}
+
+	switch (cmd) {
+	case PR_SCHED_CORE_GET:
+		if (type != PIDTYPE_PID || uaddr & 7) {
+			err = -EINVAL;
+			goto out;
+		}
+		cookie = sched_core_clone_cookie(task);
+		if (cookie) {
+			/* XXX improve ? */
+			ptr_to_hashval((void *)cookie, &id);
+		}
+		err = put_user(id, (u64 __user *)uaddr);
+		goto out;
+
+	case PR_SCHED_CORE_CREATE:
+		cookie = sched_core_alloc_cookie();
+		if (!cookie) {
+			err = -ENOMEM;
+			goto out;
+		}
+		break;
+
+	case PR_SCHED_CORE_SHARE_TO:
+		cookie = sched_core_clone_cookie(current);
+		break;
+
+	case PR_SCHED_CORE_SHARE_FROM:
+		if (type != PIDTYPE_PID) {
+			err = -EINVAL;
+			goto out;
+		}
+		cookie = sched_core_clone_cookie(task);
+		__sched_core_set(current, cookie);
+		goto out;
+
+	default:
+		err = -EINVAL;
+		goto out;
+	};
+
+	if (type == PIDTYPE_PID) {
+		__sched_core_set(task, cookie);
+		goto out;
+	}
+
+	read_lock(&tasklist_lock);
+	grp = task_pid_type(task, type);
+
+	do_each_pid_thread(grp, type, p) {
+		if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS)) {
+			err = -EPERM;
+			goto out_tasklist;
+		}
+	} while_each_pid_thread(grp, type, p);
+
+	do_each_pid_thread(grp, type, p) {
+		__sched_core_set(p, cookie);
+	} while_each_pid_thread(grp, type, p);
+out_tasklist:
+	read_unlock(&tasklist_lock);
+
+out:
+	sched_core_put_cookie(cookie);
+	put_task_struct(task);
+	return err;
+}
+
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2534,6 +2534,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 		error = set_syscall_user_dispatch(arg2, arg3, arg4,
 						  (char __user *) arg5);
 		break;
+#ifdef CONFIG_SCHED_CORE
+	case PR_SCHED_CORE:
+		error = sched_core_share_pid(arg2, arg3, arg4, arg5);
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -255,4 +255,12 @@ struct prctl_mm_map {
 # define SYSCALL_DISPATCH_FILTER_ALLOW	0
 # define SYSCALL_DISPATCH_FILTER_BLOCK	1
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE			60
+# define PR_SCHED_CORE_GET		0
+# define PR_SCHED_CORE_CREATE		1 /* create unique core_sched cookie */
+# define PR_SCHED_CORE_SHARE_TO		2 /* push core_sched cookie to pid */
+# define PR_SCHED_CORE_SHARE_FROM	3 /* pull core_sched cookie to pid */
+# define PR_SCHED_CORE_MAX		4
+
 #endif /* _LINUX_PRCTL_H */



^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 19/19] kselftest: Add test for core sched prctl interface
  2021-04-22 12:04 [PATCH 00/19] sched: Core Scheduling Peter Zijlstra
                   ` (17 preceding siblings ...)
  2021-04-22 12:05 ` [PATCH 18/19] sched: prctl() core-scheduling interface Peter Zijlstra
@ 2021-04-22 12:05 ` Peter Zijlstra
  2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Chris Hyser
  2021-04-22 16:43 ` [PATCH 00/19] sched: Core Scheduling Don Hiatt
                   ` (3 subsequent siblings)
  22 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2021-04-22 12:05 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, peterz, tglx

From: Chris Hyser <chris.hyser@oracle.com>

Provides a selftest and examples of using the interface.

[peterz: updated to not use sched_debug]
Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 tools/testing/selftests/sched/.gitignore      |    1 
 tools/testing/selftests/sched/Makefile        |   14 +
 tools/testing/selftests/sched/config          |    1 
 tools/testing/selftests/sched/cs_prctl_test.c |  338 ++++++++++++++++++++++++++
 4 files changed, 354 insertions(+)
 create mode 100644 tools/testing/selftests/sched/.gitignore
 create mode 100644 tools/testing/selftests/sched/Makefile
 create mode 100644 tools/testing/selftests/sched/config
 create mode 100644 tools/testing/selftests/sched/cs_prctl_test.c

--- /dev/null
+++ b/tools/testing/selftests/sched/.gitignore
@@ -0,0 +1 @@
+cs_prctl_test
--- /dev/null
+++ b/tools/testing/selftests/sched/Makefile
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: GPL-2.0+
+
+ifneq ($(shell $(CC) --version 2>&1 | head -n 1 | grep clang),)
+CLANG_FLAGS += -no-integrated-as
+endif
+
+CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/  -Wl,-rpath=./ \
+	  $(CLANG_FLAGS)
+LDLIBS += -lpthread
+
+TEST_GEN_FILES := cs_prctl_test
+TEST_PROGS := cs_prctl_test
+
+include ../lib.mk
--- /dev/null
+++ b/tools/testing/selftests/sched/config
@@ -0,0 +1 @@
+CONFIG_SCHED_DEBUG=y
--- /dev/null
+++ b/tools/testing/selftests/sched/cs_prctl_test.c
@@ -0,0 +1,338 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Use the core scheduling prctl() to test core scheduling cookies control.
+ *
+ * Copyright (c) 2021 Oracle and/or its affiliates.
+ * Author: Chris Hyser <chris.hyser@oracle.com>
+ *
+ *
+ * This library is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This library is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License
+ * for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public License
+ * along with this library; if not, see <http://www.gnu.org/licenses>.
+ */
+
+#define _GNU_SOURCE
+#include <sys/eventfd.h>
+#include <sys/wait.h>
+#include <sys/types.h>
+#include <sched.h>
+#include <sys/prctl.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include <time.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#if __GLIBC_PREREQ(2, 30) == 0
+#include <sys/syscall.h>
+static pid_t gettid(void)
+{
+	return syscall(SYS_gettid);
+}
+#endif
+
+#ifndef PR_SCHED_CORE
+#define PR_SCHED_CORE			60
+# define PR_SCHED_CORE_GET		0
+# define PR_SCHED_CORE_CREATE		1 /* create unique core_sched cookie */
+# define PR_SCHED_CORE_SHARE_TO		2 /* push core_sched cookie to pid */
+# define PR_SCHED_CORE_SHARE_FROM	3 /* pull core_sched cookie to pid */
+# define PR_SCHED_CORE_MAX		4
+#endif
+
+#define MAX_PROCESSES 128
+#define MAX_THREADS   128
+
+static const char USAGE[] = "cs_prctl_test [options]\n"
+"    options:\n"
+"	-P  : number of processes to create.\n"
+"	-T  : number of threads per process to create.\n"
+"	-d  : delay time to keep tasks alive.\n"
+"	-k  : keep tasks alive until keypress.\n";
+
+enum pid_type {PIDTYPE_PID = 0, PIDTYPE_TGID, PIDTYPE_PGID};
+
+const int THREAD_CLONE_FLAGS = CLONE_THREAD | CLONE_SIGHAND | CLONE_FS | CLONE_VM | CLONE_FILES;
+
+static int _prctl(int option, unsigned long arg2, unsigned long arg3, unsigned long arg4,
+		  unsigned long arg5)
+{
+	int res;
+
+	res = prctl(option, arg2, arg3, arg4, arg5);
+	printf("%d = prctl(%d, %ld, %ld, %ld, %lx)\n", res, option, (long)arg2, (long)arg3,
+	       (long)arg4, arg5);
+	return res;
+}
+
+#define STACK_SIZE (1024 * 1024)
+
+#define handle_error(msg) __handle_error(__FILE__, __LINE__, msg)
+static void __handle_error(char *fn, int ln, char *msg)
+{
+	printf("(%s:%d) - ", fn, ln);
+	perror(msg);
+	exit(EXIT_FAILURE);
+}
+
+static void handle_usage(int rc, char *msg)
+{
+	puts(USAGE);
+	puts(msg);
+	putchar('\n');
+	exit(rc);
+}
+
+static unsigned long get_cs_cookie(int pid)
+{
+	unsigned long long cookie;
+	int ret;
+
+	ret = prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, pid, PIDTYPE_PID,
+		    (unsigned long)&cookie);
+	if (ret) {
+		printf("Not a core sched system\n");
+		return -1UL;
+	}
+
+	return cookie;
+}
+
+struct child_args {
+	int num_threads;
+	int pfd[2];
+	int cpid;
+	int thr_tids[MAX_THREADS];
+};
+
+static int child_func_thread(void __attribute__((unused))*arg)
+{
+	while (1)
+		usleep(20000);
+	return 0;
+}
+
+static void create_threads(int num_threads, int thr_tids[])
+{
+	void *child_stack;
+	pid_t tid;
+	int i;
+
+	for (i = 0; i < num_threads; ++i) {
+		child_stack = malloc(STACK_SIZE);
+		if (!child_stack)
+			handle_error("child stack allocate");
+
+		tid = clone(child_func_thread, child_stack + STACK_SIZE, THREAD_CLONE_FLAGS, NULL);
+		if (tid == -1)
+			handle_error("clone thread");
+		thr_tids[i] = tid;
+	}
+}
+
+static int child_func_process(void *arg)
+{
+	struct child_args *ca = (struct child_args *)arg;
+
+	close(ca->pfd[0]);
+
+	create_threads(ca->num_threads, ca->thr_tids);
+
+	write(ca->pfd[1], &ca->thr_tids, sizeof(int) * ca->num_threads);
+	close(ca->pfd[1]);
+
+	while (1)
+		usleep(20000);
+	return 0;
+}
+
+static unsigned char child_func_process_stack[STACK_SIZE];
+
+void create_processes(int num_processes, int num_threads, struct child_args proc[])
+{
+	pid_t cpid;
+	int i;
+
+	for (i = 0; i < num_processes; ++i) {
+		proc[i].num_threads = num_threads;
+
+		if (pipe(proc[i].pfd) == -1)
+			handle_error("pipe() failed");
+
+		cpid = clone(child_func_process, child_func_process_stack + STACK_SIZE,
+			     SIGCHLD, &proc[i]);
+		proc[i].cpid = cpid;
+		close(proc[i].pfd[1]);
+	}
+
+	for (i = 0; i < num_processes; ++i) {
+		read(proc[i].pfd[0], &proc[i].thr_tids, sizeof(int) * proc[i].num_threads);
+		close(proc[i].pfd[0]);
+	}
+}
+
+void disp_processes(int num_processes, struct child_args proc[])
+{
+	int i, j;
+
+	printf("tid=%d, / tgid=%d / pgid=%d: %lx\n", gettid(), getpid(), getpgid(0),
+	       get_cs_cookie(getpid()));
+
+	for (i = 0; i < num_processes; ++i) {
+		printf("    tid=%d, / tgid=%d / pgid=%d: %lx\n", proc[i].cpid, proc[i].cpid,
+		       getpgid(proc[i].cpid), get_cs_cookie(proc[i].cpid));
+		for (j = 0; j < proc[i].num_threads; ++j) {
+			printf("        tid=%d, / tgid=%d / pgid=%d: %lx\n", proc[i].thr_tids[j],
+			       proc[i].cpid, getpgid(0), get_cs_cookie(proc[i].thr_tids[j]));
+		}
+	}
+	puts("\n");
+}
+
+static int errors;
+
+#define validate(v) _validate(__LINE__, v, #v)
+void _validate(int line, int val, char *msg)
+{
+	if (!val) {
+		++errors;
+		printf("(%d) FAILED: %s\n", line, msg);
+	} else {
+		printf("(%d) PASSED: %s\n", line, msg);
+	}
+}
+
+int main(int argc, char *argv[])
+{
+	struct child_args procs[MAX_PROCESSES];
+
+	int keypress = 0;
+	int num_processes = 2;
+	int num_threads = 3;
+	int delay = 0;
+	int res = 0;
+	int pidx;
+	int pid;
+	int opt;
+
+	while ((opt = getopt(argc, argv, ":hkT:P:d:")) != -1) {
+		switch (opt) {
+		case 'P':
+			num_processes = (int)strtol(optarg, NULL, 10);
+			break;
+		case 'T':
+			num_threads = (int)strtoul(optarg, NULL, 10);
+			break;
+		case 'd':
+			delay = (int)strtol(optarg, NULL, 10);
+			break;
+		case 'k':
+			keypress = 1;
+			break;
+		case 'h':
+			printf(USAGE);
+			exit(EXIT_SUCCESS);
+		default:
+			handle_usage(20, "unknown option");
+		}
+	}
+
+	if (num_processes < 1 || num_processes > MAX_PROCESSES)
+		handle_usage(1, "Bad processes value");
+
+	if (num_threads < 1 || num_threads > MAX_THREADS)
+		handle_usage(2, "Bad thread value");
+
+	if (keypress)
+		delay = -1;
+
+	srand(time(NULL));
+
+	/* put into separate process group */
+	if (setpgid(0, 0) != 0)
+		handle_error("process group");
+
+	printf("\n## Create a thread/process/process group hiearchy\n");
+	create_processes(num_processes, num_threads, procs);
+	disp_processes(num_processes, procs);
+	validate(get_cs_cookie(0) == 0);
+
+	printf("\n## Set a cookie on entire process group\n");
+	if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, 0, PIDTYPE_PGID, 0) < 0)
+		handle_error("core_sched create failed -- PGID");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(0) != 0);
+
+	/* get a random process pid */
+	pidx = rand() % num_processes;
+	pid = procs[pidx].cpid;
+
+	validate(get_cs_cookie(0) == get_cs_cookie(pid));
+	validate(get_cs_cookie(0) == get_cs_cookie(procs[pidx].thr_tids[0]));
+
+	printf("\n## Set a new cookie on entire process/TGID [%d]\n", pid);
+	if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, pid, PIDTYPE_TGID, 0) < 0)
+		handle_error("core_sched create failed -- TGID");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(0) != get_cs_cookie(pid));
+	validate(get_cs_cookie(pid) != 0);
+	validate(get_cs_cookie(pid) == get_cs_cookie(procs[pidx].thr_tids[0]));
+
+	printf("\n## Copy the cookie of current/PGID[%d], to pid [%d] as PIDTYPE_PID\n",
+	       getpid(), pid);
+	if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, pid, PIDTYPE_PID, 0) < 0)
+		handle_error("core_sched share to itself failed -- PID");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(0) == get_cs_cookie(pid));
+	validate(get_cs_cookie(pid) != 0);
+	validate(get_cs_cookie(pid) != get_cs_cookie(procs[pidx].thr_tids[0]));
+
+	printf("\n## Copy cookie from a thread [%d] to current/PGID [%d] as PIDTYPE_PID\n",
+	       procs[pidx].thr_tids[0], getpid());
+	if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, procs[pidx].thr_tids[0],
+		   PIDTYPE_PID, 0) < 0)
+		handle_error("core_sched share from thread failed -- PID");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(0) == get_cs_cookie(procs[pidx].thr_tids[0]));
+	validate(get_cs_cookie(pid) != get_cs_cookie(procs[pidx].thr_tids[0]));
+
+	printf("\n## Copy cookie from current [%d] to current as pidtype PGID\n", getpid());
+	if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, 0, PIDTYPE_PGID, 0) < 0)
+		handle_error("core_sched share to self failed -- PGID");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(0) == get_cs_cookie(pid));
+	validate(get_cs_cookie(pid) != 0);
+	validate(get_cs_cookie(pid) == get_cs_cookie(procs[pidx].thr_tids[0]));
+
+	if (errors) {
+		printf("TESTS FAILED. errors: %d\n", errors);
+		res = 10;
+	} else {
+		printf("SUCCESS !!!\n");
+	}
+
+	if (keypress)
+		getchar();
+	else
+		sleep(delay);
+
+	for (pidx = 0; pidx < num_processes; ++pidx)
+		kill(procs[pidx].cpid, 15);
+
+	return res;
+}



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/19] sched: Core Scheduling
  2021-04-22 12:04 [PATCH 00/19] sched: Core Scheduling Peter Zijlstra
                   ` (18 preceding siblings ...)
  2021-04-22 12:05 ` [PATCH 19/19] kselftest: Add test for core sched prctl interface Peter Zijlstra
@ 2021-04-22 16:43 ` Don Hiatt
  2021-04-22 17:29   ` Peter Zijlstra
  2021-04-30  6:47 ` Ning, Hongyu
                   ` (2 subsequent siblings)
  22 siblings, 1 reply; 103+ messages in thread
From: Don Hiatt @ 2021-04-22 16:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes, Hyser,Chris, Josh Don, mingo, vincent.guittot,
	valentin.schneider, mgorman, linux-kernel, tglx

On Thu, Apr 22, 2021 at 5:37 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Hai,
>
> This is an agressive fold of all the core-scheduling work so far. I've stripped
> a whole bunch of tags along the way (hopefully not too many, please yell if you
> feel I made a mistake), including tested-by. Please retest.
>
> Changes since the last partial post is dropping all the cgroup stuff and
> PR_SCHED_CORE_CLEAR as well as that exec() behaviour in order to later resolve
> the cgroup issue.
>
Hi Peter,

Is there a reason that PR_SCHED_CORE_CLEAR got removed? It's handy to
be able to clear
cookies.

Thanks,

Don

> Since we're really rather late for the coming merge window, my plan was to
> merge the lot right after the merge window.
>
> Again, please test.
>
> These patches should shortly be available in my queue.git.
>
> ---
>  b/kernel/sched/core_sched.c                     |  229 ++++++
>  b/tools/testing/selftests/sched/.gitignore      |    1
>  b/tools/testing/selftests/sched/Makefile        |   14
>  b/tools/testing/selftests/sched/config          |    1
>  b/tools/testing/selftests/sched/cs_prctl_test.c |  338 +++++++++
>  include/linux/sched.h                           |   19
>  include/uapi/linux/prctl.h                      |    8
>  kernel/Kconfig.preempt                          |    6
>  kernel/fork.c                                   |    4
>  kernel/sched/Makefile                           |    1
>  kernel/sched/core.c                             |  858 ++++++++++++++++++++++--
>  kernel/sched/cpuacct.c                          |   12
>  kernel/sched/deadline.c                         |   38 -
>  kernel/sched/debug.c                            |    4
>  kernel/sched/fair.c                             |  276 +++++--
>  kernel/sched/idle.c                             |   13
>  kernel/sched/pelt.h                             |    2
>  kernel/sched/rt.c                               |   31
>  kernel/sched/sched.h                            |  393 ++++++++--
>  kernel/sched/stop_task.c                        |   14
>  kernel/sched/topology.c                         |    4
>  kernel/sys.c                                    |    5
>  tools/include/uapi/linux/prctl.h                |    8
>  23 files changed, 2057 insertions(+), 222 deletions(-)
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/19] sched: Core Scheduling
  2021-04-22 16:43 ` [PATCH 00/19] sched: Core Scheduling Don Hiatt
@ 2021-04-22 17:29   ` Peter Zijlstra
  0 siblings, 0 replies; 103+ messages in thread
From: Peter Zijlstra @ 2021-04-22 17:29 UTC (permalink / raw)
  To: Don Hiatt
  Cc: Joel Fernandes, Hyser,Chris, Josh Don, mingo, vincent.guittot,
	valentin.schneider, mgorman, linux-kernel, tglx

On Thu, Apr 22, 2021 at 09:43:59AM -0700, Don Hiatt wrote:
> On Thu, Apr 22, 2021 at 5:37 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Hai,
> >
> > This is an agressive fold of all the core-scheduling work so far. I've stripped
> > a whole bunch of tags along the way (hopefully not too many, please yell if you
> > feel I made a mistake), including tested-by. Please retest.
> >
> > Changes since the last partial post is dropping all the cgroup stuff and
> > PR_SCHED_CORE_CLEAR as well as that exec() behaviour in order to later resolve
> > the cgroup issue.
> >
> Hi Peter,
> 
> Is there a reason that PR_SCHED_CORE_CLEAR got removed? It's handy to
> be able to clear cookies.

I agree, but if we're forced to do cgroup with cookie inherit, when
CLEAR allows a trivial escape.

Until that whole cgroup thing is sorted, it is easiest to not have
CLEAR, lest we have to change/augment CLEAR semantics in unfortunate
ways, which is always harder once an API is out there.

Also, see the entire previous discussion here:

  https://lore.kernel.org/lkml/20210401131012.395311786@infradead.org/

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-22 12:05 ` [PATCH 04/19] sched: Prepare for Core-wide rq->lock Peter Zijlstra
@ 2021-04-24  1:22   ` Josh Don
  2021-04-26  8:31     ` Peter Zijlstra
  2021-04-28  6:02   ` Aubrey Li
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 103+ messages in thread
From: Josh Don @ 2021-04-24  1:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes, Hyser,Chris, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Mel Gorman, linux-kernel, Thomas Gleixner

Hi Peter,

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -186,12 +186,37 @@ int sysctl_sched_rt_runtime = 950000;
>
>  void raw_spin_rq_lock_nested(struct rq *rq, int subclass)
>  {
> -       raw_spin_lock_nested(rq_lockp(rq), subclass);
> +       raw_spinlock_t *lock;
> +
> +       if (sched_core_disabled()) {

Nothing to stop sched_core from being enabled right here? Leading to
us potentially taking the wrong lock.

> +               raw_spin_lock_nested(&rq->__lock, subclass);
> +               return;
> +       }
> +
> +       for (;;) {
> +               lock = rq_lockp(rq);
> +               raw_spin_lock_nested(lock, subclass);
> +               if (likely(lock == rq_lockp(rq)))
> +                       return;
> +               raw_spin_unlock(lock);
> +       }
>  }

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-24  1:22   ` Josh Don
@ 2021-04-26  8:31     ` Peter Zijlstra
  2021-04-26 22:21       ` Josh Don
  0 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2021-04-26  8:31 UTC (permalink / raw)
  To: Josh Don
  Cc: Joel Fernandes, Hyser,Chris, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Mel Gorman, linux-kernel, Thomas Gleixner

On Fri, Apr 23, 2021 at 06:22:52PM -0700, Josh Don wrote:
> Hi Peter,
> 
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -186,12 +186,37 @@ int sysctl_sched_rt_runtime = 950000;
> >
> >  void raw_spin_rq_lock_nested(struct rq *rq, int subclass)
> >  {
> > -       raw_spin_lock_nested(rq_lockp(rq), subclass);
> > +       raw_spinlock_t *lock;
> > +
> > +       if (sched_core_disabled()) {
> 
> Nothing to stop sched_core from being enabled right here? Leading to
> us potentially taking the wrong lock.
> 
> > +               raw_spin_lock_nested(&rq->__lock, subclass);
> > +               return;
> > +       }
> > +
> > +       for (;;) {
> > +               lock = rq_lockp(rq);
> > +               raw_spin_lock_nested(lock, subclass);
> > +               if (likely(lock == rq_lockp(rq)))
> > +                       return;
> > +               raw_spin_unlock(lock);
> > +       }
> >  }

Very good; something like the below seems to be the best I can make of
it..

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f732642e3e09..1a81e9cc9e5d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -290,6 +290,10 @@ static void sched_core_assert_empty(void)
 static void __sched_core_enable(void)
 {
 	static_branch_enable(&__sched_core_enabled);
+	/*
+	 * Ensure raw_spin_rq_*lock*() have completed before flipping.
+	 */
+	synchronize_sched();
 	__sched_core_flip(true);
 	sched_core_assert_empty();
 }
@@ -449,16 +453,22 @@ void raw_spin_rq_lock_nested(struct rq *rq, int subclass)
 {
 	raw_spinlock_t *lock;
 
+	preempt_disable();
 	if (sched_core_disabled()) {
 		raw_spin_lock_nested(&rq->__lock, subclass);
+		/* preempt *MUST* still be disabled here */
+		preempt_enable_no_resched();
 		return;
 	}
 
 	for (;;) {
 		lock = __rq_lockp(rq);
 		raw_spin_lock_nested(lock, subclass);
-		if (likely(lock == __rq_lockp(rq)))
+		if (likely(lock == __rq_lockp(rq))) {
+			/* preempt *MUST* still be disabled here */
+			preempt_enable_no_resched();
 			return;
+		}
 		raw_spin_unlock(lock);
 	}
 }
@@ -468,14 +478,20 @@ bool raw_spin_rq_trylock(struct rq *rq)
 	raw_spinlock_t *lock;
 	bool ret;
 
-	if (sched_core_disabled())
-		return raw_spin_trylock(&rq->__lock);
+	preempt_disable();
+	if (sched_core_disabled()) {
+		ret = raw_spin_trylock(&rq->__lock);
+		preempt_enable();
+		return ret;
+	}
 
 	for (;;) {
 		lock = __rq_lockp(rq);
 		ret = raw_spin_trylock(lock);
-		if (!ret || (likely(lock == __rq_lockp(rq))))
+		if (!ret || (likely(lock == __rq_lockp(rq)))) {
+			preempt_enable();
 			return ret;
+		}
 		raw_spin_unlock(lock);
 	}
 }

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-26  8:31     ` Peter Zijlstra
@ 2021-04-26 22:21       ` Josh Don
  2021-04-27 17:10         ` Don Hiatt
                           ` (2 more replies)
  0 siblings, 3 replies; 103+ messages in thread
From: Josh Don @ 2021-04-26 22:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes, Hyser,Chris, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Mel Gorman, linux-kernel, Thomas Gleixner

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f732642e3e09..1a81e9cc9e5d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -290,6 +290,10 @@ static void sched_core_assert_empty(void)
>  static void __sched_core_enable(void)
>  {
>         static_branch_enable(&__sched_core_enabled);
> +       /*
> +        * Ensure raw_spin_rq_*lock*() have completed before flipping.
> +        */
> +       synchronize_sched();

synchronize_rcu()

>         __sched_core_flip(true);
>         sched_core_assert_empty();
>  }
> @@ -449,16 +453,22 @@ void raw_spin_rq_lock_nested(struct rq *rq, int subclass)
>  {
>         raw_spinlock_t *lock;
>
> +       preempt_disable();
>         if (sched_core_disabled()) {
>                 raw_spin_lock_nested(&rq->__lock, subclass);
> +               /* preempt *MUST* still be disabled here */
> +               preempt_enable_no_resched();
>                 return;
>         }

This approach looks good to me. I'm guessing you went this route
instead of doing the re-check after locking in order to optimize the
disabled case?

Recommend a comment that the preempt_disable() here pairs with the
synchronize_rcu() in __sched_core_enable().

>
>         for (;;) {
>                 lock = __rq_lockp(rq);
>                 raw_spin_lock_nested(lock, subclass);
> -               if (likely(lock == __rq_lockp(rq)))
> +               if (likely(lock == __rq_lockp(rq))) {
> +                       /* preempt *MUST* still be disabled here */
> +                       preempt_enable_no_resched();
>                         return;
> +               }
>                 raw_spin_unlock(lock);
>         }
>  }

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-26 22:21       ` Josh Don
@ 2021-04-27 17:10         ` Don Hiatt
  2021-04-27 23:35           ` Josh Don
  2021-04-27 23:30         ` Josh Don
  2021-04-28  7:13         ` Peter Zijlstra
  2 siblings, 1 reply; 103+ messages in thread
From: Don Hiatt @ 2021-04-27 17:10 UTC (permalink / raw)
  To: Josh Don
  Cc: Peter Zijlstra, Joel Fernandes, Hyser,Chris, Ingo Molnar,
	Vincent Guittot, Valentin Schneider, Mel Gorman, linux-kernel,
	Thomas Gleixner

On Mon, Apr 26, 2021 at 3:21 PM Josh Don <joshdon@google.com> wrote:
>
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index f732642e3e09..1a81e9cc9e5d 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -290,6 +290,10 @@ static void sched_core_assert_empty(void)
> >  static void __sched_core_enable(void)
> >  {
> >         static_branch_enable(&__sched_core_enabled);
> > +       /*
> > +        * Ensure raw_spin_rq_*lock*() have completed before flipping.
> > +        */
> > +       synchronize_sched();
>
> synchronize_rcu()
>
> >         __sched_core_flip(true);
> >         sched_core_assert_empty();
> >  }
> > @@ -449,16 +453,22 @@ void raw_spin_rq_lock_nested(struct rq *rq, int subclass)
> >  {
> >         raw_spinlock_t *lock;
> >
> > +       preempt_disable();
> >         if (sched_core_disabled()) {
> >                 raw_spin_lock_nested(&rq->__lock, subclass);
> > +               /* preempt *MUST* still be disabled here */
> > +               preempt_enable_no_resched();
> >                 return;
> >         }
>
> This approach looks good to me. I'm guessing you went this route
> instead of doing the re-check after locking in order to optimize the
> disabled case?
>
Hi Josh and Peter,

I've been running into soft lookups and hard lockups when running a script
that just cycles setting the cookie of a group of processes over and over again.

Unfortunately the only way I can reproduce this is by setting the cookies
on qemu. I've tried sysbench, stress-ng but those seem to work just fine.

I'm running Peter's branch and even tried the suggested changes here but
still see the same behavior. I enabled panic on hard lockup and here below
is a snippet of the log.

Is there anything you'd like me to try or have any debugging you'd like me to
do? I'd certainly like to get to the bottom of this.

Thanks and have a great day,

Don

[ 1164.129706] NMI backtrace for cpu 10
[ 1164.129707] CPU: 10 PID: 6994 Comm: CPU 2/KVM Kdump: loaded
Tainted: G          I  L    5.12.0-rc8+ #4
[ 1164.129707] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS
2.9.4 11/06/2020
[ 1164.129708] RIP: 0010:native_queued_spin_lock_slowpath+0x69/0x1e0
[ 1164.129708] Code: 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09
d0 a9 00 01 ff ff 75 20 85 c0 75 0c b8 01 00 00 00 66 89 07 5d c3 f3
90 8b 07 <84> c0 75 f8 b8 01 00 00 00 66 89 07 eb ec f6 c4 01 75 04 c6
47 01
[ 1164.129709] RSP: 0018:ffffb35d591e0e20 EFLAGS: 00000002
[ 1164.129710] RAX: 0000000000200101 RBX: ffff9ecb0096c180 RCX: 0000000000000001
[ 1164.129710] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9ecb0096c180
[ 1164.129711] RBP: ffffb35d591e0e20 R08: 0000000000000000 R09: 000000010002e2a6
[ 1164.129711] R10: 0000000000000000 R11: ffffb35d591e0ff8 R12: ffff9ecb0096c180
[ 1164.129712] R13: 000000000000000a R14: 000000000000000a R15: ffff9e6e8dc88000
[ 1164.129712] FS:  00007f91bd79f700(0000) GS:ffff9ecb00940000(0000)
knlGS:0000000000000000
[ 1164.129713] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1164.129713] CR2: 000000000255d05c CR3: 000000022eb8c005 CR4: 00000000007726e0
[ 1164.129714] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1164.129715] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1164.129715] PKRU: 55555554
[ 1164.129715] Call Trace:
[ 1164.129716]  <IRQ>
[ 1164.129716]  _raw_spin_lock+0x1f/0x30
[ 1164.129717]  raw_spin_rq_lock_nested.part.93+0x2b/0x60
[ 1164.129717]  raw_spin_rq_lock_nested.constprop.126+0x1a/0x20
[ 1164.129718]  scheduler_tick+0x4c/0x240
[ 1164.129718]  ? tick_sched_handle.isra.17+0x60/0x60
[ 1164.129719]  update_process_times+0xab/0xc0
[ 1164.129719]  tick_sched_handle.isra.17+0x21/0x60
[ 1164.129720]  tick_sched_timer+0x6b/0x80
[ 1164.129720]  __hrtimer_run_queues+0x10d/0x230
[ 1164.129721]  hrtimer_interrupt+0xe7/0x230
[ 1164.129721]  __sysvec_apic_timer_interrupt+0x62/0x100
[ 1164.129721]  sysvec_apic_timer_interrupt+0x51/0xa0
[ 1164.129722]  </IRQ>
[ 1164.129722]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 1164.129722] RIP: 0010:smp_call_function_single+0xf6/0x130
[ 1164.129723] Code: 00 75 4d 48 83 c4 48 41 5a 5d 49 8d 62 f8 c3 48
89 75 c0 48 8d 75 b0 48 89 55 c8 e8 84 fe ff ff 8b 55 b8 83 e2 01 74
0a f3 90 <8b> 55 b8 83 e2 01 75 f6 eb c0 8b 05 4a d8 b0 01 85 c0 0f 85
67 ff
[ 1164.129724] RSP: 0018:ffffb35d5e20fb20 EFLAGS: 00000202
[ 1164.129725] RAX: 0000000000000000 RBX: ffff9e6e8df12600 RCX: ffffb35d5f2ffcc0
[ 1164.129725] RDX: 0000000000000001 RSI: ffffb35d5e20fb20 RDI: ffffb35d5e20fb20
[ 1164.129725] RBP: ffffb35d5e20fb70 R08: 0000000000001000 R09: 0000000000001907
[ 1164.129726] R10: ffffb35d5e20fb98 R11: 00000000003d0900 R12: 000000000000000a
[ 1164.129726] R13: 000000000000000c R14: 0000000000000000 R15: 0000000000000000
[ 1164.129727]  ? vmclear_error+0x150/0x390 [kvm_intel]
[ 1164.129727]  vmx_vcpu_load_vmcs+0x1b2/0x330 [kvm_intel]
[ 1164.129728]  ? update_load_avg+0x1b8/0x640
[ 1164.129728]  ? vmx_vcpu_load_vmcs+0x1b2/0x330 [kvm_intel]
[ 1164.129728]  ? load_fixmap_gdt+0x23/0x30
[ 1164.129729]  vmx_vcpu_load_vmcs+0x309/0x330 [kvm_intel]
[ 1164.129729]  kvm_arch_vcpu_load+0x41/0x260 [kvm]
[ 1164.129729]  ? __switch_to_xtra+0xe3/0x4e0
[ 1164.129730]  kvm_io_bus_write+0xbc/0xf0 [kvm]
[ 1164.129730]  ? kvm_io_bus_write+0xbc/0xf0 [kvm]
[ 1164.129730]  finish_task_switch+0x184/0x290
[ 1164.129731]  __schedule+0x30a/0x12f0
[ 1164.129731]  schedule+0x40/0xb0
[ 1164.129732]  kvm_vcpu_block+0x57/0x4b0 [kvm]
[ 1164.129732]  kvm_arch_vcpu_ioctl_run+0x3fe/0x1b60 [kvm]
[ 1164.129732]  kvm_vcpu_kick+0x59c/0xa10 [kvm]
[ 1164.129733]  ? kvm_vcpu_kick+0x59c/0xa10 [kvm]
[ 1164.129733]  __x64_sys_ioctl+0x96/0xd0
[ 1164.129733]  do_syscall_64+0x37/0x80
[ 1164.129734]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1164.129734] RIP: 0033:0x7fb1c8a4c317
[ 1164.129735] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00
00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00
00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89
01 48
[ 1164.129735] RSP: 002b:00007f91bd79e7e8 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[ 1164.129736] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007fb1c8a4c317
[ 1164.129737] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000001a
[ 1164.129737] RBP: 0000000000000000 R08: 000055f3e6d027c0 R09: 00007f91a80040b8
[ 1164.129738] R10: 00007f91a80050a0 R11: 0000000000000246 R12: 00007fb1c4023010
[ 1164.129739] R13: 000055f3e569d880 R14: 00007fb1cb8e7000 R15: 00007ffd86806f70


> Recommend a comment that the preempt_disable() here pairs with the
> synchronize_rcu() in __sched_core_enable().
>
> >
> >         for (;;) {
> >                 lock = __rq_lockp(rq);
> >                 raw_spin_lock_nested(lock, subclass);
> > -               if (likely(lock == __rq_lockp(rq)))
> > +               if (likely(lock == __rq_lockp(rq))) {
> > +                       /* preempt *MUST* still be disabled here */
> > +                       preempt_enable_no_resched();
> >                         return;
> > +               }
> >                 raw_spin_unlock(lock);
> >         }
> >  }

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-26 22:21       ` Josh Don
  2021-04-27 17:10         ` Don Hiatt
@ 2021-04-27 23:30         ` Josh Don
  2021-04-28  9:13           ` Peter Zijlstra
  2021-04-28  7:13         ` Peter Zijlstra
  2 siblings, 1 reply; 103+ messages in thread
From: Josh Don @ 2021-04-27 23:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes, Hyser,Chris, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Mel Gorman, linux-kernel, Thomas Gleixner,
	dhiatt

On Mon, Apr 26, 2021 at 3:21 PM Josh Don <joshdon@google.com> wrote:
>
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index f732642e3e09..1a81e9cc9e5d 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -290,6 +290,10 @@ static void sched_core_assert_empty(void)
> >  static void __sched_core_enable(void)
> >  {
> >         static_branch_enable(&__sched_core_enabled);
> > +       /*
> > +        * Ensure raw_spin_rq_*lock*() have completed before flipping.
> > +        */
> > +       synchronize_sched();
>
> synchronize_rcu()
>
> >         __sched_core_flip(true);
> >         sched_core_assert_empty();
> >  }
> > @@ -449,16 +453,22 @@ void raw_spin_rq_lock_nested(struct rq *rq, int subclass)
> >  {
> >         raw_spinlock_t *lock;
> >
> > +       preempt_disable();
> >         if (sched_core_disabled()) {
> >                 raw_spin_lock_nested(&rq->__lock, subclass);
> > +               /* preempt *MUST* still be disabled here */
> > +               preempt_enable_no_resched();
> >                 return;
> >         }
>
> This approach looks good to me. I'm guessing you went this route
> instead of doing the re-check after locking in order to optimize the
> disabled case?
>
> Recommend a comment that the preempt_disable() here pairs with the
> synchronize_rcu() in __sched_core_enable().
>
> >
> >         for (;;) {
> >                 lock = __rq_lockp(rq);
> >                 raw_spin_lock_nested(lock, subclass);
> > -               if (likely(lock == __rq_lockp(rq)))
> > +               if (likely(lock == __rq_lockp(rq))) {
> > +                       /* preempt *MUST* still be disabled here */
> > +                       preempt_enable_no_resched();
> >                         return;
> > +               }
> >                 raw_spin_unlock(lock);
> >         }
> >  }

Also, did you mean to have a preempt_enable_no_resched() rather than
prempt_enable() in raw_spin_rq_trylock?

I went over the rq_lockp stuff again after Don's reported lockup. Most
uses are safe due to already holding an rq lock. However,
double_rq_unlock() is prone to race:

double_rq_unlock(rq1, rq2):
/* Initial state: core sched enabled, and rq1 and rq2 are smt
siblings. So, double_rq_lock(rq1, rq2) only took a single rq lock */
raw_spin_rq_unlock(rq1);
/* now not holding any rq lock */
/* sched core disabled. Now __rq_lockp(rq1) != __rq_lockp(rq2), so we
falsely unlock rq2 */
if (__rq_lockp(rq1) != __rq_lockp(rq2))
        raw_spin_rq_unlock(rq2);
else
        __release(rq2->lock);

Instead we can cache __rq_lockp(rq1) and __rq_lockp(rq2) before
releasing the lock, in order to prevent this. FWIW I think it is
likely that Don is seeing a different issue.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-27 17:10         ` Don Hiatt
@ 2021-04-27 23:35           ` Josh Don
  2021-04-28  1:03             ` Aubrey Li
  2021-04-28 16:04             ` Don Hiatt
  0 siblings, 2 replies; 103+ messages in thread
From: Josh Don @ 2021-04-27 23:35 UTC (permalink / raw)
  To: Don Hiatt
  Cc: Peter Zijlstra, Joel Fernandes, Hyser,Chris, Ingo Molnar,
	Vincent Guittot, Valentin Schneider, Mel Gorman, linux-kernel,
	Thomas Gleixner

On Tue, Apr 27, 2021 at 10:10 AM Don Hiatt <dhiatt@digitalocean.com> wrote:
> Hi Josh and Peter,
>
> I've been running into soft lookups and hard lockups when running a script
> that just cycles setting the cookie of a group of processes over and over again.
>
> Unfortunately the only way I can reproduce this is by setting the cookies
> on qemu. I've tried sysbench, stress-ng but those seem to work just fine.
>
> I'm running Peter's branch and even tried the suggested changes here but
> still see the same behavior. I enabled panic on hard lockup and here below
> is a snippet of the log.
>
> Is there anything you'd like me to try or have any debugging you'd like me to
> do? I'd certainly like to get to the bottom of this.

Hi Don,

I tried to repro using qemu, but did not generate a lockup. Could you
provide more details on what your script is doing (or better yet,
share the script directly)? I would have expected you to potentially
hit a lockup if you were cycling sched_core being enabled and
disabled, but it sounds like you are just recreating the cookie for a
process group over and over?

Best,
Josh

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-27 23:35           ` Josh Don
@ 2021-04-28  1:03             ` Aubrey Li
  2021-04-28  6:05               ` Aubrey Li
  2021-04-28 16:04             ` Don Hiatt
  1 sibling, 1 reply; 103+ messages in thread
From: Aubrey Li @ 2021-04-28  1:03 UTC (permalink / raw)
  To: Josh Don
  Cc: Don Hiatt, Peter Zijlstra, Joel Fernandes, Hyser,Chris,
	Ingo Molnar, Vincent Guittot, Valentin Schneider, Mel Gorman,
	linux-kernel, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 1910 bytes --]

On Wed, Apr 28, 2021 at 7:36 AM Josh Don <joshdon@google.com> wrote:
>
> On Tue, Apr 27, 2021 at 10:10 AM Don Hiatt <dhiatt@digitalocean.com> wrote:
> > Hi Josh and Peter,
> >
> > I've been running into soft lookups and hard lockups when running a script
> > that just cycles setting the cookie of a group of processes over and over again.
> >
> > Unfortunately the only way I can reproduce this is by setting the cookies
> > on qemu. I've tried sysbench, stress-ng but those seem to work just fine.
> >
> > I'm running Peter's branch and even tried the suggested changes here but
> > still see the same behavior. I enabled panic on hard lockup and here below
> > is a snippet of the log.
> >
> > Is there anything you'd like me to try or have any debugging you'd like me to
> > do? I'd certainly like to get to the bottom of this.
>
> Hi Don,
>
> I tried to repro using qemu, but did not generate a lockup. Could you
> provide more details on what your script is doing (or better yet,
> share the script directly)? I would have expected you to potentially
> hit a lockup if you were cycling sched_core being enabled and
> disabled, but it sounds like you are just recreating the cookie for a
> process group over and over?
>

I saw something similar on a bare metal hardware. Also tried the suggested
patch here and no luck. Panic stack attached with
softlockup_all_cpu_backtrace=1.
(sorry, my system has 192 cpus and somehow putting 184 cpus offline causes
system hang without any message...)

My script created the core cookie for two different process groups.
The one is for sysbench cpu, the other is for sysbench mysql,
mysqld(cookie=0) is
also on the same machine. The number of tasks in each category is the same as
the number of CPUs on the system. And cookie is created just during task
startup.

Please let me know if the script is needed, I'll push it to github
with some cleanup.

Thanks,
-Aubrey

[-- Attachment #2: aubrey-ubuntu_2021-04-28_01-00-04.log --]
[-- Type: application/octet-stream, Size: 542032 bytes --]

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-22 12:05 ` [PATCH 04/19] sched: Prepare for Core-wide rq->lock Peter Zijlstra
  2021-04-24  1:22   ` Josh Don
@ 2021-04-28  6:02   ` Aubrey Li
  2021-04-29  8:03   ` Aubrey Li
  2021-05-07  9:50   ` [PATCH v2 " Peter Zijlstra
  3 siblings, 0 replies; 103+ messages in thread
From: Aubrey Li @ 2021-04-28  6:02 UTC (permalink / raw)
  To: Peter Zijlstra, joel, chris.hyser, joshdon, mingo,
	vincent.guittot, valentin.schneider, mgorman
  Cc: linux-kernel, tglx

On 4/22/21 8:05 PM, Peter Zijlstra wrote:
> When switching on core-sched, CPUs need to agree which lock to use for
> their RQ.
> 
> The new rule will be that rq->core_enabled will be toggled while
> holding all rq->__locks that belong to a core. This means we need to
> double check the rq->core_enabled value after each lock acquire and
> retry if it changed.
> 
> This also has implications for those sites that take multiple RQ
> locks, they need to be careful that the second lock doesn't end up
> being the first lock.
> 
> Verify the lock pointer after acquiring the first lock, because if
> they're on the same core, holding any of the rq->__lock instances will
> pin the core state.
> 
> While there, change the rq->__lock order to CPU number, instead of rq
> address, this greatly simplifies the next patch.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c  |   48 ++++++++++++++++++++++++++++++++++++++++++++++--
>  kernel/sched/sched.h |   41 +++++++++++------------------------------
>  2 files changed, 57 insertions(+), 32 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -186,12 +186,37 @@ int sysctl_sched_rt_runtime = 950000;
>  
>  void raw_spin_rq_lock_nested(struct rq *rq, int subclass)
>  {
> -	raw_spin_lock_nested(rq_lockp(rq), subclass);
> +	raw_spinlock_t *lock;
> +
> +	if (sched_core_disabled()) {
> +		raw_spin_lock_nested(&rq->__lock, subclass);
> +		return;
> +	}
> +
> +	for (;;) {
> +		lock = rq_lockp(rq);
> +		raw_spin_lock_nested(lock, subclass);
> +		if (likely(lock == rq_lockp(rq)))
> +			return;
> +		raw_spin_unlock(lock);
> +	}
>  }
>  
>  bool raw_spin_rq_trylock(struct rq *rq)
>  {
> -	return raw_spin_trylock(rq_lockp(rq));
> +	raw_spinlock_t *lock;
> +	bool ret;
> +
> +	if (sched_core_disabled())
> +		return raw_spin_trylock(&rq->__lock);
> +
> +	for (;;) {
> +		lock = rq_lockp(rq);
> +		ret = raw_spin_trylock(lock);
> +		if (!ret || (likely(lock == rq_lockp(rq))))
> +			return ret;
> +		raw_spin_unlock(lock);
> +	}
>  }
>  
>  void raw_spin_rq_unlock(struct rq *rq)
> @@ -199,6 +224,25 @@ void raw_spin_rq_unlock(struct rq *rq)
>  	raw_spin_unlock(rq_lockp(rq));
>  }
>  
> +#ifdef CONFIG_SMP
> +/*
> + * double_rq_lock - safely lock two runqueues
> + */
> +void double_rq_lock(struct rq *rq1, struct rq *rq2)
> +{
> +	lockdep_assert_irqs_disabled();
> +
> +	if (rq1->cpu > rq2->cpu)
> +		swap(rq1, rq2);

I'm not sure why swap rq here instead of rq lock? This swaps dst rq
and src rq and causes the subsequent logic wrong at least in try_steal_cookie().

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-28  1:03             ` Aubrey Li
@ 2021-04-28  6:05               ` Aubrey Li
  2021-04-28 10:57                 ` Aubrey Li
  0 siblings, 1 reply; 103+ messages in thread
From: Aubrey Li @ 2021-04-28  6:05 UTC (permalink / raw)
  To: Aubrey Li, Josh Don
  Cc: Don Hiatt, Peter Zijlstra, Joel Fernandes, Hyser,Chris,
	Ingo Molnar, Vincent Guittot, Valentin Schneider, Mel Gorman,
	linux-kernel, Thomas Gleixner

On 4/28/21 9:03 AM, Aubrey Li wrote:
> On Wed, Apr 28, 2021 at 7:36 AM Josh Don <joshdon@google.com> wrote:
>>
>> On Tue, Apr 27, 2021 at 10:10 AM Don Hiatt <dhiatt@digitalocean.com> wrote:
>>> Hi Josh and Peter,
>>>
>>> I've been running into soft lookups and hard lockups when running a script
>>> that just cycles setting the cookie of a group of processes over and over again.
>>>
>>> Unfortunately the only way I can reproduce this is by setting the cookies
>>> on qemu. I've tried sysbench, stress-ng but those seem to work just fine.
>>>
>>> I'm running Peter's branch and even tried the suggested changes here but
>>> still see the same behavior. I enabled panic on hard lockup and here below
>>> is a snippet of the log.
>>>
>>> Is there anything you'd like me to try or have any debugging you'd like me to
>>> do? I'd certainly like to get to the bottom of this.
>>
>> Hi Don,
>>
>> I tried to repro using qemu, but did not generate a lockup. Could you
>> provide more details on what your script is doing (or better yet,
>> share the script directly)? I would have expected you to potentially
>> hit a lockup if you were cycling sched_core being enabled and
>> disabled, but it sounds like you are just recreating the cookie for a
>> process group over and over?
>>
> 
> I saw something similar on a bare metal hardware. Also tried the suggested
> patch here and no luck. Panic stack attached with
> softlockup_all_cpu_backtrace=1.
> (sorry, my system has 192 cpus and somehow putting 184 cpus offline causes
> system hang without any message...)

Can you please try the following change to see if the problem is gone on your side?

Thanks,
-Aubrey

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f732642e3e09..1ef13b50dfcd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -493,14 +493,17 @@ void double_rq_lock(struct rq *rq1, struct rq *rq2)
 {
 	lockdep_assert_irqs_disabled();
 
-	if (rq1->cpu > rq2->cpu)
-		swap(rq1, rq2);
-
-	raw_spin_rq_lock(rq1);
-	if (__rq_lockp(rq1) == __rq_lockp(rq2))
-		return;
-
-	raw_spin_rq_lock_nested(rq2, SINGLE_DEPTH_NESTING);
+	if (__rq_lockp(rq1) == __rq_lockp(rq2)) {
+		raw_spin_rq_lock(rq1);
+	} else {
+		if (__rq_lockp(rq1) < __rq_lockp(rq2)) {
+			raw_spin_rq_lock(rq1);
+			raw_spin_rq_lock_nested(rq2, SINGLE_DEPTH_NESTING);
+		} else {
+			raw_spin_rq_lock(rq2);
+			raw_spin_rq_lock_nested(rq1, SINGLE_DEPTH_NESTING);
+		}
+	}
 }
 #endif
 

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-26 22:21       ` Josh Don
  2021-04-27 17:10         ` Don Hiatt
  2021-04-27 23:30         ` Josh Don
@ 2021-04-28  7:13         ` Peter Zijlstra
  2 siblings, 0 replies; 103+ messages in thread
From: Peter Zijlstra @ 2021-04-28  7:13 UTC (permalink / raw)
  To: Josh Don
  Cc: Joel Fernandes, Hyser,Chris, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Mel Gorman, linux-kernel, Thomas Gleixner

On Mon, Apr 26, 2021 at 03:21:36PM -0700, Josh Don wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index f732642e3e09..1a81e9cc9e5d 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -290,6 +290,10 @@ static void sched_core_assert_empty(void)
> >  static void __sched_core_enable(void)
> >  {
> >         static_branch_enable(&__sched_core_enabled);
> > +       /*
> > +        * Ensure raw_spin_rq_*lock*() have completed before flipping.
> > +        */
> > +       synchronize_sched();
> 
> synchronize_rcu()

Moo, I actually like synchronize_sched() because it indicates it matches
a preempt disabled region.

> >         __sched_core_flip(true);
> >         sched_core_assert_empty();
> >  }
> > @@ -449,16 +453,22 @@ void raw_spin_rq_lock_nested(struct rq *rq, int subclass)
> >  {
> >         raw_spinlock_t *lock;
> >
> > +       preempt_disable();
> >         if (sched_core_disabled()) {
> >                 raw_spin_lock_nested(&rq->__lock, subclass);
> > +               /* preempt *MUST* still be disabled here */
> > +               preempt_enable_no_resched();
> >                 return;
> >         }
> 
> This approach looks good to me. I'm guessing you went this route
> instead of doing the re-check after locking in order to optimize the
> disabled case?

Exactly.

> Recommend a comment that the preempt_disable() here pairs with the
> synchronize_rcu() in __sched_core_enable().

Fair enough.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-27 23:30         ` Josh Don
@ 2021-04-28  9:13           ` Peter Zijlstra
  2021-04-28 10:35             ` Aubrey Li
  2021-04-29 20:11             ` Josh Don
  0 siblings, 2 replies; 103+ messages in thread
From: Peter Zijlstra @ 2021-04-28  9:13 UTC (permalink / raw)
  To: Josh Don
  Cc: Joel Fernandes, Hyser,Chris, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Mel Gorman, linux-kernel, Thomas Gleixner,
	dhiatt

On Tue, Apr 27, 2021 at 04:30:02PM -0700, Josh Don wrote:

> Also, did you mean to have a preempt_enable_no_resched() rather than
> prempt_enable() in raw_spin_rq_trylock?

No, trylock really needs to be preempt_enable(), because it can have
failed, at which point it will not have incremented the preemption count
and our decrement can hit 0, at which point we really should reschedule.

> I went over the rq_lockp stuff again after Don's reported lockup. Most
> uses are safe due to already holding an rq lock. However,
> double_rq_unlock() is prone to race:
> 
> double_rq_unlock(rq1, rq2):
> /* Initial state: core sched enabled, and rq1 and rq2 are smt
> siblings. So, double_rq_lock(rq1, rq2) only took a single rq lock */
> raw_spin_rq_unlock(rq1);
> /* now not holding any rq lock */
> /* sched core disabled. Now __rq_lockp(rq1) != __rq_lockp(rq2), so we
> falsely unlock rq2 */
> if (__rq_lockp(rq1) != __rq_lockp(rq2))
>         raw_spin_rq_unlock(rq2);
> else
>         __release(rq2->lock);
> 
> Instead we can cache __rq_lockp(rq1) and __rq_lockp(rq2) before
> releasing the lock, in order to prevent this. FWIW I think it is
> likely that Don is seeing a different issue.

Ah, indeed so.. rq_lockp() could do with an assertion, not sure how to
sanely do that. Anyway, double_rq_unlock() is simple enough to fix, we
can simply flip the unlock()s.

( I'm suffering a cold and am really quite slow atm )

How's this then?

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f732642e3e09..3a534c0c1c46 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -290,6 +290,10 @@ static void sched_core_assert_empty(void)
 static void __sched_core_enable(void)
 {
 	static_branch_enable(&__sched_core_enabled);
+	/*
+	 * Ensure raw_spin_rq_*lock*() have completed before flipping.
+	 */
+	synchronize_sched();
 	__sched_core_flip(true);
 	sched_core_assert_empty();
 }
@@ -449,16 +453,23 @@ void raw_spin_rq_lock_nested(struct rq *rq, int subclass)
 {
 	raw_spinlock_t *lock;
 
+	/* Matches synchronize_sched() in __sched_core_enabled() */
+	preempt_disable();
 	if (sched_core_disabled()) {
 		raw_spin_lock_nested(&rq->__lock, subclass);
+		/* preempt-count *MUST* be > 1 */
+		preempt_enable_no_resched();
 		return;
 	}
 
 	for (;;) {
 		lock = __rq_lockp(rq);
 		raw_spin_lock_nested(lock, subclass);
-		if (likely(lock == __rq_lockp(rq)))
+		if (likely(lock == __rq_lockp(rq))) {
+			/* preempt-count *MUST* be > 1 */
+			preempt_enable_no_resched();
 			return;
+		}
 		raw_spin_unlock(lock);
 	}
 }
@@ -468,14 +479,21 @@ bool raw_spin_rq_trylock(struct rq *rq)
 	raw_spinlock_t *lock;
 	bool ret;
 
-	if (sched_core_disabled())
-		return raw_spin_trylock(&rq->__lock);
+	/* Matches synchronize_sched() in __sched_core_enabled() */
+	preempt_disable();
+	if (sched_core_disabled()) {
+		ret = raw_spin_trylock(&rq->__lock);
+		preempt_enable();
+		return ret;
+	}
 
 	for (;;) {
 		lock = __rq_lockp(rq);
 		ret = raw_spin_trylock(lock);
-		if (!ret || (likely(lock == __rq_lockp(rq))))
+		if (!ret || (likely(lock == __rq_lockp(rq)))) {
+			preempt_enable();
 			return ret;
+		}
 		raw_spin_unlock(lock);
 	}
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6a905fe19eef..c9a52231d58a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2568,11 +2568,12 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
 	__releases(rq1->lock)
 	__releases(rq2->lock)
 {
-	raw_spin_rq_unlock(rq1);
 	if (__rq_lockp(rq1) != __rq_lockp(rq2))
 		raw_spin_rq_unlock(rq2);
 	else
 		__release(rq2->lock);
+
+	raw_spin_rq_unlock(rq1);
 }
 
 extern void set_rq_online (struct rq *rq);

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-28  9:13           ` Peter Zijlstra
@ 2021-04-28 10:35             ` Aubrey Li
  2021-04-28 11:03               ` Peter Zijlstra
  2021-04-29 20:11             ` Josh Don
  1 sibling, 1 reply; 103+ messages in thread
From: Aubrey Li @ 2021-04-28 10:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Don, Joel Fernandes, Hyser,Chris, Ingo Molnar,
	Vincent Guittot, Valentin Schneider, Mel Gorman, linux-kernel,
	Thomas Gleixner, Don Hiatt

On Wed, Apr 28, 2021 at 5:14 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Apr 27, 2021 at 04:30:02PM -0700, Josh Don wrote:
>
> > Also, did you mean to have a preempt_enable_no_resched() rather than
> > prempt_enable() in raw_spin_rq_trylock?
>
> No, trylock really needs to be preempt_enable(), because it can have
> failed, at which point it will not have incremented the preemption count
> and our decrement can hit 0, at which point we really should reschedule.
>
> > I went over the rq_lockp stuff again after Don's reported lockup. Most
> > uses are safe due to already holding an rq lock. However,
> > double_rq_unlock() is prone to race:
> >
> > double_rq_unlock(rq1, rq2):
> > /* Initial state: core sched enabled, and rq1 and rq2 are smt
> > siblings. So, double_rq_lock(rq1, rq2) only took a single rq lock */
> > raw_spin_rq_unlock(rq1);
> > /* now not holding any rq lock */
> > /* sched core disabled. Now __rq_lockp(rq1) != __rq_lockp(rq2), so we
> > falsely unlock rq2 */
> > if (__rq_lockp(rq1) != __rq_lockp(rq2))
> >         raw_spin_rq_unlock(rq2);
> > else
> >         __release(rq2->lock);
> >
> > Instead we can cache __rq_lockp(rq1) and __rq_lockp(rq2) before
> > releasing the lock, in order to prevent this. FWIW I think it is
> > likely that Don is seeing a different issue.
>
> Ah, indeed so.. rq_lockp() could do with an assertion, not sure how to
> sanely do that. Anyway, double_rq_unlock() is simple enough to fix, we
> can simply flip the unlock()s.
>
> ( I'm suffering a cold and am really quite slow atm )
>
> How's this then?
>
> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f732642e3e09..3a534c0c1c46 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -290,6 +290,10 @@ static void sched_core_assert_empty(void)
>  static void __sched_core_enable(void)
>  {
>         static_branch_enable(&__sched_core_enabled);
> +       /*
> +        * Ensure raw_spin_rq_*lock*() have completed before flipping.
> +        */
> +       synchronize_sched();

synchronize_sched() seems no longer exist...

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-28  6:05               ` Aubrey Li
@ 2021-04-28 10:57                 ` Aubrey Li
  2021-04-28 16:41                   ` Don Hiatt
  0 siblings, 1 reply; 103+ messages in thread
From: Aubrey Li @ 2021-04-28 10:57 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Josh Don, Don Hiatt, Peter Zijlstra, Joel Fernandes, Hyser,Chris,
	Ingo Molnar, Vincent Guittot, Valentin Schneider, Mel Gorman,
	linux-kernel, Thomas Gleixner

On Wed, Apr 28, 2021 at 2:05 PM Aubrey Li <aubrey.li@linux.intel.com> wrote:
>
> On 4/28/21 9:03 AM, Aubrey Li wrote:
> > On Wed, Apr 28, 2021 at 7:36 AM Josh Don <joshdon@google.com> wrote:
> >>
> >> On Tue, Apr 27, 2021 at 10:10 AM Don Hiatt <dhiatt@digitalocean.com> wrote:
> >>> Hi Josh and Peter,
> >>>
> >>> I've been running into soft lookups and hard lockups when running a script
> >>> that just cycles setting the cookie of a group of processes over and over again.
> >>>
> >>> Unfortunately the only way I can reproduce this is by setting the cookies
> >>> on qemu. I've tried sysbench, stress-ng but those seem to work just fine.
> >>>
> >>> I'm running Peter's branch and even tried the suggested changes here but
> >>> still see the same behavior. I enabled panic on hard lockup and here below
> >>> is a snippet of the log.
> >>>
> >>> Is there anything you'd like me to try or have any debugging you'd like me to
> >>> do? I'd certainly like to get to the bottom of this.
> >>
> >> Hi Don,
> >>
> >> I tried to repro using qemu, but did not generate a lockup. Could you
> >> provide more details on what your script is doing (or better yet,
> >> share the script directly)? I would have expected you to potentially
> >> hit a lockup if you were cycling sched_core being enabled and
> >> disabled, but it sounds like you are just recreating the cookie for a
> >> process group over and over?
> >>
> >
> > I saw something similar on a bare metal hardware. Also tried the suggested
> > patch here and no luck. Panic stack attached with
> > softlockup_all_cpu_backtrace=1.
> > (sorry, my system has 192 cpus and somehow putting 184 cpus offline causes
> > system hang without any message...)
>
> Can you please try the following change to see if the problem is gone on your side?
>

Please ignore this patch, as the change of double_rq_unlock() in
Peter's last patch
fixed the problem.

Thanks,
-Aubrey

>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f732642e3e09..1ef13b50dfcd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -493,14 +493,17 @@ void double_rq_lock(struct rq *rq1, struct rq *rq2)
>  {
>         lockdep_assert_irqs_disabled();
>
> -       if (rq1->cpu > rq2->cpu)
> -               swap(rq1, rq2);
> -
> -       raw_spin_rq_lock(rq1);
> -       if (__rq_lockp(rq1) == __rq_lockp(rq2))
> -               return;
> -
> -       raw_spin_rq_lock_nested(rq2, SINGLE_DEPTH_NESTING);
> +       if (__rq_lockp(rq1) == __rq_lockp(rq2)) {
> +               raw_spin_rq_lock(rq1);
> +       } else {
> +               if (__rq_lockp(rq1) < __rq_lockp(rq2)) {
> +                       raw_spin_rq_lock(rq1);
> +                       raw_spin_rq_lock_nested(rq2, SINGLE_DEPTH_NESTING);
> +               } else {
> +                       raw_spin_rq_lock(rq2);
> +                       raw_spin_rq_lock_nested(rq1, SINGLE_DEPTH_NESTING);
> +               }
> +       }
>  }
>  #endif
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-28 10:35             ` Aubrey Li
@ 2021-04-28 11:03               ` Peter Zijlstra
  2021-04-28 14:18                 ` Paul E. McKenney
  0 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2021-04-28 11:03 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Josh Don, Joel Fernandes, Hyser,Chris, Ingo Molnar,
	Vincent Guittot, Valentin Schneider, Mel Gorman, linux-kernel,
	Thomas Gleixner, Don Hiatt, Paul McKenney

On Wed, Apr 28, 2021 at 06:35:36PM +0800, Aubrey Li wrote:
> On Wed, Apr 28, 2021 at 5:14 PM Peter Zijlstra <peterz@infradead.org> wrote:

> > Ah, indeed so.. rq_lockp() could do with an assertion, not sure how to
> > sanely do that. Anyway, double_rq_unlock() is simple enough to fix, we
> > can simply flip the unlock()s.
> >
> > ( I'm suffering a cold and am really quite slow atm )
> >
> > How's this then?
> >
> > ---
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index f732642e3e09..3a534c0c1c46 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -290,6 +290,10 @@ static void sched_core_assert_empty(void)
> >  static void __sched_core_enable(void)
> >  {
> >         static_branch_enable(&__sched_core_enabled);
> > +       /*
> > +        * Ensure raw_spin_rq_*lock*() have completed before flipping.
> > +        */
> > +       synchronize_sched();
> 
> synchronize_sched() seems no longer exist...

Bah.. Paul, why did that go away? I realize RCU merged in the sched and
bh flavours, but I still find it expressive to use sync_sched() vs
preempt_disable().

Anyway, just use sync_rcu().

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-28 11:03               ` Peter Zijlstra
@ 2021-04-28 14:18                 ` Paul E. McKenney
  0 siblings, 0 replies; 103+ messages in thread
From: Paul E. McKenney @ 2021-04-28 14:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Aubrey Li, Josh Don, Joel Fernandes, Hyser,Chris, Ingo Molnar,
	Vincent Guittot, Valentin Schneider, Mel Gorman, linux-kernel,
	Thomas Gleixner, Don Hiatt

On Wed, Apr 28, 2021 at 01:03:29PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 28, 2021 at 06:35:36PM +0800, Aubrey Li wrote:
> > On Wed, Apr 28, 2021 at 5:14 PM Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > Ah, indeed so.. rq_lockp() could do with an assertion, not sure how to
> > > sanely do that. Anyway, double_rq_unlock() is simple enough to fix, we
> > > can simply flip the unlock()s.
> > >
> > > ( I'm suffering a cold and am really quite slow atm )
> > >
> > > How's this then?
> > >
> > > ---
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index f732642e3e09..3a534c0c1c46 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -290,6 +290,10 @@ static void sched_core_assert_empty(void)
> > >  static void __sched_core_enable(void)
> > >  {
> > >         static_branch_enable(&__sched_core_enabled);
> > > +       /*
> > > +        * Ensure raw_spin_rq_*lock*() have completed before flipping.
> > > +        */
> > > +       synchronize_sched();
> > 
> > synchronize_sched() seems no longer exist...
> 
> Bah.. Paul, why did that go away? I realize RCU merged in the sched and
> bh flavours, but I still find it expressive to use sync_sched() vs
> preempt_disable().

I could have made synchronize_sched() a synonym for synchronize_rcu(),
but that would be more likely to mislead than to help.

> Anyway, just use sync_rcu().

And yes, just use synchronize_rcu().

							Thanx, Paul

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-27 23:35           ` Josh Don
  2021-04-28  1:03             ` Aubrey Li
@ 2021-04-28 16:04             ` Don Hiatt
  1 sibling, 0 replies; 103+ messages in thread
From: Don Hiatt @ 2021-04-28 16:04 UTC (permalink / raw)
  To: Josh Don
  Cc: Peter Zijlstra, Joel Fernandes, Hyser,Chris, Ingo Molnar,
	Vincent Guittot, Valentin Schneider, Mel Gorman, linux-kernel,
	Thomas Gleixner

On Tue, Apr 27, 2021 at 4:35 PM Josh Don <joshdon@google.com> wrote:
>
> On Tue, Apr 27, 2021 at 10:10 AM Don Hiatt <dhiatt@digitalocean.com> wrote:
> > Hi Josh and Peter,
> >
> > I've been running into soft lookups and hard lockups when running a script
> > that just cycles setting the cookie of a group of processes over and over again.
> >
> > Unfortunately the only way I can reproduce this is by setting the cookies
> > on qemu. I've tried sysbench, stress-ng but those seem to work just fine.
> >
> > I'm running Peter's branch and even tried the suggested changes here but
> > still see the same behavior. I enabled panic on hard lockup and here below
> > is a snippet of the log.
> >
> > Is there anything you'd like me to try or have any debugging you'd like me to
> > do? I'd certainly like to get to the bottom of this.
>
> Hi Don,
>
> I tried to repro using qemu, but did not generate a lockup. Could you
> provide more details on what your script is doing (or better yet,
> share the script directly)? I would have expected you to potentially
> hit a lockup if you were cycling sched_core being enabled and
> disabled, but it sounds like you are just recreating the cookie for a
> process group over and over?
>
> Best,
> Josh

Hi Josh,

Sorry if I wasn't clear, but I'm running on bare metal (Peter's
5.12-rc8 repo) and have two qemu-system-x86_64 vms (1 with 8vcpu, the
other with 24 but it doesn't really matter). I then run a script [1]
that cycles setting the cookie for all the processes given the pid of
qemu-system-x86_64.

I also just found a lockup when I set the cookies for those two vms,
pinned them both to the same processor pair, and ran sysbench within
each vm to generate some load [2]. This is without cycling the
cookies, just setting once.

Thanks!

Don

----[1]---
This is a little test harness (I can provide the source but it is
based on your kselftests)
dhiatt@s2r5node34:~/do_coresched$ ./do_coresched -h
Usage for ./do_coresched
   -c Create sched cookie for <pid>
   -f Share sched cookie from <pid>
   -g Get sched cookie from <pid>
   -t Share sched cookie to <pid>
   -z Clear sched cookie from <pid>
   -p PID

Create: _prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, pid, PIDTYPE_PID, 0)
Share: _prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, pid, PIDTYPE_PID, 0)
Get: prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, pid, PIDTYPE_PID,
                    (unsigned long)&cookie);

--

dhiatt@s2r5node34:~/do_coresched$ cat set.sh
#!/bin/bash
# usage: set.sh pid

target=$1
echo target pid: $target
pids=`ps -eL | grep $target | awk '{print $2}' | sort -g`

echo "Setting cookies for $target"
./do_coresched -c -p $BASHPID. #
for i in $pids
do
./do_coresched -t -p $i
./do_coresched -g -p $i
done
---

---[2]---
[ 8911.926989] watchdog: BUG: soft lockup - CPU#53 stuck for 22s! [CPU
1/KVM:19539]
[ 8911.935727] NMI watchdog: Watchdog detected hard LOCKUP on cpu 6
[ 8911.935791] NMI watchdog: Watchdog detected hard LOCKUP on cpu 7
[ 8911.935908] NMI watchdog: Watchdog detected hard LOCKUP on cpu 12
[ 8911.935967] NMI watchdog: Watchdog detected hard LOCKUP on cpu 13
[ 8911.936070] NMI watchdog: Watchdog detected hard LOCKUP on cpu 19
[ 8911.936145] NMI watchdog: Watchdog detected hard LOCKUP on cpu 21
[ 8911.936220] NMI watchdog: Watchdog detected hard LOCKUP on cpu 23
[ 8911.936361] NMI watchdog: Watchdog detected hard LOCKUP on cpu 31
[ 8911.936453] NMI watchdog: Watchdog detected hard LOCKUP on cpu 34
[ 8911.936567] NMI watchdog: Watchdog detected hard LOCKUP on cpu 42
[ 8911.936627] NMI watchdog: Watchdog detected hard LOCKUP on cpu 46
[ 8911.936712] NMI watchdog: Watchdog detected hard LOCKUP on cpu 49
[ 8911.936827] NMI watchdog: Watchdog detected hard LOCKUP on cpu 58
[ 8911.936887] NMI watchdog: Watchdog detected hard LOCKUP on cpu 60
[ 8911.936969] NMI watchdog: Watchdog detected hard LOCKUP on cpu 70
[ 8915.926847] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 8915.932904] rcu: 2-...!: (7 ticks this GP)
idle=38e/1/0x4000000000000000 softirq=181538/181540 fqs=1178
[ 8915.942617] rcu: 14-...!: (2 GPs behind)
idle=d0e/1/0x4000000000000000 softirq=44825/44825 fqs=1178
[ 8915.954034] rcu: rcu_sched kthread timer wakeup didn't happen for
12568 jiffies! g462469 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[ 8915.965775] rcu: Possible timer handling issue on cpu=6 timer-softirq=10747
[ 8915.973021] rcu: rcu_sched kthread starved for 12572 jiffies!
g462469 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=6
[ 8915.983681] rcu: Unless rcu_sched kthread gets sufficient CPU time,
OOM is now expected behavior.
[ 8915.992879] rcu: RCU grace-period kthread stack dump:
[ 8915.998211] rcu: Stack dump where RCU GP kthread last ran:
[ 8939.925995] watchdog: BUG: soft lockup - CPU#53 stuck for 22s! [CPU
1/KVM:19539]
[ 8939.935274] NMI watchdog: Watchdog detected hard LOCKUP on cpu 25
[ 8939.935351] NMI watchdog: Watchdog detected hard LOCKUP on cpu 27
[ 8939.935425] NMI watchdog: Watchdog detected hard LOCKUP on cpu 29
[ 8939.935653] NMI watchdog: Watchdog detected hard LOCKUP on cpu 44
[ 8939.935845] NMI watchdog: Watchdog detected hard LOCKUP on cpu 63
[ 8963.997140] watchdog: BUG: soft lockup - CPU#71 stuck for 22s!
[SchedulerRunner:4405]

-----

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-28 10:57                 ` Aubrey Li
@ 2021-04-28 16:41                   ` Don Hiatt
  2021-04-29 20:48                     ` Josh Don
  0 siblings, 1 reply; 103+ messages in thread
From: Don Hiatt @ 2021-04-28 16:41 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Aubrey Li, Josh Don, Peter Zijlstra, Joel Fernandes, Hyser,Chris,
	Ingo Molnar, Vincent Guittot, Valentin Schneider, Mel Gorman,
	linux-kernel, Thomas Gleixner

On Wed, Apr 28, 2021 at 3:57 AM Aubrey Li <aubrey.intel@gmail.com> wrote:
>
> On Wed, Apr 28, 2021 at 2:05 PM Aubrey Li <aubrey.li@linux.intel.com> wrote:
> >
> > On 4/28/21 9:03 AM, Aubrey Li wrote:
> > > On Wed, Apr 28, 2021 at 7:36 AM Josh Don <joshdon@google.com> wrote:
> > >>
> > >> On Tue, Apr 27, 2021 at 10:10 AM Don Hiatt <dhiatt@digitalocean.com> wrote:
> > >>> Hi Josh and Peter,
> > >>>
> > >>> I've been running into soft lookups and hard lockups when running a script
> > >>> that just cycles setting the cookie of a group of processes over and over again.
> > >>>
> > >>> Unfortunately the only way I can reproduce this is by setting the cookies
> > >>> on qemu. I've tried sysbench, stress-ng but those seem to work just fine.
> > >>>
> > >>> I'm running Peter's branch and even tried the suggested changes here but
> > >>> still see the same behavior. I enabled panic on hard lockup and here below
> > >>> is a snippet of the log.
> > >>>
> > >>> Is there anything you'd like me to try or have any debugging you'd like me to
> > >>> do? I'd certainly like to get to the bottom of this.
> > >>
> > >> Hi Don,
> > >>
> > >> I tried to repro using qemu, but did not generate a lockup. Could you
> > >> provide more details on what your script is doing (or better yet,
> > >> share the script directly)? I would have expected you to potentially
> > >> hit a lockup if you were cycling sched_core being enabled and
> > >> disabled, but it sounds like you are just recreating the cookie for a
> > >> process group over and over?
> > >>
> > >
> > > I saw something similar on a bare metal hardware. Also tried the suggested
> > > patch here and no luck. Panic stack attached with
> > > softlockup_all_cpu_backtrace=1.
> > > (sorry, my system has 192 cpus and somehow putting 184 cpus offline causes
> > > system hang without any message...)
> >
> > Can you please try the following change to see if the problem is gone on your side?
> >
>
> Please ignore this patch, as the change of double_rq_unlock() in
> Peter's last patch
> fixed the problem.
>
> Thanks,
> -Aubrey
>
I'm still seeing hard lockups while repeatedly setting cookies on qemu
processes even with
the updated patch. If there is any debug you'd like me to turn on,
just let me know.

Thanks!

don



> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index f732642e3e09..1ef13b50dfcd 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -493,14 +493,17 @@ void double_rq_lock(struct rq *rq1, struct rq *rq2)
> >  {
> >         lockdep_assert_irqs_disabled();
> >
> > -       if (rq1->cpu > rq2->cpu)
> > -               swap(rq1, rq2);
> > -
> > -       raw_spin_rq_lock(rq1);
> > -       if (__rq_lockp(rq1) == __rq_lockp(rq2))
> > -               return;
> > -
> > -       raw_spin_rq_lock_nested(rq2, SINGLE_DEPTH_NESTING);
> > +       if (__rq_lockp(rq1) == __rq_lockp(rq2)) {
> > +               raw_spin_rq_lock(rq1);
> > +       } else {
> > +               if (__rq_lockp(rq1) < __rq_lockp(rq2)) {
> > +                       raw_spin_rq_lock(rq1);
> > +                       raw_spin_rq_lock_nested(rq2, SINGLE_DEPTH_NESTING);
> > +               } else {
> > +                       raw_spin_rq_lock(rq2);
> > +                       raw_spin_rq_lock_nested(rq1, SINGLE_DEPTH_NESTING);
> > +               }
> > +       }
> >  }
> >  #endif
> >

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-22 12:05 ` [PATCH 04/19] sched: Prepare for Core-wide rq->lock Peter Zijlstra
  2021-04-24  1:22   ` Josh Don
  2021-04-28  6:02   ` Aubrey Li
@ 2021-04-29  8:03   ` Aubrey Li
  2021-04-29 20:39     ` Josh Don
  2021-05-07  9:50   ` [PATCH v2 " Peter Zijlstra
  3 siblings, 1 reply; 103+ messages in thread
From: Aubrey Li @ 2021-04-29  8:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes, Hyser,Chris, Josh Don, Ingo Molnar,
	Vincent Guittot, Valentin Schneider, Mel Gorman,
	Linux List Kernel Mailing, Thomas Gleixner

On Thu, Apr 22, 2021 at 8:39 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> When switching on core-sched, CPUs need to agree which lock to use for
> their RQ.
>
> The new rule will be that rq->core_enabled will be toggled while
> holding all rq->__locks that belong to a core. This means we need to
> double check the rq->core_enabled value after each lock acquire and
> retry if it changed.
>
> This also has implications for those sites that take multiple RQ
> locks, they need to be careful that the second lock doesn't end up
> being the first lock.
>
> Verify the lock pointer after acquiring the first lock, because if
> they're on the same core, holding any of the rq->__lock instances will
> pin the core state.
>
> While there, change the rq->__lock order to CPU number, instead of rq
> address, this greatly simplifies the next patch.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c  |   48 ++++++++++++++++++++++++++++++++++++++++++++++--
>  kernel/sched/sched.h |   41 +++++++++++------------------------------
>  2 files changed, 57 insertions(+), 32 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
----snip----
> @@ -199,6 +224,25 @@ void raw_spin_rq_unlock(struct rq *rq)
>         raw_spin_unlock(rq_lockp(rq));
>  }
>
> +#ifdef CONFIG_SMP
> +/*
> + * double_rq_lock - safely lock two runqueues
> + */
> +void double_rq_lock(struct rq *rq1, struct rq *rq2)
> +{
> +       lockdep_assert_irqs_disabled();
> +
> +       if (rq1->cpu > rq2->cpu)

It's still a bit hard for me to digest this function, I guess using (rq->cpu)
can't guarantee the sequence of locking when coresched is enabled.

- cpu1 and cpu7 shares lockA
- cpu2 and cpu8 shares lockB

double_rq_lock(1,8) leads to lock(A) and lock(B)
double_rq_lock(7,2) leads to lock(B) and lock(A)

change to below to avoid ABBA?
+       if (__rq_lockp(rq1) > __rq_lockp(rq2))

Please correct me if I was wrong.

Thanks,
-Aubrey

> +               swap(rq1, rq2);
> +
> +       raw_spin_rq_lock(rq1);
> +       if (rq_lockp(rq1) == rq_lockp(rq2))
> +               return;
> +
> +       raw_spin_rq_lock_nested(rq2, SINGLE_DEPTH_NESTING);
> +}
> +#endif
> +

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-28  9:13           ` Peter Zijlstra
  2021-04-28 10:35             ` Aubrey Li
@ 2021-04-29 20:11             ` Josh Don
  2021-05-03 19:17               ` Peter Zijlstra
  1 sibling, 1 reply; 103+ messages in thread
From: Josh Don @ 2021-04-29 20:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes, Hyser,Chris, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Mel Gorman, linux-kernel, Thomas Gleixner,
	Don Hiatt

On Wed, Apr 28, 2021 at 2:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Apr 27, 2021 at 04:30:02PM -0700, Josh Don wrote:
>
> > Also, did you mean to have a preempt_enable_no_resched() rather than
> > prempt_enable() in raw_spin_rq_trylock?
>
> No, trylock really needs to be preempt_enable(), because it can have
> failed, at which point it will not have incremented the preemption count
> and our decrement can hit 0, at which point we really should reschedule.

Ah yes, makes sense. Did you want to use preempt_enable_no_resched()
at all then? No chance of preempt_count() being 1 at the point of
enable, as you comment below.

> > I went over the rq_lockp stuff again after Don's reported lockup. Most
> > uses are safe due to already holding an rq lock. However,
> > double_rq_unlock() is prone to race:
> >
> > double_rq_unlock(rq1, rq2):
> > /* Initial state: core sched enabled, and rq1 and rq2 are smt
> > siblings. So, double_rq_lock(rq1, rq2) only took a single rq lock */
> > raw_spin_rq_unlock(rq1);
> > /* now not holding any rq lock */
> > /* sched core disabled. Now __rq_lockp(rq1) != __rq_lockp(rq2), so we
> > falsely unlock rq2 */
> > if (__rq_lockp(rq1) != __rq_lockp(rq2))
> >         raw_spin_rq_unlock(rq2);
> > else
> >         __release(rq2->lock);
> >
> > Instead we can cache __rq_lockp(rq1) and __rq_lockp(rq2) before
> > releasing the lock, in order to prevent this. FWIW I think it is
> > likely that Don is seeing a different issue.
>
> Ah, indeed so.. rq_lockp() could do with an assertion, not sure how to
> sanely do that. Anyway, double_rq_unlock() is simple enough to fix, we
> can simply flip the unlock()s.
>
> ( I'm suffering a cold and am really quite slow atm )

No worries, hope it's a mild one.

> How's this then?

Looks good to me (other than the synchronize_sched()->synchronize_rcu()).

For these locking patches,
Reviewed-by: Josh Don <joshdon@google.com>

I'll see if I can repro  that lockup.

> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f732642e3e09..3a534c0c1c46 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -290,6 +290,10 @@ static void sched_core_assert_empty(void)
>  static void __sched_core_enable(void)
>  {
>         static_branch_enable(&__sched_core_enabled);
> +       /*
> +        * Ensure raw_spin_rq_*lock*() have completed before flipping.
> +        */
> +       synchronize_sched();
>         __sched_core_flip(true);
>         sched_core_assert_empty();
>  }
> @@ -449,16 +453,23 @@ void raw_spin_rq_lock_nested(struct rq *rq, int subclass)
>  {
>         raw_spinlock_t *lock;
>
> +       /* Matches synchronize_sched() in __sched_core_enabled() */
> +       preempt_disable();
>         if (sched_core_disabled()) {
>                 raw_spin_lock_nested(&rq->__lock, subclass);
> +               /* preempt-count *MUST* be > 1 */
> +               preempt_enable_no_resched();
>                 return;
>         }
>
>         for (;;) {
>                 lock = __rq_lockp(rq);
>                 raw_spin_lock_nested(lock, subclass);
> -               if (likely(lock == __rq_lockp(rq)))
> +               if (likely(lock == __rq_lockp(rq))) {
> +                       /* preempt-count *MUST* be > 1 */
> +                       preempt_enable_no_resched();
>                         return;
> +               }
>                 raw_spin_unlock(lock);
>         }
>  }
> @@ -468,14 +479,21 @@ bool raw_spin_rq_trylock(struct rq *rq)
>         raw_spinlock_t *lock;
>         bool ret;
>
> -       if (sched_core_disabled())
> -               return raw_spin_trylock(&rq->__lock);
> +       /* Matches synchronize_sched() in __sched_core_enabled() */
> +       preempt_disable();
> +       if (sched_core_disabled()) {
> +               ret = raw_spin_trylock(&rq->__lock);
> +               preempt_enable();
> +               return ret;
> +       }
>
>         for (;;) {
>                 lock = __rq_lockp(rq);
>                 ret = raw_spin_trylock(lock);
> -               if (!ret || (likely(lock == __rq_lockp(rq))))
> +               if (!ret || (likely(lock == __rq_lockp(rq)))) {
> +                       preempt_enable();
>                         return ret;
> +               }
>                 raw_spin_unlock(lock);
>         }
>  }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 6a905fe19eef..c9a52231d58a 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2568,11 +2568,12 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
>         __releases(rq1->lock)
>         __releases(rq2->lock)
>  {
> -       raw_spin_rq_unlock(rq1);
>         if (__rq_lockp(rq1) != __rq_lockp(rq2))
>                 raw_spin_rq_unlock(rq2);
>         else
>                 __release(rq2->lock);
> +
> +       raw_spin_rq_unlock(rq1);
>  }
>
>  extern void set_rq_online (struct rq *rq);

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-29  8:03   ` Aubrey Li
@ 2021-04-29 20:39     ` Josh Don
  2021-04-30  8:20       ` Aubrey Li
  2021-05-04  7:38       ` Peter Zijlstra
  0 siblings, 2 replies; 103+ messages in thread
From: Josh Don @ 2021-04-29 20:39 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Peter Zijlstra, Joel Fernandes, Hyser,Chris, Ingo Molnar,
	Vincent Guittot, Valentin Schneider, Mel Gorman,
	Linux List Kernel Mailing, Thomas Gleixner, Don Hiatt

On Thu, Apr 29, 2021 at 1:03 AM Aubrey Li <aubrey.intel@gmail.com> wrote:
>
> On Thu, Apr 22, 2021 at 8:39 PM Peter Zijlstra <peterz@infradead.org> wrote:
> ----snip----
> > @@ -199,6 +224,25 @@ void raw_spin_rq_unlock(struct rq *rq)
> >         raw_spin_unlock(rq_lockp(rq));
> >  }
> >
> > +#ifdef CONFIG_SMP
> > +/*
> > + * double_rq_lock - safely lock two runqueues
> > + */
> > +void double_rq_lock(struct rq *rq1, struct rq *rq2)
> > +{
> > +       lockdep_assert_irqs_disabled();
> > +
> > +       if (rq1->cpu > rq2->cpu)
>
> It's still a bit hard for me to digest this function, I guess using (rq->cpu)
> can't guarantee the sequence of locking when coresched is enabled.
>
> - cpu1 and cpu7 shares lockA
> - cpu2 and cpu8 shares lockB
>
> double_rq_lock(1,8) leads to lock(A) and lock(B)
> double_rq_lock(7,2) leads to lock(B) and lock(A)
>
> change to below to avoid ABBA?
> +       if (__rq_lockp(rq1) > __rq_lockp(rq2))
>
> Please correct me if I was wrong.

Great catch Aubrey. This is possibly what is causing the lockups that
Don is seeing.

The proposed usage of __rq_lockp() is prone to race with sched core
being enabled/disabled. It also won't order properly if we do
double_rq_lock(smt0, smt1) vs double_rq_lock(smt1, smt0), since these
would have equivalent __rq_lockp(). I'd propose an alternative but
similar idea: order by core, then break ties by ordering on cpu.

+#ifdef CONFIG_SCHED_CORE
+       if (rq1->core->cpu > rq2->core->cpu)
+               swap(rq1, rq2);
+       else if (rq1->core->cpu == rq2->core->cpu && rq1->cpu > rq2->cpu)
+               swap(rq1, rq2);
+#else
        if (rq1->cpu > rq2->cpu)
                swap(rq1, rq2);
+#endif

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-28 16:41                   ` Don Hiatt
@ 2021-04-29 20:48                     ` Josh Don
  2021-04-29 21:09                       ` Don Hiatt
  0 siblings, 1 reply; 103+ messages in thread
From: Josh Don @ 2021-04-29 20:48 UTC (permalink / raw)
  To: Don Hiatt
  Cc: Aubrey Li, Aubrey Li, Peter Zijlstra, Joel Fernandes,
	Hyser,Chris, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Mel Gorman, linux-kernel, Thomas Gleixner

On Wed, Apr 28, 2021 at 9:41 AM Don Hiatt <dhiatt@digitalocean.com> wrote:
>
> I'm still seeing hard lockups while repeatedly setting cookies on qemu
> processes even with
> the updated patch. If there is any debug you'd like me to turn on,
> just let me know.
>
> Thanks!
>
> don

Thanks for the added context on your repro configuration. In addition
to the updated patch from earlier, could you try the modification to
double_rq_lock() from
https://lkml.kernel.org/r/CABk29NuS-B3n4sbmavo0NDA1OCCsz6Zf2VDjjFQvAxBMQoJ_Lg@mail.gmail.com
? I have a feeling this is what's causing your lockup.

Best,
Josh

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-29 20:48                     ` Josh Don
@ 2021-04-29 21:09                       ` Don Hiatt
  2021-04-29 23:22                         ` Josh Don
  2021-04-30  8:26                         ` Aubrey Li
  0 siblings, 2 replies; 103+ messages in thread
From: Don Hiatt @ 2021-04-29 21:09 UTC (permalink / raw)
  To: Josh Don
  Cc: Aubrey Li, Aubrey Li, Peter Zijlstra, Joel Fernandes,
	Hyser,Chris, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Mel Gorman, linux-kernel, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 1288 bytes --]

On Thu, Apr 29, 2021 at 1:48 PM Josh Don <joshdon@google.com> wrote:
>
> On Wed, Apr 28, 2021 at 9:41 AM Don Hiatt <dhiatt@digitalocean.com> wrote:
> >
> > I'm still seeing hard lockups while repeatedly setting cookies on qemu
> > processes even with
> > the updated patch. If there is any debug you'd like me to turn on,
> > just let me know.
> >
> > Thanks!
> >
> > don
>
> Thanks for the added context on your repro configuration. In addition
> to the updated patch from earlier, could you try the modification to
> double_rq_lock() from
> https://lkml.kernel.org/r/CABk29NuS-B3n4sbmavo0NDA1OCCsz6Zf2VDjjFQvAxBMQoJ_Lg@mail.gmail.com
> ? I have a feeling this is what's causing your lockup.
>
> Best,
> Josh

Hi Josh,

I've been running Aubrey+Peter's patch (attached) for almost 5 hours
and haven't had a single issue. :)

I'm running a set-cookie script every 5 seconds on the two VMs (each
vm is running
'sysbench --threads=1 --time=0 cpu run' to generate some load in the vm) and
I'm running two of the same sysbench runs on the HV while setting cookies
every 5 seconds.

Unless I jinxed us it looks like a great fix. :)

Let me know if there is anything else you'd like me to try. I'm going
to leave the tests running
and see what happens. I update with what I find.

Thanks!

don

[-- Attachment #2: aubrey-cosched.diff --]
[-- Type: application/octet-stream, Size: 2320 bytes --]

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f732642e3e09..1d7bc87007cd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -290,6 +290,10 @@ static void sched_core_assert_empty(void)
 static void __sched_core_enable(void)
 {
 	static_branch_enable(&__sched_core_enabled);
+	/*
+	 * Ensure raw_spin_rq_*lock*() have completed before flipping.
+	 */
+	synchronize_rcu();
 	__sched_core_flip(true);
 	sched_core_assert_empty();
 }
@@ -449,16 +453,23 @@ void raw_spin_rq_lock_nested(struct rq *rq, int subclass)
 {
 	raw_spinlock_t *lock;

+	/* Matches synchronize_sched() in __sched_core_enabled() */
+	preempt_disable();
 	if (sched_core_disabled()) {
 		raw_spin_lock_nested(&rq->__lock, subclass);
+		/* preempt-count *MUST* be > 1 */
+		preempt_enable_no_resched();
 		return;
 	}

 	for (;;) {
 		lock = __rq_lockp(rq);
 		raw_spin_lock_nested(lock, subclass);
-		if (likely(lock == __rq_lockp(rq)))
+		if (likely(lock == __rq_lockp(rq))) {
+			/* preempt-count *MUST* be > 1 */
+			preempt_enable_no_resched();
 			return;
+		}
 		raw_spin_unlock(lock);
 	}
 }
@@ -468,14 +479,21 @@ bool raw_spin_rq_trylock(struct rq *rq)
 	raw_spinlock_t *lock;
 	bool ret;

-	if (sched_core_disabled())
-		return raw_spin_trylock(&rq->__lock);
+	/* Matches synchronize_sched() in __sched_core_enabled() */
+	preempt_disable();
+	if (sched_core_disabled()) {
+		ret = raw_spin_trylock(&rq->__lock);
+		preempt_enable();
+		return ret;
+	}

 	for (;;) {
 		lock = __rq_lockp(rq);
 		ret = raw_spin_trylock(lock);
-		if (!ret || (likely(lock == __rq_lockp(rq))))
+		if (!ret || (likely(lock == __rq_lockp(rq)))) {
+			preempt_enable();
 			return ret;
+		}
 		raw_spin_unlock(lock);
 	}
 }
@@ -493,14 +511,17 @@ void double_rq_lock(struct rq *rq1, struct rq *rq2)
 {
 	lockdep_assert_irqs_disabled();

-	if (rq1->cpu > rq2->cpu)
-		swap(rq1, rq2);
-
-	raw_spin_rq_lock(rq1);
-	if (__rq_lockp(rq1) == __rq_lockp(rq2))
-		return;
-
-	raw_spin_rq_lock_nested(rq2, SINGLE_DEPTH_NESTING);
+	if (__rq_lockp(rq1) == __rq_lockp(rq2)) {
+		raw_spin_rq_lock(rq1);
+	} else {
+		if (__rq_lockp(rq1) < __rq_lockp(rq2)) {
+			raw_spin_rq_lock(rq1);
+			raw_spin_rq_lock_nested(rq2, SINGLE_DEPTH_NESTING);
+		} else {
+			raw_spin_rq_lock(rq2);
+			raw_spin_rq_lock_nested(rq1, SINGLE_DEPTH_NESTING);
+		}
+	}
 }
 #endif

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-29 21:09                       ` Don Hiatt
@ 2021-04-29 23:22                         ` Josh Don
  2021-04-30 16:18                           ` Don Hiatt
  2021-04-30  8:26                         ` Aubrey Li
  1 sibling, 1 reply; 103+ messages in thread
From: Josh Don @ 2021-04-29 23:22 UTC (permalink / raw)
  To: Don Hiatt
  Cc: Aubrey Li, Aubrey Li, Peter Zijlstra, Joel Fernandes,
	Hyser,Chris, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Mel Gorman, linux-kernel, Thomas Gleixner

On Thu, Apr 29, 2021 at 2:09 PM Don Hiatt <dhiatt@digitalocean.com> wrote:
>
> On Thu, Apr 29, 2021 at 1:48 PM Josh Don <joshdon@google.com> wrote:
> >
> > On Wed, Apr 28, 2021 at 9:41 AM Don Hiatt <dhiatt@digitalocean.com> wrote:
> > >
> > > I'm still seeing hard lockups while repeatedly setting cookies on qemu
> > > processes even with
> > > the updated patch. If there is any debug you'd like me to turn on,
> > > just let me know.
> > >
> > > Thanks!
> > >
> > > don
> >
> > Thanks for the added context on your repro configuration. In addition
> > to the updated patch from earlier, could you try the modification to
> > double_rq_lock() from
> > https://lkml.kernel.org/r/CABk29NuS-B3n4sbmavo0NDA1OCCsz6Zf2VDjjFQvAxBMQoJ_Lg@mail.gmail.com
> > ? I have a feeling this is what's causing your lockup.
> >
> > Best,
> > Josh
>
> Hi Josh,
>
> I've been running Aubrey+Peter's patch (attached) for almost 5 hours
> and haven't had a single issue. :)
>
> I'm running a set-cookie script every 5 seconds on the two VMs (each
> vm is running
> 'sysbench --threads=1 --time=0 cpu run' to generate some load in the vm) and
> I'm running two of the same sysbench runs on the HV while setting cookies
> every 5 seconds.
>
> Unless I jinxed us it looks like a great fix. :)
>
> Let me know if there is anything else you'd like me to try. I'm going
> to leave the tests running
> and see what happens. I update with what I find.
>
> Thanks!
>
> don

That's awesome news, thanks for validating. Note that with Aubrey's
patch there is still a race window if sched core is being
enabled/disabled (ie. if you alternate between there being some
cookies in the system, and no cookies). In my reply I posted an
alternative version to avoid that. If your script were to do the
on-off flipping with the old patch, you'd might eventually see another
lockup.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/19] sched: Core Scheduling
  2021-04-22 12:04 [PATCH 00/19] sched: Core Scheduling Peter Zijlstra
                   ` (19 preceding siblings ...)
  2021-04-22 16:43 ` [PATCH 00/19] sched: Core Scheduling Don Hiatt
@ 2021-04-30  6:47 ` Ning, Hongyu
  2021-05-06 10:29   ` Peter Zijlstra
  2021-05-07 18:02 ` Joel Fernandes
  2021-05-10 16:16 ` Vincent Guittot
  22 siblings, 1 reply; 103+ messages in thread
From: Ning, Hongyu @ 2021-04-30  6:47 UTC (permalink / raw)
  To: Peter Zijlstra, joel, chris.hyser, joshdon, mingo,
	vincent.guittot, valentin.schneider, mgorman
  Cc: linux-kernel, tglx, Li, Aubrey, Tim Chen


On 2021/4/22 20:04, Peter Zijlstra wrote:
> Hai,
> 
> This is an agressive fold of all the core-scheduling work so far. I've stripped
> a whole bunch of tags along the way (hopefully not too many, please yell if you
> feel I made a mistake), including tested-by. Please retest.
> 
> Changes since the last partial post is dropping all the cgroup stuff and
> PR_SCHED_CORE_CLEAR as well as that exec() behaviour in order to later resolve
> the cgroup issue.
> 
> Since we're really rather late for the coming merge window, my plan was to
> merge the lot right after the merge window.
> 
> Again, please test.
> 
> These patches should shortly be available in my queue.git.
> 
> ---
>  b/kernel/sched/core_sched.c                     |  229 ++++++
>  b/tools/testing/selftests/sched/.gitignore      |    1 
>  b/tools/testing/selftests/sched/Makefile        |   14 
>  b/tools/testing/selftests/sched/config          |    1 
>  b/tools/testing/selftests/sched/cs_prctl_test.c |  338 +++++++++
>  include/linux/sched.h                           |   19 
>  include/uapi/linux/prctl.h                      |    8 
>  kernel/Kconfig.preempt                          |    6 
>  kernel/fork.c                                   |    4 
>  kernel/sched/Makefile                           |    1 
>  kernel/sched/core.c                             |  858 ++++++++++++++++++++++--
>  kernel/sched/cpuacct.c                          |   12 
>  kernel/sched/deadline.c                         |   38 -
>  kernel/sched/debug.c                            |    4 
>  kernel/sched/fair.c                             |  276 +++++--
>  kernel/sched/idle.c                             |   13 
>  kernel/sched/pelt.h                             |    2 
>  kernel/sched/rt.c                               |   31 
>  kernel/sched/sched.h                            |  393 ++++++++--
>  kernel/sched/stop_task.c                        |   14 
>  kernel/sched/topology.c                         |    4 
>  kernel/sys.c                                    |    5 
>  tools/include/uapi/linux/prctl.h                |    8 
>  23 files changed, 2057 insertions(+), 222 deletions(-)
> 


Adding sysbench/uperf/wis performance results for reference:

- kernel under test:
	-- above patchset of core-scheduling + local fix for softlockup issue: https://lore.kernel.org/lkml/5c289c5a-a120-a1d0-ca89-2724a1445fe8@linux.intel.com/
	-- coresched_v10 kernel source: https://github.com/digitalocean/linux-coresched/commits/coresched/v10-v5.10.y

- workloads: 
	-- A. sysbench cpu (192 threads) + sysbench cpu (192 threads)
	-- B. sysbench cpu (192 threads) + sysbench mysql (192 threads)
	-- C. uperf netperf.xml (192 threads over TCP or UDP protocol separately)
	-- D. will-it-scale context_switch via pipe (192 threads)

- test machine setup: 
	CPU(s):              192
	On-line CPU(s) list: 0-191
	Thread(s) per core:  2
	Core(s) per socket:  48
	Socket(s):           2
	NUMA node(s):        4

- performance change key info:
	--workload B: coresched (cs_on), sysbench mysql performance drop around 20% vs coresched_v10
	--workload C, coresched (cs_on), uperf performance increased almost double vs coresched_v10
	--workload C, default (cs_off), uperf performance drop over 20% vs coresched_v10, same issue seen on v5.12-rc8 base (w/o coresched patchset)
	--workload D, coresched (cs_on), wis performance increased almost double vs coresched_v10

- performance info of workloads, normalized based on coresched_v10 results
	--workload A:
	Note: 
	* no performance change compared to coresched_v10
+---------------------------------------+------+----------------------------------------------+------------------------------------------------+-------+-------------------------------+---------------------------------+
|                                       | **   | coresched_peterz_aubrey_fix_base_v5.12-rc8   | coresched_peterz_aubrey_fix_base_v5.12-rc8     | ***   | coresched_v10_base_v5.10.11   | coresched_v10_base_v5.10.11     |
+=======================================+======+==============================================+================================================+=======+===============================+=================================+
| workload                              | **   | sysbench cpu * 192                           | sysbench cpu * 192                             | ***   | sysbench cpu * 192            | sysbench cpu * 192              |
+---------------------------------------+------+----------------------------------------------+------------------------------------------------+-------+-------------------------------+---------------------------------+
| prctl/cgroup                          | **   | prctl on workload cpu_0                      | prctl on workload cpu_1                        | ***   | cg_sysbench_cpu_0             | cg_sysbench_cpu_1               |
+---------------------------------------+------+----------------------------------------------+------------------------------------------------+-------+-------------------------------+---------------------------------+
| record_item                           | **   | Tput_avg (events/s)                          | Tput_avg (events/s)                            | ***   | Tput_avg (events/s)           | Tput_avg (events/s)             |
+---------------------------------------+------+----------------------------------------------+------------------------------------------------+-------+-------------------------------+---------------------------------+
| coresched normalized vs coresched_v10 | **   | 0.99                                         | 1.01                                           | ***   | 1                             | 1                               |
+---------------------------------------+------+----------------------------------------------+------------------------------------------------+-------+-------------------------------+---------------------------------+
| default normalized vs coresched_v10   | **   | 1.03                                         | 0.98                                           | ***   | 1                             | 1                               |
+---------------------------------------+------+----------------------------------------------+------------------------------------------------+-------+-------------------------------+---------------------------------+
| smtoff normalized vs coresched_v10    | **   | 1.01                                         | 0.99                                           | ***   | 1                             | 1                               |
+---------------------------------------+------+----------------------------------------------+------------------------------------------------+-------+-------------------------------+---------------------------------+

	--workload B:
	Note: 
	* coresched (cs_on), sysbench mysql performance drop around 20% vs coresched_v10
+---------------------------------------+------+----------------------------------------------+------------------------------------------------+-------+-------------------------------+---------------------------------+
|                                       | **   | coresched_peterz_aubrey_fix_base_v5.12-rc8   | coresched_peterz_aubrey_fix_base_v5.12-rc8     | ***   | coresched_v10_base_v5.10.11   | coresched_v10_base_v5.10.11     |
+=======================================+======+==============================================+================================================+=======+===============================+=================================+
| workload                              | **   | sysbench cpu * 192                           | sysbench mysql * 192                           | ***   | sysbench cpu * 192            | sysbench mysql * 192            |
+---------------------------------------+------+----------------------------------------------+------------------------------------------------+-------+-------------------------------+---------------------------------+
| prctl/cgroup                          | **   | prctl on workload cpu_0                      | prctl on workload mysql_0                      | ***   | cg_sysbench_cpu_0             | cg_sysbench_mysql_0             |
+---------------------------------------+------+----------------------------------------------+------------------------------------------------+-------+-------------------------------+---------------------------------+
| record_item                           | **   | Tput_avg (events/s)                          | Tput_avg (events/s)                            | ***   | Tput_avg (events/s)           | Tput_avg (events/s)             |
+---------------------------------------+------+----------------------------------------------+------------------------------------------------+-------+-------------------------------+---------------------------------+
| coresched normalized vs coresched_v10 | **   | 1.03                                         | 0.77                                           | ***   | 1                             | 1                               |
+---------------------------------------+------+----------------------------------------------+------------------------------------------------+-------+-------------------------------+---------------------------------+
| default normalized vs coresched_v10   | **   | 1.02                                         | 0.9                                            | ***   | 1                             | 1                               |
+---------------------------------------+------+----------------------------------------------+------------------------------------------------+-------+-------------------------------+---------------------------------+
| smtoff normalized vs coresched_v10    | **   | 0.94                                         | 1.14                                           | ***   | 1                             | 1                               |
+---------------------------------------+------+----------------------------------------------+------------------------------------------------+-------+-------------------------------+---------------------------------+
	
	--workload C:
	Note: 
	* coresched (cs_on), uperf performance increased almost double vs coresched_v10
	* default (cs_off), uperf performance drop over 20% vs coresched_v10, same issue seen on v5.12-rc8 base (w/o coresched patchset)
+---------------------------------------+------+----------------------------------------------+------------------------------------------------+-------+-------------------------------+---------------------------------+
|                                       | **   | coresched_peterz_aubrey_fix_base_v5.12-rc8   | coresched_peterz_aubrey_fix_base_v5.12-rc8     | ***   | coresched_v10_base_v5.10.11   | coresched_v10_base_v5.10.11     |
+=======================================+======+==============================================+================================================+=======+===============================+=================================+
| workload                              | **   | uperf netperf TCP * 192                      | uperf netperf UDP * 192                        | ***   | uperf netperf TCP * 192       | uperf netperf UDP * 192         |
+---------------------------------------+------+----------------------------------------------+------------------------------------------------+-------+-------------------------------+---------------------------------+
| prctl/cgroup                          | **   | prctl on workload uperf                      | prctl on workload uperf                        | ***   | cg_uperf                      | cg_uperf                        |
+---------------------------------------+------+----------------------------------------------+------------------------------------------------+-------+-------------------------------+---------------------------------+
| record_item                           | **   | Tput_avg (Gb/s)                              | Tput_avg (Gb/s)                                | ***   | Tput_avg (Gb/s)               | Tput_avg (Gb/s)                 |
+---------------------------------------+------+----------------------------------------------+------------------------------------------------+-------+-------------------------------+---------------------------------+
| coresched normalized vs coresched_v10 | **   | 1.87                                         | 1.99                                           | ***   | 1                             | 1                               |
+---------------------------------------+------+----------------------------------------------+------------------------------------------------+-------+-------------------------------+---------------------------------+
| default normalized vs coresched_v10   | **   | 0.78                                         | 0.74                                           | ***   | 1                             | 1                               |
+---------------------------------------+------+----------------------------------------------+------------------------------------------------+-------+-------------------------------+---------------------------------+
| smtoff normalized vs coresched_v10    | **   | 0.87                                         | 0.95                                           | ***   | 1                             | 1                               |
+---------------------------------------+------+----------------------------------------------+------------------------------------------------+-------+-------------------------------+---------------------------------+

	--workload D:
	Note: 
	* coresched (cs_on), wis performance increased almost double vs coresched_v10
+---------------------------------------+------+----------------------------------------------+-------+-------------------------------+
|                                       | **   | coresched_peterz_aubrey_fix_base_v5.12-rc8   | ***   | coresched_v10_base_v5.10.11   |
+=======================================+======+==============================================+=======+===============================+
| workload                              | **   | will-it-scale  * 192                         | ***   | will-it-scale  * 192          |
|                                       |      | (pipe based context_switch)                  |       | (pipe based context_switch)   |
+---------------------------------------+------+----------------------------------------------+-------+-------------------------------+
| prctl/cgroup                          | **   | prctl on workload wis                        | ***   | cg_wis                        |
+---------------------------------------+------+----------------------------------------------+-------+-------------------------------+
| record_item                           | **   | threads_avg                                  | ***   | threads_avg                   |
+---------------------------------------+------+----------------------------------------------+-------+-------------------------------+
| coresched normalized vs coresched_v10 | **   | 1.98                                         | ***   | 1                             |
+---------------------------------------+------+----------------------------------------------+-------+-------------------------------+
| default normalized vs coresched_v10   | **   | 1.13                                         | ***   | 1                             |
+---------------------------------------+------+----------------------------------------------+-------+-------------------------------+
| smtoff normalized vs coresched_v10    | **   | 1.32                                         | ***   | 1                             |
+---------------------------------------+------+----------------------------------------------+-------+-------------------------------+

	-- notes on record_item:
	* coresched normalized vs coresched_v10: smton, cs enabled, test result normalized by result of coresched_v10 under same config
	* default normalized vs coresched_v10: smton, cs disabled, test result normalized by result of coresched_v10 under same config
	* smtoff normalized vs coresched_v10: smtoff, test result normalized by result of coresched_v10 under same config

Hongyu

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-29 20:39     ` Josh Don
@ 2021-04-30  8:20       ` Aubrey Li
  2021-04-30  8:48         ` Josh Don
  2021-05-04  7:38       ` Peter Zijlstra
  1 sibling, 1 reply; 103+ messages in thread
From: Aubrey Li @ 2021-04-30  8:20 UTC (permalink / raw)
  To: Josh Don
  Cc: Peter Zijlstra, Joel Fernandes, Hyser,Chris, Ingo Molnar,
	Vincent Guittot, Valentin Schneider, Mel Gorman,
	Linux List Kernel Mailing, Thomas Gleixner, Don Hiatt

On Fri, Apr 30, 2021 at 4:40 AM Josh Don <joshdon@google.com> wrote:
>
> On Thu, Apr 29, 2021 at 1:03 AM Aubrey Li <aubrey.intel@gmail.com> wrote:
> >
> > On Thu, Apr 22, 2021 at 8:39 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > ----snip----
> > > @@ -199,6 +224,25 @@ void raw_spin_rq_unlock(struct rq *rq)
> > >         raw_spin_unlock(rq_lockp(rq));
> > >  }
> > >
> > > +#ifdef CONFIG_SMP
> > > +/*
> > > + * double_rq_lock - safely lock two runqueues
> > > + */
> > > +void double_rq_lock(struct rq *rq1, struct rq *rq2)
> > > +{
> > > +       lockdep_assert_irqs_disabled();
> > > +
> > > +       if (rq1->cpu > rq2->cpu)
> >
> > It's still a bit hard for me to digest this function, I guess using (rq->cpu)
> > can't guarantee the sequence of locking when coresched is enabled.
> >
> > - cpu1 and cpu7 shares lockA
> > - cpu2 and cpu8 shares lockB
> >
> > double_rq_lock(1,8) leads to lock(A) and lock(B)
> > double_rq_lock(7,2) leads to lock(B) and lock(A)
> >
> > change to below to avoid ABBA?
> > +       if (__rq_lockp(rq1) > __rq_lockp(rq2))
> >
> > Please correct me if I was wrong.
>
> Great catch Aubrey. This is possibly what is causing the lockups that
> Don is seeing.
>
> The proposed usage of __rq_lockp() is prone to race with sched core
> being enabled/disabled.It also won't order properly if we do
> double_rq_lock(smt0, smt1) vs double_rq_lock(smt1, smt0), since these
> would have equivalent __rq_lockp()

If __rq_lockp(smt0) == __rq_lockp(smt1), rq0 and rq1 won't swap,
Later only one rq is locked and just returns. I'm not sure how does it not
order properly?

.> I'd propose an alternative but similar idea: order by core, then break ties
> by ordering on cpu.
>
> +#ifdef CONFIG_SCHED_CORE
> +       if (rq1->core->cpu > rq2->core->cpu)
> +               swap(rq1, rq2);
> +       else if (rq1->core->cpu == rq2->core->cpu && rq1->cpu > rq2->cpu)
> +               swap(rq1, rq2);

That is, why the "else if" branch is needed?

> +#else
>         if (rq1->cpu > rq2->cpu)
>                 swap(rq1, rq2);
> +#endif

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-29 21:09                       ` Don Hiatt
  2021-04-29 23:22                         ` Josh Don
@ 2021-04-30  8:26                         ` Aubrey Li
  1 sibling, 0 replies; 103+ messages in thread
From: Aubrey Li @ 2021-04-30  8:26 UTC (permalink / raw)
  To: Don Hiatt
  Cc: Josh Don, Aubrey Li, Peter Zijlstra, Joel Fernandes, Hyser,Chris,
	Ingo Molnar, Vincent Guittot, Valentin Schneider, Mel Gorman,
	linux-kernel, Thomas Gleixner

On Fri, Apr 30, 2021 at 5:09 AM Don Hiatt <dhiatt@digitalocean.com> wrote:
>
> On Thu, Apr 29, 2021 at 1:48 PM Josh Don <joshdon@google.com> wrote:
> >
> > On Wed, Apr 28, 2021 at 9:41 AM Don Hiatt <dhiatt@digitalocean.com> wrote:
> > >
> > > I'm still seeing hard lockups while repeatedly setting cookies on qemu
> > > processes even with
> > > the updated patch. If there is any debug you'd like me to turn on,
> > > just let me know.
> > >
> > > Thanks!
> > >
> > > don
> >
> > Thanks for the added context on your repro configuration. In addition
> > to the updated patch from earlier, could you try the modification to
> > double_rq_lock() from
> > https://lkml.kernel.org/r/CABk29NuS-B3n4sbmavo0NDA1OCCsz6Zf2VDjjFQvAxBMQoJ_Lg@mail.gmail.com
> > ? I have a feeling this is what's causing your lockup.
> >
> > Best,
> > Josh
>
> Hi Josh,
>
> I've been running Aubrey+Peter's patch (attached) for almost 5 hours
> and haven't had a single issue. :)
>
> I'm running a set-cookie script every 5 seconds on the two VMs (each
> vm is running
> 'sysbench --threads=1 --time=0 cpu run' to generate some load in the vm) and
> I'm running two of the same sysbench runs on the HV while setting cookies
> every 5 seconds.
>
> Unless I jinxed us it looks like a great fix. :)
>
> Let me know if there is anything else you'd like me to try. I'm going
> to leave the tests running
> and see what happens. I update with what I find.
>

This sounds great, Thanks Don.

However, I guess we have the same problem in _double_lock_balance(), which
needs to be fixed as well.

I or Josh can make a better patch for this after Peter's comments.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-30  8:20       ` Aubrey Li
@ 2021-04-30  8:48         ` Josh Don
  2021-04-30 14:15           ` Aubrey Li
  0 siblings, 1 reply; 103+ messages in thread
From: Josh Don @ 2021-04-30  8:48 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Peter Zijlstra, Joel Fernandes, Hyser,Chris, Ingo Molnar,
	Vincent Guittot, Valentin Schneider, Mel Gorman,
	Linux List Kernel Mailing, Thomas Gleixner, Don Hiatt

On Fri, Apr 30, 2021 at 1:20 AM Aubrey Li <aubrey.intel@gmail.com> wrote:
>
> On Fri, Apr 30, 2021 at 4:40 AM Josh Don <joshdon@google.com> wrote:
> >
> > On Thu, Apr 29, 2021 at 1:03 AM Aubrey Li <aubrey.intel@gmail.com> wrote:
> > >
> > > On Thu, Apr 22, 2021 at 8:39 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > > ----snip----
> > > > @@ -199,6 +224,25 @@ void raw_spin_rq_unlock(struct rq *rq)
> > > >         raw_spin_unlock(rq_lockp(rq));
> > > >  }
> > > >
> > > > +#ifdef CONFIG_SMP
> > > > +/*
> > > > + * double_rq_lock - safely lock two runqueues
> > > > + */
> > > > +void double_rq_lock(struct rq *rq1, struct rq *rq2)
> > > > +{
> > > > +       lockdep_assert_irqs_disabled();
> > > > +
> > > > +       if (rq1->cpu > rq2->cpu)
> > >
> > > It's still a bit hard for me to digest this function, I guess using (rq->cpu)
> > > can't guarantee the sequence of locking when coresched is enabled.
> > >
> > > - cpu1 and cpu7 shares lockA
> > > - cpu2 and cpu8 shares lockB
> > >
> > > double_rq_lock(1,8) leads to lock(A) and lock(B)
> > > double_rq_lock(7,2) leads to lock(B) and lock(A)
> > >
> > > change to below to avoid ABBA?
> > > +       if (__rq_lockp(rq1) > __rq_lockp(rq2))
> > >
> > > Please correct me if I was wrong.
> >
> > Great catch Aubrey. This is possibly what is causing the lockups that
> > Don is seeing.
> >
> > The proposed usage of __rq_lockp() is prone to race with sched core
> > being enabled/disabled.It also won't order properly if we do
> > double_rq_lock(smt0, smt1) vs double_rq_lock(smt1, smt0), since these
> > would have equivalent __rq_lockp()
>
> If __rq_lockp(smt0) == __rq_lockp(smt1), rq0 and rq1 won't swap,
> Later only one rq is locked and just returns. I'm not sure how does it not
> order properly?

If there is a concurrent switch from sched_core enable <-> disable,
the value of __rq_lockp() will race.

In the version you posted directly above, where we swap rq1 and rq2 if
__rq_lockp(rq1) > __rqlockp(rq2) rather than comparing the cpu, the
following can happen:

cpu 1 and cpu 7 share a core lock when coresched is enabled

- schedcore enabled
- double_lock(7, 1)
- __rq_lockp compares equal for 7 and 1; no swap is done
- schedcore disabled; now __rq_lockp returns the per-rq lock
- lock(__rq_lockp(7)) => lock(7)
- lock(__rq_lockp(1)) => lock(1)

Then we can also have

- schedcore disabled
- double_lock(1, 7)
- __rq_lock(1) < rq_lock(7), so no swap
- lock(__rqlockp(1)) => lock(1)
- lock(__rq_lockp(7)) => lock(7)

So we have in the first 7->1 and in the second 1->7

>
> .> I'd propose an alternative but similar idea: order by core, then break ties
> > by ordering on cpu.
> >
> > +#ifdef CONFIG_SCHED_CORE
> > +       if (rq1->core->cpu > rq2->core->cpu)
> > +               swap(rq1, rq2);
> > +       else if (rq1->core->cpu == rq2->core->cpu && rq1->cpu > rq2->cpu)
> > +               swap(rq1, rq2);
>
> That is, why the "else if" branch is needed?

Ensuring that core siblings always take their locks in the same order
if coresched is disabled.

>
> > +#else
> >         if (rq1->cpu > rq2->cpu)
> >                 swap(rq1, rq2);
> > +#endif

Best,
Josh

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-30  8:48         ` Josh Don
@ 2021-04-30 14:15           ` Aubrey Li
  0 siblings, 0 replies; 103+ messages in thread
From: Aubrey Li @ 2021-04-30 14:15 UTC (permalink / raw)
  To: Josh Don
  Cc: Peter Zijlstra, Joel Fernandes, Hyser,Chris, Ingo Molnar,
	Vincent Guittot, Valentin Schneider, Mel Gorman,
	Linux List Kernel Mailing, Thomas Gleixner, Don Hiatt

On Fri, Apr 30, 2021 at 4:48 PM Josh Don <joshdon@google.com> wrote:
>
> On Fri, Apr 30, 2021 at 1:20 AM Aubrey Li <aubrey.intel@gmail.com> wrote:
> >
> > On Fri, Apr 30, 2021 at 4:40 AM Josh Don <joshdon@google.com> wrote:
> > >
> > > On Thu, Apr 29, 2021 at 1:03 AM Aubrey Li <aubrey.intel@gmail.com> wrote:
> > > >
> > > > On Thu, Apr 22, 2021 at 8:39 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > > > ----snip----
> > > > > @@ -199,6 +224,25 @@ void raw_spin_rq_unlock(struct rq *rq)
> > > > >         raw_spin_unlock(rq_lockp(rq));
> > > > >  }
> > > > >
> > > > > +#ifdef CONFIG_SMP
> > > > > +/*
> > > > > + * double_rq_lock - safely lock two runqueues
> > > > > + */
> > > > > +void double_rq_lock(struct rq *rq1, struct rq *rq2)
> > > > > +{
> > > > > +       lockdep_assert_irqs_disabled();
> > > > > +
> > > > > +       if (rq1->cpu > rq2->cpu)
> > > >
> > > > It's still a bit hard for me to digest this function, I guess using (rq->cpu)
> > > > can't guarantee the sequence of locking when coresched is enabled.
> > > >
> > > > - cpu1 and cpu7 shares lockA
> > > > - cpu2 and cpu8 shares lockB
> > > >
> > > > double_rq_lock(1,8) leads to lock(A) and lock(B)
> > > > double_rq_lock(7,2) leads to lock(B) and lock(A)
> > > >
> > > > change to below to avoid ABBA?
> > > > +       if (__rq_lockp(rq1) > __rq_lockp(rq2))
> > > >
> > > > Please correct me if I was wrong.
> > >
> > > Great catch Aubrey. This is possibly what is causing the lockups that
> > > Don is seeing.
> > >
> > > The proposed usage of __rq_lockp() is prone to race with sched core
> > > being enabled/disabled.It also won't order properly if we do
> > > double_rq_lock(smt0, smt1) vs double_rq_lock(smt1, smt0), since these
> > > would have equivalent __rq_lockp()
> >
> > If __rq_lockp(smt0) == __rq_lockp(smt1), rq0 and rq1 won't swap,
> > Later only one rq is locked and just returns. I'm not sure how does it not
> > order properly?
>
> If there is a concurrent switch from sched_core enable <-> disable,
> the value of __rq_lockp() will race.
>
> In the version you posted directly above, where we swap rq1 and rq2 if
> __rq_lockp(rq1) > __rqlockp(rq2) rather than comparing the cpu, the
> following can happen:
>
> cpu 1 and cpu 7 share a core lock when coresched is enabled
>
> - schedcore enabled
> - double_lock(7, 1)
> - __rq_lockp compares equal for 7 and 1; no swap is done
> - schedcore disabled; now __rq_lockp returns the per-rq lock
> - lock(__rq_lockp(7)) => lock(7)
> - lock(__rq_lockp(1)) => lock(1)
>
> Then we can also have
>
> - schedcore disabled
> - double_lock(1, 7)
> - __rq_lock(1) < rq_lock(7), so no swap
> - lock(__rqlockp(1)) => lock(1)
> - lock(__rq_lockp(7)) => lock(7)
>
> So we have in the first 7->1 and in the second 1->7
>
> >
> > .> I'd propose an alternative but similar idea: order by core, then break ties
> > > by ordering on cpu.
> > >
> > > +#ifdef CONFIG_SCHED_CORE
> > > +       if (rq1->core->cpu > rq2->core->cpu)
> > > +               swap(rq1, rq2);
> > > +       else if (rq1->core->cpu == rq2->core->cpu && rq1->cpu > rq2->cpu)
> > > +               swap(rq1, rq2);
> >
> > That is, why the "else if" branch is needed?
>
> Ensuring that core siblings always take their locks in the same order
> if coresched is disabled.
>

Both this and above make sense to me. Thanks for the great elaboration, Josh!

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-29 23:22                         ` Josh Don
@ 2021-04-30 16:18                           ` Don Hiatt
  0 siblings, 0 replies; 103+ messages in thread
From: Don Hiatt @ 2021-04-30 16:18 UTC (permalink / raw)
  To: Josh Don
  Cc: Aubrey Li, Aubrey Li, Peter Zijlstra, Joel Fernandes,
	Hyser,Chris, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Mel Gorman, linux-kernel, Thomas Gleixner

On Thu, Apr 29, 2021 at 4:22 PM Josh Don <joshdon@google.com> wrote:
>
> On Thu, Apr 29, 2021 at 2:09 PM Don Hiatt <dhiatt@digitalocean.com> wrote:
> >
> > On Thu, Apr 29, 2021 at 1:48 PM Josh Don <joshdon@google.com> wrote:
> > >
> > > On Wed, Apr 28, 2021 at 9:41 AM Don Hiatt <dhiatt@digitalocean.com> wrote:
> > > >
> > > > I'm still seeing hard lockups while repeatedly setting cookies on qemu
> > > > processes even with
> > > > the updated patch. If there is any debug you'd like me to turn on,
> > > > just let me know.
> > > >
> > > > Thanks!
> > > >
> > > > don
> > >
> > > Thanks for the added context on your repro configuration. In addition
> > > to the updated patch from earlier, could you try the modification to
> > > double_rq_lock() from
> > > https://lkml.kernel.org/r/CABk29NuS-B3n4sbmavo0NDA1OCCsz6Zf2VDjjFQvAxBMQoJ_Lg@mail.gmail.com
> > > ? I have a feeling this is what's causing your lockup.
> > >
> > > Best,
> > > Josh
> >
> > Hi Josh,
> >
> > I've been running Aubrey+Peter's patch (attached) for almost 5 hours
> > and haven't had a single issue. :)
> >
> > I'm running a set-cookie script every 5 seconds on the two VMs (each
> > vm is running
> > 'sysbench --threads=1 --time=0 cpu run' to generate some load in the vm) and
> > I'm running two of the same sysbench runs on the HV while setting cookies
> > every 5 seconds.
> >
> > Unless I jinxed us it looks like a great fix. :)
> >
> > Let me know if there is anything else you'd like me to try. I'm going
> > to leave the tests running
> > and see what happens. I update with what I find.
> >
> > Thanks!
> >
> > don
>
> That's awesome news, thanks for validating. Note that with Aubrey's
> patch there is still a race window if sched core is being
> enabled/disabled (ie. if you alternate between there being some
> cookies in the system, and no cookies). In my reply I posted an
> alternative version to avoid that. If your script were to do the
> on-off flipping with the old patch, you'd might eventually see another
> lockup.

My tests have been running 24 hours now and all is good. I'll do another
test with your changes next as well as continue to test the core-sched queue.

This and all the other patches in Peter's repo:
    Tested-by: Don Hiatt <dhiatt@digitalocean.com>

Have a great day and thanks again.

don

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-29 20:11             ` Josh Don
@ 2021-05-03 19:17               ` Peter Zijlstra
  0 siblings, 0 replies; 103+ messages in thread
From: Peter Zijlstra @ 2021-05-03 19:17 UTC (permalink / raw)
  To: Josh Don
  Cc: Joel Fernandes, Hyser,Chris, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Mel Gorman, linux-kernel, Thomas Gleixner,
	Don Hiatt

On Thu, Apr 29, 2021 at 01:11:54PM -0700, Josh Don wrote:
> On Wed, Apr 28, 2021 at 2:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, Apr 27, 2021 at 04:30:02PM -0700, Josh Don wrote:
> >
> > > Also, did you mean to have a preempt_enable_no_resched() rather than
> > > prempt_enable() in raw_spin_rq_trylock?
> >
> > No, trylock really needs to be preempt_enable(), because it can have
> > failed, at which point it will not have incremented the preemption count
> > and our decrement can hit 0, at which point we really should reschedule.
> 
> Ah yes, makes sense. Did you want to use preempt_enable_no_resched()
> at all then? No chance of preempt_count() being 1 at the point of
> enable, as you comment below.

preempt_enable_no_resched() avoids the branch which we know we'll never
take. It generates slightly saner code. It's not much, but every little
bit helps, right? ;-)

> > ( I'm suffering a cold and am really quite slow atm )
> 
> No worries, hope it's a mild one.

It's not the super popular one from '19, and the family seems to be
mostly recovered again, so all's well.

> > How's this then?
> 
> Looks good to me (other than the synchronize_sched()->synchronize_rcu()).
> 
> For these locking patches,
> Reviewed-by: Josh Don <joshdon@google.com>

Thanks!, I'll go fold them into the proper place and update the repo.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-29 20:39     ` Josh Don
  2021-04-30  8:20       ` Aubrey Li
@ 2021-05-04  7:38       ` Peter Zijlstra
  2021-05-05 16:20         ` Don Hiatt
  1 sibling, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2021-05-04  7:38 UTC (permalink / raw)
  To: Josh Don
  Cc: Aubrey Li, Joel Fernandes, Hyser,Chris, Ingo Molnar,
	Vincent Guittot, Valentin Schneider, Mel Gorman,
	Linux List Kernel Mailing, Thomas Gleixner, Don Hiatt

On Thu, Apr 29, 2021 at 01:39:54PM -0700, Josh Don wrote:

> > > +void double_rq_lock(struct rq *rq1, struct rq *rq2)
> > > +{
> > > +       lockdep_assert_irqs_disabled();
> > > +
> > > +       if (rq1->cpu > rq2->cpu)
> >
> > It's still a bit hard for me to digest this function, I guess using (rq->cpu)
> > can't guarantee the sequence of locking when coresched is enabled.
> >
> > - cpu1 and cpu7 shares lockA
> > - cpu2 and cpu8 shares lockB
> >
> > double_rq_lock(1,8) leads to lock(A) and lock(B)
> > double_rq_lock(7,2) leads to lock(B) and lock(A)

Good one!

> > change to below to avoid ABBA?
> > +       if (__rq_lockp(rq1) > __rq_lockp(rq2))

This, however, is broken badly, not only does it suffer the problem Josh
pointed out, it also breaks the rq->__lock ordering vs
__sched_core_flip(), which was the whole reason the ordering needed
changing in the first place.

> I'd propose an alternative but
> similar idea: order by core, then break ties by ordering on cpu.
> 
> +#ifdef CONFIG_SCHED_CORE
> +       if (rq1->core->cpu > rq2->core->cpu)
> +               swap(rq1, rq2);
> +       else if (rq1->core->cpu == rq2->core->cpu && rq1->cpu > rq2->cpu)
> +               swap(rq1, rq2);
> +#else
>         if (rq1->cpu > rq2->cpu)
>                 swap(rq1, rq2);
> +#endif

I've written it like so:

static inline bool rq_order_less(struct rq *rq1, struct rq *rq2)
{
#ifdef CONFIG_SCHED_CORE
	if (rq1->core->cpu < rq2->core->cpu)
		return true;
	if (rq1->core->cpu > rq2->core->cpu)
		return false;
#endif
	return rq1->cpu < rq2->cpu;
}

/*
 * double_rq_lock - safely lock two runqueues
 */
void double_rq_lock(struct rq *rq1, struct rq *rq2)
{
	lockdep_assert_irqs_disabled();

	if (rq_order_less(rq2, rq1))
		swap(rq1, rq2);

	raw_spin_rq_lock(rq1);
	if (rq_lockp(rq1) == rq_lockp(rq2))
		return;

	raw_spin_rq_lock_nested(rq2, SINGLE_DEPTH_NESTING);
}

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-05-04  7:38       ` Peter Zijlstra
@ 2021-05-05 16:20         ` Don Hiatt
  2021-05-06 10:25           ` Peter Zijlstra
  0 siblings, 1 reply; 103+ messages in thread
From: Don Hiatt @ 2021-05-05 16:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Don, Aubrey Li, Joel Fernandes, Hyser,Chris, Ingo Molnar,
	Vincent Guittot, Valentin Schneider, Mel Gorman,
	Linux List Kernel Mailing, Thomas Gleixner

On Tue, May 4, 2021 at 12:38 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Apr 29, 2021 at 01:39:54PM -0700, Josh Don wrote:
>
> > > > +void double_rq_lock(struct rq *rq1, struct rq *rq2)
> > > > +{
> > > > +       lockdep_assert_irqs_disabled();
> > > > +
> > > > +       if (rq1->cpu > rq2->cpu)
> > >
> > > It's still a bit hard for me to digest this function, I guess using (rq->cpu)
> > > can't guarantee the sequence of locking when coresched is enabled.
> > >
> > > - cpu1 and cpu7 shares lockA
> > > - cpu2 and cpu8 shares lockB
> > >
> > > double_rq_lock(1,8) leads to lock(A) and lock(B)
> > > double_rq_lock(7,2) leads to lock(B) and lock(A)
>
> Good one!

Hi Peter,

I've been running the same set-cookie tests on your latest repo for
the last 24 hours and haven't had a single lockup. Thank you very
much!

don

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] sched: Prepare for Core-wide rq->lock
  2021-05-05 16:20         ` Don Hiatt
@ 2021-05-06 10:25           ` Peter Zijlstra
  0 siblings, 0 replies; 103+ messages in thread
From: Peter Zijlstra @ 2021-05-06 10:25 UTC (permalink / raw)
  To: Don Hiatt
  Cc: Josh Don, Aubrey Li, Joel Fernandes, Hyser,Chris, Ingo Molnar,
	Vincent Guittot, Valentin Schneider, Mel Gorman,
	Linux List Kernel Mailing, Thomas Gleixner

On Wed, May 05, 2021 at 09:20:38AM -0700, Don Hiatt wrote:
> On Tue, May 4, 2021 at 12:38 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, Apr 29, 2021 at 01:39:54PM -0700, Josh Don wrote:
> >
> > > > > +void double_rq_lock(struct rq *rq1, struct rq *rq2)
> > > > > +{
> > > > > +       lockdep_assert_irqs_disabled();
> > > > > +
> > > > > +       if (rq1->cpu > rq2->cpu)
> > > >
> > > > It's still a bit hard for me to digest this function, I guess using (rq->cpu)
> > > > can't guarantee the sequence of locking when coresched is enabled.
> > > >
> > > > - cpu1 and cpu7 shares lockA
> > > > - cpu2 and cpu8 shares lockB
> > > >
> > > > double_rq_lock(1,8) leads to lock(A) and lock(B)
> > > > double_rq_lock(7,2) leads to lock(B) and lock(A)
> >
> > Good one!
> 
> Hi Peter,
> 
> I've been running the same set-cookie tests on your latest repo for
> the last 24 hours and haven't had a single lockup. Thank you very
> much!

Excellent, applied your Tested-by, thanks!

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/19] sched: Core Scheduling
  2021-04-30  6:47 ` Ning, Hongyu
@ 2021-05-06 10:29   ` Peter Zijlstra
  2021-05-06 12:53     ` Ning, Hongyu
  0 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2021-05-06 10:29 UTC (permalink / raw)
  To: Ning, Hongyu
  Cc: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman, linux-kernel, tglx, Li, Aubrey,
	Tim Chen

On Fri, Apr 30, 2021 at 02:47:00PM +0800, Ning, Hongyu wrote:
> Adding sysbench/uperf/wis performance results for reference:
> 
> - kernel under test:
> 	-- above patchset of core-scheduling + local fix for softlockup issue: https://lore.kernel.org/lkml/5c289c5a-a120-a1d0-ca89-2724a1445fe8@linux.intel.com/
> 	-- coresched_v10 kernel source: https://github.com/digitalocean/linux-coresched/commits/coresched/v10-v5.10.y

Shall I summarize all this as:

Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>

?

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/19] sched: Core Scheduling
  2021-05-06 10:29   ` Peter Zijlstra
@ 2021-05-06 12:53     ` Ning, Hongyu
  0 siblings, 0 replies; 103+ messages in thread
From: Ning, Hongyu @ 2021-05-06 12:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman, linux-kernel, tglx, Li, Aubrey,
	Tim Chen



On 2021/5/6 18:29, Peter Zijlstra wrote:
> On Fri, Apr 30, 2021 at 02:47:00PM +0800, Ning, Hongyu wrote:
>> Adding sysbench/uperf/wis performance results for reference:
>>
>> - kernel under test:
>> 	-- above patchset of core-scheduling + local fix for softlockup issue: https://lore.kernel.org/lkml/5c289c5a-a120-a1d0-ca89-2724a1445fe8@linux.intel.com/
>> 	-- coresched_v10 kernel source: https://github.com/digitalocean/linux-coresched/commits/coresched/v10-v5.10.y
> 
> Shall I summarize all this as:
> 
> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
> 
> ?
> 

Yes, that would be great.

Regards,
Hongyu

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 04/19] sched: Prepare for Core-wide rq->lock
  2021-04-22 12:05 ` [PATCH 04/19] sched: Prepare for Core-wide rq->lock Peter Zijlstra
                     ` (2 preceding siblings ...)
  2021-04-29  8:03   ` Aubrey Li
@ 2021-05-07  9:50   ` Peter Zijlstra
  2021-05-08  8:07     ` Aubrey Li
  3 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2021-05-07  9:50 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, tglx


When switching on core-sched, CPUs need to agree which lock to use for
their RQ.

The new rule will be that rq->core_enabled will be toggled while
holding all rq->__locks that belong to a core. This means we need to
double check the rq->core_enabled value after each lock acquire and
retry if it changed.

This also has implications for those sites that take multiple RQ
locks, they need to be careful that the second lock doesn't end up
being the first lock.

Verify the lock pointer after acquiring the first lock, because if
they're on the same core, holding any of the rq->__lock instances will
pin the core state.

While there, change the rq->__lock order to CPU number, instead of rq
address, this greatly simplifies the next patch.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
---
 kernel/sched/core.c  |   48 ++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h |   48 +++++++++++++++++-------------------------------
 2 files changed, 63 insertions(+), 33 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -186,12 +186,37 @@ int sysctl_sched_rt_runtime = 950000;
 
 void raw_spin_rq_lock_nested(struct rq *rq, int subclass)
 {
-	raw_spin_lock_nested(rq_lockp(rq), subclass);
+	raw_spinlock_t *lock;
+
+	if (sched_core_disabled()) {
+		raw_spin_lock_nested(&rq->__lock, subclass);
+		return;
+	}
+
+	for (;;) {
+		lock = rq_lockp(rq);
+		raw_spin_lock_nested(lock, subclass);
+		if (likely(lock == rq_lockp(rq)))
+			return;
+		raw_spin_unlock(lock);
+	}
 }
 
 bool raw_spin_rq_trylock(struct rq *rq)
 {
-	return raw_spin_trylock(rq_lockp(rq));
+	raw_spinlock_t *lock;
+	bool ret;
+
+	if (sched_core_disabled())
+		return raw_spin_trylock(&rq->__lock);
+
+	for (;;) {
+		lock = rq_lockp(rq);
+		ret = raw_spin_trylock(lock);
+		if (!ret || (likely(lock == rq_lockp(rq))))
+			return ret;
+		raw_spin_unlock(lock);
+	}
 }
 
 void raw_spin_rq_unlock(struct rq *rq)
@@ -199,6 +224,25 @@ void raw_spin_rq_unlock(struct rq *rq)
 	raw_spin_unlock(rq_lockp(rq));
 }
 
+#ifdef CONFIG_SMP
+/*
+ * double_rq_lock - safely lock two runqueues
+ */
+void double_rq_lock(struct rq *rq1, struct rq *rq2)
+{
+	lockdep_assert_irqs_disabled();
+
+	if (rq_order_less(rq2, rq1))
+		swap(rq1, rq2);
+
+	raw_spin_rq_lock(rq1);
+	if (rq_lockp(rq1) == rq_lockp(rq2))
+		return;
+
+	raw_spin_rq_lock_nested(rq2, SINGLE_DEPTH_NESTING);
+}
+#endif
+
 /*
  * __task_rq_lock - lock the rq @p resides on.
  */
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1113,6 +1113,11 @@ static inline bool is_migration_disabled
 #endif
 }
 
+static inline bool sched_core_disabled(void)
+{
+	return true;
+}
+
 static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 {
 	return &rq->__lock;
@@ -2231,10 +2236,17 @@ unsigned long arch_scale_freq_capacity(i
 }
 #endif
 
+
 #ifdef CONFIG_SMP
-#ifdef CONFIG_PREEMPTION
 
-static inline void double_rq_lock(struct rq *rq1, struct rq *rq2);
+static inline bool rq_order_less(struct rq *rq1, struct rq *rq2)
+{
+	return rq1->cpu < rq2->cpu;
+}
+
+extern void double_rq_lock(struct rq *rq1, struct rq *rq2);
+
+#ifdef CONFIG_PREEMPTION
 
 /*
  * fair double_lock_balance: Safely acquires both rq->locks in a fair
@@ -2274,14 +2286,13 @@ static inline int _double_lock_balance(s
 	if (likely(raw_spin_rq_trylock(busiest)))
 		return 0;
 
-	if (rq_lockp(busiest) >= rq_lockp(this_rq)) {
+	if (rq_order_less(this_rq, busiest)) {
 		raw_spin_rq_lock_nested(busiest, SINGLE_DEPTH_NESTING);
 		return 0;
 	}
 
 	raw_spin_rq_unlock(this_rq);
-	raw_spin_rq_lock(busiest);
-	raw_spin_rq_lock_nested(this_rq, SINGLE_DEPTH_NESTING);
+	double_rq_lock(this_rq, busiest);
 
 	return 1;
 }
@@ -2334,31 +2345,6 @@ static inline void double_raw_lock(raw_s
 }
 
 /*
- * double_rq_lock - safely lock two runqueues
- *
- * Note this does not disable interrupts like task_rq_lock,
- * you need to do so manually before calling.
- */
-static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
-	__acquires(rq1->lock)
-	__acquires(rq2->lock)
-{
-	BUG_ON(!irqs_disabled());
-	if (rq_lockp(rq1) == rq_lockp(rq2)) {
-		raw_spin_rq_lock(rq1);
-		__acquire(rq2->lock);	/* Fake it out ;) */
-	} else {
-		if (rq_lockp(rq1) < rq_lockp(rq2)) {
-			raw_spin_rq_lock(rq1);
-			raw_spin_rq_lock_nested(rq2, SINGLE_DEPTH_NESTING);
-		} else {
-			raw_spin_rq_lock(rq2);
-			raw_spin_rq_lock_nested(rq1, SINGLE_DEPTH_NESTING);
-		}
-	}
-}
-
-/*
  * double_rq_unlock - safely unlock two runqueues
  *
  * Note this does not restore interrupts like task_rq_unlock,
@@ -2368,11 +2354,11 @@ static inline void double_rq_unlock(stru
 	__releases(rq1->lock)
 	__releases(rq2->lock)
 {
-	raw_spin_rq_unlock(rq1);
 	if (rq_lockp(rq1) != rq_lockp(rq2))
 		raw_spin_rq_unlock(rq2);
 	else
 		__release(rq2->lock);
+	raw_spin_rq_unlock(rq1);
 }
 
 extern void set_rq_online (struct rq *rq);

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH v2 05/19] sched: Core-wide rq->lock
  2021-04-22 12:05 ` [PATCH 05/19] sched: " Peter Zijlstra
@ 2021-05-07  9:50   ` Peter Zijlstra
  2021-05-12 10:28     ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  0 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2021-05-07  9:50 UTC (permalink / raw)
  To: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman
  Cc: linux-kernel, tglx


Introduce the basic infrastructure to have a core wide rq->lock.

This relies on the rq->__lock order being in increasing CPU number
(inside a core). It is also constrained to SMT8 per lockdep (and
SMT256 per preempt_count).

Luckily SMT8 is the max supported SMT count for Linux (Mips, Sparc and
Power are known to have this).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
---
 kernel/Kconfig.preempt |    6 +
 kernel/sched/core.c    |  164 +++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h   |   58 +++++++++++++++++
 3 files changed, 224 insertions(+), 4 deletions(-)

--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -99,3 +99,9 @@ config PREEMPT_DYNAMIC
 
 	  Interesting if you want the same pre-built kernel should be used for
 	  both Server and Desktop workloads.
+
+config SCHED_CORE
+	bool "Core Scheduling for SMT"
+	default y
+	depends on SCHED_SMT
+
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -84,6 +84,108 @@ unsigned int sysctl_sched_rt_period = 10
 
 __read_mostly int scheduler_running;
 
+#ifdef CONFIG_SCHED_CORE
+
+DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+/*
+ * Magic required such that:
+ *
+ *	raw_spin_rq_lock(rq);
+ *	...
+ *	raw_spin_rq_unlock(rq);
+ *
+ * ends up locking and unlocking the _same_ lock, and all CPUs
+ * always agree on what rq has what lock.
+ *
+ * XXX entirely possible to selectively enable cores, don't bother for now.
+ */
+
+static DEFINE_MUTEX(sched_core_mutex);
+static int sched_core_count;
+static struct cpumask sched_core_mask;
+
+static void __sched_core_flip(bool enabled)
+{
+	int cpu, t, i;
+
+	cpus_read_lock();
+
+	/*
+	 * Toggle the online cores, one by one.
+	 */
+	cpumask_copy(&sched_core_mask, cpu_online_mask);
+	for_each_cpu(cpu, &sched_core_mask) {
+		const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+
+		i = 0;
+		local_irq_disable();
+		for_each_cpu(t, smt_mask) {
+			/* supports up to SMT8 */
+			raw_spin_lock_nested(&cpu_rq(t)->__lock, i++);
+		}
+
+		for_each_cpu(t, smt_mask)
+			cpu_rq(t)->core_enabled = enabled;
+
+		for_each_cpu(t, smt_mask)
+			raw_spin_unlock(&cpu_rq(t)->__lock);
+		local_irq_enable();
+
+		cpumask_andnot(&sched_core_mask, &sched_core_mask, smt_mask);
+	}
+
+	/*
+	 * Toggle the offline CPUs.
+	 */
+	cpumask_copy(&sched_core_mask, cpu_possible_mask);
+	cpumask_andnot(&sched_core_mask, &sched_core_mask, cpu_online_mask);
+
+	for_each_cpu(cpu, &sched_core_mask)
+		cpu_rq(cpu)->core_enabled = enabled;
+
+	cpus_read_unlock();
+}
+
+static void __sched_core_enable(void)
+{
+	// XXX verify there are no cookie tasks (yet)
+
+	static_branch_enable(&__sched_core_enabled);
+	/*
+	 * Ensure all previous instances of raw_spin_rq_*lock() have finished
+	 * and future ones will observe !sched_core_disabled().
+	 */
+	synchronize_rcu();
+	__sched_core_flip(true);
+}
+
+static void __sched_core_disable(void)
+{
+	// XXX verify there are no cookie tasks (left)
+
+	__sched_core_flip(false);
+	static_branch_disable(&__sched_core_enabled);
+}
+
+void sched_core_get(void)
+{
+	mutex_lock(&sched_core_mutex);
+	if (!sched_core_count++)
+		__sched_core_enable();
+	mutex_unlock(&sched_core_mutex);
+}
+
+void sched_core_put(void)
+{
+	mutex_lock(&sched_core_mutex);
+	if (!--sched_core_count)
+		__sched_core_disable();
+	mutex_unlock(&sched_core_mutex);
+}
+
+#endif /* CONFIG_SCHED_CORE */
+
 /*
  * part of the period that we allow rt tasks to run in us.
  * default: 0.95s
@@ -188,16 +290,23 @@ void raw_spin_rq_lock_nested(struct rq *
 {
 	raw_spinlock_t *lock;
 
+	/* Matches synchronize_rcu() in __sched_core_enable() */
+	preempt_disable();
 	if (sched_core_disabled()) {
 		raw_spin_lock_nested(&rq->__lock, subclass);
+		/* preempt_count *MUST* be > 1 */
+		preempt_enable_no_resched();
 		return;
 	}
 
 	for (;;) {
 		lock = rq_lockp(rq);
 		raw_spin_lock_nested(lock, subclass);
-		if (likely(lock == rq_lockp(rq)))
+		if (likely(lock == rq_lockp(rq))) {
+			/* preempt_count *MUST* be > 1 */
+			preempt_enable_no_resched();
 			return;
+		}
 		raw_spin_unlock(lock);
 	}
 }
@@ -207,14 +316,21 @@ bool raw_spin_rq_trylock(struct rq *rq)
 	raw_spinlock_t *lock;
 	bool ret;
 
-	if (sched_core_disabled())
-		return raw_spin_trylock(&rq->__lock);
+	/* Matches synchronize_rcu() in __sched_core_enable() */
+	preempt_disable();
+	if (sched_core_disabled()) {
+		ret = raw_spin_trylock(&rq->__lock);
+		preempt_enable();
+		return ret;
+	}
 
 	for (;;) {
 		lock = rq_lockp(rq);
 		ret = raw_spin_trylock(lock);
-		if (!ret || (likely(lock == rq_lockp(rq))))
+		if (!ret || (likely(lock == rq_lockp(rq)))) {
+			preempt_enable();
 			return ret;
+		}
 		raw_spin_unlock(lock);
 	}
 }
@@ -5042,6 +5158,40 @@ pick_next_task(struct rq *rq, struct tas
 	BUG();
 }
 
+#ifdef CONFIG_SCHED_CORE
+
+static inline void sched_core_cpu_starting(unsigned int cpu)
+{
+	const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+	struct rq *rq, *core_rq = NULL;
+	int i;
+
+	core_rq = cpu_rq(cpu)->core;
+
+	if (!core_rq) {
+		for_each_cpu(i, smt_mask) {
+			rq = cpu_rq(i);
+			if (rq->core && rq->core == rq)
+				core_rq = rq;
+		}
+
+		if (!core_rq)
+			core_rq = cpu_rq(cpu);
+
+		for_each_cpu(i, smt_mask) {
+			rq = cpu_rq(i);
+
+			WARN_ON_ONCE(rq->core && rq->core != core_rq);
+			rq->core = core_rq;
+		}
+	}
+}
+#else /* !CONFIG_SCHED_CORE */
+
+static inline void sched_core_cpu_starting(unsigned int cpu) {}
+
+#endif /* CONFIG_SCHED_CORE */
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -8007,6 +8157,7 @@ static void sched_rq_cpu_starting(unsign
 
 int sched_cpu_starting(unsigned int cpu)
 {
+	sched_core_cpu_starting(cpu);
 	sched_rq_cpu_starting(cpu);
 	sched_tick_start(cpu);
 	return 0;
@@ -8291,6 +8442,11 @@ void __init sched_init(void)
 #endif /* CONFIG_SMP */
 		hrtick_rq_init(rq);
 		atomic_set(&rq->nr_iowait, 0);
+
+#ifdef CONFIG_SCHED_CORE
+		rq->core = NULL;
+		rq->core_enabled = 0;
+#endif
 	}
 
 	set_load_weight(&init_task, false);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1075,6 +1075,12 @@ struct rq {
 #endif
 	unsigned int		push_busy;
 	struct cpu_stop_work	push_work;
+
+#ifdef CONFIG_SCHED_CORE
+	/* per rq */
+	struct rq		*core;
+	unsigned int		core_enabled;
+#endif
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -1113,6 +1119,35 @@ static inline bool is_migration_disabled
 #endif
 }
 
+#ifdef CONFIG_SCHED_CORE
+
+DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+	return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled;
+}
+
+static inline bool sched_core_disabled(void)
+{
+	return !static_branch_unlikely(&__sched_core_enabled);
+}
+
+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+	if (sched_core_enabled(rq))
+		return &rq->core->__lock;
+
+	return &rq->__lock;
+}
+
+#else /* !CONFIG_SCHED_CORE */
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+	return false;
+}
+
 static inline bool sched_core_disabled(void)
 {
 	return true;
@@ -1123,6 +1158,8 @@ static inline raw_spinlock_t *rq_lockp(s
 	return &rq->__lock;
 }
 
+#endif /* CONFIG_SCHED_CORE */
+
 static inline void lockdep_assert_rq_held(struct rq *rq)
 {
 	lockdep_assert_held(rq_lockp(rq));
@@ -2241,6 +2278,27 @@ unsigned long arch_scale_freq_capacity(i
 
 static inline bool rq_order_less(struct rq *rq1, struct rq *rq2)
 {
+#ifdef CONFIG_SCHED_CORE
+	/*
+	 * In order to not have {0,2},{1,3} turn into into an AB-BA,
+	 * order by core-id first and cpu-id second.
+	 *
+	 * Notably:
+	 *
+	 *	double_rq_lock(0,3); will take core-0, core-1 lock
+	 *	double_rq_lock(1,2); will take core-1, core-0 lock
+	 *
+	 * when only cpu-id is considered.
+	 */
+	if (rq1->core->cpu < rq2->core->cpu)
+		return true;
+	if (rq1->core->cpu > rq2->core->cpu)
+		return false;
+
+	/*
+	 * __sched_core_flip() relies on SMT having cpu-id lock order.
+	 */
+#endif
 	return rq1->cpu < rq2->cpu;
 }
 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/19] sched: Core Scheduling
  2021-04-22 12:04 [PATCH 00/19] sched: Core Scheduling Peter Zijlstra
                   ` (20 preceding siblings ...)
  2021-04-30  6:47 ` Ning, Hongyu
@ 2021-05-07 18:02 ` Joel Fernandes
  2021-05-10 16:16 ` Vincent Guittot
  22 siblings, 0 replies; 103+ messages in thread
From: Joel Fernandes @ 2021-05-07 18:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hyser,Chris, Josh Don, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Mel Gorman, LKML, Thomas Glexiner

On Thu, Apr 22, 2021 at 8:36 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Hai,
>
> This is an agressive fold of all the core-scheduling work so far. I've stripped
> a whole bunch of tags along the way (hopefully not too many, please yell if you
> feel I made a mistake), including tested-by. Please retest.
>
> Changes since the last partial post is dropping all the cgroup stuff and
> PR_SCHED_CORE_CLEAR as well as that exec() behaviour in order to later resolve
> the cgroup issue.
>
> Since we're really rather late for the coming merge window, my plan was to
> merge the lot right after the merge window.
>
> Again, please test.

Thanks Peter, I uploaded this set to our ChromeOS 5.10 kernel tree
with minor changes to make it apply, to:
https://chromium-review.googlesource.com/q/topic:cs510
Let me know if anything looks weird, but it does build. I will be
testing it further in the coming days.

Of course product kernels slightly lag upstream but such is life <:-)

 - Joel

>
> These patches should shortly be available in my queue.git.
>
> ---
>  b/kernel/sched/core_sched.c                     |  229 ++++++
>  b/tools/testing/selftests/sched/.gitignore      |    1
>  b/tools/testing/selftests/sched/Makefile        |   14
>  b/tools/testing/selftests/sched/config          |    1
>  b/tools/testing/selftests/sched/cs_prctl_test.c |  338 +++++++++
>  include/linux/sched.h                           |   19
>  include/uapi/linux/prctl.h                      |    8
>  kernel/Kconfig.preempt                          |    6
>  kernel/fork.c                                   |    4
>  kernel/sched/Makefile                           |    1
>  kernel/sched/core.c                             |  858 ++++++++++++++++++++++--
>  kernel/sched/cpuacct.c                          |   12
>  kernel/sched/deadline.c                         |   38 -
>  kernel/sched/debug.c                            |    4
>  kernel/sched/fair.c                             |  276 +++++--
>  kernel/sched/idle.c                             |   13
>  kernel/sched/pelt.h                             |    2
>  kernel/sched/rt.c                               |   31
>  kernel/sched/sched.h                            |  393 ++++++++--
>  kernel/sched/stop_task.c                        |   14
>  kernel/sched/topology.c                         |    4
>  kernel/sys.c                                    |    5
>  tools/include/uapi/linux/prctl.h                |    8
>  23 files changed, 2057 insertions(+), 222 deletions(-)
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 04/19] sched: Prepare for Core-wide rq->lock
  2021-05-07  9:50   ` [PATCH v2 " Peter Zijlstra
@ 2021-05-08  8:07     ` Aubrey Li
  2021-05-12  9:07       ` Peter Zijlstra
  0 siblings, 1 reply; 103+ messages in thread
From: Aubrey Li @ 2021-05-08  8:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes, Hyser,Chris, Josh Don, Ingo Molnar,
	Vincent Guittot, Valentin Schneider, Mel Gorman,
	Linux List Kernel Mailing, Thomas Gleixner

On Fri, May 7, 2021 at 8:34 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
>
> When switching on core-sched, CPUs need to agree which lock to use for
> their RQ.
>
> The new rule will be that rq->core_enabled will be toggled while
> holding all rq->__locks that belong to a core. This means we need to
> double check the rq->core_enabled value after each lock acquire and
> retry if it changed.
>
> This also has implications for those sites that take multiple RQ
> locks, they need to be careful that the second lock doesn't end up
> being the first lock.
>
> Verify the lock pointer after acquiring the first lock, because if
> they're on the same core, holding any of the rq->__lock instances will
> pin the core state.
>
> While there, change the rq->__lock order to CPU number, instead of rq
> address, this greatly simplifies the next patch.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Tested-by: Don Hiatt <dhiatt@digitalocean.com>
> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
> ---
>  kernel/sched/core.c  |   48 ++++++++++++++++++++++++++++++++++++++++++++++--
>  kernel/sched/sched.h |   48 +++++++++++++++++-------------------------------
>  2 files changed, 63 insertions(+), 33 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -186,12 +186,37 @@ int sysctl_sched_rt_runtime = 950000;
>
>  void raw_spin_rq_lock_nested(struct rq *rq, int subclass)
>  {
> -       raw_spin_lock_nested(rq_lockp(rq), subclass);
> +       raw_spinlock_t *lock;
> +
> +       if (sched_core_disabled()) {
> +               raw_spin_lock_nested(&rq->__lock, subclass);
> +               return;
> +       }
> +
> +       for (;;) {
> +               lock = rq_lockp(rq);
> +               raw_spin_lock_nested(lock, subclass);
> +               if (likely(lock == rq_lockp(rq)))
> +                       return;
> +               raw_spin_unlock(lock);
> +       }
>  }
>
>  bool raw_spin_rq_trylock(struct rq *rq)
>  {
> -       return raw_spin_trylock(rq_lockp(rq));
> +       raw_spinlock_t *lock;
> +       bool ret;
> +
> +       if (sched_core_disabled())
> +               return raw_spin_trylock(&rq->__lock);
> +
> +       for (;;) {
> +               lock = rq_lockp(rq);
> +               ret = raw_spin_trylock(lock);
> +               if (!ret || (likely(lock == rq_lockp(rq))))
> +                       return ret;
> +               raw_spin_unlock(lock);
> +       }
>  }
>
>  void raw_spin_rq_unlock(struct rq *rq)
> @@ -199,6 +224,25 @@ void raw_spin_rq_unlock(struct rq *rq)
>         raw_spin_unlock(rq_lockp(rq));
>  }
>
> +#ifdef CONFIG_SMP
> +/*
> + * double_rq_lock - safely lock two runqueues
> + */
> +void double_rq_lock(struct rq *rq1, struct rq *rq2)

Do we need the static lock checking here?
        __acquires(rq1->lock)
        __acquires(rq2->lock)

> +{
> +       lockdep_assert_irqs_disabled();
> +
> +       if (rq_order_less(rq2, rq1))
> +               swap(rq1, rq2);
> +
> +       raw_spin_rq_lock(rq1);
> +       if (rq_lockp(rq1) == rq_lockp(rq2)) {

And here?
                __acquire(rq2->lock);

> +               return;
}
> +
> +       raw_spin_rq_lock_nested(rq2, SINGLE_DEPTH_NESTING);
> +}
> +#endif
> +
>  /*
>   * __task_rq_lock - lock the rq @p resides on.
>   */
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1113,6 +1113,11 @@ static inline bool is_migration_disabled
>  #endif
>  }
>
> +static inline bool sched_core_disabled(void)
> +{
> +       return true;
> +}
> +
>  static inline raw_spinlock_t *rq_lockp(struct rq *rq)
>  {
>         return &rq->__lock;
> @@ -2231,10 +2236,17 @@ unsigned long arch_scale_freq_capacity(i
>  }
>  #endif
>
> +
>  #ifdef CONFIG_SMP
> -#ifdef CONFIG_PREEMPTION
>
> -static inline void double_rq_lock(struct rq *rq1, struct rq *rq2);
> +static inline bool rq_order_less(struct rq *rq1, struct rq *rq2)
> +{
> +       return rq1->cpu < rq2->cpu;
> +}
> +
> +extern void double_rq_lock(struct rq *rq1, struct rq *rq2);
> +
> +#ifdef CONFIG_PREEMPTION
>
>  /*
>   * fair double_lock_balance: Safely acquires both rq->locks in a fair
> @@ -2274,14 +2286,13 @@ static inline int _double_lock_balance(s
>         if (likely(raw_spin_rq_trylock(busiest)))
>                 return 0;
>
> -       if (rq_lockp(busiest) >= rq_lockp(this_rq)) {
> +       if (rq_order_less(this_rq, busiest)) {
>                 raw_spin_rq_lock_nested(busiest, SINGLE_DEPTH_NESTING);
>                 return 0;
>         }
>
>         raw_spin_rq_unlock(this_rq);
> -       raw_spin_rq_lock(busiest);
> -       raw_spin_rq_lock_nested(this_rq, SINGLE_DEPTH_NESTING);
> +       double_rq_lock(this_rq, busiest);
>
>         return 1;
>  }
> @@ -2334,31 +2345,6 @@ static inline void double_raw_lock(raw_s
>  }
>
>  /*
> - * double_rq_lock - safely lock two runqueues
> - *
> - * Note this does not disable interrupts like task_rq_lock,
> - * you need to do so manually before calling.
> - */
> -static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
> -       __acquires(rq1->lock)
> -       __acquires(rq2->lock)
> -{
> -       BUG_ON(!irqs_disabled());
> -       if (rq_lockp(rq1) == rq_lockp(rq2)) {
> -               raw_spin_rq_lock(rq1);
> -               __acquire(rq2->lock);   /* Fake it out ;) */
> -       } else {
> -               if (rq_lockp(rq1) < rq_lockp(rq2)) {
> -                       raw_spin_rq_lock(rq1);
> -                       raw_spin_rq_lock_nested(rq2, SINGLE_DEPTH_NESTING);
> -               } else {
> -                       raw_spin_rq_lock(rq2);
> -                       raw_spin_rq_lock_nested(rq1, SINGLE_DEPTH_NESTING);
> -               }
> -       }
> -}
> -
> -/*
>   * double_rq_unlock - safely unlock two runqueues
>   *
>   * Note this does not restore interrupts like task_rq_unlock,
> @@ -2368,11 +2354,11 @@ static inline void double_rq_unlock(stru
>         __releases(rq1->lock)
>         __releases(rq2->lock)
>  {
> -       raw_spin_rq_unlock(rq1);
>         if (rq_lockp(rq1) != rq_lockp(rq2))
>                 raw_spin_rq_unlock(rq2);
>         else
>                 __release(rq2->lock);
> +       raw_spin_rq_unlock(rq1);

This change seems not necessary, as the softlockup root cause is not
the misorder lock release.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 17/19] sched: Inherit task cookie on fork()
  2021-04-22 12:05 ` [PATCH 17/19] sched: Inherit task cookie on fork() Peter Zijlstra
@ 2021-05-10 16:06   ` Joel Fernandes
  2021-05-10 16:22     ` Chris Hyser
  2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  1 sibling, 1 reply; 103+ messages in thread
From: Joel Fernandes @ 2021-05-10 16:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hyser,Chris, Josh Don, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Mel Gorman, LKML, Thomas Glexiner

Hi Peter,

On Thu, Apr 22, 2021 at 8:36 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Note that sched_core_fork() is called from under tasklist_lock, and
> not from sched_fork() earlier. This avoids a few races later.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/sched.h     |    2 ++
>  kernel/fork.c             |    3 +++
>  kernel/sched/core_sched.c |    6 ++++++
>  3 files changed, 11 insertions(+)
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2172,8 +2172,10 @@ const struct cpumask *sched_trace_rd_spa
>
>  #ifdef CONFIG_SCHED_CORE
>  extern void sched_core_free(struct task_struct *tsk);
> +extern void sched_core_fork(struct task_struct *p);
>  #else
>  static inline void sched_core_free(struct task_struct *tsk) { }
> +static inline void sched_core_fork(struct task_struct *p) { }
>  #endif
>
>  #endif
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2249,6 +2249,8 @@ static __latent_entropy struct task_stru
>
>         klp_copy_process(p);
>
> +       sched_core_fork(p);
> +
>         spin_lock(&current->sighand->siglock);
>
>         /*
> @@ -2336,6 +2338,7 @@ static __latent_entropy struct task_stru
>         return p;
>
>  bad_fork_cancel_cgroup:
> +       sched_core_free(p);
>         spin_unlock(&current->sighand->siglock);
>         write_unlock_irq(&tasklist_lock);
>         cgroup_cancel_fork(p, args);
> --- a/kernel/sched/core_sched.c
> +++ b/kernel/sched/core_sched.c
> @@ -100,6 +100,12 @@ static unsigned long sched_core_clone_co
>         return cookie;
>  }
>
> +void sched_core_fork(struct task_struct *p)
> +{
> +       RB_CLEAR_NODE(&p->core_node);
> +       p->core_cookie = sched_core_clone_cookie(current);

Does this make sense also for !CLONE_THREAD forks?

With earlier versions of core scheduling, we have done the following
on ChromeOS. Basically, if it is a "thread clone", share the cookie
since memory is shared within a process (same address space within a
process). Otherwise, set the cookie to a new unique cookie so that the
new process does not share core with parent initially (since their
address space will be different).

Example Psedu-ocode in sched_fork():

        if (current->core_cookie && (clone_flags & CLONE_THREAD)) {
                    p->core_cookie = clone_cookie(current);
        } else {
                     p->core_cookie = create_new_cookie();
       }

In your version though, I don't see that it always clones the cookie
whether it is a CLONE_THREAD clone or not. Is that correct? I feel
that's a security issue.

-Joel

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/19] sched: Core Scheduling
  2021-04-22 12:04 [PATCH 00/19] sched: Core Scheduling Peter Zijlstra
                   ` (21 preceding siblings ...)
  2021-05-07 18:02 ` Joel Fernandes
@ 2021-05-10 16:16 ` Vincent Guittot
  2021-05-11  7:00   ` Vincent Guittot
  22 siblings, 1 reply; 103+ messages in thread
From: Vincent Guittot @ 2021-05-10 16:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes, Hyser,Chris, Josh Don, Ingo Molnar,
	Valentin Schneider, Mel Gorman, linux-kernel, Thomas Gleixner

Hi Peter,

On Thu, 22 Apr 2021 at 14:36, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Hai,
>
> This is an agressive fold of all the core-scheduling work so far. I've stripped
> a whole bunch of tags along the way (hopefully not too many, please yell if you
> feel I made a mistake), including tested-by. Please retest.
>
> Changes since the last partial post is dropping all the cgroup stuff and
> PR_SCHED_CORE_CLEAR as well as that exec() behaviour in order to later resolve
> the cgroup issue.
>
> Since we're really rather late for the coming merge window, my plan was to
> merge the lot right after the merge window.
>
> Again, please test.

I have run various tests with your sched/core-sched branch on my arm64
machine: 2 numa * 28 cores * SMT4

List of tests:
perf sched pipe
hackbench process and threads with 1,4,16,32,64,128,256 groups
tbench with 1,4,16,32,64,128,256 groups
Unixbench shell and exec
reaim

core-sched was compiled but not used: CONFIG_SCHED_CORE=y

I haven't seen any regressions or problems; it's a mix of +/- 1-2%. I
have only seen one significant improvement for hackbench -g 1 (+4.92%)
that I don't really explain. Will dig a bit one this later

So Tested-by: Vincent Guittot <vincent.guitto@linaro.org>



>
> These patches should shortly be available in my queue.git.
>
> ---
>  b/kernel/sched/core_sched.c                     |  229 ++++++
>  b/tools/testing/selftests/sched/.gitignore      |    1
>  b/tools/testing/selftests/sched/Makefile        |   14
>  b/tools/testing/selftests/sched/config          |    1
>  b/tools/testing/selftests/sched/cs_prctl_test.c |  338 +++++++++
>  include/linux/sched.h                           |   19
>  include/uapi/linux/prctl.h                      |    8
>  kernel/Kconfig.preempt                          |    6
>  kernel/fork.c                                   |    4
>  kernel/sched/Makefile                           |    1
>  kernel/sched/core.c                             |  858 ++++++++++++++++++++++--
>  kernel/sched/cpuacct.c                          |   12
>  kernel/sched/deadline.c                         |   38 -
>  kernel/sched/debug.c                            |    4
>  kernel/sched/fair.c                             |  276 +++++--
>  kernel/sched/idle.c                             |   13
>  kernel/sched/pelt.h                             |    2
>  kernel/sched/rt.c                               |   31
>  kernel/sched/sched.h                            |  393 ++++++++--
>  kernel/sched/stop_task.c                        |   14
>  kernel/sched/topology.c                         |    4
>  kernel/sys.c                                    |    5
>  tools/include/uapi/linux/prctl.h                |    8
>  23 files changed, 2057 insertions(+), 222 deletions(-)
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 17/19] sched: Inherit task cookie on fork()
  2021-05-10 16:06   ` Joel Fernandes
@ 2021-05-10 16:22     ` Chris Hyser
  2021-05-10 20:47       ` Joel Fernandes
  0 siblings, 1 reply; 103+ messages in thread
From: Chris Hyser @ 2021-05-10 16:22 UTC (permalink / raw)
  To: Joel Fernandes, Peter Zijlstra
  Cc: Josh Don, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Mel Gorman, LKML, Thomas Glexiner

On 5/10/21 12:06 PM, Joel Fernandes wrote:
> Hi Peter,
> 
> On Thu, Apr 22, 2021 at 8:36 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> Note that sched_core_fork() is called from under tasklist_lock, and
>> not from sched_fork() earlier. This avoids a few races later.
>>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> ---
>>   include/linux/sched.h     |    2 ++
>>   kernel/fork.c             |    3 +++
>>   kernel/sched/core_sched.c |    6 ++++++
>>   3 files changed, 11 insertions(+)
>>
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2172,8 +2172,10 @@ const struct cpumask *sched_trace_rd_spa
>>
>>   #ifdef CONFIG_SCHED_CORE
>>   extern void sched_core_free(struct task_struct *tsk);
>> +extern void sched_core_fork(struct task_struct *p);
>>   #else
>>   static inline void sched_core_free(struct task_struct *tsk) { }
>> +static inline void sched_core_fork(struct task_struct *p) { }
>>   #endif
>>
>>   #endif
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -2249,6 +2249,8 @@ static __latent_entropy struct task_stru
>>
>>          klp_copy_process(p);
>>
>> +       sched_core_fork(p);
>> +
>>          spin_lock(&current->sighand->siglock);
>>
>>          /*
>> @@ -2336,6 +2338,7 @@ static __latent_entropy struct task_stru
>>          return p;
>>
>>   bad_fork_cancel_cgroup:
>> +       sched_core_free(p);
>>          spin_unlock(&current->sighand->siglock);
>>          write_unlock_irq(&tasklist_lock);
>>          cgroup_cancel_fork(p, args);
>> --- a/kernel/sched/core_sched.c
>> +++ b/kernel/sched/core_sched.c
>> @@ -100,6 +100,12 @@ static unsigned long sched_core_clone_co
>>          return cookie;
>>   }
>>
>> +void sched_core_fork(struct task_struct *p)
>> +{
>> +       RB_CLEAR_NODE(&p->core_node);
>> +       p->core_cookie = sched_core_clone_cookie(current);
> 
> Does this make sense also for !CLONE_THREAD forks?

Yes. Given the absence of a cgroup interface, fork inheritance (clone the cookie) is the best way to create shared 
cookie hierarchies. The security issue you mentioned was handled in my original code by setting a unique cookie on 
'exec', but Peter took that out for the reason mentioned above. It was part of the "lets get this in compromise" effort.

-chrish

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 17/19] sched: Inherit task cookie on fork()
  2021-05-10 16:22     ` Chris Hyser
@ 2021-05-10 20:47       ` Joel Fernandes
  2021-05-10 21:38         ` Chris Hyser
  0 siblings, 1 reply; 103+ messages in thread
From: Joel Fernandes @ 2021-05-10 20:47 UTC (permalink / raw)
  To: Chris Hyser
  Cc: Peter Zijlstra, Josh Don, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Mel Gorman, LKML, Thomas Glexiner

On Mon, May 10, 2021 at 12:23 PM Chris Hyser <chris.hyser@oracle.com> wrote:
>
> On 5/10/21 12:06 PM, Joel Fernandes wrote:
> > Hi Peter,
> >
> > On Thu, Apr 22, 2021 at 8:36 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >>
> >> Note that sched_core_fork() is called from under tasklist_lock, and
> >> not from sched_fork() earlier. This avoids a few races later.
> >>
> >> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> >> ---
> >>   include/linux/sched.h     |    2 ++
> >>   kernel/fork.c             |    3 +++
> >>   kernel/sched/core_sched.c |    6 ++++++
> >>   3 files changed, 11 insertions(+)
> >>
> >> --- a/include/linux/sched.h
> >> +++ b/include/linux/sched.h
> >> @@ -2172,8 +2172,10 @@ const struct cpumask *sched_trace_rd_spa
> >>
> >>   #ifdef CONFIG_SCHED_CORE
> >>   extern void sched_core_free(struct task_struct *tsk);
> >> +extern void sched_core_fork(struct task_struct *p);
> >>   #else
> >>   static inline void sched_core_free(struct task_struct *tsk) { }
> >> +static inline void sched_core_fork(struct task_struct *p) { }
> >>   #endif
> >>
> >>   #endif
> >> --- a/kernel/fork.c
> >> +++ b/kernel/fork.c
> >> @@ -2249,6 +2249,8 @@ static __latent_entropy struct task_stru
> >>
> >>          klp_copy_process(p);
> >>
> >> +       sched_core_fork(p);
> >> +
> >>          spin_lock(&current->sighand->siglock);
> >>
> >>          /*
> >> @@ -2336,6 +2338,7 @@ static __latent_entropy struct task_stru
> >>          return p;
> >>
> >>   bad_fork_cancel_cgroup:
> >> +       sched_core_free(p);
> >>          spin_unlock(&current->sighand->siglock);
> >>          write_unlock_irq(&tasklist_lock);
> >>          cgroup_cancel_fork(p, args);
> >> --- a/kernel/sched/core_sched.c
> >> +++ b/kernel/sched/core_sched.c
> >> @@ -100,6 +100,12 @@ static unsigned long sched_core_clone_co
> >>          return cookie;
> >>   }
> >>
> >> +void sched_core_fork(struct task_struct *p)
> >> +{
> >> +       RB_CLEAR_NODE(&p->core_node);
> >> +       p->core_cookie = sched_core_clone_cookie(current);
> >
> > Does this make sense also for !CLONE_THREAD forks?
>
> Yes. Given the absence of a cgroup interface, fork inheritance (clone the cookie) is the best way to create shared
> cookie hierarchies. The security issue you mentioned was handled in my original code by setting a unique cookie on
> 'exec', but Peter took that out for the reason mentioned above. It was part of the "lets get this in compromise" effort.

Thanks for sharing the history of it. I guess one can argue that this
policy is better to be hardcoded in userspace since core-scheduling
can be used for non-security usecases as well. Maybe one could simply
call the prctl(2) from userspace if they so desire, before calling
exec() ?

- Joel

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 17/19] sched: Inherit task cookie on fork()
  2021-05-10 20:47       ` Joel Fernandes
@ 2021-05-10 21:38         ` Chris Hyser
  2021-05-12  9:05           ` Peter Zijlstra
  0 siblings, 1 reply; 103+ messages in thread
From: Chris Hyser @ 2021-05-10 21:38 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Peter Zijlstra, Josh Don, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Mel Gorman, LKML, Thomas Glexiner

On 5/10/21 4:47 PM, Joel Fernandes wrote:
> On Mon, May 10, 2021 at 12:23 PM Chris Hyser <chris.hyser@oracle.com> wrote:

>>>> +void sched_core_fork(struct task_struct *p)
>>>> +{
>>>> +       RB_CLEAR_NODE(&p->core_node);
>>>> +       p->core_cookie = sched_core_clone_cookie(current);
>>>
>>> Does this make sense also for !CLONE_THREAD forks?
>>
>> Yes. Given the absence of a cgroup interface, fork inheritance (clone the cookie) is the best way to create shared
>> cookie hierarchies. The security issue you mentioned was handled in my original code by setting a unique cookie on
>> 'exec', but Peter took that out for the reason mentioned above. It was part of the "lets get this in compromise" effort.
> 
> Thanks for sharing the history of it. I guess one can argue that this
> policy is better to be hardcoded in userspace since core-scheduling
> can be used for non-security usecases as well. Maybe one could simply
> call the prctl(2) from userspace if they so desire, before calling
> exec() ?

I think the defining use case is a container's init. If the cookie is set for it by the container creator and without 
any other user code knowing about core_sched, every descendant spawned will have the same cookie and be in the same 
core_sched group much like the cgroup interface had provided. If we create a unique cookie in the kernel either on fork 
or exec, we are secure, but we will now have 1000's of core sched groups.

CLEAR was also removed (temporarily, I hope) because a core_sched knowledgeable program in the example core_sched 
container group should not be able to remove itself from _all_ core sched groups. It can modify it's cookie, but that is 
no different than the normal case.

Both of these beg for a kernel policy, but that discussion was TBD.

-chrish

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 00/19] sched: Core Scheduling
  2021-05-10 16:16 ` Vincent Guittot
@ 2021-05-11  7:00   ` Vincent Guittot
  0 siblings, 0 replies; 103+ messages in thread
From: Vincent Guittot @ 2021-05-11  7:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes, Hyser,Chris, Josh Don, Ingo Molnar,
	Valentin Schneider, Mel Gorman, linux-kernel, Thomas Gleixner

On Mon, 10 May 2021 at 18:16, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> Hi Peter,
>
> On Thu, 22 Apr 2021 at 14:36, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Hai,
> >
> > This is an agressive fold of all the core-scheduling work so far. I've stripped
> > a whole bunch of tags along the way (hopefully not too many, please yell if you
> > feel I made a mistake), including tested-by. Please retest.
> >
> > Changes since the last partial post is dropping all the cgroup stuff and
> > PR_SCHED_CORE_CLEAR as well as that exec() behaviour in order to later resolve
> > the cgroup issue.
> >
> > Since we're really rather late for the coming merge window, my plan was to
> > merge the lot right after the merge window.
> >
> > Again, please test.
>
> I have run various tests with your sched/core-sched branch on my arm64
> machine: 2 numa * 28 cores * SMT4
>
> List of tests:
> perf sched pipe
> hackbench process and threads with 1,4,16,32,64,128,256 groups
> tbench with 1,4,16,32,64,128,256 groups
> Unixbench shell and exec
> reaim
>
> core-sched was compiled but not used: CONFIG_SCHED_CORE=y
>
> I haven't seen any regressions or problems; it's a mix of +/- 1-2%. I
> have only seen one significant improvement for hackbench -g 1 (+4.92%)
> that I don't really explain. Will dig a bit one this later
>
> So Tested-by: Vincent Guittot <vincent.guitto@linaro.org>

Tested-by: Vincent Guittot <vincent.guittot@linaro.org>

With the correct email address

>
>
>
> >
> > These patches should shortly be available in my queue.git.
> >
> > ---
> >  b/kernel/sched/core_sched.c                     |  229 ++++++
> >  b/tools/testing/selftests/sched/.gitignore      |    1
> >  b/tools/testing/selftests/sched/Makefile        |   14
> >  b/tools/testing/selftests/sched/config          |    1
> >  b/tools/testing/selftests/sched/cs_prctl_test.c |  338 +++++++++
> >  include/linux/sched.h                           |   19
> >  include/uapi/linux/prctl.h                      |    8
> >  kernel/Kconfig.preempt                          |    6
> >  kernel/fork.c                                   |    4
> >  kernel/sched/Makefile                           |    1
> >  kernel/sched/core.c                             |  858 ++++++++++++++++++++++--
> >  kernel/sched/cpuacct.c                          |   12
> >  kernel/sched/deadline.c                         |   38 -
> >  kernel/sched/debug.c                            |    4
> >  kernel/sched/fair.c                             |  276 +++++--
> >  kernel/sched/idle.c                             |   13
> >  kernel/sched/pelt.h                             |    2
> >  kernel/sched/rt.c                               |   31
> >  kernel/sched/sched.h                            |  393 ++++++++--
> >  kernel/sched/stop_task.c                        |   14
> >  kernel/sched/topology.c                         |    4
> >  kernel/sys.c                                    |    5
> >  tools/include/uapi/linux/prctl.h                |    8
> >  23 files changed, 2057 insertions(+), 222 deletions(-)
> >

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 17/19] sched: Inherit task cookie on fork()
  2021-05-10 21:38         ` Chris Hyser
@ 2021-05-12  9:05           ` Peter Zijlstra
  2021-05-12 20:20             ` Josh Don
  0 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2021-05-12  9:05 UTC (permalink / raw)
  To: Chris Hyser
  Cc: Joel Fernandes, Josh Don, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Mel Gorman, LKML, Thomas Glexiner

On Mon, May 10, 2021 at 05:38:18PM -0400, Chris Hyser wrote:
> On 5/10/21 4:47 PM, Joel Fernandes wrote:
> > On Mon, May 10, 2021 at 12:23 PM Chris Hyser <chris.hyser@oracle.com> wrote:
> 
> > > > > +void sched_core_fork(struct task_struct *p)
> > > > > +{
> > > > > +       RB_CLEAR_NODE(&p->core_node);
> > > > > +       p->core_cookie = sched_core_clone_cookie(current);
> > > > 
> > > > Does this make sense also for !CLONE_THREAD forks?
> > > 
> > > Yes. Given the absence of a cgroup interface, fork inheritance
> > > (clone the cookie) is the best way to create shared cookie
> > > hierarchies. The security issue you mentioned was handled in my
> > > original code by setting a unique cookie on 'exec', but Peter took
> > > that out for the reason mentioned above. It was part of the "lets
> > > get this in compromise" effort.

Right, not only that, given all this is moot when parent and child have
the same PTRACE permissions, since if they do, they can inspect one
another's innards anyway, exec()/fork() just fundamentally isn't a
magical boundary we should care about.

The only special case there is SUID exec(), because in that case the
actual credentials change and the PTRACE permissions do actually change.

I sorta had a patch to do that, but it's yuck because that cred change
happens after the point of no return and we need an allocation for the
new cookie. Now, we could rely on the fact that a task context
allocation (GFP_KERNEL) for something as small as our cookie will never
fail and hence we shouldn't be bothered by it, we should do the error
path and yuck.

> > Thanks for sharing the history of it. I guess one can argue that this
> > policy is better to be hardcoded in userspace since core-scheduling
> > can be used for non-security usecases as well. Maybe one could simply
> > call the prctl(2) from userspace if they so desire, before calling
> > exec() ?
> 
> I think the defining use case is a container's init. If the cookie is set
> for it by the container creator and without any other user code knowing
> about core_sched, every descendant spawned will have the same cookie and be
> in the same core_sched group much like the cgroup interface had provided. If
> we create a unique cookie in the kernel either on fork or exec, we are
> secure, but we will now have 1000's of core sched groups.
> 
> CLEAR was also removed (temporarily, I hope) because a core_sched
> knowledgeable program in the example core_sched container group should not
> be able to remove itself from _all_ core sched groups. It can modify it's
> cookie, but that is no different than the normal case.

Note that much of clear is possible by using SHARE_FROM on your parent
to reset the cookie.

> Both of these beg for a kernel policy, but that discussion was TBD.

Right, I need a Champion that actually cares about cgroups and has
use-cases to go argue with TJ on this. I've proposed code that I think
has sane semantics, but I'm not in a position to argue for it, given I
think a world without cgroups is a better world :-)))

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH v2 04/19] sched: Prepare for Core-wide rq->lock
  2021-05-08  8:07     ` Aubrey Li
@ 2021-05-12  9:07       ` Peter Zijlstra
  0 siblings, 0 replies; 103+ messages in thread
From: Peter Zijlstra @ 2021-05-12  9:07 UTC (permalink / raw)
  To: Aubrey Li
  Cc: Joel Fernandes, Hyser,Chris, Josh Don, Ingo Molnar,
	Vincent Guittot, Valentin Schneider, Mel Gorman,
	Linux List Kernel Mailing, Thomas Gleixner

On Sat, May 08, 2021 at 04:07:35PM +0800, Aubrey Li wrote:
> > +#ifdef CONFIG_SMP
> > +/*
> > + * double_rq_lock - safely lock two runqueues
> > + */
> > +void double_rq_lock(struct rq *rq1, struct rq *rq2)
> 
> Do we need the static lock checking here?
>         __acquires(rq1->lock)
>         __acquires(rq2->lock)
> 
> > +{
> > +       lockdep_assert_irqs_disabled();
> > +
> > +       if (rq_order_less(rq2, rq1))
> > +               swap(rq1, rq2);
> > +
> > +       raw_spin_rq_lock(rq1);
> > +       if (rq_lockp(rq1) == rq_lockp(rq2)) {
> 
> And here?
>                 __acquire(rq2->lock);
> 
> > +               return;
> }
> > +
> > +       raw_spin_rq_lock_nested(rq2, SINGLE_DEPTH_NESTING);
> > +}
> > +#endif

I'd as soon rip out all that sparse annotation crud; I don't think I've
ever had any benefit from it.


> > @@ -2368,11 +2354,11 @@ static inline void double_rq_unlock(stru
> >         __releases(rq1->lock)
> >         __releases(rq2->lock)
> >  {
> > -       raw_spin_rq_unlock(rq1);
> >         if (rq_lockp(rq1) != rq_lockp(rq2))
> >                 raw_spin_rq_unlock(rq2);
> >         else
> >                 __release(rq2->lock);
> > +       raw_spin_rq_unlock(rq1);
> 
> This change seems not necessary, as the softlockup root cause is not
> the misorder lock release.

No, it really is needed; rq_lockp() is not stable if we don't hold a
lock.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [tip: sched/core] sched: prctl() core-scheduling interface
  2021-04-22 12:05 ` [PATCH 18/19] sched: prctl() core-scheduling interface Peter Zijlstra
@ 2021-05-12 10:28   ` tip-bot2 for Chris Hyser
  2021-06-14 23:36   ` [PATCH 18/19] " Josh Don
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 103+ messages in thread
From: tip-bot2 for Chris Hyser @ 2021-05-12 10:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Chris Hyser, Peter Zijlstra (Intel),
	Don Hiatt, Hongyu Ning, Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     7ac592aa35a684ff1858fb9ec282886b9e3575ac
Gitweb:        https://git.kernel.org/tip/7ac592aa35a684ff1858fb9ec282886b9e3575ac
Author:        Chris Hyser <chris.hyser@oracle.com>
AuthorDate:    Wed, 24 Mar 2021 17:40:15 -04:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 12 May 2021 11:43:31 +02:00

sched: prctl() core-scheduling interface

This patch provides support for setting and copying core scheduling
'task cookies' between threads (PID), processes (TGID), and process
groups (PGID).

The value of core scheduling isn't that tasks don't share a core,
'nosmt' can do that. The value lies in exploiting all the sharing
opportunities that exist to recover possible lost performance and that
requires a degree of flexibility in the API.

>From a security perspective (and there are others), the thread,
process and process group distinction is an existent hierarchal
categorization of tasks that reflects many of the security concerns
about 'data sharing'. For example, protecting against cache-snooping
by a thread that can just read the memory directly isn't all that
useful.

With this in mind, subcommands to CREATE/SHARE (TO/FROM) provide a
mechanism to create and share cookies. CREATE/SHARE_TO specify a
target pid with enum pidtype used to specify the scope of the targeted
tasks. For example, PIDTYPE_TGID will share the cookie with the
process and all of it's threads as typically desired in a security
scenario.

API:

  prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, tgtpid, pidtype, &cookie)
  prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, tgtpid, pidtype, NULL)
  prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, tgtpid, pidtype, NULL)
  prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, srcpid, pidtype, NULL)

where 'tgtpid/srcpid == 0' implies the current process and pidtype is
kernel enum pid_type {PIDTYPE_PID, PIDTYPE_TGID, PIDTYPE_PGID, ...}.

For return values, EINVAL, ENOMEM are what they say. ESRCH means the
tgtpid/srcpid was not found. EPERM indicates lack of PTRACE permission
access to tgtpid/srcpid. ENODEV indicates your machines lacks SMT.

[peterz: complete rewrite]
Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123309.039845339@infradead.org
---
 include/linux/sched.h            |   2 +-
 include/uapi/linux/prctl.h       |   8 ++-
 kernel/sched/core_sched.c        | 114 ++++++++++++++++++++++++++++++-
 kernel/sys.c                     |   5 +-
 tools/include/uapi/linux/prctl.h |   8 ++-
 5 files changed, 137 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index fba47e5..c7e7d50 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2182,6 +2182,8 @@ const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 #ifdef CONFIG_SCHED_CORE
 extern void sched_core_free(struct task_struct *tsk);
 extern void sched_core_fork(struct task_struct *p);
+extern int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
+				unsigned long uaddr);
 #else
 static inline void sched_core_free(struct task_struct *tsk) { }
 static inline void sched_core_fork(struct task_struct *p) { }
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 18a9f59..967d9c5 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -259,4 +259,12 @@ struct prctl_mm_map {
 #define PR_PAC_SET_ENABLED_KEYS		60
 #define PR_PAC_GET_ENABLED_KEYS		61
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE			62
+# define PR_SCHED_CORE_GET		0
+# define PR_SCHED_CORE_CREATE		1 /* create unique core_sched cookie */
+# define PR_SCHED_CORE_SHARE_TO		2 /* push core_sched cookie to pid */
+# define PR_SCHED_CORE_SHARE_FROM	3 /* pull core_sched cookie to pid */
+# define PR_SCHED_CORE_MAX		4
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c
index dcbbeae..9a80e9a 100644
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-only
 
+#include <linux/prctl.h>
 #include "sched.h"
 
 /*
@@ -113,3 +114,116 @@ void sched_core_free(struct task_struct *p)
 {
 	sched_core_put_cookie(p->core_cookie);
 }
+
+static void __sched_core_set(struct task_struct *p, unsigned long cookie)
+{
+	cookie = sched_core_get_cookie(cookie);
+	cookie = sched_core_update_cookie(p, cookie);
+	sched_core_put_cookie(cookie);
+}
+
+/* Called from prctl interface: PR_SCHED_CORE */
+int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
+			 unsigned long uaddr)
+{
+	unsigned long cookie = 0, id = 0;
+	struct task_struct *task, *p;
+	struct pid *grp;
+	int err = 0;
+
+	if (!static_branch_likely(&sched_smt_present))
+		return -ENODEV;
+
+	if (type > PIDTYPE_PGID || cmd >= PR_SCHED_CORE_MAX || pid < 0 ||
+	    (cmd != PR_SCHED_CORE_GET && uaddr))
+		return -EINVAL;
+
+	rcu_read_lock();
+	if (pid == 0) {
+		task = current;
+	} else {
+		task = find_task_by_vpid(pid);
+		if (!task) {
+			rcu_read_unlock();
+			return -ESRCH;
+		}
+	}
+	get_task_struct(task);
+	rcu_read_unlock();
+
+	/*
+	 * Check if this process has the right to modify the specified
+	 * process. Use the regular "ptrace_may_access()" checks.
+	 */
+	if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
+		err = -EPERM;
+		goto out;
+	}
+
+	switch (cmd) {
+	case PR_SCHED_CORE_GET:
+		if (type != PIDTYPE_PID || uaddr & 7) {
+			err = -EINVAL;
+			goto out;
+		}
+		cookie = sched_core_clone_cookie(task);
+		if (cookie) {
+			/* XXX improve ? */
+			ptr_to_hashval((void *)cookie, &id);
+		}
+		err = put_user(id, (u64 __user *)uaddr);
+		goto out;
+
+	case PR_SCHED_CORE_CREATE:
+		cookie = sched_core_alloc_cookie();
+		if (!cookie) {
+			err = -ENOMEM;
+			goto out;
+		}
+		break;
+
+	case PR_SCHED_CORE_SHARE_TO:
+		cookie = sched_core_clone_cookie(current);
+		break;
+
+	case PR_SCHED_CORE_SHARE_FROM:
+		if (type != PIDTYPE_PID) {
+			err = -EINVAL;
+			goto out;
+		}
+		cookie = sched_core_clone_cookie(task);
+		__sched_core_set(current, cookie);
+		goto out;
+
+	default:
+		err = -EINVAL;
+		goto out;
+	};
+
+	if (type == PIDTYPE_PID) {
+		__sched_core_set(task, cookie);
+		goto out;
+	}
+
+	read_lock(&tasklist_lock);
+	grp = task_pid_type(task, type);
+
+	do_each_pid_thread(grp, type, p) {
+		if (!ptrace_may_access(p, PTRACE_MODE_READ_REALCREDS)) {
+			err = -EPERM;
+			goto out_tasklist;
+		}
+	} while_each_pid_thread(grp, type, p);
+
+	do_each_pid_thread(grp, type, p) {
+		__sched_core_set(p, cookie);
+	} while_each_pid_thread(grp, type, p);
+out_tasklist:
+	read_unlock(&tasklist_lock);
+
+out:
+	sched_core_put_cookie(cookie);
+	put_task_struct(task);
+	return err;
+}
+
diff --git a/kernel/sys.c b/kernel/sys.c
index 3a583a2..9de46a4 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2550,6 +2550,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		error = set_syscall_user_dispatch(arg2, arg3, arg4,
 						  (char __user *) arg5);
 		break;
+#ifdef CONFIG_SCHED_CORE
+	case PR_SCHED_CORE:
+		error = sched_core_share_pid(arg2, arg3, arg4, arg5);
+		break;
+#endif
 	default:
 		error = -EINVAL;
 		break;
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index 18a9f59..967d9c5 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -259,4 +259,12 @@ struct prctl_mm_map {
 #define PR_PAC_SET_ENABLED_KEYS		60
 #define PR_PAC_GET_ENABLED_KEYS		61
 
+/* Request the scheduler to share a core */
+#define PR_SCHED_CORE			62
+# define PR_SCHED_CORE_GET		0
+# define PR_SCHED_CORE_CREATE		1 /* create unique core_sched cookie */
+# define PR_SCHED_CORE_SHARE_TO		2 /* push core_sched cookie to pid */
+# define PR_SCHED_CORE_SHARE_FROM	3 /* pull core_sched cookie to pid */
+# define PR_SCHED_CORE_MAX		4
+
 #endif /* _LINUX_PRCTL_H */

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [tip: sched/core] kselftest: Add test for core sched prctl interface
  2021-04-22 12:05 ` [PATCH 19/19] kselftest: Add test for core sched prctl interface Peter Zijlstra
@ 2021-05-12 10:28   ` tip-bot2 for Chris Hyser
  0 siblings, 0 replies; 103+ messages in thread
From: tip-bot2 for Chris Hyser @ 2021-05-12 10:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Chris Hyser, Peter Zijlstra (Intel),
	Don Hiatt, Hongyu Ning, Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     9f26990074931bbf797373e53104216059b300b1
Gitweb:        https://git.kernel.org/tip/9f26990074931bbf797373e53104216059b300b1
Author:        Chris Hyser <chris.hyser@oracle.com>
AuthorDate:    Wed, 24 Mar 2021 17:40:16 -04:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 12 May 2021 11:43:32 +02:00

kselftest: Add test for core sched prctl interface

Provides a selftest and examples of using the interface.

[peterz: updated to not use sched_debug]
Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123309.100860030@infradead.org
---
 tools/testing/selftests/sched/.gitignore      |   1 +-
 tools/testing/selftests/sched/Makefile        |  14 +-
 tools/testing/selftests/sched/config          |   1 +-
 tools/testing/selftests/sched/cs_prctl_test.c | 338 +++++++++++++++++-
 4 files changed, 354 insertions(+)
 create mode 100644 tools/testing/selftests/sched/.gitignore
 create mode 100644 tools/testing/selftests/sched/Makefile
 create mode 100644 tools/testing/selftests/sched/config
 create mode 100644 tools/testing/selftests/sched/cs_prctl_test.c

diff --git a/tools/testing/selftests/sched/.gitignore b/tools/testing/selftests/sched/.gitignore
new file mode 100644
index 0000000..6996d46
--- /dev/null
+++ b/tools/testing/selftests/sched/.gitignore
@@ -0,0 +1 @@
+cs_prctl_test
diff --git a/tools/testing/selftests/sched/Makefile b/tools/testing/selftests/sched/Makefile
new file mode 100644
index 0000000..10c72f1
--- /dev/null
+++ b/tools/testing/selftests/sched/Makefile
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: GPL-2.0+
+
+ifneq ($(shell $(CC) --version 2>&1 | head -n 1 | grep clang),)
+CLANG_FLAGS += -no-integrated-as
+endif
+
+CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/  -Wl,-rpath=./ \
+	  $(CLANG_FLAGS)
+LDLIBS += -lpthread
+
+TEST_GEN_FILES := cs_prctl_test
+TEST_PROGS := cs_prctl_test
+
+include ../lib.mk
diff --git a/tools/testing/selftests/sched/config b/tools/testing/selftests/sched/config
new file mode 100644
index 0000000..e8b09aa
--- /dev/null
+++ b/tools/testing/selftests/sched/config
@@ -0,0 +1 @@
+CONFIG_SCHED_DEBUG=y
diff --git a/tools/testing/selftests/sched/cs_prctl_test.c b/tools/testing/selftests/sched/cs_prctl_test.c
new file mode 100644
index 0000000..63fe652
--- /dev/null
+++ b/tools/testing/selftests/sched/cs_prctl_test.c
@@ -0,0 +1,338 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Use the core scheduling prctl() to test core scheduling cookies control.
+ *
+ * Copyright (c) 2021 Oracle and/or its affiliates.
+ * Author: Chris Hyser <chris.hyser@oracle.com>
+ *
+ *
+ * This library is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This library is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License
+ * for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public License
+ * along with this library; if not, see <http://www.gnu.org/licenses>.
+ */
+
+#define _GNU_SOURCE
+#include <sys/eventfd.h>
+#include <sys/wait.h>
+#include <sys/types.h>
+#include <sched.h>
+#include <sys/prctl.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include <time.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#if __GLIBC_PREREQ(2, 30) == 0
+#include <sys/syscall.h>
+static pid_t gettid(void)
+{
+	return syscall(SYS_gettid);
+}
+#endif
+
+#ifndef PR_SCHED_CORE
+#define PR_SCHED_CORE			62
+# define PR_SCHED_CORE_GET		0
+# define PR_SCHED_CORE_CREATE		1 /* create unique core_sched cookie */
+# define PR_SCHED_CORE_SHARE_TO		2 /* push core_sched cookie to pid */
+# define PR_SCHED_CORE_SHARE_FROM	3 /* pull core_sched cookie to pid */
+# define PR_SCHED_CORE_MAX		4
+#endif
+
+#define MAX_PROCESSES 128
+#define MAX_THREADS   128
+
+static const char USAGE[] = "cs_prctl_test [options]\n"
+"    options:\n"
+"	-P  : number of processes to create.\n"
+"	-T  : number of threads per process to create.\n"
+"	-d  : delay time to keep tasks alive.\n"
+"	-k  : keep tasks alive until keypress.\n";
+
+enum pid_type {PIDTYPE_PID = 0, PIDTYPE_TGID, PIDTYPE_PGID};
+
+const int THREAD_CLONE_FLAGS = CLONE_THREAD | CLONE_SIGHAND | CLONE_FS | CLONE_VM | CLONE_FILES;
+
+static int _prctl(int option, unsigned long arg2, unsigned long arg3, unsigned long arg4,
+		  unsigned long arg5)
+{
+	int res;
+
+	res = prctl(option, arg2, arg3, arg4, arg5);
+	printf("%d = prctl(%d, %ld, %ld, %ld, %lx)\n", res, option, (long)arg2, (long)arg3,
+	       (long)arg4, arg5);
+	return res;
+}
+
+#define STACK_SIZE (1024 * 1024)
+
+#define handle_error(msg) __handle_error(__FILE__, __LINE__, msg)
+static void __handle_error(char *fn, int ln, char *msg)
+{
+	printf("(%s:%d) - ", fn, ln);
+	perror(msg);
+	exit(EXIT_FAILURE);
+}
+
+static void handle_usage(int rc, char *msg)
+{
+	puts(USAGE);
+	puts(msg);
+	putchar('\n');
+	exit(rc);
+}
+
+static unsigned long get_cs_cookie(int pid)
+{
+	unsigned long long cookie;
+	int ret;
+
+	ret = prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, pid, PIDTYPE_PID,
+		    (unsigned long)&cookie);
+	if (ret) {
+		printf("Not a core sched system\n");
+		return -1UL;
+	}
+
+	return cookie;
+}
+
+struct child_args {
+	int num_threads;
+	int pfd[2];
+	int cpid;
+	int thr_tids[MAX_THREADS];
+};
+
+static int child_func_thread(void __attribute__((unused))*arg)
+{
+	while (1)
+		usleep(20000);
+	return 0;
+}
+
+static void create_threads(int num_threads, int thr_tids[])
+{
+	void *child_stack;
+	pid_t tid;
+	int i;
+
+	for (i = 0; i < num_threads; ++i) {
+		child_stack = malloc(STACK_SIZE);
+		if (!child_stack)
+			handle_error("child stack allocate");
+
+		tid = clone(child_func_thread, child_stack + STACK_SIZE, THREAD_CLONE_FLAGS, NULL);
+		if (tid == -1)
+			handle_error("clone thread");
+		thr_tids[i] = tid;
+	}
+}
+
+static int child_func_process(void *arg)
+{
+	struct child_args *ca = (struct child_args *)arg;
+
+	close(ca->pfd[0]);
+
+	create_threads(ca->num_threads, ca->thr_tids);
+
+	write(ca->pfd[1], &ca->thr_tids, sizeof(int) * ca->num_threads);
+	close(ca->pfd[1]);
+
+	while (1)
+		usleep(20000);
+	return 0;
+}
+
+static unsigned char child_func_process_stack[STACK_SIZE];
+
+void create_processes(int num_processes, int num_threads, struct child_args proc[])
+{
+	pid_t cpid;
+	int i;
+
+	for (i = 0; i < num_processes; ++i) {
+		proc[i].num_threads = num_threads;
+
+		if (pipe(proc[i].pfd) == -1)
+			handle_error("pipe() failed");
+
+		cpid = clone(child_func_process, child_func_process_stack + STACK_SIZE,
+			     SIGCHLD, &proc[i]);
+		proc[i].cpid = cpid;
+		close(proc[i].pfd[1]);
+	}
+
+	for (i = 0; i < num_processes; ++i) {
+		read(proc[i].pfd[0], &proc[i].thr_tids, sizeof(int) * proc[i].num_threads);
+		close(proc[i].pfd[0]);
+	}
+}
+
+void disp_processes(int num_processes, struct child_args proc[])
+{
+	int i, j;
+
+	printf("tid=%d, / tgid=%d / pgid=%d: %lx\n", gettid(), getpid(), getpgid(0),
+	       get_cs_cookie(getpid()));
+
+	for (i = 0; i < num_processes; ++i) {
+		printf("    tid=%d, / tgid=%d / pgid=%d: %lx\n", proc[i].cpid, proc[i].cpid,
+		       getpgid(proc[i].cpid), get_cs_cookie(proc[i].cpid));
+		for (j = 0; j < proc[i].num_threads; ++j) {
+			printf("        tid=%d, / tgid=%d / pgid=%d: %lx\n", proc[i].thr_tids[j],
+			       proc[i].cpid, getpgid(0), get_cs_cookie(proc[i].thr_tids[j]));
+		}
+	}
+	puts("\n");
+}
+
+static int errors;
+
+#define validate(v) _validate(__LINE__, v, #v)
+void _validate(int line, int val, char *msg)
+{
+	if (!val) {
+		++errors;
+		printf("(%d) FAILED: %s\n", line, msg);
+	} else {
+		printf("(%d) PASSED: %s\n", line, msg);
+	}
+}
+
+int main(int argc, char *argv[])
+{
+	struct child_args procs[MAX_PROCESSES];
+
+	int keypress = 0;
+	int num_processes = 2;
+	int num_threads = 3;
+	int delay = 0;
+	int res = 0;
+	int pidx;
+	int pid;
+	int opt;
+
+	while ((opt = getopt(argc, argv, ":hkT:P:d:")) != -1) {
+		switch (opt) {
+		case 'P':
+			num_processes = (int)strtol(optarg, NULL, 10);
+			break;
+		case 'T':
+			num_threads = (int)strtoul(optarg, NULL, 10);
+			break;
+		case 'd':
+			delay = (int)strtol(optarg, NULL, 10);
+			break;
+		case 'k':
+			keypress = 1;
+			break;
+		case 'h':
+			printf(USAGE);
+			exit(EXIT_SUCCESS);
+		default:
+			handle_usage(20, "unknown option");
+		}
+	}
+
+	if (num_processes < 1 || num_processes > MAX_PROCESSES)
+		handle_usage(1, "Bad processes value");
+
+	if (num_threads < 1 || num_threads > MAX_THREADS)
+		handle_usage(2, "Bad thread value");
+
+	if (keypress)
+		delay = -1;
+
+	srand(time(NULL));
+
+	/* put into separate process group */
+	if (setpgid(0, 0) != 0)
+		handle_error("process group");
+
+	printf("\n## Create a thread/process/process group hiearchy\n");
+	create_processes(num_processes, num_threads, procs);
+	disp_processes(num_processes, procs);
+	validate(get_cs_cookie(0) == 0);
+
+	printf("\n## Set a cookie on entire process group\n");
+	if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, 0, PIDTYPE_PGID, 0) < 0)
+		handle_error("core_sched create failed -- PGID");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(0) != 0);
+
+	/* get a random process pid */
+	pidx = rand() % num_processes;
+	pid = procs[pidx].cpid;
+
+	validate(get_cs_cookie(0) == get_cs_cookie(pid));
+	validate(get_cs_cookie(0) == get_cs_cookie(procs[pidx].thr_tids[0]));
+
+	printf("\n## Set a new cookie on entire process/TGID [%d]\n", pid);
+	if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, pid, PIDTYPE_TGID, 0) < 0)
+		handle_error("core_sched create failed -- TGID");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(0) != get_cs_cookie(pid));
+	validate(get_cs_cookie(pid) != 0);
+	validate(get_cs_cookie(pid) == get_cs_cookie(procs[pidx].thr_tids[0]));
+
+	printf("\n## Copy the cookie of current/PGID[%d], to pid [%d] as PIDTYPE_PID\n",
+	       getpid(), pid);
+	if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, pid, PIDTYPE_PID, 0) < 0)
+		handle_error("core_sched share to itself failed -- PID");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(0) == get_cs_cookie(pid));
+	validate(get_cs_cookie(pid) != 0);
+	validate(get_cs_cookie(pid) != get_cs_cookie(procs[pidx].thr_tids[0]));
+
+	printf("\n## Copy cookie from a thread [%d] to current/PGID [%d] as PIDTYPE_PID\n",
+	       procs[pidx].thr_tids[0], getpid());
+	if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, procs[pidx].thr_tids[0],
+		   PIDTYPE_PID, 0) < 0)
+		handle_error("core_sched share from thread failed -- PID");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(0) == get_cs_cookie(procs[pidx].thr_tids[0]));
+	validate(get_cs_cookie(pid) != get_cs_cookie(procs[pidx].thr_tids[0]));
+
+	printf("\n## Copy cookie from current [%d] to current as pidtype PGID\n", getpid());
+	if (_prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, 0, PIDTYPE_PGID, 0) < 0)
+		handle_error("core_sched share to self failed -- PGID");
+	disp_processes(num_processes, procs);
+
+	validate(get_cs_cookie(0) == get_cs_cookie(pid));
+	validate(get_cs_cookie(pid) != 0);
+	validate(get_cs_cookie(pid) == get_cs_cookie(procs[pidx].thr_tids[0]));
+
+	if (errors) {
+		printf("TESTS FAILED. errors: %d\n", errors);
+		res = 10;
+	} else {
+		printf("SUCCESS !!!\n");
+	}
+
+	if (keypress)
+		getchar();
+	else
+		sleep(delay);
+
+	for (pidx = 0; pidx < num_processes; ++pidx)
+		kill(procs[pidx].cpid, 15);
+
+	return res;
+}

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [tip: sched/core] sched: Inherit task cookie on fork()
  2021-04-22 12:05 ` [PATCH 17/19] sched: Inherit task cookie on fork() Peter Zijlstra
  2021-05-10 16:06   ` Joel Fernandes
@ 2021-05-12 10:28   ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 103+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-05-12 10:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Don Hiatt, Hongyu Ning, Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     85dd3f61203c5cfa72b308ff327b5fbf3fc1ce5e
Gitweb:        https://git.kernel.org/tip/85dd3f61203c5cfa72b308ff327b5fbf3fc1ce5e
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Mon, 29 Mar 2021 15:18:35 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 12 May 2021 11:43:31 +02:00

sched: Inherit task cookie on fork()

Note that sched_core_fork() is called from under tasklist_lock, and
not from sched_fork() earlier. This avoids a few races later.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.980003687@infradead.org
---
 include/linux/sched.h     | 2 ++
 kernel/fork.c             | 3 +++
 kernel/sched/core_sched.c | 6 ++++++
 3 files changed, 11 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index eab3f7c..fba47e5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2181,8 +2181,10 @@ const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
 #ifdef CONFIG_SCHED_CORE
 extern void sched_core_free(struct task_struct *tsk);
+extern void sched_core_fork(struct task_struct *p);
 #else
 static inline void sched_core_free(struct task_struct *tsk) { }
+static inline void sched_core_fork(struct task_struct *p) { }
 #endif
 
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index d16c60c..e7fd928 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2251,6 +2251,8 @@ static __latent_entropy struct task_struct *copy_process(
 
 	klp_copy_process(p);
 
+	sched_core_fork(p);
+
 	spin_lock(&current->sighand->siglock);
 
 	/*
@@ -2338,6 +2340,7 @@ static __latent_entropy struct task_struct *copy_process(
 	return p;
 
 bad_fork_cancel_cgroup:
+	sched_core_free(p);
 	spin_unlock(&current->sighand->siglock);
 	write_unlock_irq(&tasklist_lock);
 	cgroup_cancel_fork(p, args);
diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c
index 8d0869a..dcbbeae 100644
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -103,6 +103,12 @@ static unsigned long sched_core_clone_cookie(struct task_struct *p)
 	return cookie;
 }
 
+void sched_core_fork(struct task_struct *p)
+{
+	RB_CLEAR_NODE(&p->core_node);
+	p->core_cookie = sched_core_clone_cookie(current);
+}
+
 void sched_core_free(struct task_struct *p)
 {
 	sched_core_put_cookie(p->core_cookie);

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [tip: sched/core] sched: Trivial core scheduling cookie management
  2021-04-22 12:05 ` [PATCH 16/19] sched: Trivial core scheduling cookie management Peter Zijlstra
@ 2021-05-12 10:28   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 103+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-05-12 10:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Don Hiatt, Hongyu Ning, Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     6e33cad0af49336952e5541464bd02f5b5fd433e
Gitweb:        https://git.kernel.org/tip/6e33cad0af49336952e5541464bd02f5b5fd433e
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 26 Mar 2021 18:55:06 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 12 May 2021 11:43:31 +02:00

sched: Trivial core scheduling cookie management

In order to not have to use pid_struct, create a new, smaller,
structure to manage task cookies for core scheduling.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.919768100@infradead.org
---
 include/linux/sched.h     |   6 ++-
 kernel/fork.c             |   1 +-
 kernel/sched/Makefile     |   1 +-
 kernel/sched/core.c       |   7 +-
 kernel/sched/core_sched.c | 109 +++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h      |  16 +++++-
 6 files changed, 137 insertions(+), 3 deletions(-)
 create mode 100644 kernel/sched/core_sched.c

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9b822e3..eab3f7c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2179,4 +2179,10 @@ int sched_trace_rq_nr_running(struct rq *rq);
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
+#ifdef CONFIG_SCHED_CORE
+extern void sched_core_free(struct task_struct *tsk);
+#else
+static inline void sched_core_free(struct task_struct *tsk) { }
+#endif
+
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index dc06afd..d16c60c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -742,6 +742,7 @@ void __put_task_struct(struct task_struct *tsk)
 	exit_creds(tsk);
 	delayacct_tsk_free(tsk);
 	put_signal_struct(tsk->signal);
+	sched_core_free(tsk);
 
 	if (!profile_handoff_task(tsk))
 		free_task(tsk);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 5fc9c9b..978fcfc 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -36,3 +36,4 @@ obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
+obj-$(CONFIG_SCHED_CORE) += core_sched.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b498888..55b2d93 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -167,7 +167,7 @@ static inline int rb_sched_core_cmp(const void *key, const struct rb_node *node)
 	return 0;
 }
 
-static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 {
 	rq->core->core_task_seq++;
 
@@ -177,14 +177,15 @@ static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 	rb_add(&p->core_node, &rq->core_tree, rb_sched_core_less);
 }
 
-static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+void sched_core_dequeue(struct rq *rq, struct task_struct *p)
 {
 	rq->core->core_task_seq++;
 
-	if (!p->core_cookie)
+	if (!sched_core_enqueued(p))
 		return;
 
 	rb_erase(&p->core_node, &rq->core_tree);
+	RB_CLEAR_NODE(&p->core_node);
 }
 
 /*
diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c
new file mode 100644
index 0000000..8d0869a
--- /dev/null
+++ b/kernel/sched/core_sched.c
@@ -0,0 +1,109 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "sched.h"
+
+/*
+ * A simple wrapper around refcount. An allocated sched_core_cookie's
+ * address is used to compute the cookie of the task.
+ */
+struct sched_core_cookie {
+	refcount_t refcnt;
+};
+
+unsigned long sched_core_alloc_cookie(void)
+{
+	struct sched_core_cookie *ck = kmalloc(sizeof(*ck), GFP_KERNEL);
+	if (!ck)
+		return 0;
+
+	refcount_set(&ck->refcnt, 1);
+	sched_core_get();
+
+	return (unsigned long)ck;
+}
+
+void sched_core_put_cookie(unsigned long cookie)
+{
+	struct sched_core_cookie *ptr = (void *)cookie;
+
+	if (ptr && refcount_dec_and_test(&ptr->refcnt)) {
+		kfree(ptr);
+		sched_core_put();
+	}
+}
+
+unsigned long sched_core_get_cookie(unsigned long cookie)
+{
+	struct sched_core_cookie *ptr = (void *)cookie;
+
+	if (ptr)
+		refcount_inc(&ptr->refcnt);
+
+	return cookie;
+}
+
+/*
+ * sched_core_update_cookie - replace the cookie on a task
+ * @p: the task to update
+ * @cookie: the new cookie
+ *
+ * Effectively exchange the task cookie; caller is responsible for lifetimes on
+ * both ends.
+ *
+ * Returns: the old cookie
+ */
+unsigned long sched_core_update_cookie(struct task_struct *p, unsigned long cookie)
+{
+	unsigned long old_cookie;
+	struct rq_flags rf;
+	struct rq *rq;
+	bool enqueued;
+
+	rq = task_rq_lock(p, &rf);
+
+	/*
+	 * Since creating a cookie implies sched_core_get(), and we cannot set
+	 * a cookie until after we've created it, similarly, we cannot destroy
+	 * a cookie until after we've removed it, we must have core scheduling
+	 * enabled here.
+	 */
+	SCHED_WARN_ON((p->core_cookie || cookie) && !sched_core_enabled(rq));
+
+	enqueued = sched_core_enqueued(p);
+	if (enqueued)
+		sched_core_dequeue(rq, p);
+
+	old_cookie = p->core_cookie;
+	p->core_cookie = cookie;
+
+	if (enqueued)
+		sched_core_enqueue(rq, p);
+
+	/*
+	 * If task is currently running, it may not be compatible anymore after
+	 * the cookie change, so enter the scheduler on its CPU to schedule it
+	 * away.
+	 */
+	if (task_running(rq, p))
+		resched_curr(rq);
+
+	task_rq_unlock(rq, p, &rf);
+
+	return old_cookie;
+}
+
+static unsigned long sched_core_clone_cookie(struct task_struct *p)
+{
+	unsigned long cookie, flags;
+
+	raw_spin_lock_irqsave(&p->pi_lock, flags);
+	cookie = sched_core_get_cookie(p->core_cookie);
+	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+
+	return cookie;
+}
+
+void sched_core_free(struct task_struct *p)
+{
+	sched_core_put_cookie(p->core_cookie);
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3878386..904c52b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1229,6 +1229,22 @@ static inline bool sched_group_cookie_match(struct rq *rq,
 
 extern void queue_core_balance(struct rq *rq);
 
+static inline bool sched_core_enqueued(struct task_struct *p)
+{
+	return !RB_EMPTY_NODE(&p->core_node);
+}
+
+extern void sched_core_enqueue(struct rq *rq, struct task_struct *p);
+extern void sched_core_dequeue(struct rq *rq, struct task_struct *p);
+
+extern void sched_core_get(void);
+extern void sched_core_put(void);
+
+extern unsigned long sched_core_alloc_cookie(void);
+extern void sched_core_put_cookie(unsigned long cookie);
+extern unsigned long sched_core_get_cookie(unsigned long cookie);
+extern unsigned long sched_core_update_cookie(struct task_struct *p, unsigned long cookie);
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline bool sched_core_enabled(struct rq *rq)

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [tip: sched/core] sched: Migration changes for core scheduling
  2021-04-22 12:05 ` [PATCH 15/19] sched: Migration changes for core scheduling Peter Zijlstra
@ 2021-05-12 10:28   ` tip-bot2 for Aubrey Li
  0 siblings, 0 replies; 103+ messages in thread
From: tip-bot2 for Aubrey Li @ 2021-05-12 10:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Aubrey Li, Peter Zijlstra (Intel),
	Don Hiatt, Hongyu Ning, Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     97886d9dcd86820bdbc1fa73b455982809cbc8c2
Gitweb:        https://git.kernel.org/tip/97886d9dcd86820bdbc1fa73b455982809cbc8c2
Author:        Aubrey Li <aubrey.li@linux.intel.com>
AuthorDate:    Wed, 24 Mar 2021 17:40:13 -04:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 12 May 2021 11:43:30 +02:00

sched: Migration changes for core scheduling

 - Don't migrate if there is a cookie mismatch
     Load balance tries to move task from busiest CPU to the
     destination CPU. When core scheduling is enabled, if the
     task's cookie does not match with the destination CPU's
     core cookie, this task may be skipped by this CPU. This
     mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
     In the fast path of task wakeup, select the first cookie matched
     idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
     In the slow path of task wakeup, find the idlest CPU whose core
     cookie matches with task's cookie

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.860083871@infradead.org
---
 kernel/sched/fair.c  | 29 +++++++++++++----
 kernel/sched/sched.h | 73 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 96 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5948dc1..2635e10 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5889,11 +5889,15 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
 
 	/* Traverse only the allowed CPUs */
 	for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+		struct rq *rq = cpu_rq(i);
+
+		if (!sched_core_cookie_match(rq, p))
+			continue;
+
 		if (sched_idle_cpu(i))
 			return i;
 
 		if (available_idle_cpu(i)) {
-			struct rq *rq = cpu_rq(i);
 			struct cpuidle_state *idle = idle_get_state(rq);
 			if (idle && idle->exit_latency < min_exit_latency) {
 				/*
@@ -5979,9 +5983,10 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
 	return new_cpu;
 }
 
-static inline int __select_idle_cpu(int cpu)
+static inline int __select_idle_cpu(int cpu, struct task_struct *p)
 {
-	if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
+	    sched_cpu_cookie_match(cpu_rq(cpu), p))
 		return cpu;
 
 	return -1;
@@ -6051,7 +6056,7 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
 	int cpu;
 
 	if (!static_branch_likely(&sched_smt_present))
-		return __select_idle_cpu(core);
+		return __select_idle_cpu(core, p);
 
 	for_each_cpu(cpu, cpu_smt_mask(core)) {
 		if (!available_idle_cpu(cpu)) {
@@ -6107,7 +6112,7 @@ static inline bool test_idle_cores(int cpu, bool def)
 
 static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
 {
-	return __select_idle_cpu(core);
+	return __select_idle_cpu(core, p);
 }
 
 static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
@@ -6164,7 +6169,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool 
 		} else {
 			if (!--nr)
 				return -1;
-			idle_cpu = __select_idle_cpu(cpu);
+			idle_cpu = __select_idle_cpu(cpu, p);
 			if ((unsigned int)idle_cpu < nr_cpumask_bits)
 				break;
 		}
@@ -7527,6 +7532,14 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 
 	if (sysctl_sched_migration_cost == -1)
 		return 1;
+
+	/*
+	 * Don't migrate task if the task's cookie does not match
+	 * with the destination CPU's core cookie.
+	 */
+	if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+		return 1;
+
 	if (sysctl_sched_migration_cost == 0)
 		return 0;
 
@@ -8857,6 +8870,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 					p->cpus_ptr))
 			continue;
 
+		/* Skip over this group if no cookie matched */
+		if (!sched_group_cookie_match(cpu_rq(this_cpu), p, group))
+			continue;
+
 		local_group = cpumask_test_cpu(this_cpu,
 					       sched_group_span(group));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 91ca1fe..3878386 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1134,7 +1134,9 @@ static inline bool is_migration_disabled(struct task_struct *p)
 #endif
 }
 
+struct sched_group;
 #ifdef CONFIG_SCHED_CORE
+static inline struct cpumask *sched_group_span(struct sched_group *sg);
 
 DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
 
@@ -1170,6 +1172,61 @@ static inline raw_spinlock_t *__rq_lockp(struct rq *rq)
 
 bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool fi);
 
+/*
+ * Helpers to check if the CPU's core cookie matches with the task's cookie
+ * when core scheduling is enabled.
+ * A special case is that the task's cookie always matches with CPU's core
+ * cookie if the CPU is in an idle core.
+ */
+static inline bool sched_cpu_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
+	if (!sched_core_enabled(rq))
+		return true;
+
+	return rq->core->core_cookie == p->core_cookie;
+}
+
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	bool idle_core = true;
+	int cpu;
+
+	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
+	if (!sched_core_enabled(rq))
+		return true;
+
+	for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
+		if (!available_idle_cpu(cpu)) {
+			idle_core = false;
+			break;
+		}
+	}
+
+	/*
+	 * A CPU in an idle core is always the best choice for tasks with
+	 * cookies.
+	 */
+	return idle_core || rq->core->core_cookie == p->core_cookie;
+}
+
+static inline bool sched_group_cookie_match(struct rq *rq,
+					    struct task_struct *p,
+					    struct sched_group *group)
+{
+	int cpu;
+
+	/* Ignore cookie match if core scheduler is not enabled on the CPU. */
+	if (!sched_core_enabled(rq))
+		return true;
+
+	for_each_cpu_and(cpu, sched_group_span(group), p->cpus_ptr) {
+		if (sched_core_cookie_match(rq, p))
+			return true;
+	}
+	return false;
+}
+
 extern void queue_core_balance(struct rq *rq);
 
 #else /* !CONFIG_SCHED_CORE */
@@ -1198,6 +1255,22 @@ static inline void queue_core_balance(struct rq *rq)
 {
 }
 
+static inline bool sched_cpu_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	return true;
+}
+
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct *p)
+{
+	return true;
+}
+
+static inline bool sched_group_cookie_match(struct rq *rq,
+					    struct task_struct *p,
+					    struct sched_group *group)
+{
+	return true;
+}
 #endif /* CONFIG_SCHED_CORE */
 
 static inline void lockdep_assert_rq_held(struct rq *rq)

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [tip: sched/core] sched: Trivial forced-newidle balancer
  2021-04-22 12:05 ` [PATCH 14/19] sched: Trivial forced-newidle balancer Peter Zijlstra
@ 2021-05-12 10:28   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 103+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-05-12 10:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Don Hiatt, Hongyu Ning, Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     d2dfa17bc7de67e99685c4d6557837bf801a102c
Gitweb:        https://git.kernel.org/tip/d2dfa17bc7de67e99685c4d6557837bf801a102c
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 17 Nov 2020 18:19:43 -05:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 12 May 2021 11:43:30 +02:00

sched: Trivial forced-newidle balancer

When a sibling is forced-idle to match the core-cookie; search for
matching tasks to fill the core.

rcu_read_unlock() can incur an infrequent deadlock in
sched_core_balance(). Fix this by using the RCU-sched flavor instead.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.800048269@infradead.org
---
 include/linux/sched.h |   1 +-
 kernel/sched/core.c   | 130 ++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/idle.c   |   1 +-
 kernel/sched/sched.h  |   6 ++-
 4 files changed, 137 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 45eedcc..9b822e3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -705,6 +705,7 @@ struct task_struct {
 #ifdef CONFIG_SCHED_CORE
 	struct rb_node			core_node;
 	unsigned long			core_cookie;
+	unsigned int			core_occupation;
 #endif
 
 #ifdef CONFIG_CGROUP_SCHED
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e45c1d2..b498888 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -204,6 +204,21 @@ static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
 	return __node_2_sc(node);
 }
 
+static struct task_struct *sched_core_next(struct task_struct *p, unsigned long cookie)
+{
+	struct rb_node *node = &p->core_node;
+
+	node = rb_next(node);
+	if (!node)
+		return NULL;
+
+	p = container_of(node, struct task_struct, core_node);
+	if (p->core_cookie != cookie)
+		return NULL;
+
+	return p;
+}
+
 /*
  * Magic required such that:
  *
@@ -5389,8 +5404,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	const struct sched_class *class;
 	const struct cpumask *smt_mask;
 	bool fi_before = false;
+	int i, j, cpu, occ = 0;
 	bool need_sync;
-	int i, j, cpu;
 
 	if (!sched_core_enabled(rq))
 		return __pick_next_task(rq, prev, rf);
@@ -5512,6 +5527,9 @@ again:
 			if (!p)
 				continue;
 
+			if (!is_task_rq_idle(p))
+				occ++;
+
 			rq_i->core_pick = p;
 			if (rq_i->idle == p && rq_i->nr_running) {
 				rq->core->core_forceidle = true;
@@ -5543,6 +5561,7 @@ again:
 
 						cpu_rq(j)->core_pick = NULL;
 					}
+					occ = 1;
 					goto again;
 				}
 			}
@@ -5588,6 +5607,8 @@ again:
 		if (!(fi_before && rq->core->core_forceidle))
 			task_vruntime_update(rq_i, rq_i->core_pick, rq->core->core_forceidle);
 
+		rq_i->core_pick->core_occupation = occ;
+
 		if (i == cpu) {
 			rq_i->core_pick = NULL;
 			continue;
@@ -5609,6 +5630,113 @@ done:
 	return next;
 }
 
+static bool try_steal_cookie(int this, int that)
+{
+	struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
+	struct task_struct *p;
+	unsigned long cookie;
+	bool success = false;
+
+	local_irq_disable();
+	double_rq_lock(dst, src);
+
+	cookie = dst->core->core_cookie;
+	if (!cookie)
+		goto unlock;
+
+	if (dst->curr != dst->idle)
+		goto unlock;
+
+	p = sched_core_find(src, cookie);
+	if (p == src->idle)
+		goto unlock;
+
+	do {
+		if (p == src->core_pick || p == src->curr)
+			goto next;
+
+		if (!cpumask_test_cpu(this, &p->cpus_mask))
+			goto next;
+
+		if (p->core_occupation > dst->idle->core_occupation)
+			goto next;
+
+		p->on_rq = TASK_ON_RQ_MIGRATING;
+		deactivate_task(src, p, 0);
+		set_task_cpu(p, this);
+		activate_task(dst, p, 0);
+		p->on_rq = TASK_ON_RQ_QUEUED;
+
+		resched_curr(dst);
+
+		success = true;
+		break;
+
+next:
+		p = sched_core_next(p, cookie);
+	} while (p);
+
+unlock:
+	double_rq_unlock(dst, src);
+	local_irq_enable();
+
+	return success;
+}
+
+static bool steal_cookie_task(int cpu, struct sched_domain *sd)
+{
+	int i;
+
+	for_each_cpu_wrap(i, sched_domain_span(sd), cpu) {
+		if (i == cpu)
+			continue;
+
+		if (need_resched())
+			break;
+
+		if (try_steal_cookie(cpu, i))
+			return true;
+	}
+
+	return false;
+}
+
+static void sched_core_balance(struct rq *rq)
+{
+	struct sched_domain *sd;
+	int cpu = cpu_of(rq);
+
+	preempt_disable();
+	rcu_read_lock();
+	raw_spin_rq_unlock_irq(rq);
+	for_each_domain(cpu, sd) {
+		if (need_resched())
+			break;
+
+		if (steal_cookie_task(cpu, sd))
+			break;
+	}
+	raw_spin_rq_lock_irq(rq);
+	rcu_read_unlock();
+	preempt_enable();
+}
+
+static DEFINE_PER_CPU(struct callback_head, core_balance_head);
+
+void queue_core_balance(struct rq *rq)
+{
+	if (!sched_core_enabled(rq))
+		return;
+
+	if (!rq->core->core_cookie)
+		return;
+
+	if (!rq->nr_running) /* not forced idle */
+		return;
+
+	queue_balance_callback(rq, &per_cpu(core_balance_head, rq->cpu), sched_core_balance);
+}
+
 static inline void sched_core_cpu_starting(unsigned int cpu)
 {
 	const struct cpumask *smt_mask = cpu_smt_mask(cpu);
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 43646e7..912b47a 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -437,6 +437,7 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
 {
 	update_idle_core(rq);
 	schedstat_inc(rq->sched_goidle);
+	queue_core_balance(rq);
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4a898ab..91ca1fe 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1170,6 +1170,8 @@ static inline raw_spinlock_t *__rq_lockp(struct rq *rq)
 
 bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool fi);
 
+extern void queue_core_balance(struct rq *rq);
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline bool sched_core_enabled(struct rq *rq)
@@ -1192,6 +1194,10 @@ static inline raw_spinlock_t *__rq_lockp(struct rq *rq)
 	return &rq->__lock;
 }
 
+static inline void queue_core_balance(struct rq *rq)
+{
+}
+
 #endif /* CONFIG_SCHED_CORE */
 
 static inline void lockdep_assert_rq_held(struct rq *rq)

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [tip: sched/core] sched/fair: Snapshot the min_vruntime of CPUs on force idle
  2021-04-22 12:05 ` [PATCH 13/19] sched/fair: Snapshot the min_vruntime of CPUs on force idle Peter Zijlstra
@ 2021-05-12 10:28   ` tip-bot2 for Joel Fernandes (Google)
  0 siblings, 0 replies; 103+ messages in thread
From: tip-bot2 for Joel Fernandes (Google) @ 2021-05-12 10:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Joel Fernandes (Google),
	Don Hiatt, Hongyu Ning, Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     c6047c2e3af68dae23ad884249e0d42ff28d2d1b
Gitweb:        https://git.kernel.org/tip/c6047c2e3af68dae23ad884249e0d42ff28d2d1b
Author:        Joel Fernandes (Google) <joel@joelfernandes.org>
AuthorDate:    Tue, 17 Nov 2020 18:19:39 -05:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 12 May 2021 11:43:29 +02:00

sched/fair: Snapshot the min_vruntime of CPUs on force idle

During force-idle, we end up doing cross-cpu comparison of vruntimes
during pick_next_task. If we simply compare (vruntime-min_vruntime)
across CPUs, and if the CPUs only have 1 task each, we will always
end up comparing 0 with 0 and pick just one of the tasks all the time.
This starves the task that was not picked. To fix this, take a snapshot
of the min_vruntime when entering force idle and use it for comparison.
This min_vruntime snapshot will only be used for cross-CPU vruntime
comparison, and nothing else.

A note about the min_vruntime snapshot and force idling:

During selection:

  When we're not fi, we need to update snapshot.
  when we're fi and we were not fi, we must update snapshot.
  When we're fi and we were already fi, we must not update snapshot.

Which gives:

  fib     fi      update
  0       0       1
  0       1       1
  1       0       1
  1       1       0

Where:

  fi:  force-idled now
  fib: force-idled before

So the min_vruntime snapshot needs to be updated when: !(fib && fi).

Also, the cfs_prio_less() function needs to be aware of whether the
core is in force idle or not, since it will be use this information to
know whether to advance a cfs_rq's min_vruntime_fi in the hierarchy.
So pass this information along via pick_task() -> prio_less().

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.738542617@infradead.org
---
 kernel/sched/core.c  | 59 +++++++++++++++++++---------------
 kernel/sched/fair.c  | 75 +++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |  8 +++++-
 3 files changed, 117 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e506d9d..e45c1d2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -111,7 +111,7 @@ static inline int __task_prio(struct task_struct *p)
  */
 
 /* real prio, less is less */
-static inline bool prio_less(struct task_struct *a, struct task_struct *b)
+static inline bool prio_less(struct task_struct *a, struct task_struct *b, bool in_fi)
 {
 
 	int pa = __task_prio(a), pb = __task_prio(b);
@@ -125,19 +125,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
 		return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-		u64 vruntime = b->se.vruntime;
-
-		/*
-		 * Normalize the vruntime if tasks are in different cpus.
-		 */
-		if (task_cpu(a) != task_cpu(b)) {
-			vruntime -= task_cfs_rq(b)->min_vruntime;
-			vruntime += task_cfs_rq(a)->min_vruntime;
-		}
-
-		return !((s64)(a->se.vruntime - vruntime) <= 0);
-	}
+	if (pa == MAX_RT_PRIO + MAX_NICE)	/* fair */
+		return cfs_prio_less(a, b, in_fi);
 
 	return false;
 }
@@ -151,7 +140,7 @@ static inline bool __sched_core_less(struct task_struct *a, struct task_struct *
 		return false;
 
 	/* flip prio, so high prio is leftmost */
-	if (prio_less(b, a))
+	if (prio_less(b, a, task_rq(a)->core->core_forceidle))
 		return true;
 
 	return false;
@@ -5350,7 +5339,7 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
  * - Else returns idle_task.
  */
 static struct task_struct *
-pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max)
+pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max, bool in_fi)
 {
 	struct task_struct *class_pick, *cookie_pick;
 	unsigned long cookie = rq->core->core_cookie;
@@ -5365,7 +5354,7 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
 		 * higher priority than max.
 		 */
 		if (max && class_pick->core_cookie &&
-		    prio_less(class_pick, max))
+		    prio_less(class_pick, max, in_fi))
 			return idle_sched_class.pick_task(rq);
 
 		return class_pick;
@@ -5384,19 +5373,22 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
 	 * the core (so far) and it must be selected, otherwise we must go with
 	 * the cookie pick in order to satisfy the constraint.
 	 */
-	if (prio_less(cookie_pick, class_pick) &&
-	    (!max || prio_less(max, class_pick)))
+	if (prio_less(cookie_pick, class_pick, in_fi) &&
+	    (!max || prio_less(max, class_pick, in_fi)))
 		return class_pick;
 
 	return cookie_pick;
 }
 
+extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi);
+
 static struct task_struct *
 pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct task_struct *next, *max = NULL;
 	const struct sched_class *class;
 	const struct cpumask *smt_mask;
+	bool fi_before = false;
 	bool need_sync;
 	int i, j, cpu;
 
@@ -5478,9 +5470,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 		if (!next->core_cookie) {
 			rq->core_pick = NULL;
+			/*
+			 * For robustness, update the min_vruntime_fi for
+			 * unconstrained picks as well.
+			 */
+			WARN_ON_ONCE(fi_before);
+			task_vruntime_update(rq, next, false);
 			goto done;
 		}
-		need_sync = true;
 	}
 
 	for_each_cpu(i, smt_mask) {
@@ -5511,11 +5508,16 @@ again:
 			 * highest priority task already selected for this
 			 * core.
 			 */
-			p = pick_task(rq_i, class, max);
+			p = pick_task(rq_i, class, max, fi_before);
 			if (!p)
 				continue;
 
 			rq_i->core_pick = p;
+			if (rq_i->idle == p && rq_i->nr_running) {
+				rq->core->core_forceidle = true;
+				if (!fi_before)
+					rq->core->core_forceidle_seq++;
+			}
 
 			/*
 			 * If this new candidate is of higher priority than the
@@ -5534,6 +5536,7 @@ again:
 				max = p;
 
 				if (old_max) {
+					rq->core->core_forceidle = false;
 					for_each_cpu(j, smt_mask) {
 						if (j == i)
 							continue;
@@ -5574,10 +5577,16 @@ again:
 		if (!rq_i->core_pick)
 			continue;
 
-		if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running &&
-		    !rq_i->core->core_forceidle) {
-			rq_i->core->core_forceidle = true;
-		}
+		/*
+		 * Update for new !FI->FI transitions, or if continuing to be in !FI:
+		 * fi_before     fi      update?
+		 *  0            0       1
+		 *  0            1       1
+		 *  1            0       1
+		 *  1            1       0
+		 */
+		if (!(fi_before && rq->core->core_forceidle))
+			task_vruntime_update(rq_i, rq_i->core_pick, rq->core->core_forceidle);
 
 		if (i == cpu) {
 			rq_i->core_pick = NULL;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4d1ecab..5948dc1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10801,6 +10801,81 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr)
 	    __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
 		resched_curr(rq);
 }
+
+/*
+ * se_fi_update - Update the cfs_rq->min_vruntime_fi in a CFS hierarchy if needed.
+ */
+static void se_fi_update(struct sched_entity *se, unsigned int fi_seq, bool forceidle)
+{
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		if (forceidle) {
+			if (cfs_rq->forceidle_seq == fi_seq)
+				break;
+			cfs_rq->forceidle_seq = fi_seq;
+		}
+
+		cfs_rq->min_vruntime_fi = cfs_rq->min_vruntime;
+	}
+}
+
+void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi)
+{
+	struct sched_entity *se = &p->se;
+
+	if (p->sched_class != &fair_sched_class)
+		return;
+
+	se_fi_update(se, rq->core->core_forceidle_seq, in_fi);
+}
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool in_fi)
+{
+	struct rq *rq = task_rq(a);
+	struct sched_entity *sea = &a->se;
+	struct sched_entity *seb = &b->se;
+	struct cfs_rq *cfs_rqa;
+	struct cfs_rq *cfs_rqb;
+	s64 delta;
+
+	SCHED_WARN_ON(task_rq(b)->core != rq->core);
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	/*
+	 * Find an se in the hierarchy for tasks a and b, such that the se's
+	 * are immediate siblings.
+	 */
+	while (sea->cfs_rq->tg != seb->cfs_rq->tg) {
+		int sea_depth = sea->depth;
+		int seb_depth = seb->depth;
+
+		if (sea_depth >= seb_depth)
+			sea = parent_entity(sea);
+		if (sea_depth <= seb_depth)
+			seb = parent_entity(seb);
+	}
+
+	se_fi_update(sea, rq->core->core_forceidle_seq, in_fi);
+	se_fi_update(seb, rq->core->core_forceidle_seq, in_fi);
+
+	cfs_rqa = sea->cfs_rq;
+	cfs_rqb = seb->cfs_rq;
+#else
+	cfs_rqa = &task_rq(a)->cfs;
+	cfs_rqb = &task_rq(b)->cfs;
+#endif
+
+	/*
+	 * Find delta after normalizing se's vruntime with its cfs_rq's
+	 * min_vruntime_fi, which would have been updated in prior calls
+	 * to se_fi_update().
+	 */
+	delta = (s64)(sea->vruntime - seb->vruntime) +
+		(s64)(cfs_rqb->min_vruntime_fi - cfs_rqa->min_vruntime_fi);
+
+	return delta > 0;
+}
 #else
 static inline void task_tick_core(struct rq *rq, struct task_struct *curr) {}
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index db55514..4a898ab 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -526,6 +526,11 @@ struct cfs_rq {
 
 	u64			exec_clock;
 	u64			min_vruntime;
+#ifdef CONFIG_SCHED_CORE
+	unsigned int		forceidle_seq;
+	u64			min_vruntime_fi;
+#endif
+
 #ifndef CONFIG_64BIT
 	u64			min_vruntime_copy;
 #endif
@@ -1089,6 +1094,7 @@ struct rq {
 	unsigned int		core_pick_seq;
 	unsigned long		core_cookie;
 	unsigned char		core_forceidle;
+	unsigned int		core_forceidle_seq;
 #endif
 };
 
@@ -1162,6 +1168,8 @@ static inline raw_spinlock_t *__rq_lockp(struct rq *rq)
 	return &rq->__lock;
 }
 
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool fi);
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline bool sched_core_enabled(struct rq *rq)

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [tip: sched/core] sched: Fix priority inversion of cookied task with sibling
  2021-04-22 12:05 ` [PATCH 12/19] sched: Fix priority inversion of cookied task with sibling Peter Zijlstra
@ 2021-05-12 10:28   ` tip-bot2 for Joel Fernandes (Google)
  0 siblings, 0 replies; 103+ messages in thread
From: tip-bot2 for Joel Fernandes (Google) @ 2021-05-12 10:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Joel Fernandes (Google),
	Don Hiatt, Hongyu Ning, Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     7afbba119f0da09824d723f8081608ea1f74ff57
Gitweb:        https://git.kernel.org/tip/7afbba119f0da09824d723f8081608ea1f74ff57
Author:        Joel Fernandes (Google) <joel@joelfernandes.org>
AuthorDate:    Tue, 17 Nov 2020 18:19:42 -05:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 12 May 2021 11:43:29 +02:00

sched: Fix priority inversion of cookied task with sibling

The rationale is as follows. In the core-wide pick logic, even if
need_sync == false, we need to go look at other CPUs (non-local CPUs)
to see if they could be running RT.

Say the RQs in a particular core look like this:

Let CFS1 and CFS2 be 2 tagged CFS tags.
Let RT1 be an untagged RT task.

	rq0		rq1
	CFS1 (tagged)	RT1 (no tag)
	CFS2 (tagged)

Say schedule() runs on rq0. Now, it will enter the above loop and
pick_task(RT) will return NULL for 'p'. It will enter the above if()
block and see that need_sync == false and will skip RT entirely.

The end result of the selection will be (say prio(CFS1) > prio(CFS2)):

	rq0             rq1
	CFS1            IDLE

When it should have selected:

	rq0             rq1
	IDLE            RT

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.678425748@infradead.org
---
 kernel/sched/core.c | 65 +++++++++++++++++---------------------------
 1 file changed, 26 insertions(+), 39 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f5e1e6f..e506d9d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5443,6 +5443,15 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	put_prev_task_balance(rq, prev, rf);
 
 	smt_mask = cpu_smt_mask(cpu);
+	need_sync = !!rq->core->core_cookie;
+
+	/* reset state */
+	rq->core->core_cookie = 0UL;
+	if (rq->core->core_forceidle) {
+		need_sync = true;
+		fi_before = true;
+		rq->core->core_forceidle = false;
+	}
 
 	/*
 	 * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
@@ -5455,14 +5464,25 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	 * 'Fix' this by also increasing @task_seq for every pick.
 	 */
 	rq->core->core_task_seq++;
-	need_sync = !!rq->core->core_cookie;
 
-	/* reset state */
-	rq->core->core_cookie = 0UL;
-	if (rq->core->core_forceidle) {
+	/*
+	 * Optimize for common case where this CPU has no cookies
+	 * and there are no cookied tasks running on siblings.
+	 */
+	if (!need_sync) {
+		for_each_class(class) {
+			next = class->pick_task(rq);
+			if (next)
+				break;
+		}
+
+		if (!next->core_cookie) {
+			rq->core_pick = NULL;
+			goto done;
+		}
 		need_sync = true;
-		rq->core->core_forceidle = false;
 	}
+
 	for_each_cpu(i, smt_mask) {
 		struct rq *rq_i = cpu_rq(i);
 
@@ -5492,31 +5512,8 @@ again:
 			 * core.
 			 */
 			p = pick_task(rq_i, class, max);
-			if (!p) {
-				/*
-				 * If there weren't no cookies; we don't need to
-				 * bother with the other siblings.
-				 * If the rest of the core is not running a tagged
-				 * task, i.e.  need_sync == 0, and the current CPU
-				 * which called into the schedule() loop does not
-				 * have any tasks for this class, skip selecting for
-				 * other siblings since there's no point. We don't skip
-				 * for RT/DL because that could make CFS force-idle RT.
-				 */
-				if (i == cpu && !need_sync && class == &fair_sched_class)
-					goto next_class;
-
+			if (!p)
 				continue;
-			}
-
-			/*
-			 * Optimize the 'normal' case where there aren't any
-			 * cookies and we don't need to sync up.
-			 */
-			if (i == cpu && !need_sync && !p->core_cookie) {
-				next = p;
-				goto done;
-			}
 
 			rq_i->core_pick = p;
 
@@ -5544,19 +5541,9 @@ again:
 						cpu_rq(j)->core_pick = NULL;
 					}
 					goto again;
-				} else {
-					/*
-					 * Once we select a task for a cpu, we
-					 * should not be doing an unconstrained
-					 * pick because it might starve a task
-					 * on a forced idle cpu.
-					 */
-					need_sync = true;
 				}
-
 			}
 		}
-next_class:;
 	}
 
 	rq->core->core_pick_seq = rq->core->core_task_seq;

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [tip: sched/core] sched/fair: Fix forced idle sibling starvation corner case
  2021-04-22 12:05 ` [PATCH 11/19] sched/fair: Fix forced idle sibling starvation corner case Peter Zijlstra
@ 2021-05-12 10:28   ` tip-bot2 for Vineeth Pillai
  0 siblings, 0 replies; 103+ messages in thread
From: tip-bot2 for Vineeth Pillai @ 2021-05-12 10:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Vineeth Pillai, Peter Zijlstra (Intel),
	Don Hiatt, Hongyu Ning, Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     8039e96fcc1de30d5bcaf05da9ca2de46a800826
Gitweb:        https://git.kernel.org/tip/8039e96fcc1de30d5bcaf05da9ca2de46a800826
Author:        Vineeth Pillai <viremana@linux.microsoft.com>
AuthorDate:    Tue, 17 Nov 2020 18:19:38 -05:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 12 May 2021 11:43:29 +02:00

sched/fair: Fix forced idle sibling starvation corner case

If there is only one long running local task and the sibling is
forced idle, it  might not get a chance to run until a schedule
event happens on any cpu in the core.

So we check for this condition during a tick to see if a sibling
is starved and then give it a chance to schedule.

Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.617407840@infradead.org
---
 kernel/sched/core.c  | 15 ++++++++-------
 kernel/sched/fair.c  | 40 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  2 +-
 3 files changed, 49 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index db763f4..f5e1e6f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5459,16 +5459,15 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	/* reset state */
 	rq->core->core_cookie = 0UL;
+	if (rq->core->core_forceidle) {
+		need_sync = true;
+		rq->core->core_forceidle = false;
+	}
 	for_each_cpu(i, smt_mask) {
 		struct rq *rq_i = cpu_rq(i);
 
 		rq_i->core_pick = NULL;
 
-		if (rq_i->core_forceidle) {
-			need_sync = true;
-			rq_i->core_forceidle = false;
-		}
-
 		if (i != cpu)
 			update_rq_clock(rq_i);
 	}
@@ -5588,8 +5587,10 @@ next_class:;
 		if (!rq_i->core_pick)
 			continue;
 
-		if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running)
-			rq_i->core_forceidle = true;
+		if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running &&
+		    !rq_i->core->core_forceidle) {
+			rq_i->core->core_forceidle = true;
+		}
 
 		if (i == cpu) {
 			rq_i->core_pick = NULL;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 08be7a2..4d1ecab 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10767,6 +10767,44 @@ static void rq_offline_fair(struct rq *rq)
 
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_SCHED_CORE
+static inline bool
+__entity_slice_used(struct sched_entity *se, int min_nr_tasks)
+{
+	u64 slice = sched_slice(cfs_rq_of(se), se);
+	u64 rtime = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+
+	return (rtime * min_nr_tasks > slice);
+}
+
+#define MIN_NR_TASKS_DURING_FORCEIDLE	2
+static inline void task_tick_core(struct rq *rq, struct task_struct *curr)
+{
+	if (!sched_core_enabled(rq))
+		return;
+
+	/*
+	 * If runqueue has only one task which used up its slice and
+	 * if the sibling is forced idle, then trigger schedule to
+	 * give forced idle task a chance.
+	 *
+	 * sched_slice() considers only this active rq and it gets the
+	 * whole slice. But during force idle, we have siblings acting
+	 * like a single runqueue and hence we need to consider runnable
+	 * tasks on this cpu and the forced idle cpu. Ideally, we should
+	 * go through the forced idle rq, but that would be a perf hit.
+	 * We can assume that the forced idle cpu has atleast
+	 * MIN_NR_TASKS_DURING_FORCEIDLE - 1 tasks and use that to check
+	 * if we need to give up the cpu.
+	 */
+	if (rq->core->core_forceidle && rq->cfs.nr_running == 1 &&
+	    __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
+		resched_curr(rq);
+}
+#else
+static inline void task_tick_core(struct rq *rq, struct task_struct *curr) {}
+#endif
+
 /*
  * scheduler tick hitting a task of our scheduling class.
  *
@@ -10790,6 +10828,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	update_misfit_status(curr, rq);
 	update_overutilized_status(task_rq(curr));
+
+	task_tick_core(rq, curr);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dd44a31..db55514 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1083,12 +1083,12 @@ struct rq {
 	unsigned int		core_enabled;
 	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
-	unsigned char		core_forceidle;
 
 	/* shared state */
 	unsigned int		core_task_seq;
 	unsigned int		core_pick_seq;
 	unsigned long		core_cookie;
+	unsigned char		core_forceidle;
 #endif
 };
 

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [tip: sched/core] sched: Basic tracking of matching tasks
  2021-04-22 12:05 ` [PATCH 09/19] sched: Basic tracking of matching tasks Peter Zijlstra
@ 2021-05-12 10:28   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 103+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-05-12 10:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Joel Fernandes (Google),
	Don Hiatt, Hongyu Ning, Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     8a311c740b53324ec584e0e3bb7077d56b123c28
Gitweb:        https://git.kernel.org/tip/8a311c740b53324ec584e0e3bb7077d56b123c28
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 17 Nov 2020 18:19:36 -05:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 12 May 2021 11:43:28 +02:00

sched: Basic tracking of matching tasks

Introduce task_struct::core_cookie as an opaque identifier for core
scheduling. When enabled; core scheduling will only allow matching
task to be on the core; where idle matches everything.

When task_struct::core_cookie is set (and core scheduling is enabled)
these tasks are indexed in a second RB-tree, first on cookie value
then on scheduling function, such that matching task selection always
finds the most elegible match.

NOTE: *shudder* at the overhead...

NOTE: *sigh*, a 3rd copy of the scheduling function; the alternative
is per class tracking of cookies and that just duplicates a lot of
stuff for no raisin (the 2nd copy lives in the rt-mutex PI code).

[Joel: folded fixes]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.496975854@infradead.org
---
 include/linux/sched.h |   8 +-
 kernel/sched/core.c   | 152 +++++++++++++++++++++++++++++++++++++++--
 kernel/sched/fair.c   |  46 +------------
 kernel/sched/sched.h  |  55 +++++++++++++++-
 4 files changed, 210 insertions(+), 51 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d2c8813..45eedcc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -700,10 +700,16 @@ struct task_struct {
 	const struct sched_class	*sched_class;
 	struct sched_entity		se;
 	struct sched_rt_entity		rt;
+	struct sched_dl_entity		dl;
+
+#ifdef CONFIG_SCHED_CORE
+	struct rb_node			core_node;
+	unsigned long			core_cookie;
+#endif
+
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group		*sched_task_group;
 #endif
-	struct sched_dl_entity		dl;
 
 #ifdef CONFIG_UCLAMP_TASK
 	/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 85147be..c057d47 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -88,6 +88,133 @@ __read_mostly int scheduler_running;
 
 DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
 
+/* kernel prio, less is more */
+static inline int __task_prio(struct task_struct *p)
+{
+	if (p->sched_class == &stop_sched_class) /* trumps deadline */
+		return -2;
+
+	if (rt_prio(p->prio)) /* includes deadline */
+		return p->prio; /* [-1, 99] */
+
+	if (p->sched_class == &idle_sched_class)
+		return MAX_RT_PRIO + NICE_WIDTH; /* 140 */
+
+	return MAX_RT_PRIO + MAX_NICE; /* 120, squash fair */
+}
+
+/*
+ * l(a,b)
+ * le(a,b) := !l(b,a)
+ * g(a,b)  := l(b,a)
+ * ge(a,b) := !l(a,b)
+ */
+
+/* real prio, less is less */
+static inline bool prio_less(struct task_struct *a, struct task_struct *b)
+{
+
+	int pa = __task_prio(a), pb = __task_prio(b);
+
+	if (-pa < -pb)
+		return true;
+
+	if (-pb < -pa)
+		return false;
+
+	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
+		return !dl_time_before(a->dl.deadline, b->dl.deadline);
+
+	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
+		u64 vruntime = b->se.vruntime;
+
+		/*
+		 * Normalize the vruntime if tasks are in different cpus.
+		 */
+		if (task_cpu(a) != task_cpu(b)) {
+			vruntime -= task_cfs_rq(b)->min_vruntime;
+			vruntime += task_cfs_rq(a)->min_vruntime;
+		}
+
+		return !((s64)(a->se.vruntime - vruntime) <= 0);
+	}
+
+	return false;
+}
+
+static inline bool __sched_core_less(struct task_struct *a, struct task_struct *b)
+{
+	if (a->core_cookie < b->core_cookie)
+		return true;
+
+	if (a->core_cookie > b->core_cookie)
+		return false;
+
+	/* flip prio, so high prio is leftmost */
+	if (prio_less(b, a))
+		return true;
+
+	return false;
+}
+
+#define __node_2_sc(node) rb_entry((node), struct task_struct, core_node)
+
+static inline bool rb_sched_core_less(struct rb_node *a, const struct rb_node *b)
+{
+	return __sched_core_less(__node_2_sc(a), __node_2_sc(b));
+}
+
+static inline int rb_sched_core_cmp(const void *key, const struct rb_node *node)
+{
+	const struct task_struct *p = __node_2_sc(node);
+	unsigned long cookie = (unsigned long)key;
+
+	if (cookie < p->core_cookie)
+		return -1;
+
+	if (cookie > p->core_cookie)
+		return 1;
+
+	return 0;
+}
+
+static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+{
+	rq->core->core_task_seq++;
+
+	if (!p->core_cookie)
+		return;
+
+	rb_add(&p->core_node, &rq->core_tree, rb_sched_core_less);
+}
+
+static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+{
+	rq->core->core_task_seq++;
+
+	if (!p->core_cookie)
+		return;
+
+	rb_erase(&p->core_node, &rq->core_tree);
+}
+
+/*
+ * Find left-most (aka, highest priority) task matching @cookie.
+ */
+static struct task_struct *sched_core_find(struct rq *rq, unsigned long cookie)
+{
+	struct rb_node *node;
+
+	node = rb_find_first((void *)cookie, &rq->core_tree, rb_sched_core_cmp);
+	/*
+	 * The idle task always matches any cookie!
+	 */
+	if (!node)
+		return idle_sched_class.pick_task(rq);
+
+	return __node_2_sc(node);
+}
+
 /*
  * Magic required such that:
  *
@@ -147,10 +274,16 @@ static void __sched_core_flip(bool enabled)
 	cpus_read_unlock();
 }
 
-static void __sched_core_enable(void)
+static void sched_core_assert_empty(void)
 {
-	// XXX verify there are no cookie tasks (yet)
+	int cpu;
 
+	for_each_possible_cpu(cpu)
+		WARN_ON_ONCE(!RB_EMPTY_ROOT(&cpu_rq(cpu)->core_tree));
+}
+
+static void __sched_core_enable(void)
+{
 	static_branch_enable(&__sched_core_enabled);
 	/*
 	 * Ensure all previous instances of raw_spin_rq_*lock() have finished
@@ -158,12 +291,12 @@ static void __sched_core_enable(void)
 	 */
 	synchronize_rcu();
 	__sched_core_flip(true);
+	sched_core_assert_empty();
 }
 
 static void __sched_core_disable(void)
 {
-	// XXX verify there are no cookie tasks (left)
-
+	sched_core_assert_empty();
 	__sched_core_flip(false);
 	static_branch_disable(&__sched_core_enabled);
 }
@@ -205,6 +338,11 @@ void sched_core_put(void)
 		schedule_work(&_work);
 }
 
+#else /* !CONFIG_SCHED_CORE */
+
+static inline void sched_core_enqueue(struct rq *rq, struct task_struct *p) { }
+static inline void sched_core_dequeue(struct rq *rq, struct task_struct *p) { }
+
 #endif /* CONFIG_SCHED_CORE */
 
 /*
@@ -1797,10 +1935,16 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 
 	uclamp_rq_inc(rq, p);
 	p->sched_class->enqueue_task(rq, p, flags);
+
+	if (sched_core_enabled(rq))
+		sched_core_enqueue(rq, p);
 }
 
 static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	if (sched_core_enabled(rq))
+		sched_core_dequeue(rq, p);
+
 	if (!(flags & DEQUEUE_NOCLOCK))
 		update_rq_clock(rq);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 51d72ab..08be7a2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -268,33 +268,11 @@ const struct sched_class fair_sched_class;
  */
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static inline struct task_struct *task_of(struct sched_entity *se)
-{
-	SCHED_WARN_ON(!entity_is_task(se));
-	return container_of(se, struct task_struct, se);
-}
 
 /* Walk up scheduling entities hierarchy */
 #define for_each_sched_entity(se) \
 		for (; se; se = se->parent)
 
-static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
-{
-	return p->se.cfs_rq;
-}
-
-/* runqueue on which this entity is (to be) queued */
-static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
-{
-	return se->cfs_rq;
-}
-
-/* runqueue "owned" by this group */
-static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
-{
-	return grp->my_q;
-}
-
 static inline void cfs_rq_tg_path(struct cfs_rq *cfs_rq, char *path, int len)
 {
 	if (!path)
@@ -455,33 +433,9 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #else	/* !CONFIG_FAIR_GROUP_SCHED */
 
-static inline struct task_struct *task_of(struct sched_entity *se)
-{
-	return container_of(se, struct task_struct, se);
-}
-
 #define for_each_sched_entity(se) \
 		for (; se; se = NULL)
 
-static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
-{
-	return &task_rq(p)->cfs;
-}
-
-static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
-{
-	struct task_struct *p = task_of(se);
-	struct rq *rq = task_rq(p);
-
-	return &rq->cfs;
-}
-
-/* runqueue "owned" by this group */
-static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
-{
-	return NULL;
-}
-
 static inline void cfs_rq_tg_path(struct cfs_rq *cfs_rq, char *path, int len)
 {
 	if (path)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fa990cd..e43a217 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1080,6 +1080,10 @@ struct rq {
 	/* per rq */
 	struct rq		*core;
 	unsigned int		core_enabled;
+	struct rb_root		core_tree;
+
+	/* shared state */
+	unsigned int		core_task_seq;
 #endif
 };
 
@@ -1243,6 +1247,57 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		raw_cpu_ptr(&runqueues)
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline struct task_struct *task_of(struct sched_entity *se)
+{
+	SCHED_WARN_ON(!entity_is_task(se));
+	return container_of(se, struct task_struct, se);
+}
+
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+	return p->se.cfs_rq;
+}
+
+/* runqueue on which this entity is (to be) queued */
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+	return se->cfs_rq;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+	return grp->my_q;
+}
+
+#else
+
+static inline struct task_struct *task_of(struct sched_entity *se)
+{
+	return container_of(se, struct task_struct, se);
+}
+
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+	return &task_rq(p)->cfs;
+}
+
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+	struct task_struct *p = task_of(se);
+	struct rq *rq = task_rq(p);
+
+	return &rq->cfs;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+	return NULL;
+}
+#endif
+
 extern void update_rq_clock(struct rq *rq);
 
 static inline u64 __rq_clock_broken(struct rq *rq)

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [tip: sched/core] sched: Add core wide task selection and scheduling
  2021-04-22 12:05 ` [PATCH 10/19] sched: Add core wide task selection and scheduling Peter Zijlstra
@ 2021-05-12 10:28   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 103+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-05-12 10:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Don Hiatt, Hongyu Ning, Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     539f65125d20aacab54d02d77f10a839f45b09dc
Gitweb:        https://git.kernel.org/tip/539f65125d20aacab54d02d77f10a839f45b09dc
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 17 Nov 2020 18:19:37 -05:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 12 May 2021 11:43:28 +02:00

sched: Add core wide task selection and scheduling

Instead of only selecting a local task, select a task for all SMT
siblings for every reschedule on the core (irrespective which logical
CPU does the reschedule).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.557559654@infradead.org
---
 kernel/sched/core.c  | 301 +++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |   6 +-
 2 files changed, 305 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c057d47..db763f4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5282,7 +5282,7 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
  * Pick up the highest-prio task:
  */
 static inline struct task_struct *
-pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	const struct sched_class *class;
 	struct task_struct *p;
@@ -5323,6 +5323,294 @@ restart:
 }
 
 #ifdef CONFIG_SCHED_CORE
+static inline bool is_task_rq_idle(struct task_struct *t)
+{
+	return (task_rq(t)->idle == t);
+}
+
+static inline bool cookie_equals(struct task_struct *a, unsigned long cookie)
+{
+	return is_task_rq_idle(a) || (a->core_cookie == cookie);
+}
+
+static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
+{
+	if (is_task_rq_idle(a) || is_task_rq_idle(b))
+		return true;
+
+	return a->core_cookie == b->core_cookie;
+}
+
+// XXX fairness/fwd progress conditions
+/*
+ * Returns
+ * - NULL if there is no runnable task for this class.
+ * - the highest priority task for this runqueue if it matches
+ *   rq->core->core_cookie or its priority is greater than max.
+ * - Else returns idle_task.
+ */
+static struct task_struct *
+pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max)
+{
+	struct task_struct *class_pick, *cookie_pick;
+	unsigned long cookie = rq->core->core_cookie;
+
+	class_pick = class->pick_task(rq);
+	if (!class_pick)
+		return NULL;
+
+	if (!cookie) {
+		/*
+		 * If class_pick is tagged, return it only if it has
+		 * higher priority than max.
+		 */
+		if (max && class_pick->core_cookie &&
+		    prio_less(class_pick, max))
+			return idle_sched_class.pick_task(rq);
+
+		return class_pick;
+	}
+
+	/*
+	 * If class_pick is idle or matches cookie, return early.
+	 */
+	if (cookie_equals(class_pick, cookie))
+		return class_pick;
+
+	cookie_pick = sched_core_find(rq, cookie);
+
+	/*
+	 * If class > max && class > cookie, it is the highest priority task on
+	 * the core (so far) and it must be selected, otherwise we must go with
+	 * the cookie pick in order to satisfy the constraint.
+	 */
+	if (prio_less(cookie_pick, class_pick) &&
+	    (!max || prio_less(max, class_pick)))
+		return class_pick;
+
+	return cookie_pick;
+}
+
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	struct task_struct *next, *max = NULL;
+	const struct sched_class *class;
+	const struct cpumask *smt_mask;
+	bool need_sync;
+	int i, j, cpu;
+
+	if (!sched_core_enabled(rq))
+		return __pick_next_task(rq, prev, rf);
+
+	cpu = cpu_of(rq);
+
+	/* Stopper task is switching into idle, no need core-wide selection. */
+	if (cpu_is_offline(cpu)) {
+		/*
+		 * Reset core_pick so that we don't enter the fastpath when
+		 * coming online. core_pick would already be migrated to
+		 * another cpu during offline.
+		 */
+		rq->core_pick = NULL;
+		return __pick_next_task(rq, prev, rf);
+	}
+
+	/*
+	 * If there were no {en,de}queues since we picked (IOW, the task
+	 * pointers are all still valid), and we haven't scheduled the last
+	 * pick yet, do so now.
+	 *
+	 * rq->core_pick can be NULL if no selection was made for a CPU because
+	 * it was either offline or went offline during a sibling's core-wide
+	 * selection. In this case, do a core-wide selection.
+	 */
+	if (rq->core->core_pick_seq == rq->core->core_task_seq &&
+	    rq->core->core_pick_seq != rq->core_sched_seq &&
+	    rq->core_pick) {
+		WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);
+
+		next = rq->core_pick;
+		if (next != prev) {
+			put_prev_task(rq, prev);
+			set_next_task(rq, next);
+		}
+
+		rq->core_pick = NULL;
+		return next;
+	}
+
+	put_prev_task_balance(rq, prev, rf);
+
+	smt_mask = cpu_smt_mask(cpu);
+
+	/*
+	 * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
+	 *
+	 * @task_seq guards the task state ({en,de}queues)
+	 * @pick_seq is the @task_seq we did a selection on
+	 * @sched_seq is the @pick_seq we scheduled
+	 *
+	 * However, preemptions can cause multiple picks on the same task set.
+	 * 'Fix' this by also increasing @task_seq for every pick.
+	 */
+	rq->core->core_task_seq++;
+	need_sync = !!rq->core->core_cookie;
+
+	/* reset state */
+	rq->core->core_cookie = 0UL;
+	for_each_cpu(i, smt_mask) {
+		struct rq *rq_i = cpu_rq(i);
+
+		rq_i->core_pick = NULL;
+
+		if (rq_i->core_forceidle) {
+			need_sync = true;
+			rq_i->core_forceidle = false;
+		}
+
+		if (i != cpu)
+			update_rq_clock(rq_i);
+	}
+
+	/*
+	 * Try and select tasks for each sibling in decending sched_class
+	 * order.
+	 */
+	for_each_class(class) {
+again:
+		for_each_cpu_wrap(i, smt_mask, cpu) {
+			struct rq *rq_i = cpu_rq(i);
+			struct task_struct *p;
+
+			if (rq_i->core_pick)
+				continue;
+
+			/*
+			 * If this sibling doesn't yet have a suitable task to
+			 * run; ask for the most elegible task, given the
+			 * highest priority task already selected for this
+			 * core.
+			 */
+			p = pick_task(rq_i, class, max);
+			if (!p) {
+				/*
+				 * If there weren't no cookies; we don't need to
+				 * bother with the other siblings.
+				 * If the rest of the core is not running a tagged
+				 * task, i.e.  need_sync == 0, and the current CPU
+				 * which called into the schedule() loop does not
+				 * have any tasks for this class, skip selecting for
+				 * other siblings since there's no point. We don't skip
+				 * for RT/DL because that could make CFS force-idle RT.
+				 */
+				if (i == cpu && !need_sync && class == &fair_sched_class)
+					goto next_class;
+
+				continue;
+			}
+
+			/*
+			 * Optimize the 'normal' case where there aren't any
+			 * cookies and we don't need to sync up.
+			 */
+			if (i == cpu && !need_sync && !p->core_cookie) {
+				next = p;
+				goto done;
+			}
+
+			rq_i->core_pick = p;
+
+			/*
+			 * If this new candidate is of higher priority than the
+			 * previous; and they're incompatible; we need to wipe
+			 * the slate and start over. pick_task makes sure that
+			 * p's priority is more than max if it doesn't match
+			 * max's cookie.
+			 *
+			 * NOTE: this is a linear max-filter and is thus bounded
+			 * in execution time.
+			 */
+			if (!max || !cookie_match(max, p)) {
+				struct task_struct *old_max = max;
+
+				rq->core->core_cookie = p->core_cookie;
+				max = p;
+
+				if (old_max) {
+					for_each_cpu(j, smt_mask) {
+						if (j == i)
+							continue;
+
+						cpu_rq(j)->core_pick = NULL;
+					}
+					goto again;
+				} else {
+					/*
+					 * Once we select a task for a cpu, we
+					 * should not be doing an unconstrained
+					 * pick because it might starve a task
+					 * on a forced idle cpu.
+					 */
+					need_sync = true;
+				}
+
+			}
+		}
+next_class:;
+	}
+
+	rq->core->core_pick_seq = rq->core->core_task_seq;
+	next = rq->core_pick;
+	rq->core_sched_seq = rq->core->core_pick_seq;
+
+	/* Something should have been selected for current CPU */
+	WARN_ON_ONCE(!next);
+
+	/*
+	 * Reschedule siblings
+	 *
+	 * NOTE: L1TF -- at this point we're no longer running the old task and
+	 * sending an IPI (below) ensures the sibling will no longer be running
+	 * their task. This ensures there is no inter-sibling overlap between
+	 * non-matching user state.
+	 */
+	for_each_cpu(i, smt_mask) {
+		struct rq *rq_i = cpu_rq(i);
+
+		/*
+		 * An online sibling might have gone offline before a task
+		 * could be picked for it, or it might be offline but later
+		 * happen to come online, but its too late and nothing was
+		 * picked for it.  That's Ok - it will pick tasks for itself,
+		 * so ignore it.
+		 */
+		if (!rq_i->core_pick)
+			continue;
+
+		if (is_task_rq_idle(rq_i->core_pick) && rq_i->nr_running)
+			rq_i->core_forceidle = true;
+
+		if (i == cpu) {
+			rq_i->core_pick = NULL;
+			continue;
+		}
+
+		/* Did we break L1TF mitigation requirements? */
+		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+
+		if (rq_i->curr == rq_i->core_pick) {
+			rq_i->core_pick = NULL;
+			continue;
+		}
+
+		resched_curr(rq_i);
+	}
+
+done:
+	set_next_task(rq, next);
+	return next;
+}
 
 static inline void sched_core_cpu_starting(unsigned int cpu)
 {
@@ -5354,6 +5642,12 @@ static inline void sched_core_cpu_starting(unsigned int cpu)
 
 static inline void sched_core_cpu_starting(unsigned int cpu) {}
 
+static struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
+{
+	return __pick_next_task(rq, prev, rf);
+}
+
 #endif /* CONFIG_SCHED_CORE */
 
 /*
@@ -8609,7 +8903,12 @@ void __init sched_init(void)
 
 #ifdef CONFIG_SCHED_CORE
 		rq->core = NULL;
+		rq->core_pick = NULL;
 		rq->core_enabled = 0;
+		rq->core_tree = RB_ROOT;
+		rq->core_forceidle = false;
+
+		rq->core_cookie = 0UL;
 #endif
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e43a217..dd44a31 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1079,11 +1079,16 @@ struct rq {
 #ifdef CONFIG_SCHED_CORE
 	/* per rq */
 	struct rq		*core;
+	struct task_struct	*core_pick;
 	unsigned int		core_enabled;
+	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
+	unsigned char		core_forceidle;
 
 	/* shared state */
 	unsigned int		core_task_seq;
+	unsigned int		core_pick_seq;
+	unsigned long		core_cookie;
 #endif
 };
 
@@ -2060,7 +2065,6 @@ static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 
 static inline void set_next_task(struct rq *rq, struct task_struct *next)
 {
-	WARN_ON_ONCE(rq->curr != next);
 	next->sched_class->set_next_task(rq, next, false);
 }
 

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [tip: sched/core] sched: Introduce sched_class::pick_task()
  2021-04-22 12:05 ` [PATCH 08/19] sched: Introduce sched_class::pick_task() Peter Zijlstra
@ 2021-05-12 10:28   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 103+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-05-12 10:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Vineeth Remanan Pillai, Don Hiatt, Hongyu Ning, Vincent Guittot,
	x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     21f56ffe4482e501b9e83737612493eeaac21f5a
Gitweb:        https://git.kernel.org/tip/21f56ffe4482e501b9e83737612493eeaac21f5a
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 17 Nov 2020 18:19:32 -05:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 12 May 2021 11:43:28 +02:00

sched: Introduce sched_class::pick_task()

Because sched_class::pick_next_task() also implies
sched_class::set_next_task() (and possibly put_prev_task() and
newidle_balance) it is not state invariant. This makes it unsuitable
for remote task selection.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Vineeth: folded fixes]
Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.437092775@infradead.org
---
 kernel/sched/deadline.c  | 16 ++++++++++++++--
 kernel/sched/fair.c      | 40 ++++++++++++++++++++++++++++++++++++---
 kernel/sched/idle.c      |  8 ++++++++-
 kernel/sched/rt.c        | 15 +++++++++++++--
 kernel/sched/sched.h     |  3 +++-
 kernel/sched/stop_task.c | 14 ++++++++++++--
 6 files changed, 87 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fb0eb9a..3829c5a 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1852,7 +1852,7 @@ static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
 	return rb_entry(left, struct sched_dl_entity, rb_node);
 }
 
-static struct task_struct *pick_next_task_dl(struct rq *rq)
+static struct task_struct *pick_task_dl(struct rq *rq)
 {
 	struct sched_dl_entity *dl_se;
 	struct dl_rq *dl_rq = &rq->dl;
@@ -1864,7 +1864,18 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
 	dl_se = pick_next_dl_entity(rq, dl_rq);
 	BUG_ON(!dl_se);
 	p = dl_task_of(dl_se);
-	set_next_task_dl(rq, p, true);
+
+	return p;
+}
+
+static struct task_struct *pick_next_task_dl(struct rq *rq)
+{
+	struct task_struct *p;
+
+	p = pick_task_dl(rq);
+	if (p)
+		set_next_task_dl(rq, p, true);
+
 	return p;
 }
 
@@ -2539,6 +2550,7 @@ DEFINE_SCHED_CLASS(dl) = {
 
 #ifdef CONFIG_SMP
 	.balance		= balance_dl,
+	.pick_task		= pick_task_dl,
 	.select_task_rq		= select_task_rq_dl,
 	.migrate_task_rq	= migrate_task_rq_dl,
 	.set_cpus_allowed       = set_cpus_allowed_dl,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 18960d0..51d72ab 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4419,6 +4419,8 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 static void
 set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	clear_buddies(cfs_rq, se);
+
 	/* 'current' is not kept within the tree. */
 	if (se->on_rq) {
 		/*
@@ -4478,7 +4480,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 	 * Avoid running the skip buddy, if running something else can
 	 * be done without getting too unfair.
 	 */
-	if (cfs_rq->skip == se) {
+	if (cfs_rq->skip && cfs_rq->skip == se) {
 		struct sched_entity *second;
 
 		if (se == curr) {
@@ -4505,8 +4507,6 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 		se = cfs_rq->last;
 	}
 
-	clear_buddies(cfs_rq, se);
-
 	return se;
 }
 
@@ -7095,6 +7095,39 @@ preempt:
 		set_last_buddy(se);
 }
 
+#ifdef CONFIG_SMP
+static struct task_struct *pick_task_fair(struct rq *rq)
+{
+	struct sched_entity *se;
+	struct cfs_rq *cfs_rq;
+
+again:
+	cfs_rq = &rq->cfs;
+	if (!cfs_rq->nr_running)
+		return NULL;
+
+	do {
+		struct sched_entity *curr = cfs_rq->curr;
+
+		/* When we pick for a remote RQ, we'll not have done put_prev_entity() */
+		if (curr) {
+			if (curr->on_rq)
+				update_curr(cfs_rq);
+			else
+				curr = NULL;
+
+			if (unlikely(check_cfs_rq_runtime(cfs_rq)))
+				goto again;
+		}
+
+		se = pick_next_entity(cfs_rq, curr);
+		cfs_rq = group_cfs_rq(se);
+	} while (cfs_rq);
+
+	return task_of(se);
+}
+#endif
+
 struct task_struct *
 pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
@@ -11298,6 +11331,7 @@ DEFINE_SCHED_CLASS(fair) = {
 
 #ifdef CONFIG_SMP
 	.balance		= balance_fair,
+	.pick_task		= pick_task_fair,
 	.select_task_rq		= select_task_rq_fair,
 	.migrate_task_rq	= migrate_task_rq_fair,
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 0194768..43646e7 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -439,6 +439,13 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
 	schedstat_inc(rq->sched_goidle);
 }
 
+#ifdef CONFIG_SMP
+static struct task_struct *pick_task_idle(struct rq *rq)
+{
+	return rq->idle;
+}
+#endif
+
 struct task_struct *pick_next_task_idle(struct rq *rq)
 {
 	struct task_struct *next = rq->idle;
@@ -506,6 +513,7 @@ DEFINE_SCHED_CLASS(idle) = {
 
 #ifdef CONFIG_SMP
 	.balance		= balance_idle,
+	.pick_task		= pick_task_idle,
 	.select_task_rq		= select_task_rq_idle,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index b3d39c3..a525447 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1626,7 +1626,7 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
 	return rt_task_of(rt_se);
 }
 
-static struct task_struct *pick_next_task_rt(struct rq *rq)
+static struct task_struct *pick_task_rt(struct rq *rq)
 {
 	struct task_struct *p;
 
@@ -1634,7 +1634,17 @@ static struct task_struct *pick_next_task_rt(struct rq *rq)
 		return NULL;
 
 	p = _pick_next_task_rt(rq);
-	set_next_task_rt(rq, p, true);
+
+	return p;
+}
+
+static struct task_struct *pick_next_task_rt(struct rq *rq)
+{
+	struct task_struct *p = pick_task_rt(rq);
+
+	if (p)
+		set_next_task_rt(rq, p, true);
+
 	return p;
 }
 
@@ -2483,6 +2493,7 @@ DEFINE_SCHED_CLASS(rt) = {
 
 #ifdef CONFIG_SMP
 	.balance		= balance_rt,
+	.pick_task		= pick_task_rt,
 	.select_task_rq		= select_task_rq_rt,
 	.set_cpus_allowed       = set_cpus_allowed_common,
 	.rq_online              = rq_online_rt,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ca30af3..fa990cd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1953,6 +1953,9 @@ struct sched_class {
 #ifdef CONFIG_SMP
 	int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int flags);
+
+	struct task_struct * (*pick_task)(struct rq *rq);
+
 	void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
 
 	void (*task_woken)(struct rq *this_rq, struct task_struct *task);
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 55f3912..f988ebe 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -34,15 +34,24 @@ static void set_next_task_stop(struct rq *rq, struct task_struct *stop, bool fir
 	stop->se.exec_start = rq_clock_task(rq);
 }
 
-static struct task_struct *pick_next_task_stop(struct rq *rq)
+static struct task_struct *pick_task_stop(struct rq *rq)
 {
 	if (!sched_stop_runnable(rq))
 		return NULL;
 
-	set_next_task_stop(rq, rq->stop, true);
 	return rq->stop;
 }
 
+static struct task_struct *pick_next_task_stop(struct rq *rq)
+{
+	struct task_struct *p = pick_task_stop(rq);
+
+	if (p)
+		set_next_task_stop(rq, p, true);
+
+	return p;
+}
+
 static void
 enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 {
@@ -123,6 +132,7 @@ DEFINE_SCHED_CLASS(stop) = {
 
 #ifdef CONFIG_SMP
 	.balance		= balance_stop,
+	.pick_task		= pick_task_stop,
 	.select_task_rq		= select_task_rq_stop,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [tip: sched/core] sched: Allow sched_core_put() from atomic context
  2021-04-22 12:05 ` [PATCH 07/19] sched: Allow sched_core_put() from atomic context Peter Zijlstra
@ 2021-05-12 10:28   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 103+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-05-12 10:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Don Hiatt, Hongyu Ning, Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     875feb41fd20f6bd6054c9e79a5bcd9da6d8d2b2
Gitweb:        https://git.kernel.org/tip/875feb41fd20f6bd6054c9e79a5bcd9da6d8d2b2
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Mon, 29 Mar 2021 10:08:58 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 12 May 2021 11:43:27 +02:00

sched: Allow sched_core_put() from atomic context

Stuff the meat of sched_core_put() into a work such that we can use
sched_core_put() from atomic context.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.377455632@infradead.org
---
 kernel/sched/core.c | 33 +++++++++++++++++++++++++++------
 1 file changed, 27 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 42c1c88..85147be 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -102,7 +102,7 @@ DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
  */
 
 static DEFINE_MUTEX(sched_core_mutex);
-static int sched_core_count;
+static atomic_t sched_core_count;
 static struct cpumask sched_core_mask;
 
 static void __sched_core_flip(bool enabled)
@@ -170,18 +170,39 @@ static void __sched_core_disable(void)
 
 void sched_core_get(void)
 {
+	if (atomic_inc_not_zero(&sched_core_count))
+		return;
+
 	mutex_lock(&sched_core_mutex);
-	if (!sched_core_count++)
+	if (!atomic_read(&sched_core_count))
 		__sched_core_enable();
+
+	smp_mb__before_atomic();
+	atomic_inc(&sched_core_count);
 	mutex_unlock(&sched_core_mutex);
 }
 
-void sched_core_put(void)
+static void __sched_core_put(struct work_struct *work)
 {
-	mutex_lock(&sched_core_mutex);
-	if (!--sched_core_count)
+	if (atomic_dec_and_mutex_lock(&sched_core_count, &sched_core_mutex)) {
 		__sched_core_disable();
-	mutex_unlock(&sched_core_mutex);
+		mutex_unlock(&sched_core_mutex);
+	}
+}
+
+void sched_core_put(void)
+{
+	static DECLARE_WORK(_work, __sched_core_put);
+
+	/*
+	 * "There can be only one"
+	 *
+	 * Either this is the last one, or we don't actually need to do any
+	 * 'work'. If it is the last *again*, we rely on
+	 * WORK_STRUCT_PENDING_BIT.
+	 */
+	if (!atomic_add_unless(&sched_core_count, -1, 1))
+		schedule_work(&_work);
 }
 
 #endif /* CONFIG_SCHED_CORE */

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [tip: sched/core] sched: Optimize rq_lockp() usage
  2021-04-22 12:05 ` [PATCH 06/19] sched: Optimize rq_lockp() usage Peter Zijlstra
@ 2021-05-12 10:28   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 103+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-05-12 10:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Don Hiatt, Hongyu Ning, Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     9ef7e7e33bcdb57be1afb28884053c28b5f05240
Gitweb:        https://git.kernel.org/tip/9ef7e7e33bcdb57be1afb28884053c28b5f05240
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 03 Mar 2021 16:45:41 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 12 May 2021 11:43:27 +02:00

sched: Optimize rq_lockp() usage

rq_lockp() includes a static_branch(), which is asm-goto, which is
asm volatile which defeats regular CSE. This means that:

	if (!static_branch(&foo))
		return simple;

	if (static_branch(&foo) && cond)
		return complex;

Doesn't fold and we get horrible code. Introduce __rq_lockp() without
the static_branch() on.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.316696988@infradead.org
---
 kernel/sched/core.c     | 16 ++++++++--------
 kernel/sched/deadline.c |  4 ++--
 kernel/sched/fair.c     |  2 +-
 kernel/sched/sched.h    | 33 +++++++++++++++++++++++++--------
 4 files changed, 36 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 384b793..42c1c88 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -300,9 +300,9 @@ void raw_spin_rq_lock_nested(struct rq *rq, int subclass)
 	}
 
 	for (;;) {
-		lock = rq_lockp(rq);
+		lock = __rq_lockp(rq);
 		raw_spin_lock_nested(lock, subclass);
-		if (likely(lock == rq_lockp(rq))) {
+		if (likely(lock == __rq_lockp(rq))) {
 			/* preempt_count *MUST* be > 1 */
 			preempt_enable_no_resched();
 			return;
@@ -325,9 +325,9 @@ bool raw_spin_rq_trylock(struct rq *rq)
 	}
 
 	for (;;) {
-		lock = rq_lockp(rq);
+		lock = __rq_lockp(rq);
 		ret = raw_spin_trylock(lock);
-		if (!ret || (likely(lock == rq_lockp(rq)))) {
+		if (!ret || (likely(lock == __rq_lockp(rq)))) {
 			preempt_enable();
 			return ret;
 		}
@@ -352,7 +352,7 @@ void double_rq_lock(struct rq *rq1, struct rq *rq2)
 		swap(rq1, rq2);
 
 	raw_spin_rq_lock(rq1);
-	if (rq_lockp(rq1) == rq_lockp(rq2))
+	if (__rq_lockp(rq1) == __rq_lockp(rq2))
 		return;
 
 	raw_spin_rq_lock_nested(rq2, SINGLE_DEPTH_NESTING);
@@ -2622,7 +2622,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	 * task_rq_lock().
 	 */
 	WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) ||
-				      lockdep_is_held(rq_lockp(task_rq(p)))));
+				      lockdep_is_held(__rq_lockp(task_rq(p)))));
 #endif
 	/*
 	 * Clearly, migrating tasks to offline CPUs is a fairly daft thing.
@@ -4248,7 +4248,7 @@ prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf
 	 * do an early lockdep release here:
 	 */
 	rq_unpin_lock(rq, rf);
-	spin_release(&rq_lockp(rq)->dep_map, _THIS_IP_);
+	spin_release(&__rq_lockp(rq)->dep_map, _THIS_IP_);
 #ifdef CONFIG_DEBUG_SPINLOCK
 	/* this is a valid case when another task releases the spinlock */
 	rq_lockp(rq)->owner = next;
@@ -4262,7 +4262,7 @@ static inline void finish_lock_switch(struct rq *rq)
 	 * fix up the runqueue lock - which gets 'carried over' from
 	 * prev into current:
 	 */
-	spin_acquire(&rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
+	spin_acquire(&__rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
 	__balance_callbacks(rq);
 	raw_spin_rq_unlock_irq(rq);
 }
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 6e99b8b..fb0eb9a 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1097,9 +1097,9 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 		 * If the runqueue is no longer available, migrate the
 		 * task elsewhere. This necessarily changes rq.
 		 */
-		lockdep_unpin_lock(rq_lockp(rq), rf.cookie);
+		lockdep_unpin_lock(__rq_lockp(rq), rf.cookie);
 		rq = dl_task_offline_migration(rq, p);
-		rf.cookie = lockdep_pin_lock(rq_lockp(rq));
+		rf.cookie = lockdep_pin_lock(__rq_lockp(rq));
 		update_rq_clock(rq);
 
 		/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e50bd75..18960d0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1107,7 +1107,7 @@ struct numa_group {
 static struct numa_group *deref_task_numa_group(struct task_struct *p)
 {
 	return rcu_dereference_check(p->numa_group, p == current ||
-		(lockdep_is_held(rq_lockp(task_rq(p))) && !READ_ONCE(p->on_cpu)));
+		(lockdep_is_held(__rq_lockp(task_rq(p))) && !READ_ONCE(p->on_cpu)));
 }
 
 static struct numa_group *deref_curr_numa_group(struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 29418b8..ca30af3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1133,6 +1133,10 @@ static inline bool sched_core_disabled(void)
 	return !static_branch_unlikely(&__sched_core_enabled);
 }
 
+/*
+ * Be careful with this function; not for general use. The return value isn't
+ * stable unless you actually hold a relevant rq->__lock.
+ */
 static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 {
 	if (sched_core_enabled(rq))
@@ -1141,6 +1145,14 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 	return &rq->__lock;
 }
 
+static inline raw_spinlock_t *__rq_lockp(struct rq *rq)
+{
+	if (rq->core_enabled)
+		return &rq->core->__lock;
+
+	return &rq->__lock;
+}
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline bool sched_core_enabled(struct rq *rq)
@@ -1158,11 +1170,16 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 	return &rq->__lock;
 }
 
+static inline raw_spinlock_t *__rq_lockp(struct rq *rq)
+{
+	return &rq->__lock;
+}
+
 #endif /* CONFIG_SCHED_CORE */
 
 static inline void lockdep_assert_rq_held(struct rq *rq)
 {
-	lockdep_assert_held(rq_lockp(rq));
+	lockdep_assert_held(__rq_lockp(rq));
 }
 
 extern void raw_spin_rq_lock_nested(struct rq *rq, int subclass);
@@ -1346,7 +1363,7 @@ extern struct callback_head balance_push_callback;
  */
 static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
 {
-	rf->cookie = lockdep_pin_lock(rq_lockp(rq));
+	rf->cookie = lockdep_pin_lock(__rq_lockp(rq));
 
 #ifdef CONFIG_SCHED_DEBUG
 	rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
@@ -1364,12 +1381,12 @@ static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
 		rf->clock_update_flags = RQCF_UPDATED;
 #endif
 
-	lockdep_unpin_lock(rq_lockp(rq), rf->cookie);
+	lockdep_unpin_lock(__rq_lockp(rq), rf->cookie);
 }
 
 static inline void rq_repin_lock(struct rq *rq, struct rq_flags *rf)
 {
-	lockdep_repin_lock(rq_lockp(rq), rf->cookie);
+	lockdep_repin_lock(__rq_lockp(rq), rf->cookie);
 
 #ifdef CONFIG_SCHED_DEBUG
 	/*
@@ -2338,7 +2355,7 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
 	__acquires(busiest->lock)
 	__acquires(this_rq->lock)
 {
-	if (rq_lockp(this_rq) == rq_lockp(busiest))
+	if (__rq_lockp(this_rq) == __rq_lockp(busiest))
 		return 0;
 
 	if (likely(raw_spin_rq_trylock(busiest)))
@@ -2370,9 +2387,9 @@ static inline int double_lock_balance(struct rq *this_rq, struct rq *busiest)
 static inline void double_unlock_balance(struct rq *this_rq, struct rq *busiest)
 	__releases(busiest->lock)
 {
-	if (rq_lockp(this_rq) != rq_lockp(busiest))
+	if (__rq_lockp(this_rq) != __rq_lockp(busiest))
 		raw_spin_rq_unlock(busiest);
-	lock_set_subclass(&rq_lockp(this_rq)->dep_map, 0, _RET_IP_);
+	lock_set_subclass(&__rq_lockp(this_rq)->dep_map, 0, _RET_IP_);
 }
 
 static inline void double_lock(spinlock_t *l1, spinlock_t *l2)
@@ -2412,7 +2429,7 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
 	__releases(rq1->lock)
 	__releases(rq2->lock)
 {
-	if (rq_lockp(rq1) != rq_lockp(rq2))
+	if (__rq_lockp(rq1) != __rq_lockp(rq2))
 		raw_spin_rq_unlock(rq2);
 	else
 		__release(rq2->lock);

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [tip: sched/core] sched: Core-wide rq->lock
  2021-05-07  9:50   ` [PATCH v2 " Peter Zijlstra
@ 2021-05-12 10:28     ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 103+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-05-12 10:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Don Hiatt, Hongyu Ning, Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     9edeaea1bc452372718837ed2ba775811baf1ba1
Gitweb:        https://git.kernel.org/tip/9edeaea1bc452372718837ed2ba775811baf1ba1
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 17 Nov 2020 18:19:34 -05:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 12 May 2021 11:43:27 +02:00

sched: Core-wide rq->lock

Introduce the basic infrastructure to have a core wide rq->lock.

This relies on the rq->__lock order being in increasing CPU number
(inside a core). It is also constrained to SMT8 per lockdep (and
SMT256 per preempt_count).

Luckily SMT8 is the max supported SMT count for Linux (Mips, Sparc and
Power are known to have this).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/YJUNfzSgptjX7tG6@hirez.programming.kicks-ass.net
---
 kernel/Kconfig.preempt |   6 +-
 kernel/sched/core.c    | 164 +++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h   |  58 ++++++++++++++-
 3 files changed, 224 insertions(+), 4 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 4160173..ea1e333 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -99,3 +99,9 @@ config PREEMPT_DYNAMIC
 
 	  Interesting if you want the same pre-built kernel should be used for
 	  both Server and Desktop workloads.
+
+config SCHED_CORE
+	bool "Core Scheduling for SMT"
+	default y
+	depends on SCHED_SMT
+
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8bd2f12..384b793 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -84,6 +84,108 @@ unsigned int sysctl_sched_rt_period = 1000000;
 
 __read_mostly int scheduler_running;
 
+#ifdef CONFIG_SCHED_CORE
+
+DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+/*
+ * Magic required such that:
+ *
+ *	raw_spin_rq_lock(rq);
+ *	...
+ *	raw_spin_rq_unlock(rq);
+ *
+ * ends up locking and unlocking the _same_ lock, and all CPUs
+ * always agree on what rq has what lock.
+ *
+ * XXX entirely possible to selectively enable cores, don't bother for now.
+ */
+
+static DEFINE_MUTEX(sched_core_mutex);
+static int sched_core_count;
+static struct cpumask sched_core_mask;
+
+static void __sched_core_flip(bool enabled)
+{
+	int cpu, t, i;
+
+	cpus_read_lock();
+
+	/*
+	 * Toggle the online cores, one by one.
+	 */
+	cpumask_copy(&sched_core_mask, cpu_online_mask);
+	for_each_cpu(cpu, &sched_core_mask) {
+		const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+
+		i = 0;
+		local_irq_disable();
+		for_each_cpu(t, smt_mask) {
+			/* supports up to SMT8 */
+			raw_spin_lock_nested(&cpu_rq(t)->__lock, i++);
+		}
+
+		for_each_cpu(t, smt_mask)
+			cpu_rq(t)->core_enabled = enabled;
+
+		for_each_cpu(t, smt_mask)
+			raw_spin_unlock(&cpu_rq(t)->__lock);
+		local_irq_enable();
+
+		cpumask_andnot(&sched_core_mask, &sched_core_mask, smt_mask);
+	}
+
+	/*
+	 * Toggle the offline CPUs.
+	 */
+	cpumask_copy(&sched_core_mask, cpu_possible_mask);
+	cpumask_andnot(&sched_core_mask, &sched_core_mask, cpu_online_mask);
+
+	for_each_cpu(cpu, &sched_core_mask)
+		cpu_rq(cpu)->core_enabled = enabled;
+
+	cpus_read_unlock();
+}
+
+static void __sched_core_enable(void)
+{
+	// XXX verify there are no cookie tasks (yet)
+
+	static_branch_enable(&__sched_core_enabled);
+	/*
+	 * Ensure all previous instances of raw_spin_rq_*lock() have finished
+	 * and future ones will observe !sched_core_disabled().
+	 */
+	synchronize_rcu();
+	__sched_core_flip(true);
+}
+
+static void __sched_core_disable(void)
+{
+	// XXX verify there are no cookie tasks (left)
+
+	__sched_core_flip(false);
+	static_branch_disable(&__sched_core_enabled);
+}
+
+void sched_core_get(void)
+{
+	mutex_lock(&sched_core_mutex);
+	if (!sched_core_count++)
+		__sched_core_enable();
+	mutex_unlock(&sched_core_mutex);
+}
+
+void sched_core_put(void)
+{
+	mutex_lock(&sched_core_mutex);
+	if (!--sched_core_count)
+		__sched_core_disable();
+	mutex_unlock(&sched_core_mutex);
+}
+
+#endif /* CONFIG_SCHED_CORE */
+
 /*
  * part of the period that we allow rt tasks to run in us.
  * default: 0.95s
@@ -188,16 +290,23 @@ void raw_spin_rq_lock_nested(struct rq *rq, int subclass)
 {
 	raw_spinlock_t *lock;
 
+	/* Matches synchronize_rcu() in __sched_core_enable() */
+	preempt_disable();
 	if (sched_core_disabled()) {
 		raw_spin_lock_nested(&rq->__lock, subclass);
+		/* preempt_count *MUST* be > 1 */
+		preempt_enable_no_resched();
 		return;
 	}
 
 	for (;;) {
 		lock = rq_lockp(rq);
 		raw_spin_lock_nested(lock, subclass);
-		if (likely(lock == rq_lockp(rq)))
+		if (likely(lock == rq_lockp(rq))) {
+			/* preempt_count *MUST* be > 1 */
+			preempt_enable_no_resched();
 			return;
+		}
 		raw_spin_unlock(lock);
 	}
 }
@@ -207,14 +316,21 @@ bool raw_spin_rq_trylock(struct rq *rq)
 	raw_spinlock_t *lock;
 	bool ret;
 
-	if (sched_core_disabled())
-		return raw_spin_trylock(&rq->__lock);
+	/* Matches synchronize_rcu() in __sched_core_enable() */
+	preempt_disable();
+	if (sched_core_disabled()) {
+		ret = raw_spin_trylock(&rq->__lock);
+		preempt_enable();
+		return ret;
+	}
 
 	for (;;) {
 		lock = rq_lockp(rq);
 		ret = raw_spin_trylock(lock);
-		if (!ret || (likely(lock == rq_lockp(rq))))
+		if (!ret || (likely(lock == rq_lockp(rq)))) {
+			preempt_enable();
 			return ret;
+		}
 		raw_spin_unlock(lock);
 	}
 }
@@ -5041,6 +5157,40 @@ restart:
 	BUG();
 }
 
+#ifdef CONFIG_SCHED_CORE
+
+static inline void sched_core_cpu_starting(unsigned int cpu)
+{
+	const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+	struct rq *rq, *core_rq = NULL;
+	int i;
+
+	core_rq = cpu_rq(cpu)->core;
+
+	if (!core_rq) {
+		for_each_cpu(i, smt_mask) {
+			rq = cpu_rq(i);
+			if (rq->core && rq->core == rq)
+				core_rq = rq;
+		}
+
+		if (!core_rq)
+			core_rq = cpu_rq(cpu);
+
+		for_each_cpu(i, smt_mask) {
+			rq = cpu_rq(i);
+
+			WARN_ON_ONCE(rq->core && rq->core != core_rq);
+			rq->core = core_rq;
+		}
+	}
+}
+#else /* !CONFIG_SCHED_CORE */
+
+static inline void sched_core_cpu_starting(unsigned int cpu) {}
+
+#endif /* CONFIG_SCHED_CORE */
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -8006,6 +8156,7 @@ static void sched_rq_cpu_starting(unsigned int cpu)
 
 int sched_cpu_starting(unsigned int cpu)
 {
+	sched_core_cpu_starting(cpu);
 	sched_rq_cpu_starting(cpu);
 	sched_tick_start(cpu);
 	return 0;
@@ -8290,6 +8441,11 @@ void __init sched_init(void)
 #endif /* CONFIG_SMP */
 		hrtick_rq_init(rq);
 		atomic_set(&rq->nr_iowait, 0);
+
+#ifdef CONFIG_SCHED_CORE
+		rq->core = NULL;
+		rq->core_enabled = 0;
+#endif
 	}
 
 	set_load_weight(&init_task, false);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f8bd5c8..29418b8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1075,6 +1075,12 @@ struct rq {
 #endif
 	unsigned int		push_busy;
 	struct cpu_stop_work	push_work;
+
+#ifdef CONFIG_SCHED_CORE
+	/* per rq */
+	struct rq		*core;
+	unsigned int		core_enabled;
+#endif
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -1113,6 +1119,35 @@ static inline bool is_migration_disabled(struct task_struct *p)
 #endif
 }
 
+#ifdef CONFIG_SCHED_CORE
+
+DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+	return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled;
+}
+
+static inline bool sched_core_disabled(void)
+{
+	return !static_branch_unlikely(&__sched_core_enabled);
+}
+
+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+	if (sched_core_enabled(rq))
+		return &rq->core->__lock;
+
+	return &rq->__lock;
+}
+
+#else /* !CONFIG_SCHED_CORE */
+
+static inline bool sched_core_enabled(struct rq *rq)
+{
+	return false;
+}
+
 static inline bool sched_core_disabled(void)
 {
 	return true;
@@ -1123,6 +1158,8 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 	return &rq->__lock;
 }
 
+#endif /* CONFIG_SCHED_CORE */
+
 static inline void lockdep_assert_rq_held(struct rq *rq)
 {
 	lockdep_assert_held(rq_lockp(rq));
@@ -2241,6 +2278,27 @@ unsigned long arch_scale_freq_capacity(int cpu)
 
 static inline bool rq_order_less(struct rq *rq1, struct rq *rq2)
 {
+#ifdef CONFIG_SCHED_CORE
+	/*
+	 * In order to not have {0,2},{1,3} turn into into an AB-BA,
+	 * order by core-id first and cpu-id second.
+	 *
+	 * Notably:
+	 *
+	 *	double_rq_lock(0,3); will take core-0, core-1 lock
+	 *	double_rq_lock(1,2); will take core-1, core-0 lock
+	 *
+	 * when only cpu-id is considered.
+	 */
+	if (rq1->core->cpu < rq2->core->cpu)
+		return true;
+	if (rq1->core->cpu > rq2->core->cpu)
+		return false;
+
+	/*
+	 * __sched_core_flip() relies on SMT having cpu-id lock order.
+	 */
+#endif
 	return rq1->cpu < rq2->cpu;
 }
 

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [tip: sched/core] sched: Provide raw_spin_rq_*lock*() helpers
  2021-04-22 12:05 ` [PATCH 02/19] sched: Provide raw_spin_rq_*lock*() helpers Peter Zijlstra
@ 2021-05-12 10:28   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 103+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-05-12 10:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Don Hiatt, Hongyu Ning, Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     39d371b7c0c299d489041884d005aacc4bba8c15
Gitweb:        https://git.kernel.org/tip/39d371b7c0c299d489041884d005aacc4bba8c15
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 02 Mar 2021 12:13:13 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 12 May 2021 11:43:26 +02:00

sched: Provide raw_spin_rq_*lock*() helpers

In prepration for playing games with rq->lock, add some rq_lock
wrappers.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.075967879@infradead.org
---
 kernel/sched/core.c  | 15 +++++++++++++-
 kernel/sched/sched.h | 50 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 65 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 660120d..5568018 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -184,6 +184,21 @@ int sysctl_sched_rt_runtime = 950000;
  *
  */
 
+void raw_spin_rq_lock_nested(struct rq *rq, int subclass)
+{
+	raw_spin_lock_nested(rq_lockp(rq), subclass);
+}
+
+bool raw_spin_rq_trylock(struct rq *rq)
+{
+	return raw_spin_trylock(rq_lockp(rq));
+}
+
+void raw_spin_rq_unlock(struct rq *rq)
+{
+	raw_spin_unlock(rq_lockp(rq));
+}
+
 /*
  * __task_rq_lock - lock the rq @p resides on.
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a189bec..f654587 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1113,6 +1113,56 @@ static inline bool is_migration_disabled(struct task_struct *p)
 #endif
 }
 
+static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+{
+	return &rq->lock;
+}
+
+static inline void lockdep_assert_rq_held(struct rq *rq)
+{
+	lockdep_assert_held(rq_lockp(rq));
+}
+
+extern void raw_spin_rq_lock_nested(struct rq *rq, int subclass);
+extern bool raw_spin_rq_trylock(struct rq *rq);
+extern void raw_spin_rq_unlock(struct rq *rq);
+
+static inline void raw_spin_rq_lock(struct rq *rq)
+{
+	raw_spin_rq_lock_nested(rq, 0);
+}
+
+static inline void raw_spin_rq_lock_irq(struct rq *rq)
+{
+	local_irq_disable();
+	raw_spin_rq_lock(rq);
+}
+
+static inline void raw_spin_rq_unlock_irq(struct rq *rq)
+{
+	raw_spin_rq_unlock(rq);
+	local_irq_enable();
+}
+
+static inline unsigned long _raw_spin_rq_lock_irqsave(struct rq *rq)
+{
+	unsigned long flags;
+	local_irq_save(flags);
+	raw_spin_rq_lock(rq);
+	return flags;
+}
+
+static inline void raw_spin_rq_unlock_irqrestore(struct rq *rq, unsigned long flags)
+{
+	raw_spin_rq_unlock(rq);
+	local_irq_restore(flags);
+}
+
+#define raw_spin_rq_lock_irqsave(rq, flags)	\
+do {						\
+	flags = _raw_spin_rq_lock_irqsave(rq);	\
+} while (0)
+
 #ifdef CONFIG_SCHED_SMT
 extern void __update_idle_core(struct rq *rq);
 

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [tip: sched/core] sched: Wrap rq::lock access
  2021-04-22 12:05 ` [PATCH 03/19] sched: Wrap rq::lock access Peter Zijlstra
@ 2021-05-12 10:28   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 103+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-05-12 10:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Don Hiatt, Hongyu Ning, Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     5cb9eaa3d274f75539077a28cf01e3563195fa53
Gitweb:        https://git.kernel.org/tip/5cb9eaa3d274f75539077a28cf01e3563195fa53
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 17 Nov 2020 18:19:31 -05:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 12 May 2021 11:43:26 +02:00

sched: Wrap rq::lock access

In preparation of playing games with rq->lock, abstract the thing
using an accessor.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.136465446@infradead.org
---
 kernel/sched/core.c     |  70 +++++++++++++-------------
 kernel/sched/cpuacct.c  |  12 ++--
 kernel/sched/deadline.c |  22 ++++----
 kernel/sched/debug.c    |   4 +-
 kernel/sched/fair.c     |  35 ++++++-------
 kernel/sched/idle.c     |   4 +-
 kernel/sched/pelt.h     |   2 +-
 kernel/sched/rt.c       |  16 +++---
 kernel/sched/sched.h    | 105 +++++++++++++++++++--------------------
 kernel/sched/topology.c |   4 +-
 10 files changed, 136 insertions(+), 138 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5568018..5e6f5f5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -211,12 +211,12 @@ struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
 
 	for (;;) {
 		rq = task_rq(p);
-		raw_spin_lock(&rq->lock);
+		raw_spin_rq_lock(rq);
 		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
 			rq_pin_lock(rq, rf);
 			return rq;
 		}
-		raw_spin_unlock(&rq->lock);
+		raw_spin_rq_unlock(rq);
 
 		while (unlikely(task_on_rq_migrating(p)))
 			cpu_relax();
@@ -235,7 +235,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
 	for (;;) {
 		raw_spin_lock_irqsave(&p->pi_lock, rf->flags);
 		rq = task_rq(p);
-		raw_spin_lock(&rq->lock);
+		raw_spin_rq_lock(rq);
 		/*
 		 *	move_queued_task()		task_rq_lock()
 		 *
@@ -257,7 +257,7 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
 			rq_pin_lock(rq, rf);
 			return rq;
 		}
-		raw_spin_unlock(&rq->lock);
+		raw_spin_rq_unlock(rq);
 		raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 
 		while (unlikely(task_on_rq_migrating(p)))
@@ -327,7 +327,7 @@ void update_rq_clock(struct rq *rq)
 {
 	s64 delta;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	if (rq->clock_update_flags & RQCF_ACT_SKIP)
 		return;
@@ -625,7 +625,7 @@ void resched_curr(struct rq *rq)
 	struct task_struct *curr = rq->curr;
 	int cpu;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	if (test_tsk_need_resched(curr))
 		return;
@@ -649,10 +649,10 @@ void resched_cpu(int cpu)
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_rq_lock_irqsave(rq, flags);
 	if (cpu_online(cpu) || cpu == smp_processor_id())
 		resched_curr(rq);
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_rq_unlock_irqrestore(rq, flags);
 }
 
 #ifdef CONFIG_SMP
@@ -1151,7 +1151,7 @@ static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
 	struct uclamp_se *uc_se = &p->uclamp[clamp_id];
 	struct uclamp_bucket *bucket;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	/* Update task effective clamp */
 	p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
@@ -1191,7 +1191,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
 	unsigned int bkt_clamp;
 	unsigned int rq_clamp;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	/*
 	 * If sched_uclamp_used was enabled after task @p was enqueued,
@@ -1864,7 +1864,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
 				   struct task_struct *p, int new_cpu)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
 	set_task_cpu(p, new_cpu);
@@ -2038,7 +2038,7 @@ int push_cpu_stop(void *arg)
 	struct task_struct *p = arg;
 
 	raw_spin_lock_irq(&p->pi_lock);
-	raw_spin_lock(&rq->lock);
+	raw_spin_rq_lock(rq);
 
 	if (task_rq(p) != rq)
 		goto out_unlock;
@@ -2068,7 +2068,7 @@ int push_cpu_stop(void *arg)
 
 out_unlock:
 	rq->push_busy = false;
-	raw_spin_unlock(&rq->lock);
+	raw_spin_rq_unlock(rq);
 	raw_spin_unlock_irq(&p->pi_lock);
 
 	put_task_struct(p);
@@ -2121,7 +2121,7 @@ __do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask, u32
 		 * Because __kthread_bind() calls this on blocked tasks without
 		 * holding rq->lock.
 		 */
-		lockdep_assert_held(&rq->lock);
+		lockdep_assert_rq_held(rq);
 		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
 	}
 	if (running)
@@ -2462,7 +2462,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	 * task_rq_lock().
 	 */
 	WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) ||
-				      lockdep_is_held(&task_rq(p)->lock)));
+				      lockdep_is_held(rq_lockp(task_rq(p)))));
 #endif
 	/*
 	 * Clearly, migrating tasks to offline CPUs is a fairly daft thing.
@@ -3004,7 +3004,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 {
 	int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	if (p->sched_contributes_to_load)
 		rq->nr_uninterruptible--;
@@ -4015,7 +4015,7 @@ static void do_balance_callbacks(struct rq *rq, struct callback_head *head)
 	void (*func)(struct rq *rq);
 	struct callback_head *next;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	while (head) {
 		func = (void (*)(struct rq *))head->func;
@@ -4038,7 +4038,7 @@ static inline struct callback_head *splice_balance_callbacks(struct rq *rq)
 {
 	struct callback_head *head = rq->balance_callback;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 	if (head)
 		rq->balance_callback = NULL;
 
@@ -4055,9 +4055,9 @@ static inline void balance_callbacks(struct rq *rq, struct callback_head *head)
 	unsigned long flags;
 
 	if (unlikely(head)) {
-		raw_spin_lock_irqsave(&rq->lock, flags);
+		raw_spin_rq_lock_irqsave(rq, flags);
 		do_balance_callbacks(rq, head);
-		raw_spin_unlock_irqrestore(&rq->lock, flags);
+		raw_spin_rq_unlock_irqrestore(rq, flags);
 	}
 }
 
@@ -4088,10 +4088,10 @@ prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf
 	 * do an early lockdep release here:
 	 */
 	rq_unpin_lock(rq, rf);
-	spin_release(&rq->lock.dep_map, _THIS_IP_);
+	spin_release(&rq_lockp(rq)->dep_map, _THIS_IP_);
 #ifdef CONFIG_DEBUG_SPINLOCK
 	/* this is a valid case when another task releases the spinlock */
-	rq->lock.owner = next;
+	rq_lockp(rq)->owner = next;
 #endif
 }
 
@@ -4102,9 +4102,9 @@ static inline void finish_lock_switch(struct rq *rq)
 	 * fix up the runqueue lock - which gets 'carried over' from
 	 * prev into current:
 	 */
-	spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
+	spin_acquire(&rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
 	__balance_callbacks(rq);
-	raw_spin_unlock_irq(&rq->lock);
+	raw_spin_rq_unlock_irq(rq);
 }
 
 /*
@@ -5164,7 +5164,7 @@ static void __sched notrace __schedule(bool preempt)
 
 		rq_unpin_lock(rq, &rf);
 		__balance_callbacks(rq);
-		raw_spin_unlock_irq(&rq->lock);
+		raw_spin_rq_unlock_irq(rq);
 	}
 }
 
@@ -5706,7 +5706,7 @@ out_unlock:
 
 	rq_unpin_lock(rq, &rf);
 	__balance_callbacks(rq);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_rq_unlock(rq);
 
 	preempt_enable();
 }
@@ -7456,7 +7456,7 @@ void init_idle(struct task_struct *idle, int cpu)
 	__sched_fork(0, idle);
 
 	raw_spin_lock_irqsave(&idle->pi_lock, flags);
-	raw_spin_lock(&rq->lock);
+	raw_spin_rq_lock(rq);
 
 	idle->state = TASK_RUNNING;
 	idle->se.exec_start = sched_clock();
@@ -7494,7 +7494,7 @@ void init_idle(struct task_struct *idle, int cpu)
 #ifdef CONFIG_SMP
 	idle->on_cpu = 1;
 #endif
-	raw_spin_unlock(&rq->lock);
+	raw_spin_rq_unlock(rq);
 	raw_spin_unlock_irqrestore(&idle->pi_lock, flags);
 
 	/* Set the preempt count _outside_ the spinlocks! */
@@ -7660,7 +7660,7 @@ static void balance_push(struct rq *rq)
 {
 	struct task_struct *push_task = rq->curr;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 	SCHED_WARN_ON(rq->cpu != smp_processor_id());
 
 	/*
@@ -7698,9 +7698,9 @@ static void balance_push(struct rq *rq)
 		 */
 		if (!rq->nr_running && !rq_has_pinned_tasks(rq) &&
 		    rcuwait_active(&rq->hotplug_wait)) {
-			raw_spin_unlock(&rq->lock);
+			raw_spin_rq_unlock(rq);
 			rcuwait_wake_up(&rq->hotplug_wait);
-			raw_spin_lock(&rq->lock);
+			raw_spin_rq_lock(rq);
 		}
 		return;
 	}
@@ -7710,7 +7710,7 @@ static void balance_push(struct rq *rq)
 	 * Temporarily drop rq->lock such that we can wake-up the stop task.
 	 * Both preemption and IRQs are still disabled.
 	 */
-	raw_spin_unlock(&rq->lock);
+	raw_spin_rq_unlock(rq);
 	stop_one_cpu_nowait(rq->cpu, __balance_push_cpu_stop, push_task,
 			    this_cpu_ptr(&push_work));
 	/*
@@ -7718,7 +7718,7 @@ static void balance_push(struct rq *rq)
 	 * schedule(). The next pick is obviously going to be the stop task
 	 * which kthread_is_per_cpu() and will push this task away.
 	 */
-	raw_spin_lock(&rq->lock);
+	raw_spin_rq_lock(rq);
 }
 
 static void balance_push_set(int cpu, bool on)
@@ -8008,7 +8008,7 @@ static void dump_rq_tasks(struct rq *rq, const char *loglvl)
 	struct task_struct *g, *p;
 	int cpu = cpu_of(rq);
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	printk("%sCPU%d enqueued tasks (%u total):\n", loglvl, cpu, rq->nr_running);
 	for_each_process_thread(g, p) {
@@ -8181,7 +8181,7 @@ void __init sched_init(void)
 		struct rq *rq;
 
 		rq = cpu_rq(i);
-		raw_spin_lock_init(&rq->lock);
+		raw_spin_lock_init(&rq->__lock);
 		rq->nr_running = 0;
 		rq->calc_load_active = 0;
 		rq->calc_load_update = jiffies + LOAD_FREQ;
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 104a1ba..893eece 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -112,7 +112,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
 	/*
 	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
 	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_rq_lock_irq(cpu_rq(cpu));
 #endif
 
 	if (index == CPUACCT_STAT_NSTATS) {
@@ -126,7 +126,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
 	}
 
 #ifndef CONFIG_64BIT
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_rq_unlock_irq(cpu_rq(cpu));
 #endif
 
 	return data;
@@ -141,14 +141,14 @@ static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
 	/*
 	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
 	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_rq_lock_irq(cpu_rq(cpu));
 #endif
 
 	for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
 		cpuusage->usages[i] = val;
 
 #ifndef CONFIG_64BIT
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+	raw_spin_rq_unlock_irq(cpu_rq(cpu));
 #endif
 }
 
@@ -253,13 +253,13 @@ static int cpuacct_all_seq_show(struct seq_file *m, void *V)
 			 * Take rq->lock to make 64-bit read safe on 32-bit
 			 * platforms.
 			 */
-			raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+			raw_spin_rq_lock_irq(cpu_rq(cpu));
 #endif
 
 			seq_printf(m, " %llu", cpuusage->usages[index]);
 
 #ifndef CONFIG_64BIT
-			raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+			raw_spin_rq_unlock_irq(cpu_rq(cpu));
 #endif
 		}
 		seq_puts(m, "\n");
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 9a29897..6e99b8b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -157,7 +157,7 @@ void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
 {
 	u64 old = dl_rq->running_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_rq_held(rq_of_dl_rq(dl_rq));
 	dl_rq->running_bw += dl_bw;
 	SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
 	SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
@@ -170,7 +170,7 @@ void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
 {
 	u64 old = dl_rq->running_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_rq_held(rq_of_dl_rq(dl_rq));
 	dl_rq->running_bw -= dl_bw;
 	SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
 	if (dl_rq->running_bw > old)
@@ -184,7 +184,7 @@ void __add_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
 {
 	u64 old = dl_rq->this_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_rq_held(rq_of_dl_rq(dl_rq));
 	dl_rq->this_bw += dl_bw;
 	SCHED_WARN_ON(dl_rq->this_bw < old); /* overflow */
 }
@@ -194,7 +194,7 @@ void __sub_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
 {
 	u64 old = dl_rq->this_bw;
 
-	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	lockdep_assert_rq_held(rq_of_dl_rq(dl_rq));
 	dl_rq->this_bw -= dl_bw;
 	SCHED_WARN_ON(dl_rq->this_bw > old); /* underflow */
 	if (dl_rq->this_bw > old)
@@ -987,7 +987,7 @@ static int start_dl_timer(struct task_struct *p)
 	ktime_t now, act;
 	s64 delta;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	/*
 	 * We want the timer to fire at the deadline, but considering
@@ -1097,9 +1097,9 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 		 * If the runqueue is no longer available, migrate the
 		 * task elsewhere. This necessarily changes rq.
 		 */
-		lockdep_unpin_lock(&rq->lock, rf.cookie);
+		lockdep_unpin_lock(rq_lockp(rq), rf.cookie);
 		rq = dl_task_offline_migration(rq, p);
-		rf.cookie = lockdep_pin_lock(&rq->lock);
+		rf.cookie = lockdep_pin_lock(rq_lockp(rq));
 		update_rq_clock(rq);
 
 		/*
@@ -1731,7 +1731,7 @@ static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused
 	 * from try_to_wake_up(). Hence, p->pi_lock is locked, but
 	 * rq->lock is not... So, lock it
 	 */
-	raw_spin_lock(&rq->lock);
+	raw_spin_rq_lock(rq);
 	if (p->dl.dl_non_contending) {
 		sub_running_bw(&p->dl, &rq->dl);
 		p->dl.dl_non_contending = 0;
@@ -1746,7 +1746,7 @@ static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused
 			put_task_struct(p);
 	}
 	sub_rq_bw(&p->dl, &rq->dl);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_rq_unlock(rq);
 }
 
 static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
@@ -2291,10 +2291,10 @@ skip:
 		double_unlock_balance(this_rq, src_rq);
 
 		if (push_task) {
-			raw_spin_unlock(&this_rq->lock);
+			raw_spin_rq_unlock(this_rq);
 			stop_one_cpu_nowait(src_rq->cpu, push_cpu_stop,
 					    push_task, &src_rq->push_work);
-			raw_spin_lock(&this_rq->lock);
+			raw_spin_rq_lock(this_rq);
 		}
 	}
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 9c882f2..3bdee5f 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -576,7 +576,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "exec_clock",
 			SPLIT_NS(cfs_rq->exec_clock));
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_rq_lock_irqsave(rq, flags);
 	if (rb_first_cached(&cfs_rq->tasks_timeline))
 		MIN_vruntime = (__pick_first_entity(cfs_rq))->vruntime;
 	last = __pick_last_entity(cfs_rq);
@@ -584,7 +584,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 		max_vruntime = last->vruntime;
 	min_vruntime = cfs_rq->min_vruntime;
 	rq0_min_vruntime = cpu_rq(0)->cfs.min_vruntime;
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_rq_unlock_irqrestore(rq, flags);
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "MIN_vruntime",
 			SPLIT_NS(MIN_vruntime));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "min_vruntime",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6bdbb7b..e50bd75 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1107,7 +1107,7 @@ struct numa_group {
 static struct numa_group *deref_task_numa_group(struct task_struct *p)
 {
 	return rcu_dereference_check(p->numa_group, p == current ||
-		(lockdep_is_held(&task_rq(p)->lock) && !READ_ONCE(p->on_cpu)));
+		(lockdep_is_held(rq_lockp(task_rq(p))) && !READ_ONCE(p->on_cpu)));
 }
 
 static struct numa_group *deref_curr_numa_group(struct task_struct *p)
@@ -5328,7 +5328,7 @@ static void __maybe_unused update_runtime_enabled(struct rq *rq)
 {
 	struct task_group *tg;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(tg, &task_groups, list) {
@@ -5347,7 +5347,7 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
 {
 	struct task_group *tg;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(tg, &task_groups, list) {
@@ -6891,7 +6891,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 		 * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
 		 * rq->lock and can modify state directly.
 		 */
-		lockdep_assert_held(&task_rq(p)->lock);
+		lockdep_assert_rq_held(task_rq(p));
 		detach_entity_cfs_rq(&p->se);
 
 	} else {
@@ -7518,7 +7518,7 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 {
 	s64 delta;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_rq_held(env->src_rq);
 
 	if (p->sched_class != &fair_sched_class)
 		return 0;
@@ -7616,7 +7616,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 {
 	int tsk_cache_hot;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_rq_held(env->src_rq);
 
 	/*
 	 * We do not migrate tasks that are:
@@ -7705,7 +7705,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
  */
 static void detach_task(struct task_struct *p, struct lb_env *env)
 {
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_rq_held(env->src_rq);
 
 	deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
 	set_task_cpu(p, env->dst_cpu);
@@ -7721,7 +7721,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)
 {
 	struct task_struct *p;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_rq_held(env->src_rq);
 
 	list_for_each_entry_reverse(p,
 			&env->src_rq->cfs_tasks, se.group_node) {
@@ -7757,7 +7757,7 @@ static int detach_tasks(struct lb_env *env)
 	struct task_struct *p;
 	int detached = 0;
 
-	lockdep_assert_held(&env->src_rq->lock);
+	lockdep_assert_rq_held(env->src_rq);
 
 	/*
 	 * Source run queue has been emptied by another CPU, clear
@@ -7887,7 +7887,7 @@ next:
  */
 static void attach_task(struct rq *rq, struct task_struct *p)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	BUG_ON(task_rq(p) != rq);
 	activate_task(rq, p, ENQUEUE_NOCLOCK);
@@ -9798,7 +9798,7 @@ more_balance:
 		if (need_active_balance(&env)) {
 			unsigned long flags;
 
-			raw_spin_lock_irqsave(&busiest->lock, flags);
+			raw_spin_rq_lock_irqsave(busiest, flags);
 
 			/*
 			 * Don't kick the active_load_balance_cpu_stop,
@@ -9806,8 +9806,7 @@ more_balance:
 			 * moved to this_cpu:
 			 */
 			if (!cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr)) {
-				raw_spin_unlock_irqrestore(&busiest->lock,
-							    flags);
+				raw_spin_rq_unlock_irqrestore(busiest, flags);
 				goto out_one_pinned;
 			}
 
@@ -9824,7 +9823,7 @@ more_balance:
 				busiest->push_cpu = this_cpu;
 				active_balance = 1;
 			}
-			raw_spin_unlock_irqrestore(&busiest->lock, flags);
+			raw_spin_rq_unlock_irqrestore(busiest, flags);
 
 			if (active_balance) {
 				stop_one_cpu_nowait(cpu_of(busiest),
@@ -10649,7 +10648,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 		goto out;
 	}
 
-	raw_spin_unlock(&this_rq->lock);
+	raw_spin_rq_unlock(this_rq);
 
 	update_blocked_averages(this_cpu);
 	rcu_read_lock();
@@ -10688,7 +10687,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	}
 	rcu_read_unlock();
 
-	raw_spin_lock(&this_rq->lock);
+	raw_spin_rq_lock(this_rq);
 
 	if (curr_cost > this_rq->max_idle_balance_cost)
 		this_rq->max_idle_balance_cost = curr_cost;
@@ -11175,9 +11174,9 @@ void unregister_fair_sched_group(struct task_group *tg)
 
 		rq = cpu_rq(cpu);
 
-		raw_spin_lock_irqsave(&rq->lock, flags);
+		raw_spin_rq_lock_irqsave(rq, flags);
 		list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
-		raw_spin_unlock_irqrestore(&rq->lock, flags);
+		raw_spin_rq_unlock_irqrestore(rq, flags);
 	}
 }
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 7ca3d3d..0194768 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -455,10 +455,10 @@ struct task_struct *pick_next_task_idle(struct rq *rq)
 static void
 dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)
 {
-	raw_spin_unlock_irq(&rq->lock);
+	raw_spin_rq_unlock_irq(rq);
 	printk(KERN_ERR "bad: scheduling from the idle thread!\n");
 	dump_stack();
-	raw_spin_lock_irq(&rq->lock);
+	raw_spin_rq_lock_irq(rq);
 }
 
 /*
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 1462846..9ed6d8c 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -141,7 +141,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
 
 static inline u64 rq_clock_pelt(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 	assert_clock_updated(rq);
 
 	return rq->clock_pelt - rq->lost_idle_time;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index c286e5b..b3d39c3 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -888,7 +888,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
 		if (skip)
 			continue;
 
-		raw_spin_lock(&rq->lock);
+		raw_spin_rq_lock(rq);
 		update_rq_clock(rq);
 
 		if (rt_rq->rt_time) {
@@ -926,7 +926,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
 
 		if (enqueue)
 			sched_rt_rq_enqueue(rt_rq);
-		raw_spin_unlock(&rq->lock);
+		raw_spin_rq_unlock(rq);
 	}
 
 	if (!throttled && (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF))
@@ -1894,10 +1894,10 @@ retry:
 		 */
 		push_task = get_push_task(rq);
 		if (push_task) {
-			raw_spin_unlock(&rq->lock);
+			raw_spin_rq_unlock(rq);
 			stop_one_cpu_nowait(rq->cpu, push_cpu_stop,
 					    push_task, &rq->push_work);
-			raw_spin_lock(&rq->lock);
+			raw_spin_rq_lock(rq);
 		}
 
 		return 0;
@@ -2122,10 +2122,10 @@ void rto_push_irq_work_func(struct irq_work *work)
 	 * When it gets updated, a check is made if a push is possible.
 	 */
 	if (has_pushable_tasks(rq)) {
-		raw_spin_lock(&rq->lock);
+		raw_spin_rq_lock(rq);
 		while (push_rt_task(rq, true))
 			;
-		raw_spin_unlock(&rq->lock);
+		raw_spin_rq_unlock(rq);
 	}
 
 	raw_spin_lock(&rd->rto_lock);
@@ -2243,10 +2243,10 @@ skip:
 		double_unlock_balance(this_rq, src_rq);
 
 		if (push_task) {
-			raw_spin_unlock(&this_rq->lock);
+			raw_spin_rq_unlock(this_rq);
 			stop_one_cpu_nowait(src_rq->cpu, push_cpu_stop,
 					    push_task, &src_rq->push_work);
-			raw_spin_lock(&this_rq->lock);
+			raw_spin_rq_lock(this_rq);
 		}
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f654587..dbabf28 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -905,7 +905,7 @@ DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
  */
 struct rq {
 	/* runqueue lock: */
-	raw_spinlock_t		lock;
+	raw_spinlock_t		__lock;
 
 	/*
 	 * nr_running and cpu_load should be in the same cacheline because
@@ -1115,7 +1115,7 @@ static inline bool is_migration_disabled(struct task_struct *p)
 
 static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 {
-	return &rq->lock;
+	return &rq->__lock;
 }
 
 static inline void lockdep_assert_rq_held(struct rq *rq)
@@ -1229,7 +1229,7 @@ static inline void assert_clock_updated(struct rq *rq)
 
 static inline u64 rq_clock(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 	assert_clock_updated(rq);
 
 	return rq->clock;
@@ -1237,7 +1237,7 @@ static inline u64 rq_clock(struct rq *rq)
 
 static inline u64 rq_clock_task(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 	assert_clock_updated(rq);
 
 	return rq->clock_task;
@@ -1263,7 +1263,7 @@ static inline u64 rq_clock_thermal(struct rq *rq)
 
 static inline void rq_clock_skip_update(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 	rq->clock_update_flags |= RQCF_REQ_SKIP;
 }
 
@@ -1273,7 +1273,7 @@ static inline void rq_clock_skip_update(struct rq *rq)
  */
 static inline void rq_clock_cancel_skipupdate(struct rq *rq)
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 	rq->clock_update_flags &= ~RQCF_REQ_SKIP;
 }
 
@@ -1304,7 +1304,7 @@ extern struct callback_head balance_push_callback;
  */
 static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
 {
-	rf->cookie = lockdep_pin_lock(&rq->lock);
+	rf->cookie = lockdep_pin_lock(rq_lockp(rq));
 
 #ifdef CONFIG_SCHED_DEBUG
 	rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
@@ -1322,12 +1322,12 @@ static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
 		rf->clock_update_flags = RQCF_UPDATED;
 #endif
 
-	lockdep_unpin_lock(&rq->lock, rf->cookie);
+	lockdep_unpin_lock(rq_lockp(rq), rf->cookie);
 }
 
 static inline void rq_repin_lock(struct rq *rq, struct rq_flags *rf)
 {
-	lockdep_repin_lock(&rq->lock, rf->cookie);
+	lockdep_repin_lock(rq_lockp(rq), rf->cookie);
 
 #ifdef CONFIG_SCHED_DEBUG
 	/*
@@ -1348,7 +1348,7 @@ static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_rq_unlock(rq);
 }
 
 static inline void
@@ -1357,7 +1357,7 @@ task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 	__releases(p->pi_lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_rq_unlock(rq);
 	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 }
 
@@ -1365,7 +1365,7 @@ static inline void
 rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock_irqsave(&rq->lock, rf->flags);
+	raw_spin_rq_lock_irqsave(rq, rf->flags);
 	rq_pin_lock(rq, rf);
 }
 
@@ -1373,7 +1373,7 @@ static inline void
 rq_lock_irq(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock_irq(&rq->lock);
+	raw_spin_rq_lock_irq(rq);
 	rq_pin_lock(rq, rf);
 }
 
@@ -1381,7 +1381,7 @@ static inline void
 rq_lock(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock(&rq->lock);
+	raw_spin_rq_lock(rq);
 	rq_pin_lock(rq, rf);
 }
 
@@ -1389,7 +1389,7 @@ static inline void
 rq_relock(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
-	raw_spin_lock(&rq->lock);
+	raw_spin_rq_lock(rq);
 	rq_repin_lock(rq, rf);
 }
 
@@ -1398,7 +1398,7 @@ rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock_irqrestore(&rq->lock, rf->flags);
+	raw_spin_rq_unlock_irqrestore(rq, rf->flags);
 }
 
 static inline void
@@ -1406,7 +1406,7 @@ rq_unlock_irq(struct rq *rq, struct rq_flags *rf)
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock_irq(&rq->lock);
+	raw_spin_rq_unlock_irq(rq);
 }
 
 static inline void
@@ -1414,7 +1414,7 @@ rq_unlock(struct rq *rq, struct rq_flags *rf)
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
-	raw_spin_unlock(&rq->lock);
+	raw_spin_rq_unlock(rq);
 }
 
 static inline struct rq *
@@ -1479,7 +1479,7 @@ queue_balance_callback(struct rq *rq,
 		       struct callback_head *head,
 		       void (*func)(struct rq *rq))
 {
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	if (unlikely(head->next || rq->balance_callback == &balance_push_callback))
 		return;
@@ -2019,7 +2019,7 @@ static inline struct task_struct *get_push_task(struct rq *rq)
 {
 	struct task_struct *p = rq->curr;
 
-	lockdep_assert_held(&rq->lock);
+	lockdep_assert_rq_held(rq);
 
 	if (rq->push_busy)
 		return NULL;
@@ -2249,7 +2249,7 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
 	__acquires(busiest->lock)
 	__acquires(this_rq->lock)
 {
-	raw_spin_unlock(&this_rq->lock);
+	raw_spin_rq_unlock(this_rq);
 	double_rq_lock(this_rq, busiest);
 
 	return 1;
@@ -2268,20 +2268,22 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
 	__acquires(busiest->lock)
 	__acquires(this_rq->lock)
 {
-	int ret = 0;
-
-	if (unlikely(!raw_spin_trylock(&busiest->lock))) {
-		if (busiest < this_rq) {
-			raw_spin_unlock(&this_rq->lock);
-			raw_spin_lock(&busiest->lock);
-			raw_spin_lock_nested(&this_rq->lock,
-					      SINGLE_DEPTH_NESTING);
-			ret = 1;
-		} else
-			raw_spin_lock_nested(&busiest->lock,
-					      SINGLE_DEPTH_NESTING);
+	if (rq_lockp(this_rq) == rq_lockp(busiest))
+		return 0;
+
+	if (likely(raw_spin_rq_trylock(busiest)))
+		return 0;
+
+	if (rq_lockp(busiest) >= rq_lockp(this_rq)) {
+		raw_spin_rq_lock_nested(busiest, SINGLE_DEPTH_NESTING);
+		return 0;
 	}
-	return ret;
+
+	raw_spin_rq_unlock(this_rq);
+	raw_spin_rq_lock(busiest);
+	raw_spin_rq_lock_nested(this_rq, SINGLE_DEPTH_NESTING);
+
+	return 1;
 }
 
 #endif /* CONFIG_PREEMPTION */
@@ -2291,11 +2293,7 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
  */
 static inline int double_lock_balance(struct rq *this_rq, struct rq *busiest)
 {
-	if (unlikely(!irqs_disabled())) {
-		/* printk() doesn't work well under rq->lock */
-		raw_spin_unlock(&this_rq->lock);
-		BUG_ON(1);
-	}
+	lockdep_assert_irqs_disabled();
 
 	return _double_lock_balance(this_rq, busiest);
 }
@@ -2303,8 +2301,9 @@ static inline int double_lock_balance(struct rq *this_rq, struct rq *busiest)
 static inline void double_unlock_balance(struct rq *this_rq, struct rq *busiest)
 	__releases(busiest->lock)
 {
-	raw_spin_unlock(&busiest->lock);
-	lock_set_subclass(&this_rq->lock.dep_map, 0, _RET_IP_);
+	if (rq_lockp(this_rq) != rq_lockp(busiest))
+		raw_spin_rq_unlock(busiest);
+	lock_set_subclass(&rq_lockp(this_rq)->dep_map, 0, _RET_IP_);
 }
 
 static inline void double_lock(spinlock_t *l1, spinlock_t *l2)
@@ -2345,16 +2344,16 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
 	__acquires(rq2->lock)
 {
 	BUG_ON(!irqs_disabled());
-	if (rq1 == rq2) {
-		raw_spin_lock(&rq1->lock);
+	if (rq_lockp(rq1) == rq_lockp(rq2)) {
+		raw_spin_rq_lock(rq1);
 		__acquire(rq2->lock);	/* Fake it out ;) */
 	} else {
-		if (rq1 < rq2) {
-			raw_spin_lock(&rq1->lock);
-			raw_spin_lock_nested(&rq2->lock, SINGLE_DEPTH_NESTING);
+		if (rq_lockp(rq1) < rq_lockp(rq2)) {
+			raw_spin_rq_lock(rq1);
+			raw_spin_rq_lock_nested(rq2, SINGLE_DEPTH_NESTING);
 		} else {
-			raw_spin_lock(&rq2->lock);
-			raw_spin_lock_nested(&rq1->lock, SINGLE_DEPTH_NESTING);
+			raw_spin_rq_lock(rq2);
+			raw_spin_rq_lock_nested(rq1, SINGLE_DEPTH_NESTING);
 		}
 	}
 }
@@ -2369,9 +2368,9 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
 	__releases(rq1->lock)
 	__releases(rq2->lock)
 {
-	raw_spin_unlock(&rq1->lock);
-	if (rq1 != rq2)
-		raw_spin_unlock(&rq2->lock);
+	raw_spin_rq_unlock(rq1);
+	if (rq_lockp(rq1) != rq_lockp(rq2))
+		raw_spin_rq_unlock(rq2);
 	else
 		__release(rq2->lock);
 }
@@ -2394,7 +2393,7 @@ static inline void double_rq_lock(struct rq *rq1, struct rq *rq2)
 {
 	BUG_ON(!irqs_disabled());
 	BUG_ON(rq1 != rq2);
-	raw_spin_lock(&rq1->lock);
+	raw_spin_rq_lock(rq1);
 	__acquire(rq2->lock);	/* Fake it out ;) */
 }
 
@@ -2409,7 +2408,7 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
 	__releases(rq2->lock)
 {
 	BUG_ON(rq1 != rq2);
-	raw_spin_unlock(&rq1->lock);
+	raw_spin_rq_unlock(rq1);
 	__release(rq2->lock);
 }
 
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 55a0a24..053115b 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -467,7 +467,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
 	struct root_domain *old_rd = NULL;
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	raw_spin_rq_lock_irqsave(rq, flags);
 
 	if (rq->rd) {
 		old_rd = rq->rd;
@@ -493,7 +493,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
 	if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
 		set_rq_online(rq);
 
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	raw_spin_rq_unlock_irqrestore(rq, flags);
 
 	if (old_rd)
 		call_rcu(&old_rd->rcu, free_rootdomain);

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [tip: sched/core] sched/fair: Add a few assertions
  2021-04-22 12:05 ` [PATCH 01/19] sched/fair: Add a few assertions Peter Zijlstra
@ 2021-05-12 10:28   ` tip-bot2 for Peter Zijlstra
  2021-05-13  8:56     ` Ning, Hongyu
  0 siblings, 1 reply; 103+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-05-12 10:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Don Hiatt, Hongyu Ning, Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     9099a14708ce1dfecb6002605594a0daa319b555
Gitweb:        https://git.kernel.org/tip/9099a14708ce1dfecb6002605594a0daa319b555
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 17 Nov 2020 18:19:35 -05:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 12 May 2021 11:43:26 +02:00

sched/fair: Add a few assertions

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.015639083@infradead.org
---
 kernel/sched/fair.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c209f68..6bdbb7b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6288,6 +6288,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 		task_util = uclamp_task_util(p);
 	}
 
+	/*
+	 * per-cpu select_idle_mask usage
+	 */
+	lockdep_assert_irqs_disabled();
+
 	if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
 	    asym_fits_capacity(task_util, target))
 		return target;
@@ -6781,8 +6786,6 @@ unlock:
  * certain conditions an idle sibling CPU if the domain has SD_WAKE_AFFINE set.
  *
  * Returns the target CPU number.
- *
- * preempt must be disabled.
  */
 static int
 select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
@@ -6795,6 +6798,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	/* SD_flags and WF_flags share the first nibble */
 	int sd_flag = wake_flags & 0xF;
 
+	/*
+	 * required for stable ->cpus_allowed
+	 */
+	lockdep_assert_held(&p->pi_lock);
 	if (wake_flags & WF_TTWU) {
 		record_wakee(p);
 

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 17/19] sched: Inherit task cookie on fork()
  2021-05-12  9:05           ` Peter Zijlstra
@ 2021-05-12 20:20             ` Josh Don
  2021-05-12 21:07               ` Don Hiatt
  0 siblings, 1 reply; 103+ messages in thread
From: Josh Don @ 2021-05-12 20:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Hyser, Joel Fernandes, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Mel Gorman, LKML, Thomas Glexiner

On Wed, May 12, 2021 at 2:05 AM Peter Zijlstra <peterz@infradead.org> wrote:
> Right, I need a Champion that actually cares about cgroups and has
> use-cases to go argue with TJ on this. I've proposed code that I think
> has sane semantics, but I'm not in a position to argue for it, given I
> think a world without cgroups is a better world :-)))

Not sure if Tejun has any thoughts on
http://lkml.kernel.org/r/CABk29NtahuW6UERvRdK5v8My_MfPsoESDKXUjGdvaQcHOJEMvg@mail.gmail.com.

We're looking at using the prctl interface with one of our main
internal users of core scheduling. As an example, suppose we have a
management process that wants to make tasks A and B share a cookie:
- Spawn a new thread m, which then does the following, and exits.
- PR_SCHED_CORE_CREATE for just its own PID
- PR_SCHED_CORE_SHARE_TO A
- PR_SCHED_CORE_SHARE_TO B

That seems to work ok; I'll follow up if there are any pain points
that aren't easily addressed with the prctl interface.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 17/19] sched: Inherit task cookie on fork()
  2021-05-12 20:20             ` Josh Don
@ 2021-05-12 21:07               ` Don Hiatt
  0 siblings, 0 replies; 103+ messages in thread
From: Don Hiatt @ 2021-05-12 21:07 UTC (permalink / raw)
  To: Josh Don
  Cc: Peter Zijlstra, Chris Hyser, Joel Fernandes, Ingo Molnar,
	Vincent Guittot, Valentin Schneider, Mel Gorman, LKML,
	Thomas Glexiner

On Wed, May 12, 2021 at 1:58 PM Josh Don <joshdon@google.com> wrote:
>
> On Wed, May 12, 2021 at 2:05 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > Right, I need a Champion that actually cares about cgroups and has
> > use-cases to go argue with TJ on this. I've proposed code that I think
> > has sane semantics, but I'm not in a position to argue for it, given I
> > think a world without cgroups is a better world :-)))
>
> Not sure if Tejun has any thoughts on
> http://lkml.kernel.org/r/CABk29NtahuW6UERvRdK5v8My_MfPsoESDKXUjGdvaQcHOJEMvg@mail.gmail.com.
>
> We're looking at using the prctl interface with one of our main
> internal users of core scheduling. As an example, suppose we have a
> management process that wants to make tasks A and B share a cookie:
> - Spawn a new thread m, which then does the following, and exits.
> - PR_SCHED_CORE_CREATE for just its own PID
> - PR_SCHED_CORE_SHARE_TO A
> - PR_SCHED_CORE_SHARE_TO B
>
> That seems to work ok; I'll follow up if there are any pain points
> that aren't easily addressed with the prctl interface.

This is exactly what I'm doing to tag all the processes for qemu. I have as
shell script that creates a cookie for $BASHPID and then SHARE_TO all
the process and it works just great. :)

don

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [tip: sched/core] sched/fair: Add a few assertions
  2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
@ 2021-05-13  8:56     ` Ning, Hongyu
  0 siblings, 0 replies; 103+ messages in thread
From: Ning, Hongyu @ 2021-05-13  8:56 UTC (permalink / raw)
  To: linux-kernel, linux-tip-commits
  Cc: Peter Zijlstra (Intel), Don Hiatt, Vincent Guittot, x86


On 2021/5/12 18:28, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the sched/core branch of tip:
> 
> Commit-ID:     9099a14708ce1dfecb6002605594a0daa319b555
> Gitweb:        https://git.kernel.org/tip/9099a14708ce1dfecb6002605594a0daa319b555
> Author:        Peter Zijlstra <peterz@infradead.org>
> AuthorDate:    Tue, 17 Nov 2020 18:19:35 -05:00
> Committer:     Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Wed, 12 May 2021 11:43:26 +02:00
> 
> sched/fair: Add a few assertions
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Tested-by: Don Hiatt <dhiatt@digitalocean.com>
> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
> Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
> Link: https://lkml.kernel.org/r/20210422123308.015639083@infradead.org
> ---
>  kernel/sched/fair.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 

Add quick test results based on tip tree sched/core merge commit:

====TEST INFO====
- kernel under test:
	-- tip tree sched/core merge: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=60208dac643e24cbc62317de4e486fdcbbf05215
	-- coresched_v10 kernel source: https://github.com/digitalocean/linux-coresched/commits/coresched/v10-v5.10.y

- test machine setup: 
	CPU(s):              192
	On-line CPU(s) list: 0-191
	Thread(s) per core:  2
	Core(s) per socket:  48
	Socket(s):           2
	NUMA node(s):        4

- performance test workloads: 
	-- A. sysbench cpu (192 threads) + sysbench cpu (192 threads)
	-- B. sysbench cpu (192 threads) + sysbench mysql (192 threads)
	-- C. uperf netperf.xml (192 threads over TCP or UDP protocol separately)
	-- D. will-it-scale context_switch via pipe (192 threads)

- negative test:
	-- A. continuously toggle coresched (enable/disable) thru prctl on task cookies of PGID, during full loading of uperf workload with coresched on
	-- B. continuously toggle smt (on/off) via /sys/devices/system/cpu/smt/control, during full loading of uperf workload with coresched on

====TEST RESULTS====
- performance change key info:
	--workload B: coresched (cs_on), sysbench mysql performance drop around 20% vs coresched_v10
	--workload C, coresched (cs_on), uperf performance increased almost double vs coresched_v10
	--workload C, default (cs_off), uperf performance drop over 25% vs coresched_v10, same issue seen on v5.13-rc1 base (w/o coresched patchset)
	--workload D, coresched (cs_on), wis performance increased almost double vs coresched_v10

- negative test summary:
	no platform hang or kernel panic observed for both test

- performance info of workloads, normalized based on coresched_v10 results
	-- performance workload A:
	Note: 
	* no performance change compared to coresched_v10
+---------------------------------------+------+--------------------------------------+----------------------------------------+-------+-------------------------------+---------------------------------+
|                                       | **   | coresched_tip_merge_base_v5.13-rc1   | coresched_tip_merge_base_v5.13-rc1     | **    | coresched_v10_base_v5.10.11   | coresched_v10_base_v5.10.11     |
+=======================================+======+======================================+========================================+=======+===============================+=================================+
| workload                              | **   | sysbench cpu * 192                   | sysbench cpu * 192                     | **    | sysbench cpu * 192            | sysbench cpu * 192              |
+---------------------------------------+------+--------------------------------------+----------------------------------------+-------+-------------------------------+---------------------------------+
| prctl/cgroup                          | **   | prctl on workload cpu_0              | prctl on workload cpu_1                | **    | cg_sysbench_cpu_0             | cg_sysbench_cpu_1               |
+---------------------------------------+------+--------------------------------------+----------------------------------------+-------+-------------------------------+---------------------------------+
| record_item                           | **   | Tput_avg (events/s)                  | Tput_avg (events/s)                    | **    | Tput_avg (events/s)           | Tput_avg (events/s)             |
+---------------------------------------+------+--------------------------------------+----------------------------------------+-------+-------------------------------+---------------------------------+
| coresched normalized vs coresched_v10 | **   | 0.97                                 | 1.05                                   | **    | 1                             | 1                               |
+---------------------------------------+------+--------------------------------------+----------------------------------------+-------+-------------------------------+---------------------------------+
| default normalized vs coresched_v10   | **   | 1.03                                 | 0.95                                   | **    | 1                             | 1                               |
+---------------------------------------+------+--------------------------------------+----------------------------------------+-------+-------------------------------+---------------------------------+
| smtoff normalized vs coresched_v10    | **   | 0.96                                 | 1.04                                   | **    | 1                             | 1                               |
+---------------------------------------+------+--------------------------------------+----------------------------------------+-------+-------------------------------+---------------------------------+


	-- performance workload B:
	Note: 
	* coresched (cs_on), sysbench mysql performance drop around 20% vs coresched_v10
+---------------------------------------+------+--------------------------------------+----------------------------------------+-------+-------------------------------+---------------------------------+
|                                       | **   | coresched_tip_merge_base_v5.13-rc1   | coresched_tip_merge_base_v5.13-rc1     | **    | coresched_v10_base_v5.10.11   | coresched_v10_base_v5.10.11     |
+=======================================+======+======================================+========================================+=======+===============================+=================================+
| workload                              | **   | sysbench cpu * 192                   | sysbench mysql * 192                   | **    | sysbench cpu * 192            | sysbench mysql * 192            |
+---------------------------------------+------+--------------------------------------+----------------------------------------+-------+-------------------------------+---------------------------------+
| prctl/cgroup                          | **   | prctl on workload cpu_0              | prctl on workload mysql_0              | **    | cg_sysbench_cpu_0             | cg_sysbench_mysql_0             |
+---------------------------------------+------+--------------------------------------+----------------------------------------+-------+-------------------------------+---------------------------------+
| record_item                           | **   | Tput_avg (events/s)                  | Tput_avg (events/s)                    | **    | Tput_avg (events/s)           | Tput_avg (events/s)             |
+---------------------------------------+------+--------------------------------------+----------------------------------------+-------+-------------------------------+---------------------------------+
| coresched normalized vs coresched_v10 | **   | 1.02                                 | 0.81                                   | **    | 1                             | 1                               |
+---------------------------------------+------+--------------------------------------+----------------------------------------+-------+-------------------------------+---------------------------------+
| default normalized vs coresched_v10   | **   | 1.01                                 | 0.94                                   | **    | 1                             | 1                               |
+---------------------------------------+------+--------------------------------------+----------------------------------------+-------+-------------------------------+---------------------------------+
| smtoff normalized vs coresched_v10    | **   | 0.93                                 | 1.18                                   | **    | 1                             | 1                               |
+---------------------------------------+------+--------------------------------------+----------------------------------------+-------+-------------------------------+---------------------------------+


	-- performance workload C:
	Note: 
	* coresched (cs_on), uperf performance increased almost double vs coresched_v10
	* default (cs_off), uperf performance drop over 25% vs coresched_v10, same issue seen on v5.13-rc1 base (w/o coresched patchset)
+---------------------------------------+------+--------------------------------------+----------------------------------------+-------+-------------------------------+---------------------------------+
|                                       | **   | coresched_tip_merge_base_v5.13-rc1   | coresched_tip_merge_base_v5.13-rc1     | **    | coresched_v10_base_v5.10.11   | coresched_v10_base_v5.10.11     |
+=======================================+======+======================================+========================================+=======+===============================+=================================+
| workload                              | **   | uperf netperf TCP * 192              | uperf netperf UDP * 192                | **    | uperf netperf TCP * 192       | uperf netperf UDP * 192         |
+---------------------------------------+------+--------------------------------------+----------------------------------------+-------+-------------------------------+---------------------------------+
| prctl/cgroup                          | **   | prctl on workload uperf              | prctl on workload uperf                | **    | cg_uperf                      | cg_uperf                        |
+---------------------------------------+------+--------------------------------------+----------------------------------------+-------+-------------------------------+---------------------------------+
| record_item                           | **   | Tput_avg (Gb/s)                      | Tput_avg (Gb/s)                        | **    | Tput_avg (Gb/s)               | Tput_avg (Gb/s)                 |
+---------------------------------------+------+--------------------------------------+----------------------------------------+-------+-------------------------------+---------------------------------+
| coresched normalized vs coresched_v10 | **   | 1.83                                 | 1.93                                   | **    | 1                             | 1                               |
+---------------------------------------+------+--------------------------------------+----------------------------------------+-------+-------------------------------+---------------------------------+
| default normalized vs coresched_v10   | **   | 0.75                                 | 0.71                                   | **    | 1                             | 1                               |
+---------------------------------------+------+--------------------------------------+----------------------------------------+-------+-------------------------------+---------------------------------+
| smtoff normalized vs coresched_v10    | **   | 1                                    | 1.06                                   | **    | 1                             | 1                               |
+---------------------------------------+------+--------------------------------------+----------------------------------------+-------+-------------------------------+---------------------------------+


	-- performance workload D:
	Note: 
	* coresched (cs_on), wis performance increased almost double vs coresched_v10
	* default (cs_off) and smtoff, wis performance is better vs coresched_v10
+---------------------------------------+------+--------------------------------------+-------+-------------------------------+
|                                       | **   | coresched_tip_merge_base_v5.13-rc1   | **    | coresched_v10_base_v5.10.11   |
+=======================================+======+======================================+=======+===============================+
| workload                              | **   | will-it-scale  * 192                 | **    | will-it-scale  * 192          |
|                                       |      | (pipe based context_switch)          |       | (pipe based context_switch)   |
+---------------------------------------+------+--------------------------------------+-------+-------------------------------+
| prctl/cgroup                          | **   | prctl on workload wis                | **    | cg_wis                        |
+---------------------------------------+------+--------------------------------------+-------+-------------------------------+
| record_item                           | **   | threads_avg                          | **    | threads_avg                   |
+---------------------------------------+------+--------------------------------------+-------+-------------------------------+
| coresched normalized vs coresched_v10 | **   | 2.01                                 | **    | 1.00                          |
+---------------------------------------+------+--------------------------------------+-------+-------------------------------+
| default normalized vs coresched_v10   | **   | 1.13                                 | **    | 1.00                          |
+---------------------------------------+------+--------------------------------------+-------+-------------------------------+
| smtoff normalized vs coresched_v10    | **   | 1.29                                 | **    | 1.00                          |
+---------------------------------------+------+--------------------------------------+-------+-------------------------------+


	-- notes on record_item:
	* coresched normalized vs coresched_v10: smton, cs enabled, test result normalized by result of coresched_v10 under same config
	* default normalized vs coresched_v10: smton, cs disabled, test result normalized by result of coresched_v10 under same config
	* smtoff normalized vs coresched_v10: smtoff, test result normalized by result of coresched_v10 under same config



-- Hongyu Ning

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 18/19] sched: prctl() core-scheduling interface
  2021-04-22 12:05 ` [PATCH 18/19] sched: prctl() core-scheduling interface Peter Zijlstra
  2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Chris Hyser
@ 2021-06-14 23:36   ` Josh Don
  2021-06-15 11:31     ` Joel Fernandes
  2021-08-05 16:53   ` Eugene Syromiatnikov
  2021-08-17 15:15   ` Eugene Syromiatnikov
  3 siblings, 1 reply; 103+ messages in thread
From: Josh Don @ 2021-06-14 23:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes, Hyser,Chris, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Mel Gorman, linux-kernel, Thomas Gleixner,
	Aubrey Li, Xiangling Kong, Benjamin Segall

On Thu, Apr 22, 2021 at 5:36 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> From: Chris Hyser <chris.hyser@oracle.com>
>
> This patch provides support for setting and copying core scheduling
> 'task cookies' between threads (PID), processes (TGID), and process
> groups (PGID).

[snip]

Internally, we have lots of trusted processes that don't have a
security need for coresched cookies. However, these processes could
still decide to create cookies for themselves, which will degrade
machine capacity and performance for other jobs on the machine.

Any thoughts on whether it would be desirable to have the ability to
restrict use of SCHED_CORE_CREATE? Perhaps a new SCHED_CORE capability
would be appropriate?

- Josh

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 18/19] sched: prctl() core-scheduling interface
  2021-06-14 23:36   ` [PATCH 18/19] " Josh Don
@ 2021-06-15 11:31     ` Joel Fernandes
  0 siblings, 0 replies; 103+ messages in thread
From: Joel Fernandes @ 2021-06-15 11:31 UTC (permalink / raw)
  To: Josh Don
  Cc: Peter Zijlstra, Hyser,Chris, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Mel Gorman, linux-kernel, Thomas Gleixner,
	Aubrey Li, Xiangling Kong, Benjamin Segall, Vineeth Pillai

On Mon, Jun 14, 2021 at 7:36 PM Josh Don <joshdon@google.com> wrote:
>
> On Thu, Apr 22, 2021 at 5:36 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > From: Chris Hyser <chris.hyser@oracle.com>
> >
> > This patch provides support for setting and copying core scheduling
> > 'task cookies' between threads (PID), processes (TGID), and process
> > groups (PGID).
>
> [snip]
>
> Internally, we have lots of trusted processes that don't have a
> security need for coresched cookies. However, these processes could
> still decide to create cookies for themselves, which will degrade
> machine capacity and performance for other jobs on the machine.
>
> Any thoughts on whether it would be desirable to have the ability to
> restrict use of SCHED_CORE_CREATE? Perhaps a new SCHED_CORE capability
> would be appropriate?

Hi,
Maybe a capability may not work because then other users who don't
care for the issue you mention will be required to manage/assign the
capability as well?

How about you use seccomp to filter the prctl based on the PID, and
CREATE command?

-Joel

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 18/19] sched: prctl() core-scheduling interface
  2021-04-22 12:05 ` [PATCH 18/19] sched: prctl() core-scheduling interface Peter Zijlstra
  2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Chris Hyser
  2021-06-14 23:36   ` [PATCH 18/19] " Josh Don
@ 2021-08-05 16:53   ` Eugene Syromiatnikov
  2021-08-05 17:00     ` Peter Zijlstra
  2021-08-17 15:15   ` Eugene Syromiatnikov
  3 siblings, 1 reply; 103+ messages in thread
From: Eugene Syromiatnikov @ 2021-08-05 16:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman, linux-kernel, tglx, linux-api, ldv

On Thu, Apr 22, 2021 at 02:05:17PM +0200, Peter Zijlstra wrote:
> API:
> 
>   prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, tgtpid, pidtype, &cookie)
>   prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, tgtpid, pidtype, NULL)
>   prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, tgtpid, pidtype, NULL)
>   prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, srcpid, pidtype, NULL)
> 
> where 'tgtpid/srcpid == 0' implies the current process and pidtype is
> kernel enum pid_type {PIDTYPE_PID, PIDTYPE_TGID, PIDTYPE_PGID, ...}.

It means that enum pid_tipe has to be a part of UAPI now.  Would you
like to address it, or rather I'd send a patch?


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 18/19] sched: prctl() core-scheduling interface
  2021-08-05 16:53   ` Eugene Syromiatnikov
@ 2021-08-05 17:00     ` Peter Zijlstra
  0 siblings, 0 replies; 103+ messages in thread
From: Peter Zijlstra @ 2021-08-05 17:00 UTC (permalink / raw)
  To: Eugene Syromiatnikov
  Cc: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman, linux-kernel, tglx, linux-api, ldv

On Thu, Aug 05, 2021 at 06:53:19PM +0200, Eugene Syromiatnikov wrote:
> On Thu, Apr 22, 2021 at 02:05:17PM +0200, Peter Zijlstra wrote:
> > API:
> > 
> >   prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, tgtpid, pidtype, &cookie)
> >   prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, tgtpid, pidtype, NULL)
> >   prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, tgtpid, pidtype, NULL)
> >   prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, srcpid, pidtype, NULL)
> > 
> > where 'tgtpid/srcpid == 0' implies the current process and pidtype is
> > kernel enum pid_type {PIDTYPE_PID, PIDTYPE_TGID, PIDTYPE_PGID, ...}.
> 
> It means that enum pid_tipe has to be a part of UAPI now.  Would you
> like to address it, or rather I'd send a patch?

Please send a patch; I'm more sparse than usual atm.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 18/19] sched: prctl() core-scheduling interface
  2021-04-22 12:05 ` [PATCH 18/19] sched: prctl() core-scheduling interface Peter Zijlstra
                     ` (2 preceding siblings ...)
  2021-08-05 16:53   ` Eugene Syromiatnikov
@ 2021-08-17 15:15   ` Eugene Syromiatnikov
  2021-08-17 15:52     ` Peter Zijlstra
  3 siblings, 1 reply; 103+ messages in thread
From: Eugene Syromiatnikov @ 2021-08-17 15:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman, linux-kernel, tglx,
	Christian Brauner, ldv

[-- Attachment #1: Type: text/plain, Size: 2822 bytes --]

On Thu, Apr 22, 2021 at 02:05:17PM +0200, Peter Zijlstra wrote:
> From: Chris Hyser <chris.hyser@oracle.com>
> 
> This patch provides support for setting and copying core scheduling
> 'task cookies' between threads (PID), processes (TGID), and process
> groups (PGID).

Hello.

It seems that there is some issue within the scheduler code
that can be triggered via this interface:

    # gcc -std=gnu99 -Wextra -Werror prctl-sched-core-oops-repro.c -o prctl-sched-core-oops-repro
    # ../src/strace -fvq -eprctl,clone,setsid ./prctl-sched-core-oops-repro
    clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f271f875890) = 239820
    [pid 239820] setsid()                   = 239820
    [pid 239820] +++ exited with 0 +++
    --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=239820, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
    Iteration 0 status: 0
    prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, 239816, 0x2 /* PIDTYPE_PGID */, NULL) = 0
    clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f271f875890) = 239821
    [pid 239821] setsid()                   = ?
    [pid 239821] +++ killed by SIGKILL +++
    --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=239821, si_uid=0, si_status=SIGKILL, si_utime=0, si_stime=0} ---
    Iteration 1 status: 0x9
    prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, 239816, 0x2 /* PIDTYPE_PGID */, NULL) = 0
    +++ exited with 0 +++

kmsg indicates that a NULL pointer dereference has occurred:

    [76195.611570] BUG: kernel NULL pointer dereference, address: 0000000000000000
    ...
    [76195.621771] RIP: 0010:do_raw_spin_trylock+0x5/0x40
    ...
    [76195.640144] Call Trace:
    [76195.640706]  _raw_spin_lock_nested+0x37/0x80
    [76195.641645]  ? raw_spin_rq_lock_nested+0x4b/0x80
    [76195.642693]  raw_spin_rq_lock_nested+0x4b/0x80
    [76195.643669]  online_fair_sched_group+0x39/0x240
    [76195.644663]  sched_autogroup_create_attach+0x9d/0x170
    [76195.645765]  ksys_setsid+0xe6/0x110
    [76195.646533]  __do_sys_setsid+0xa/0x10
    [76195.647358]  do_syscall_64+0x3b/0x90
    [76195.648219]  entry_SYSCALL_64_after_hwframe+0x44/0xae

The full kmsg excerpt and the reproducer code are attached.

There's also additional "BUG: sleeping function called from invalid
context at include/linux/percpu-rwsem.h:49" message is produced (see the
full log in the attached file "prctl-sched-core-oops-bug-dmesg.log")
when the full test case[1] is run, but I haven't been successful
so far in producing a minimal reproduced for it.

[1] https://github.com/strace/strace/commit/a90a5a56d2b76ba3ebd417472a02f40d3d6599d8
    Run with `./bootstrap && ./configure CFLAGS='-g -Og' --enable-gcc-Werror &&
    make check TESTS=prctl-sched-core--pidns-translation.gen`

[-- Attachment #2: prctl-sched-core-oops-repro.c --]
[-- Type: text/x-csrc, Size: 445 bytes --]

#include <stdio.h>
#include <linux/prctl.h>
#include <sys/prctl.h>
#include <sys/wait.h>
#include <unistd.h>

int main()
{
	int pid;
	int pgid = getpgid(0);

	for (size_t i = 0; i < 2; i++) {
		if ((pid = fork())) {
			int status;

			wait(&status);
			printf("Iteration %d status: %#x\n", i, status);
		} else {
			setsid();

			return 0;
		}

		prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, pgid, 0x2 /* PIDTYPE_PGID */, NULL);
	}

	return 0;
}


[-- Attachment #3: prctl-sched-core-oops-dmesg.log --]
[-- Type: text/plain, Size: 4239 bytes --]

[76195.611570] BUG: kernel NULL pointer dereference, address: 0000000000000000
[76195.613059] #PF: supervisor read access in kernel mode
[76195.614174] #PF: error_code(0x0000) - not-present page
[76195.615329] PGD 800000005f27e067 P4D 800000005f27e067 PUD 3f7a3067 PMD 0 
[76195.616801] Oops: 0000 [#67] SMP PTI
[76195.617586] CPU: 2 PID: 239821 Comm: prctl-sched-cor Tainted: G      D W        --------- ---  5.14.0-0.rc5.20210813gitf8e6dfc64f61.46.fc36.x86_64 #1
[76195.620374] Hardware name: HP ProLiant BL480c G1, BIOS I14 10/04/2007
[76195.621771] RIP: 0010:do_raw_spin_trylock+0x5/0x40
[76195.622821] Code: c6 a4 12 5f 9f 48 89 ef e8 c8 fe ff ff eb a9 89 c6 48 89 ef e8 0c f5 ff ff 66 90 eb a9 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <8b> 07 85 c0 75 28 ba 01 00 00 00 f0 0f b1 17 75 1d 65 8b 05 fb 98
[76195.626797] RSP: 0018:ffffa366014abe58 EFLAGS: 00010086
[76195.627936] RAX: 0000000000000001 RBX: 0000000000000004 RCX: 0000000000000000
[76195.629470] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[76195.631048] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[76195.632585] R10: 0000000000000000 R11: ffff98292b21ad48 R12: 0000000000000018
[76195.634078] R13: 0000000000000000 R14: ffff98292b7ef940 R15: ffff982813938e00
[76195.635621] FS:  00007f271f8755c0(0000) GS:ffff98292b200000(0000) knlGS:0000000000000000
[76195.637354] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[76195.638606] CR2: 0000000000000000 CR3: 000000000fdd0000 CR4: 00000000000006e0
[76195.640144] Call Trace:
[76195.640706]  _raw_spin_lock_nested+0x37/0x80
[76195.641645]  ? raw_spin_rq_lock_nested+0x4b/0x80
[76195.642693]  raw_spin_rq_lock_nested+0x4b/0x80
[76195.643669]  online_fair_sched_group+0x39/0x240
[76195.644663]  sched_autogroup_create_attach+0x9d/0x170
[76195.645765]  ksys_setsid+0xe6/0x110
[76195.646533]  __do_sys_setsid+0xa/0x10
[76195.647358]  do_syscall_64+0x3b/0x90
[76195.648219]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[76195.649401] RIP: 0033:0x7f271f7496cb
[76195.650241] Code: 73 01 c3 48 8b 0d 5d a7 11 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 70 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2d a7 11 00 f7 d8 64 89 01 48
[76195.654522] RSP: 002b:00007ffed2ea9888 EFLAGS: 00000206 ORIG_RAX: 0000000000000070
[76195.656275] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f271f7496cb
[76195.657934] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000001
[76195.659566] RBP: 00007ffed2ea98b0 R08: 0000000000000000 R09: 00007f271f8755c0
[76195.661227] R10: 00007f271f67ca60 R11: 0000000000000206 R12: 00007ffed2ea99e8
[76195.662869] R13: 0000000000401176 R14: 00007f271f8adc00 R15: 0000000000403e18
[76195.664510] Modules linked in: rfkill iTCO_wdt intel_pmc_bxt iTCO_vendor_support kvm_intel gpio_ich kvm snd_pcsp snd_pcm snd_timer ipmi_ssif irqbypass snd soundcore lpc_ich acpi_ipmi e1000e tg3 hpilo ipmi_si bnx2 i5000_edac ipmi_devintf i5k_amb ipmi_msghandler fuse zram ip_tables xfs radeon i2c_algo_bit drm_ttm_helper ttm drm_kms_helper cec drm hpsa serio_raw hpwdt scsi_transport_sas
[76195.672185] CR2: 0000000000000000
[76195.672973] ---[ end trace ab4b5b9489802ac5 ]---
[76195.674050] RIP: 0010:__lock_acquire+0x5c1/0x1e00
[76195.675146] Code: 00 00 83 f8 2f 0f 87 59 05 00 00 3b 05 cc 46 f6 02 41 bd 01 00 00 00 0f 86 03 01 00 00 89 05 ba 46 f6 02 e9 f8 00 00 00 31 c9 <48> 81 3f 20 8e cc a0 41 0f 45 c8 83 fe 01 0f 87 6f fa ff ff 89 f0
[76195.679386] RSP: 0018:ffffa36602d5fd40 EFLAGS: 00010046
[76195.680606] RAX: 0000000000000046 RBX: 0000000000000001 RCX: 0000000000000000
[76195.682239] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
[76195.683881] RBP: 0000000000000018 R08: 0000000000000001 R09: 0000000000000001
[76195.685554] R10: 0000000000000001 R11: 0000000000000000 R12: ffff98285fad33c0
[76195.687197] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[76195.688840] FS:  00007f271f8755c0(0000) GS:ffff98292b200000(0000) knlGS:0000000000000000
[76195.690690] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[76195.692034] CR2: 0000000000000000 CR3: 000000000fdd0000 CR4: 00000000000006e0
[76195.693677] note: prctl-sched-cor[239821] exited with preempt_count 2

[-- Attachment #4: prctl-sched-core-oops-bug-dmesg.log --]
[-- Type: text/plain, Size: 6095 bytes --]

[31148.057893] BUG: kernel NULL pointer dereference, address: 0000000000000000
[31148.059390] #PF: supervisor read access in kernel mode
[31148.060530] #PF: error_code(0x0000) - not-present page
[31148.061601] PGD 800000005f693067 P4D 800000005f693067 PUD 114fe067 PMD 0 
[31148.063082] Oops: 0000 [#21] SMP PTI
[31148.063880] CPU: 0 PID: 214019 Comm: prctl-sched-cor Tainted: G      D W        --------- ---  5.14.0-0.rc5.20210813gitf8e6dfc64f61.46.fc36.x86_64 #1
[31148.067019] Hardware name: HP ProLiant BL480c G1, BIOS I14 10/04/2007
[31148.068536] RIP: 0010:do_raw_spin_trylock+0x5/0x40
[31148.069629] Code: c6 a4 12 5f 9f 48 89 ef e8 c8 fe ff ff eb a9 89 c6 48 89 ef e8 0c f5 ff ff 66 90 eb a9 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <8b> 07 85 c0 75 28 ba 01 00 00 00 f0 0f b1 17 75 1d 65 8b 05 fb 98
[31148.073601] RSP: 0018:ffffa366030ffe58 EFLAGS: 00010086
[31148.074740] RAX: 0000000000000001 RBX: 0000000000000004 RCX: 0000000000000000
[31148.076235] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[31148.077770] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[31148.079264] R10: 0000000000000000 R11: 000000000121ee36 R12: 0000000000000018
[31148.080752] R13: 0000000000000000 R14: ffff98292b7ef940 R15: ffff98280d75da00
[31148.082295] FS:  00007f1aefadc600(0000) GS:ffff98292ae00000(0000) knlGS:0000000000000000
[31148.084029] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[31148.085282] CR2: 0000000000000000 CR3: 00000000013be000 CR4: 00000000000006f0
[31148.086770] Call Trace:
[31148.087286]  _raw_spin_lock_nested+0x37/0x80
[31148.088212]  ? raw_spin_rq_lock_nested+0x4b/0x80
[31148.089180]  raw_spin_rq_lock_nested+0x4b/0x80
[31148.090189]  online_fair_sched_group+0x39/0x240
[31148.091185]  sched_autogroup_create_attach+0x9d/0x170
[31148.092310]  ksys_setsid+0xe6/0x110
[31148.093073]  __do_sys_setsid+0xa/0x10
[31148.093881]  do_syscall_64+0x3b/0x90
[31148.094678]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[31148.095783] RIP: 0033:0x7f1aef9b06cb
[31148.096579] Code: 73 01 c3 48 8b 0d 5d a7 11 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 70 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2d a7 11 00 f7 d8 64 89 01 48
[31148.100758] RSP: 002b:00007fff4b7eef78 EFLAGS: 00000206 ORIG_RAX: 0000000000000070
[31148.102535] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1aef9b06cb
[31148.104188] RDX: 0000000000000010 RSI: 0000000000406230 RDI: 0000000000000003
[31148.105859] RBP: 00000000ffffffff R08: 0000000000000000 R09: 00007f1aefadc600
[31148.107523] R10: 00007f1aef8e3a60 R11: 0000000000000206 R12: 0000000000000001
[31148.109155] R13: 0000000000401326 R14: 00007f1aefb14c00 R15: 0000000000405e18
[31148.110819] Modules linked in: rfkill iTCO_wdt intel_pmc_bxt iTCO_vendor_support kvm_intel gpio_ich kvm snd_pcsp snd_pcm snd_timer ipmi_ssif irqbypass snd soundcore lpc_ich acpi_ipmi e1000e tg3 hpilo ipmi_si bnx2 i5000_edac ipmi_devintf i5k_amb ipmi_msghandler fuse zram ip_tables xfs radeon i2c_algo_bit drm_ttm_helper ttm drm_kms_helper cec drm hpsa serio_raw hpwdt scsi_transport_sas
[31148.118454] CR2: 0000000000000000
[31148.119246] ---[ end trace ab4b5b9489802a97 ]---
[31148.120379] RIP: 0010:__lock_acquire+0x5c1/0x1e00
[31148.121475] Code: 00 00 83 f8 2f 0f 87 59 05 00 00 3b 05 cc 46 f6 02 41 bd 01 00 00 00 0f 86 03 01 00 00 89 05 ba 46 f6 02 e9 f8 00 00 00 31 c9 <48> 81 3f 20 8e cc a0 41 0f 45 c8 83 fe 01 0f 87 6f fa ff ff 89 f0
[31148.125694] RSP: 0018:ffffa36602d5fd40 EFLAGS: 00010046
[31148.126907] RAX: 0000000000000046 RBX: 0000000000000001 RCX: 0000000000000000
[31148.128546] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
[31148.130195] RBP: 0000000000000018 R08: 0000000000000001 R09: 0000000000000001
[31148.131853] R10: 0000000000000001 R11: 0000000000000000 R12: ffff98285fad33c0
[31148.133493] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[31148.135168] FS:  00007f1aefadc600(0000) GS:ffff98292ae00000(0000) knlGS:0000000000000000
[31148.137018] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[31148.138352] CR2: 0000000000000000 CR3: 00000000013be000 CR4: 00000000000006f0
[31148.139984] note: prctl-sched-cor[214019] exited with preempt_count 2
[31148.141470] BUG: sleeping function called from invalid context at include/linux/percpu-rwsem.h:49
[31148.143494] in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 214019, name: prctl-sched-cor
[31148.145497] INFO: lockdep is turned off.
[31148.146422] irq event stamp: 0
[31148.147195] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[31148.148640] hardirqs last disabled at (0): [<ffffffff9e0e4e9f>] copy_process+0x7cf/0x1fb0
[31148.150511] softirqs last  enabled at (0): [<ffffffff9e0e4e9f>] copy_process+0x7cf/0x1fb0
[31148.152384] softirqs last disabled at (0): [<0000000000000000>] 0x0
[31148.153825] CPU: 0 PID: 214019 Comm: prctl-sched-cor Tainted: G      D W        --------- ---  5.14.0-0.rc5.20210813gitf8e6dfc64f61.46.fc36.x86_64 #1
[31148.156852] Hardware name: HP ProLiant BL480c G1, BIOS I14 10/04/2007
[31148.158338] Call Trace:
[31148.158945]  dump_stack_lvl+0x57/0x72
[31148.159859]  ___might_sleep.cold+0xb6/0xc6
[31148.160822]  exit_signals+0x1c/0x2d0
[31148.161668]  do_exit+0xbf/0xc10
[31148.162419]  ? ksys_setsid+0xe6/0x110
[31148.163228]  rewind_stack_do_exit+0x17/0x20
[31148.164203] RIP: 0033:0x7f1aef9b06cb
[31148.165043] Code: 73 01 c3 48 8b 0d 5d a7 11 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 70 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2d a7 11 00 f7 d8 64 89 01 48
[31148.169275] RSP: 002b:00007fff4b7eef78 EFLAGS: 00000206 ORIG_RAX: 0000000000000070
[31148.171019] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1aef9b06cb
[31148.172700] RDX: 0000000000000010 RSI: 0000000000406230 RDI: 0000000000000003
[31148.174343] RBP: 00000000ffffffff R08: 0000000000000000 R09: 00007f1aefadc600
[31148.175994] R10: 00007f1aef8e3a60 R11: 0000000000000206 R12: 0000000000000001
[31148.177638] R13: 0000000000401326 R14: 00007f1aefb14c00 R15: 0000000000405e18

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 18/19] sched: prctl() core-scheduling interface
  2021-08-17 15:15   ` Eugene Syromiatnikov
@ 2021-08-17 15:52     ` Peter Zijlstra
  2021-08-17 23:17       ` Eugene Syromiatnikov
  0 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2021-08-17 15:52 UTC (permalink / raw)
  To: Eugene Syromiatnikov
  Cc: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman, linux-kernel, tglx,
	Christian Brauner, ldv

On Tue, Aug 17, 2021 at 05:15:42PM +0200, Eugene Syromiatnikov wrote:
> [76195.611570] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [76195.613059] #PF: supervisor read access in kernel mode
> [76195.614174] #PF: error_code(0x0000) - not-present page
> [76195.615329] PGD 800000005f27e067 P4D 800000005f27e067 PUD 3f7a3067 PMD 0 
> [76195.616801] Oops: 0000 [#67] SMP PTI
> [76195.617586] CPU: 2 PID: 239821 Comm: prctl-sched-cor Tainted: G      D W        --------- ---  5.14.0-0.rc5.20210813gitf8e6dfc64f61.46.fc36.x86_64 #1
> [76195.620374] Hardware name: HP ProLiant BL480c G1, BIOS I14 10/04/2007
> [76195.621771] RIP: 0010:do_raw_spin_trylock+0x5/0x40
> [76195.622821] Code: c6 a4 12 5f 9f 48 89 ef e8 c8 fe ff ff eb a9 89 c6 48 89 ef e8 0c f5 ff ff 66 90 eb a9 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <8b> 07 85 c0 75 28 ba 01 00 00 00 f0 0f b1 17 75 1d 65 8b 05 fb 98
> [76195.626797] RSP: 0018:ffffa366014abe58 EFLAGS: 00010086
> [76195.627936] RAX: 0000000000000001 RBX: 0000000000000004 RCX: 0000000000000000
> [76195.629470] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> [76195.631048] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
> [76195.632585] R10: 0000000000000000 R11: ffff98292b21ad48 R12: 0000000000000018
> [76195.634078] R13: 0000000000000000 R14: ffff98292b7ef940 R15: ffff982813938e00
> [76195.635621] FS:  00007f271f8755c0(0000) GS:ffff98292b200000(0000) knlGS:0000000000000000
> [76195.637354] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [76195.638606] CR2: 0000000000000000 CR3: 000000000fdd0000 CR4: 00000000000006e0
> [76195.640144] Call Trace:
> [76195.640706]  _raw_spin_lock_nested+0x37/0x80
> [76195.641645]  ? raw_spin_rq_lock_nested+0x4b/0x80
> [76195.642693]  raw_spin_rq_lock_nested+0x4b/0x80
> [76195.643669]  online_fair_sched_group+0x39/0x240

Urgh... lemme guess, your HP BIOS is funny and reports more possible
CPUs than you actually have resulting in cpu_possible_mask !=
cpu_online_mask. Alternatively, you booted with nr_cpus= or something
daft like that.

That code does for_each_possible_cpus(i) { rq_lock_irq(cpu_rq(i)); },
which, because of core-sched, needs rq->core set-up, but because these
CPUs have never been online, that's not done and *BOOM*.

Or something like that.. I'll try and have a look tomorrow, I'm in dire
need of sleep.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 18/19] sched: prctl() core-scheduling interface
  2021-08-17 15:52     ` Peter Zijlstra
@ 2021-08-17 23:17       ` Eugene Syromiatnikov
  2021-08-19 11:09         ` [PATCH] sched: Fix Core-wide rq->lock for uninitialized CPUs Peter Zijlstra
  0 siblings, 1 reply; 103+ messages in thread
From: Eugene Syromiatnikov @ 2021-08-17 23:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman, linux-kernel, tglx,
	Christian Brauner, ldv

On Tue, Aug 17, 2021 at 05:52:43PM +0200, Peter Zijlstra wrote:
> Urgh... lemme guess, your HP BIOS is funny and reports more possible
> CPUs than you actually have resulting in cpu_possible_mask !=
> cpu_online_mask. Alternatively, you booted with nr_cpus= or something
> daft like that.

Yep, it seems to be the case:

    # cat /sys/devices/system/cpu/possible
    0-7
    # cat /sys/devices/system/cpu/online
    0-3

> 
> That code does for_each_possible_cpus(i) { rq_lock_irq(cpu_rq(i)); },
> which, because of core-sched, needs rq->core set-up, but because these
> CPUs have never been online, that's not done and *BOOM*.
> 
> Or something like that.. I'll try and have a look tomorrow, I'm in dire
> need of sleep.
> 


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH] sched: Fix Core-wide rq->lock for uninitialized CPUs
  2021-08-17 23:17       ` Eugene Syromiatnikov
@ 2021-08-19 11:09         ` Peter Zijlstra
  2021-08-19 15:50           ` Tao Zhou
                             ` (3 more replies)
  0 siblings, 4 replies; 103+ messages in thread
From: Peter Zijlstra @ 2021-08-19 11:09 UTC (permalink / raw)
  To: Eugene Syromiatnikov
  Cc: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman, linux-kernel, tglx,
	Christian Brauner, ldv

On Wed, Aug 18, 2021 at 01:17:34AM +0200, Eugene Syromiatnikov wrote:
> On Tue, Aug 17, 2021 at 05:52:43PM +0200, Peter Zijlstra wrote:
> > Urgh... lemme guess, your HP BIOS is funny and reports more possible
> > CPUs than you actually have resulting in cpu_possible_mask !=
> > cpu_online_mask. Alternatively, you booted with nr_cpus= or something
> > daft like that.
> 
> Yep, it seems to be the case:
> 
>     # cat /sys/devices/system/cpu/possible
>     0-7
>     # cat /sys/devices/system/cpu/online
>     0-3
> 

I think the below should work... can you please verify?

---
Subject: sched: Fix Core-wide rq->lock for uninitialized CPUs

Eugene tripped over the case where rq_lock(), as called in a
for_each_possible_cpu() loop came apart because rq->core hadn't been
setup yet.

This is a somewhat unusual, but valid case.

Rework things such that rq->core is initialized to point at itself. IOW
initialize each CPU as a single threaded Core. CPU online will then join
the new CPU (thread) to an existing Core where needed.

For completeness sake, have CPU offline fully undo the state so as to
not presume the topology will match the next time it comes online.

Fixes: 9edeaea1bc45 ("sched: Core-wide rq->lock")
Reported-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  | 143 +++++++++++++++++++++++++++++++++++++++++----------
 kernel/sched/sched.h |   2 +-
 2 files changed, 118 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0f14e09a4f99..21d633971fcf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -237,9 +237,30 @@ static DEFINE_MUTEX(sched_core_mutex);
 static atomic_t sched_core_count;
 static struct cpumask sched_core_mask;
 
+static void sched_core_lock(int cpu, unsigned long *flags)
+{
+	const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+	int t, i = 0;
+
+	local_irq_save(*flags);
+	for_each_cpu(t, smt_mask)
+		raw_spin_lock_nested(&cpu_rq(t)->__lock, i++);
+}
+
+static void sched_core_unlock(int cpu, unsigned long *flags)
+{
+	const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+	int t;
+
+	for_each_cpu(t, smt_mask)
+		raw_spin_unlock(&cpu_rq(t)->__lock);
+	local_irq_restore(*flags);
+}
+
 static void __sched_core_flip(bool enabled)
 {
-	int cpu, t, i;
+	unsigned long flags;
+	int cpu, t;
 
 	cpus_read_lock();
 
@@ -250,19 +271,12 @@ static void __sched_core_flip(bool enabled)
 	for_each_cpu(cpu, &sched_core_mask) {
 		const struct cpumask *smt_mask = cpu_smt_mask(cpu);
 
-		i = 0;
-		local_irq_disable();
-		for_each_cpu(t, smt_mask) {
-			/* supports up to SMT8 */
-			raw_spin_lock_nested(&cpu_rq(t)->__lock, i++);
-		}
+		sched_core_lock(cpu, &flags);
 
 		for_each_cpu(t, smt_mask)
 			cpu_rq(t)->core_enabled = enabled;
 
-		for_each_cpu(t, smt_mask)
-			raw_spin_unlock(&cpu_rq(t)->__lock);
-		local_irq_enable();
+		sched_core_unlock(cpu, &flags);
 
 		cpumask_andnot(&sched_core_mask, &sched_core_mask, smt_mask);
 	}
@@ -5979,35 +5993,109 @@ void queue_core_balance(struct rq *rq)
 	queue_balance_callback(rq, &per_cpu(core_balance_head, rq->cpu), sched_core_balance);
 }
 
-static inline void sched_core_cpu_starting(unsigned int cpu)
+static void sched_core_cpu_starting(unsigned int cpu)
 {
 	const struct cpumask *smt_mask = cpu_smt_mask(cpu);
-	struct rq *rq, *core_rq = NULL;
-	int i;
+	struct rq *rq = cpu_rq(cpu), *core_rq = NULL;
+	unsigned long flags;
+	int t;
 
-	core_rq = cpu_rq(cpu)->core;
+	sched_core_lock(cpu, &flags);
 
-	if (!core_rq) {
-		for_each_cpu(i, smt_mask) {
-			rq = cpu_rq(i);
-			if (rq->core && rq->core == rq)
-				core_rq = rq;
+	WARN_ON_ONCE(rq->core != rq);
+
+	/* if we're the first, we'll be our own leader */
+	if (cpumask_weight(smt_mask) == 1)
+		goto unlock;
+
+	/* find the leader */
+	for_each_cpu(t, smt_mask) {
+		if (t == cpu)
+			continue;
+		rq = cpu_rq(t);
+		if (rq->core == rq) {
+			core_rq = rq;
+			break;
 		}
+	}
 
-		if (!core_rq)
-			core_rq = cpu_rq(cpu);
+	if (WARN_ON_ONCE(!core_rq)) /* whoopsie */
+		goto unlock;
 
-		for_each_cpu(i, smt_mask) {
-			rq = cpu_rq(i);
+	/* install and validate core_rq */
+	for_each_cpu(t, smt_mask) {
+		rq = cpu_rq(t);
 
-			WARN_ON_ONCE(rq->core && rq->core != core_rq);
+		if (t == cpu)
 			rq->core = core_rq;
-		}
+
+		WARN_ON_ONCE(rq->core != core_rq);
 	}
+
+unlock:
+	sched_core_unlock(cpu, &flags);
 }
+
+static void sched_core_cpu_deactivate(unsigned int cpu)
+{
+	const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+	struct rq *rq = cpu_rq(cpu), *core_rq = NULL;
+	unsigned long flags;
+	int t;
+
+	sched_core_lock(cpu, &flags);
+
+	/* if we're the last man standing, nothing to do */
+	if (cpumask_weight(smt_mask) == 1) {
+		WARN_ON_ONCE(rq->core != rq);
+		goto unlock;
+	}
+
+	/* if we're not the leader, nothing to do */
+	if (rq->core != rq)
+		goto unlock;
+
+	/* find a new leader */
+	for_each_cpu(t, smt_mask) {
+		if (t == cpu)
+			continue;
+		core_rq = cpu_rq(t);
+		break;
+	}
+
+	if (WARN_ON_ONCE(!core_rq)) /* impossible */
+		goto unlock;
+
+	/* copy the shared state to the new leader */
+	core_rq->core_task_seq      = rq->core_task_seq;
+	core_rq->core_pick_seq      = rq->core_pick_seq;
+	core_rq->core_cookie        = rq->core_cookie;
+	core_rq->core_forceidle     = rq->core_forceidle;
+	core_rq->core_forceidle_seq = rq->core_forceidle_seq;
+
+	/* install new leader */
+	for_each_cpu(t, smt_mask) {
+		rq = cpu_rq(t);
+		rq->core = core_rq;
+	}
+
+unlock:
+	sched_core_unlock(cpu, &flags);
+}
+
+static inline void sched_core_cpu_dying(unsigned int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+
+	if (rq->core != rq)
+		rq->core = rq;
+}
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline void sched_core_cpu_starting(unsigned int cpu) {}
+static inline void sched_core_cpu_deactivate(unsigned int cpu) {}
+static inline void sched_core_cpu_dying(unsigned int cpu) {}
 
 static struct task_struct *
 pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
@@ -9009,6 +9097,8 @@ int sched_cpu_deactivate(unsigned int cpu)
 	 */
 	if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
 		static_branch_dec_cpuslocked(&sched_smt_present);
+
+	sched_core_cpu_deactivate(cpu);
 #endif
 
 	if (!sched_smp_initialized)
@@ -9113,6 +9203,7 @@ int sched_cpu_dying(unsigned int cpu)
 	calc_load_migrate(rq);
 	update_max_interval();
 	hrtick_clear(rq);
+	sched_core_cpu_dying(cpu);
 	return 0;
 }
 #endif
@@ -9324,7 +9415,7 @@ void __init sched_init(void)
 		atomic_set(&rq->nr_iowait, 0);
 
 #ifdef CONFIG_SCHED_CORE
-		rq->core = NULL;
+		rq->core = rq;
 		rq->core_pick = NULL;
 		rq->core_enabled = 0;
 		rq->core_tree = RB_ROOT;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a9a660c6e08a..e6347c88c467 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1103,7 +1103,7 @@ struct rq {
 	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
 
-	/* shared state */
+	/* shared state -- careful with sched_core_cpu_deactivate() */
 	unsigned int		core_task_seq;
 	unsigned int		core_pick_seq;
 	unsigned long		core_cookie;

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH] sched: Fix Core-wide rq->lock for uninitialized CPUs
  2021-08-19 11:09         ` [PATCH] sched: Fix Core-wide rq->lock for uninitialized CPUs Peter Zijlstra
@ 2021-08-19 15:50           ` Tao Zhou
  2021-08-19 16:19           ` Eugene Syromiatnikov
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 103+ messages in thread
From: Tao Zhou @ 2021-08-19 15:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eugene Syromiatnikov, joel, chris.hyser, joshdon, mingo,
	vincent.guittot, valentin.schneider, mgorman, linux-kernel, tglx,
	Christian Brauner, ldv, tao.zhou

On Thu, Aug 19, 2021 at 01:09:17PM +0200, Peter Zijlstra wrote:

> On Wed, Aug 18, 2021 at 01:17:34AM +0200, Eugene Syromiatnikov wrote:
> > On Tue, Aug 17, 2021 at 05:52:43PM +0200, Peter Zijlstra wrote:
> > > Urgh... lemme guess, your HP BIOS is funny and reports more possible
> > > CPUs than you actually have resulting in cpu_possible_mask !=
> > > cpu_online_mask. Alternatively, you booted with nr_cpus= or something
> > > daft like that.
> > 
> > Yep, it seems to be the case:
> > 
> >     # cat /sys/devices/system/cpu/possible
> >     0-7
> >     # cat /sys/devices/system/cpu/online
> >     0-3
> > 
> 
> I think the below should work... can you please verify?
> 
> ---
> Subject: sched: Fix Core-wide rq->lock for uninitialized CPUs
> 
> Eugene tripped over the case where rq_lock(), as called in a
> for_each_possible_cpu() loop came apart because rq->core hadn't been
> setup yet.
> 
> This is a somewhat unusual, but valid case.
> 
> Rework things such that rq->core is initialized to point at itself. IOW
> initialize each CPU as a single threaded Core. CPU online will then join
> the new CPU (thread) to an existing Core where needed.
> 
> For completeness sake, have CPU offline fully undo the state so as to
> not presume the topology will match the next time it comes online.
> 
> Fixes: 9edeaea1bc45 ("sched: Core-wide rq->lock")
> Reported-by: Eugene Syromiatnikov <esyr@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c  | 143 +++++++++++++++++++++++++++++++++++++++++----------
>  kernel/sched/sched.h |   2 +-
>  2 files changed, 118 insertions(+), 27 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0f14e09a4f99..21d633971fcf 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -237,9 +237,30 @@ static DEFINE_MUTEX(sched_core_mutex);
>  static atomic_t sched_core_count;
>  static struct cpumask sched_core_mask;
>  
> +static void sched_core_lock(int cpu, unsigned long *flags)
> +{
> +	const struct cpumask *smt_mask = cpu_smt_mask(cpu);
> +	int t, i = 0;
> +
> +	local_irq_save(*flags);
> +	for_each_cpu(t, smt_mask)
> +		raw_spin_lock_nested(&cpu_rq(t)->__lock, i++);
> +}
> +
> +static void sched_core_unlock(int cpu, unsigned long *flags)
> +{
> +	const struct cpumask *smt_mask = cpu_smt_mask(cpu);
> +	int t;
> +
> +	for_each_cpu(t, smt_mask)
> +		raw_spin_unlock(&cpu_rq(t)->__lock);
> +	local_irq_restore(*flags);
> +}
> +
>  static void __sched_core_flip(bool enabled)
>  {
> -	int cpu, t, i;
> +	unsigned long flags;
> +	int cpu, t;
>  
>  	cpus_read_lock();
>  
> @@ -250,19 +271,12 @@ static void __sched_core_flip(bool enabled)
>  	for_each_cpu(cpu, &sched_core_mask) {
>  		const struct cpumask *smt_mask = cpu_smt_mask(cpu);
>  
> -		i = 0;
> -		local_irq_disable();
> -		for_each_cpu(t, smt_mask) {
> -			/* supports up to SMT8 */
> -			raw_spin_lock_nested(&cpu_rq(t)->__lock, i++);
> -		}
> +		sched_core_lock(cpu, &flags);
>  
>  		for_each_cpu(t, smt_mask)
>  			cpu_rq(t)->core_enabled = enabled;
>  
> -		for_each_cpu(t, smt_mask)
> -			raw_spin_unlock(&cpu_rq(t)->__lock);
> -		local_irq_enable();
> +		sched_core_unlock(cpu, &flags);
>  
>  		cpumask_andnot(&sched_core_mask, &sched_core_mask, smt_mask);
>  	}
> @@ -5979,35 +5993,109 @@ void queue_core_balance(struct rq *rq)
>  	queue_balance_callback(rq, &per_cpu(core_balance_head, rq->cpu), sched_core_balance);
>  }
>  
> -static inline void sched_core_cpu_starting(unsigned int cpu)
> +static void sched_core_cpu_starting(unsigned int cpu)
>  {
>  	const struct cpumask *smt_mask = cpu_smt_mask(cpu);
> -	struct rq *rq, *core_rq = NULL;
> -	int i;
> +	struct rq *rq = cpu_rq(cpu), *core_rq = NULL;
> +	unsigned long flags;
> +	int t;
>  
> -	core_rq = cpu_rq(cpu)->core;
> +	sched_core_lock(cpu, &flags);
>  
> -	if (!core_rq) {
> -		for_each_cpu(i, smt_mask) {
> -			rq = cpu_rq(i);
> -			if (rq->core && rq->core == rq)
> -				core_rq = rq;
> +	WARN_ON_ONCE(rq->core != rq);
> +
> +	/* if we're the first, we'll be our own leader */
> +	if (cpumask_weight(smt_mask) == 1)
> +		goto unlock;
> +
> +	/* find the leader */
> +	for_each_cpu(t, smt_mask) {
> +		if (t == cpu)
> +			continue;
> +		rq = cpu_rq(t);
> +		if (rq->core == rq) {
> +			core_rq = rq;
> +			break;
>  		}
> +	}
>  
> -		if (!core_rq)
> -			core_rq = cpu_rq(cpu);
> +	if (WARN_ON_ONCE(!core_rq)) /* whoopsie */
> +		goto unlock;
>  
> -		for_each_cpu(i, smt_mask) {
> -			rq = cpu_rq(i);
> +	/* install and validate core_rq */
> +	for_each_cpu(t, smt_mask) {
> +		rq = cpu_rq(t);
>  
> -			WARN_ON_ONCE(rq->core && rq->core != core_rq);
> +		if (t == cpu)
>  			rq->core = core_rq;
> -		}
> +
> +		WARN_ON_ONCE(rq->core != core_rq);
>  	}

Yoh, Peter!

Read not more than ten times but stay here long enough until I realize
something that I am not sure. You are not only just fix the problem,
you change the rq's core selection method(manner.). Original source now
set the core to one rq, here you distribute the core_rq to not just only
one rq.

For example(SMT3): rq1 --> rq0, rq0 --> rq2, rq2 --> rq2(/* whoopsie */).

And the core can change when deactivate.
Not sure I'm getting all the things right, but this is more performance
efficient because the lock contention is avoided when pick task from my
imagination.
Again, I'm not know much about everything, just put some words.

> +
> +unlock:
> +	sched_core_unlock(cpu, &flags);
>  }
> +
> +static void sched_core_cpu_deactivate(unsigned int cpu)
> +{
> +	const struct cpumask *smt_mask = cpu_smt_mask(cpu);
> +	struct rq *rq = cpu_rq(cpu), *core_rq = NULL;
> +	unsigned long flags;
> +	int t;
> +
> +	sched_core_lock(cpu, &flags);
> +
> +	/* if we're the last man standing, nothing to do */
> +	if (cpumask_weight(smt_mask) == 1) {
> +		WARN_ON_ONCE(rq->core != rq);
> +		goto unlock;
> +	}
> +
> +	/* if we're not the leader, nothing to do */
> +	if (rq->core != rq)
> +		goto unlock;
> +
> +	/* find a new leader */
> +	for_each_cpu(t, smt_mask) {
> +		if (t == cpu)
> +			continue;
> +		core_rq = cpu_rq(t);
> +		break;
> +	}
> +
> +	if (WARN_ON_ONCE(!core_rq)) /* impossible */
> +		goto unlock;
> +
> +	/* copy the shared state to the new leader */
> +	core_rq->core_task_seq      = rq->core_task_seq;
> +	core_rq->core_pick_seq      = rq->core_pick_seq;
> +	core_rq->core_cookie        = rq->core_cookie;
> +	core_rq->core_forceidle     = rq->core_forceidle;
> +	core_rq->core_forceidle_seq = rq->core_forceidle_seq;
> +
> +	/* install new leader */
> +	for_each_cpu(t, smt_mask) {
> +		rq = cpu_rq(t);
> +		rq->core = core_rq;
> +	}
> +
> +unlock:
> +	sched_core_unlock(cpu, &flags);
> +}
> +
> +static inline void sched_core_cpu_dying(unsigned int cpu)
> +{
> +	struct rq *rq = cpu_rq(cpu);
> +
> +	if (rq->core != rq)
> +		rq->core = rq;
> +}
> +
>  #else /* !CONFIG_SCHED_CORE */
>  
>  static inline void sched_core_cpu_starting(unsigned int cpu) {}
> +static inline void sched_core_cpu_deactivate(unsigned int cpu) {}
> +static inline void sched_core_cpu_dying(unsigned int cpu) {}
>  
>  static struct task_struct *
>  pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> @@ -9009,6 +9097,8 @@ int sched_cpu_deactivate(unsigned int cpu)
>  	 */
>  	if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
>  		static_branch_dec_cpuslocked(&sched_smt_present);
> +
> +	sched_core_cpu_deactivate(cpu);
>  #endif
>  
>  	if (!sched_smp_initialized)
> @@ -9113,6 +9203,7 @@ int sched_cpu_dying(unsigned int cpu)
>  	calc_load_migrate(rq);
>  	update_max_interval();
>  	hrtick_clear(rq);
> +	sched_core_cpu_dying(cpu);
>  	return 0;
>  }
>  #endif
> @@ -9324,7 +9415,7 @@ void __init sched_init(void)
>  		atomic_set(&rq->nr_iowait, 0);
>  
>  #ifdef CONFIG_SCHED_CORE
> -		rq->core = NULL;
> +		rq->core = rq;
>  		rq->core_pick = NULL;
>  		rq->core_enabled = 0;
>  		rq->core_tree = RB_ROOT;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index a9a660c6e08a..e6347c88c467 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1103,7 +1103,7 @@ struct rq {
>  	unsigned int		core_sched_seq;
>  	struct rb_root		core_tree;
>  
> -	/* shared state */
> +	/* shared state -- careful with sched_core_cpu_deactivate() */
>  	unsigned int		core_task_seq;
>  	unsigned int		core_pick_seq;
>  	unsigned long		core_cookie;


Thanks,
Tao

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH] sched: Fix Core-wide rq->lock for uninitialized CPUs
  2021-08-19 11:09         ` [PATCH] sched: Fix Core-wide rq->lock for uninitialized CPUs Peter Zijlstra
  2021-08-19 15:50           ` Tao Zhou
@ 2021-08-19 16:19           ` Eugene Syromiatnikov
  2021-08-20  0:18           ` Josh Don
  2021-08-23  9:07           ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra
  3 siblings, 0 replies; 103+ messages in thread
From: Eugene Syromiatnikov @ 2021-08-19 16:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: joel, chris.hyser, joshdon, mingo, vincent.guittot,
	valentin.schneider, mgorman, linux-kernel, tglx,
	Christian Brauner, ldv

On Thu, Aug 19, 2021 at 01:09:17PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 18, 2021 at 01:17:34AM +0200, Eugene Syromiatnikov wrote:
> > On Tue, Aug 17, 2021 at 05:52:43PM +0200, Peter Zijlstra wrote:
> > > Urgh... lemme guess, your HP BIOS is funny and reports more possible
> > > CPUs than you actually have resulting in cpu_possible_mask !=
> > > cpu_online_mask. Alternatively, you booted with nr_cpus= or something
> > > daft like that.
> > 
> > Yep, it seems to be the case:
> > 
> >     # cat /sys/devices/system/cpu/possible
> >     0-7
> >     # cat /sys/devices/system/cpu/online
> >     0-3
> > 
> 
> I think the below should work... can you please verify?

Yes, it no longer oops'es now, thank you!

    # cat /sys/devices/system/cpu/possible
    0-7
    # cat /sys/devices/system/cpu/online
    0-3
    # ./prctl-sched-core-oops-repro
    Iteration 0 status: 0
    Iteration 1 status: 0
    # ../src/strace -fvq -eprctl,clone,setsid -esignal=none ./prctl-sched-core-oops-repro
    clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f510b1c7890) = 108328
    [pid 108328] setsid()                   = 108328
    [pid 108328] +++ exited with 0 +++
    Iteration 0 status: 0
    prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, 108324, 0x2 /* PIDTYPE_PGID */, NULL) = 0
    clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f510b1c7890) = 108329
    [pid 108329] setsid()                   = 108329
    [pid 108329] +++ exited with 0 +++
    Iteration 1 status: 0
    prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, 108324, 0x2 /* PIDTYPE_PGID */, NULL) = 0
    +++ exited with 0 +++

> ---
> Subject: sched: Fix Core-wide rq->lock for uninitialized CPUs
> 
> Eugene tripped over the case where rq_lock(), as called in a
> for_each_possible_cpu() loop came apart because rq->core hadn't been
> setup yet.
> 
> This is a somewhat unusual, but valid case.
> 
> Rework things such that rq->core is initialized to point at itself. IOW
> initialize each CPU as a single threaded Core. CPU online will then join
> the new CPU (thread) to an existing Core where needed.
> 
> For completeness sake, have CPU offline fully undo the state so as to
> not presume the topology will match the next time it comes online.
> 
> Fixes: 9edeaea1bc45 ("sched: Core-wide rq->lock")
> Reported-by: Eugene Syromiatnikov <esyr@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Tested-by: Eugene Syromiatnikov <esyr@redhat.com>


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH] sched: Fix Core-wide rq->lock for uninitialized CPUs
  2021-08-19 11:09         ` [PATCH] sched: Fix Core-wide rq->lock for uninitialized CPUs Peter Zijlstra
  2021-08-19 15:50           ` Tao Zhou
  2021-08-19 16:19           ` Eugene Syromiatnikov
@ 2021-08-20  0:18           ` Josh Don
  2021-08-20 10:02             ` Peter Zijlstra
  2021-08-23  9:07           ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra
  3 siblings, 1 reply; 103+ messages in thread
From: Josh Don @ 2021-08-20  0:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eugene Syromiatnikov, Joel Fernandes, Hyser,Chris, Ingo Molnar,
	Vincent Guittot, Valentin Schneider, Mel Gorman, linux-kernel,
	Thomas Gleixner, Christian Brauner, ldv

Hi Peter,

On Thu, Aug 19, 2021 at 4:09 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> -       /* shared state */
> +       /* shared state -- careful with sched_core_cpu_deactivate() */

Could also throw these fields into a wrapping struct. Either way seems fine.

Reviewed-by: Josh Don <joshdon@google.com>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH] sched: Fix Core-wide rq->lock for uninitialized CPUs
  2021-08-20  0:18           ` Josh Don
@ 2021-08-20 10:02             ` Peter Zijlstra
  0 siblings, 0 replies; 103+ messages in thread
From: Peter Zijlstra @ 2021-08-20 10:02 UTC (permalink / raw)
  To: Josh Don
  Cc: Eugene Syromiatnikov, Joel Fernandes, Hyser,Chris, Ingo Molnar,
	Vincent Guittot, Valentin Schneider, Mel Gorman, linux-kernel,
	Thomas Gleixner, Christian Brauner, ldv

On Thu, Aug 19, 2021 at 05:18:22PM -0700, Josh Don wrote:
> Hi Peter,
> 
> On Thu, Aug 19, 2021 at 4:09 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > -       /* shared state */
> > +       /* shared state -- careful with sched_core_cpu_deactivate() */
> 
> Could also throw these fields into a wrapping struct. Either way seems fine.
> 
> Reviewed-by: Josh Don <joshdon@google.com>

Yes, I considered that, but didn't want the churn at this time.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [tip: sched/urgent] sched: Fix Core-wide rq->lock for uninitialized CPUs
  2021-08-19 11:09         ` [PATCH] sched: Fix Core-wide rq->lock for uninitialized CPUs Peter Zijlstra
                             ` (2 preceding siblings ...)
  2021-08-20  0:18           ` Josh Don
@ 2021-08-23  9:07           ` tip-bot2 for Peter Zijlstra
  3 siblings, 0 replies; 103+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-08-23  9:07 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Eugene Syromiatnikov, Peter Zijlstra (Intel),
	Josh Don, x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     3c474b3239f12fe0b00d7e82481f36a1f31e79ab
Gitweb:        https://git.kernel.org/tip/3c474b3239f12fe0b00d7e82481f36a1f31e79ab
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 19 Aug 2021 13:09:17 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 20 Aug 2021 12:32:53 +02:00

sched: Fix Core-wide rq->lock for uninitialized CPUs

Eugene tripped over the case where rq_lock(), as called in a
for_each_possible_cpu() loop came apart because rq->core hadn't been
setup yet.

This is a somewhat unusual, but valid case.

Rework things such that rq->core is initialized to point at itself. IOW
initialize each CPU as a single threaded Core. CPU online will then join
the new CPU (thread) to an existing Core where needed.

For completeness sake, have CPU offline fully undo the state so as to
not presume the topology will match the next time it comes online.

Fixes: 9edeaea1bc45 ("sched: Core-wide rq->lock")
Reported-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Don <joshdon@google.com>
Tested-by: Eugene Syromiatnikov <esyr@redhat.com>
Link: https://lkml.kernel.org/r/YR473ZGeKqMs6kw+@hirez.programming.kicks-ass.net
---
 kernel/sched/core.c  | 143 ++++++++++++++++++++++++++++++++++--------
 kernel/sched/sched.h |   2 +-
 2 files changed, 118 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 20ffcc0..f3b27c6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -237,9 +237,30 @@ static DEFINE_MUTEX(sched_core_mutex);
 static atomic_t sched_core_count;
 static struct cpumask sched_core_mask;
 
+static void sched_core_lock(int cpu, unsigned long *flags)
+{
+	const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+	int t, i = 0;
+
+	local_irq_save(*flags);
+	for_each_cpu(t, smt_mask)
+		raw_spin_lock_nested(&cpu_rq(t)->__lock, i++);
+}
+
+static void sched_core_unlock(int cpu, unsigned long *flags)
+{
+	const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+	int t;
+
+	for_each_cpu(t, smt_mask)
+		raw_spin_unlock(&cpu_rq(t)->__lock);
+	local_irq_restore(*flags);
+}
+
 static void __sched_core_flip(bool enabled)
 {
-	int cpu, t, i;
+	unsigned long flags;
+	int cpu, t;
 
 	cpus_read_lock();
 
@@ -250,19 +271,12 @@ static void __sched_core_flip(bool enabled)
 	for_each_cpu(cpu, &sched_core_mask) {
 		const struct cpumask *smt_mask = cpu_smt_mask(cpu);
 
-		i = 0;
-		local_irq_disable();
-		for_each_cpu(t, smt_mask) {
-			/* supports up to SMT8 */
-			raw_spin_lock_nested(&cpu_rq(t)->__lock, i++);
-		}
+		sched_core_lock(cpu, &flags);
 
 		for_each_cpu(t, smt_mask)
 			cpu_rq(t)->core_enabled = enabled;
 
-		for_each_cpu(t, smt_mask)
-			raw_spin_unlock(&cpu_rq(t)->__lock);
-		local_irq_enable();
+		sched_core_unlock(cpu, &flags);
 
 		cpumask_andnot(&sched_core_mask, &sched_core_mask, smt_mask);
 	}
@@ -5736,35 +5750,109 @@ void queue_core_balance(struct rq *rq)
 	queue_balance_callback(rq, &per_cpu(core_balance_head, rq->cpu), sched_core_balance);
 }
 
-static inline void sched_core_cpu_starting(unsigned int cpu)
+static void sched_core_cpu_starting(unsigned int cpu)
 {
 	const struct cpumask *smt_mask = cpu_smt_mask(cpu);
-	struct rq *rq, *core_rq = NULL;
-	int i;
+	struct rq *rq = cpu_rq(cpu), *core_rq = NULL;
+	unsigned long flags;
+	int t;
 
-	core_rq = cpu_rq(cpu)->core;
+	sched_core_lock(cpu, &flags);
 
-	if (!core_rq) {
-		for_each_cpu(i, smt_mask) {
-			rq = cpu_rq(i);
-			if (rq->core && rq->core == rq)
-				core_rq = rq;
+	WARN_ON_ONCE(rq->core != rq);
+
+	/* if we're the first, we'll be our own leader */
+	if (cpumask_weight(smt_mask) == 1)
+		goto unlock;
+
+	/* find the leader */
+	for_each_cpu(t, smt_mask) {
+		if (t == cpu)
+			continue;
+		rq = cpu_rq(t);
+		if (rq->core == rq) {
+			core_rq = rq;
+			break;
 		}
+	}
 
-		if (!core_rq)
-			core_rq = cpu_rq(cpu);
+	if (WARN_ON_ONCE(!core_rq)) /* whoopsie */
+		goto unlock;
 
-		for_each_cpu(i, smt_mask) {
-			rq = cpu_rq(i);
+	/* install and validate core_rq */
+	for_each_cpu(t, smt_mask) {
+		rq = cpu_rq(t);
 
-			WARN_ON_ONCE(rq->core && rq->core != core_rq);
+		if (t == cpu)
 			rq->core = core_rq;
-		}
+
+		WARN_ON_ONCE(rq->core != core_rq);
 	}
+
+unlock:
+	sched_core_unlock(cpu, &flags);
 }
+
+static void sched_core_cpu_deactivate(unsigned int cpu)
+{
+	const struct cpumask *smt_mask = cpu_smt_mask(cpu);
+	struct rq *rq = cpu_rq(cpu), *core_rq = NULL;
+	unsigned long flags;
+	int t;
+
+	sched_core_lock(cpu, &flags);
+
+	/* if we're the last man standing, nothing to do */
+	if (cpumask_weight(smt_mask) == 1) {
+		WARN_ON_ONCE(rq->core != rq);
+		goto unlock;
+	}
+
+	/* if we're not the leader, nothing to do */
+	if (rq->core != rq)
+		goto unlock;
+
+	/* find a new leader */
+	for_each_cpu(t, smt_mask) {
+		if (t == cpu)
+			continue;
+		core_rq = cpu_rq(t);
+		break;
+	}
+
+	if (WARN_ON_ONCE(!core_rq)) /* impossible */
+		goto unlock;
+
+	/* copy the shared state to the new leader */
+	core_rq->core_task_seq      = rq->core_task_seq;
+	core_rq->core_pick_seq      = rq->core_pick_seq;
+	core_rq->core_cookie        = rq->core_cookie;
+	core_rq->core_forceidle     = rq->core_forceidle;
+	core_rq->core_forceidle_seq = rq->core_forceidle_seq;
+
+	/* install new leader */
+	for_each_cpu(t, smt_mask) {
+		rq = cpu_rq(t);
+		rq->core = core_rq;
+	}
+
+unlock:
+	sched_core_unlock(cpu, &flags);
+}
+
+static inline void sched_core_cpu_dying(unsigned int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+
+	if (rq->core != rq)
+		rq->core = rq;
+}
+
 #else /* !CONFIG_SCHED_CORE */
 
 static inline void sched_core_cpu_starting(unsigned int cpu) {}
+static inline void sched_core_cpu_deactivate(unsigned int cpu) {}
+static inline void sched_core_cpu_dying(unsigned int cpu) {}
 
 static struct task_struct *
 pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
@@ -8707,6 +8795,8 @@ int sched_cpu_deactivate(unsigned int cpu)
 	 */
 	if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
 		static_branch_dec_cpuslocked(&sched_smt_present);
+
+	sched_core_cpu_deactivate(cpu);
 #endif
 
 	if (!sched_smp_initialized)
@@ -8811,6 +8901,7 @@ int sched_cpu_dying(unsigned int cpu)
 	calc_load_migrate(rq);
 	update_max_interval();
 	hrtick_clear(rq);
+	sched_core_cpu_dying(cpu);
 	return 0;
 }
 #endif
@@ -9022,7 +9113,7 @@ void __init sched_init(void)
 		atomic_set(&rq->nr_iowait, 0);
 
 #ifdef CONFIG_SCHED_CORE
-		rq->core = NULL;
+		rq->core = rq;
 		rq->core_pick = NULL;
 		rq->core_enabled = 0;
 		rq->core_tree = RB_ROOT;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 14a41a2..da4295f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1093,7 +1093,7 @@ struct rq {
 	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
 
-	/* shared state */
+	/* shared state -- careful with sched_core_cpu_deactivate() */
 	unsigned int		core_task_seq;
 	unsigned int		core_pick_seq;
 	unsigned long		core_cookie;

^ permalink raw reply related	[flat|nested] 103+ messages in thread

end of thread, other threads:[~2021-08-23  9:08 UTC | newest]

Thread overview: 103+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-22 12:04 [PATCH 00/19] sched: Core Scheduling Peter Zijlstra
2021-04-22 12:05 ` [PATCH 01/19] sched/fair: Add a few assertions Peter Zijlstra
2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2021-05-13  8:56     ` Ning, Hongyu
2021-04-22 12:05 ` [PATCH 02/19] sched: Provide raw_spin_rq_*lock*() helpers Peter Zijlstra
2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2021-04-22 12:05 ` [PATCH 03/19] sched: Wrap rq::lock access Peter Zijlstra
2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2021-04-22 12:05 ` [PATCH 04/19] sched: Prepare for Core-wide rq->lock Peter Zijlstra
2021-04-24  1:22   ` Josh Don
2021-04-26  8:31     ` Peter Zijlstra
2021-04-26 22:21       ` Josh Don
2021-04-27 17:10         ` Don Hiatt
2021-04-27 23:35           ` Josh Don
2021-04-28  1:03             ` Aubrey Li
2021-04-28  6:05               ` Aubrey Li
2021-04-28 10:57                 ` Aubrey Li
2021-04-28 16:41                   ` Don Hiatt
2021-04-29 20:48                     ` Josh Don
2021-04-29 21:09                       ` Don Hiatt
2021-04-29 23:22                         ` Josh Don
2021-04-30 16:18                           ` Don Hiatt
2021-04-30  8:26                         ` Aubrey Li
2021-04-28 16:04             ` Don Hiatt
2021-04-27 23:30         ` Josh Don
2021-04-28  9:13           ` Peter Zijlstra
2021-04-28 10:35             ` Aubrey Li
2021-04-28 11:03               ` Peter Zijlstra
2021-04-28 14:18                 ` Paul E. McKenney
2021-04-29 20:11             ` Josh Don
2021-05-03 19:17               ` Peter Zijlstra
2021-04-28  7:13         ` Peter Zijlstra
2021-04-28  6:02   ` Aubrey Li
2021-04-29  8:03   ` Aubrey Li
2021-04-29 20:39     ` Josh Don
2021-04-30  8:20       ` Aubrey Li
2021-04-30  8:48         ` Josh Don
2021-04-30 14:15           ` Aubrey Li
2021-05-04  7:38       ` Peter Zijlstra
2021-05-05 16:20         ` Don Hiatt
2021-05-06 10:25           ` Peter Zijlstra
2021-05-07  9:50   ` [PATCH v2 " Peter Zijlstra
2021-05-08  8:07     ` Aubrey Li
2021-05-12  9:07       ` Peter Zijlstra
2021-04-22 12:05 ` [PATCH 05/19] sched: " Peter Zijlstra
2021-05-07  9:50   ` [PATCH v2 " Peter Zijlstra
2021-05-12 10:28     ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2021-04-22 12:05 ` [PATCH 06/19] sched: Optimize rq_lockp() usage Peter Zijlstra
2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2021-04-22 12:05 ` [PATCH 07/19] sched: Allow sched_core_put() from atomic context Peter Zijlstra
2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2021-04-22 12:05 ` [PATCH 08/19] sched: Introduce sched_class::pick_task() Peter Zijlstra
2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2021-04-22 12:05 ` [PATCH 09/19] sched: Basic tracking of matching tasks Peter Zijlstra
2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2021-04-22 12:05 ` [PATCH 10/19] sched: Add core wide task selection and scheduling Peter Zijlstra
2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2021-04-22 12:05 ` [PATCH 11/19] sched/fair: Fix forced idle sibling starvation corner case Peter Zijlstra
2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Vineeth Pillai
2021-04-22 12:05 ` [PATCH 12/19] sched: Fix priority inversion of cookied task with sibling Peter Zijlstra
2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Joel Fernandes (Google)
2021-04-22 12:05 ` [PATCH 13/19] sched/fair: Snapshot the min_vruntime of CPUs on force idle Peter Zijlstra
2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Joel Fernandes (Google)
2021-04-22 12:05 ` [PATCH 14/19] sched: Trivial forced-newidle balancer Peter Zijlstra
2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2021-04-22 12:05 ` [PATCH 15/19] sched: Migration changes for core scheduling Peter Zijlstra
2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Aubrey Li
2021-04-22 12:05 ` [PATCH 16/19] sched: Trivial core scheduling cookie management Peter Zijlstra
2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2021-04-22 12:05 ` [PATCH 17/19] sched: Inherit task cookie on fork() Peter Zijlstra
2021-05-10 16:06   ` Joel Fernandes
2021-05-10 16:22     ` Chris Hyser
2021-05-10 20:47       ` Joel Fernandes
2021-05-10 21:38         ` Chris Hyser
2021-05-12  9:05           ` Peter Zijlstra
2021-05-12 20:20             ` Josh Don
2021-05-12 21:07               ` Don Hiatt
2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2021-04-22 12:05 ` [PATCH 18/19] sched: prctl() core-scheduling interface Peter Zijlstra
2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Chris Hyser
2021-06-14 23:36   ` [PATCH 18/19] " Josh Don
2021-06-15 11:31     ` Joel Fernandes
2021-08-05 16:53   ` Eugene Syromiatnikov
2021-08-05 17:00     ` Peter Zijlstra
2021-08-17 15:15   ` Eugene Syromiatnikov
2021-08-17 15:52     ` Peter Zijlstra
2021-08-17 23:17       ` Eugene Syromiatnikov
2021-08-19 11:09         ` [PATCH] sched: Fix Core-wide rq->lock for uninitialized CPUs Peter Zijlstra
2021-08-19 15:50           ` Tao Zhou
2021-08-19 16:19           ` Eugene Syromiatnikov
2021-08-20  0:18           ` Josh Don
2021-08-20 10:02             ` Peter Zijlstra
2021-08-23  9:07           ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra
2021-04-22 12:05 ` [PATCH 19/19] kselftest: Add test for core sched prctl interface Peter Zijlstra
2021-05-12 10:28   ` [tip: sched/core] " tip-bot2 for Chris Hyser
2021-04-22 16:43 ` [PATCH 00/19] sched: Core Scheduling Don Hiatt
2021-04-22 17:29   ` Peter Zijlstra
2021-04-30  6:47 ` Ning, Hongyu
2021-05-06 10:29   ` Peter Zijlstra
2021-05-06 12:53     ` Ning, Hongyu
2021-05-07 18:02 ` Joel Fernandes
2021-05-10 16:16 ` Vincent Guittot
2021-05-11  7:00   ` Vincent Guittot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.