All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -v3 0/9] sched: Fix hot-unplug regression
@ 2021-01-21 10:17 Peter Zijlstra
  2021-01-21 10:17 ` [PATCH -v3 1/9] sched/core: Print out straggler tasks in sched_cpu_dying() Peter Zijlstra
                   ` (10 more replies)
  0 siblings, 11 replies; 25+ messages in thread
From: Peter Zijlstra @ 2021-01-21 10:17 UTC (permalink / raw)
  To: mingo, tglx
  Cc: linux-kernel, jiangshanlai, valentin.schneider, cai,
	vincent.donnefort, decui, paulmck, vincent.guittot, rostedt, tj,
	peterz

Hi,

Some cautious optimism lets me post v3 of these patches. They (knock on wood)
fix the regression introduced by commit:

  1cf12e08bc4d ("sched/hotplug: Consolidate task migration on CPU unplug")

These patches survived overnight runs for both me and Valentin, but I'll let it
run for at least another 12 hours before committing these patches.

New in this version is patch #7.

Much thanks to Valentin for his continued support and debugging efforts.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH -v3 1/9] sched/core: Print out straggler tasks in sched_cpu_dying()
  2021-01-21 10:17 [PATCH -v3 0/9] sched: Fix hot-unplug regression Peter Zijlstra
@ 2021-01-21 10:17 ` Peter Zijlstra
  2021-01-21 10:17 ` [PATCH -v3 2/9] workqueue: Use cpu_possible_mask instead of cpu_active_mask to break affinity Peter Zijlstra
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2021-01-21 10:17 UTC (permalink / raw)
  To: mingo, tglx
  Cc: linux-kernel, jiangshanlai, valentin.schneider, cai,
	vincent.donnefort, decui, paulmck, vincent.guittot, rostedt, tj,
	peterz

From: Valentin Schneider <valentin.schneider@arm.com>

Since commit

  1cf12e08bc4d ("sched/hotplug: Consolidate task migration on CPU unplug")

tasks are expected to move themselves out of a out-going CPU. For most
tasks this will be done automagically via BALANCE_PUSH, but percpu kthreads
will have to cooperate and move themselves away one way or another.

Currently, some percpu kthreads (workqueues being a notable exemple) do not
cooperate nicely and can end up on an out-going CPU at the time
sched_cpu_dying() is invoked.

Print the dying rq's tasks to shed some light on the stragglers.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210113183141.11974-1-valentin.schneider@arm.com
---
 kernel/sched/core.c |   24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7580,6 +7580,25 @@ static void calc_load_migrate(struct rq
 		atomic_long_add(delta, &calc_load_tasks);
 }
 
+static void dump_rq_tasks(struct rq *rq, const char *loglvl)
+{
+	struct task_struct *g, *p;
+	int cpu = cpu_of(rq);
+
+	lockdep_assert_held(&rq->lock);
+
+	printk("%sCPU%d enqueued tasks (%u total):\n", loglvl, cpu, rq->nr_running);
+	for_each_process_thread(g, p) {
+		if (task_cpu(p) != cpu)
+			continue;
+
+		if (!task_on_rq_queued(p))
+			continue;
+
+		printk("%s\tpid: %d, name: %s\n", loglvl, p->pid, p->comm);
+	}
+}
+
 int sched_cpu_dying(unsigned int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -7589,7 +7608,10 @@ int sched_cpu_dying(unsigned int cpu)
 	sched_tick_stop(cpu);
 
 	rq_lock_irqsave(rq, &rf);
-	BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq));
+	if (rq->nr_running != 1 || rq_has_pinned_tasks(rq)) {
+		WARN(true, "Dying CPU not properly vacated!");
+		dump_rq_tasks(rq, KERN_WARNING);
+	}
 	rq_unlock_irqrestore(rq, &rf);
 
 	calc_load_migrate(rq);



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH -v3 2/9] workqueue: Use cpu_possible_mask instead of cpu_active_mask to break affinity
  2021-01-21 10:17 [PATCH -v3 0/9] sched: Fix hot-unplug regression Peter Zijlstra
  2021-01-21 10:17 ` [PATCH -v3 1/9] sched/core: Print out straggler tasks in sched_cpu_dying() Peter Zijlstra
@ 2021-01-21 10:17 ` Peter Zijlstra
  2021-01-21 10:17 ` [PATCH -v3 3/9] sched: Dont run cpu-online with balance_push() enabled Peter Zijlstra
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2021-01-21 10:17 UTC (permalink / raw)
  To: mingo, tglx
  Cc: linux-kernel, jiangshanlai, valentin.schneider, cai,
	vincent.donnefort, decui, paulmck, vincent.guittot, rostedt, tj,
	peterz, Lai Jiangshan

From: Lai Jiangshan <laijs@linux.alibaba.com>

The scheduler won't break affinity for us any more, and we should
"emulate" the same behavior when the scheduler breaks affinity for
us.  The behavior is "changing the cpumask to cpu_possible_mask".

And there might be some other CPUs online later while the worker is
still running with the pending work items.  The worker should be allowed
to use the later online CPUs as before and process the work items ASAP.
If we use cpu_active_mask here, we can't achieve this goal but
using cpu_possible_mask can.

Fixes: 06249738a41a ("workqueue: Manually break affinity on hotplug")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lkml.kernel.org/r/20210111152638.2417-4-jiangshanlai@gmail.com
---
 kernel/workqueue.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4920,7 +4920,7 @@ static void unbind_workers(int cpu)
 		raw_spin_unlock_irq(&pool->lock);
 
 		for_each_pool_worker(worker, pool)
-			WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_active_mask) < 0);
+			WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_possible_mask) < 0);
 
 		mutex_unlock(&wq_pool_attach_mutex);
 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH -v3 3/9] sched: Dont run cpu-online with balance_push() enabled
  2021-01-21 10:17 [PATCH -v3 0/9] sched: Fix hot-unplug regression Peter Zijlstra
  2021-01-21 10:17 ` [PATCH -v3 1/9] sched/core: Print out straggler tasks in sched_cpu_dying() Peter Zijlstra
  2021-01-21 10:17 ` [PATCH -v3 2/9] workqueue: Use cpu_possible_mask instead of cpu_active_mask to break affinity Peter Zijlstra
@ 2021-01-21 10:17 ` Peter Zijlstra
  2021-01-21 14:00   ` Valentin Schneider
  2021-01-22 17:41   ` [tip: sched/urgent] sched: Don't " tip-bot2 for Peter Zijlstra
  2021-01-21 10:17 ` [PATCH -v3 4/9] kthread: Extract KTHREAD_IS_PER_CPU Peter Zijlstra
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 25+ messages in thread
From: Peter Zijlstra @ 2021-01-21 10:17 UTC (permalink / raw)
  To: mingo, tglx
  Cc: linux-kernel, jiangshanlai, valentin.schneider, cai,
	vincent.donnefort, decui, paulmck, vincent.guittot, rostedt, tj,
	peterz

We don't need to push away tasks when we come online, mark the push
complete right before the CPU dies.

XXX hotplug state machine has trouble with rollback here.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7320,10 +7320,12 @@ static void balance_push_set(int cpu, bo
 	struct rq_flags rf;
 
 	rq_lock_irqsave(rq, &rf);
-	if (on)
+	if (on) {
+		WARN_ON_ONCE(rq->balance_callback);
 		rq->balance_callback = &balance_push_callback;
-	else
+	} else if (rq->balance_callback == &balance_push_callback) {
 		rq->balance_callback = NULL;
+	}
 	rq_unlock_irqrestore(rq, &rf);
 }
 
@@ -7441,6 +7443,10 @@ int sched_cpu_activate(unsigned int cpu)
 	struct rq *rq = cpu_rq(cpu);
 	struct rq_flags rf;
 
+	/*
+	 * Make sure that when the hotplug state machine does a roll-back
+	 * we clear balance_push. Ideally that would happen earlier...
+	 */
 	balance_push_set(cpu, false);
 
 #ifdef CONFIG_SCHED_SMT
@@ -7608,6 +7614,12 @@ int sched_cpu_dying(unsigned int cpu)
 	}
 	rq_unlock_irqrestore(rq, &rf);
 
+	/*
+	 * Should really be after we clear cpu_online(), but we're in
+	 * stop_machine(), so it all works.
+	 */
+	balance_push_set(cpu, false);
+
 	calc_load_migrate(rq);
 	update_max_interval();
 	nohz_balance_exit_idle(rq);



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH -v3 4/9] kthread: Extract KTHREAD_IS_PER_CPU
  2021-01-21 10:17 [PATCH -v3 0/9] sched: Fix hot-unplug regression Peter Zijlstra
                   ` (2 preceding siblings ...)
  2021-01-21 10:17 ` [PATCH -v3 3/9] sched: Dont run cpu-online with balance_push() enabled Peter Zijlstra
@ 2021-01-21 10:17 ` Peter Zijlstra
  2021-01-22 17:41   ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra
  2021-01-21 10:17 ` [PATCH -v3 5/9] workqueue: Tag bound workers with KTHREAD_IS_PER_CPU Peter Zijlstra
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2021-01-21 10:17 UTC (permalink / raw)
  To: mingo, tglx
  Cc: linux-kernel, jiangshanlai, valentin.schneider, cai,
	vincent.donnefort, decui, paulmck, vincent.guittot, rostedt, tj,
	peterz

There is a need to distinguish geniune per-cpu kthreads from kthreads
that happen to have a single CPU affinity.

Geniune per-cpu kthreads are kthreads that are CPU affine for
correctness, these will obviously have PF_KTHREAD set, but must also
have PF_NO_SETAFFINITY set, lest userspace modify their affinity and
ruins things.

However, these two things are not sufficient, PF_NO_SETAFFINITY is
also set on other tasks that have their affinities controlled through
other means, like for instance workqueues.

Therefore another bit is needed; it turns out kthread_create_per_cpu()
already has such a bit: KTHREAD_IS_PER_CPU, which is used to make
kthread_park()/kthread_unpark() work correctly.

Expose this flag and remove the implicit setting of it from
kthread_create_on_cpu(); the io_uring usage of it seems dubious at
best.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/kthread.h |    3 +++
 kernel/kthread.c        |   30 ++++++++++++++++++++++++++++--
 kernel/smpboot.c        |    1 +
 3 files changed, 32 insertions(+), 2 deletions(-)

--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -33,6 +33,9 @@ struct task_struct *kthread_create_on_cp
 					  unsigned int cpu,
 					  const char *namefmt);
 
+void kthread_set_per_cpu(struct task_struct *k, int cpu);
+bool kthread_is_per_cpu(struct task_struct *k);
+
 /**
  * kthread_run - create and wake a thread.
  * @threadfn: the function to run until signal_pending(current).
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -493,11 +494,36 @@ struct task_struct *kthread_create_on_cp
 		return p;
 	kthread_bind(p, cpu);
 	/* CPU hotplug need to bind once again when unparking the thread. */
-	set_bit(KTHREAD_IS_PER_CPU, &to_kthread(p)->flags);
 	to_kthread(p)->cpu = cpu;
 	return p;
 }
 
+void kthread_set_per_cpu(struct task_struct *k, int cpu)
+{
+	struct kthread *kthread = to_kthread(k);
+	if (!kthread)
+		return;
+
+	WARN_ON_ONCE(!(k->flags & PF_NO_SETAFFINITY));
+
+	if (cpu < 0) {
+		clear_bit(KTHREAD_IS_PER_CPU, &kthread->flags);
+		return;
+	}
+
+	kthread->cpu = cpu;
+	set_bit(KTHREAD_IS_PER_CPU, &kthread->flags);
+}
+
+bool kthread_is_per_cpu(struct task_struct *k)
+{
+	struct kthread *kthread = to_kthread(k);
+	if (!kthread)
+		return false;
+
+	return test_bit(KTHREAD_IS_PER_CPU, &kthread->flags);
+}
+
 /**
  * kthread_unpark - unpark a thread created by kthread_create().
  * @k:		thread created by kthread_create().
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -188,6 +188,7 @@ __smpboot_create_thread(struct smp_hotpl
 		kfree(td);
 		return PTR_ERR(tsk);
 	}
+	kthread_set_per_cpu(tsk, cpu);
 	/*
 	 * Park the thread so that it could start right on the CPU
 	 * when it is available.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH -v3 5/9] workqueue: Tag bound workers with KTHREAD_IS_PER_CPU
  2021-01-21 10:17 [PATCH -v3 0/9] sched: Fix hot-unplug regression Peter Zijlstra
                   ` (3 preceding siblings ...)
  2021-01-21 10:17 ` [PATCH -v3 4/9] kthread: Extract KTHREAD_IS_PER_CPU Peter Zijlstra
@ 2021-01-21 10:17 ` Peter Zijlstra
  2021-01-21 14:31   ` Valentin Schneider
  2021-01-22 17:41   ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra
  2021-01-21 10:17 ` [PATCH -v3 6/9] workqueue: Restrict affinity change to rescuer Peter Zijlstra
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 25+ messages in thread
From: Peter Zijlstra @ 2021-01-21 10:17 UTC (permalink / raw)
  To: mingo, tglx
  Cc: linux-kernel, jiangshanlai, valentin.schneider, cai,
	vincent.donnefort, decui, paulmck, vincent.guittot, rostedt, tj,
	peterz

Mark the per-cpu workqueue workers as KTHREAD_IS_PER_CPU.

Workqueues have unfortunate semantics in that per-cpu workers are not
default flushed and parked during hotplug, however a subset does
manual flush on hotplug and hard relies on them for correctness.

Therefore play silly games..

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/workqueue.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1861,6 +1861,8 @@ static void worker_attach_to_pool(struct
 	 */
 	if (pool->flags & POOL_DISASSOCIATED)
 		worker->flags |= WORKER_UNBOUND;
+	else
+		kthread_set_per_cpu(worker->task, pool->cpu);
 
 	list_add_tail(&worker->node, &pool->workers);
 	worker->pool = pool;
@@ -1883,6 +1885,7 @@ static void worker_detach_from_pool(stru
 
 	mutex_lock(&wq_pool_attach_mutex);
 
+	kthread_set_per_cpu(worker->task, -1);
 	list_del(&worker->node);
 	worker->pool = NULL;
 
@@ -4919,8 +4922,10 @@ static void unbind_workers(int cpu)
 
 		raw_spin_unlock_irq(&pool->lock);
 
-		for_each_pool_worker(worker, pool)
+		for_each_pool_worker(worker, pool) {
+			kthread_set_per_cpu(worker->task, -1);
 			WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_possible_mask) < 0);
+		}
 
 		mutex_unlock(&wq_pool_attach_mutex);
 
@@ -4972,9 +4977,11 @@ static void rebind_workers(struct worker
 	 * of all workers first and then clear UNBOUND.  As we're called
 	 * from CPU_ONLINE, the following shouldn't fail.
 	 */
-	for_each_pool_worker(worker, pool)
+	for_each_pool_worker(worker, pool) {
+		kthread_set_per_cpu(worker->task, pool->cpu);
 		WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task,
 						  pool->attrs->cpumask) < 0);
+	}
 
 	raw_spin_lock_irq(&pool->lock);
 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH -v3 6/9] workqueue: Restrict affinity change to rescuer
  2021-01-21 10:17 [PATCH -v3 0/9] sched: Fix hot-unplug regression Peter Zijlstra
                   ` (4 preceding siblings ...)
  2021-01-21 10:17 ` [PATCH -v3 5/9] workqueue: Tag bound workers with KTHREAD_IS_PER_CPU Peter Zijlstra
@ 2021-01-21 10:17 ` Peter Zijlstra
  2021-01-22 17:41   ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra
  2021-01-21 10:17 ` [PATCH -v3 7/9] sched: Prepare to use balance_push in ttwu() Peter Zijlstra
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2021-01-21 10:17 UTC (permalink / raw)
  To: mingo, tglx
  Cc: linux-kernel, jiangshanlai, valentin.schneider, cai,
	vincent.donnefort, decui, paulmck, vincent.guittot, rostedt, tj,
	peterz

create_worker() will already set the right affinity using
kthread_bind_mask(), this means only the rescuer will need to change
it's affinity.

Howveer, while in cpu-hot-unplug a regular task is not allowed to run
on online&&!active as it would be pushed away quite agressively. We
need KTHREAD_IS_PER_CPU to survive in that environment.

Therefore set the affinity after getting that magic flag.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/workqueue.c |    9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1849,12 +1849,6 @@ static void worker_attach_to_pool(struct
 	mutex_lock(&wq_pool_attach_mutex);
 
 	/*
-	 * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
-	 * online CPUs.  It'll be re-applied when any of the CPUs come up.
-	 */
-	set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
-
-	/*
 	 * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
 	 * stable across this function.  See the comments above the flag
 	 * definition for details.
@@ -1864,6 +1858,9 @@ static void worker_attach_to_pool(struct
 	else
 		kthread_set_per_cpu(worker->task, pool->cpu);
 
+	if (worker->rescue_wq)
+		set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+
 	list_add_tail(&worker->node, &pool->workers);
 	worker->pool = pool;
 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH -v3 7/9] sched: Prepare to use balance_push in ttwu()
  2021-01-21 10:17 [PATCH -v3 0/9] sched: Fix hot-unplug regression Peter Zijlstra
                   ` (5 preceding siblings ...)
  2021-01-21 10:17 ` [PATCH -v3 6/9] workqueue: Restrict affinity change to rescuer Peter Zijlstra
@ 2021-01-21 10:17 ` Peter Zijlstra
  2021-01-22 17:41   ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra
  2021-01-21 10:17 ` [PATCH -v3 8/9] sched: Fix CPU hotplug / tighten is_per_cpu_kthread() Peter Zijlstra
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2021-01-21 10:17 UTC (permalink / raw)
  To: mingo, tglx
  Cc: linux-kernel, jiangshanlai, valentin.schneider, cai,
	vincent.donnefort, decui, paulmck, vincent.guittot, rostedt, tj,
	peterz

In preparation of using the balance_push state in ttwu() we need it to
provide a reliable and consistent state.

The immediate problem is that rq->balance_callback gets cleared every
schedule() and then re-set in the balance_push_callback() itself. This
is not a reliable signal, so add a variable that stays set during the
entire time.

Also move setting it before the synchronize_rcu() in
sched_cpu_deactivate(), such that we get guaranteed visibility to
ttwu(), which is a preempt-disable region.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |   18 +++++++++++++-----
 kernel/sched/sched.h |    1 +
 2 files changed, 14 insertions(+), 5 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7320,6 +7327,7 @@ static void balance_push_set(int cpu, bo
 	struct rq_flags rf;
 
 	rq_lock_irqsave(rq, &rf);
+	rq->balance_push = on;
 	if (on) {
 		WARN_ON_ONCE(rq->balance_callback);
 		rq->balance_callback = &balance_push_callback;
@@ -7489,17 +7497,17 @@ int sched_cpu_deactivate(unsigned int cp
 	int ret;
 
 	set_cpu_active(cpu, false);
+	balance_push_set(cpu, true);
+
 	/*
-	 * We've cleared cpu_active_mask, wait for all preempt-disabled and RCU
-	 * users of this state to go away such that all new such users will
-	 * observe it.
+	 * We've cleared cpu_active_mask / set balance_push, wait for all
+	 * preempt-disabled and RCU users of this state to go away such that
+	 * all new such users will observe it.
 	 *
 	 * Do sync before park smpboot threads to take care the rcu boost case.
 	 */
 	synchronize_rcu();
 
-	balance_push_set(cpu, true);
-
 	rq_lock_irqsave(rq, &rf);
 	if (rq->rd) {
 		update_rq_clock(rq);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -975,6 +975,7 @@ struct rq {
 	unsigned long		cpu_capacity_orig;
 
 	struct callback_head	*balance_callback;
+	unsigned char		balance_push;
 
 	unsigned char		nohz_idle_balance;
 	unsigned char		idle_balance;



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH -v3 8/9] sched: Fix CPU hotplug / tighten is_per_cpu_kthread()
  2021-01-21 10:17 [PATCH -v3 0/9] sched: Fix hot-unplug regression Peter Zijlstra
                   ` (6 preceding siblings ...)
  2021-01-21 10:17 ` [PATCH -v3 7/9] sched: Prepare to use balance_push in ttwu() Peter Zijlstra
@ 2021-01-21 10:17 ` Peter Zijlstra
  2021-01-21 14:01   ` Valentin Schneider
  2021-01-22 17:41   ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra
  2021-01-21 10:17 ` [PATCH -v3 9/9] sched: Relax the set_cpus_allowed_ptr() semantics Peter Zijlstra
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 25+ messages in thread
From: Peter Zijlstra @ 2021-01-21 10:17 UTC (permalink / raw)
  To: mingo, tglx
  Cc: linux-kernel, jiangshanlai, valentin.schneider, cai,
	vincent.donnefort, decui, paulmck, vincent.guittot, rostedt, tj,
	peterz

Prior to commit 1cf12e08bc4d ("sched/hotplug: Consolidate task
migration on CPU unplug") we'd leave any task on the dying CPU and
break affinity and force them off at the very end.

This scheme had to change in order to enable migrate_disable(). One
cannot wait for migrate_disable() to complete while stuck in
stop_machine(). Furthermore, since we need at the very least: idle,
hotplug and stop threads at any point before stop_machine, we can't
break affinity and/or push those away.

Under the assumption that all per-cpu kthreads are sanely handled by
CPU hotplug, the new code no long breaks affinity or migrates any of
them (which then includes the critical ones above).

However, there's an important difference between per-cpu kthreads and
kthreads that happen to have a single CPU affinity which is lost. The
latter class very much relies on the forced affinity breaking and
migration semantics previously provided.

Use the new kthread_is_per_cpu() infrastructure to tighten
is_per_cpu_kthread() and fix the hot-unplug problems stemming from the
change.

Fixes: 1cf12e08bc4d ("sched/hotplug: Consolidate task migration on CPU unplug")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   32 ++++++++++++++++++++++++++++----
 1 file changed, 28 insertions(+), 4 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1796,13 +1796,28 @@ static inline bool rq_has_pinned_tasks(s
  */
 static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 {
+	/* When not in the task's cpumask, no point in looking further. */
 	if (!cpumask_test_cpu(cpu, p->cpus_ptr))
 		return false;
 
-	if (is_per_cpu_kthread(p) || is_migration_disabled(p))
+	/* migrate_disabled() must be allowed to finish. */
+	if (is_migration_disabled(p))
 		return cpu_online(cpu);
 
-	return cpu_active(cpu);
+	/* Non kernel threads are not allowed during either online or offline. */
+	if (!(p->flags & PF_KTHREAD))
+		return cpu_active(cpu);
+
+	/* KTHREAD_IS_PER_CPU is always allowed. */
+	if (kthread_is_per_cpu(p))
+		return cpu_online(cpu);
+
+	/* Regular kernel threads don't get to stay during offline. */
+	if (cpu_rq(cpu)->balance_push)
+		return false;
+
+	/* But are allowed during online. */
+	return cpu_online(cpu);
 }
 
 /*
@@ -3122,6 +3122,13 @@ bool cpus_share_cache(int this_cpu, int
 static inline bool ttwu_queue_cond(int cpu, int wake_flags)
 {
 	/*
+	 * Do not complicate things with the async wake_list while the CPU is
+	 * in hotplug state.
+	 */
+	if (!cpu_active(cpu))
+		return false;
+
+	/*
 	 * If the CPU does not share cache, then queue the task on the
 	 * remote rqs wakelist to avoid accessing remote data.
 	 */
@@ -7283,8 +7298,14 @@ static void balance_push(struct rq *rq)
 	/*
 	 * Both the cpu-hotplug and stop task are in this case and are
 	 * required to complete the hotplug process.
+	 *
+	 * XXX: the idle task does not match kthread_is_per_cpu() due to
+	 * histerical raisins.
 	 */
-	if (is_per_cpu_kthread(push_task) || is_migration_disabled(push_task)) {
+	if (rq->idle == push_task ||
+	    ((push_task->flags & PF_KTHREAD) && kthread_is_per_cpu(push_task)) ||
+	    is_migration_disabled(push_task)) {
+
 		/*
 		 * If this is the idle task on the outgoing CPU try to wake
 		 * up the hotplug control thread which might wait for the
@@ -7316,7 +7337,7 @@ static void balance_push(struct rq *rq)
 	/*
 	 * At this point need_resched() is true and we'll take the loop in
 	 * schedule(). The next pick is obviously going to be the stop task
-	 * which is_per_cpu_kthread() and will push this task away.
+	 * which kthread_is_per_cpu() and will push this task away.
 	 */
 	raw_spin_lock(&rq->lock);
 }
@@ -7504,6 +7525,9 @@ int sched_cpu_deactivate(unsigned int cp
 	 * preempt-disabled and RCU users of this state to go away such that
 	 * all new such users will observe it.
 	 *
+	 * Specifically, we rely on ttwu to no longer target this CPU, see
+	 * ttwu_queue_cond() and is_cpu_allowed().
+	 *
 	 * Do sync before park smpboot threads to take care the rcu boost case.
 	 */
 	synchronize_rcu();



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH -v3 9/9] sched: Relax the set_cpus_allowed_ptr() semantics
  2021-01-21 10:17 [PATCH -v3 0/9] sched: Fix hot-unplug regression Peter Zijlstra
                   ` (7 preceding siblings ...)
  2021-01-21 10:17 ` [PATCH -v3 8/9] sched: Fix CPU hotplug / tighten is_per_cpu_kthread() Peter Zijlstra
@ 2021-01-21 10:17 ` Peter Zijlstra
  2021-01-22 17:41   ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra
  2021-01-21 14:36 ` [PATCH -v3 0/9] sched: Fix hot-unplug regression Valentin Schneider
  2021-01-21 19:56 ` Paul E. McKenney
  10 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2021-01-21 10:17 UTC (permalink / raw)
  To: mingo, tglx
  Cc: linux-kernel, jiangshanlai, valentin.schneider, cai,
	vincent.donnefort, decui, paulmck, vincent.guittot, rostedt, tj,
	peterz

Now that we have KTHREAD_IS_PER_CPU to denote the critical per-cpu
tasks to retain during CPU offline, we can relax the warning in
set_cpus_allowed_ptr(). Any spurious kthread that wants to get on at
the last minute will get pushed off before it can run.

While during CPU online there is no harm, and actual benefit, to
allowing kthreads back on early, it simplifies hotplug code and fixes
a number of outstanding races.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lai jiangshan <jiangshanlai@gmail.com>
---
 kernel/sched/core.c |   21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2342,7 +2342,9 @@ static int __set_cpus_allowed_ptr(struct
 
 	if (p->flags & PF_KTHREAD || is_migration_disabled(p)) {
 		/*
-		 * Kernel threads are allowed on online && !active CPUs.
+		 * Kernel threads are allowed on online && !active CPUs,
+		 * however, during cpu-hot-unplug, even these might get pushed
+		 * away if not KTHREAD_IS_PER_CPU.
 		 *
 		 * Specifically, migration_disabled() tasks must not fail the
 		 * cpumask_any_and_distribute() pick below, esp. so on
@@ -2386,16 +2388,6 @@ static int __set_cpus_allowed_ptr(struct
 
 	__do_set_cpus_allowed(p, new_mask, flags);
 
-	if (p->flags & PF_KTHREAD) {
-		/*
-		 * For kernel threads that do indeed end up on online &&
-		 * !active we want to ensure they are strict per-CPU threads.
-		 */
-		WARN_ON(cpumask_intersects(new_mask, cpu_online_mask) &&
-			!cpumask_intersects(new_mask, cpu_active_mask) &&
-			p->nr_cpus_allowed != 1);
-	}
-
 	return affine_move_task(rq, p, &rf, dest_cpu, flags);
 
 out:
@@ -7518,6 +7510,13 @@ int sched_cpu_deactivate(unsigned int cp
 	int ret;
 
 	set_cpu_active(cpu, false);
+
+	/*
+	 * From this point forward, this CPU will refuse to run any task that
+	 * is not: migrate_disable() or KTHREAD_IS_PER_CPU, and will actively
+	 * push those tasks away until this gets cleared, see
+	 * sched_cpu_dying().
+	 */
 	balance_push_set(cpu, true);
 
 	/*



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -v3 3/9] sched: Dont run cpu-online with balance_push() enabled
  2021-01-21 10:17 ` [PATCH -v3 3/9] sched: Dont run cpu-online with balance_push() enabled Peter Zijlstra
@ 2021-01-21 14:00   ` Valentin Schneider
  2021-01-21 14:14     ` Peter Zijlstra
  2021-01-22 17:41   ` [tip: sched/urgent] sched: Don't " tip-bot2 for Peter Zijlstra
  1 sibling, 1 reply; 25+ messages in thread
From: Valentin Schneider @ 2021-01-21 14:00 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, tglx
  Cc: linux-kernel, jiangshanlai, cai, vincent.donnefort, decui,
	paulmck, vincent.guittot, rostedt, tj, peterz

On 21/01/21 11:17, Peter Zijlstra wrote:
> @@ -7608,6 +7614,12 @@ int sched_cpu_dying(unsigned int cpu)
>       }
>       rq_unlock_irqrestore(rq, &rf);
>
> +	/*
> +	 * Should really be after we clear cpu_online(), but we're in
> +	 * stop_machine(), so it all works.
> +	 */

I believe you noted yourself in some earlier version that this *is* running
with cpu_online(cpu) == false, __cpu_disable() being invoked before the
_DYING .teardown callbacks are run.

> +	balance_push_set(cpu, false);
> +
>       calc_load_migrate(rq);
>       update_max_interval();
>       nohz_balance_exit_idle(rq);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -v3 8/9] sched: Fix CPU hotplug / tighten is_per_cpu_kthread()
  2021-01-21 10:17 ` [PATCH -v3 8/9] sched: Fix CPU hotplug / tighten is_per_cpu_kthread() Peter Zijlstra
@ 2021-01-21 14:01   ` Valentin Schneider
  2021-01-21 14:18     ` Peter Zijlstra
  2021-01-22 17:41   ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra
  1 sibling, 1 reply; 25+ messages in thread
From: Valentin Schneider @ 2021-01-21 14:01 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, tglx
  Cc: linux-kernel, jiangshanlai, cai, vincent.donnefort, decui,
	paulmck, vincent.guittot, rostedt, tj, peterz

On 21/01/21 11:17, Peter Zijlstra wrote:
> @@ -7504,6 +7525,9 @@ int sched_cpu_deactivate(unsigned int cp
>        * preempt-disabled and RCU users of this state to go away such that
>        * all new such users will observe it.
>        *
> +	 * Specifically, we rely on ttwu to no longer target this CPU, see
> +	 * ttwu_queue_cond() and is_cpu_allowed().
> +	 *

So the last time ttwu_queue_wakelist() can append a task onto a dying
CPU's wakelist is before sched_cpu_deactivate()'s synchronize_rcu()
returns. 

As discussed on IRC, paranoia would have us issue a

  flush_smp_call_function_from_idle()

upon returning from said sync, but this will require further surgery.

Do we want something like the below in the meantime? Ideally we'd warn on
setting rq->ttwu_pending when !cpu_active(), but as per the above this is
allowed before the synchronize_rcu() returns.

---

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ed6ff94aa68a..4b5b4b02ee64 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7590,6 +7590,7 @@ int sched_cpu_starting(unsigned int cpu)
  */
 int sched_cpu_wait_empty(unsigned int cpu)
 {
+	WARN_ON_ONCE(READ_ONCE(cpu_rq(cpu)->ttwu_pending));
 	balance_hotplug_wait();
 	return 0;
 }

>        * Do sync before park smpboot threads to take care the rcu boost case.
>        */
>       synchronize_rcu();

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH -v3 3/9] sched: Dont run cpu-online with balance_push() enabled
  2021-01-21 14:00   ` Valentin Schneider
@ 2021-01-21 14:14     ` Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2021-01-21 14:14 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: mingo, tglx, linux-kernel, jiangshanlai, cai, vincent.donnefort,
	decui, paulmck, vincent.guittot, rostedt, tj

On Thu, Jan 21, 2021 at 02:00:45PM +0000, Valentin Schneider wrote:
> On 21/01/21 11:17, Peter Zijlstra wrote:
> > @@ -7608,6 +7614,12 @@ int sched_cpu_dying(unsigned int cpu)
> >       }
> >       rq_unlock_irqrestore(rq, &rf);
> >
> > +	/*
> > +	 * Should really be after we clear cpu_online(), but we're in
> > +	 * stop_machine(), so it all works.
> > +	 */
> 
> I believe you noted yourself in some earlier version that this *is* running
> with cpu_online(cpu) == false, __cpu_disable() being invoked before the
> _DYING .teardown callbacks are run.

Argh! lemme go fix that comment.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -v3 8/9] sched: Fix CPU hotplug / tighten is_per_cpu_kthread()
  2021-01-21 14:01   ` Valentin Schneider
@ 2021-01-21 14:18     ` Peter Zijlstra
  2021-01-21 14:36       ` Valentin Schneider
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2021-01-21 14:18 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: mingo, tglx, linux-kernel, jiangshanlai, cai, vincent.donnefort,
	decui, paulmck, vincent.guittot, rostedt, tj

On Thu, Jan 21, 2021 at 02:01:03PM +0000, Valentin Schneider wrote:
> On 21/01/21 11:17, Peter Zijlstra wrote:
> > @@ -7504,6 +7525,9 @@ int sched_cpu_deactivate(unsigned int cp
> >        * preempt-disabled and RCU users of this state to go away such that
> >        * all new such users will observe it.
> >        *
> > +	 * Specifically, we rely on ttwu to no longer target this CPU, see
> > +	 * ttwu_queue_cond() and is_cpu_allowed().
> > +	 *
> 
> So the last time ttwu_queue_wakelist() can append a task onto a dying
> CPU's wakelist is before sched_cpu_deactivate()'s synchronize_rcu()
> returns. 
> 
> As discussed on IRC, paranoia would have us issue a
> 
>   flush_smp_call_function_from_idle()
> 
> upon returning from said sync, but this will require further surgery.

Right, specifically RCU needs a little more help there.

> Do we want something like the below in the meantime? Ideally we'd warn on
> setting rq->ttwu_pending when !cpu_active(), but as per the above this is
> allowed before the synchronize_rcu() returns.

I'm not sure I'm brave enough to add that just now :/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -v3 5/9] workqueue: Tag bound workers with KTHREAD_IS_PER_CPU
  2021-01-21 10:17 ` [PATCH -v3 5/9] workqueue: Tag bound workers with KTHREAD_IS_PER_CPU Peter Zijlstra
@ 2021-01-21 14:31   ` Valentin Schneider
  2021-01-22 17:41   ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 25+ messages in thread
From: Valentin Schneider @ 2021-01-21 14:31 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, tglx
  Cc: linux-kernel, jiangshanlai, cai, vincent.donnefort, decui,
	paulmck, vincent.guittot, rostedt, tj, peterz

On 21/01/21 11:17, Peter Zijlstra wrote:
> @@ -4972,9 +4977,11 @@ static void rebind_workers(struct worker
>  	 * of all workers first and then clear UNBOUND.  As we're called
>  	 * from CPU_ONLINE, the following shouldn't fail.
>  	 */
> -	for_each_pool_worker(worker, pool)
> +	for_each_pool_worker(worker, pool) {
> +		kthread_set_per_cpu(worker->task, pool->cpu);
>  		WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task,
>  						  pool->attrs->cpumask) < 0);

At the end of this series, is_cpu_allowed() allows kthreads with
KTHREAD_IS_PER_CPU on any online CPU, even if it isn't the designated
kthread->cpu.

I thought there might be a race here, given this gives us a window where a
pcpu kworker has the flag but is still affined to cpus_possible_mask. Now,
given cpus_write_lock(), we can't have a CPU going up while another goes
down. So I think it's actually fine, and I've been chasing ghosts yet again.

> +	}
>  
>  	raw_spin_lock_irq(&pool->lock);
>  

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -v3 8/9] sched: Fix CPU hotplug / tighten is_per_cpu_kthread()
  2021-01-21 14:18     ` Peter Zijlstra
@ 2021-01-21 14:36       ` Valentin Schneider
  0 siblings, 0 replies; 25+ messages in thread
From: Valentin Schneider @ 2021-01-21 14:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, linux-kernel, jiangshanlai, cai, vincent.donnefort,
	decui, paulmck, vincent.guittot, rostedt, tj

On 21/01/21 15:18, Peter Zijlstra wrote:
> On Thu, Jan 21, 2021 at 02:01:03PM +0000, Valentin Schneider wrote:
>> On 21/01/21 11:17, Peter Zijlstra wrote:
>> > @@ -7504,6 +7525,9 @@ int sched_cpu_deactivate(unsigned int cp
>> >        * preempt-disabled and RCU users of this state to go away such that
>> >        * all new such users will observe it.
>> >        *
>> > +	 * Specifically, we rely on ttwu to no longer target this CPU, see
>> > +	 * ttwu_queue_cond() and is_cpu_allowed().
>> > +	 *
>> 
>> So the last time ttwu_queue_wakelist() can append a task onto a dying
>> CPU's wakelist is before sched_cpu_deactivate()'s synchronize_rcu()
>> returns. 
>> 
>> As discussed on IRC, paranoia would have us issue a
>> 
>>   flush_smp_call_function_from_idle()
>> 
>> upon returning from said sync, but this will require further surgery.
>
> Right, specifically RCU needs a little more help there.
>
>> Do we want something like the below in the meantime? Ideally we'd warn on
>> setting rq->ttwu_pending when !cpu_active(), but as per the above this is
>> allowed before the synchronize_rcu() returns.
>
> I'm not sure I'm brave enough to add that just now :/

I get you; I couldn't come up with a better scheme that would give us a bit
more info than the sched_cpu_dying() splat :/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -v3 0/9] sched: Fix hot-unplug regression
  2021-01-21 10:17 [PATCH -v3 0/9] sched: Fix hot-unplug regression Peter Zijlstra
                   ` (8 preceding siblings ...)
  2021-01-21 10:17 ` [PATCH -v3 9/9] sched: Relax the set_cpus_allowed_ptr() semantics Peter Zijlstra
@ 2021-01-21 14:36 ` Valentin Schneider
  2021-01-21 19:56 ` Paul E. McKenney
  10 siblings, 0 replies; 25+ messages in thread
From: Valentin Schneider @ 2021-01-21 14:36 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, tglx
  Cc: linux-kernel, jiangshanlai, cai, vincent.donnefort, decui,
	paulmck, vincent.guittot, rostedt, tj, peterz

On 21/01/21 11:17, Peter Zijlstra wrote:
> Hi,
>
> Some cautious optimism lets me post v3 of these patches. They (knock on wood)
> fix the regression introduced by commit:
>
>   1cf12e08bc4d ("sched/hotplug: Consolidate task migration on CPU unplug")
>
> These patches survived overnight runs for both me and Valentin, but I'll let it
> run for at least another 12 hours before committing these patches.
>
> New in this version is patch #7.
>

Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>

And, providing the tests keep sailing smoothly:

Tested-by: Valentin Schneider <valentin.schneider@arm.com>

> Much thanks to Valentin for his continued support and debugging efforts.

Happy to ~get headaches~ help!

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -v3 0/9] sched: Fix hot-unplug regression
  2021-01-21 10:17 [PATCH -v3 0/9] sched: Fix hot-unplug regression Peter Zijlstra
                   ` (9 preceding siblings ...)
  2021-01-21 14:36 ` [PATCH -v3 0/9] sched: Fix hot-unplug regression Valentin Schneider
@ 2021-01-21 19:56 ` Paul E. McKenney
  10 siblings, 0 replies; 25+ messages in thread
From: Paul E. McKenney @ 2021-01-21 19:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, linux-kernel, jiangshanlai, valentin.schneider, cai,
	vincent.donnefort, decui, vincent.guittot, rostedt, tj

On Thu, Jan 21, 2021 at 11:17:02AM +0100, Peter Zijlstra wrote:
> Hi,
> 
> Some cautious optimism lets me post v3 of these patches. They (knock on wood)
> fix the regression introduced by commit:
> 
>   1cf12e08bc4d ("sched/hotplug: Consolidate task migration on CPU unplug")
> 
> These patches survived overnight runs for both me and Valentin, but I'll let it
> run for at least another 12 hours before committing these patches.
> 
> New in this version is patch #7.
> 
> Much thanks to Valentin for his continued support and debugging efforts.

Thank you all!!!  I have started testing these on top of -rcu.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [tip: sched/urgent] sched: Relax the set_cpus_allowed_ptr() semantics
  2021-01-21 10:17 ` [PATCH -v3 9/9] sched: Relax the set_cpus_allowed_ptr() semantics Peter Zijlstra
@ 2021-01-22 17:41   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-01-22 17:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Lai jiangshan, Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     741ba80f6f9a4702089c122129f22df9774b3e64
Gitweb:        https://git.kernel.org/tip/741ba80f6f9a4702089c122129f22df9774b3e64
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Sat, 16 Jan 2021 11:56:37 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 22 Jan 2021 15:09:44 +01:00

sched: Relax the set_cpus_allowed_ptr() semantics

Now that we have KTHREAD_IS_PER_CPU to denote the critical per-cpu
tasks to retain during CPU offline, we can relax the warning in
set_cpus_allowed_ptr(). Any spurious kthread that wants to get on at
the last minute will get pushed off before it can run.

While during CPU online there is no harm, and actual benefit, to
allowing kthreads back on early, it simplifies hotplug code and fixes
a number of outstanding races.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lai jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103507.240724591@infradead.org
---
 kernel/sched/core.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 56b0962..ff74fca 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2342,7 +2342,9 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
 
 	if (p->flags & PF_KTHREAD || is_migration_disabled(p)) {
 		/*
-		 * Kernel threads are allowed on online && !active CPUs.
+		 * Kernel threads are allowed on online && !active CPUs,
+		 * however, during cpu-hot-unplug, even these might get pushed
+		 * away if not KTHREAD_IS_PER_CPU.
 		 *
 		 * Specifically, migration_disabled() tasks must not fail the
 		 * cpumask_any_and_distribute() pick below, esp. so on
@@ -2386,16 +2388,6 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
 
 	__do_set_cpus_allowed(p, new_mask, flags);
 
-	if (p->flags & PF_KTHREAD) {
-		/*
-		 * For kernel threads that do indeed end up on online &&
-		 * !active we want to ensure they are strict per-CPU threads.
-		 */
-		WARN_ON(cpumask_intersects(new_mask, cpu_online_mask) &&
-			!cpumask_intersects(new_mask, cpu_active_mask) &&
-			p->nr_cpus_allowed != 1);
-	}
-
 	return affine_move_task(rq, p, &rf, dest_cpu, flags);
 
 out:
@@ -7518,6 +7510,13 @@ int sched_cpu_deactivate(unsigned int cpu)
 	int ret;
 
 	set_cpu_active(cpu, false);
+
+	/*
+	 * From this point forward, this CPU will refuse to run any task that
+	 * is not: migrate_disable() or KTHREAD_IS_PER_CPU, and will actively
+	 * push those tasks away until this gets cleared, see
+	 * sched_cpu_dying().
+	 */
 	balance_push_set(cpu, true);
 
 	/*

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [tip: sched/urgent] workqueue: Restrict affinity change to rescuer
  2021-01-21 10:17 ` [PATCH -v3 6/9] workqueue: Restrict affinity change to rescuer Peter Zijlstra
@ 2021-01-22 17:41   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-01-22 17:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     640f17c82460e9724fd256f0a1f5d99e7ff0bda4
Gitweb:        https://git.kernel.org/tip/640f17c82460e9724fd256f0a1f5d99e7ff0bda4
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 15 Jan 2021 19:08:36 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 22 Jan 2021 15:09:43 +01:00

workqueue: Restrict affinity change to rescuer

create_worker() will already set the right affinity using
kthread_bind_mask(), this means only the rescuer will need to change
it's affinity.

Howveer, while in cpu-hot-unplug a regular task is not allowed to run
on online&&!active as it would be pushed away quite agressively. We
need KTHREAD_IS_PER_CPU to survive in that environment.

Therefore set the affinity after getting that magic flag.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103506.826629830@infradead.org
---
 kernel/workqueue.c |  9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index cce3433..894bb88 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1849,12 +1849,6 @@ static void worker_attach_to_pool(struct worker *worker,
 	mutex_lock(&wq_pool_attach_mutex);
 
 	/*
-	 * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
-	 * online CPUs.  It'll be re-applied when any of the CPUs come up.
-	 */
-	set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
-
-	/*
 	 * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
 	 * stable across this function.  See the comments above the flag
 	 * definition for details.
@@ -1864,6 +1858,9 @@ static void worker_attach_to_pool(struct worker *worker,
 	else
 		kthread_set_per_cpu(worker->task, pool->cpu);
 
+	if (worker->rescue_wq)
+		set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+
 	list_add_tail(&worker->node, &pool->workers);
 	worker->pool = pool;
 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [tip: sched/urgent] sched: Fix CPU hotplug / tighten is_per_cpu_kthread()
  2021-01-21 10:17 ` [PATCH -v3 8/9] sched: Fix CPU hotplug / tighten is_per_cpu_kthread() Peter Zijlstra
  2021-01-21 14:01   ` Valentin Schneider
@ 2021-01-22 17:41   ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 25+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-01-22 17:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     5ba2ffba13a1e24e7b153683e97300f9cc6f605a
Gitweb:        https://git.kernel.org/tip/5ba2ffba13a1e24e7b153683e97300f9cc6f605a
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 12 Jan 2021 11:28:16 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 22 Jan 2021 15:09:44 +01:00

sched: Fix CPU hotplug / tighten is_per_cpu_kthread()

Prior to commit 1cf12e08bc4d ("sched/hotplug: Consolidate task
migration on CPU unplug") we'd leave any task on the dying CPU and
break affinity and force them off at the very end.

This scheme had to change in order to enable migrate_disable(). One
cannot wait for migrate_disable() to complete while stuck in
stop_machine(). Furthermore, since we need at the very least: idle,
hotplug and stop threads at any point before stop_machine, we can't
break affinity and/or push those away.

Under the assumption that all per-cpu kthreads are sanely handled by
CPU hotplug, the new code no long breaks affinity or migrates any of
them (which then includes the critical ones above).

However, there's an important difference between per-cpu kthreads and
kthreads that happen to have a single CPU affinity which is lost. The
latter class very much relies on the forced affinity breaking and
migration semantics previously provided.

Use the new kthread_is_per_cpu() infrastructure to tighten
is_per_cpu_kthread() and fix the hot-unplug problems stemming from the
change.

Fixes: 1cf12e08bc4d ("sched/hotplug: Consolidate task migration on CPU unplug")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103507.102416009@infradead.org
---
 kernel/sched/core.c | 39 +++++++++++++++++++++++++++++++++++----
 1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 16946b5..56b0962 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1796,13 +1796,28 @@ static inline bool rq_has_pinned_tasks(struct rq *rq)
  */
 static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 {
+	/* When not in the task's cpumask, no point in looking further. */
 	if (!cpumask_test_cpu(cpu, p->cpus_ptr))
 		return false;
 
-	if (is_per_cpu_kthread(p) || is_migration_disabled(p))
+	/* migrate_disabled() must be allowed to finish. */
+	if (is_migration_disabled(p))
 		return cpu_online(cpu);
 
-	return cpu_active(cpu);
+	/* Non kernel threads are not allowed during either online or offline. */
+	if (!(p->flags & PF_KTHREAD))
+		return cpu_active(cpu);
+
+	/* KTHREAD_IS_PER_CPU is always allowed. */
+	if (kthread_is_per_cpu(p))
+		return cpu_online(cpu);
+
+	/* Regular kernel threads don't get to stay during offline. */
+	if (cpu_rq(cpu)->balance_push)
+		return false;
+
+	/* But are allowed during online. */
+	return cpu_online(cpu);
 }
 
 /*
@@ -3122,6 +3137,13 @@ bool cpus_share_cache(int this_cpu, int that_cpu)
 static inline bool ttwu_queue_cond(int cpu, int wake_flags)
 {
 	/*
+	 * Do not complicate things with the async wake_list while the CPU is
+	 * in hotplug state.
+	 */
+	if (!cpu_active(cpu))
+		return false;
+
+	/*
 	 * If the CPU does not share cache, then queue the task on the
 	 * remote rqs wakelist to avoid accessing remote data.
 	 */
@@ -7276,8 +7298,14 @@ static void balance_push(struct rq *rq)
 	/*
 	 * Both the cpu-hotplug and stop task are in this case and are
 	 * required to complete the hotplug process.
+	 *
+	 * XXX: the idle task does not match kthread_is_per_cpu() due to
+	 * histerical raisins.
 	 */
-	if (is_per_cpu_kthread(push_task) || is_migration_disabled(push_task)) {
+	if (rq->idle == push_task ||
+	    ((push_task->flags & PF_KTHREAD) && kthread_is_per_cpu(push_task)) ||
+	    is_migration_disabled(push_task)) {
+
 		/*
 		 * If this is the idle task on the outgoing CPU try to wake
 		 * up the hotplug control thread which might wait for the
@@ -7309,7 +7337,7 @@ static void balance_push(struct rq *rq)
 	/*
 	 * At this point need_resched() is true and we'll take the loop in
 	 * schedule(). The next pick is obviously going to be the stop task
-	 * which is_per_cpu_kthread() and will push this task away.
+	 * which kthread_is_per_cpu() and will push this task away.
 	 */
 	raw_spin_lock(&rq->lock);
 }
@@ -7497,6 +7525,9 @@ int sched_cpu_deactivate(unsigned int cpu)
 	 * preempt-disabled and RCU users of this state to go away such that
 	 * all new such users will observe it.
 	 *
+	 * Specifically, we rely on ttwu to no longer target this CPU, see
+	 * ttwu_queue_cond() and is_cpu_allowed().
+	 *
 	 * Do sync before park smpboot threads to take care the rcu boost case.
 	 */
 	synchronize_rcu();

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [tip: sched/urgent] sched: Prepare to use balance_push in ttwu()
  2021-01-21 10:17 ` [PATCH -v3 7/9] sched: Prepare to use balance_push in ttwu() Peter Zijlstra
@ 2021-01-22 17:41   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-01-22 17:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     975707f227b07a8212060f94447171d15d7a681b
Gitweb:        https://git.kernel.org/tip/975707f227b07a8212060f94447171d15d7a681b
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 20 Jan 2021 15:05:41 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 22 Jan 2021 15:09:43 +01:00

sched: Prepare to use balance_push in ttwu()

In preparation of using the balance_push state in ttwu() we need it to
provide a reliable and consistent state.

The immediate problem is that rq->balance_callback gets cleared every
schedule() and then re-set in the balance_push_callback() itself. This
is not a reliable signal, so add a variable that stays set during the
entire time.

Also move setting it before the synchronize_rcu() in
sched_cpu_deactivate(), such that we get guaranteed visibility to
ttwu(), which is a preempt-disable region.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103506.966069627@infradead.org
---
 kernel/sched/core.c  | 11 ++++++-----
 kernel/sched/sched.h |  1 +
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8da0fd7..16946b5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7320,6 +7320,7 @@ static void balance_push_set(int cpu, bool on)
 	struct rq_flags rf;
 
 	rq_lock_irqsave(rq, &rf);
+	rq->balance_push = on;
 	if (on) {
 		WARN_ON_ONCE(rq->balance_callback);
 		rq->balance_callback = &balance_push_callback;
@@ -7489,17 +7490,17 @@ int sched_cpu_deactivate(unsigned int cpu)
 	int ret;
 
 	set_cpu_active(cpu, false);
+	balance_push_set(cpu, true);
+
 	/*
-	 * We've cleared cpu_active_mask, wait for all preempt-disabled and RCU
-	 * users of this state to go away such that all new such users will
-	 * observe it.
+	 * We've cleared cpu_active_mask / set balance_push, wait for all
+	 * preempt-disabled and RCU users of this state to go away such that
+	 * all new such users will observe it.
 	 *
 	 * Do sync before park smpboot threads to take care the rcu boost case.
 	 */
 	synchronize_rcu();
 
-	balance_push_set(cpu, true);
-
 	rq_lock_irqsave(rq, &rf);
 	if (rq->rd) {
 		update_rq_clock(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 12ada79..bb09988 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -975,6 +975,7 @@ struct rq {
 	unsigned long		cpu_capacity_orig;
 
 	struct callback_head	*balance_callback;
+	unsigned char		balance_push;
 
 	unsigned char		nohz_idle_balance;
 	unsigned char		idle_balance;

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [tip: sched/urgent] workqueue: Tag bound workers with KTHREAD_IS_PER_CPU
  2021-01-21 10:17 ` [PATCH -v3 5/9] workqueue: Tag bound workers with KTHREAD_IS_PER_CPU Peter Zijlstra
  2021-01-21 14:31   ` Valentin Schneider
@ 2021-01-22 17:41   ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 25+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-01-22 17:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     5c25b5ff89f004c30b04759dc34ace8585a4085f
Gitweb:        https://git.kernel.org/tip/5c25b5ff89f004c30b04759dc34ace8585a4085f
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 12 Jan 2021 11:26:49 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 22 Jan 2021 15:09:42 +01:00

workqueue: Tag bound workers with KTHREAD_IS_PER_CPU

Mark the per-cpu workqueue workers as KTHREAD_IS_PER_CPU.

Workqueues have unfortunate semantics in that per-cpu workers are not
default flushed and parked during hotplug, however a subset does
manual flush on hotplug and hard relies on them for correctness.

Therefore play silly games..

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103506.693465814@infradead.org
---
 kernel/workqueue.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 1646331..cce3433 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1861,6 +1861,8 @@ static void worker_attach_to_pool(struct worker *worker,
 	 */
 	if (pool->flags & POOL_DISASSOCIATED)
 		worker->flags |= WORKER_UNBOUND;
+	else
+		kthread_set_per_cpu(worker->task, pool->cpu);
 
 	list_add_tail(&worker->node, &pool->workers);
 	worker->pool = pool;
@@ -1883,6 +1885,7 @@ static void worker_detach_from_pool(struct worker *worker)
 
 	mutex_lock(&wq_pool_attach_mutex);
 
+	kthread_set_per_cpu(worker->task, -1);
 	list_del(&worker->node);
 	worker->pool = NULL;
 
@@ -4919,8 +4922,10 @@ static void unbind_workers(int cpu)
 
 		raw_spin_unlock_irq(&pool->lock);
 
-		for_each_pool_worker(worker, pool)
+		for_each_pool_worker(worker, pool) {
+			kthread_set_per_cpu(worker->task, -1);
 			WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, cpu_possible_mask) < 0);
+		}
 
 		mutex_unlock(&wq_pool_attach_mutex);
 
@@ -4972,9 +4977,11 @@ static void rebind_workers(struct worker_pool *pool)
 	 * of all workers first and then clear UNBOUND.  As we're called
 	 * from CPU_ONLINE, the following shouldn't fail.
 	 */
-	for_each_pool_worker(worker, pool)
+	for_each_pool_worker(worker, pool) {
+		kthread_set_per_cpu(worker->task, pool->cpu);
 		WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task,
 						  pool->attrs->cpumask) < 0);
+	}
 
 	raw_spin_lock_irq(&pool->lock);
 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [tip: sched/urgent] kthread: Extract KTHREAD_IS_PER_CPU
  2021-01-21 10:17 ` [PATCH -v3 4/9] kthread: Extract KTHREAD_IS_PER_CPU Peter Zijlstra
@ 2021-01-22 17:41   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-01-22 17:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     ac687e6e8c26181a33270efd1a2e2241377924b0
Gitweb:        https://git.kernel.org/tip/ac687e6e8c26181a33270efd1a2e2241377924b0
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 12 Jan 2021 11:24:04 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 22 Jan 2021 15:09:42 +01:00

kthread: Extract KTHREAD_IS_PER_CPU

There is a need to distinguish geniune per-cpu kthreads from kthreads
that happen to have a single CPU affinity.

Geniune per-cpu kthreads are kthreads that are CPU affine for
correctness, these will obviously have PF_KTHREAD set, but must also
have PF_NO_SETAFFINITY set, lest userspace modify their affinity and
ruins things.

However, these two things are not sufficient, PF_NO_SETAFFINITY is
also set on other tasks that have their affinities controlled through
other means, like for instance workqueues.

Therefore another bit is needed; it turns out kthread_create_per_cpu()
already has such a bit: KTHREAD_IS_PER_CPU, which is used to make
kthread_park()/kthread_unpark() work correctly.

Expose this flag and remove the implicit setting of it from
kthread_create_on_cpu(); the io_uring usage of it seems dubious at
best.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103506.557620262@infradead.org
---
 include/linux/kthread.h |  3 +++
 kernel/kthread.c        | 27 ++++++++++++++++++++++++++-
 kernel/smpboot.c        |  1 +
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 65b81e0..2484ed9 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -33,6 +33,9 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
 					  unsigned int cpu,
 					  const char *namefmt);
 
+void kthread_set_per_cpu(struct task_struct *k, int cpu);
+bool kthread_is_per_cpu(struct task_struct *k);
+
 /**
  * kthread_run - create and wake a thread.
  * @threadfn: the function to run until signal_pending(current).
diff --git a/kernel/kthread.c b/kernel/kthread.c
index a5eceec..e0e4a42 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -493,11 +493,36 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
 		return p;
 	kthread_bind(p, cpu);
 	/* CPU hotplug need to bind once again when unparking the thread. */
-	set_bit(KTHREAD_IS_PER_CPU, &to_kthread(p)->flags);
 	to_kthread(p)->cpu = cpu;
 	return p;
 }
 
+void kthread_set_per_cpu(struct task_struct *k, int cpu)
+{
+	struct kthread *kthread = to_kthread(k);
+	if (!kthread)
+		return;
+
+	WARN_ON_ONCE(!(k->flags & PF_NO_SETAFFINITY));
+
+	if (cpu < 0) {
+		clear_bit(KTHREAD_IS_PER_CPU, &kthread->flags);
+		return;
+	}
+
+	kthread->cpu = cpu;
+	set_bit(KTHREAD_IS_PER_CPU, &kthread->flags);
+}
+
+bool kthread_is_per_cpu(struct task_struct *k)
+{
+	struct kthread *kthread = to_kthread(k);
+	if (!kthread)
+		return false;
+
+	return test_bit(KTHREAD_IS_PER_CPU, &kthread->flags);
+}
+
 /**
  * kthread_unpark - unpark a thread created by kthread_create().
  * @k:		thread created by kthread_create().
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index 2efe1e2..f25208e 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -188,6 +188,7 @@ __smpboot_create_thread(struct smp_hotplug_thread *ht, unsigned int cpu)
 		kfree(td);
 		return PTR_ERR(tsk);
 	}
+	kthread_set_per_cpu(tsk, cpu);
 	/*
 	 * Park the thread so that it could start right on the CPU
 	 * when it is available.

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [tip: sched/urgent] sched: Don't run cpu-online with balance_push() enabled
  2021-01-21 10:17 ` [PATCH -v3 3/9] sched: Dont run cpu-online with balance_push() enabled Peter Zijlstra
  2021-01-21 14:00   ` Valentin Schneider
@ 2021-01-22 17:41   ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 25+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-01-22 17:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     22f667c97aadbf481e2cae2d6feabdf431e27b31
Gitweb:        https://git.kernel.org/tip/22f667c97aadbf481e2cae2d6feabdf431e27b31
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 15 Jan 2021 18:17:45 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 22 Jan 2021 15:09:42 +01:00

sched: Don't run cpu-online with balance_push() enabled

We don't need to push away tasks when we come online, mark the push
complete right before the CPU dies.

XXX hotplug state machine has trouble with rollback here.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103506.415606087@infradead.org
---
 kernel/sched/core.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 627534f..8da0fd7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7320,10 +7320,12 @@ static void balance_push_set(int cpu, bool on)
 	struct rq_flags rf;
 
 	rq_lock_irqsave(rq, &rf);
-	if (on)
+	if (on) {
+		WARN_ON_ONCE(rq->balance_callback);
 		rq->balance_callback = &balance_push_callback;
-	else
+	} else if (rq->balance_callback == &balance_push_callback) {
 		rq->balance_callback = NULL;
+	}
 	rq_unlock_irqrestore(rq, &rf);
 }
 
@@ -7441,6 +7443,10 @@ int sched_cpu_activate(unsigned int cpu)
 	struct rq *rq = cpu_rq(cpu);
 	struct rq_flags rf;
 
+	/*
+	 * Make sure that when the hotplug state machine does a roll-back
+	 * we clear balance_push. Ideally that would happen earlier...
+	 */
 	balance_push_set(cpu, false);
 
 #ifdef CONFIG_SCHED_SMT
@@ -7608,6 +7614,12 @@ int sched_cpu_dying(unsigned int cpu)
 	}
 	rq_unlock_irqrestore(rq, &rf);
 
+	/*
+	 * Now that the CPU is offline, make sure we're welcome
+	 * to new tasks once we come back up.
+	 */
+	balance_push_set(cpu, false);
+
 	calc_load_migrate(rq);
 	update_max_interval();
 	nohz_balance_exit_idle(rq);

^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2021-01-22 17:56 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-21 10:17 [PATCH -v3 0/9] sched: Fix hot-unplug regression Peter Zijlstra
2021-01-21 10:17 ` [PATCH -v3 1/9] sched/core: Print out straggler tasks in sched_cpu_dying() Peter Zijlstra
2021-01-21 10:17 ` [PATCH -v3 2/9] workqueue: Use cpu_possible_mask instead of cpu_active_mask to break affinity Peter Zijlstra
2021-01-21 10:17 ` [PATCH -v3 3/9] sched: Dont run cpu-online with balance_push() enabled Peter Zijlstra
2021-01-21 14:00   ` Valentin Schneider
2021-01-21 14:14     ` Peter Zijlstra
2021-01-22 17:41   ` [tip: sched/urgent] sched: Don't " tip-bot2 for Peter Zijlstra
2021-01-21 10:17 ` [PATCH -v3 4/9] kthread: Extract KTHREAD_IS_PER_CPU Peter Zijlstra
2021-01-22 17:41   ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra
2021-01-21 10:17 ` [PATCH -v3 5/9] workqueue: Tag bound workers with KTHREAD_IS_PER_CPU Peter Zijlstra
2021-01-21 14:31   ` Valentin Schneider
2021-01-22 17:41   ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra
2021-01-21 10:17 ` [PATCH -v3 6/9] workqueue: Restrict affinity change to rescuer Peter Zijlstra
2021-01-22 17:41   ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra
2021-01-21 10:17 ` [PATCH -v3 7/9] sched: Prepare to use balance_push in ttwu() Peter Zijlstra
2021-01-22 17:41   ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra
2021-01-21 10:17 ` [PATCH -v3 8/9] sched: Fix CPU hotplug / tighten is_per_cpu_kthread() Peter Zijlstra
2021-01-21 14:01   ` Valentin Schneider
2021-01-21 14:18     ` Peter Zijlstra
2021-01-21 14:36       ` Valentin Schneider
2021-01-22 17:41   ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra
2021-01-21 10:17 ` [PATCH -v3 9/9] sched: Relax the set_cpus_allowed_ptr() semantics Peter Zijlstra
2021-01-22 17:41   ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra
2021-01-21 14:36 ` [PATCH -v3 0/9] sched: Fix hot-unplug regression Valentin Schneider
2021-01-21 19:56 ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.