All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] workqueue: Fix migrate_disable hotplug changes vs kworker affinity
@ 2020-12-10 16:38 Valentin Schneider
  2020-12-10 16:38 ` [PATCH 1/2] stop_machine: Add caller debug info to queue_stop_cpus_work Valentin Schneider
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Valentin Schneider @ 2020-12-10 16:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, tglx, mingo, bigeasy, qais.yousef, swood,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vincent.donnefort, tj, ouwen210, Qian Cai

Hi folks,

This should fix the issue reported by Qian at [1]. Dietmar mentioned some
other issue with hotplug & deadline tasks, but that's being investigated by
someone else ATM. 


I would like to mention I suspect this bug comes straight from $hell, as
I've never ever had to fight off so many (mostly unrelated) issues while
looking into it. Distro kernel being mangled, git tree corruption, periods of
heisenbugism...

Cheers,
Valentin

[1]: https://lore.kernel.org/r/ff62e3ee994efb3620177bf7b19fab16f4866845.camel@redhat.com

Valentin Schneider (2):
  stop_machine: Add caller debug info to queue_stop_cpus_work
  workqueue: Fix affinity of kworkers attached during late hotplug

 kernel/stop_machine.c |  1 +
 kernel/workqueue.c    | 24 +++++++++++++++++-------
 2 files changed, 18 insertions(+), 7 deletions(-)

--
2.27.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] stop_machine: Add caller debug info to queue_stop_cpus_work
  2020-12-10 16:38 [PATCH 0/2] workqueue: Fix migrate_disable hotplug changes vs kworker affinity Valentin Schneider
@ 2020-12-10 16:38 ` Valentin Schneider
  2021-03-23 15:08   ` [tip: sched/core] " tip-bot2 for Valentin Schneider
  2020-12-10 16:38 ` [PATCH 2/2] workqueue: Fix affinity of kworkers attached during late hotplug Valentin Schneider
  2020-12-11 12:52 ` [PATCH 0/2] workqueue: Fix migrate_disable hotplug changes vs kworker affinity Peter Zijlstra
  2 siblings, 1 reply; 11+ messages in thread
From: Valentin Schneider @ 2020-12-10 16:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, tglx, mingo, bigeasy, qais.yousef, swood,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vincent.donnefort, tj, ouwen210, Qian Cai

Most callsites were covered by commit

  a8b62fd08505 ("stop_machine: Add function and caller debug info")

but this skipped queue_stop_cpus_work(). Add caller debug info to it.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/stop_machine.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 971d8acceaec..cbc30271ea4d 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -409,6 +409,7 @@ static bool queue_stop_cpus_work(const struct cpumask *cpumask,
 		work->fn = fn;
 		work->arg = arg;
 		work->done = done;
+		work->caller = _RET_IP_;
 		if (cpu_stop_queue_work(cpu, work))
 			queued = true;
 	}
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] workqueue: Fix affinity of kworkers attached during late hotplug
  2020-12-10 16:38 [PATCH 0/2] workqueue: Fix migrate_disable hotplug changes vs kworker affinity Valentin Schneider
  2020-12-10 16:38 ` [PATCH 1/2] stop_machine: Add caller debug info to queue_stop_cpus_work Valentin Schneider
@ 2020-12-10 16:38 ` Valentin Schneider
  2020-12-11 11:39   ` Vincent Donnefort
  2020-12-11 12:52 ` [PATCH 0/2] workqueue: Fix migrate_disable hotplug changes vs kworker affinity Peter Zijlstra
  2 siblings, 1 reply; 11+ messages in thread
From: Valentin Schneider @ 2020-12-10 16:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Qian Cai, Peter Zijlstra, tglx, mingo, bigeasy, qais.yousef,
	swood, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, vincent.donnefort, tj, ouwen210

Per-CPU kworkers forcefully migrated away by hotplug via
workqueue_offline_cpu() can end up spawning more kworkers via

  manage_workers() -> maybe_create_worker()

Workers created at this point will be bound using

  pool->attrs->cpumask

which in this case is wrong, as the hotplug state machine already migrated
all pinned kworkers away from this CPU. This ends up triggering the BUG_ON
condition is sched_cpu_dying() (i.e. there's a kworker enqueued on the
dying rq).

Special-case workers being attached to DISASSOCIATED pools and bind them to
cpu_active_mask, mimicking them being present when workqueue_offline_cpu()
was invoked.

Link: https://lore.kernel.org/r/ff62e3ee994efb3620177bf7b19fab16f4866845.camel@redhat.com
Fixes: 06249738a41a ("workqueue: Manually break affinity on hotplug")
Reported-by: Qian Cai <cai@redhat.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/workqueue.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9880b6c0e272..fb1418edf85c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1848,19 +1848,29 @@ static void worker_attach_to_pool(struct worker *worker,
 {
 	mutex_lock(&wq_pool_attach_mutex);
 
-	/*
-	 * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
-	 * online CPUs.  It'll be re-applied when any of the CPUs come up.
-	 */
-	set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
-
 	/*
 	 * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
 	 * stable across this function.  See the comments above the flag
 	 * definition for details.
+	 *
+	 * Worker might get attached to a pool *after* workqueue_offline_cpu()
+	 * was run - e.g. created by manage_workers() from a kworker which was
+	 * forcefully moved away by hotplug. Kworkers created from this point on
+	 * need to have their affinity changed as if they were present during
+	 * workqueue_offline_cpu().
+	 *
+	 * This will be resolved in rebind_workers().
 	 */
-	if (pool->flags & POOL_DISASSOCIATED)
+	if (pool->flags & POOL_DISASSOCIATED) {
 		worker->flags |= WORKER_UNBOUND;
+		set_cpus_allowed_ptr(worker->task, cpu_active_mask);
+	} else {
+		/*
+		 * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
+		 * online CPUs. It'll be re-applied when any of the CPUs come up.
+		 */
+		set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+	}
 
 	list_add_tail(&worker->node, &pool->workers);
 	worker->pool = pool;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] workqueue: Fix affinity of kworkers attached during late hotplug
  2020-12-10 16:38 ` [PATCH 2/2] workqueue: Fix affinity of kworkers attached during late hotplug Valentin Schneider
@ 2020-12-11 11:39   ` Vincent Donnefort
  2020-12-11 12:41     ` Peter Zijlstra
                       ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Vincent Donnefort @ 2020-12-11 11:39 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Qian Cai, Peter Zijlstra, tglx, mingo, bigeasy,
	qais.yousef, swood, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, tj,
	ouwen210

Hi Valentin,

On Thu, Dec 10, 2020 at 04:38:30PM +0000, Valentin Schneider wrote:
> Per-CPU kworkers forcefully migrated away by hotplug via
> workqueue_offline_cpu() can end up spawning more kworkers via
> 
>   manage_workers() -> maybe_create_worker()
> 
> Workers created at this point will be bound using
> 
>   pool->attrs->cpumask
> 
> which in this case is wrong, as the hotplug state machine already migrated
> all pinned kworkers away from this CPU. This ends up triggering the BUG_ON
> condition is sched_cpu_dying() (i.e. there's a kworker enqueued on the
> dying rq).
> 
> Special-case workers being attached to DISASSOCIATED pools and bind them to
> cpu_active_mask, mimicking them being present when workqueue_offline_cpu()
> was invoked.
> 
> Link: https://lore.kernel.org/r/ff62e3ee994efb3620177bf7b19fab16f4866845.camel@redhat.com
> Fixes: 06249738a41a ("workqueue: Manually break affinity on hotplug")

Isn't the problem introduced by 1cf12e0 ("sched/hotplug: Consolidate
task migration on CPU unplug") ?

Previously we had:

 AP_WORKQUEUE_ONLINE -> set POOL_DISASSOCIATED
   ...
 TEARDOWN_CPU -> clear CPU in cpu_online_mask
   |
   |-AP_SCHED_STARTING -> migrate_tasks()
   |
  AP_OFFLINE

worker_attach_to_pool(), is "protected" by the cpu_online_mask in
set_cpus_allowed_ptr(). IIUC, now, the tasks being migrated before the
cpu_online_mask is actually flipped, there's a window, between
CPUHP_AP_SCHED_WAIT_EMPTY and CPUHP_TEARDOWN_CPU where a kworker can wake-up
a new one, for the hotunplugged pool that wouldn't be caught by the
hotunplug migration.

> Reported-by: Qian Cai <cai@redhat.com>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/workqueue.c | 24 +++++++++++++++++-------
>  1 file changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 9880b6c0e272..fb1418edf85c 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1848,19 +1848,29 @@ static void worker_attach_to_pool(struct worker *worker,
>  {
>  	mutex_lock(&wq_pool_attach_mutex);
>  
> -	/*
> -	 * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
> -	 * online CPUs.  It'll be re-applied when any of the CPUs come up.
> -	 */
> -	set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> -
>  	/*
>  	 * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
>  	 * stable across this function.  See the comments above the flag
>  	 * definition for details.
> +	 *
> +	 * Worker might get attached to a pool *after* workqueue_offline_cpu()
> +	 * was run - e.g. created by manage_workers() from a kworker which was
> +	 * forcefully moved away by hotplug. Kworkers created from this point on
> +	 * need to have their affinity changed as if they were present during
> +	 * workqueue_offline_cpu().
> +	 *
> +	 * This will be resolved in rebind_workers().
>  	 */
> -	if (pool->flags & POOL_DISASSOCIATED)
> +	if (pool->flags & POOL_DISASSOCIATED) {
>  		worker->flags |= WORKER_UNBOUND;
> +		set_cpus_allowed_ptr(worker->task, cpu_active_mask);
> +	} else {
> +		/*
> +		 * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
> +		 * online CPUs. It'll be re-applied when any of the CPUs come up.
> +		 */

Does this comment still stand ? IIUC, we should always be in the
POOL_DISASSOCIATED case if the CPU from cpumask is offline. Unless a
pool->attrs->cpumask can have several CPUs. In that case maybe we should check
for the cpu_active_mask here too ?

-- 
Vincent

> +		set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> +	}
>  
>  	list_add_tail(&worker->node, &pool->workers);
>  	worker->pool = pool;
> -- 
> 2.27.0
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] workqueue: Fix affinity of kworkers attached during late hotplug
  2020-12-11 11:39   ` Vincent Donnefort
@ 2020-12-11 12:41     ` Peter Zijlstra
  2020-12-11 12:51     ` Peter Zijlstra
  2020-12-11 12:51     ` Valentin Schneider
  2 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2020-12-11 12:41 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: Valentin Schneider, linux-kernel, Qian Cai, tglx, mingo, bigeasy,
	qais.yousef, swood, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, tj,
	ouwen210

On Fri, Dec 11, 2020 at 11:39:21AM +0000, Vincent Donnefort wrote:
> Hi Valentin,
> 
> On Thu, Dec 10, 2020 at 04:38:30PM +0000, Valentin Schneider wrote:
> > Per-CPU kworkers forcefully migrated away by hotplug via
> > workqueue_offline_cpu() can end up spawning more kworkers via
> > 
> >   manage_workers() -> maybe_create_worker()
> > 
> > Workers created at this point will be bound using
> > 
> >   pool->attrs->cpumask
> > 
> > which in this case is wrong, as the hotplug state machine already migrated
> > all pinned kworkers away from this CPU. This ends up triggering the BUG_ON
> > condition is sched_cpu_dying() (i.e. there's a kworker enqueued on the
> > dying rq).
> > 
> > Special-case workers being attached to DISASSOCIATED pools and bind them to
> > cpu_active_mask, mimicking them being present when workqueue_offline_cpu()
> > was invoked.
> > 
> > Link: https://lore.kernel.org/r/ff62e3ee994efb3620177bf7b19fab16f4866845.camel@redhat.com
> > Fixes: 06249738a41a ("workqueue: Manually break affinity on hotplug")
> 
> Isn't the problem introduced by 1cf12e0 ("sched/hotplug: Consolidate
> task migration on CPU unplug") ?
> 
> Previously we had:
> 
>  AP_WORKQUEUE_ONLINE -> set POOL_DISASSOCIATED
>    ...
>  TEARDOWN_CPU -> clear CPU in cpu_online_mask
>    |
>    |-AP_SCHED_STARTING -> migrate_tasks()
>    |
>   AP_OFFLINE
> 
> worker_attach_to_pool(), is "protected" by the cpu_online_mask in
> set_cpus_allowed_ptr(). IIUC, now, the tasks being migrated before the
> cpu_online_mask is actually flipped, there's a window, between
> CPUHP_AP_SCHED_WAIT_EMPTY and CPUHP_TEARDOWN_CPU where a kworker can wake-up
> a new one, for the hotunplugged pool that wouldn't be caught by the
> hotunplug migration.

Yes, very much so, however the commit Valentin picked was supposed to
preemptively fix this. So we can consider this a fix for the fix.

But I don't mind an alternative or perhaps even second Fixes tag on
this.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] workqueue: Fix affinity of kworkers attached during late hotplug
  2020-12-11 11:39   ` Vincent Donnefort
  2020-12-11 12:41     ` Peter Zijlstra
@ 2020-12-11 12:51     ` Peter Zijlstra
  2020-12-11 12:51     ` Valentin Schneider
  2 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2020-12-11 12:51 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: Valentin Schneider, linux-kernel, Qian Cai, tglx, mingo, bigeasy,
	qais.yousef, swood, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, tj,
	ouwen210

On Fri, Dec 11, 2020 at 11:39:21AM +0000, Vincent Donnefort wrote:
> On Thu, Dec 10, 2020 at 04:38:30PM +0000, Valentin Schneider wrote:
> > +	if (pool->flags & POOL_DISASSOCIATED) {
> >  		worker->flags |= WORKER_UNBOUND;
> > +		set_cpus_allowed_ptr(worker->task, cpu_active_mask);
> > +	} else {
> > +		/*
> > +		 * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
> > +		 * online CPUs. It'll be re-applied when any of the CPUs come up.
> > +		 */
> 
> Does this comment still stand ? IIUC, we should always be in the
> POOL_DISASSOCIATED case if the CPU from cpumask is offline. Unless a
> pool->attrs->cpumask can have several CPUs. In that case maybe we should check
> for the cpu_active_mask here too ?

IIUC it can be a numa mask, and would still be valid in that case.

> > +		set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> > +	}

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] workqueue: Fix affinity of kworkers attached during late hotplug
  2020-12-11 11:39   ` Vincent Donnefort
  2020-12-11 12:41     ` Peter Zijlstra
  2020-12-11 12:51     ` Peter Zijlstra
@ 2020-12-11 12:51     ` Valentin Schneider
  2020-12-11 13:13       ` Valentin Schneider
  2 siblings, 1 reply; 11+ messages in thread
From: Valentin Schneider @ 2020-12-11 12:51 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: linux-kernel, Qian Cai, Peter Zijlstra, tglx, mingo, bigeasy,
	qais.yousef, swood, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, tj,
	ouwen210


Hi Vincent,

On 11/12/20 11:39, Vincent Donnefort wrote:
> Hi Valentin,
>
> On Thu, Dec 10, 2020 at 04:38:30PM +0000, Valentin Schneider wrote:
>> Fixes: 06249738a41a ("workqueue: Manually break affinity on hotplug")
>
> Isn't the problem introduced by 1cf12e0 ("sched/hotplug: Consolidate
> task migration on CPU unplug") ?
>
> Previously we had:
>
>  AP_WORKQUEUE_ONLINE -> set POOL_DISASSOCIATED
>    ...
>  TEARDOWN_CPU -> clear CPU in cpu_online_mask
>    |
>    |-AP_SCHED_STARTING -> migrate_tasks()
>    |
>   AP_OFFLINE
>
> worker_attach_to_pool(), is "protected" by the cpu_online_mask in
> set_cpus_allowed_ptr(). IIUC, now, the tasks being migrated before the
> cpu_online_mask is actually flipped, there's a window, between
> CPUHP_AP_SCHED_WAIT_EMPTY and CPUHP_TEARDOWN_CPU where a kworker can wake-up
> a new one, for the hotunplugged pool that wouldn't be caught by the
> hotunplug migration.
>

You're right, the splat should only happen with that other commit. That
said, this fix complements the one referred to in Fixes:, which is the
"logic" I went for.

>> Reported-by: Qian Cai <cai@redhat.com>
>> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
>> ---
>>  kernel/workqueue.c | 24 +++++++++++++++++-------
>>  1 file changed, 17 insertions(+), 7 deletions(-)
>>
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index 9880b6c0e272..fb1418edf85c 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -1848,19 +1848,29 @@ static void worker_attach_to_pool(struct worker *worker,
>>  {
>>      mutex_lock(&wq_pool_attach_mutex);
>>
>> -	/*
>> -	 * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
>> -	 * online CPUs.  It'll be re-applied when any of the CPUs come up.
>> -	 */
>> -	set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
>> -
>>      /*
>>       * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
>>       * stable across this function.  See the comments above the flag
>>       * definition for details.
>> +	 *
>> +	 * Worker might get attached to a pool *after* workqueue_offline_cpu()
>> +	 * was run - e.g. created by manage_workers() from a kworker which was
>> +	 * forcefully moved away by hotplug. Kworkers created from this point on
>> +	 * need to have their affinity changed as if they were present during
>> +	 * workqueue_offline_cpu().
>> +	 *
>> +	 * This will be resolved in rebind_workers().
>>       */
>> -	if (pool->flags & POOL_DISASSOCIATED)
>> +	if (pool->flags & POOL_DISASSOCIATED) {
>>              worker->flags |= WORKER_UNBOUND;
>> +		set_cpus_allowed_ptr(worker->task, cpu_active_mask);
>> +	} else {
>> +		/*
>> +		 * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
>> +		 * online CPUs. It'll be re-applied when any of the CPUs come up.
>> +		 */
>
> Does this comment still stand ? IIUC, we should always be in the
> POOL_DISASSOCIATED case if the CPU from cpumask is offline. Unless a
> pool->attrs->cpumask can have several CPUs.

AIUI that should the case for unbound pools

> In that case maybe we should check for the cpu_active_mask here too ?

Looking at it again, I think we might need to.

IIUC you can end up with pools bound to a single NUMA node (?). In that
case, say the last CPU of a node is going down, then:

  workqueue_offline_cpu()
    wq_update_unbound_numa()
      alloc_unbound_pwq()
        get_unbound_pool()

would still pick that node, because it doesn't look at the online / active
mask. And at this point, we would affine the
kworkers to that node, and we're back to having kworkers enqueued on a
(!active, online) CPU that is going down...

The annoying thing is we can't just compare attrs->cpumask with
cpu_active_mask, because workqueue_offline_cpu() happens a few steps below
sched_cpu_deactivate() (CPUHP_AP_ACTIVE):

  CPUHP_ONLINE -> CPUHP_AP_ACTIVE # CPU X is !active

  # Some new kworker gets created here
  worker_attach_to_pool()
    !cpumask_subset(attrs->cpumask, cpu_active_mask)
    -> affine worker to active CPUs

  CPUHP_AP_ACTIVE -> CPUHP_ONLINE # CPU X is active
  # Nothing will ever correct the kworker's affinity :(

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/2] workqueue: Fix migrate_disable hotplug changes vs kworker affinity
  2020-12-10 16:38 [PATCH 0/2] workqueue: Fix migrate_disable hotplug changes vs kworker affinity Valentin Schneider
  2020-12-10 16:38 ` [PATCH 1/2] stop_machine: Add caller debug info to queue_stop_cpus_work Valentin Schneider
  2020-12-10 16:38 ` [PATCH 2/2] workqueue: Fix affinity of kworkers attached during late hotplug Valentin Schneider
@ 2020-12-11 12:52 ` Peter Zijlstra
  2 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2020-12-11 12:52 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, tglx, mingo, bigeasy, qais.yousef, swood,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vincent.donnefort, tj, ouwen210, Qian Cai

On Thu, Dec 10, 2020 at 04:38:28PM +0000, Valentin Schneider wrote:
> Valentin Schneider (2):
>   stop_machine: Add caller debug info to queue_stop_cpus_work
>   workqueue: Fix affinity of kworkers attached during late hotplug
> 
>  kernel/stop_machine.c |  1 +
>  kernel/workqueue.c    | 24 +++++++++++++++++-------
>  2 files changed, 18 insertions(+), 7 deletions(-)

TJ, would you be okay if I take these through the sched tree that
contain the migrate_disable() patches, such that problem and fix stay
together?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] workqueue: Fix affinity of kworkers attached during late hotplug
  2020-12-11 12:51     ` Valentin Schneider
@ 2020-12-11 13:13       ` Valentin Schneider
  2020-12-11 13:16         ` Vincent Donnefort
  0 siblings, 1 reply; 11+ messages in thread
From: Valentin Schneider @ 2020-12-11 13:13 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: linux-kernel, Qian Cai, Peter Zijlstra, tglx, mingo, bigeasy,
	qais.yousef, swood, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, tj,
	ouwen210

On 11/12/20 12:51, Valentin Schneider wrote:
>> In that case maybe we should check for the cpu_active_mask here too ?
>
> Looking at it again, I think we might need to.
>
> IIUC you can end up with pools bound to a single NUMA node (?). In that
> case, say the last CPU of a node is going down, then:
>
>   workqueue_offline_cpu()
>     wq_update_unbound_numa()
>       alloc_unbound_pwq()
>         get_unbound_pool()
>
> would still pick that node, because it doesn't look at the online / active
> mask. And at this point, we would affine the
> kworkers to that node, and we're back to having kworkers enqueued on a
> (!active, online) CPU that is going down...

Assuming a node covers at least 2 CPUs, that can't actually happen per
is_cpu_allowed().

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] workqueue: Fix affinity of kworkers attached during late hotplug
  2020-12-11 13:13       ` Valentin Schneider
@ 2020-12-11 13:16         ` Vincent Donnefort
  0 siblings, 0 replies; 11+ messages in thread
From: Vincent Donnefort @ 2020-12-11 13:16 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Qian Cai, Peter Zijlstra, tglx, mingo, bigeasy,
	qais.yousef, swood, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, tj,
	ouwen210

On Fri, Dec 11, 2020 at 01:13:35PM +0000, Valentin Schneider wrote:
> On 11/12/20 12:51, Valentin Schneider wrote:
> >> In that case maybe we should check for the cpu_active_mask here too ?
> >
> > Looking at it again, I think we might need to.
> >
> > IIUC you can end up with pools bound to a single NUMA node (?). In that
> > case, say the last CPU of a node is going down, then:
> >
> >   workqueue_offline_cpu()
> >     wq_update_unbound_numa()
> >       alloc_unbound_pwq()
> >         get_unbound_pool()
> >
> > would still pick that node, because it doesn't look at the online / active
> > mask. And at this point, we would affine the
> > kworkers to that node, and we're back to having kworkers enqueued on a
> > (!active, online) CPU that is going down...
> 
> Assuming a node covers at least 2 CPUs, that can't actually happen per
> is_cpu_allowed().

Yes indeed, my bad, no problem here.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [tip: sched/core] stop_machine: Add caller debug info to queue_stop_cpus_work
  2020-12-10 16:38 ` [PATCH 1/2] stop_machine: Add caller debug info to queue_stop_cpus_work Valentin Schneider
@ 2021-03-23 15:08   ` tip-bot2 for Valentin Schneider
  0 siblings, 0 replies; 11+ messages in thread
From: tip-bot2 for Valentin Schneider @ 2021-03-23 15:08 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Valentin Schneider, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     2a2f80ff63bc36a874ed569bcaef932a8fe43514
Gitweb:        https://git.kernel.org/tip/2a2f80ff63bc36a874ed569bcaef932a8fe43514
Author:        Valentin Schneider <valentin.schneider@arm.com>
AuthorDate:    Thu, 10 Dec 2020 16:38:29 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 23 Mar 2021 16:01:58 +01:00

stop_machine: Add caller debug info to queue_stop_cpus_work

Most callsites were covered by commit

  a8b62fd08505 ("stop_machine: Add function and caller debug info")

but this skipped queue_stop_cpus_work(). Add caller debug info to it.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201210163830.21514-2-valentin.schneider@arm.com
---
 kernel/stop_machine.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 971d8ac..cbc3027 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -409,6 +409,7 @@ static bool queue_stop_cpus_work(const struct cpumask *cpumask,
 		work->fn = fn;
 		work->arg = arg;
 		work->done = done;
+		work->caller = _RET_IP_;
 		if (cpu_stop_queue_work(cpu, work))
 			queued = true;
 	}

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-03-23 15:18 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-10 16:38 [PATCH 0/2] workqueue: Fix migrate_disable hotplug changes vs kworker affinity Valentin Schneider
2020-12-10 16:38 ` [PATCH 1/2] stop_machine: Add caller debug info to queue_stop_cpus_work Valentin Schneider
2021-03-23 15:08   ` [tip: sched/core] " tip-bot2 for Valentin Schneider
2020-12-10 16:38 ` [PATCH 2/2] workqueue: Fix affinity of kworkers attached during late hotplug Valentin Schneider
2020-12-11 11:39   ` Vincent Donnefort
2020-12-11 12:41     ` Peter Zijlstra
2020-12-11 12:51     ` Peter Zijlstra
2020-12-11 12:51     ` Valentin Schneider
2020-12-11 13:13       ` Valentin Schneider
2020-12-11 13:16         ` Vincent Donnefort
2020-12-11 12:52 ` [PATCH 0/2] workqueue: Fix migrate_disable hotplug changes vs kworker affinity Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.