* [PATCH 0/2] workqueue: Fix migrate_disable hotplug changes vs kworker affinity
@ 2020-12-10 16:38 Valentin Schneider
2020-12-10 16:38 ` [PATCH 1/2] stop_machine: Add caller debug info to queue_stop_cpus_work Valentin Schneider
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Valentin Schneider @ 2020-12-10 16:38 UTC (permalink / raw)
To: linux-kernel
Cc: Peter Zijlstra, tglx, mingo, bigeasy, qais.yousef, swood,
juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
mgorman, bristot, vincent.donnefort, tj, ouwen210, Qian Cai
Hi folks,
This should fix the issue reported by Qian at [1]. Dietmar mentioned some
other issue with hotplug & deadline tasks, but that's being investigated by
someone else ATM.
I would like to mention I suspect this bug comes straight from $hell, as
I've never ever had to fight off so many (mostly unrelated) issues while
looking into it. Distro kernel being mangled, git tree corruption, periods of
heisenbugism...
Cheers,
Valentin
[1]: https://lore.kernel.org/r/ff62e3ee994efb3620177bf7b19fab16f4866845.camel@redhat.com
Valentin Schneider (2):
stop_machine: Add caller debug info to queue_stop_cpus_work
workqueue: Fix affinity of kworkers attached during late hotplug
kernel/stop_machine.c | 1 +
kernel/workqueue.c | 24 +++++++++++++++++-------
2 files changed, 18 insertions(+), 7 deletions(-)
--
2.27.0
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/2] stop_machine: Add caller debug info to queue_stop_cpus_work
2020-12-10 16:38 [PATCH 0/2] workqueue: Fix migrate_disable hotplug changes vs kworker affinity Valentin Schneider
@ 2020-12-10 16:38 ` Valentin Schneider
2021-03-23 15:08 ` [tip: sched/core] " tip-bot2 for Valentin Schneider
2020-12-10 16:38 ` [PATCH 2/2] workqueue: Fix affinity of kworkers attached during late hotplug Valentin Schneider
2020-12-11 12:52 ` [PATCH 0/2] workqueue: Fix migrate_disable hotplug changes vs kworker affinity Peter Zijlstra
2 siblings, 1 reply; 11+ messages in thread
From: Valentin Schneider @ 2020-12-10 16:38 UTC (permalink / raw)
To: linux-kernel
Cc: Peter Zijlstra, tglx, mingo, bigeasy, qais.yousef, swood,
juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
mgorman, bristot, vincent.donnefort, tj, ouwen210, Qian Cai
Most callsites were covered by commit
a8b62fd08505 ("stop_machine: Add function and caller debug info")
but this skipped queue_stop_cpus_work(). Add caller debug info to it.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
kernel/stop_machine.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 971d8acceaec..cbc30271ea4d 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -409,6 +409,7 @@ static bool queue_stop_cpus_work(const struct cpumask *cpumask,
work->fn = fn;
work->arg = arg;
work->done = done;
+ work->caller = _RET_IP_;
if (cpu_stop_queue_work(cpu, work))
queued = true;
}
--
2.27.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/2] workqueue: Fix affinity of kworkers attached during late hotplug
2020-12-10 16:38 [PATCH 0/2] workqueue: Fix migrate_disable hotplug changes vs kworker affinity Valentin Schneider
2020-12-10 16:38 ` [PATCH 1/2] stop_machine: Add caller debug info to queue_stop_cpus_work Valentin Schneider
@ 2020-12-10 16:38 ` Valentin Schneider
2020-12-11 11:39 ` Vincent Donnefort
2020-12-11 12:52 ` [PATCH 0/2] workqueue: Fix migrate_disable hotplug changes vs kworker affinity Peter Zijlstra
2 siblings, 1 reply; 11+ messages in thread
From: Valentin Schneider @ 2020-12-10 16:38 UTC (permalink / raw)
To: linux-kernel
Cc: Qian Cai, Peter Zijlstra, tglx, mingo, bigeasy, qais.yousef,
swood, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, bristot, vincent.donnefort, tj, ouwen210
Per-CPU kworkers forcefully migrated away by hotplug via
workqueue_offline_cpu() can end up spawning more kworkers via
manage_workers() -> maybe_create_worker()
Workers created at this point will be bound using
pool->attrs->cpumask
which in this case is wrong, as the hotplug state machine already migrated
all pinned kworkers away from this CPU. This ends up triggering the BUG_ON
condition is sched_cpu_dying() (i.e. there's a kworker enqueued on the
dying rq).
Special-case workers being attached to DISASSOCIATED pools and bind them to
cpu_active_mask, mimicking them being present when workqueue_offline_cpu()
was invoked.
Link: https://lore.kernel.org/r/ff62e3ee994efb3620177bf7b19fab16f4866845.camel@redhat.com
Fixes: 06249738a41a ("workqueue: Manually break affinity on hotplug")
Reported-by: Qian Cai <cai@redhat.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
kernel/workqueue.c | 24 +++++++++++++++++-------
1 file changed, 17 insertions(+), 7 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9880b6c0e272..fb1418edf85c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1848,19 +1848,29 @@ static void worker_attach_to_pool(struct worker *worker,
{
mutex_lock(&wq_pool_attach_mutex);
- /*
- * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
- * online CPUs. It'll be re-applied when any of the CPUs come up.
- */
- set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
-
/*
* The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
* stable across this function. See the comments above the flag
* definition for details.
+ *
+ * Worker might get attached to a pool *after* workqueue_offline_cpu()
+ * was run - e.g. created by manage_workers() from a kworker which was
+ * forcefully moved away by hotplug. Kworkers created from this point on
+ * need to have their affinity changed as if they were present during
+ * workqueue_offline_cpu().
+ *
+ * This will be resolved in rebind_workers().
*/
- if (pool->flags & POOL_DISASSOCIATED)
+ if (pool->flags & POOL_DISASSOCIATED) {
worker->flags |= WORKER_UNBOUND;
+ set_cpus_allowed_ptr(worker->task, cpu_active_mask);
+ } else {
+ /*
+ * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
+ * online CPUs. It'll be re-applied when any of the CPUs come up.
+ */
+ set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+ }
list_add_tail(&worker->node, &pool->workers);
worker->pool = pool;
--
2.27.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] workqueue: Fix affinity of kworkers attached during late hotplug
2020-12-10 16:38 ` [PATCH 2/2] workqueue: Fix affinity of kworkers attached during late hotplug Valentin Schneider
@ 2020-12-11 11:39 ` Vincent Donnefort
2020-12-11 12:41 ` Peter Zijlstra
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Vincent Donnefort @ 2020-12-11 11:39 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, Qian Cai, Peter Zijlstra, tglx, mingo, bigeasy,
qais.yousef, swood, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, bristot, tj,
ouwen210
Hi Valentin,
On Thu, Dec 10, 2020 at 04:38:30PM +0000, Valentin Schneider wrote:
> Per-CPU kworkers forcefully migrated away by hotplug via
> workqueue_offline_cpu() can end up spawning more kworkers via
>
> manage_workers() -> maybe_create_worker()
>
> Workers created at this point will be bound using
>
> pool->attrs->cpumask
>
> which in this case is wrong, as the hotplug state machine already migrated
> all pinned kworkers away from this CPU. This ends up triggering the BUG_ON
> condition is sched_cpu_dying() (i.e. there's a kworker enqueued on the
> dying rq).
>
> Special-case workers being attached to DISASSOCIATED pools and bind them to
> cpu_active_mask, mimicking them being present when workqueue_offline_cpu()
> was invoked.
>
> Link: https://lore.kernel.org/r/ff62e3ee994efb3620177bf7b19fab16f4866845.camel@redhat.com
> Fixes: 06249738a41a ("workqueue: Manually break affinity on hotplug")
Isn't the problem introduced by 1cf12e0 ("sched/hotplug: Consolidate
task migration on CPU unplug") ?
Previously we had:
AP_WORKQUEUE_ONLINE -> set POOL_DISASSOCIATED
...
TEARDOWN_CPU -> clear CPU in cpu_online_mask
|
|-AP_SCHED_STARTING -> migrate_tasks()
|
AP_OFFLINE
worker_attach_to_pool(), is "protected" by the cpu_online_mask in
set_cpus_allowed_ptr(). IIUC, now, the tasks being migrated before the
cpu_online_mask is actually flipped, there's a window, between
CPUHP_AP_SCHED_WAIT_EMPTY and CPUHP_TEARDOWN_CPU where a kworker can wake-up
a new one, for the hotunplugged pool that wouldn't be caught by the
hotunplug migration.
> Reported-by: Qian Cai <cai@redhat.com>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
> kernel/workqueue.c | 24 +++++++++++++++++-------
> 1 file changed, 17 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 9880b6c0e272..fb1418edf85c 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1848,19 +1848,29 @@ static void worker_attach_to_pool(struct worker *worker,
> {
> mutex_lock(&wq_pool_attach_mutex);
>
> - /*
> - * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
> - * online CPUs. It'll be re-applied when any of the CPUs come up.
> - */
> - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> -
> /*
> * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
> * stable across this function. See the comments above the flag
> * definition for details.
> + *
> + * Worker might get attached to a pool *after* workqueue_offline_cpu()
> + * was run - e.g. created by manage_workers() from a kworker which was
> + * forcefully moved away by hotplug. Kworkers created from this point on
> + * need to have their affinity changed as if they were present during
> + * workqueue_offline_cpu().
> + *
> + * This will be resolved in rebind_workers().
> */
> - if (pool->flags & POOL_DISASSOCIATED)
> + if (pool->flags & POOL_DISASSOCIATED) {
> worker->flags |= WORKER_UNBOUND;
> + set_cpus_allowed_ptr(worker->task, cpu_active_mask);
> + } else {
> + /*
> + * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
> + * online CPUs. It'll be re-applied when any of the CPUs come up.
> + */
Does this comment still stand ? IIUC, we should always be in the
POOL_DISASSOCIATED case if the CPU from cpumask is offline. Unless a
pool->attrs->cpumask can have several CPUs. In that case maybe we should check
for the cpu_active_mask here too ?
--
Vincent
> + set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> + }
>
> list_add_tail(&worker->node, &pool->workers);
> worker->pool = pool;
> --
> 2.27.0
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] workqueue: Fix affinity of kworkers attached during late hotplug
2020-12-11 11:39 ` Vincent Donnefort
@ 2020-12-11 12:41 ` Peter Zijlstra
2020-12-11 12:51 ` Peter Zijlstra
2020-12-11 12:51 ` Valentin Schneider
2 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2020-12-11 12:41 UTC (permalink / raw)
To: Vincent Donnefort
Cc: Valentin Schneider, linux-kernel, Qian Cai, tglx, mingo, bigeasy,
qais.yousef, swood, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, bristot, tj,
ouwen210
On Fri, Dec 11, 2020 at 11:39:21AM +0000, Vincent Donnefort wrote:
> Hi Valentin,
>
> On Thu, Dec 10, 2020 at 04:38:30PM +0000, Valentin Schneider wrote:
> > Per-CPU kworkers forcefully migrated away by hotplug via
> > workqueue_offline_cpu() can end up spawning more kworkers via
> >
> > manage_workers() -> maybe_create_worker()
> >
> > Workers created at this point will be bound using
> >
> > pool->attrs->cpumask
> >
> > which in this case is wrong, as the hotplug state machine already migrated
> > all pinned kworkers away from this CPU. This ends up triggering the BUG_ON
> > condition is sched_cpu_dying() (i.e. there's a kworker enqueued on the
> > dying rq).
> >
> > Special-case workers being attached to DISASSOCIATED pools and bind them to
> > cpu_active_mask, mimicking them being present when workqueue_offline_cpu()
> > was invoked.
> >
> > Link: https://lore.kernel.org/r/ff62e3ee994efb3620177bf7b19fab16f4866845.camel@redhat.com
> > Fixes: 06249738a41a ("workqueue: Manually break affinity on hotplug")
>
> Isn't the problem introduced by 1cf12e0 ("sched/hotplug: Consolidate
> task migration on CPU unplug") ?
>
> Previously we had:
>
> AP_WORKQUEUE_ONLINE -> set POOL_DISASSOCIATED
> ...
> TEARDOWN_CPU -> clear CPU in cpu_online_mask
> |
> |-AP_SCHED_STARTING -> migrate_tasks()
> |
> AP_OFFLINE
>
> worker_attach_to_pool(), is "protected" by the cpu_online_mask in
> set_cpus_allowed_ptr(). IIUC, now, the tasks being migrated before the
> cpu_online_mask is actually flipped, there's a window, between
> CPUHP_AP_SCHED_WAIT_EMPTY and CPUHP_TEARDOWN_CPU where a kworker can wake-up
> a new one, for the hotunplugged pool that wouldn't be caught by the
> hotunplug migration.
Yes, very much so, however the commit Valentin picked was supposed to
preemptively fix this. So we can consider this a fix for the fix.
But I don't mind an alternative or perhaps even second Fixes tag on
this.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] workqueue: Fix affinity of kworkers attached during late hotplug
2020-12-11 11:39 ` Vincent Donnefort
2020-12-11 12:41 ` Peter Zijlstra
@ 2020-12-11 12:51 ` Peter Zijlstra
2020-12-11 12:51 ` Valentin Schneider
2 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2020-12-11 12:51 UTC (permalink / raw)
To: Vincent Donnefort
Cc: Valentin Schneider, linux-kernel, Qian Cai, tglx, mingo, bigeasy,
qais.yousef, swood, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, bristot, tj,
ouwen210
On Fri, Dec 11, 2020 at 11:39:21AM +0000, Vincent Donnefort wrote:
> On Thu, Dec 10, 2020 at 04:38:30PM +0000, Valentin Schneider wrote:
> > + if (pool->flags & POOL_DISASSOCIATED) {
> > worker->flags |= WORKER_UNBOUND;
> > + set_cpus_allowed_ptr(worker->task, cpu_active_mask);
> > + } else {
> > + /*
> > + * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
> > + * online CPUs. It'll be re-applied when any of the CPUs come up.
> > + */
>
> Does this comment still stand ? IIUC, we should always be in the
> POOL_DISASSOCIATED case if the CPU from cpumask is offline. Unless a
> pool->attrs->cpumask can have several CPUs. In that case maybe we should check
> for the cpu_active_mask here too ?
IIUC it can be a numa mask, and would still be valid in that case.
> > + set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> > + }
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] workqueue: Fix affinity of kworkers attached during late hotplug
2020-12-11 11:39 ` Vincent Donnefort
2020-12-11 12:41 ` Peter Zijlstra
2020-12-11 12:51 ` Peter Zijlstra
@ 2020-12-11 12:51 ` Valentin Schneider
2020-12-11 13:13 ` Valentin Schneider
2 siblings, 1 reply; 11+ messages in thread
From: Valentin Schneider @ 2020-12-11 12:51 UTC (permalink / raw)
To: Vincent Donnefort
Cc: linux-kernel, Qian Cai, Peter Zijlstra, tglx, mingo, bigeasy,
qais.yousef, swood, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, bristot, tj,
ouwen210
Hi Vincent,
On 11/12/20 11:39, Vincent Donnefort wrote:
> Hi Valentin,
>
> On Thu, Dec 10, 2020 at 04:38:30PM +0000, Valentin Schneider wrote:
>> Fixes: 06249738a41a ("workqueue: Manually break affinity on hotplug")
>
> Isn't the problem introduced by 1cf12e0 ("sched/hotplug: Consolidate
> task migration on CPU unplug") ?
>
> Previously we had:
>
> AP_WORKQUEUE_ONLINE -> set POOL_DISASSOCIATED
> ...
> TEARDOWN_CPU -> clear CPU in cpu_online_mask
> |
> |-AP_SCHED_STARTING -> migrate_tasks()
> |
> AP_OFFLINE
>
> worker_attach_to_pool(), is "protected" by the cpu_online_mask in
> set_cpus_allowed_ptr(). IIUC, now, the tasks being migrated before the
> cpu_online_mask is actually flipped, there's a window, between
> CPUHP_AP_SCHED_WAIT_EMPTY and CPUHP_TEARDOWN_CPU where a kworker can wake-up
> a new one, for the hotunplugged pool that wouldn't be caught by the
> hotunplug migration.
>
You're right, the splat should only happen with that other commit. That
said, this fix complements the one referred to in Fixes:, which is the
"logic" I went for.
>> Reported-by: Qian Cai <cai@redhat.com>
>> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
>> ---
>> kernel/workqueue.c | 24 +++++++++++++++++-------
>> 1 file changed, 17 insertions(+), 7 deletions(-)
>>
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index 9880b6c0e272..fb1418edf85c 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -1848,19 +1848,29 @@ static void worker_attach_to_pool(struct worker *worker,
>> {
>> mutex_lock(&wq_pool_attach_mutex);
>>
>> - /*
>> - * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
>> - * online CPUs. It'll be re-applied when any of the CPUs come up.
>> - */
>> - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
>> -
>> /*
>> * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
>> * stable across this function. See the comments above the flag
>> * definition for details.
>> + *
>> + * Worker might get attached to a pool *after* workqueue_offline_cpu()
>> + * was run - e.g. created by manage_workers() from a kworker which was
>> + * forcefully moved away by hotplug. Kworkers created from this point on
>> + * need to have their affinity changed as if they were present during
>> + * workqueue_offline_cpu().
>> + *
>> + * This will be resolved in rebind_workers().
>> */
>> - if (pool->flags & POOL_DISASSOCIATED)
>> + if (pool->flags & POOL_DISASSOCIATED) {
>> worker->flags |= WORKER_UNBOUND;
>> + set_cpus_allowed_ptr(worker->task, cpu_active_mask);
>> + } else {
>> + /*
>> + * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
>> + * online CPUs. It'll be re-applied when any of the CPUs come up.
>> + */
>
> Does this comment still stand ? IIUC, we should always be in the
> POOL_DISASSOCIATED case if the CPU from cpumask is offline. Unless a
> pool->attrs->cpumask can have several CPUs.
AIUI that should the case for unbound pools
> In that case maybe we should check for the cpu_active_mask here too ?
Looking at it again, I think we might need to.
IIUC you can end up with pools bound to a single NUMA node (?). In that
case, say the last CPU of a node is going down, then:
workqueue_offline_cpu()
wq_update_unbound_numa()
alloc_unbound_pwq()
get_unbound_pool()
would still pick that node, because it doesn't look at the online / active
mask. And at this point, we would affine the
kworkers to that node, and we're back to having kworkers enqueued on a
(!active, online) CPU that is going down...
The annoying thing is we can't just compare attrs->cpumask with
cpu_active_mask, because workqueue_offline_cpu() happens a few steps below
sched_cpu_deactivate() (CPUHP_AP_ACTIVE):
CPUHP_ONLINE -> CPUHP_AP_ACTIVE # CPU X is !active
# Some new kworker gets created here
worker_attach_to_pool()
!cpumask_subset(attrs->cpumask, cpu_active_mask)
-> affine worker to active CPUs
CPUHP_AP_ACTIVE -> CPUHP_ONLINE # CPU X is active
# Nothing will ever correct the kworker's affinity :(
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/2] workqueue: Fix migrate_disable hotplug changes vs kworker affinity
2020-12-10 16:38 [PATCH 0/2] workqueue: Fix migrate_disable hotplug changes vs kworker affinity Valentin Schneider
2020-12-10 16:38 ` [PATCH 1/2] stop_machine: Add caller debug info to queue_stop_cpus_work Valentin Schneider
2020-12-10 16:38 ` [PATCH 2/2] workqueue: Fix affinity of kworkers attached during late hotplug Valentin Schneider
@ 2020-12-11 12:52 ` Peter Zijlstra
2 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2020-12-11 12:52 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, tglx, mingo, bigeasy, qais.yousef, swood,
juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
mgorman, bristot, vincent.donnefort, tj, ouwen210, Qian Cai
On Thu, Dec 10, 2020 at 04:38:28PM +0000, Valentin Schneider wrote:
> Valentin Schneider (2):
> stop_machine: Add caller debug info to queue_stop_cpus_work
> workqueue: Fix affinity of kworkers attached during late hotplug
>
> kernel/stop_machine.c | 1 +
> kernel/workqueue.c | 24 +++++++++++++++++-------
> 2 files changed, 18 insertions(+), 7 deletions(-)
TJ, would you be okay if I take these through the sched tree that
contain the migrate_disable() patches, such that problem and fix stay
together?
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] workqueue: Fix affinity of kworkers attached during late hotplug
2020-12-11 12:51 ` Valentin Schneider
@ 2020-12-11 13:13 ` Valentin Schneider
2020-12-11 13:16 ` Vincent Donnefort
0 siblings, 1 reply; 11+ messages in thread
From: Valentin Schneider @ 2020-12-11 13:13 UTC (permalink / raw)
To: Vincent Donnefort
Cc: linux-kernel, Qian Cai, Peter Zijlstra, tglx, mingo, bigeasy,
qais.yousef, swood, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, bristot, tj,
ouwen210
On 11/12/20 12:51, Valentin Schneider wrote:
>> In that case maybe we should check for the cpu_active_mask here too ?
>
> Looking at it again, I think we might need to.
>
> IIUC you can end up with pools bound to a single NUMA node (?). In that
> case, say the last CPU of a node is going down, then:
>
> workqueue_offline_cpu()
> wq_update_unbound_numa()
> alloc_unbound_pwq()
> get_unbound_pool()
>
> would still pick that node, because it doesn't look at the online / active
> mask. And at this point, we would affine the
> kworkers to that node, and we're back to having kworkers enqueued on a
> (!active, online) CPU that is going down...
Assuming a node covers at least 2 CPUs, that can't actually happen per
is_cpu_allowed().
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] workqueue: Fix affinity of kworkers attached during late hotplug
2020-12-11 13:13 ` Valentin Schneider
@ 2020-12-11 13:16 ` Vincent Donnefort
0 siblings, 0 replies; 11+ messages in thread
From: Vincent Donnefort @ 2020-12-11 13:16 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, Qian Cai, Peter Zijlstra, tglx, mingo, bigeasy,
qais.yousef, swood, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, bristot, tj,
ouwen210
On Fri, Dec 11, 2020 at 01:13:35PM +0000, Valentin Schneider wrote:
> On 11/12/20 12:51, Valentin Schneider wrote:
> >> In that case maybe we should check for the cpu_active_mask here too ?
> >
> > Looking at it again, I think we might need to.
> >
> > IIUC you can end up with pools bound to a single NUMA node (?). In that
> > case, say the last CPU of a node is going down, then:
> >
> > workqueue_offline_cpu()
> > wq_update_unbound_numa()
> > alloc_unbound_pwq()
> > get_unbound_pool()
> >
> > would still pick that node, because it doesn't look at the online / active
> > mask. And at this point, we would affine the
> > kworkers to that node, and we're back to having kworkers enqueued on a
> > (!active, online) CPU that is going down...
>
> Assuming a node covers at least 2 CPUs, that can't actually happen per
> is_cpu_allowed().
Yes indeed, my bad, no problem here.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [tip: sched/core] stop_machine: Add caller debug info to queue_stop_cpus_work
2020-12-10 16:38 ` [PATCH 1/2] stop_machine: Add caller debug info to queue_stop_cpus_work Valentin Schneider
@ 2021-03-23 15:08 ` tip-bot2 for Valentin Schneider
0 siblings, 0 replies; 11+ messages in thread
From: tip-bot2 for Valentin Schneider @ 2021-03-23 15:08 UTC (permalink / raw)
To: linux-tip-commits
Cc: Valentin Schneider, Peter Zijlstra (Intel), x86, linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 2a2f80ff63bc36a874ed569bcaef932a8fe43514
Gitweb: https://git.kernel.org/tip/2a2f80ff63bc36a874ed569bcaef932a8fe43514
Author: Valentin Schneider <valentin.schneider@arm.com>
AuthorDate: Thu, 10 Dec 2020 16:38:29
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 23 Mar 2021 16:01:58 +01:00
stop_machine: Add caller debug info to queue_stop_cpus_work
Most callsites were covered by commit
a8b62fd08505 ("stop_machine: Add function and caller debug info")
but this skipped queue_stop_cpus_work(). Add caller debug info to it.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201210163830.21514-2-valentin.schneider@arm.com
---
kernel/stop_machine.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 971d8ac..cbc3027 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -409,6 +409,7 @@ static bool queue_stop_cpus_work(const struct cpumask *cpumask,
work->fn = fn;
work->arg = arg;
work->done = done;
+ work->caller = _RET_IP_;
if (cpu_stop_queue_work(cpu, work))
queued = true;
}
^ permalink raw reply related [flat|nested] 11+ messages in thread
end of thread, other threads:[~2021-03-23 15:18 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-10 16:38 [PATCH 0/2] workqueue: Fix migrate_disable hotplug changes vs kworker affinity Valentin Schneider
2020-12-10 16:38 ` [PATCH 1/2] stop_machine: Add caller debug info to queue_stop_cpus_work Valentin Schneider
2021-03-23 15:08 ` [tip: sched/core] " tip-bot2 for Valentin Schneider
2020-12-10 16:38 ` [PATCH 2/2] workqueue: Fix affinity of kworkers attached during late hotplug Valentin Schneider
2020-12-11 11:39 ` Vincent Donnefort
2020-12-11 12:41 ` Peter Zijlstra
2020-12-11 12:51 ` Peter Zijlstra
2020-12-11 12:51 ` Valentin Schneider
2020-12-11 13:13 ` Valentin Schneider
2020-12-11 13:16 ` Vincent Donnefort
2020-12-11 12:52 ` [PATCH 0/2] workqueue: Fix migrate_disable hotplug changes vs kworker affinity Peter Zijlstra
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).