* [PATCH] workqueue: Make create_worker() safe against spurious wakeups @ 2022-06-22 14:08 Petr Mladek 2022-06-23 7:00 ` Petr Mladek ` (2 more replies) 0 siblings, 3 replies; 52+ messages in thread From: Petr Mladek @ 2022-06-22 14:08 UTC (permalink / raw) To: Tejun Heo; +Cc: Lai Jiangshan, Michal Hocko, linux-kernel, Petr Mladek A system crashed with the following BUG() report: [115147.050484] BUG: kernel NULL pointer dereference, address: 0000000000000000 [115147.050488] #PF: supervisor write access in kernel mode [115147.050489] #PF: error_code(0x0002) - not-present page [115147.050490] PGD 0 P4D 0 [115147.050494] Oops: 0002 [#1] PREEMPT_RT SMP NOPTI [115147.050498] CPU: 1 PID: 16213 Comm: kthreadd Kdump: loaded Tainted: G O X 5.3.18-2-rt #1 SLE15-SP2 (unreleased) [115147.050510] RIP: 0010:_raw_spin_lock_irq+0x14/0x30 [115147.050513] Code: 89 c6 e8 5f 7a 9b ff 66 90 c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 fa 65 ff 05 fb 53 6c 55 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 89 c6 e8 2e 7a 9b ff 66 90 c3 90 90 90 90 90 [115147.050514] RSP: 0018:ffffb0f68822fed8 EFLAGS: 00010046 [115147.050515] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [115147.050516] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 0000000000000000 [115147.050517] RBP: ffff9ca73af40a40 R08: 0000000000000001 R09: 0000000000027340 [115147.050519] R10: ffffb0f68822fe70 R11: 00000000000000a9 R12: ffffb0f688067dc0 [115147.050520] R13: ffff9ca77e9a8000 R14: ffff9ca7634ca780 R15: ffff9ca7634ca780 [115147.050521] FS: 0000000000000000(0000) GS:ffff9ca77fb00000(0000) knlGS:0000000000000000 [115147.050523] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [115147.050524] CR2: 00000000000000b8 CR3: 000000004472e000 CR4: 00000000003406e0 [115147.050524] Call Trace: [115147.050533] worker_thread+0xb4/0x3c0 [115147.050538] ? process_one_work+0x4a0/0x4a0 [115147.050540] kthread+0x152/0x170 [115147.050542] ? kthread_park+0xa0/0xa0 [115147.050544] ret_from_fork+0x35/0x40 Further debugging shown that the worker thread was woken before worker_attach_to_pool() finished in create_worker(). Any kthread is supposed to stay in TASK_UNINTERRUPTIBLE sleep until it is explicitly woken. But a spurious wakeup might break this expectation. As a result, worker_thread() might read worker->pool before it was set in worker create_worker() by worker_attach_to_pool(). Also manage_workers() might want to create yet another worker before worker->pool->nr_workers is updated. It is a kind off a chicken & egg problem. Synchronize these operations using a completion API. Note that worker->pool might be then read without wq_pool_attach_mutex. Normal worker always belongs to the same pool. Also note that rescuer_thread() does not need this because all needed values are set before the kthreads is created. It is connected with a particular workqueue. It is attached to different pools when needed. Signed-off-by: Petr Mladek <pmladek@suse.com> --- kernel/workqueue.c | 13 ++++++++++++- kernel/workqueue_internal.h | 1 + 2 files changed, 13 insertions(+), 1 deletion(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 1ea50f6be843..9586b0797145 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1939,6 +1939,7 @@ static struct worker *create_worker(struct worker_pool *pool) goto fail; worker->id = id; + init_completion(&worker->ready_to_start); if (pool->cpu >= 0) snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id, @@ -1961,6 +1962,7 @@ static struct worker *create_worker(struct worker_pool *pool) raw_spin_lock_irq(&pool->lock); worker->pool->nr_workers++; worker_enter_idle(worker); + complete(&worker->ready_to_start); wake_up_process(worker->task); raw_spin_unlock_irq(&pool->lock); @@ -2378,10 +2380,19 @@ static void set_pf_worker(bool val) static int worker_thread(void *__worker) { struct worker *worker = __worker; - struct worker_pool *pool = worker->pool; + struct worker_pool *pool; /* tell the scheduler that this is a workqueue worker */ set_pf_worker(true); + + /* + * Wait until the worker is ready to get started. It must be attached + * to a pool first. This is needed because of spurious wakeups. + */ + while (wait_for_completion_interruptible(&worker->ready_to_start)) + ; + pool = worker->pool; + woke_up: raw_spin_lock_irq(&pool->lock); diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h index e00b1204a8e9..ffca2783c8bf 100644 --- a/kernel/workqueue_internal.h +++ b/kernel/workqueue_internal.h @@ -41,6 +41,7 @@ struct worker { /* L: for rescuers */ struct list_head node; /* A: anchored at pool->workers */ /* A: runs through worker->node */ + struct completion ready_to_start; /* None: handle spurious wakeup */ unsigned long last_active; /* L: last active timestamp */ unsigned int flags; /* X: flags */ -- 2.35.3 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH] workqueue: Make create_worker() safe against spurious wakeups 2022-06-22 14:08 [PATCH] workqueue: Make create_worker() safe against spurious wakeups Petr Mladek @ 2022-06-23 7:00 ` Petr Mladek 2022-06-23 7:14 ` Michal Hocko 2022-06-26 4:19 ` Hillf Danton 2022-06-25 5:00 ` re. Spurious wakeup on a newly created kthread Tejun Heo 2022-08-04 8:57 ` [PATCH] workqueue: Make create_worker() safe against spurious wakeups Lai Jiangshan 2 siblings, 2 replies; 52+ messages in thread From: Petr Mladek @ 2022-06-23 7:00 UTC (permalink / raw) To: Tejun Heo; +Cc: Lai Jiangshan, Michal Hocko, linux-kernel On Wed 2022-06-22 16:08:53, Petr Mladek wrote: > A system crashed with the following BUG() report: > > [115147.050484] BUG: kernel NULL pointer dereference, address: 0000000000000000 > [115147.050488] #PF: supervisor write access in kernel mode > [115147.050489] #PF: error_code(0x0002) - not-present page > [115147.050490] PGD 0 P4D 0 > [115147.050494] Oops: 0002 [#1] PREEMPT_RT SMP NOPTI > [115147.050498] CPU: 1 PID: 16213 Comm: kthreadd Kdump: loaded Tainted: G O X 5.3.18-2-rt #1 SLE15-SP2 (unreleased) > [115147.050510] RIP: 0010:_raw_spin_lock_irq+0x14/0x30 > [115147.050513] Code: 89 c6 e8 5f 7a 9b ff 66 90 c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 fa 65 ff 05 fb 53 6c 55 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 89 c6 e8 2e 7a 9b ff 66 90 c3 90 90 90 90 90 > [115147.050514] RSP: 0018:ffffb0f68822fed8 EFLAGS: 00010046 > [115147.050515] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 > [115147.050516] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 0000000000000000 > [115147.050517] RBP: ffff9ca73af40a40 R08: 0000000000000001 R09: 0000000000027340 > [115147.050519] R10: ffffb0f68822fe70 R11: 00000000000000a9 R12: ffffb0f688067dc0 > [115147.050520] R13: ffff9ca77e9a8000 R14: ffff9ca7634ca780 R15: ffff9ca7634ca780 > [115147.050521] FS: 0000000000000000(0000) GS:ffff9ca77fb00000(0000) knlGS:0000000000000000 > [115147.050523] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [115147.050524] CR2: 00000000000000b8 CR3: 000000004472e000 CR4: 00000000003406e0 > [115147.050524] Call Trace: > [115147.050533] worker_thread+0xb4/0x3c0 > [115147.050538] ? process_one_work+0x4a0/0x4a0 > [115147.050540] kthread+0x152/0x170 > [115147.050542] ? kthread_park+0xa0/0xa0 > [115147.050544] ret_from_fork+0x35/0x40 > > Further debugging shown that the worker thread was woken > before worker_attach_to_pool() finished in create_worker(). > > Any kthread is supposed to stay in TASK_UNINTERRUPTIBLE sleep > until it is explicitly woken. But a spurious wakeup might > break this expectation. > > As a result, worker_thread() might read worker->pool before > it was set in worker create_worker() by worker_attach_to_pool(). > Also manage_workers() might want to create yet another worker > before worker->pool->nr_workers is updated. It is a kind off > a chicken & egg problem. > > Synchronize these operations using a completion API. > > Note that worker->pool might be then read without wq_pool_attach_mutex. > Normal worker always belongs to the same pool. > > Also note that rescuer_thread() does not need this because all > needed values are set before the kthreads is created. It is > connected with a particular workqueue. It is attached to different > pools when needed. > > Signed-off-by: Petr Mladek <pmladek@suse.com> > --- > kernel/workqueue.c | 13 ++++++++++++- > kernel/workqueue_internal.h | 1 + > 2 files changed, 13 insertions(+), 1 deletion(-) > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index 1ea50f6be843..9586b0797145 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -2378,10 +2380,19 @@ static void set_pf_worker(bool val) > static int worker_thread(void *__worker) > { > struct worker *worker = __worker; > - struct worker_pool *pool = worker->pool; > + struct worker_pool *pool; > > /* tell the scheduler that this is a workqueue worker */ > set_pf_worker(true); > + > + /* > + * Wait until the worker is ready to get started. It must be attached > + * to a pool first. This is needed because of spurious wakeups. > + */ > + while (wait_for_completion_interruptible(&worker->ready_to_start)) I have realized that this is wrong. I used _interruptible() variant because we do not know how long it would need to sleep. And long sleeping in TASK_UNINTERRUPTIBLE might trigger hung task warning. But kthread_bind_mask() requires that the kthread will be in TASK_UNINTERRUPTIBLE state. Note that even the freshly created kthread is sleeping in TASK_UNTERRUPTIBLE, see kthread() in kernel/kthread.c. But it does not trigger the hung task warning because there is a special check for this case, see check_hung_task(): /* * When a freshly created task is scheduled once, changes its state to * TASK_UNINTERRUPTIBLE without having ever been switched out once, it * musn't be checked. */ if (unlikely(!switch_count)) return; The following seems to work well: while (!wait_for_completion_timeout(&worker->ready_to_start, 5 * HZ)) ; That said, I am not super happy with this solution. IMHO, it is a flaw in the kthread() API. But I do not see a reasonable way how to fix this except for adding a special API how to wake up the freshly created kthread for the first time. It would require updating all users. I am not sure how this problem is common and if it is worth it. And I used completion API to avoid tricky code using low level scheduling API. > + ; > + pool = worker->pool; > + > woke_up: > raw_spin_lock_irq(&pool->lock); > I am going to wait with v2 until I get some feedback for this approach. Best Regards, Petr ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] workqueue: Make create_worker() safe against spurious wakeups 2022-06-23 7:00 ` Petr Mladek @ 2022-06-23 7:14 ` Michal Hocko 2022-06-26 4:19 ` Hillf Danton 1 sibling, 0 replies; 52+ messages in thread From: Michal Hocko @ 2022-06-23 7:14 UTC (permalink / raw) To: Petr Mladek; +Cc: Tejun Heo, Lai Jiangshan, linux-kernel On Thu 23-06-22 09:00:12, Petr Mladek wrote: > On Wed 2022-06-22 16:08:53, Petr Mladek wrote: > > A system crashed with the following BUG() report: > > > > [115147.050484] BUG: kernel NULL pointer dereference, address: 0000000000000000 > > [115147.050488] #PF: supervisor write access in kernel mode > > [115147.050489] #PF: error_code(0x0002) - not-present page > > [115147.050490] PGD 0 P4D 0 > > [115147.050494] Oops: 0002 [#1] PREEMPT_RT SMP NOPTI > > [115147.050498] CPU: 1 PID: 16213 Comm: kthreadd Kdump: loaded Tainted: G O X 5.3.18-2-rt #1 SLE15-SP2 (unreleased) > > [115147.050510] RIP: 0010:_raw_spin_lock_irq+0x14/0x30 > > [115147.050513] Code: 89 c6 e8 5f 7a 9b ff 66 90 c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 fa 65 ff 05 fb 53 6c 55 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 89 c6 e8 2e 7a 9b ff 66 90 c3 90 90 90 90 90 > > [115147.050514] RSP: 0018:ffffb0f68822fed8 EFLAGS: 00010046 > > [115147.050515] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 > > [115147.050516] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 0000000000000000 > > [115147.050517] RBP: ffff9ca73af40a40 R08: 0000000000000001 R09: 0000000000027340 > > [115147.050519] R10: ffffb0f68822fe70 R11: 00000000000000a9 R12: ffffb0f688067dc0 > > [115147.050520] R13: ffff9ca77e9a8000 R14: ffff9ca7634ca780 R15: ffff9ca7634ca780 > > [115147.050521] FS: 0000000000000000(0000) GS:ffff9ca77fb00000(0000) knlGS:0000000000000000 > > [115147.050523] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [115147.050524] CR2: 00000000000000b8 CR3: 000000004472e000 CR4: 00000000003406e0 > > [115147.050524] Call Trace: > > [115147.050533] worker_thread+0xb4/0x3c0 > > [115147.050538] ? process_one_work+0x4a0/0x4a0 > > [115147.050540] kthread+0x152/0x170 > > [115147.050542] ? kthread_park+0xa0/0xa0 > > [115147.050544] ret_from_fork+0x35/0x40 > > > > Further debugging shown that the worker thread was woken > > before worker_attach_to_pool() finished in create_worker(). > > > > Any kthread is supposed to stay in TASK_UNINTERRUPTIBLE sleep > > until it is explicitly woken. But a spurious wakeup might > > break this expectation. > > > > As a result, worker_thread() might read worker->pool before > > it was set in worker create_worker() by worker_attach_to_pool(). > > Also manage_workers() might want to create yet another worker > > before worker->pool->nr_workers is updated. It is a kind off > > a chicken & egg problem. > > > > Synchronize these operations using a completion API. > > > > Note that worker->pool might be then read without wq_pool_attach_mutex. > > Normal worker always belongs to the same pool. > > > > Also note that rescuer_thread() does not need this because all > > needed values are set before the kthreads is created. It is > > connected with a particular workqueue. It is attached to different > > pools when needed. > > > > Signed-off-by: Petr Mladek <pmladek@suse.com> > > --- > > kernel/workqueue.c | 13 ++++++++++++- > > kernel/workqueue_internal.h | 1 + > > 2 files changed, 13 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > > index 1ea50f6be843..9586b0797145 100644 > > --- a/kernel/workqueue.c > > +++ b/kernel/workqueue.c > > @@ -2378,10 +2380,19 @@ static void set_pf_worker(bool val) > > static int worker_thread(void *__worker) > > { > > struct worker *worker = __worker; > > - struct worker_pool *pool = worker->pool; > > + struct worker_pool *pool; > > > > /* tell the scheduler that this is a workqueue worker */ > > set_pf_worker(true); > > + > > + /* > > + * Wait until the worker is ready to get started. It must be attached > > + * to a pool first. This is needed because of spurious wakeups. > > + */ > > + while (wait_for_completion_interruptible(&worker->ready_to_start)) > > I have realized that this is wrong. I used _interruptible() variant > because we do not know how long it would need to sleep. And long > sleeping in TASK_UNINTERRUPTIBLE might trigger hung task warning. > > But kthread_bind_mask() requires that the kthread will be in > TASK_UNINTERRUPTIBLE state. > > Note that even the freshly created kthread is sleeping in > TASK_UNTERRUPTIBLE, see kthread() in kernel/kthread.c. But > it does not trigger the hung task warning because there is > a special check for this case, see check_hung_task(): > > /* > * When a freshly created task is scheduled once, changes its state to > * TASK_UNINTERRUPTIBLE without having ever been switched out once, it > * musn't be checked. > */ > if (unlikely(!switch_count)) > return; > > The following seems to work well: > > while (!wait_for_completion_timeout(&worker->ready_to_start, 5 * HZ)) > ; This is really ugly and I do not really see why this is needed. Should there be such an overloaded system that create_worker cannot finish the initialization before khungtask triggers then we should be notified. I do not see how papering this over would be helpful. That being said, I really thing that this should be plain wait_for_completion() With that changed, feel free to add Reviewed-by: Michal Hocko <mhocko@suse.com> and thanks helping to debug this and prepare the fix. I was not really aware of the spurious wake up scenario. Btw. it would be great to describe that in a more detail as well for future reference. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] workqueue: Make create_worker() safe against spurious wakeups 2022-06-23 7:00 ` Petr Mladek 2022-06-23 7:14 ` Michal Hocko @ 2022-06-26 4:19 ` Hillf Danton 1 sibling, 0 replies; 52+ messages in thread From: Hillf Danton @ 2022-06-26 4:19 UTC (permalink / raw) To: Petr Mladek Cc: Tejun Heo, Thomas Gleixner, linux-mm, Michal Hocko, linux-kernel On Thu, 23 Jun 2022 09:00:12 +0200 Petr Mladek wrote: > On Wed 2022-06-22 16:08:53, Petr Mladek wrote: > > A system crashed with the following BUG() report: > > > > [115147.050484] BUG: kernel NULL pointer dereference, address: 0000000000000000 > > [115147.050488] #PF: supervisor write access in kernel mode > > [115147.050489] #PF: error_code(0x0002) - not-present page > > [115147.050490] PGD 0 P4D 0 > > [115147.050494] Oops: 0002 [#1] PREEMPT_RT SMP NOPTI > > [115147.050498] CPU: 1 PID: 16213 Comm: kthreadd Kdump: loaded Tainted: G O X 5.3.18-2-rt #1 SLE15-SP2 (unreleased) > > [115147.050510] RIP: 0010:_raw_spin_lock_irq+0x14/0x30 > > [115147.050513] Code: 89 c6 e8 5f 7a 9b ff 66 90 c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 fa 65 ff 05 fb 53 6c 55 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 89 c6 e8 2e 7a 9b ff 66 90 c3 90 90 90 90 90 > > [115147.050514] RSP: 0018:ffffb0f68822fed8 EFLAGS: 00010046 > > [115147.050515] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 > > [115147.050516] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 0000000000000000 > > [115147.050517] RBP: ffff9ca73af40a40 R08: 0000000000000001 R09: 0000000000027340 > > [115147.050519] R10: ffffb0f68822fe70 R11: 00000000000000a9 R12: ffffb0f688067dc0 > > [115147.050520] R13: ffff9ca77e9a8000 R14: ffff9ca7634ca780 R15: ffff9ca7634ca780 > > [115147.050521] FS: 0000000000000000(0000) GS:ffff9ca77fb00000(0000) knlGS:0000000000000000 > > [115147.050523] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [115147.050524] CR2: 00000000000000b8 CR3: 000000004472e000 CR4: 00000000003406e0 > > [115147.050524] Call Trace: > > [115147.050533] worker_thread+0xb4/0x3c0 > > [115147.050538] ? process_one_work+0x4a0/0x4a0 > > [115147.050540] kthread+0x152/0x170 > > [115147.050542] ? kthread_park+0xa0/0xa0 > > [115147.050544] ret_from_fork+0x35/0x40 > > > > Further debugging shown that the worker thread was woken > > before worker_attach_to_pool() finished in create_worker(). > > > > Any kthread is supposed to stay in TASK_UNINTERRUPTIBLE sleep > > until it is explicitly woken. But a spurious wakeup might > > break this expectation. > > > > As a result, worker_thread() might read worker->pool before > > it was set in worker create_worker() by worker_attach_to_pool(). > > Also manage_workers() might want to create yet another worker > > before worker->pool->nr_workers is updated. It is a kind off > > a chicken & egg problem. > > > > Synchronize these operations using a completion API. > > > > Note that worker->pool might be then read without wq_pool_attach_mutex. > > Normal worker always belongs to the same pool. > > > > Also note that rescuer_thread() does not need this because all > > needed values are set before the kthreads is created. It is > > connected with a particular workqueue. It is attached to different > > pools when needed. > > > > Signed-off-by: Petr Mladek <pmladek@suse.com> > > --- > > kernel/workqueue.c | 13 ++++++++++++- > > kernel/workqueue_internal.h | 1 + > > 2 files changed, 13 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > > index 1ea50f6be843..9586b0797145 100644 > > --- a/kernel/workqueue.c > > +++ b/kernel/workqueue.c > > @@ -2378,10 +2380,19 @@ static void set_pf_worker(bool val) > > static int worker_thread(void *__worker) > > { > > struct worker *worker = __worker; > > - struct worker_pool *pool = worker->pool; > > + struct worker_pool *pool; > > > > /* tell the scheduler that this is a workqueue worker */ > > set_pf_worker(true); > > + > > + /* > > + * Wait until the worker is ready to get started. It must be attached > > + * to a pool first. This is needed because of spurious wakeups. > > + */ > > + while (wait_for_completion_interruptible(&worker->ready_to_start)) > > I have realized that this is wrong. I used _interruptible() variant > because we do not know how long it would need to sleep. And long > sleeping in TASK_UNINTERRUPTIBLE might trigger hung task warning. > > But kthread_bind_mask() requires that the kthread will be in > TASK_UNINTERRUPTIBLE state. > > Note that even the freshly created kthread is sleeping in > TASK_UNTERRUPTIBLE, see kthread() in kernel/kthread.c. But > it does not trigger the hung task warning because there is > a special check for this case, see check_hung_task(): > > /* > * When a freshly created task is scheduled once, changes its state to > * TASK_UNINTERRUPTIBLE without having ever been switched out once, it > * musn't be checked. > */ > if (unlikely(!switch_count)) > return; > > The following seems to work well: > > while (!wait_for_completion_timeout(&worker->ready_to_start, 5 * HZ)) > ; > > That said, I am not super happy with this solution. IMHO, it is a flaw > in the kthread() API. But I do not see a reasonable way how to fix > this except for adding a special API how to wake up the freshly > created kthread for the first time. It would require updating all > users. I am not sure how this problem is common and if it is worth it. > > And I used completion API to avoid tricky code using low level scheduling > API. > > > + ; > > + pool = worker->pool; > > + > > woke_up: > > raw_spin_lock_irq(&pool->lock); > > > > I am going to wait with v2 until I get some feedback for this approach. Hey Petr, Apart from the complete method you proposed, another option is try the park road that WQ worker is created and parked, then unparked in addition to the normal wakeup, to cut the chance for it to go to work before office is cleaned up. Only for thoughts now. Hillf +++ b/kernel/workqueue.c @@ -1946,7 +1946,7 @@ static struct worker *create_worker(stru else snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id); - worker->task = kthread_create_on_node(worker_thread, worker, pool->node, + worker->task = park_kthread_create_on_node(worker_thread, worker, pool->node, "kworker/%s", id_buf); if (IS_ERR(worker->task)) goto fail; @@ -1964,6 +1964,7 @@ static struct worker *create_worker(stru wake_up_process(worker->task); raw_spin_unlock_irq(&pool->lock); + kthread_unpark(worker->task); return worker; fail: +++ b/kernel/kthread.c @@ -46,6 +46,7 @@ struct kthread_create_info struct task_struct *result; struct completion *done; + int park; struct list_head list; }; @@ -350,6 +351,8 @@ static int kthread(void *_create) self->threadfn = threadfn; self->data = data; + if (create->park) + set_bit(KTHREAD_SHOULD_PARK, &self->flags); /* * The new thread inherited kthreadd's priority and CPU mask. Reset * back to default in case they have been changed. @@ -414,7 +417,8 @@ static __printf(4, 0) struct task_struct *__kthread_create_on_node(int (*threadfn)(void *data), void *data, int node, const char namefmt[], - va_list args) + va_list args, + int park) { DECLARE_COMPLETION_ONSTACK(done); struct task_struct *task; @@ -427,6 +431,7 @@ struct task_struct *__kthread_create_on_ create->data = data; create->node = node; create->done = &done; + create->park = park; spin_lock(&kthread_create_lock); list_add_tail(&create->list, &kthread_create_list); @@ -509,13 +514,28 @@ struct task_struct *kthread_create_on_no va_list args; va_start(args, namefmt); - task = __kthread_create_on_node(threadfn, data, node, namefmt, args); + task = __kthread_create_on_node(threadfn, data, node, namefmt, args, 0); va_end(args); return task; } EXPORT_SYMBOL(kthread_create_on_node); +struct task_struct *park_kthread_create_on_node(int (*threadfn)(void *data), + void *data, int node, + const char namefmt[], + ...) +{ + struct task_struct *task; + va_list args; + + va_start(args, namefmt); + task = __kthread_create_on_node(threadfn, data, node, namefmt, args, 1); + va_end(args); + + return task; +} + static void __kthread_bind_mask(struct task_struct *p, const struct cpumask *mask, unsigned int state) { unsigned long flags; @@ -852,7 +872,7 @@ __kthread_create_worker(int cpu, unsigne node = cpu_to_node(cpu); task = __kthread_create_on_node(kthread_worker_fn, worker, - node, namefmt, args); + node, namefmt, args, 0); if (IS_ERR(task)) goto fail_task; ^ permalink raw reply [flat|nested] 52+ messages in thread
* re. Spurious wakeup on a newly created kthread 2022-06-22 14:08 [PATCH] workqueue: Make create_worker() safe against spurious wakeups Petr Mladek 2022-06-23 7:00 ` Petr Mladek @ 2022-06-25 5:00 ` Tejun Heo 2022-06-25 17:01 ` Linus Torvalds 2022-06-27 8:07 ` Michal Hocko 2022-08-04 8:57 ` [PATCH] workqueue: Make create_worker() safe against spurious wakeups Lai Jiangshan 2 siblings, 2 replies; 52+ messages in thread From: Tejun Heo @ 2022-06-25 5:00 UTC (permalink / raw) To: Petr Mladek Cc: Lai Jiangshan, Michal Hocko, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Linus Torvalds, Oleg Nesterov, Eric W. Biederman Hello, cc'ing random assortment of ppl who touched kernel/kthread.c and others who would know better. So, Petr debugged a NULL deref in workqueue code to a spurious wakeup on a newly created kthread. The abbreviated patch description follows. The original message is at http://lkml.kernel.org/r/20220622140853.31383-1-pmladek@suse.com On Wed, Jun 22, 2022 at 04:08:53PM +0200, Petr Mladek wrote: > A system crashed with the following BUG() report: > > [115147.050484] BUG: kernel NULL pointer dereference, address: 0000000000000000 ... > [115147.050524] Call Trace: > [115147.050533] worker_thread+0xb4/0x3c0 > [115147.050540] kthread+0x152/0x170 > [115147.050544] ret_from_fork+0x35/0x40 > > Further debugging shown that the worker thread was woken > before worker_attach_to_pool() finished in create_worker(). > > Any kthread is supposed to stay in TASK_UNINTERRUPTIBLE sleep > until it is explicitly woken. But a spurious wakeup might > break this expectation. > > As a result, worker_thread() might read worker->pool before > it was set in worker create_worker() by worker_attach_to_pool(). > Also manage_workers() might want to create yet another worker > before worker->pool->nr_workers is updated. It is a kind off > a chicken & egg problem. tl;dr is that the worker creation code expects a newly created worker kthread to sit tight until the creator finishes setting up stuff and sends the initial wakeup. However, something, which wasn't identified in the report (Petr, it'd be great if you can find out who did the wakeup), wakes up the new kthread before the creation path is done with init which causes the new kthread to try to deref a NULL pointer. Petr fixed the problem by adding an extra handshake step so that the new kthread explicitly waits for the creation path, which is fine, but the picture isn't making sense to me. * Are spurious wakeups allowed? The way that we do set_current_state() in every iteration in wait_event() seems to suggest that we expect someone to spuriously flip task state to RUNNING. * However, if we're to expect spurious wakeups for anybody anytime, why does a newly created kthread bother with schedule_preempt_disabled() in kernel/kthread.c::kthread() at all? It can't guarantee anything and all it does is masking subtle bugs. What am I missing here? Thanks. -- tejun ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-25 5:00 ` re. Spurious wakeup on a newly created kthread Tejun Heo @ 2022-06-25 17:01 ` Linus Torvalds 2022-06-25 17:36 ` Eric W. Biederman ` (2 more replies) 2022-06-27 8:07 ` Michal Hocko 1 sibling, 3 replies; 52+ messages in thread From: Linus Torvalds @ 2022-06-25 17:01 UTC (permalink / raw) To: Tejun Heo Cc: Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov, Eric W. Biederman On Fri, Jun 24, 2022 at 10:00 PM Tejun Heo <tj@kernel.org> wrote: > > So, Petr debugged a NULL deref in workqueue code to a spurious wakeup > on a newly created kthread. What? No. That patch can't be right for several reasons. What we call "spurious wakeups" exist, but they are about wakeups that happen from being on a _previous_ wait-queue, and having already been removed from it. They aren't "really" spurious, they are just asynchronous enough (and thus unexpected) that you basically should never have a "sleep on wait-queue" without then looping and re-testing the condition. There is no _truly_ spurious wakeup. You were always woken up for a reason, it's just that there are more reasons than the entirely obvious ones. For example, the reason that quoted patch cannot be right is that this code pattern: while (wait_for_completion_interruptible(&worker->ready_to_start)) ; is not valid kernel code. EVER. There is absolutely no way that can be correct. Either that code can take a signal, or it cannot. If it can take a signal, it had better react to said signal. If it cannot, it must not use an interruptble sleep - since now that loop turned into a kernel-side busy-loop. So NAK on this kind of crazy "I don't know what happened, so I'll just add *more* bugs to the code" voodoo programming. And no, we don't "fix" that by then adding a timeout. Stop this "add random code" thing. If you cannot be woken up before X happens, then you should: - don't go to sleep before X happens - don't add yourself to any wait-queues before X happens - don't expose your process to others before X happens The solution is *not* to add some completion with random "wait for this before waking". I think the problem here is much more fundamental: you expect a new thread to not wake up until you've told it to. We do have that infrastructure in the kernel: when you create a new thread, you can do various setup, and the thread won't actually run until you do "wake_up_new_task()" on it. However, that's not how kernel_thread() (or kernel_clone()) works. Those will call wake_up_new_task(p) for you, and as such a new kernel thread will immediately start running. So I think the expectations here are entirely wrong. I think create_worker() is fundamentally buggy, in how it does that /* start the newly created worker */ .. wake_up_process(worker->task); because that wake_up_process() is already much too late. The process got woken up already, because it was created by create_kthread() -> kernel_thread() -> kernel_clone, which does that wake_up_new_task() and it starts running. If you want to do thread setup *bnefore* the kernel is running, it needs to be done before that wake_up_new_task(). That's very possible. Look at what create_io_thread() does, for example: it never calls wake_up_new_process() at all, and leaves that to the caller, which has done the extra setup. Or the kernel_clone_args could have a "init" function that gets called before doing the wake_up_new_task() is done. Or a number of other solutions. But no, we're not randomly adding some new completion because people were confused and thought they were waking things up when it was already awake from before. Linus ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-25 17:01 ` Linus Torvalds @ 2022-06-25 17:36 ` Eric W. Biederman 2022-06-25 18:25 ` Linus Torvalds 2022-06-26 1:58 ` Tejun Heo 2022-06-28 9:51 ` Petr Mladek 2 siblings, 1 reply; 52+ messages in thread From: Eric W. Biederman @ 2022-06-25 17:36 UTC (permalink / raw) To: Linus Torvalds Cc: Tejun Heo, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov Linus Torvalds <torvalds@linux-foundation.org> writes: > On Fri, Jun 24, 2022 at 10:00 PM Tejun Heo <tj@kernel.org> wrote: >> >> So, Petr debugged a NULL deref in workqueue code to a spurious wakeup >> on a newly created kthread. > > What? No. That patch can't be right for several reasons. > > What we call "spurious wakeups" exist, but they are about wakeups that > happen from being on a _previous_ wait-queue, and having already been > removed from it. > > They aren't "really" spurious, they are just asynchronous enough (and > thus unexpected) that you basically should never have a "sleep on > wait-queue" without then looping and re-testing the condition. > > There is no _truly_ spurious wakeup. You were always woken up for a > reason, it's just that there are more reasons than the entirely > obvious ones. > > For example, the reason that quoted patch cannot be right is that this > code pattern: > > while (wait_for_completion_interruptible(&worker->ready_to_start)) > ; > > is not valid kernel code. EVER. There is absolutely no way that can be correct. > > Either that code can take a signal, or it cannot. If it can take a > signal, it had better react to said signal. If it cannot, it must not > use an interruptble sleep - since now that loop turned into a > kernel-side busy-loop. > > So NAK on this kind of crazy "I don't know what happened, so I'll just > add *more* bugs to the code" voodoo programming. > > And no, we don't "fix" that by then adding a timeout. > > Stop this "add random code" thing. > > If you cannot be woken up before X happens, then you should: > > - don't go to sleep before X happens > > - don't add yourself to any wait-queues before X happens > > - don't expose your process to others before X happens > > The solution is *not* to add some completion with random "wait for > this before waking". > > I think the problem here is much more fundamental: you expect a new > thread to not wake up until you've told it to. > > We do have that infrastructure in the kernel: when you create a new > thread, you can do various setup, and the thread won't actually run > until you do "wake_up_new_task()" on it. > > However, that's not how kernel_thread() (or kernel_clone()) works. > Those will call wake_up_new_task(p) for you, and as such a new kernel > thread will immediately start running. > > So I think the expectations here are entirely wrong. I think > create_worker() is fundamentally buggy, in how it does that > > /* start the newly created worker */ > .. > wake_up_process(worker->task); > > because that wake_up_process() is already much too late. The process > got woken up already, because it was created by create_kthread() -> > kernel_thread() -> kernel_clone, which does that wake_up_new_task() > and it starts running. > > If you want to do thread setup *bnefore* the kernel is running, it > needs to be done before that wake_up_new_task(). > > That's very possible. Look at what create_io_thread() does, for > example: it never calls wake_up_new_process() at all, and leaves that > to the caller, which has done the extra setup. > > Or the kernel_clone_args could have a "init" function that gets called > before doing the wake_up_new_task() is done. Or a number of other > solutions. > > But no, we're not randomly adding some new completion because people > were confused and thought they were waking things up when it was > already awake from before. Ugh. The use of kthread_bind_mask is buggy and kthread_bind_mask and kthread_bind look buggy by definition. As far as I know the kind of games that involve wait_task_inactive are only safe in special stop states which TASK_UNINTERRUPTIBLE is not. I don't know if we have any current code in the kernel that does this but I have seen code proposed that for suspend/resume that would wake up everything not in a special stop state, and have every kernel task figure out if it should be sleeping or not after such a resume. This implies that kthread_park needs to be used for this logic to work at all. So the only legitimate user of the __kthread_bind and __kthread_bind_mask appears to be kthread_unpark. Let me suggest someone create a new variant of kthread_create that takes all of the parameters the workqueue code wants to set. Hopefully that will be enough that we can use it to remove the handful of buggy cases that call kthread_bind_mask and kthread_bind. Eric ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-25 17:36 ` Eric W. Biederman @ 2022-06-25 18:25 ` Linus Torvalds 2022-06-25 18:43 ` Linus Torvalds 2022-06-26 0:26 ` Eric W. Biederman 0 siblings, 2 replies; 52+ messages in thread From: Linus Torvalds @ 2022-06-25 18:25 UTC (permalink / raw) To: Eric W. Biederman Cc: Tejun Heo, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov On Sat, Jun 25, 2022 at 10:36 AM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Let me suggest someone create a new variant of kthread_create that takes > all of the parameters the workqueue code wants to set. I suspect the real issue is that that the kthread code simply shouldn't use the kernel_thread() helper at all. That helper is literally designed for "start a thread, run this thing". That's what it *does*. And that's not at all what the kthread code wants. It wants to set affinity masks, it wants to create a name for the thread, it wants to do all those other things. That code really wants to just do copy_process(). Or maybe it really should just use create_io_thread(), which has a much better interface, except it has that one oddity in that it sets the flag that does this: if (args->io_thread) { /* * Mark us an IO worker, and block any signal that isn't * fatal or STOP */ p->flags |= PF_IO_WORKER; siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)); } that then has special semantics. Now, those special semantics may actually be exactly what kthreads wants, but they are important: /* * PF_IO_WORKER threads will catch and exit on fatal signals * themselves. They have cleanup that must be performed, so * we cannot call do_exit() on their behalf. */ which is about that "what happens for fatal signals" thing. Linus ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-25 18:25 ` Linus Torvalds @ 2022-06-25 18:43 ` Linus Torvalds 2022-06-25 23:28 ` Eric W. Biederman 2022-06-28 14:16 ` Christian Brauner 2022-06-26 0:26 ` Eric W. Biederman 1 sibling, 2 replies; 52+ messages in thread From: Linus Torvalds @ 2022-06-25 18:43 UTC (permalink / raw) To: Eric W. Biederman, Christian Brauner Cc: Tejun Heo, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov On Sat, Jun 25, 2022 at 11:25 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > And that's not at all what the kthread code wants. It wants to set > affinity masks, it wants to create a name for the thread, it wants to > do all those other things. > > That code really wants to just do copy_process(). Honestly, I think kernel/kthread.c should be almost rewritten from scratch. I do not understand why it does all those odd keventd games at all, and why kthread_create_info exists in the first place. Why does kthread_create() not just create the thread directly itself, and instead does that odd queue it onto a work function? Some of that goes back to before the git history, and very little of it seems to make any sense. It's as if the code is meant to be able to run from interrupt context, but that can't be it: it's literally doing a GFP_KERNEL kmalloc, it's doing spin-locks without irq safety etc. So why is it calling kthreadd_task() to create the thread? Purely for some crazy odd "make that the parent" reason? I dunno. The code is odd, unexplained, looks buggy, and most fo the reasons are probably entirely historical. I'm adding Christian to this thread too, since I get the feeling that it really should be more tightly integrated with copy_process(), and that Christian might have comments. Christian, see some context in the thread here: https://lore.kernel.org/all/CAHk-=wiC7rj1o7vTnYUPfD7YxAu09MZiZbahHqvLm9+Cgg1dFw@mail.gmail.com/ for some of this. Linus ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-25 18:43 ` Linus Torvalds @ 2022-06-25 23:28 ` Eric W. Biederman 2022-06-25 23:41 ` Eric W. Biederman 2022-06-25 23:43 ` Linus Torvalds 2022-06-28 14:16 ` Christian Brauner 1 sibling, 2 replies; 52+ messages in thread From: Eric W. Biederman @ 2022-06-25 23:28 UTC (permalink / raw) To: Linus Torvalds Cc: Christian Brauner, Tejun Heo, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov Linus Torvalds <torvalds@linux-foundation.org> writes: > On Sat, Jun 25, 2022 at 11:25 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> And that's not at all what the kthread code wants. It wants to set >> affinity masks, it wants to create a name for the thread, it wants to >> do all those other things. >> >> That code really wants to just do copy_process(). > > Honestly, I think kernel/kthread.c should be almost rewritten from scratch. > > I do not understand why it does all those odd keventd games at all, > and why kthread_create_info exists in the first place. I presume you mean kthreadd games? > Why does kthread_create() not just create the thread directly itself, > and instead does that odd queue it onto a work function? > > Some of that goes back to before the git history, and very little of > it seems to make any sense. It's as if the code is meant to be able to > run from interrupt context, but that can't be it: it's literally doing > a GFP_KERNEL kmalloc, it's doing spin-locks without irq safety etc. > > So why is it calling kthreadd_task() to create the thread? Purely for > some crazy odd "make that the parent" reason? > > I dunno. The code is odd, unexplained, looks buggy, and most fo the > reasons are probably entirely historical. I can explain why kthreadd exists and why it creates the threads. Very long ago in the context of random userspace processes people would use kernel_thread to create threads and a helper function that I think was called something like kernel_daemonize to scrub the userspace bits off. It was an unending sources of problems as the scrub was never complete nor correct. So with the introduction of kthreadd the kernel threads were moved out of the userspace process tree, and userspace stopped being able to influence the kernel threads. AKA instead of doing the equivalent of a suid exec the code started going the equivalent sshing into the local box. We *need* to preserve that kind of separation. I want to say that all that is required is that copy_process copies from kthreadd. Unfortunately that means that it needs to be kthreadd doing the work, as copy_process does always copies from current. It would take quite a bit of work to untangle that mess. It does appear possible to write a parallel function to copy_process that is used only for creating kernel threads, and can streamline itself because it knows it is creating kernel threads. Short of that the code needs to keep routing through kthreadd. Using create_io_thread or a dedicated wrapper around copy_process certainly looks like it could simplify some of kthread creation. > I'm adding Christian to this thread too, since I get the feeling that > it really should be more tightly integrated with copy_process(), and > that Christian might have comments. > > Christian, see some context in the thread here: > > https://lore.kernel.org/all/CAHk-=wiC7rj1o7vTnYUPfD7YxAu09MZiZbahHqvLm9+Cgg1dFw@mail.gmail.com/ > > for some of this. > > Linus Eric ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-25 23:28 ` Eric W. Biederman @ 2022-06-25 23:41 ` Eric W. Biederman 2022-06-25 23:43 ` Linus Torvalds 1 sibling, 0 replies; 52+ messages in thread From: Eric W. Biederman @ 2022-06-25 23:41 UTC (permalink / raw) To: Linus Torvalds Cc: Christian Brauner, Tejun Heo, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov "Eric W. Biederman" <ebiederm@xmission.com> writes: > Linus Torvalds <torvalds@linux-foundation.org> writes: > >> On Sat, Jun 25, 2022 at 11:25 AM Linus Torvalds >> <torvalds@linux-foundation.org> wrote: >>> >>> And that's not at all what the kthread code wants. It wants to set >>> affinity masks, it wants to create a name for the thread, it wants to >>> do all those other things. >>> >>> That code really wants to just do copy_process(). >> >> Honestly, I think kernel/kthread.c should be almost rewritten from scratch. >> >> I do not understand why it does all those odd keventd games at all, >> and why kthread_create_info exists in the first place. > > I presume you mean kthreadd games? > >> Why does kthread_create() not just create the thread directly itself, >> and instead does that odd queue it onto a work function? >> >> Some of that goes back to before the git history, and very little of >> it seems to make any sense. It's as if the code is meant to be able to >> run from interrupt context, but that can't be it: it's literally doing >> a GFP_KERNEL kmalloc, it's doing spin-locks without irq safety etc. >> >> So why is it calling kthreadd_task() to create the thread? Purely for >> some crazy odd "make that the parent" reason? >> >> I dunno. The code is odd, unexplained, looks buggy, and most fo the >> reasons are probably entirely historical. > > I can explain why kthreadd exists and why it creates the threads. > > Very long ago in the context of random userspace processes people would > use kernel_thread to create threads and a helper function that I think > was called something like kernel_daemonize to scrub the userspace bits > off. > > It was an unending sources of problems as the scrub was never complete > nor correct. > > So with the introduction of kthreadd the kernel threads were moved > out of the userspace process tree, and userspace stopped being able to > influence the kernel threads. > > AKA instead of doing the equivalent of a suid exec the code started > going the equivalent sshing into the local box. > > We *need* to preserve that kind of separation. > > I want to say that all that is required is that copy_process copies > from kthreadd. Unfortunately that means that it needs to be kthreadd > doing the work, as copy_process does always copies from current. It > would take quite a bit of work to untangle that mess. > > It does appear possible to write a parallel function to copy_process > that is used only for creating kernel threads, and can streamline itself > because it knows it is creating kernel threads. > > Short of that the code needs to keep routing through kthreadd. > > Using create_io_thread or a dedicated wrapper around copy_process > certainly looks like it could simplify some of kthread creation. Hmm. Looking at kthread() I completely agree that kernel_thread() has the wrong set of semantics and we really could benefit from never waking the fledgling kernel thread in the first place. Eric ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-25 23:28 ` Eric W. Biederman 2022-06-25 23:41 ` Eric W. Biederman @ 2022-06-25 23:43 ` Linus Torvalds 2022-06-25 23:48 ` Linus Torvalds 2022-06-26 0:21 ` re. Spurious wakeup on a newly created kthread Eric W. Biederman 1 sibling, 2 replies; 52+ messages in thread From: Linus Torvalds @ 2022-06-25 23:43 UTC (permalink / raw) To: Eric W. Biederman Cc: Christian Brauner, Tejun Heo, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov On Sat, Jun 25, 2022 at 4:28 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > I presume you mean kthreadd games? Yeah, sorry. > So with the introduction of kthreadd the kernel threads were moved > out of the userspace process tree, and userspace stopped being able to > influence the kernel threads. Ahh. So essentially it's indeed just basically the parenting issue. > I want to say that all that is required is that copy_process copies > from kthreadd. Unfortunately that means that it needs to be kthreadd > doing the work, as copy_process does always copies from current. It > would take quite a bit of work to untangle that mess. Hmm. Yeah, it would probably be painful to replace 'current' with something else, due to locking etc. Using "current" has some fundamental advantages wrt using another thread pointer that may or may not be active at the same time. That said, there *is* another thread that has very similar "we don't need locking, because it's always around and it doesn't change": init_task. I wonder if we could basically get rid of every use of 'current' in kernel/fork.c with just a task pointer that is passed in, and then for kernel threads pass in 'init_task'. I'd hate to duplicate copy_process() entirely, when we probably could just consider it just a parenting issue, and say that you can have either 'current' or 'init_task' as the parent. And that 'init_task' would basically be the 'clean slate for kthreads'. Christian, comments? Linus ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-25 23:43 ` Linus Torvalds @ 2022-06-25 23:48 ` Linus Torvalds 2022-06-26 0:19 ` Eric W. Biederman 2022-06-26 19:14 ` [PATCH 0/3] kthread: Stop using TASK_UNINTERRUPTIBLE Eric W. Biederman 2022-06-26 0:21 ` re. Spurious wakeup on a newly created kthread Eric W. Biederman 1 sibling, 2 replies; 52+ messages in thread From: Linus Torvalds @ 2022-06-25 23:48 UTC (permalink / raw) To: Eric W. Biederman Cc: Christian Brauner, Tejun Heo, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov On Sat, Jun 25, 2022 at 4:43 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I wonder if we could basically get rid of every use of 'current' in > kernel/fork.c with just a task pointer that is passed in, and then for > kernel threads pass in 'init_task'. That might even help code generation. Instead of doing the 'look up current' all the time, just having the parent task as an argument might actually simplify things. We historically used to do those kinds of things exactly because it helps generate better code (particularly with inline functions, and things like 'exit' that calls many different things), but have mostly avoided it because 'current' may generate some small asm snippet all the time, but it clarifies the "it can't be a random thread" and locking issues. But if it's always 'current or init_task', we don't have many locking issues. Of course, there might be some reason not to want to use init_task because it is _so_ special, and so parenting to something silly like kthreadd ends up being simpler. But.. Anyway, the whole "don't wake up thread until it's all done" is a separate and independent issue from the "odd use of kthreadd" issue. Linus ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-25 23:48 ` Linus Torvalds @ 2022-06-26 0:19 ` Eric W. Biederman 2022-06-27 0:01 ` Wedson Almeida Filho 2022-06-26 19:14 ` [PATCH 0/3] kthread: Stop using TASK_UNINTERRUPTIBLE Eric W. Biederman 1 sibling, 1 reply; 52+ messages in thread From: Eric W. Biederman @ 2022-06-26 0:19 UTC (permalink / raw) To: Linus Torvalds Cc: Christian Brauner, Tejun Heo, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov Linus Torvalds <torvalds@linux-foundation.org> writes: > Anyway, the whole "don't wake up thread until it's all done" is a > separate and independent issue from the "odd use of kthreadd" issue. Yes. "don't wake up a kthread until it's all done" is a much easier change that will simplify things tremendously. Further it is necessary for Peter Zijlstra's rewrite of the kernel freezer. As anything that isn't a special stop state (which TASK_UNINTERRUPTIBLE is not) will receive a spurious wake up on when thawed out. Eric ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-26 0:19 ` Eric W. Biederman @ 2022-06-27 0:01 ` Wedson Almeida Filho 2022-06-27 7:11 ` Peter Zijlstra 0 siblings, 1 reply; 52+ messages in thread From: Wedson Almeida Filho @ 2022-06-27 0:01 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, Christian Brauner, Tejun Heo, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov On Sat, Jun 24, 2022 at 07:19:27PM -0500, Eric W. Biederman wrote: > > Further it is necessary for Peter Zijlstra's rewrite of the kernel > freezer. As anything that isn't a special stop state (which > TASK_UNINTERRUPTIBLE is not) will receive a spurious wake up on when > thawed out. Do you know if the current (i.e., prior to the rewrite) kernel freezer also sends spurious wakeups when thawing tasks? I'm trying to understand if this source of spurious wakeups will be a new one or already exists (and might therefore explain what's described in the original email). ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-27 0:01 ` Wedson Almeida Filho @ 2022-06-27 7:11 ` Peter Zijlstra 2022-06-27 18:23 ` Wedson Almeida Filho 0 siblings, 1 reply; 52+ messages in thread From: Peter Zijlstra @ 2022-06-27 7:11 UTC (permalink / raw) To: Wedson Almeida Filho Cc: Eric W. Biederman, Linus Torvalds, Christian Brauner, Tejun Heo, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov On Mon, Jun 27, 2022 at 12:01:04AM +0000, Wedson Almeida Filho wrote: > On Sat, Jun 24, 2022 at 07:19:27PM -0500, Eric W. Biederman wrote: > > > > Further it is necessary for Peter Zijlstra's rewrite of the kernel > > freezer. As anything that isn't a special stop state (which > > TASK_UNINTERRUPTIBLE is not) will receive a spurious wake up on when > > thawed out. > > Do you know if the current (i.e., prior to the rewrite) kernel freezer > also sends spurious wakeups when thawing tasks? Current freezer can thaw at random points in time, even before SMP bringup if you're unlucky. And yes, I think it can induce 'spurious' wakeups as well. But really; like Linus already said upsteam, every wait loop *MUST* already be able to deal with spurious wakeups. This is why pretty much every wait primitive we have looks like: for (;;) { set_current_state(state); if (cond) break; schedule(); } __set_current_state(RUNNING); Which is immune to random wake-ups since it need @cond to make progress. *NEVER* rely on just the wakeup itself for progress, that's buggy as heck in lots of ways. There are a few exceptions, but they all require special wait states and much carefulness. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-27 7:11 ` Peter Zijlstra @ 2022-06-27 18:23 ` Wedson Almeida Filho 2022-06-27 18:45 ` Linus Torvalds 0 siblings, 1 reply; 52+ messages in thread From: Wedson Almeida Filho @ 2022-06-27 18:23 UTC (permalink / raw) To: Peter Zijlstra Cc: Eric W. Biederman, Linus Torvalds, Christian Brauner, Tejun Heo, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov On Mon, Jun 27, 2022 at 09:11:10AM +0200, Peter Zijlstra wrote: > Current freezer can thaw at random points in time, even before SMP > bringup if you're unlucky. And yes, I think it can induce 'spurious' > wakeups as well. Thanks, Peter, that's good to know! > But really; like Linus already said upsteam, every wait loop *MUST* > already be able to deal with spurious wakeups. This is why pretty much > every wait primitive we have looks like: > > for (;;) { > set_current_state(state); > if (cond) > break; > schedule(); > } > __set_current_state(RUNNING); > > Which is immune to random wake-ups since it need @cond to make progress. > *NEVER* rely on just the wakeup itself for progress, that's buggy as > heck in lots of ways. Yes, I wonder how many more instances of this kind of bug we have lurking around given that this one in core kernel code appears to have been around for at least 17 years. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-27 18:23 ` Wedson Almeida Filho @ 2022-06-27 18:45 ` Linus Torvalds 0 siblings, 0 replies; 52+ messages in thread From: Linus Torvalds @ 2022-06-27 18:45 UTC (permalink / raw) To: Wedson Almeida Filho Cc: Peter Zijlstra, Eric W. Biederman, Christian Brauner, Tejun Heo, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov On Mon, Jun 27, 2022 at 11:23 AM Wedson Almeida Filho <wedsonaf@google.com> wrote: > > Yes, I wonder how many more instances of this kind of bug we have > lurking around given that this one in core kernel code appears to have > been around for at least 17 years. The good news is that this "explicit wait loop" pattern is actually starting to be pretty rare. New code seldom uses it, because "wait_event()" and friends are just *so* much easier to use, and will magically expand to that kind of loop properly. So the explicit loop with add_wait_queue and friends is very traditional and our original wait model (well, I think I did have "sleep_on()" *very* early, but I fixed that broken model quickly since it fundamentally doesn't work for multiple events), but it's not actually very common any more. And the "just call schedule, wait for wakeup" should be very rare indeed. It has never been a valid pattern, so that kthread code is just plain strange, and I wouldn't expect to see it commonly elsewhere. It's simply not how things have ever been supposed to be done. Linus ^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH 0/3] kthread: Stop using TASK_UNINTERRUPTIBLE 2022-06-25 23:48 ` Linus Torvalds 2022-06-26 0:19 ` Eric W. Biederman @ 2022-06-26 19:14 ` Eric W. Biederman 2022-06-26 19:15 ` [PATCH 1/3] kthread: Remove the flags argument from kernel_thread Eric W. Biederman ` (2 more replies) 1 sibling, 3 replies; 52+ messages in thread From: Eric W. Biederman @ 2022-06-26 19:14 UTC (permalink / raw) To: Linus Torvalds Cc: Christian Brauner, Tejun Heo, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov Being silly I figured I would poke my nose in and see how much work it is to never wake up kthreads until we are ready to use them. This is my first draft at that and something that can hopefully shape the conversation on how we want to fix things a little bit. The big thing that needs to happen that I haven't implemented is that kthread_run and kthread_run_on_cpu need to be uninlined and moved into kthread.c. This will allow them to call wake_up_new_task even from modular code. The handful of drivers that are using kthread_create_on_node by extension need to be modified to use kthread_run or kthread_run_on_cpu. Eric W. Biederman (3): kthread: Remove the flags argument from kernel_thread kthread: Replace kernel_thread with new_kthread kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE) arch/arm/common/bL_switcher.c | 2 +- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 +- drivers/block/mtip32xx/mtip32xx.c | 2 +- drivers/firmware/psci/psci_checker.c | 2 +- drivers/firmware/stratix10-svc.c | 4 +- drivers/scsi/bnx2fc/bnx2fc_fcoe.c | 2 +- drivers/scsi/bnx2i/bnx2i_init.c | 2 +- drivers/scsi/qedi/qedi_main.c | 2 +- include/linux/kthread.h | 4 +- include/linux/sched/task.h | 2 +- init/main.c | 6 +- kernel/bpf/cpumap.c | 2 +- kernel/dma/map_benchmark.c | 2 +- kernel/fork.c | 5 +- kernel/kthread.c | 120 +++++++++++++++--------------- kernel/smpboot.c | 1 + kernel/workqueue.c | 2 +- net/core/pktgen.c | 2 +- net/sunrpc/svc.c | 2 +- 19 files changed, 82 insertions(+), 86 deletions(-) Eric ^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH 1/3] kthread: Remove the flags argument from kernel_thread 2022-06-26 19:14 ` [PATCH 0/3] kthread: Stop using TASK_UNINTERRUPTIBLE Eric W. Biederman @ 2022-06-26 19:15 ` Eric W. Biederman 2022-06-26 21:20 ` Linus Torvalds 2022-06-26 19:16 ` [PATCH 2/3] kthread: Replace kernel_thread with new_kthread Eric W. Biederman 2022-06-26 19:16 ` [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE) Eric W. Biederman 2 siblings, 1 reply; 52+ messages in thread From: Eric W. Biederman @ 2022-06-26 19:15 UTC (permalink / raw) To: Linus Torvalds Cc: Christian Brauner, Tejun Heo, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov There are only two callers of kernel_thread remaining. The calling in init/main.c that creates kthreadd, and the caller in kernel/kthread.c Both callers pass CLONE_FS|CLONE_FILES. The argument SIGCHLD causes terminate to exit with the oridnary process SIGCHLD semantics. As kthreadd never exists it simply does not matter what kind of exit it has. So for simplicity make it look like everything else and use SIGCHLD. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- include/linux/sched/task.h | 2 +- init/main.c | 2 +- kernel/fork.c | 3 ++- kernel/kthread.c | 2 +- 4 files changed, 5 insertions(+), 4 deletions(-) diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 505aaf9fe477..d95930e220da 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -91,7 +91,7 @@ extern pid_t kernel_clone(struct kernel_clone_args *kargs); struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node); struct task_struct *fork_idle(int); struct mm_struct *copy_init_mm(void); -extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags); +extern pid_t kernel_thread(int (*fn)(void *), void *arg); extern pid_t user_mode_thread(int (*fn)(void *), void *arg, unsigned long flags); extern long kernel_wait4(pid_t, int __user *, int, struct rusage *); int kernel_wait(pid_t pid, int *stat); diff --git a/init/main.c b/init/main.c index 0ee39cdcfcac..211d38db0d16 100644 --- a/init/main.c +++ b/init/main.c @@ -701,7 +701,7 @@ noinline void __ref rest_init(void) rcu_read_unlock(); numa_default_policy(); - pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES); + pid = kernel_thread(kthreadd, NULL); rcu_read_lock(); kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns); rcu_read_unlock(); diff --git a/kernel/fork.c b/kernel/fork.c index 9d44f2d46c69..65909ded0ea7 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2694,8 +2694,9 @@ pid_t kernel_clone(struct kernel_clone_args *args) /* * Create a kernel thread. */ -pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags) +pid_t kernel_thread(int (*fn)(void *), void *arg) { + unsigned long flags = CLONE_FS | CLONE_FILES | SIGCHLD; struct kernel_clone_args args = { .flags = ((lower_32_bits(flags) | CLONE_VM | CLONE_UNTRACED) & ~CSIGNAL), diff --git a/kernel/kthread.c b/kernel/kthread.c index 544fd4097406..c0505e6b7142 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -396,7 +396,7 @@ static void create_kthread(struct kthread_create_info *create) current->pref_node_fork = create->node; #endif /* We want our own signal handler (we take no signals by default). */ - pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD); + pid = kernel_thread(kthread, create); if (pid < 0) { /* If user was SIGKILLed, I release the structure. */ struct completion *done = xchg(&create->done, NULL); -- 2.35.3 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH 1/3] kthread: Remove the flags argument from kernel_thread 2022-06-26 19:15 ` [PATCH 1/3] kthread: Remove the flags argument from kernel_thread Eric W. Biederman @ 2022-06-26 21:20 ` Linus Torvalds 0 siblings, 0 replies; 52+ messages in thread From: Linus Torvalds @ 2022-06-26 21:20 UTC (permalink / raw) To: Eric W. Biederman Cc: Christian Brauner, Tejun Heo, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov On Sun, Jun 26, 2022 at 12:15 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > As kthreadd never exists it simply does not matter what kind of exit > it has. So for simplicity make it look like everything else and use > SIGCHLD. That "never exists" should be "never exits" ;) But: > +pid_t kernel_thread(int (*fn)(void *), void *arg) > { > + unsigned long flags = CLONE_FS | CLONE_FILES | SIGCHLD; > struct kernel_clone_args args = { > .flags = ((lower_32_bits(flags) | CLONE_VM | > CLONE_UNTRACED) & ~CSIGNAL), Please just get rid of that 'flags' thing, and the lower_32_bits() games and write this all as pid_t kernel_thread(int (*fn)(void *), void *arg) { struct kernel_clone_args args = { .flags = CLONE_FS | CLONE_FILES | CLONE_VM | CLONE_UNTRACED, .exit_signal = SIGCHLD, ... which does that whole thing much more clearly. Linus ^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH 2/3] kthread: Replace kernel_thread with new_kthread 2022-06-26 19:14 ` [PATCH 0/3] kthread: Stop using TASK_UNINTERRUPTIBLE Eric W. Biederman 2022-06-26 19:15 ` [PATCH 1/3] kthread: Remove the flags argument from kernel_thread Eric W. Biederman @ 2022-06-26 19:16 ` Eric W. Biederman 2022-06-26 19:16 ` [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE) Eric W. Biederman 2 siblings, 0 replies; 52+ messages in thread From: Eric W. Biederman @ 2022-06-26 19:16 UTC (permalink / raw) To: Linus Torvalds Cc: Christian Brauner, Tejun Heo, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov It is desriable to be able to perform all of the kthread setup before the kernel thread is awaked for the first time. To make that possible replace kernel_thread with new_kthread that does all of the same work except it does not call wake_up_new_task. Replace the two uses of kernel_threadd with new_kthread and a call to wake_up_new_task. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- include/linux/sched/task.h | 2 +- init/main.c | 6 ++---- kernel/fork.c | 4 ++-- kernel/kthread.c | 10 ++++++---- 4 files changed, 11 insertions(+), 11 deletions(-) diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index d95930e220da..c4c7a0118553 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -91,7 +91,7 @@ extern pid_t kernel_clone(struct kernel_clone_args *kargs); struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node); struct task_struct *fork_idle(int); struct mm_struct *copy_init_mm(void); -extern pid_t kernel_thread(int (*fn)(void *), void *arg); +extern struct task_struct *new_kthread(int (*fn)(void *), void *arg, int node); extern pid_t user_mode_thread(int (*fn)(void *), void *arg, unsigned long flags); extern long kernel_wait4(pid_t, int __user *, int, struct rusage *); int kernel_wait(pid_t pid, int *stat); diff --git a/init/main.c b/init/main.c index 211d38db0d16..b437581f8001 100644 --- a/init/main.c +++ b/init/main.c @@ -701,10 +701,8 @@ noinline void __ref rest_init(void) rcu_read_unlock(); numa_default_policy(); - pid = kernel_thread(kthreadd, NULL); - rcu_read_lock(); - kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns); - rcu_read_unlock(); + kthreadd_task = new_kthread(kthreadd, NULL, NUMA_NO_NODE); + wake_up_new_task(kthreadd_task); /* * Enable might_sleep() and smp_processor_id() checks. diff --git a/kernel/fork.c b/kernel/fork.c index 65909ded0ea7..794d9f9c78bc 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2694,7 +2694,7 @@ pid_t kernel_clone(struct kernel_clone_args *args) /* * Create a kernel thread. */ -pid_t kernel_thread(int (*fn)(void *), void *arg) +struct task_struct *new_kthread(int (*fn)(void *), void *arg, int node) { unsigned long flags = CLONE_FS | CLONE_FILES | SIGCHLD; struct kernel_clone_args args = { @@ -2706,7 +2706,7 @@ pid_t kernel_thread(int (*fn)(void *), void *arg) .kthread = 1, }; - return kernel_clone(&args); + return copy_process(NULL, 0, node, &args); } /* diff --git a/kernel/kthread.c b/kernel/kthread.c index c0505e6b7142..8529f6b1582b 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -390,14 +390,14 @@ int tsk_fork_get_node(struct task_struct *tsk) static void create_kthread(struct kthread_create_info *create) { - int pid; + struct task_struct *new; #ifdef CONFIG_NUMA current->pref_node_fork = create->node; #endif /* We want our own signal handler (we take no signals by default). */ - pid = kernel_thread(kthread, create); - if (pid < 0) { + new = new_kthread(kthread, create, NUMA_NO_NODE); + if (IS_ERR(new)) { /* If user was SIGKILLed, I release the structure. */ struct completion *done = xchg(&create->done, NULL); @@ -405,8 +405,10 @@ static void create_kthread(struct kthread_create_info *create) kfree(create); return; } - create->result = ERR_PTR(pid); + create->result = ERR_CAST(new); complete(done); + } else { + wake_up_new_task(new); } } -- 2.35.3 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE) 2022-06-26 19:14 ` [PATCH 0/3] kthread: Stop using TASK_UNINTERRUPTIBLE Eric W. Biederman 2022-06-26 19:15 ` [PATCH 1/3] kthread: Remove the flags argument from kernel_thread Eric W. Biederman 2022-06-26 19:16 ` [PATCH 2/3] kthread: Replace kernel_thread with new_kthread Eric W. Biederman @ 2022-06-26 19:16 ` Eric W. Biederman 2022-06-26 19:59 ` Linus Torvalds ` (2 more replies) 2 siblings, 3 replies; 52+ messages in thread From: Eric W. Biederman @ 2022-06-26 19:16 UTC (permalink / raw) To: Linus Torvalds Cc: Christian Brauner, Tejun Heo, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov Instead leave the task as a new unscheduled task and require the caller to call wake_up_new_task. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- arch/arm/common/bL_switcher.c | 2 +- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 +- drivers/block/mtip32xx/mtip32xx.c | 2 +- drivers/firmware/psci/psci_checker.c | 2 +- drivers/firmware/stratix10-svc.c | 4 +- drivers/scsi/bnx2fc/bnx2fc_fcoe.c | 2 +- drivers/scsi/bnx2i/bnx2i_init.c | 2 +- drivers/scsi/qedi/qedi_main.c | 2 +- include/linux/kthread.h | 4 +- kernel/bpf/cpumap.c | 2 +- kernel/dma/map_benchmark.c | 2 +- kernel/kthread.c | 114 ++++++++++------------ kernel/smpboot.c | 1 + kernel/workqueue.c | 2 +- net/core/pktgen.c | 2 +- net/sunrpc/svc.c | 2 +- 16 files changed, 72 insertions(+), 77 deletions(-) diff --git a/arch/arm/common/bL_switcher.c b/arch/arm/common/bL_switcher.c index 9a9aa53547a6..15c9841c3c15 100644 --- a/arch/arm/common/bL_switcher.c +++ b/arch/arm/common/bL_switcher.c @@ -311,7 +311,7 @@ static struct task_struct *bL_switcher_thread_create(int cpu, void *arg) cpu_to_node(cpu), "kswitcher_%d", cpu); if (!IS_ERR(task)) { kthread_bind(task, cpu); - wake_up_process(task); + wake_up_new_task(task); } else pr_err("%s failed for CPU %d\n", __func__, cpu); return task; diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index db813f819ad6..ba09f5cf1773 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -1206,7 +1206,7 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel) goto out; } kthread_bind(thread, cpu); - wake_up_process(thread); + wake_up_new_task(thread); ret = wait_event_interruptible(plr->lock_thread_wq, plr->thread_done == 1); @@ -1304,7 +1304,7 @@ int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp) } kthread_bind(thread, plr->cpu); - wake_up_process(thread); + wake_up_new_task(thread); ret = wait_event_interruptible(plr->lock_thread_wq, plr->thread_done == 1); diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c index 27386a572ba4..b2f5b30a27aa 100644 --- a/drivers/block/mtip32xx/mtip32xx.c +++ b/drivers/block/mtip32xx/mtip32xx.c @@ -3649,7 +3649,7 @@ static int mtip_block_initialize(struct driver_data *dd) rv = -EFAULT; goto kthread_run_error; } - wake_up_process(dd->mtip_svc_handler); + wake_up_new_task(dd->mtip_svc_handler); if (wait_for_rebuild == MTIP_FTL_REBUILD_MAGIC) rv = wait_for_rebuild; diff --git a/drivers/firmware/psci/psci_checker.c b/drivers/firmware/psci/psci_checker.c index 116eb465cdb4..52fcd122e2b6 100644 --- a/drivers/firmware/psci/psci_checker.c +++ b/drivers/firmware/psci/psci_checker.c @@ -418,7 +418,7 @@ static int suspend_tests(void) * wait for the completion of suspend_threads_started. */ for (i = 0; i < nb_threads; ++i) - wake_up_process(threads[i]); + wake_up_new_task(threads[i]); complete_all(&suspend_threads_started); wait_for_completion(&suspend_threads_done); diff --git a/drivers/firmware/stratix10-svc.c b/drivers/firmware/stratix10-svc.c index 14663f671323..64d07aaa68bf 100644 --- a/drivers/firmware/stratix10-svc.c +++ b/drivers/firmware/stratix10-svc.c @@ -581,7 +581,7 @@ static int svc_get_sh_memory(struct platform_device *pdev, return -EINVAL; } - wake_up_process(sh_memory_task); + wake_up_new_task(sh_memory_task); if (!wait_for_completion_timeout(&sh_memory->sync_complete, 10 * HZ)) { dev_err(dev, @@ -834,7 +834,7 @@ int stratix10_svc_send(struct stratix10_svc_chan *chan, void *msg) return -EINVAL; } kthread_bind(chan->ctrl->task, cpu); - wake_up_process(chan->ctrl->task); + wake_up_new_task(chan->ctrl->task); } pr_debug("%s: sent P-va=%p, P-com=%x, P-size=%u\n", __func__, diff --git a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c index 05ddbb9bb7d8..1b3af0e01d67 100644 --- a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c +++ b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c @@ -2622,7 +2622,7 @@ static int bnx2fc_cpu_online(unsigned int cpu) /* bind thread to the cpu */ kthread_bind(thread, cpu); p->iothread = thread; - wake_up_process(thread); + wake_up_new_task(thread); return 0; } diff --git a/drivers/scsi/bnx2i/bnx2i_init.c b/drivers/scsi/bnx2i/bnx2i_init.c index 2b3f0c10478e..cd4976bb86fc 100644 --- a/drivers/scsi/bnx2i/bnx2i_init.c +++ b/drivers/scsi/bnx2i/bnx2i_init.c @@ -424,7 +424,7 @@ static int bnx2i_cpu_online(unsigned int cpu) /* bind thread to the cpu */ kthread_bind(thread, cpu); p->iothread = thread; - wake_up_process(thread); + wake_up_new_task(thread); return 0; } diff --git a/drivers/scsi/qedi/qedi_main.c b/drivers/scsi/qedi/qedi_main.c index 83ffba7f51da..97ced4f12d6e 100644 --- a/drivers/scsi/qedi/qedi_main.c +++ b/drivers/scsi/qedi/qedi_main.c @@ -1967,7 +1967,7 @@ static int qedi_cpu_online(unsigned int cpu) kthread_bind(thread, cpu); p->iothread = thread; - wake_up_process(thread); + wake_up_new_task(thread); return 0; } diff --git a/include/linux/kthread.h b/include/linux/kthread.h index 30e5bec81d2b..fa5e24df84bf 100644 --- a/include/linux/kthread.h +++ b/include/linux/kthread.h @@ -53,7 +53,7 @@ bool kthread_is_per_cpu(struct task_struct *k); struct task_struct *__k \ = kthread_create(threadfn, data, namefmt, ## __VA_ARGS__); \ if (!IS_ERR(__k)) \ - wake_up_process(__k); \ + wake_up_new_task(__k); \ __k; \ }) @@ -77,7 +77,7 @@ kthread_run_on_cpu(int (*threadfn)(void *data), void *data, p = kthread_create_on_cpu(threadfn, data, cpu, namefmt); if (!IS_ERR(p)) - wake_up_process(p); + wake_up_new_task(p); return p; } diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index f4860ac756cd..e72bbb766f01 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -475,7 +475,7 @@ __cpu_map_entry_alloc(struct bpf_map *map, struct bpf_cpumap_val *value, /* Make sure kthread runs on a single CPU */ kthread_bind(rcpu->kthread, cpu); - wake_up_process(rcpu->kthread); + wake_up_new_task(rcpu->kthread); return rcpu; diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c index 0520a8f4fb1d..b28e33c07c92 100644 --- a/kernel/dma/map_benchmark.c +++ b/kernel/dma/map_benchmark.c @@ -134,7 +134,7 @@ static int do_map_benchmark(struct map_benchmark_data *map) for (i = 0; i < threads; i++) { get_task_struct(tsk[i]); - wake_up_process(tsk[i]); + wake_up_new_task(tsk[i]); } msleep_interruptible(map->bparam.seconds * 1000); diff --git a/kernel/kthread.c b/kernel/kthread.c index 8529f6b1582b..1769b5118694 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -329,51 +329,12 @@ EXPORT_SYMBOL(kthread_complete_and_exit); static int kthread(void *_create) { - static const struct sched_param param = { .sched_priority = 0 }; - /* Copy data: it's on kthread's stack */ - struct kthread_create_info *create = _create; - int (*threadfn)(void *data) = create->threadfn; - void *data = create->data; - struct completion *done; - struct kthread *self; - int ret; - - self = to_kthread(current); + struct kthread *self = to_kthread(current); + int ret = -EINTR; - /* If user was SIGKILLed, I release the structure. */ - done = xchg(&create->done, NULL); - if (!done) { - kfree(create); - kthread_exit(-EINTR); - } - - self->threadfn = threadfn; - self->data = data; - - /* - * The new thread inherited kthreadd's priority and CPU mask. Reset - * back to default in case they have been changed. - */ - sched_setscheduler_nocheck(current, SCHED_NORMAL, ¶m); - set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_TYPE_KTHREAD)); - - /* OK, tell user we're spawned, wait for stop or wakeup */ - __set_current_state(TASK_UNINTERRUPTIBLE); - create->result = current; - /* - * Thread is going to call schedule(), do not preempt it, - * or the creator may spend more time in wait_task_inactive(). - */ - preempt_disable(); - complete(done); - schedule_preempt_disabled(); - preempt_enable(); - - ret = -EINTR; if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) { cgroup_kthread_ready(); - __kthread_parkme(self); - ret = threadfn(data); + ret = self->threadfn(self->data); } kthread_exit(ret); } @@ -391,25 +352,41 @@ int tsk_fork_get_node(struct task_struct *tsk) static void create_kthread(struct kthread_create_info *create) { struct task_struct *new; + struct completion *done; #ifdef CONFIG_NUMA current->pref_node_fork = create->node; #endif /* We want our own signal handler (we take no signals by default). */ new = new_kthread(kthread, create, NUMA_NO_NODE); + create->result = new; + /* Claim the completion */ + done = xchg(&create->done, NULL); if (IS_ERR(new)) { - /* If user was SIGKILLed, I release the structure. */ - struct completion *done = xchg(&create->done, NULL); + if (done) + complete(done); + } else if (done) { + static const struct sched_param param = { .sched_priority = 0 }; + struct kthread *kthread = to_kthread(new); - if (!done) { - kfree(create); - return; - } - create->result = ERR_CAST(new); - complete(done); - } else { - wake_up_new_task(new); + kthread->threadfn = create->threadfn; + kthread->data = create->data; + + /* + * The new thread inherited kthreadd's priority and CPU mask. Reset + * back to default in case they have been changed. + */ + sched_setscheduler_nocheck(new, SCHED_NORMAL, ¶m); + set_cpus_allowed_ptr(new, housekeeping_cpumask(HK_TYPE_KTHREAD)); + + /* OK, tell user we're spawned, wait for stop or wakeup */ + //wake_up_new_task(new); } + /* If user was SIGKILLed, release the structure. */ + if (!done) + kfree(create); + else + complete(done); } static __printf(4, 0) @@ -518,11 +495,11 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data), } EXPORT_SYMBOL(kthread_create_on_node); -static void __kthread_bind_mask(struct task_struct *p, const struct cpumask *mask, unsigned int state) +static void kthread_bind_mask_parked(struct task_struct *p, const struct cpumask *mask) { unsigned long flags; - if (!wait_task_inactive(p, state)) { + if (!wait_task_inactive(p, TASK_PARKED)) { WARN_ON(1); return; } @@ -534,14 +511,31 @@ static void __kthread_bind_mask(struct task_struct *p, const struct cpumask *mas raw_spin_unlock_irqrestore(&p->pi_lock, flags); } -static void __kthread_bind(struct task_struct *p, unsigned int cpu, unsigned int state) +static void kthread_bind_mask_new(struct task_struct *new, const struct cpumask *mask) +{ + unsigned long flags; + + /* + * FIXME: verify that p is a new task that + * has not yet been passed through + * wake_up_new_task + */ + + /* It's safe because new has never been scheduled. */ + raw_spin_lock_irqsave(&new->pi_lock, flags); + do_set_cpus_allowed(new, mask); + new->flags |= PF_NO_SETAFFINITY; + raw_spin_unlock_irqrestore(&new->pi_lock, flags); +} + +static void __kthread_bind(struct task_struct *p, unsigned int cpu) { - __kthread_bind_mask(p, cpumask_of(cpu), state); + kthread_bind_mask_new(p, cpumask_of(cpu)); } void kthread_bind_mask(struct task_struct *p, const struct cpumask *mask) { - __kthread_bind_mask(p, mask, TASK_UNINTERRUPTIBLE); + kthread_bind_mask_new(p, mask); } /** @@ -555,7 +549,7 @@ void kthread_bind_mask(struct task_struct *p, const struct cpumask *mask) */ void kthread_bind(struct task_struct *p, unsigned int cpu) { - __kthread_bind(p, cpu, TASK_UNINTERRUPTIBLE); + __kthread_bind(p, cpu); } EXPORT_SYMBOL(kthread_bind); @@ -629,7 +623,7 @@ void kthread_unpark(struct task_struct *k) * The binding was lost and we need to set it again. */ if (test_bit(KTHREAD_IS_PER_CPU, &kthread->flags)) - __kthread_bind(k, kthread->cpu, TASK_PARKED); + kthread_bind_mask_parked(k, cpumask_of(kthread->cpu)); clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags); /* @@ -863,7 +857,7 @@ __kthread_create_worker(int cpu, unsigned int flags, worker->flags = flags; worker->task = task; - wake_up_process(task); + wake_up_new_task(task); return worker; fail_task: diff --git a/kernel/smpboot.c b/kernel/smpboot.c index b9f54544e749..79b8d05164d6 100644 --- a/kernel/smpboot.c +++ b/kernel/smpboot.c @@ -192,6 +192,7 @@ __smpboot_create_thread(struct smp_hotplug_thread *ht, unsigned int cpu) * Park the thread so that it could start right on the CPU * when it is available. */ + wake_up_new_task(tsk); kthread_park(tsk); get_task_struct(tsk); *per_cpu_ptr(ht->store, cpu) = tsk; diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 1ea50f6be843..2d16a933f452 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1961,7 +1961,7 @@ static struct worker *create_worker(struct worker_pool *pool) raw_spin_lock_irq(&pool->lock); worker->pool->nr_workers++; worker_enter_idle(worker); - wake_up_process(worker->task); + wake_up_new_task(worker->task); raw_spin_unlock_irq(&pool->lock); return worker; diff --git a/net/core/pktgen.c b/net/core/pktgen.c index 84b62cd7bc57..40efd9b678fa 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -3864,7 +3864,7 @@ static int __net_init pktgen_create_thread(int cpu, struct pktgen_net *pn) t->net = pn; get_task_struct(p); - wake_up_process(p); + wake_up_new_task(p); wait_for_completion(&t->start_done); return 0; diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c index 7c9a0d0b1230..addbba323b9c 100644 --- a/net/sunrpc/svc.c +++ b/net/sunrpc/svc.c @@ -769,7 +769,7 @@ svc_start_kthreads(struct svc_serv *serv, struct svc_pool *pool, int nrservs) svc_pool_map_set_cpumask(task, chosen_pool->sp_id); svc_sock_update_bufs(serv); - wake_up_process(task); + wake_up_new_task(task); } while (nrservs > 0); return 0; -- 2.35.3 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE) 2022-06-26 19:16 ` [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE) Eric W. Biederman @ 2022-06-26 19:59 ` Linus Torvalds 2022-06-26 20:23 ` Tejun Heo 2022-06-26 22:14 ` kernel test robot 2022-06-26 22:34 ` kernel test robot 2 siblings, 1 reply; 52+ messages in thread From: Linus Torvalds @ 2022-06-26 19:59 UTC (permalink / raw) To: Eric W. Biederman Cc: Christian Brauner, Tejun Heo, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov On Sun, Jun 26, 2022 at 12:16 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Instead leave the task as a new unscheduled task and require the > caller to call wake_up_new_task. So I think this is somewhat error-prone, and we should probably abstract things out a bit more. Almost every single case that does this does this for one single reason: it wants to run setup code before the new kthread is actually activated. And I think *that* should be the change - add a "setup()" function pointer to the whole kthread infrastructure. Allow it to return an error, which will then just kill the new thread again without ever even starting it up. I'd really prefer to avoid having random drivers and subsystems know about the *very* magical "wake_up_new_task()" thing. Yes, it's a real thing, but it's a thing that normal code should not ever use. The whole "wake_up_process()" model for kthread creation was wrong. But moving existing users of a bad interface to using the even more special "wake_up_new_task()" thing is not the solution, I feel. Also, since you're very much in this area - you also please look into getting rid of that horrible 'done' completion pointer entirely? The xchg games on that thing are horrendous and very non-intuitive. Linus ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE) 2022-06-26 19:59 ` Linus Torvalds @ 2022-06-26 20:23 ` Tejun Heo 2022-06-26 20:55 ` Linus Torvalds 2022-06-27 7:22 ` Peter Zijlstra 0 siblings, 2 replies; 52+ messages in thread From: Tejun Heo @ 2022-06-26 20:23 UTC (permalink / raw) To: Linus Torvalds Cc: Eric W. Biederman, Christian Brauner, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov Hello, On Sun, Jun 26, 2022 at 12:59:09PM -0700, Linus Torvalds wrote: > And I think *that* should be the change - add a "setup()" function > pointer to the whole kthread infrastructure. Allow it to return an > error, which will then just kill the new thread again without ever > even starting it up. > > I'd really prefer to avoid having random drivers and subsystems know > about the *very* magical "wake_up_new_task()" thing. Yes, it's a real > thing, but it's a thing that normal code should not ever use. > > The whole "wake_up_process()" model for kthread creation was wrong. > But moving existing users of a bad interface to using the even more > special "wake_up_new_task()" thing is not the solution, I feel. This is a bit of bike-shedding but there are inherent downsides to callback based interface in terms of write/readability. Now you have to move the init code out of line, and if the context that needs to be passing doesn't fit in a single pointer, you gotta define a struct to carry it adding to the boilerplate. Having separate create and run steps makes sense as a pattern and wake_up_new_task() is less error prone in one sense - if the caller doesn't call it, the new kthread actually won't start running making the bug obvious. The fact that the function blindly trusts the caller to be handing in an actual new task is scary and we likely don't wanna proliferate its use across the code base. Would it be a reasonable tradeoff to have a kthread wrapper - kthread_start() or whatever - which ensures that it is actually called once on a new task? That way, we can keep the init code inline and bugs on both sides (not starting and starting incorrectly) are obvious. Thanks. -- tejun ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE) 2022-06-26 20:23 ` Tejun Heo @ 2022-06-26 20:55 ` Linus Torvalds 2022-06-27 7:22 ` Peter Zijlstra 1 sibling, 0 replies; 52+ messages in thread From: Linus Torvalds @ 2022-06-26 20:55 UTC (permalink / raw) To: Tejun Heo Cc: Eric W. Biederman, Christian Brauner, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov On Sun, Jun 26, 2022 at 1:23 PM Tejun Heo <tj@kernel.org> wrote: > > Would it be a reasonable tradeoff to have a kthread wrapper - > kthread_start() or whatever - which ensures that it is actually called > once on a new task? That way, we can keep the init code inline and > bugs on both sides (not starting and starting incorrectly) are > obvious. Yeah, that sounds reasonable to me. Linus ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE) 2022-06-26 20:23 ` Tejun Heo 2022-06-26 20:55 ` Linus Torvalds @ 2022-06-27 7:22 ` Peter Zijlstra 2022-06-27 8:11 ` Tejun Heo 1 sibling, 1 reply; 52+ messages in thread From: Peter Zijlstra @ 2022-06-27 7:22 UTC (permalink / raw) To: Tejun Heo Cc: Linus Torvalds, Eric W. Biederman, Christian Brauner, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov Bit off a tangent.. On Mon, Jun 27, 2022 at 05:23:00AM +0900, Tejun Heo wrote: > This is a bit of bike-shedding but there are inherent downsides to > callback based interface in terms of write/readability. Now you have > to move the init code out of line, and if the context that needs to be > passing doesn't fit in a single pointer, you gotta define a struct to > carry it adding to the boilerplate. Yes, I so wish C had reasonable lambda expressions :/ But even if we could come up with a semi sensible syntax and implementation, it would still be many *many* years before we could actually use it in-kernel :-( ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE) 2022-06-27 7:22 ` Peter Zijlstra @ 2022-06-27 8:11 ` Tejun Heo 2022-06-27 18:04 ` Wedson Almeida Filho 0 siblings, 1 reply; 52+ messages in thread From: Tejun Heo @ 2022-06-27 8:11 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Eric W. Biederman, Christian Brauner, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov On Mon, Jun 27, 2022 at 09:22:37AM +0200, Peter Zijlstra wrote: > Yes, I so wish C had reasonable lambda expressions :/ But even if we > could come up with a semi sensible syntax and implementation, it would > still be many *many* years before we could actually use it in-kernel :-( Yeah, I have a hard time imagining this happening in C but maybe we'll get pretty good closure support through rust-in-kernel if that works out. That'd be pretty sweet even though we might not be able to use it everywhere. Thanks. -- tejun ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE) 2022-06-27 8:11 ` Tejun Heo @ 2022-06-27 18:04 ` Wedson Almeida Filho 2022-06-27 22:06 ` Peter Zijlstra 0 siblings, 1 reply; 52+ messages in thread From: Wedson Almeida Filho @ 2022-06-27 18:04 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Linus Torvalds, Eric W. Biederman, Christian Brauner, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov On Mon, Jun 27, 2022 at 05:11:36PM +0900, Tejun Heo wrote: > Yeah, I have a hard time imagining this happening in C but maybe we'll > get pretty good closure support through rust-in-kernel if that works > out. That'd be pretty sweet even though we might not be able to use it > everywhere. While Rust does support closures and it would work just fine here, I think in this case its type system allows for better ergonomics and flexibility without them, for example: // the pr_info! part is a closure for the body of the thread. Could // also be replaced with a function. let new_thread = task::new_paused(|| pr_info!("Hello world\n"))?; // Do whatever initialisation one wants to do using new_thread. Only // functions that _can_ be used on a new kthread would be available // (e.g., wake_up_process() wouldn't). new_thread.start(); // new_thread isn't accessible anymore. The compiler fails compilation // if one attempts to use it again, for example, to call start() // again. The type returned by task::new_paused() wouldn't be copyable, so we can guarantee that start() is called at most once. It would have a Drop implemention (destructor) that puts the task, which means that we could use the question mark operator for error handling between new_paused() & start() (or really any kind of early-return technique) and all error paths would properly clean the new task up without any goto mess. It also means that if one forgets to call start(), not only will the thread never start, it will also be freed (i.e., no leaks). If the caller wants to keep a reference to the task, they would do something like the following (instead of calling new_thread.start()): let task = new_thread.start_and_get(); Then `task` could be used as any task. For example, wake_up() would be available, but not wake_up_new_task(). It also has automatic handling of refcounting such that we are garanteed to never have a dangling pointer to the task. Lastly, all the checks I mentioned above happen at compile time, so there is absolutely zero cost at runtime. Anyway, sorry for the digression. I thought this would be a good opportunity to talk about some of the possibilities in API design and enforcement that Rust affords us since this kind of design was the topic in discussion and Rust was brought up by someone else :) Cheers, -Wedson ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE) 2022-06-27 18:04 ` Wedson Almeida Filho @ 2022-06-27 22:06 ` Peter Zijlstra 2022-06-27 22:34 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 52+ messages in thread From: Peter Zijlstra @ 2022-06-27 22:06 UTC (permalink / raw) To: Wedson Almeida Filho Cc: Tejun Heo, Linus Torvalds, Eric W. Biederman, Christian Brauner, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov On Mon, Jun 27, 2022 at 06:04:12PM +0000, Wedson Almeida Filho wrote: > let new_thread = task::new_paused(|| pr_info!("Hello world\n"))?; I'm still having a really hard time with this Rust stuff, the above looks like a syntax error and random characters to me :/ ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE) 2022-06-27 22:06 ` Peter Zijlstra @ 2022-06-27 22:34 ` Linus Torvalds 2022-06-27 22:45 ` Wedson Almeida Filho 2022-06-28 0:32 ` Wedson Almeida Filho 2 siblings, 0 replies; 52+ messages in thread From: Linus Torvalds @ 2022-06-27 22:34 UTC (permalink / raw) To: Peter Zijlstra Cc: Wedson Almeida Filho, Tejun Heo, Eric W. Biederman, Christian Brauner, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov On Mon, Jun 27, 2022 at 3:07 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Jun 27, 2022 at 06:04:12PM +0000, Wedson Almeida Filho wrote: > > > let new_thread = task::new_paused(|| pr_info!("Hello world\n"))?; > > I'm still having a really hard time with this Rust stuff, the above > looks like a syntax error and random characters to me :/ Heh. The '!' for macros is probably my least favorite part of Rust syntax, it just makes macros look so unintegrated. Not at all the kind of "you can use a macro instead of a function" thing, because macros always have that '!' thing. And yeah, the pipe characters used by closures sure make that particular line look extra magical. The question mark is a "do if ok, return if error". Think of it like a "try" thing for exception catching. But yeah, all the special characters does make me think of perl. I haven't really gotten the hang of reading rust without a google window open to figure things out, but I think that's just a "you have to get used to it". Or, alternatively, you have to just ignore the rust parts. As I mentioned at OSS NA last week - it's not like most people can read our MM code either - even when you know C, some of that code is pretty incomprehensible unless you know how it all works. If people can be productive kernel developers without understanding the MM layer, I'm sure people can be kernel developers without understanding rust.. Linus ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE) 2022-06-27 22:06 ` Peter Zijlstra 2022-06-27 22:34 ` Linus Torvalds @ 2022-06-27 22:45 ` Wedson Almeida Filho 2022-06-28 0:32 ` Wedson Almeida Filho 2 siblings, 0 replies; 52+ messages in thread From: Wedson Almeida Filho @ 2022-06-27 22:45 UTC (permalink / raw) To: Peter Zijlstra Cc: Tejun Heo, Linus Torvalds, Eric W. Biederman, Christian Brauner, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov On Tue, Jun 28, 2022 at 12:06:29AM +0200, Peter Zijlstra wrote: > On Mon, Jun 27, 2022 at 06:04:12PM +0000, Wedson Almeida Filho wrote: > > > let new_thread = task::new_paused(|| pr_info!("Hello world\n"))?; > > I'm still having a really hard time with this Rust stuff, the above > looks like a syntax error and random characters to me :/ Fair enough, there are a few things going on there. The code above could be expanded to the following: fn thread_func() { pr_info!("Hello world\n"); } let retval = task::new_paused(thread_func); if retval.is_err() { return retval.unwrap_err(); } let new_thread = retval.unwrap(); I hope the above is clearer. The question-mark operator is just syntax-sugar for the error handling above, the pipes indicate a closure, the full syntax is: |args| -> ret_type { statement1; statement2; etc. } -- the line you quote has the compiler infer the type (so it's omitted), there are no args (so the nothing between the pipes), and when there's a single statement that is an expression whose type matches the return type, the curly braces can thus be omitted, so we end up with "|| x" as the closure. I'm confident people can get used to this. It's in the rare cases when one has to think about say subtyping, invariance, covariance, and contravriance that things may get hairy (or exciting for the more math-inclined). ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE) 2022-06-27 22:06 ` Peter Zijlstra 2022-06-27 22:34 ` Linus Torvalds 2022-06-27 22:45 ` Wedson Almeida Filho @ 2022-06-28 0:32 ` Wedson Almeida Filho 2022-06-28 7:58 ` Peter Zijlstra 2 siblings, 1 reply; 52+ messages in thread From: Wedson Almeida Filho @ 2022-06-28 0:32 UTC (permalink / raw) To: Peter Zijlstra Cc: Tejun Heo, Linus Torvalds, Eric W. Biederman, Christian Brauner, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov On Tue, Jun 28, 2022 at 12:06:29AM +0200, Peter Zijlstra wrote: > On Mon, Jun 27, 2022 at 06:04:12PM +0000, Wedson Almeida Filho wrote: > > > let new_thread = task::new_paused(|| pr_info!("Hello world\n"))?; > > I'm still having a really hard time with this Rust stuff, the above > looks like a syntax error and random characters to me :/ Peter, I meant to ask in my previous email: setting aside the syntax for a moment, do you have an opinion on the sort of things that Rust allows us to enforce at compile time (as exemplified in the new_paused() fragment)? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE) 2022-06-28 0:32 ` Wedson Almeida Filho @ 2022-06-28 7:58 ` Peter Zijlstra 2022-06-30 0:57 ` Wedson Almeida Filho 0 siblings, 1 reply; 52+ messages in thread From: Peter Zijlstra @ 2022-06-28 7:58 UTC (permalink / raw) To: Wedson Almeida Filho Cc: Tejun Heo, Linus Torvalds, Eric W. Biederman, Christian Brauner, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov On Tue, Jun 28, 2022 at 12:32:33AM +0000, Wedson Almeida Filho wrote: > On Tue, Jun 28, 2022 at 12:06:29AM +0200, Peter Zijlstra wrote: > > On Mon, Jun 27, 2022 at 06:04:12PM +0000, Wedson Almeida Filho wrote: > > > > > let new_thread = task::new_paused(|| pr_info!("Hello world\n"))?; > > > > I'm still having a really hard time with this Rust stuff, the above > > looks like a syntax error and random characters to me :/ > > Peter, I meant to ask in my previous email: setting aside the syntax for > a moment, do you have an opinion on the sort of things that Rust allows > us to enforce at compile time (as exemplified in the new_paused() > fragment)? So I used to do quite a lot of C++ in a previous life; I think I'm more or less familiar with a lot of the things Rust offers, except it is a lot stricter. C++ allows you to do the right thing, but also allows you to take your own foot off (a bit like C, except you can make an even bigger mess of things), where Rust tries really hard to protect the foot. The one thing I dread is compile times, C++ is bad, but given Rust has to do even more compile time enforcement it'll suck worse. And I'm already not using clang because it's so much worse than gcc. I've just not had *any* time to actually look at Rust in any detail :/ But given I'm the kind of idiot that does tree-wide cleanups just because it's the right thing, I'm bound to run into it sooner rather than later, and then I'll curse my way through having to learn it just to get crap done I expect ... Anyway; from what I understand Rust is a fair way away from core code. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE) 2022-06-28 7:58 ` Peter Zijlstra @ 2022-06-30 0:57 ` Wedson Almeida Filho 0 siblings, 0 replies; 52+ messages in thread From: Wedson Almeida Filho @ 2022-06-30 0:57 UTC (permalink / raw) To: Peter Zijlstra Cc: Tejun Heo, Linus Torvalds, Eric W. Biederman, Christian Brauner, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov On Tue, Jun 28, 2022 at 09:58:43AM +0200, Peter Zijlstra wrote: > On Tue, Jun 28, 2022 at 12:32:33AM +0000, Wedson Almeida Filho wrote: > > Peter, I meant to ask in my previous email: setting aside the syntax for > > a moment, do you have an opinion on the sort of things that Rust allows > > us to enforce at compile time (as exemplified in the new_paused() > > fragment)? > > So I used to do quite a lot of C++ in a previous life; I think I'm more > or less familiar with a lot of the things Rust offers, except it is a > lot stricter. C++ allows you to do the right thing, but also allows you > to take your own foot off (a bit like C, except you can make an even > bigger mess of things), where Rust tries really hard to protect the > foot. That's a fair assessment. I'd just like to emphasise one aspect: if I shoot myself on the foot with Rust, it's either a bug in the compiler or because I violated a precondition of an unsafe block. > The one thing I dread is compile times, C++ is bad, but given Rust has > to do even more compile time enforcement it'll suck worse. And I'm > already not using clang because it's so much worse than gcc. Yes, it's definitely going to be slower, but I think this is a good tradeoff: I'd rather have checks performed when compiling than when running the kernel. One thing that may speed things up is to disable the borrow checker when compiling a known-good revision. I don't think rustc allows this but rust-gcc doesn't implement borrow checking at all AFAIU, so we could presumably ask them to add a flag to keep it disabled (when/if they decide to implement it) in such cases. > I've just not had *any* time to actually look at Rust in any detail :/ > > But given I'm the kind of idiot that does tree-wide cleanups just > because it's the right thing, I'm bound to run into it sooner rather > than later, and then I'll curse my way through having to learn it just > to get crap done I expect ... Looking forward to this :) Jokes aside, when the day comes, I'm happy to discuss anything that doesn't seem to make sense. > Anyway; from what I understand Rust is a fair way away from core code. Indeed. However, we do want to expose core APIs (like paused thread creation) to developers writing Rust code, so while the implementation will remain in C, we'll implement the _API_ for a bunch of core code. Cheers, -Wedson ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE) 2022-06-26 19:16 ` [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE) Eric W. Biederman 2022-06-26 19:59 ` Linus Torvalds @ 2022-06-26 22:14 ` kernel test robot 2022-06-26 22:34 ` kernel test robot 2 siblings, 0 replies; 52+ messages in thread From: kernel test robot @ 2022-06-26 22:14 UTC (permalink / raw) To: Eric W. Biederman, Linus Torvalds Cc: kbuild-all, LKML, Christian Brauner, Tejun Heo, Petr Mladek, Lai Jiangshan, Michal Hocko, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Linux Memory Management List, Oleg Nesterov Hi "Eric, Thank you for the patch! Yet something to improve: [auto build test ERROR on mkp-scsi/for-next] [also build test ERROR on jejb-scsi/for-next linus/master v5.19-rc3] [cannot apply to linux/master next-20220624] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/intel-lab-lkp/linux/commits/Eric-W-Biederman/kthread-Remove-the-flags-argument-from-kernel_thread/20220627-041557 base: https://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next config: x86_64-randconfig-a013 compiler: gcc-11 (Debian 11.3.0-3) 11.3.0 reproduce (this is a W=1 build): # https://github.com/intel-lab-lkp/linux/commit/950f4922fa2aef3d2e68b4b2ab728c6945830991 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Eric-W-Biederman/kthread-Remove-the-flags-argument-from-kernel_thread/20220627-041557 git checkout 950f4922fa2aef3d2e68b4b2ab728c6945830991 # save the config file mkdir build_dir && cp config build_dir/.config make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash If you fix the issue, kindly add following tag where applicable Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>, old ones prefixed by <<): >> ERROR: modpost: "wake_up_new_task" [kernel/rcu/rcutorture.ko] undefined! -- 0-DAY CI Kernel Test Service https://01.org/lkp ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE) 2022-06-26 19:16 ` [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE) Eric W. Biederman 2022-06-26 19:59 ` Linus Torvalds 2022-06-26 22:14 ` kernel test robot @ 2022-06-26 22:34 ` kernel test robot 2 siblings, 0 replies; 52+ messages in thread From: kernel test robot @ 2022-06-26 22:34 UTC (permalink / raw) To: Eric W. Biederman, Linus Torvalds Cc: llvm, kbuild-all, LKML, Christian Brauner, Tejun Heo, Petr Mladek, Lai Jiangshan, Michal Hocko, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Linux Memory Management List, Oleg Nesterov Hi "Eric, Thank you for the patch! Yet something to improve: [auto build test ERROR on mkp-scsi/for-next] [also build test ERROR on jejb-scsi/for-next linus/master v5.19-rc3] [cannot apply to linux/master next-20220624] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/intel-lab-lkp/linux/commits/Eric-W-Biederman/kthread-Remove-the-flags-argument-from-kernel_thread/20220627-041557 base: https://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next config: x86_64-randconfig-a012 compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project b0d6dd3905db145853c7c744ac92d49b00b1fa20) reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/intel-lab-lkp/linux/commit/950f4922fa2aef3d2e68b4b2ab728c6945830991 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Eric-W-Biederman/kthread-Remove-the-flags-argument-from-kernel_thread/20220627-041557 git checkout 950f4922fa2aef3d2e68b4b2ab728c6945830991 # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash If you fix the issue, kindly add following tag where applicable Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>, old ones prefixed by <<): ERROR: modpost: "wake_up_new_task" [kernel/rcu/rcutorture.ko] undefined! ERROR: modpost: "wake_up_new_task" [fs/nfs/nfsv4.ko] undefined! >> ERROR: modpost: "wake_up_new_task" [fs/ext4/ext4.ko] undefined! >> ERROR: modpost: "wake_up_new_task" [fs/jbd2/jbd2.ko] undefined! >> ERROR: modpost: "wake_up_new_task" [fs/ecryptfs/ecryptfs.ko] undefined! ERROR: modpost: "wake_up_new_task" [fs/cifs/cifs.ko] undefined! >> ERROR: modpost: "wake_up_new_task" [fs/nilfs2/nilfs2.ko] undefined! >> ERROR: modpost: "wake_up_new_task" [fs/gfs2/gfs2.ko] undefined! >> ERROR: modpost: "wake_up_new_task" [fs/f2fs/f2fs.ko] undefined! >> ERROR: modpost: "wake_up_new_task" [drivers/char/ipmi/ipmi_si.ko] undefined! WARNING: modpost: suppressed 30 unresolved symbol warnings because there were too many) -- 0-DAY CI Kernel Test Service https://01.org/lkp ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-25 23:43 ` Linus Torvalds 2022-06-25 23:48 ` Linus Torvalds @ 2022-06-26 0:21 ` Eric W. Biederman 1 sibling, 0 replies; 52+ messages in thread From: Eric W. Biederman @ 2022-06-26 0:21 UTC (permalink / raw) To: Linus Torvalds Cc: Christian Brauner, Tejun Heo, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov Linus Torvalds <torvalds@linux-foundation.org> writes: > On Sat, Jun 25, 2022 at 4:28 PM Eric W. Biederman <ebiederm@xmission.com> wrote: >> >> I presume you mean kthreadd games? > > Yeah, sorry. > >> So with the introduction of kthreadd the kernel threads were moved >> out of the userspace process tree, and userspace stopped being able to >> influence the kernel threads. > > Ahh. So essentially it's indeed just basically the parenting issue. That is one way to look at it. The way I described it at the time was: > commit 73c279927f89561ecb45b2dfdf9314bafcfd9f67 > Author: Eric W. Biederman <ebiederm@xmission.com> > Date: Wed May 9 02:34:32 2007 -0700 > > kthread: don't depend on work queues > > Currently there is a circular reference between work queue initialization > and kthread initialization. This prevents the kthread infrastructure from > initializing until after work queues have been initialized. > > We want the properties of tasks created with kthread_create to be as close > as possible to the init_task and to not be contaminated by user processes. > The later we start our kthreadd that creates these tasks the harder it is > to avoid contamination from user processes and the more of a mess we have > to clean up because the defaults have changed on us. > > So this patch modifies the kthread support to not use work queues but to > instead use a simple list of structures, and to have kthreadd start from > init_task immediately after our kernel thread that execs /sbin/init. > > By being a true child of init_task we only have to change those process > settings that we want to have different from init_task, such as our process > name, the cpus that are allowed, blocking all signals and setting SIGCHLD > to SIG_IGN so that all of our children are reaped automatically. > > By being a true child of init_task we also naturally get our ppid set to 0 > and do not wind up as a child of PID == 1. Ensuring that tasks generated > by kthread_create will not slow down the functioning of the wait family of > functions. > > [akpm@linux-foundation.org: use interruptible sleeps] > Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> > Cc: Oleg Nesterov <oleg@tv-sign.ru> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Eric ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-25 18:43 ` Linus Torvalds 2022-06-25 23:28 ` Eric W. Biederman @ 2022-06-28 14:16 ` Christian Brauner 1 sibling, 0 replies; 52+ messages in thread From: Christian Brauner @ 2022-06-28 14:16 UTC (permalink / raw) To: Linus Torvalds Cc: Eric W. Biederman, Tejun Heo, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov On Sat, Jun 25, 2022 at 11:43:15AM -0700, Linus Torvalds wrote: > On Sat, Jun 25, 2022 at 11:25 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > And that's not at all what the kthread code wants. It wants to set > > affinity masks, it wants to create a name for the thread, it wants to > > do all those other things. > > > > That code really wants to just do copy_process(). > > Honestly, I think kernel/kthread.c should be almost rewritten from scratch. > > I do not understand why it does all those odd keventd games at all, > and why kthread_create_info exists in the first place. > > Why does kthread_create() not just create the thread directly itself, > and instead does that odd queue it onto a work function? > > Some of that goes back to before the git history, and very little of > it seems to make any sense. It's as if the code is meant to be able to > run from interrupt context, but that can't be it: it's literally doing > a GFP_KERNEL kmalloc, it's doing spin-locks without irq safety etc. > > So why is it calling kthreadd_task() to create the thread? Purely for > some crazy odd "make that the parent" reason? > > I dunno. The code is odd, unexplained, looks buggy, and most fo the > reasons are probably entirely historical. > > I'm adding Christian to this thread too, since I get the feeling that > it really should be more tightly integrated with copy_process(), and > that Christian might have comments. > > Christian, see some context in the thread here: > > https://lore.kernel.org/all/CAHk-=wiC7rj1o7vTnYUPfD7YxAu09MZiZbahHqvLm9+Cgg1dFw@mail.gmail.com/ > > for some of this. Sorry, I was at LSS last week. I honestly didn't touch the code back then because it seemed almost entirely unrelated to regular task creation apart from kernel_thread() that I added. I didn't feel comfortable changing a lot of stuff there. Iirc, just a few months ago io_uring still made us of the kthread infrastructure and I think that made the limits of the interface more obvious. Now we soon will have two users that create a version of kernel generated threads with properties of another process (io_uring and [1]). In my head, the kthread infra should be able to support generation of pure kernel threads as well as the creation of users workers instead of adding specialized interfaces to do this. The fact that it doesn't is a limitation of the interface that imho shows it hasn't grown to adapt to the new use-cases we have. And imho we'll see more of those. In this context it's really worth looking at [1] because to some extent it duplicates bits we have for the kthread infra whereas I still think the kthread infra should support both possibly exposing two apis one to return pure kernel threads and the other returning struct user_worker or similar. Idk, it might just be a heat-stroke talking... I don't feel comfortable making strong assertions about the original implementation of kthreads. I wasn't around and there might be historical context I'm missing. One issue that Tejun also mentioned later in the thread and that we run into is that we have a pattern where we create a kthread and then trust the caller to handle/activate the new task. This is more problematic once we start supporting something like [1] where that's exposed to a driver. (Ideally creation of such a task would generate a unique callback - I think Peter suggested something like this? - that could only be used on that task...) [1]: https://lore.kernel.org/lkml/20220620011357.10646-1-michael.christie@oracle.com ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-25 18:25 ` Linus Torvalds 2022-06-25 18:43 ` Linus Torvalds @ 2022-06-26 0:26 ` Eric W. Biederman 1 sibling, 0 replies; 52+ messages in thread From: Eric W. Biederman @ 2022-06-26 0:26 UTC (permalink / raw) To: Linus Torvalds Cc: Tejun Heo, Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov Linus Torvalds <torvalds@linux-foundation.org> writes: > On Sat, Jun 25, 2022 at 10:36 AM Eric W. Biederman > <ebiederm@xmission.com> wrote: >> >> Let me suggest someone create a new variant of kthread_create that takes >> all of the parameters the workqueue code wants to set. > > I suspect the real issue is that that the kthread code simply > shouldn't use the kernel_thread() helper at all. > > That helper is literally designed for "start a thread, run this thing". > > That's what it *does*. > > And that's not at all what the kthread code wants. It wants to set > affinity masks, it wants to create a name for the thread, it wants to > do all those other things. > > That code really wants to just do copy_process(). > > Or maybe it really should just use create_io_thread(), which has a > much better interface, except it has that one oddity in that it sets > the flag that does this: > > if (args->io_thread) { > /* > * Mark us an IO worker, and block any signal that isn't > * fatal or STOP > */ > p->flags |= PF_IO_WORKER; > siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)); > } > > that then has special semantics. It is worth pointing out kthreads also have special semantics for signals and they are different. In particular kthreads ignore all signals by default. The io_threads are much more userspace threads and the userspace process handles signals, it is just the io_thread that blocks the signals so they will go to real userspace processes. Eric ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-25 17:01 ` Linus Torvalds 2022-06-25 17:36 ` Eric W. Biederman @ 2022-06-26 1:58 ` Tejun Heo 2022-06-26 2:53 ` Linus Torvalds 2022-06-28 9:51 ` Petr Mladek 2 siblings, 1 reply; 52+ messages in thread From: Tejun Heo @ 2022-06-26 1:58 UTC (permalink / raw) To: Linus Torvalds Cc: Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov, Eric W. Biederman Hello, On Sat, Jun 25, 2022 at 10:01:35AM -0700, Linus Torvalds wrote: > On Fri, Jun 24, 2022 at 10:00 PM Tejun Heo <tj@kernel.org> wrote: > > > > So, Petr debugged a NULL deref in workqueue code to a spurious wakeup > > on a newly created kthread. > > What? No. That patch can't be right for several reasons. > > What we call "spurious wakeups" exist, but they are about wakeups that > happen from being on a _previous_ wait-queue, and having already been > removed from it. > > They aren't "really" spurious, they are just asynchronous enough (and > thus unexpected) that you basically should never have a "sleep on > wait-queue" without then looping and re-testing the condition. Can you elaborate on this a bit? At least for the standard wait_event-ish wait blocks, the waiter always does finish_wait() before leavig a wait. finish_wait() does lockless check on wq_entry->entry and may or may not grab wq_head->lock. When it does, it's fully synchronized against the waker. Even when it doesn't, while the lack of memory ordering on the finish_wait() side may let things slide a bit, I can't see how it can slide after the set_current_state in the next wait block. I'm probably missing sometihng. Is it about bespoke wait mechanisms? Can you give a concrete example of an async wakeup scenario? > There is no _truly_ spurious wakeup. You were always woken up for a > reason, it's just that there are more reasons than the entirely > obvious ones. So, the deferred wakeups from earlier waits are one. Can you give some other examples? This is something which has always bothered me and I couldn't find explanations which aren't hand-wavy on my own. It'd be really great to have clarity. > For example, the reason that quoted patch cannot be right is that this > code pattern: > > while (wait_for_completion_interruptible(&worker->ready_to_start)) > ; > > is not valid kernel code. EVER. There is absolutely no way that can be correct. > > Either that code can take a signal, or it cannot. If it can take a > signal, it had better react to said signal. If it cannot, it must not > use an interruptble sleep - since now that loop turned into a > kernel-side busy-loop. > > So NAK on this kind of crazy "I don't know what happened, so I'll just > add *more* bugs to the code" voodoo programming. > > And no, we don't "fix" that by then adding a timeout. Yeah, I should've been more explicit on this. Michal already pointed out that it doesn't make sense to loop over interruptible timed sleeps and it should use one plain uninterruptible sleep, so this part isn't in question. ... > I think the problem here is much more fundamental: you expect a new > thread to not wake up until you've told it to. > > We do have that infrastructure in the kernel: when you create a new > thread, you can do various setup, and the thread won't actually run > until you do "wake_up_new_task()" on it. That's because that's the only thing which ignores TASK_NEW, right? > However, that's not how kernel_thread() (or kernel_clone()) works. > Those will call wake_up_new_task(p) for you, and as such a new kernel > thread will immediately start running. > > So I think the expectations here are entirely wrong. I think > create_worker() is fundamentally buggy, in how it does that > > /* start the newly created worker */ > .. > wake_up_process(worker->task); > > because that wake_up_process() is already much too late. The process > got woken up already, because it was created by create_kthread() -> > kernel_thread() -> kernel_clone, which does that wake_up_new_task() > and it starts running. A couple things still aren't clear for me. * If there are no true spurious wakeups, where did the racing wakeup come from? The task just got created w/ TASK_NEW and woken up once with wake_up_new_task(). It hasn't been on any wait queue or advertised itself to anything. * If there are spurious wakeups, why is kthread() scheduling after signaling creation completion in the first place? As I wrote before, all it would do is masking these bugs. If we can't gurantee that the kthread will stay blocked, shouldn't we just remove the schedule_preempt_disabled() call in kthread()? Thanks. -- tejun ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-26 1:58 ` Tejun Heo @ 2022-06-26 2:53 ` Linus Torvalds 2022-06-26 6:09 ` Tejun Heo 2022-06-27 12:04 ` Michal Hocko 0 siblings, 2 replies; 52+ messages in thread From: Linus Torvalds @ 2022-06-26 2:53 UTC (permalink / raw) To: Tejun Heo Cc: Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov, Eric W. Biederman On Sat, Jun 25, 2022 at 6:58 PM Tejun Heo <tj@kernel.org> wrote: > > > > They aren't "really" spurious, they are just asynchronous enough (and > > thus unexpected) that you basically should never have a "sleep on > > wait-queue" without then looping and re-testing the condition. > > Can you elaborate on this a bit? At least for the standard > wait_event-ish wait blocks, the waiter always does finish_wait() > before leavig a wait. Correct. As long as *all* you ever use is a standard wait_event thing, you are always serialized by the spinlock on the wait queue. However. Very few processes only use standard wait_event things. There are a number of places that use "wake_up_process()" which doesn't use a wait-queue, but a wake-up directed right at a particular task. Is that the _common_ pattern? It's not *uncommon*. It's maybe not the strictly normal wait-queue one, but if you grep for "wake_up_process()" you will find quite a lot of them. And yes, many of those uses too will use strictly serialized locking: in that both the waker and the sleeper (ie the target of the wake_up_process()) will have serialized the target thread pointer with some lock, exactly like the normal wait-queues serialize using the wait-queue lock. But several do *not* use locking, and instead rely on the thread_struct being RCU-free'd. In fact, I think kthread_create() itself is such an example, with that wake_up_process(kthreadd_task); just doing a blind directed wakeup with no locking what-so-ever, just a straight wake_up. And the important thing to notice that if you have even just *one* such user, that kthreadd_task will basically get randomyl "spuriously" woken up while it is waiting for something else. Now, the kthreadd_task task is kind of specialized, and that particular wake_up_process() doesn't affect anybody else, so think of that just as an example. But now imagine any of the other unlocked directed wake_up_process() users. Imagine them targeting regular user processes that may be doing regular system calls. Imagine that wake_up_process() being just done under a RCU read lock, and maybe the process had already woken up *AND* had already gone on to do entirely other things! IOW, maybe you are now a thread that does a perfectly normal locked wait_queue thing - but because the *previous* system call did something that did less strictly locked things, there may be another CPU that is still in the process of waking you up from that previous system call. So now your "strictly locked" waitqueue usage sees a "spurious" wakeup, because a previous not-so-stictly-locked usage had been delayed by interrupts on another CPU. And yes, those "wake up process under RCU read lock" really do exist. There are literally things that take a reference to a task struct, add it to some RCU-safe list, and then do the wakeup without locking. > I'm probably missing sometihng. Is it about bespoke wait mechanisms? > Can you give a concrete example of an async wakeup scenario? Grep for wake_up_process() and just look for them/ Side note: signals end up doing effectively the same thing if you're doing interruptible waits, of course, but people are probably more *aware* of signals when they use TASK_INTERRUPTIBLE. But these RCU-delayed wakeups can wake up even TASK_UNINTERRUPTIBLE calls. Anyway, it's easy enough to deal with: use the "event" macros, and you'll never have to worry about it. But if you use explicit wait-queues and manual scheduling (rather than wait_event() and friends), you need to be aware that when you go to sleep, the fact that you woke up is *not* a guarantee that the wakeup came from the wait queue. So you need to always do that in a loop. The wait_event code will do that loop for you, but if you do manual wait-queues you are required to do the looping yourself. > So, the deferred wakeups from earlier waits are one. Can you give some > other examples? This is something which has always bothered me and I > couldn't find explanations which aren't hand-wavy on my own. It'd be > really great to have clarity. There's a second class of "spurious" waits that aren't really spurious at all, and aren't even deferred, and are simply due to "there were multiple things you waited for, but you didn't even *think* about it". The obvious one is for poll/select, but there people are aware of it. The less obvious one is for taking a page fault or being put to sleep and being woken up again while actively *inside* a wait loop, and already with a wait-queue entry added. For a historical example of *both* of those kinds of situations, take a look at n_tty_read() in drivers/tty/n_tty.c, and in particular, notice how the wait-loop looks something like this: add_wait_queue(&tty->read_wait, &wait); while (nr) { .. } ... remove_wait_queue(&tty->read_wait, &wait); and *inside* the loop, you have both (a) locking: down_read(&tty->termios_rwsem); (b) historically (but not any more) user space accesses and both of those kinds mean that you actually have overlapping lifetimes of wait-queues for the same process. That's very much how wait-queues were always designed to work - it solves a lot of race conditions (ie the traditional "sleep_on()" model is broken garbage), but it means that each individual user doesn't necessarily understand that other active wait-queues can wake up a process *while* the code thinks about another wait-queue entirely. IOW, when you are in the page fault code, and you are waiting for the filesystem IO to complete, you may *also* be on that kind of "tty->read_wait" waitqueue, and a character coming in on that tty will wake you up. The *filesyustem* code doesn't care about the tty wakeup, so it's very much an example of a "spurious" wake event as far as the filesystem code is concerned. Similarly, when you are locking a semaphote, the only wait-queue activity you care about at the time of the lock is the ones coming from the unlockers, but you may get "spurious" wakeups from other things that the process is *also* interested in. Again, none of these are *really* spurious. They are real wakeup events. It's just that within the *local* code they look spurious, because the locking code, the disk IO code, whatever the code is doesn't know or care about all the other things that process is involved in. And once again, it's not that different from "hey, signals can wake you up at pretty much any time", but people *think* that because they are doing a non-interruptible wait it is somehow "exclusive", and don't think about all the other things that that process has been involved in > * If there are no true spurious wakeups, where did the racing wakeup > come from? The task just got created w/ TASK_NEW and woken up once > with wake_up_new_task(). It hasn't been on any wait queue or > advertised itself to anything. I don't think it was ever a spurious wakeup at all. The create_worker() code does: worker->task = kthread_create_on_node(.. .. worker_attach_to_pool(worker, pool); .. wake_up_process(worker->task); and thinks that the wake_up_process() happens after the worker_attach_to_pool(). But I don't see that at all. The reality seems to be that the wake_up_process() is a complete no-op, because the task was already woken up by kthread_create_on_node(). But I dunno. I think the whole argument of Any kthread is supposed to stay in TASK_UNINTERRUPTIBLE sleep until it is explicitly woken. was completely broken to begin with, and then the belief that there's some spurious wakeup came from that. But hey, that "wake_up_process(worker->task)" itself does seem spurious. Anyway, I think this whole "spurious wakeup" is a complete red herring. They don't "really" exist, but any code that expects some exclusive wakeup tends to be broken code and that kind of thinking is dangerous. So you should *ALWAYS* write your code as if things can get random spurious wakeups at any time. Because with various CPU hotplug events etc, it really might happen. But I don't think that's what's going on here. I think the workqueue code is just confused, and should have initielized "worker->pool" much earlier. Because as things are now, when worker_thread() starts running, and does that static int worker_thread(void *__worker) { struct worker *worker = __worker; struct worker_pool *pool = worker->pool; thing, that can happen *immediately* after that kthread_create_on_node(worker_thread, worker, happens. It just so happens that *normally* the create_worker() code ends up finishing setup before the new worker has actually finished scheduling.. No? Linus ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-26 2:53 ` Linus Torvalds @ 2022-06-26 6:09 ` Tejun Heo 2022-06-27 12:04 ` Michal Hocko 1 sibling, 0 replies; 52+ messages in thread From: Tejun Heo @ 2022-06-26 6:09 UTC (permalink / raw) To: Linus Torvalds Cc: Petr Mladek, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov, Eric W. Biederman Hello, On Sat, Jun 25, 2022 at 07:53:34PM -0700, Linus Torvalds wrote: ... > Is that the _common_ pattern? It's not *uncommon*. It's maybe not the > strictly normal wait-queue one, but if you grep for > "wake_up_process()" you will find quite a lot of them. ... > > I'm probably missing sometihng. Is it about bespoke wait mechanisms? > > Can you give a concrete example of an async wakeup scenario? > > Grep for wake_up_process() and just look for them/ Right, for kthreads, custom work list + directed wakeup at the known handler task is a pattern seen across the code base and that does affect all other waits the kthread would do. ... > So you need to always do that in a loop. The wait_event code will do > that loop for you, but if you do manual wait-queues you are required > to do the looping yourself. The reason why this bothered me sometimes is that w/ simple kthread use cases, there are place where all the legtimate wakeup sources are clearly known. In those cases, the fact that I can't think of a case where the looping would be needed creates subtle nagging feeling when writing them. ... > Again, none of these are *really* spurious. They are real wakeup > events. It's just that within the *local* code they look spurious, > because the locking code, the disk IO code, whatever the code is > doesn't know or care about all the other things that process is > involved in. I see. Yeah, waiting on multiple sources which may not be known to each wait logic makes sense. ... > But I don't think that's what's going on here. I think the workqueue > code is just confused, and should have initielized "worker->pool" much > earlier. Because as things are now, when worker_thread() starts > running, and does that > > static int worker_thread(void *__worker) > { > struct worker *worker = __worker; > struct worker_pool *pool = worker->pool; > > thing, that can happen *immediately* after that > > kthread_create_on_node(worker_thread, worker, > > happens. It just so happens that *normally* the create_worker() code > ends up finishing setup before the new worker has actually finished > scheduling.. > > No? I'm not sure. Putting aside the always-loop-with-cond-check princicple for the time being, I can't yet think of a scenario where the interlocking would break down unless there really is a wakeup which is coming from an unrelated source. Just experimented with commenting out that wakeup in create_worker(). Simply commenting it out doesn't break anything but if I wait for PF_WQ_WORKER to be set by the new worker thread, it doesn't happen. ie. the initial wakeup is spurious because there is a later wakeup which comes when a work item is actually dispatched to the new worker. But the newly created kworker won't start executing its callback function without at least one extra wakeup, whereever that may be coming from. After the initial TASK_NEW handshake, the new kthread notifies kthread_create_info->done and then goes into an UNINTERRUPTIBLE sleep and won't start executing the callback function unless somebody wakes it up. This being a brand new task, it's a pretty sterile wakeup environment. So, there actually has to be an outside wakeup source here. If we say that anyone should expect external wakeups that it has no direct control over, the kthread_bind interface is broken too cuz that's making exactly the same assumption that the task is dormant at that point till the owner sends a wakeup and thus its internal state can be manipulated safely. Petr, regardless of how this eventually gets resolved, I'm really curious where the wakeup that you saw came from. Would it be possible for you to find out? Thanks. -- tejun ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-26 2:53 ` Linus Torvalds 2022-06-26 6:09 ` Tejun Heo @ 2022-06-27 12:04 ` Michal Hocko 1 sibling, 0 replies; 52+ messages in thread From: Michal Hocko @ 2022-06-27 12:04 UTC (permalink / raw) To: Linus Torvalds Cc: Tejun Heo, Petr Mladek, Lai Jiangshan, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov, Eric W. Biederman On Sat 25-06-22 19:53:34, Linus Torvalds wrote: > On Sat, Jun 25, 2022 at 6:58 PM Tejun Heo <tj@kernel.org> wrote: [...] > > * If there are no true spurious wakeups, where did the racing wakeup > > come from? The task just got created w/ TASK_NEW and woken up once > > with wake_up_new_task(). It hasn't been on any wait queue or > > advertised itself to anything. > > I don't think it was ever a spurious wakeup at all. > > The create_worker() code does: > > worker->task = kthread_create_on_node(.. > .. > worker_attach_to_pool(worker, pool); > .. > wake_up_process(worker->task); > > and thinks that the wake_up_process() happens after the worker_attach_to_pool(). > > But I don't see that at all. > > The reality seems to be that the wake_up_process() is a complete > no-op, because the task was already woken up by > kthread_create_on_node(). Just for the record. the newly created thread is not running our thread function at this stage. It is rather subtle and took me some time to decypher but kthread_create_on_node will create and wake up kernel thread running kthread() function: [...] /* * Thread is going to call schedule(), do not preempt it, * or the creator may spend more time in wait_task_inactive(). */ preempt_disable(); complete(done); schedule_preempt_disabled(); preempt_enable(); ret = -EINTR; if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) { cgroup_kthread_ready(); __kthread_parkme(self); ret = threadfn(data); } so the newly created thread will go into sleep before calling the threadfn (worker_thread here). Somebody must have woken it up other than create_worker. I couldn't have found out who that was (see my other email with some notes from the crash dump). I do agree that a simple schedule without checking for a condition is dubious, fragile and wrong. If anything wait_for_completion would be less confusing and targetted waiting. Petr has added that completion into worker_thread to address this specific case and a better fix would be to address all callers because who knows how many of those are similarly broken. I also do agree that this whole scheme is rather convoluted and having an init() callback to be executed before threadfn would be much more easier to follow. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-25 17:01 ` Linus Torvalds 2022-06-25 17:36 ` Eric W. Biederman 2022-06-26 1:58 ` Tejun Heo @ 2022-06-28 9:51 ` Petr Mladek 2022-06-28 10:07 ` Tejun Heo 2 siblings, 1 reply; 52+ messages in thread From: Petr Mladek @ 2022-06-28 9:51 UTC (permalink / raw) To: Linus Torvalds Cc: Tejun Heo, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov, Eric W. Biederman On Sat 2022-06-25 10:01:35, Linus Torvalds wrote: > On Fri, Jun 24, 2022 at 10:00 PM Tejun Heo <tj@kernel.org> wrote: > For example, the reason that quoted patch cannot be right is that this > code pattern: > > while (wait_for_completion_interruptible(&worker->ready_to_start)) > ; > > is not valid kernel code. EVER. There is absolutely no way that can be correct. > > Either that code can take a signal, or it cannot. If it can take a > signal, it had better react to said signal. If it cannot, it must not > use an interruptble sleep - since now that loop turned into a > kernel-side busy-loop. I agree that the code is ugly and does not make sense. JFYI, I did it this way because I wanted to calm down the hung task detector. kthreads ignore signals so that I did not think about this aspect. I though that I saw this trick somewhere. Do not ask me why I wanted to calm down the hung task detector. I quess that I felt overloaded and wished a calm day or something like this. I agree that it does not make much sense. Best Regards, Petr ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-28 9:51 ` Petr Mladek @ 2022-06-28 10:07 ` Tejun Heo 0 siblings, 0 replies; 52+ messages in thread From: Tejun Heo @ 2022-06-28 10:07 UTC (permalink / raw) To: Petr Mladek Cc: Linus Torvalds, Lai Jiangshan, Michal Hocko, Linux Kernel Mailing List, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Oleg Nesterov, Eric W. Biederman On Tue, Jun 28, 2022 at 11:51:37AM +0200, Petr Mladek wrote: > I quess that I felt overloaded and wished a calm day > or something like this. I agree that it does not make > much sense. Hey, we all get overwhelmed sometimes. I wish you a calm day! -- tejun ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-25 5:00 ` re. Spurious wakeup on a newly created kthread Tejun Heo 2022-06-25 17:01 ` Linus Torvalds @ 2022-06-27 8:07 ` Michal Hocko 2022-06-27 8:21 ` Tejun Heo 2022-06-28 15:08 ` Petr Mladek 1 sibling, 2 replies; 52+ messages in thread From: Michal Hocko @ 2022-06-27 8:07 UTC (permalink / raw) To: Tejun Heo Cc: Petr Mladek, Lai Jiangshan, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Linus Torvalds, Oleg Nesterov, Eric W. Biederman On Sat 25-06-22 14:00:10, Tejun Heo wrote: [...] > tl;dr is that the worker creation code expects a newly created worker > kthread to sit tight until the creator finishes setting up stuff and > sends the initial wakeup. However, something, which wasn't identified > in the report (Petr, it'd be great if you can find out who did the > wakeup), wakes up the new kthread before the creation path is done > with init which causes the new kthread to try to deref a NULL pointer. One thing that hasn't been really mentioned explicitly but it might be really important. 5.3.18-2-rt is the RT patchset backported to 5.3 based SLES kernel. We do not know whether the upstream RT patchset is affected the same way and due to nature of the customer we cannot really have this tested on that kernel. I have tried to dig that out from the crash dump but no luck unfortunately. I was able to find the kthread pointer on few stacks: crash> search -t ffff9ca77e9a8000 PID: 2 TASK: ffff9ca700381d80 CPU: 0 COMMAND: "kthreadd" ffffb0f68001fd78: ffff9ca77e9a8000 ffffb0f68001fe38: ffff9ca77e9a8000 PID: 842 TASK: ffff9ca75e6349c0 CPU: 0 COMMAND: "sysmon" ffffb0f6808e7c50: ffff9ca77e9a8000 ffffb0f6808e7da8: ffff9ca77e9a8000 PID: 2035 TASK: ffff9ca71e252c40 CPU: 1 COMMAND: "vReqJob" ffffb0f68300fba8: ffff9ca77e9a8000 ffffb0f68300fbf0: ffff9ca77e9a8000 PID: 16127 TASK: ffff9ca77e9abb00 CPU: 0 COMMAND: "kworker/0:0H" ffffb0f688067dd0: ffff9ca77e9a8000 PID: 16213 TASK: ffff9ca77e9a8000 CPU: 1 COMMAND: "kworker/0:2H" ffffb0f68822fba0: ffff9ca77e9a8000 ffffb0f68822fca0: ffff9ca77e9a8000 ffffb0f68822fd70: ffff9ca77e9a8000 ffffb0f68822fe38: ffff9ca77e9a8000 ffffb0f68822fef8: ffff9ca77e9a8000 kthreadadd is not all that interesting because that one has created the thread. sysmon is crash> bt ffff9ca75e6349c0 PID: 842 TASK: ffff9ca75e6349c0 CPU: 0 COMMAND: "sysmon" #0 [ffffb0f6808e7be8] __schedule at ffffffffaa94bef1 #1 [ffffb0f6808e7c78] preempt_schedule_common at ffffffffaa94c5d1 #2 [ffffb0f6808e7c80] ___preempt_schedule at ffffffffaa201bb6 #3 [ffffb0f6808e7cd8] rt_mutex_postunlock at ffffffffaa309f6b #4 [ffffb0f6808e7ce8] rt_mutex_futex_unlock at ffffffffaa94ea7c #5 [ffffb0f6808e7d30] rt_spin_unlock at ffffffffaa950ace #6 [ffffb0f6808e7d38] proc_pid_status at ffffffffaa4c9ff4 #7 [ffffb0f6808e7dd8] proc_single_show at ffffffffaa4c3801 #8 [ffffb0f6808e7e10] seq_read at ffffffffaa47c1b8 #9 [ffffb0f6808e7e70] vfs_read at ffffffffaa455fd9 #10 [ffffb0f6808e7ea0] ksys_read at ffffffffaa456364 so collecting /pro/status data and releasing the sighand lock. crash> bt ffff9ca71e252c40 PID: 2035 TASK: ffff9ca71e252c40 CPU: 1 COMMAND: "vReqJob" #0 [ffffb0f68300fbb0] __schedule at ffffffffaa94bef1 #1 [ffffb0f68300fc40] schedule at ffffffffaa94c4e6 #2 [ffffb0f68300fc50] futex_wait_queue_me at ffffffffaa33e2c0 #3 [ffffb0f68300fc88] futex_wait at ffffffffaa33f199 #4 [ffffb0f68300fd98] do_futex at ffffffffaa341062 #5 [ffffb0f68300fe70] __x64_sys_futex at ffffffffaa341a7e #6 [ffffb0f68300fee0] do_syscall_64 at ffffffffaa2025f0 is scheduled into the kthread which is expected because the newly created kernel thread should be running kthread() where it sleeps before calling the thread function. kworker/0:2H is our tkrehad worker. crash> bt ffff9ca77e9abb00 #5 [ffffb0f688067c90] delay_tsc at ffffffffaa9479fc #6 [ffffb0f688067c90] wait_for_xmitr at ffffffffaa6c1e45 #7 [ffffb0f688067cb0] serial8250_console_putchar at ffffffffaa6c1ee6 #8 [ffffb0f688067cc8] uart_console_write at ffffffffaa6bb095 #9 [ffffb0f688067cf0] serial8250_console_write at ffffffffaa6c5962 #10 [ffffb0f688067d70] console_unlock at ffffffffaa30d795 #11 [ffffb0f688067db0] vprintk_emit at ffffffffaa30fc11 #12 [ffffb0f688067e00] printk at ffffffffaa3103c0 #13 [ffffb0f688067e68] __kthread_bind_mask at ffffffffaa2d74df #14 [ffffb0f688067e90] create_worker at ffffffffaa2cfc13 #15 [ffffb0f688067ed8] worker_thread at ffffffffaa2d1cbe #16 [ffffb0f688067f10] kthread at ffffffffaa2d6da2 is the delayed thread which is creating the worker and currently stuck somewhere in the printk (console) code. So if somebody has woken up our thread from inside kthread() then it doesn't have that pointer on the stack and I couldn't it find elsewhere either. Maybe somebody has an idea where to look at. Btw. I still haven't caught up with the rest of the email thread and today will be quite busy for me. Anyway, if somebody has ideas about a further post mortem debugging then I am more than happy to help out. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-27 8:07 ` Michal Hocko @ 2022-06-27 8:21 ` Tejun Heo 2022-06-27 10:18 ` Michal Hocko 2022-06-28 15:08 ` Petr Mladek 1 sibling, 1 reply; 52+ messages in thread From: Tejun Heo @ 2022-06-27 8:21 UTC (permalink / raw) To: Michal Hocko Cc: Petr Mladek, Lai Jiangshan, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Linus Torvalds, Oleg Nesterov, Eric W. Biederman Hello, Michal. On Mon, Jun 27, 2022 at 10:07:22AM +0200, Michal Hocko wrote: > So if somebody has woken up our thread from inside kthread() then it > doesn't have that pointer on the stack and I couldn't it find elsewhere > either. Maybe somebody has an idea where to look at. One way could be bpftrace'ing or printking __wake_up_common() and friends to dump backtrace if it's trying to wake a kthread whose comm starts with kworker/ and doesn't have (struct worker *)kthread_data(task)->pool set. Thanks. -- tejun ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-27 8:21 ` Tejun Heo @ 2022-06-27 10:18 ` Michal Hocko 0 siblings, 0 replies; 52+ messages in thread From: Michal Hocko @ 2022-06-27 10:18 UTC (permalink / raw) To: Tejun Heo Cc: Petr Mladek, Lai Jiangshan, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Linus Torvalds, Oleg Nesterov, Eric W. Biederman On Mon 27-06-22 17:21:12, Tejun Heo wrote: > Hello, Michal. > > On Mon, Jun 27, 2022 at 10:07:22AM +0200, Michal Hocko wrote: > > So if somebody has woken up our thread from inside kthread() then it > > doesn't have that pointer on the stack and I couldn't it find elsewhere > > either. Maybe somebody has an idea where to look at. > > One way could be bpftrace'ing or printking __wake_up_common() and > friends to dump backtrace if it's trying to wake a kthread whose comm > starts with kworker/ and doesn't have (struct worker > *)kthread_data(task)->pool set. I am afraid I won't have a chance for a runtime debugging. All I have at this stage is a crash dump. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: re. Spurious wakeup on a newly created kthread 2022-06-27 8:07 ` Michal Hocko 2022-06-27 8:21 ` Tejun Heo @ 2022-06-28 15:08 ` Petr Mladek 1 sibling, 0 replies; 52+ messages in thread From: Petr Mladek @ 2022-06-28 15:08 UTC (permalink / raw) To: Michal Hocko Cc: Tejun Heo, Lai Jiangshan, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton, Linus Torvalds, Oleg Nesterov, Eric W. Biederman On Mon 2022-06-27 10:07:22, Michal Hocko wrote: > On Sat 25-06-22 14:00:10, Tejun Heo wrote: > [...] > > tl;dr is that the worker creation code expects a newly created worker > > kthread to sit tight until the creator finishes setting up stuff and > > sends the initial wakeup. However, something, which wasn't identified > > in the report (Petr, it'd be great if you can find out who did the > > wakeup), wakes up the new kthread before the creation path is done > > with init which causes the new kthread to try to deref a NULL pointer. > > One thing that hasn't been really mentioned explicitly but it might be > really important. 5.3.18-2-rt is the RT patchset backported to 5.3 > based SLES kernel. We do not know whether the upstream RT patchset is > affected the same way and due to nature of the customer we cannot really > have this tested on that kernel. > > I have tried to dig that out from the crash dump but no luck > unfortunately. I was able to find the kthread pointer on few stacks: I tried to dig more but I am not sure if it explains the problem. The newly created worker was woken from "sysmon" process because it was in wake_q_sleeper queue for its own &task->sighand->siglock. I should not cause problems if the worker was really blocked and waiting to this spin lock. It looks to me that ptrace might in theory cause the problematic wakeup. Note that it is SUSE kernel 5.3 with backported RT patchset. See below for more details. Michal found this process by searching for ffff9ca77e9a8000 on process stacks. > sysmon is > crash> bt ffff9ca75e6349c0 > PID: 842 TASK: ffff9ca75e6349c0 CPU: 0 COMMAND: "sysmon" > #0 [ffffb0f6808e7be8] __schedule at ffffffffaa94bef1 > #1 [ffffb0f6808e7c78] preempt_schedule_common at ffffffffaa94c5d1 > #2 [ffffb0f6808e7c80] ___preempt_schedule at ffffffffaa201bb6 > #3 [ffffb0f6808e7cd8] rt_mutex_postunlock at ffffffffaa309f6b > #4 [ffffb0f6808e7ce8] rt_mutex_futex_unlock at ffffffffaa94ea7c > #5 [ffffb0f6808e7d30] rt_spin_unlock at ffffffffaa950ace > #6 [ffffb0f6808e7d38] proc_pid_status at ffffffffaa4c9ff4 > #7 [ffffb0f6808e7dd8] proc_single_show at ffffffffaa4c3801 > #8 [ffffb0f6808e7e10] seq_read at ffffffffaa47c1b8 > #9 [ffffb0f6808e7e70] vfs_read at ffffffffaa455fd9 > #10 [ffffb0f6808e7ea0] ksys_read at ffffffffaa456364 > > so collecting /pro/status data and releasing the sighand lock. I found the pointer to the task struct actually twice there: 1. Direct pointer to task struct: The stack of the process "sysmon" is here: #5 [ffffb0f6808e7d30] rt_spin_unlock+0x3e at ffffffffaa950ace ffffb0f6808e7d38: proc_pid_status+1636 #6 [ffffb0f6808e7d38] proc_pid_status+0x664 at ffffffffaa4c9ff4 ffffb0f6808e7d40: init_user_ns init_pid_ns ffffb0f6808e7d50: 0000000000000000 0000000000000000 ffffb0f6808e7d60: 00003f550019fca0 ffff9ca700000000 ffffb0f6808e7d70: 0000000000000000 0000000000000000 ffffb0f6808e7d80: 0000000000000000 0000000000000000 ffffb0f6808e7d90: ffffffffffffffff 0000000000000000 ffffb0f6808e7da0: 2c93096daf3bc600 [ffff9ca77e9a8000:task_struct] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ffffb0f6808e7db0: [ffff9ca7602f6408:proc_inode_cache] [ffff9ca77e934908:seq_file] ffffb0f6808e7dc0: [ffff9ca71bd7ad80:pid] init_pid_ns ffffb0f6808e7dd0: [ffff9ca77e934908:seq_file] proc_single_show+81 So, the pointer on an actively used and valid stack. It seems it is the *task_struct that was passed to rec_pid_status(). Therefore, proc_pid_status is showing status of the newly created worker that gets woken prematurely. According to the assembly, it is the code: + proc_pid_status() + task_sig() + unlock_task_sighand() + spin_unlock_irqrestore(&task->sighand->siglock) + rt_spin_unlock() 2. Indirect pointer to task struct: The process "sysmon" is interrupted in rt_mutex_postunlock(). It looks in this kernel like: void rt_mutex_postunlock(struct wake_q_head *wake_q, struct wake_q_head *wake_sleeper_q) { wake_up_q(wake_q); wake_up_q_sleeper(wake_sleeper_q); /* Pairs with preempt_disable() in rt_mutex_slowunlock() */ preempt_enable(); } Note that pointers to wake_q and wake_sleeper_q are passed as parameters. They point to the related fields in task-struct: crash> struct -x -o task_struct | grep wake_q [0x860] struct wake_q_node wake_q; [0x868] struct wake_q_node wake_q_sleeper; The relevant part of the stack of "sysmon" is: crash> bt -FFxs 842 [...] #2 [ffffb0f6808e7c80] ___preempt_schedule+0x16 at ffffffffaa201bb6 ffffb0f6808e7c88: [ffff9ca764ee90bd:kmalloc-4k] ffffb0f6808e7c20 ffffb0f6808e7c98: 0000000000027340 fair_sched_class ffffb0f6808e7ca8: 0000000000000003 0000000000000000 ffffb0f6808e7cb8: 0000000000000000 0000000000000202 ffffb0f6808e7cc8: [ffff9ca77e9a8028:task_struct] ffffb0f6808e7d08 ffffb0f6808e7cd8: rt_mutex_postunlock+43 #3 [ffffb0f6808e7cd8] rt_mutex_postunlock+0x2b at ffffffffaa309f6b ffffb0f6808e7ce0: [ffff9ca762400000:sighand_cache] rt_mutex_futex_unlock+172 #4 [ffffb0f6808e7ce8] rt_mutex_futex_unlock+0xac at ffffffffaa94ea7c ffffb0f6808e7cf0: 0000000000000246 0000000000000001 ffffb0f6808e7d00: ffffb0f6808e7cf8 [ffff9ca77e9a8868:task_struct] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ffffb0f6808e7d10: [ffff9ca77e9a8868:task_struct] 2c93096daf3bc600 ffffb0f6808e7d20: [ffff9ca77e934908:seq_file] 0000000000000000 ffffb0f6808e7d30: rt_spin_unlock+62 and if we look at the task struct via the offset to wake_q_sleeper: crash> struct task_struct -l 0x868 ffff9ca77e9a8868 | grep -e pid pid = 16213, So, the newly created kthread, PID 16213, was on the wake_q_sleeper queue related to its own &task->sighand->siglock Doubts: Everything should be good when the kthread was really blocked and waiting for &task->sighand->siglock. Is there any other reason why the kthread could end up on the wake_q_sleeper queue even when it did not take &task->sighand->siglock directly? + RT specifix hacks to avoid priority inversion of processes? + ptrace? Or is it possible that the following code takes task->sighand->siglock on RT kernel? static int kthread(void *_create) { [...] /* OK, tell user we're spawned, wait for stop or wakeup */ __set_current_state(TASK_UNINTERRUPTIBLE); create->result = current; /* * Thread is going to call schedule(), do not preempt it, * or the creator may spend more time in wait_task_inactive(). */ preempt_disable(); complete(done); schedule_preempt_disabled(); [...] } Some more info: The last running times are: crash> ps -l [115147050548884] [RU] PID: 16127 TASK: ffff9ca77e9abb00 CPU: 0 COMMAND: "kworker/0:0H" [115147050481426] [IN] PID: 1829 TASK: ffff9ca71f581d80 CPU: 0 COMMAND: "vReqJob" [115147050467716] [IN] PID: 704 TASK: ffff9ca7624e0ec0 CPU: 0 COMMAND: "irq/127-eth0-Tx" [115147050456263] [RU] PID: 16213 TASK: ffff9ca77e9a8000 CPU: 1 COMMAND: "kworker/0:2H" [115147050394683] [IN] PID: 2035 TASK: ffff9ca71e252c40 CPU: 1 COMMAND: "vReqJob" [115147050370507] [IN] PID: 2146 TASK: ffff9ca71dcc6740 CPU: 1 COMMAND: "vReqJob" [115147050363317] [RU] PID: 842 TASK: ffff9ca75e6349c0 CPU: 0 COMMAND: "sysmon" [115147050334723] [IN] PID: 1875 TASK: ffff9ca71f6c1d80 CPU: 1 COMMAND: "vReqJob" [115147050287774] [IN] PID: 705 TASK: ffff9ca7624e0000 CPU: 1 COMMAND: "irq/128-eth0-Tx" [115147050279924] [RU] PID: 2 TASK: ffff9ca700381d80 CPU: 0 COMMAND: "kthreadd" I am not sure what vRegJob tasks do. The new worker was scheduled on CPU1 right after vReqJob. PID 2035. In theory, the process might have triggered the rescheduling as well. But I do not see it: crash> bt 2035 PID: 2035 TASK: ffff9ca71e252c40 CPU: 1 COMMAND: "vReqJob" #0 [ffffb0f68300fbb0] __schedule at ffffffffaa94bef1 #1 [ffffb0f68300fc40] schedule at ffffffffaa94c4e6 #2 [ffffb0f68300fc50] futex_wait_queue_me at ffffffffaa33e2c0 #3 [ffffb0f68300fc88] futex_wait at ffffffffaa33f199 #4 [ffffb0f68300fd98] do_futex at ffffffffaa341062 #5 [ffffb0f68300fe70] __x64_sys_futex at ffffffffaa341a7e #6 [ffffb0f68300fee0] do_syscall_64 at ffffffffaa2025f0 #7 [ffffb0f68300ff50] entry_SYSCALL_64_after_hwframe at ffffffffaaa0008c RIP: 00007f7317e0487d RSP: 00007f7208777cd0 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 00007f72f00209d0 RCX: 00007f7317e0487d RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007f72f00209fc RBP: 0000000000000000 R8: 0000000000000000 R9: 00007f72f800eef0 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f72f0020998 R13: 000000001df92377 R14: 0000000000000000 R15: 00007f72f00209fc ORIG_RAX: 00000000000000ca CS: 0033 SS: 002b From the stack, I do not think that it is related: #3 [ffffb0f68300fc88] futex_wait+0xd9 at ffffffffaa33f199 ffffb0f68300fc90: 00007f72f00209fc ffffffff00000000 ffffb0f68300fca0: ffff9ca77e4eec00 0000000000000001 ffffb0f68300fcb0: 4063c3da32ce9400 ffffb0f68300fe68 ffffb0f68300fcc0: ffffb0f68300fe40 ffffb0f68300fe68 ffffb0f68300fcd0: ffffb0f68300fe40 [ffff9ca762a0ec00:sock_inode_cache] ffffb0f68300fce0: 0000000000004000 ___sys_sendmsg+153 ffffb0f68300fcf0: 0000000000000064 ffffb0f68300fcf8 ffffb0f68300fd00: ffffb0f68300fcf8 ffff9ca77e4eec38 ffffb0f68300fd10: ffffb0f682befd08 [ffff9ca71e252c40:task_struct] ffffb0f68300fd20: ffff9ca77e4eec08 [ffff9ca764b2b400:mm_struct] ffffb0f68300fd30: 00007f72f0020000 00000000000009fc ffffb0f68300fd40: 0000000000000000 0000000000000000 ffffb0f68300fd50: 0000000000000000 00000000ffffffff ffffb0f68300fd60: 4063c3da32ce9400 0000000000000000 ffffb0f68300fd70: 0000000000000000 0000000000000000 ffffb0f68300fd80: 0000000000000000 00007f72f800eef0 ffffb0f68300fd90: 00007f72f00209fc do_futex+322 #4 [ffffb0f68300fd98] do_futex+0x142 at ffffffffaa341062 ffffb0f68300fda0: __switch_to_asm+52 __switch_to_asm+64 ffffb0f68300fdb0: __switch_to_asm+52 __switch_to_asm+64 ffffb0f68300fdc0: __switch_to_asm+52 __switch_to_asm+64 ffffb0f68300fdd0: __switch_to_asm+52 __switch_to_asm+64 ffffb0f68300fde0: audit_filter_rules.constprop.22+583 [ffff9ca7632b1e40:cred_jar] ffffb0f68300fdf0: __switch_to_asm+52 [ffff9ca71e252c40:task_struct] ffffb0f68300fe00: __switch_to_asm+52 0000000132ce9400 ffffb0f68300fe10: ffffb0f68300fe6c ffffb0f68300fe60 ffffb0f68300fe20: 0000000000004000 0000000000000000 ffffb0f68300fe30: auditd_test_task+56 4063c3da32ce9400 ffffb0f68300fe40: 0000000000000080 0000000000000000 ffffb0f68300fe50: 00007f72f00209fc 0000000000000000 ffffb0f68300fe60: 00007f72f800eef0 0000000000000000 ffffb0f68300fe70: __x64_sys_futex+94 #5 [ffffb0f68300fe70] __x64_sys_futex+0x5e at ffffffffaa341a7e ffffb0f68300fe78: 00007f72ffffffff 0000000000000000 ffffb0f68300fe88: auditd_test_task+56 [ffff9ca71e246400:kmalloc-1k] ffffb0f68300fe98: 00000000000000ca __audit_syscall_entry+235 ffffb0f68300fea8: 4063c3da32ce9400 00000000000000ca ffffb0f68300feb8: ffffb0f68300ff58 00000000000000ca ffffb0f68300fec8: 0000000000000000 0000000000000000 ffffb0f68300fed8: 0000000000000000 do_syscall_64+128 Best Regards, Petr ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] workqueue: Make create_worker() safe against spurious wakeups 2022-06-22 14:08 [PATCH] workqueue: Make create_worker() safe against spurious wakeups Petr Mladek 2022-06-23 7:00 ` Petr Mladek 2022-06-25 5:00 ` re. Spurious wakeup on a newly created kthread Tejun Heo @ 2022-08-04 8:57 ` Lai Jiangshan 2022-08-04 10:19 ` Lai Jiangshan 2 siblings, 1 reply; 52+ messages in thread From: Lai Jiangshan @ 2022-08-04 8:57 UTC (permalink / raw) To: Petr Mladek; +Cc: Tejun Heo, Michal Hocko, LKML On Wed, Jun 22, 2022 at 10:10 PM Petr Mladek <pmladek@suse.com> wrote: > > A system crashed with the following BUG() report: > > [115147.050484] BUG: kernel NULL pointer dereference, address: 0000000000000000 > [115147.050488] #PF: supervisor write access in kernel mode > [115147.050489] #PF: error_code(0x0002) - not-present page > [115147.050490] PGD 0 P4D 0 > [115147.050494] Oops: 0002 [#1] PREEMPT_RT SMP NOPTI > [115147.050498] CPU: 1 PID: 16213 Comm: kthreadd Kdump: loaded Tainted: G O X 5.3.18-2-rt #1 SLE15-SP2 (unreleased) > [115147.050510] RIP: 0010:_raw_spin_lock_irq+0x14/0x30 > [115147.050513] Code: 89 c6 e8 5f 7a 9b ff 66 90 c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 fa 65 ff 05 fb 53 6c 55 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 89 c6 e8 2e 7a 9b ff 66 90 c3 90 90 90 90 90 > [115147.050514] RSP: 0018:ffffb0f68822fed8 EFLAGS: 00010046 > [115147.050515] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 > [115147.050516] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 0000000000000000 > [115147.050517] RBP: ffff9ca73af40a40 R08: 0000000000000001 R09: 0000000000027340 > [115147.050519] R10: ffffb0f68822fe70 R11: 00000000000000a9 R12: ffffb0f688067dc0 > [115147.050520] R13: ffff9ca77e9a8000 R14: ffff9ca7634ca780 R15: ffff9ca7634ca780 > [115147.050521] FS: 0000000000000000(0000) GS:ffff9ca77fb00000(0000) knlGS:0000000000000000 > [115147.050523] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [115147.050524] CR2: 00000000000000b8 CR3: 000000004472e000 CR4: 00000000003406e0 > [115147.050524] Call Trace: > [115147.050533] worker_thread+0xb4/0x3c0 > [115147.050538] ? process_one_work+0x4a0/0x4a0 > [115147.050540] kthread+0x152/0x170 > [115147.050542] ? kthread_park+0xa0/0xa0 > [115147.050544] ret_from_fork+0x35/0x40 > > Further debugging shown that the worker thread was woken > before worker_attach_to_pool() finished in create_worker(). > > Any kthread is supposed to stay in TASK_UNINTERRUPTIBLE sleep > until it is explicitly woken. But a spurious wakeup might > break this expectation. > > As a result, worker_thread() might read worker->pool before > it was set in worker create_worker() by worker_attach_to_pool(). > Also manage_workers() might want to create yet another worker > before worker->pool->nr_workers is updated. It is a kind off > a chicken & egg problem. > > Synchronize these operations using a completion API. > > Note that worker->pool might be then read without wq_pool_attach_mutex. > Normal worker always belongs to the same pool. > > Also note that rescuer_thread() does not need this because all > needed values are set before the kthreads is created. It is > connected with a particular workqueue. It is attached to different > pools when needed. > > Signed-off-by: Petr Mladek <pmladek@suse.com> > --- > kernel/workqueue.c | 13 ++++++++++++- > kernel/workqueue_internal.h | 1 + > 2 files changed, 13 insertions(+), 1 deletion(-) > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index 1ea50f6be843..9586b0797145 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -1939,6 +1939,7 @@ static struct worker *create_worker(struct worker_pool *pool) > goto fail; > > worker->id = id; > + init_completion(&worker->ready_to_start); > > if (pool->cpu >= 0) > snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id, > @@ -1961,6 +1962,7 @@ static struct worker *create_worker(struct worker_pool *pool) > raw_spin_lock_irq(&pool->lock); > worker->pool->nr_workers++; > worker_enter_idle(worker); > + complete(&worker->ready_to_start); > wake_up_process(worker->task); > raw_spin_unlock_irq(&pool->lock); > > @@ -2378,10 +2380,19 @@ static void set_pf_worker(bool val) > static int worker_thread(void *__worker) > { > struct worker *worker = __worker; > - struct worker_pool *pool = worker->pool; > + struct worker_pool *pool; > > /* tell the scheduler that this is a workqueue worker */ > set_pf_worker(true); > + > + /* > + * Wait until the worker is ready to get started. It must be attached > + * to a pool first. This is needed because of spurious wakeups. > + */ > + while (wait_for_completion_interruptible(&worker->ready_to_start)) > + ; There might be some wakeups from wake_up_worker() since it is in the idle list. These wakeups will be "spurious wakeups" in the view of the completion subsystem. > + pool = worker->pool; > + > woke_up: > raw_spin_lock_irq(&pool->lock); > > diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h > index e00b1204a8e9..ffca2783c8bf 100644 > --- a/kernel/workqueue_internal.h > +++ b/kernel/workqueue_internal.h > @@ -41,6 +41,7 @@ struct worker { > /* L: for rescuers */ > struct list_head node; /* A: anchored at pool->workers */ > /* A: runs through worker->node */ > + struct completion ready_to_start; /* None: handle spurious wakeup */ > > unsigned long last_active; /* L: last active timestamp */ > unsigned int flags; /* X: flags */ > -- > 2.35.3 > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] workqueue: Make create_worker() safe against spurious wakeups 2022-08-04 8:57 ` [PATCH] workqueue: Make create_worker() safe against spurious wakeups Lai Jiangshan @ 2022-08-04 10:19 ` Lai Jiangshan 0 siblings, 0 replies; 52+ messages in thread From: Lai Jiangshan @ 2022-08-04 10:19 UTC (permalink / raw) To: Petr Mladek; +Cc: Tejun Heo, Michal Hocko, LKML On Thu, Aug 4, 2022 at 4:57 PM Lai Jiangshan <jiangshanlai@gmail.com> wrote: ; > > There might be some wakeups from wake_up_worker() since it is in > the idle list. These wakeups will be "spurious wakeups" in the view > of the completion subsystem. > Oh, sorry, I was wrong. "complete(&worker->ready_to_start);" and "worker_enter_idle(worker);" are in the same pool lock region. There are no "spurious wakeups" from "wake_up_worker()" as I have wrongly described. ^ permalink raw reply [flat|nested] 52+ messages in thread
end of thread, other threads:[~2022-08-04 10:19 UTC | newest] Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-06-22 14:08 [PATCH] workqueue: Make create_worker() safe against spurious wakeups Petr Mladek 2022-06-23 7:00 ` Petr Mladek 2022-06-23 7:14 ` Michal Hocko 2022-06-26 4:19 ` Hillf Danton 2022-06-25 5:00 ` re. Spurious wakeup on a newly created kthread Tejun Heo 2022-06-25 17:01 ` Linus Torvalds 2022-06-25 17:36 ` Eric W. Biederman 2022-06-25 18:25 ` Linus Torvalds 2022-06-25 18:43 ` Linus Torvalds 2022-06-25 23:28 ` Eric W. Biederman 2022-06-25 23:41 ` Eric W. Biederman 2022-06-25 23:43 ` Linus Torvalds 2022-06-25 23:48 ` Linus Torvalds 2022-06-26 0:19 ` Eric W. Biederman 2022-06-27 0:01 ` Wedson Almeida Filho 2022-06-27 7:11 ` Peter Zijlstra 2022-06-27 18:23 ` Wedson Almeida Filho 2022-06-27 18:45 ` Linus Torvalds 2022-06-26 19:14 ` [PATCH 0/3] kthread: Stop using TASK_UNINTERRUPTIBLE Eric W. Biederman 2022-06-26 19:15 ` [PATCH 1/3] kthread: Remove the flags argument from kernel_thread Eric W. Biederman 2022-06-26 21:20 ` Linus Torvalds 2022-06-26 19:16 ` [PATCH 2/3] kthread: Replace kernel_thread with new_kthread Eric W. Biederman 2022-06-26 19:16 ` [PATCH 3/3] kthread: Stop abusing TASK_UNINTERRUPTIBLE (INCOMPLETE) Eric W. Biederman 2022-06-26 19:59 ` Linus Torvalds 2022-06-26 20:23 ` Tejun Heo 2022-06-26 20:55 ` Linus Torvalds 2022-06-27 7:22 ` Peter Zijlstra 2022-06-27 8:11 ` Tejun Heo 2022-06-27 18:04 ` Wedson Almeida Filho 2022-06-27 22:06 ` Peter Zijlstra 2022-06-27 22:34 ` Linus Torvalds 2022-06-27 22:45 ` Wedson Almeida Filho 2022-06-28 0:32 ` Wedson Almeida Filho 2022-06-28 7:58 ` Peter Zijlstra 2022-06-30 0:57 ` Wedson Almeida Filho 2022-06-26 22:14 ` kernel test robot 2022-06-26 22:34 ` kernel test robot 2022-06-26 0:21 ` re. Spurious wakeup on a newly created kthread Eric W. Biederman 2022-06-28 14:16 ` Christian Brauner 2022-06-26 0:26 ` Eric W. Biederman 2022-06-26 1:58 ` Tejun Heo 2022-06-26 2:53 ` Linus Torvalds 2022-06-26 6:09 ` Tejun Heo 2022-06-27 12:04 ` Michal Hocko 2022-06-28 9:51 ` Petr Mladek 2022-06-28 10:07 ` Tejun Heo 2022-06-27 8:07 ` Michal Hocko 2022-06-27 8:21 ` Tejun Heo 2022-06-27 10:18 ` Michal Hocko 2022-06-28 15:08 ` Petr Mladek 2022-08-04 8:57 ` [PATCH] workqueue: Make create_worker() safe against spurious wakeups Lai Jiangshan 2022-08-04 10:19 ` Lai Jiangshan
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.