* [PATCH 1/2] nvme-tcp: fix possible hang caused during ctrl deletion
@ 2022-09-28 6:23 Sagi Grimberg
2022-09-28 6:23 ` [PATCH 2/2] nvme-rdma: " Sagi Grimberg
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: Sagi Grimberg @ 2022-09-28 6:23 UTC (permalink / raw)
To: linux-nvme
Cc: Christoph Hellwig, Keith Busch, Chaitanya Kulkarni, Jonathan Nicklin
When we delete a controller, we execute the following:
1. nvme_stop_ctrl() - stop some work elements that may be
inflight or scheduled (specifically also .stop_ctrl
which cancels ctrl error recovery work)
2. nvme_remove_namespaces() - which first flushes scan_work
to avoid competing ns addition/removal
3. continue to teardown the controller
However, if err_work was scheduled to run in (1), it is designed to
cancel any inflight I/O, particularly I/O that is originating from ns
scan_work in (2), but because it is cancelled in .stop_ctrl(), we can
prevent forward progress of (2) as ns scanning is blocking on I/O
(that will never be cancelled).
The race is:
1. transport layer error observed -> err_work is scheduled
2. scan_work executes, discovers ns, generate I/O to it
3. nvme_ctop_ctrl() -> .stop_ctrl() -> cancel_work_sync(err_work)
- err_work never executed
4. nvme_remove_namespaces() -> flush_work(scan_work)
--> deadlock, because scan_work is blocked on I/O that was supposed
to be cancelled by err_work, but was cancelled before executing (see
stack trace [1]).
Fix this by flushing err_work instead of cancelling it, to force it
to execute and cancel all inflight I/O.
[1]:
--
Call Trace:
<TASK>
__schedule+0x390/0x910
? scan_shadow_nodes+0x40/0x40
schedule+0x55/0xe0
io_schedule+0x16/0x40
do_read_cache_page+0x55d/0x850
? __page_cache_alloc+0x90/0x90
read_cache_page+0x12/0x20
read_part_sector+0x3f/0x110
amiga_partition+0x3d/0x3e0
? osf_partition+0x33/0x220
? put_partition+0x90/0x90
bdev_disk_changed+0x1fe/0x4d0
blkdev_get_whole+0x7b/0x90
blkdev_get_by_dev+0xda/0x2d0
device_add_disk+0x356/0x3b0
nvme_mpath_set_live+0x13c/0x1a0 [nvme_core]
? nvme_parse_ana_log+0xae/0x1a0 [nvme_core]
nvme_update_ns_ana_state+0x3a/0x40 [nvme_core]
nvme_mpath_add_disk+0x120/0x160 [nvme_core]
nvme_alloc_ns+0x594/0xa00 [nvme_core]
nvme_validate_or_alloc_ns+0xb9/0x1a0 [nvme_core]
? __nvme_submit_sync_cmd+0x1d2/0x210 [nvme_core]
nvme_scan_work+0x281/0x410 [nvme_core]
process_one_work+0x1be/0x380
worker_thread+0x37/0x3b0
? process_one_work+0x380/0x380
kthread+0x12d/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x1f/0x30
</TASK>
INFO: task nvme:6725 blocked for more than 491 seconds.
Not tainted 5.15.65-f0.el7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:nvme state:D
stack: 0 pid: 6725 ppid: 1761 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x390/0x910
? sched_clock+0x9/0x10
schedule+0x55/0xe0
schedule_timeout+0x24b/0x2e0
? try_to_wake_up+0x358/0x510
? finish_task_switch+0x88/0x2c0
wait_for_completion+0xa5/0x110
__flush_work+0x144/0x210
? worker_attach_to_pool+0xc0/0xc0
flush_work+0x10/0x20
nvme_remove_namespaces+0x41/0xf0 [nvme_core]
nvme_do_delete_ctrl+0x47/0x66 [nvme_core]
nvme_sysfs_delete.cold.96+0x8/0xd [nvme_core]
dev_attr_store+0x14/0x30
sysfs_kf_write+0x38/0x50
kernfs_fop_write_iter+0x146/0x1d0
new_sync_write+0x114/0x1b0
? intel_pmu_handle_irq+0xe0/0x420
vfs_write+0x18d/0x270
ksys_write+0x61/0xe0
__x64_sys_write+0x1a/0x20
do_syscall_64+0x37/0x90
entry_SYSCALL_64_after_hwframe+0x61/0xcb
--
Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
Reported-by: Jonathan Nicklin <jnicklin@blockbridge.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
drivers/nvme/host/tcp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index d5871fd6f769..2524b5304bfb 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2237,7 +2237,7 @@ static void nvme_reset_ctrl_work(struct work_struct *work)
static void nvme_tcp_stop_ctrl(struct nvme_ctrl *ctrl)
{
- cancel_work_sync(&to_tcp_ctrl(ctrl)->err_work);
+ flush_work(&to_tcp_ctrl(ctrl)->err_work);
cancel_delayed_work_sync(&to_tcp_ctrl(ctrl)->connect_work);
}
--
2.34.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH 2/2] nvme-rdma: fix possible hang caused during ctrl deletion
2022-09-28 6:23 [PATCH 1/2] nvme-tcp: fix possible hang caused during ctrl deletion Sagi Grimberg
@ 2022-09-28 6:23 ` Sagi Grimberg
2022-09-29 18:43 ` [PATCH 1/2] nvme-tcp: " Sagi Grimberg
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Sagi Grimberg @ 2022-09-28 6:23 UTC (permalink / raw)
To: linux-nvme
Cc: Christoph Hellwig, Keith Busch, Chaitanya Kulkarni, Jonathan Nicklin
When we delete a controller, we execute the following:
1. nvme_stop_ctrl() - stop some work elements that may be
inflight or scheduled (specifically also .stop_ctrl
which cancels ctrl error recovery work)
2. nvme_remove_namespaces() - which first flushes scan_work
to avoid competing ns addition/removal
3. continue to teardown the controller
However, if err_work was scheduled to run in (1), it is designed to
cancel any inflight I/O, particularly I/O that is originating from ns
scan_work in (2), but because it is cancelled in .stop_ctrl(), we can
prevent forward progress of (2) as ns scanning is blocking on I/O
(that will never be cancelled).
The race is:
1. transport layer error observed -> err_work is scheduled
2. scan_work executes, discovers ns, generate I/O to it
3. nvme_ctop_ctrl() -> .stop_ctrl() -> cancel_work_sync(err_work)
- err_work never executed
4. nvme_remove_namespaces() -> flush_work(scan_work)
--> deadlock, because scan_work is blocked on I/O that was supposed
to be cancelled by err_work, but was cancelled before executing.
Fix this by flushing err_work instead of cancelling it, to force it
to execute and cancel all inflight I/O.
Fixes: b435ecea2a4d ("nvme: Add .stop_ctrl to nvme ctrl ops")
Fixes: f6c8e432cb04 ("nvme: flush namespace scanning work just before
removing namespaces")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
drivers/nvme/host/rdma.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 3100643be299..8e52d2362fa1 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1049,7 +1049,7 @@ static void nvme_rdma_stop_ctrl(struct nvme_ctrl *nctrl)
{
struct nvme_rdma_ctrl *ctrl = to_rdma_ctrl(nctrl);
- cancel_work_sync(&ctrl->err_work);
+ flush_work(&ctrl->err_work);
cancel_delayed_work_sync(&ctrl->reconnect_work);
}
--
2.34.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH 1/2] nvme-tcp: fix possible hang caused during ctrl deletion
2022-09-28 6:23 [PATCH 1/2] nvme-tcp: fix possible hang caused during ctrl deletion Sagi Grimberg
2022-09-28 6:23 ` [PATCH 2/2] nvme-rdma: " Sagi Grimberg
@ 2022-09-29 18:43 ` Sagi Grimberg
2022-09-29 18:45 ` Jonathan Nicklin
2022-10-10 8:29 ` Christoph Hellwig
3 siblings, 0 replies; 5+ messages in thread
From: Sagi Grimberg @ 2022-09-29 18:43 UTC (permalink / raw)
To: Jonathan Nicklin
Cc: Christoph Hellwig, Keith Busch, Chaitanya Kulkarni, linux-nvme
On 9/28/22 09:23, Sagi Grimberg wrote:
> When we delete a controller, we execute the following:
> 1. nvme_stop_ctrl() - stop some work elements that may be
> inflight or scheduled (specifically also .stop_ctrl
> which cancels ctrl error recovery work)
> 2. nvme_remove_namespaces() - which first flushes scan_work
> to avoid competing ns addition/removal
> 3. continue to teardown the controller
>
> However, if err_work was scheduled to run in (1), it is designed to
> cancel any inflight I/O, particularly I/O that is originating from ns
> scan_work in (2), but because it is cancelled in .stop_ctrl(), we can
> prevent forward progress of (2) as ns scanning is blocking on I/O
> (that will never be cancelled).
>
> The race is:
> 1. transport layer error observed -> err_work is scheduled
> 2. scan_work executes, discovers ns, generate I/O to it
> 3. nvme_ctop_ctrl() -> .stop_ctrl() -> cancel_work_sync(err_work)
> - err_work never executed
> 4. nvme_remove_namespaces() -> flush_work(scan_work)
> --> deadlock, because scan_work is blocked on I/O that was supposed
> to be cancelled by err_work, but was cancelled before executing (see
> stack trace [1]).
>
> Fix this by flushing err_work instead of cancelling it, to force it
> to execute and cancel all inflight I/O.
>
> [1]:
> --
> Call Trace:
> <TASK>
> __schedule+0x390/0x910
> ? scan_shadow_nodes+0x40/0x40
> schedule+0x55/0xe0
> io_schedule+0x16/0x40
> do_read_cache_page+0x55d/0x850
> ? __page_cache_alloc+0x90/0x90
> read_cache_page+0x12/0x20
> read_part_sector+0x3f/0x110
> amiga_partition+0x3d/0x3e0
> ? osf_partition+0x33/0x220
> ? put_partition+0x90/0x90
> bdev_disk_changed+0x1fe/0x4d0
> blkdev_get_whole+0x7b/0x90
> blkdev_get_by_dev+0xda/0x2d0
> device_add_disk+0x356/0x3b0
> nvme_mpath_set_live+0x13c/0x1a0 [nvme_core]
> ? nvme_parse_ana_log+0xae/0x1a0 [nvme_core]
> nvme_update_ns_ana_state+0x3a/0x40 [nvme_core]
> nvme_mpath_add_disk+0x120/0x160 [nvme_core]
> nvme_alloc_ns+0x594/0xa00 [nvme_core]
> nvme_validate_or_alloc_ns+0xb9/0x1a0 [nvme_core]
> ? __nvme_submit_sync_cmd+0x1d2/0x210 [nvme_core]
> nvme_scan_work+0x281/0x410 [nvme_core]
> process_one_work+0x1be/0x380
> worker_thread+0x37/0x3b0
> ? process_one_work+0x380/0x380
> kthread+0x12d/0x150
> ? set_kthread_struct+0x50/0x50
> ret_from_fork+0x1f/0x30
> </TASK>
> INFO: task nvme:6725 blocked for more than 491 seconds.
> Not tainted 5.15.65-f0.el7.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:nvme state:D
> stack: 0 pid: 6725 ppid: 1761 flags:0x00004000
> Call Trace:
> <TASK>
> __schedule+0x390/0x910
> ? sched_clock+0x9/0x10
> schedule+0x55/0xe0
> schedule_timeout+0x24b/0x2e0
> ? try_to_wake_up+0x358/0x510
> ? finish_task_switch+0x88/0x2c0
> wait_for_completion+0xa5/0x110
> __flush_work+0x144/0x210
> ? worker_attach_to_pool+0xc0/0xc0
> flush_work+0x10/0x20
> nvme_remove_namespaces+0x41/0xf0 [nvme_core]
> nvme_do_delete_ctrl+0x47/0x66 [nvme_core]
> nvme_sysfs_delete.cold.96+0x8/0xd [nvme_core]
> dev_attr_store+0x14/0x30
> sysfs_kf_write+0x38/0x50
> kernfs_fop_write_iter+0x146/0x1d0
> new_sync_write+0x114/0x1b0
> ? intel_pmu_handle_irq+0xe0/0x420
> vfs_write+0x18d/0x270
> ksys_write+0x61/0xe0
> __x64_sys_write+0x1a/0x20
> do_syscall_64+0x37/0x90
> entry_SYSCALL_64_after_hwframe+0x61/0xcb
> --
>
> Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
> Reported-by: Jonathan Nicklin <jnicklin@blockbridge.com>
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Jonathan, can I get your Tested-by tag on this?
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 1/2] nvme-tcp: fix possible hang caused during ctrl deletion
2022-09-28 6:23 [PATCH 1/2] nvme-tcp: fix possible hang caused during ctrl deletion Sagi Grimberg
2022-09-28 6:23 ` [PATCH 2/2] nvme-rdma: " Sagi Grimberg
2022-09-29 18:43 ` [PATCH 1/2] nvme-tcp: " Sagi Grimberg
@ 2022-09-29 18:45 ` Jonathan Nicklin
2022-10-10 8:29 ` Christoph Hellwig
3 siblings, 0 replies; 5+ messages in thread
From: Jonathan Nicklin @ 2022-09-29 18:45 UTC (permalink / raw)
To: Sagi Grimberg
Cc: linux-nvme, Christoph Hellwig, Keith Busch, Chaitanya Kulkarni
Tested-by: Jonathan Nicklin <jnicklin@blockbridge.com>
On Wed, Sep 28, 2022 at 2:23 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
> When we delete a controller, we execute the following:
> 1. nvme_stop_ctrl() - stop some work elements that may be
> inflight or scheduled (specifically also .stop_ctrl
> which cancels ctrl error recovery work)
> 2. nvme_remove_namespaces() - which first flushes scan_work
> to avoid competing ns addition/removal
> 3. continue to teardown the controller
>
> However, if err_work was scheduled to run in (1), it is designed to
> cancel any inflight I/O, particularly I/O that is originating from ns
> scan_work in (2), but because it is cancelled in .stop_ctrl(), we can
> prevent forward progress of (2) as ns scanning is blocking on I/O
> (that will never be cancelled).
>
> The race is:
> 1. transport layer error observed -> err_work is scheduled
> 2. scan_work executes, discovers ns, generate I/O to it
> 3. nvme_ctop_ctrl() -> .stop_ctrl() -> cancel_work_sync(err_work)
> - err_work never executed
> 4. nvme_remove_namespaces() -> flush_work(scan_work)
> --> deadlock, because scan_work is blocked on I/O that was supposed
> to be cancelled by err_work, but was cancelled before executing (see
> stack trace [1]).
>
> Fix this by flushing err_work instead of cancelling it, to force it
> to execute and cancel all inflight I/O.
>
> [1]:
> --
> Call Trace:
> <TASK>
> __schedule+0x390/0x910
> ? scan_shadow_nodes+0x40/0x40
> schedule+0x55/0xe0
> io_schedule+0x16/0x40
> do_read_cache_page+0x55d/0x850
> ? __page_cache_alloc+0x90/0x90
> read_cache_page+0x12/0x20
> read_part_sector+0x3f/0x110
> amiga_partition+0x3d/0x3e0
> ? osf_partition+0x33/0x220
> ? put_partition+0x90/0x90
> bdev_disk_changed+0x1fe/0x4d0
> blkdev_get_whole+0x7b/0x90
> blkdev_get_by_dev+0xda/0x2d0
> device_add_disk+0x356/0x3b0
> nvme_mpath_set_live+0x13c/0x1a0 [nvme_core]
> ? nvme_parse_ana_log+0xae/0x1a0 [nvme_core]
> nvme_update_ns_ana_state+0x3a/0x40 [nvme_core]
> nvme_mpath_add_disk+0x120/0x160 [nvme_core]
> nvme_alloc_ns+0x594/0xa00 [nvme_core]
> nvme_validate_or_alloc_ns+0xb9/0x1a0 [nvme_core]
> ? __nvme_submit_sync_cmd+0x1d2/0x210 [nvme_core]
> nvme_scan_work+0x281/0x410 [nvme_core]
> process_one_work+0x1be/0x380
> worker_thread+0x37/0x3b0
> ? process_one_work+0x380/0x380
> kthread+0x12d/0x150
> ? set_kthread_struct+0x50/0x50
> ret_from_fork+0x1f/0x30
> </TASK>
> INFO: task nvme:6725 blocked for more than 491 seconds.
> Not tainted 5.15.65-f0.el7.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:nvme state:D
> stack: 0 pid: 6725 ppid: 1761 flags:0x00004000
> Call Trace:
> <TASK>
> __schedule+0x390/0x910
> ? sched_clock+0x9/0x10
> schedule+0x55/0xe0
> schedule_timeout+0x24b/0x2e0
> ? try_to_wake_up+0x358/0x510
> ? finish_task_switch+0x88/0x2c0
> wait_for_completion+0xa5/0x110
> __flush_work+0x144/0x210
> ? worker_attach_to_pool+0xc0/0xc0
> flush_work+0x10/0x20
> nvme_remove_namespaces+0x41/0xf0 [nvme_core]
> nvme_do_delete_ctrl+0x47/0x66 [nvme_core]
> nvme_sysfs_delete.cold.96+0x8/0xd [nvme_core]
> dev_attr_store+0x14/0x30
> sysfs_kf_write+0x38/0x50
> kernfs_fop_write_iter+0x146/0x1d0
> new_sync_write+0x114/0x1b0
> ? intel_pmu_handle_irq+0xe0/0x420
> vfs_write+0x18d/0x270
> ksys_write+0x61/0xe0
> __x64_sys_write+0x1a/0x20
> do_syscall_64+0x37/0x90
> entry_SYSCALL_64_after_hwframe+0x61/0xcb
> --
>
> Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
> Reported-by: Jonathan Nicklin <jnicklin@blockbridge.com>
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
> drivers/nvme/host/tcp.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index d5871fd6f769..2524b5304bfb 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -2237,7 +2237,7 @@ static void nvme_reset_ctrl_work(struct work_struct *work)
>
> static void nvme_tcp_stop_ctrl(struct nvme_ctrl *ctrl)
> {
> - cancel_work_sync(&to_tcp_ctrl(ctrl)->err_work);
> + flush_work(&to_tcp_ctrl(ctrl)->err_work);
> cancel_delayed_work_sync(&to_tcp_ctrl(ctrl)->connect_work);
> }
>
> --
> 2.34.1
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 1/2] nvme-tcp: fix possible hang caused during ctrl deletion
2022-09-28 6:23 [PATCH 1/2] nvme-tcp: fix possible hang caused during ctrl deletion Sagi Grimberg
` (2 preceding siblings ...)
2022-09-29 18:45 ` Jonathan Nicklin
@ 2022-10-10 8:29 ` Christoph Hellwig
3 siblings, 0 replies; 5+ messages in thread
From: Christoph Hellwig @ 2022-10-10 8:29 UTC (permalink / raw)
To: Sagi Grimberg
Cc: linux-nvme, Christoph Hellwig, Keith Busch, Chaitanya Kulkarni,
Jonathan Nicklin
Thanks, applied to nvme-6.1.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2022-10-10 8:29 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-28 6:23 [PATCH 1/2] nvme-tcp: fix possible hang caused during ctrl deletion Sagi Grimberg
2022-09-28 6:23 ` [PATCH 2/2] nvme-rdma: " Sagi Grimberg
2022-09-29 18:43 ` [PATCH 1/2] nvme-tcp: " Sagi Grimberg
2022-09-29 18:45 ` Jonathan Nicklin
2022-10-10 8:29 ` Christoph Hellwig
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.