* Deadlock on failure to read NVMe namespace @ 2021-10-19 11:22 Hannes Reinecke 2021-10-19 12:31 ` Sagi Grimberg 0 siblings, 1 reply; 10+ messages in thread From: Hannes Reinecke @ 2021-10-19 11:22 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-nvme, Keith Busch, Sagi Grimberg Hi all, during testing my test target (https://github.com/hreinecke/nofuse) I've managed to run into a deadlock (with nvme-5.16): c481:~ # dmesg [102885.483587] nvme nvme0: sqsize 128 > ctrl maxcmd 32, clamping down [102885.483711] nvme nvme0: creating 2 I/O queues. [102885.484130] nvme nvme0: mapped 2/0/0 default/read/poll queues. [102885.485527] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress:uuid:62f37f51-0cc7-46d5-9865-4de22e81bd9d", addr 127.0.0.1:8009 [102885.487351] nvme nvme0: request 0x0 genctr mismatch (got 0x0 expected 0x1) [102885.487354] nvme nvme0: got bad c2hdata.command_id 0x0 on queue 2 [102885.487356] nvme nvme0: receive failed: -2 [102885.487357] nvme nvme0: starting error recovery [102885.487444] block nvme0n1: no usable path - requeuing I/O [102885.502051] nvme nvme0: Reconnecting in 10 seconds... [102893.291877] nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress:uuid:62f37f51-0cc7-46d5-9865-4de22e81bd9d" [102895.535149] nvme nvme0: queue_size 128 > ctrl sqsize 32, clamping down [102895.535214] nvme nvme0: creating 2 I/O queues. [102895.535943] nvme nvme0: mapped 2/0/0 default/read/poll queues. [102895.536187] nvme nvme0: Failed reconnect attempt 1 [103700.406060] INFO: task kworker/u4:0:14965 blocked for more than 491 seconds. [103700.406068] Tainted: G E 5.15.0-rc3-54-default+ #862 [103700.406071] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [103700.406073] task:kworker/u4:0 state:D stack: 0 pid:14965 ppid: 2 flags:0x00004000 [103700.406117] Workqueue: nvme-wq nvme_scan_work [nvme_core] [103700.406143] Call Trace: [103700.406146] __schedule+0x302/0x13a0 [103700.406157] ? block_read_full_page+0x21e/0x3b0 [103700.406164] ? blkdev_direct_IO+0x4a0/0x4a0 [103700.406172] schedule+0x3a/0xa0 [103700.406177] io_schedule+0x12/0x40 [103700.406182] do_read_cache_page+0x49b/0x790 481:~ # cat /proc/15761/stack [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core] [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core] [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core] [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core] [<0>] kernfs_fop_write_iter+0x12f/0x1a0 [<0>] new_sync_write+0x122/0x1b0 [<0>] vfs_write+0x1eb/0x250 [<0>] ksys_write+0xa1/0xe0 [<0>] do_syscall_64+0x3a/0x80 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae c481:~ # cat /proc/14965/stack [<0>] do_read_cache_page+0x49b/0x790 [<0>] read_part_sector+0x39/0xe0 [<0>] read_lba+0xf9/0x1d0 [<0>] efi_partition+0xf1/0x7f0 [<0>] bdev_disk_changed+0x1ee/0x550 [<0>] blkdev_get_whole+0x81/0x90 [<0>] blkdev_get_by_dev+0x128/0x2e0 [<0>] device_add_disk+0x377/0x3c0 [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core] [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core] [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core] [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core] [<0>] nvme_scan_work+0x168/0x310 [nvme_core] [<0>] process_one_work+0x231/0x420 [<0>] worker_thread+0x2d/0x3f0 [<0>] kthread+0x11a/0x140 [<0>] ret_from_fork+0x22/0x30 My theory here is that the partition scanning code just calls into the pagecache, which doesn't set a timeout for any I/O operation. As this is done under the scan_mutex we cannot clear the active paths, and consequently we hang. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Deadlock on failure to read NVMe namespace 2021-10-19 11:22 Deadlock on failure to read NVMe namespace Hannes Reinecke @ 2021-10-19 12:31 ` Sagi Grimberg 2021-10-19 12:40 ` Hannes Reinecke 0 siblings, 1 reply; 10+ messages in thread From: Sagi Grimberg @ 2021-10-19 12:31 UTC (permalink / raw) To: Hannes Reinecke, Christoph Hellwig; +Cc: linux-nvme, Keith Busch > Hi all, > > during testing my test target (https://github.com/hreinecke/nofuse) I've > managed to run into a deadlock (with nvme-5.16): > > c481:~ # dmesg > [102885.483587] nvme nvme0: sqsize 128 > ctrl maxcmd 32, clamping down > [102885.483711] nvme nvme0: creating 2 I/O queues. > [102885.484130] nvme nvme0: mapped 2/0/0 default/read/poll queues. > [102885.485527] nvme nvme0: new ctrl: NQN > "nqn.2014-08.org.nvmexpress:uuid:62f37f51-0cc7-46d5-9865-4de22e81bd9d", > addr 127.0.0.1:8009 > [102885.487351] nvme nvme0: request 0x0 genctr mismatch (got 0x0 > expected 0x1) > [102885.487354] nvme nvme0: got bad c2hdata.command_id 0x0 on queue 2 Nice! the genctr is catching bugs already... > [102885.487356] nvme nvme0: receive failed: -2 > [102885.487357] nvme nvme0: starting error recovery > [102885.487444] block nvme0n1: no usable path - requeuing I/O > [102885.502051] nvme nvme0: Reconnecting in 10 seconds... > [102893.291877] nvme nvme0: Removing ctrl: NQN Did you trigger this removal? its 2 seconds before the reconnect attempt was supposed to run. > "nqn.2014-08.org.nvmexpress:uuid:62f37f51-0cc7-46d5-9865-4de22e81bd9d" > [102895.535149] nvme nvme0: queue_size 128 > ctrl sqsize 32, clamping down > [102895.535214] nvme nvme0: creating 2 I/O queues. > [102895.535943] nvme nvme0: mapped 2/0/0 default/read/poll queues. > [102895.536187] nvme nvme0: Failed reconnect attempt 1 > [103700.406060] INFO: task kworker/u4:0:14965 blocked for more than 491 > seconds. > [103700.406068] Tainted: G E 5.15.0-rc3-54-default+ > #862 > [103700.406071] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [103700.406073] task:kworker/u4:0 state:D stack: 0 pid:14965 ppid: > 2 flags:0x00004000 > [103700.406117] Workqueue: nvme-wq nvme_scan_work [nvme_core] > [103700.406143] Call Trace: > [103700.406146] __schedule+0x302/0x13a0 > [103700.406157] ? block_read_full_page+0x21e/0x3b0 > [103700.406164] ? blkdev_direct_IO+0x4a0/0x4a0 > [103700.406172] schedule+0x3a/0xa0 > [103700.406177] io_schedule+0x12/0x40 > [103700.406182] do_read_cache_page+0x49b/0x790 > > 481:~ # cat /proc/15761/stack > [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core] > [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core] > [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core] > [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core] > [<0>] kernfs_fop_write_iter+0x12f/0x1a0 > [<0>] new_sync_write+0x122/0x1b0 > [<0>] vfs_write+0x1eb/0x250 > [<0>] ksys_write+0xa1/0xe0 > [<0>] do_syscall_64+0x3a/0x80 > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae > c481:~ # cat /proc/14965/stack > [<0>] do_read_cache_page+0x49b/0x790 > [<0>] read_part_sector+0x39/0xe0 > [<0>] read_lba+0xf9/0x1d0 > [<0>] efi_partition+0xf1/0x7f0 > [<0>] bdev_disk_changed+0x1ee/0x550 > [<0>] blkdev_get_whole+0x81/0x90 > [<0>] blkdev_get_by_dev+0x128/0x2e0 > [<0>] device_add_disk+0x377/0x3c0 > [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core] > [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core] > [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core] > [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core] > [<0>] nvme_scan_work+0x168/0x310 [nvme_core] > [<0>] process_one_work+0x231/0x420 > [<0>] worker_thread+0x2d/0x3f0 > [<0>] kthread+0x11a/0x140 > [<0>] ret_from_fork+0x22/0x30 > > My theory here is that the partition scanning code just calls into the > pagecache, which doesn't set a timeout for any I/O operation. > As this is done under the scan_mutex we cannot clear the active paths, > and consequently we hang. But the controller removal should have cancelled all inflight commands... Maybe we're missing unfreeze? Hannes, can you try this one? -- diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index e29c47114739..783fde36d2ba 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -1974,8 +1974,11 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl, nvme_sync_io_queues(ctrl); nvme_tcp_stop_io_queues(ctrl); nvme_cancel_tagset(ctrl); - if (remove) + if (remove) { nvme_start_queues(ctrl); + nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT); + nvme_unfreeze(ctrl); + } nvme_tcp_destroy_io_queues(ctrl, remove); } -- ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: Deadlock on failure to read NVMe namespace 2021-10-19 12:31 ` Sagi Grimberg @ 2021-10-19 12:40 ` Hannes Reinecke 2021-10-19 14:27 ` Sagi Grimberg 0 siblings, 1 reply; 10+ messages in thread From: Hannes Reinecke @ 2021-10-19 12:40 UTC (permalink / raw) To: Sagi Grimberg, Christoph Hellwig; +Cc: linux-nvme, Keith Busch On 10/19/21 2:31 PM, Sagi Grimberg wrote: > >> Hi all, >> >> during testing my test target (https://github.com/hreinecke/nofuse) >> I've managed to run into a deadlock (with nvme-5.16): >> >> c481:~ # dmesg >> [102885.483587] nvme nvme0: sqsize 128 > ctrl maxcmd 32, clamping down >> [102885.483711] nvme nvme0: creating 2 I/O queues. >> [102885.484130] nvme nvme0: mapped 2/0/0 default/read/poll queues. >> [102885.485527] nvme nvme0: new ctrl: NQN >> "nqn.2014-08.org.nvmexpress:uuid:62f37f51-0cc7-46d5-9865-4de22e81bd9d", addr >> 127.0.0.1:8009 >> [102885.487351] nvme nvme0: request 0x0 genctr mismatch (got 0x0 >> expected 0x1) >> [102885.487354] nvme nvme0: got bad c2hdata.command_id 0x0 on queue 2 > > Nice! the genctr is catching bugs already... > >> [102885.487356] nvme nvme0: receive failed: -2 >> [102885.487357] nvme nvme0: starting error recovery >> [102885.487444] block nvme0n1: no usable path - requeuing I/O >> [102885.502051] nvme nvme0: Reconnecting in 10 seconds... >> [102893.291877] nvme nvme0: Removing ctrl: NQN > > Did you trigger this removal? its 2 seconds before > the reconnect attempt was supposed to run. > >> "nqn.2014-08.org.nvmexpress:uuid:62f37f51-0cc7-46d5-9865-4de22e81bd9d" >> [102895.535149] nvme nvme0: queue_size 128 > ctrl sqsize 32, clamping >> down >> [102895.535214] nvme nvme0: creating 2 I/O queues. >> [102895.535943] nvme nvme0: mapped 2/0/0 default/read/poll queues. >> [102895.536187] nvme nvme0: Failed reconnect attempt 1 >> [103700.406060] INFO: task kworker/u4:0:14965 blocked for more than >> 491 seconds. >> [103700.406068] Tainted: G E >> 5.15.0-rc3-54-default+ #862 >> [103700.406071] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >> disables this message. >> [103700.406073] task:kworker/u4:0 state:D stack: 0 pid:14965 >> ppid: 2 flags:0x00004000 >> [103700.406117] Workqueue: nvme-wq nvme_scan_work [nvme_core] >> [103700.406143] Call Trace: >> [103700.406146] __schedule+0x302/0x13a0 >> [103700.406157] ? block_read_full_page+0x21e/0x3b0 >> [103700.406164] ? blkdev_direct_IO+0x4a0/0x4a0 >> [103700.406172] schedule+0x3a/0xa0 >> [103700.406177] io_schedule+0x12/0x40 >> [103700.406182] do_read_cache_page+0x49b/0x790 >> >> 481:~ # cat /proc/15761/stack >> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core] >> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core] >> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core] >> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core] >> [<0>] kernfs_fop_write_iter+0x12f/0x1a0 >> [<0>] new_sync_write+0x122/0x1b0 >> [<0>] vfs_write+0x1eb/0x250 >> [<0>] ksys_write+0xa1/0xe0 >> [<0>] do_syscall_64+0x3a/0x80 >> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae >> c481:~ # cat /proc/14965/stack >> [<0>] do_read_cache_page+0x49b/0x790 >> [<0>] read_part_sector+0x39/0xe0 >> [<0>] read_lba+0xf9/0x1d0 >> [<0>] efi_partition+0xf1/0x7f0 >> [<0>] bdev_disk_changed+0x1ee/0x550 >> [<0>] blkdev_get_whole+0x81/0x90 >> [<0>] blkdev_get_by_dev+0x128/0x2e0 >> [<0>] device_add_disk+0x377/0x3c0 >> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core] >> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core] >> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core] >> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core] >> [<0>] nvme_scan_work+0x168/0x310 [nvme_core] >> [<0>] process_one_work+0x231/0x420 >> [<0>] worker_thread+0x2d/0x3f0 >> [<0>] kthread+0x11a/0x140 >> [<0>] ret_from_fork+0x22/0x30 >> >> My theory here is that the partition scanning code just calls into the >> pagecache, which doesn't set a timeout for any I/O operation. >> As this is done under the scan_mutex we cannot clear the active paths, >> and consequently we hang. > > But the controller removal should have cancelled all inflight > commands... > > Maybe we're missing unfreeze? Hannes, can you try this one? > -- > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > index e29c47114739..783fde36d2ba 100644 > --- a/drivers/nvme/host/tcp.c > +++ b/drivers/nvme/host/tcp.c > @@ -1974,8 +1974,11 @@ static void nvme_tcp_teardown_io_queues(struct > nvme_ctrl *ctrl, > nvme_sync_io_queues(ctrl); > nvme_tcp_stop_io_queues(ctrl); > nvme_cancel_tagset(ctrl); > - if (remove) > + if (remove) { > nvme_start_queues(ctrl); > + nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT); > + nvme_unfreeze(ctrl); > + } > nvme_tcp_destroy_io_queues(ctrl, remove); > } > -- Nope. Same problem. I managed to make the problem go away (for some definitions of 'go away') with this patch: diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index fb96e900dd3a..30d1154eb611 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -141,12 +141,12 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl) { struct nvme_ns *ns; - mutex_lock(&ctrl->scan_lock); down_read(&ctrl->namespaces_rwsem); list_for_each_entry(ns, &ctrl->namespaces, list) if (nvme_mpath_clear_current_path(ns)) kblockd_schedule_work(&ns->head->requeue_work); up_read(&ctrl->namespaces_rwsem); + mutex_lock(&ctrl->scan_lock); mutex_unlock(&ctrl->scan_lock); } But I'd be the first to agree that this really is hackish. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: Deadlock on failure to read NVMe namespace 2021-10-19 12:40 ` Hannes Reinecke @ 2021-10-19 14:27 ` Sagi Grimberg 2021-10-19 15:04 ` Hannes Reinecke 2021-10-19 15:06 ` Sagi Grimberg 0 siblings, 2 replies; 10+ messages in thread From: Sagi Grimberg @ 2021-10-19 14:27 UTC (permalink / raw) To: Hannes Reinecke, Christoph Hellwig Cc: linux-nvme, Keith Busch, Anton Eidelman >>> 481:~ # cat /proc/15761/stack >>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core] >>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core] >>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core] >>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core] >>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0 >>> [<0>] new_sync_write+0x122/0x1b0 >>> [<0>] vfs_write+0x1eb/0x250 >>> [<0>] ksys_write+0xa1/0xe0 >>> [<0>] do_syscall_64+0x3a/0x80 >>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae >>> c481:~ # cat /proc/14965/stack >>> [<0>] do_read_cache_page+0x49b/0x790 >>> [<0>] read_part_sector+0x39/0xe0 >>> [<0>] read_lba+0xf9/0x1d0 >>> [<0>] efi_partition+0xf1/0x7f0 >>> [<0>] bdev_disk_changed+0x1ee/0x550 >>> [<0>] blkdev_get_whole+0x81/0x90 >>> [<0>] blkdev_get_by_dev+0x128/0x2e0 >>> [<0>] device_add_disk+0x377/0x3c0 >>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core] >>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core] >>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core] >>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core] >>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core] >>> [<0>] process_one_work+0x231/0x420 >>> [<0>] worker_thread+0x2d/0x3f0 >>> [<0>] kthread+0x11a/0x140 >>> [<0>] ret_from_fork+0x22/0x30 >>> >>> My theory here is that the partition scanning code just calls into >>> the pagecache, which doesn't set a timeout for any I/O operation. >>> As this is done under the scan_mutex we cannot clear the active >>> paths, and consequently we hang. >> >> But the controller removal should have cancelled all inflight >> commands... >> >> Maybe we're missing unfreeze? Hannes, can you try this one? >> -- >> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c >> index e29c47114739..783fde36d2ba 100644 >> --- a/drivers/nvme/host/tcp.c >> +++ b/drivers/nvme/host/tcp.c >> @@ -1974,8 +1974,11 @@ static void nvme_tcp_teardown_io_queues(struct >> nvme_ctrl *ctrl, >> nvme_sync_io_queues(ctrl); >> nvme_tcp_stop_io_queues(ctrl); >> nvme_cancel_tagset(ctrl); >> - if (remove) >> + if (remove) { >> nvme_start_queues(ctrl); >> + nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT); >> + nvme_unfreeze(ctrl); >> + } >> nvme_tcp_destroy_io_queues(ctrl, remove); >> } >> -- > Nope. Same problem. > > I managed to make the problem go away (for some definitions of 'go > away') with this patch: > > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > index fb96e900dd3a..30d1154eb611 100644 > --- a/drivers/nvme/host/multipath.c > +++ b/drivers/nvme/host/multipath.c > @@ -141,12 +141,12 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl > *ctrl) > { > struct nvme_ns *ns; > > - mutex_lock(&ctrl->scan_lock); > down_read(&ctrl->namespaces_rwsem); > list_for_each_entry(ns, &ctrl->namespaces, list) > if (nvme_mpath_clear_current_path(ns)) > kblockd_schedule_work(&ns->head->requeue_work); > up_read(&ctrl->namespaces_rwsem); > + mutex_lock(&ctrl->scan_lock); > mutex_unlock(&ctrl->scan_lock); > } > > But I'd be the first to agree that this really is hackish. Yea, that doesn't solve anything. I think this sequence is familiar and was addressed by a fix from Anton (CC'd) which still has some pending review comments. Can you lookup and try: [PATCH] nvme/mpath: fix hang when disk goes live over reconnect ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Deadlock on failure to read NVMe namespace 2021-10-19 14:27 ` Sagi Grimberg @ 2021-10-19 15:04 ` Hannes Reinecke 2021-10-19 15:09 ` Sagi Grimberg 2021-10-19 15:06 ` Sagi Grimberg 1 sibling, 1 reply; 10+ messages in thread From: Hannes Reinecke @ 2021-10-19 15:04 UTC (permalink / raw) To: Sagi Grimberg, Christoph Hellwig; +Cc: linux-nvme, Keith Busch, Anton Eidelman On 10/19/21 4:27 PM, Sagi Grimberg wrote: > >>>> 481:~ # cat /proc/15761/stack >>>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core] >>>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core] >>>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core] >>>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core] >>>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0 >>>> [<0>] new_sync_write+0x122/0x1b0 >>>> [<0>] vfs_write+0x1eb/0x250 >>>> [<0>] ksys_write+0xa1/0xe0 >>>> [<0>] do_syscall_64+0x3a/0x80 >>>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae >>>> c481:~ # cat /proc/14965/stack >>>> [<0>] do_read_cache_page+0x49b/0x790 >>>> [<0>] read_part_sector+0x39/0xe0 >>>> [<0>] read_lba+0xf9/0x1d0 >>>> [<0>] efi_partition+0xf1/0x7f0 >>>> [<0>] bdev_disk_changed+0x1ee/0x550 >>>> [<0>] blkdev_get_whole+0x81/0x90 >>>> [<0>] blkdev_get_by_dev+0x128/0x2e0 >>>> [<0>] device_add_disk+0x377/0x3c0 >>>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core] >>>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core] >>>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core] >>>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core] >>>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core] >>>> [<0>] process_one_work+0x231/0x420 >>>> [<0>] worker_thread+0x2d/0x3f0 >>>> [<0>] kthread+0x11a/0x140 >>>> [<0>] ret_from_fork+0x22/0x30 >>>> >>>> My theory here is that the partition scanning code just calls into >>>> the pagecache, which doesn't set a timeout for any I/O operation. >>>> As this is done under the scan_mutex we cannot clear the active >>>> paths, and consequently we hang. >>> >>> But the controller removal should have cancelled all inflight >>> commands... >>> >>> Maybe we're missing unfreeze? Hannes, can you try this one? >>> -- >>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c >>> index e29c47114739..783fde36d2ba 100644 >>> --- a/drivers/nvme/host/tcp.c >>> +++ b/drivers/nvme/host/tcp.c >>> @@ -1974,8 +1974,11 @@ static void nvme_tcp_teardown_io_queues(struct >>> nvme_ctrl *ctrl, >>> nvme_sync_io_queues(ctrl); >>> nvme_tcp_stop_io_queues(ctrl); >>> nvme_cancel_tagset(ctrl); >>> - if (remove) >>> + if (remove) { >>> nvme_start_queues(ctrl); >>> + nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT); >>> + nvme_unfreeze(ctrl); >>> + } >>> nvme_tcp_destroy_io_queues(ctrl, remove); >>> } >>> -- >> Nope. Same problem. >> >> I managed to make the problem go away (for some definitions of 'go >> away') with this patch: >> >> diff --git a/drivers/nvme/host/multipath.c >> b/drivers/nvme/host/multipath.c >> index fb96e900dd3a..30d1154eb611 100644 >> --- a/drivers/nvme/host/multipath.c >> +++ b/drivers/nvme/host/multipath.c >> @@ -141,12 +141,12 @@ void nvme_mpath_clear_ctrl_paths(struct >> nvme_ctrl *ctrl) >> { >> struct nvme_ns *ns; >> >> - mutex_lock(&ctrl->scan_lock); >> down_read(&ctrl->namespaces_rwsem); >> list_for_each_entry(ns, &ctrl->namespaces, list) >> if (nvme_mpath_clear_current_path(ns)) >> kblockd_schedule_work(&ns->head->requeue_work); >> up_read(&ctrl->namespaces_rwsem); >> + mutex_lock(&ctrl->scan_lock); >> mutex_unlock(&ctrl->scan_lock); >> } >> >> But I'd be the first to agree that this really is hackish. > > Yea, that doesn't solve anything. > > I think this sequence is familiar and was addressed by a fix from Anton > (CC'd) which still has some pending review comments. > > Can you lookup and try: > [PATCH] nvme/mpath: fix hang when disk goes live over reconnect I actually had looked at it, but then decided it's (trying) to address a different issue. Thing is, I'll see it being stuck even _before_ disconnect happens; we're stuck with the _initial_ scan: [ 3920.880552] device_add_disk+0x377/0x3c0 [ 3920.880558] nvme_mpath_set_live+0x130/0x1b0 [nvme_core] [ 3920.880574] nvme_mpath_add_disk+0x150/0x160 [nvme_core] [ 3920.880588] ? device_add_disk+0x27d/0x3c0 [ 3920.880593] nvme_alloc_ns+0x417/0x950 [nvme_core] [ 3920.880606] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core] [ 3920.880618] ? __nvme_submit_sync_cmd+0x19b/0x210 [nvme_core] [ 3920.880631] nvme_scan_work+0x168/0x310 [nvme_core] and can't make progress as the I/O is _NOT_ failed, but rather causing a reconnect: [ 1364.442390] nvme nvme0: request 0x0 genctr mismatch (got 0x0 expected 0x2) [ 1364.442408] nvme nvme0: got bad c2hdata.command_id 0x0 on queue 2 [ 1364.442411] nvme nvme0: receive failed: -2 [ 1364.442414] nvme nvme0: starting error recovery [ 1364.442502] block nvme0n1: no usable path - requeuing I/O so we don't have an usable path, and will requeue I/O. But upon reconnect we will retry _the same_ I/O, and get into the same state. So the I/O will be stuck until disconnect happens. But disconnect does a nvme_do_dele te_ctrl() -> nvme_remove_namespaces() -> nvme_mpath_clear_ctrl_paths() -> mutex_lock(&scan_mutex) and hangs. Everything would be solved if we had _aborted_ the invalid I/O instead of forcing a reconnect; which, incidentally, is a far better way to handle TCP PDU sequence errors than the current way of dropping the connection. Which would only help if it were a transport artifact, but I've yet to see a fabric which eats individual bits of a frame; a programming error far more likely, in which case an I/O error would be a far better response. But I guess I'll need to raise it at the fmds call... Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Deadlock on failure to read NVMe namespace 2021-10-19 15:04 ` Hannes Reinecke @ 2021-10-19 15:09 ` Sagi Grimberg 2021-10-19 15:49 ` Hannes Reinecke 0 siblings, 1 reply; 10+ messages in thread From: Sagi Grimberg @ 2021-10-19 15:09 UTC (permalink / raw) To: Hannes Reinecke, Christoph Hellwig Cc: linux-nvme, Keith Busch, Anton Eidelman On 10/19/21 6:04 PM, Hannes Reinecke wrote: > On 10/19/21 4:27 PM, Sagi Grimberg wrote: >> >>>>> 481:~ # cat /proc/15761/stack >>>>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core] >>>>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core] >>>>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core] >>>>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core] >>>>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0 >>>>> [<0>] new_sync_write+0x122/0x1b0 >>>>> [<0>] vfs_write+0x1eb/0x250 >>>>> [<0>] ksys_write+0xa1/0xe0 >>>>> [<0>] do_syscall_64+0x3a/0x80 >>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae >>>>> c481:~ # cat /proc/14965/stack >>>>> [<0>] do_read_cache_page+0x49b/0x790 >>>>> [<0>] read_part_sector+0x39/0xe0 >>>>> [<0>] read_lba+0xf9/0x1d0 >>>>> [<0>] efi_partition+0xf1/0x7f0 >>>>> [<0>] bdev_disk_changed+0x1ee/0x550 >>>>> [<0>] blkdev_get_whole+0x81/0x90 >>>>> [<0>] blkdev_get_by_dev+0x128/0x2e0 >>>>> [<0>] device_add_disk+0x377/0x3c0 >>>>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core] >>>>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core] >>>>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core] >>>>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core] >>>>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core] >>>>> [<0>] process_one_work+0x231/0x420 >>>>> [<0>] worker_thread+0x2d/0x3f0 >>>>> [<0>] kthread+0x11a/0x140 >>>>> [<0>] ret_from_fork+0x22/0x30 >>>>> >>>>> My theory here is that the partition scanning code just calls into >>>>> the pagecache, which doesn't set a timeout for any I/O operation. >>>>> As this is done under the scan_mutex we cannot clear the active >>>>> paths, and consequently we hang. >>>> >>>> But the controller removal should have cancelled all inflight >>>> commands... >>>> >>>> Maybe we're missing unfreeze? Hannes, can you try this one? >>>> -- >>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c >>>> index e29c47114739..783fde36d2ba 100644 >>>> --- a/drivers/nvme/host/tcp.c >>>> +++ b/drivers/nvme/host/tcp.c >>>> @@ -1974,8 +1974,11 @@ static void >>>> nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl, >>>> nvme_sync_io_queues(ctrl); >>>> nvme_tcp_stop_io_queues(ctrl); >>>> nvme_cancel_tagset(ctrl); >>>> - if (remove) >>>> + if (remove) { >>>> nvme_start_queues(ctrl); >>>> + nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT); >>>> + nvme_unfreeze(ctrl); >>>> + } >>>> nvme_tcp_destroy_io_queues(ctrl, remove); >>>> } >>>> -- >>> Nope. Same problem. >>> >>> I managed to make the problem go away (for some definitions of 'go >>> away') with this patch: >>> >>> diff --git a/drivers/nvme/host/multipath.c >>> b/drivers/nvme/host/multipath.c >>> index fb96e900dd3a..30d1154eb611 100644 >>> --- a/drivers/nvme/host/multipath.c >>> +++ b/drivers/nvme/host/multipath.c >>> @@ -141,12 +141,12 @@ void nvme_mpath_clear_ctrl_paths(struct >>> nvme_ctrl *ctrl) >>> { >>> struct nvme_ns *ns; >>> >>> - mutex_lock(&ctrl->scan_lock); >>> down_read(&ctrl->namespaces_rwsem); >>> list_for_each_entry(ns, &ctrl->namespaces, list) >>> if (nvme_mpath_clear_current_path(ns)) >>> kblockd_schedule_work(&ns->head->requeue_work); >>> up_read(&ctrl->namespaces_rwsem); >>> + mutex_lock(&ctrl->scan_lock); >>> mutex_unlock(&ctrl->scan_lock); >>> } >>> >>> But I'd be the first to agree that this really is hackish. >> >> Yea, that doesn't solve anything. >> >> I think this sequence is familiar and was addressed by a fix from Anton >> (CC'd) which still has some pending review comments. >> >> Can you lookup and try: >> [PATCH] nvme/mpath: fix hang when disk goes live over reconnect > > I actually had looked at it, but then decided it's (trying) to address a > different issue. > > Thing is, I'll see it being stuck even _before_ disconnect happens; > we're stuck with the _initial_ scan: > > [ 3920.880552] device_add_disk+0x377/0x3c0 > [ 3920.880558] nvme_mpath_set_live+0x130/0x1b0 [nvme_core] > [ 3920.880574] nvme_mpath_add_disk+0x150/0x160 [nvme_core] > [ 3920.880588] ? device_add_disk+0x27d/0x3c0 > [ 3920.880593] nvme_alloc_ns+0x417/0x950 [nvme_core] > [ 3920.880606] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core] > [ 3920.880618] ? __nvme_submit_sync_cmd+0x19b/0x210 [nvme_core] > [ 3920.880631] nvme_scan_work+0x168/0x310 [nvme_core] > > and can't make progress as the I/O is _NOT_ failed, but rather causing a > reconnect: > [ 1364.442390] nvme nvme0: request 0x0 genctr mismatch (got 0x0 expected > 0x2) > [ 1364.442408] nvme nvme0: got bad c2hdata.command_id 0x0 on queue 2 > [ 1364.442411] nvme nvme0: receive failed: -2 > [ 1364.442414] nvme nvme0: starting error recovery > [ 1364.442502] block nvme0n1: no usable path - requeuing I/O > > so we don't have an usable path, and will requeue I/O. > But upon reconnect we will retry _the same_ I/O, and get into the same > state. > > So the I/O will be stuck until disconnect happens. > But disconnect does a > nvme_do_dele te_ctrl() > -> nvme_remove_namespaces() > -> nvme_mpath_clear_ctrl_paths() > -> mutex_lock(&scan_mutex) > > and hangs. > > Everything would be solved if we had _aborted_ the invalid I/O instead > of forcing a reconnect; But you don't know which I/O to abort, you got a bad command_id. which, incidentally, is a far better way to > handle TCP PDU sequence errors than the current way of dropping the > connection. This is not a transport error, a digest error would be a transport error, this is a controller logic error. > Which would only help if it were a transport artifact, but > I've yet to see a fabric which eats individual bits of a frame; > a programming error far more likely, in which case an I/O error would be > a far better response. Again, which I/O? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Deadlock on failure to read NVMe namespace 2021-10-19 15:09 ` Sagi Grimberg @ 2021-10-19 15:49 ` Hannes Reinecke 0 siblings, 0 replies; 10+ messages in thread From: Hannes Reinecke @ 2021-10-19 15:49 UTC (permalink / raw) To: Sagi Grimberg, Christoph Hellwig; +Cc: linux-nvme, Keith Busch, Anton Eidelman On 10/19/21 5:09 PM, Sagi Grimberg wrote: > > > On 10/19/21 6:04 PM, Hannes Reinecke wrote: >> On 10/19/21 4:27 PM, Sagi Grimberg wrote: >>> >>>>>> 481:~ # cat /proc/15761/stack >>>>>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core] >>>>>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core] >>>>>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core] >>>>>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core] >>>>>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0 >>>>>> [<0>] new_sync_write+0x122/0x1b0 >>>>>> [<0>] vfs_write+0x1eb/0x250 >>>>>> [<0>] ksys_write+0xa1/0xe0 >>>>>> [<0>] do_syscall_64+0x3a/0x80 >>>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae >>>>>> c481:~ # cat /proc/14965/stack >>>>>> [<0>] do_read_cache_page+0x49b/0x790 >>>>>> [<0>] read_part_sector+0x39/0xe0 >>>>>> [<0>] read_lba+0xf9/0x1d0 >>>>>> [<0>] efi_partition+0xf1/0x7f0 >>>>>> [<0>] bdev_disk_changed+0x1ee/0x550 >>>>>> [<0>] blkdev_get_whole+0x81/0x90 >>>>>> [<0>] blkdev_get_by_dev+0x128/0x2e0 >>>>>> [<0>] device_add_disk+0x377/0x3c0 >>>>>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core] >>>>>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core] >>>>>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core] >>>>>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core] >>>>>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core] >>>>>> [<0>] process_one_work+0x231/0x420 >>>>>> [<0>] worker_thread+0x2d/0x3f0 >>>>>> [<0>] kthread+0x11a/0x140 >>>>>> [<0>] ret_from_fork+0x22/0x30 >>>>>> >>>>>> My theory here is that the partition scanning code just calls into >>>>>> the pagecache, which doesn't set a timeout for any I/O operation. >>>>>> As this is done under the scan_mutex we cannot clear the active >>>>>> paths, and consequently we hang. >>>>> >>>>> But the controller removal should have cancelled all inflight >>>>> commands... >>>>> >>>>> Maybe we're missing unfreeze? Hannes, can you try this one? >>>>> -- >>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c >>>>> index e29c47114739..783fde36d2ba 100644 >>>>> --- a/drivers/nvme/host/tcp.c >>>>> +++ b/drivers/nvme/host/tcp.c >>>>> @@ -1974,8 +1974,11 @@ static void >>>>> nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl, >>>>> nvme_sync_io_queues(ctrl); >>>>> nvme_tcp_stop_io_queues(ctrl); >>>>> nvme_cancel_tagset(ctrl); >>>>> - if (remove) >>>>> + if (remove) { >>>>> nvme_start_queues(ctrl); >>>>> + nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT); >>>>> + nvme_unfreeze(ctrl); >>>>> + } >>>>> nvme_tcp_destroy_io_queues(ctrl, remove); >>>>> } >>>>> -- >>>> Nope. Same problem. >>>> >>>> I managed to make the problem go away (for some definitions of 'go >>>> away') with this patch: >>>> >>>> diff --git a/drivers/nvme/host/multipath.c >>>> b/drivers/nvme/host/multipath.c >>>> index fb96e900dd3a..30d1154eb611 100644 >>>> --- a/drivers/nvme/host/multipath.c >>>> +++ b/drivers/nvme/host/multipath.c >>>> @@ -141,12 +141,12 @@ void nvme_mpath_clear_ctrl_paths(struct >>>> nvme_ctrl *ctrl) >>>> { >>>> struct nvme_ns *ns; >>>> >>>> - mutex_lock(&ctrl->scan_lock); >>>> down_read(&ctrl->namespaces_rwsem); >>>> list_for_each_entry(ns, &ctrl->namespaces, list) >>>> if (nvme_mpath_clear_current_path(ns)) >>>> >>>> kblockd_schedule_work(&ns->head->requeue_work); >>>> up_read(&ctrl->namespaces_rwsem); >>>> + mutex_lock(&ctrl->scan_lock); >>>> mutex_unlock(&ctrl->scan_lock); >>>> } >>>> >>>> But I'd be the first to agree that this really is hackish. >>> >>> Yea, that doesn't solve anything. >>> >>> I think this sequence is familiar and was addressed by a fix from Anton >>> (CC'd) which still has some pending review comments. >>> >>> Can you lookup and try: >>> [PATCH] nvme/mpath: fix hang when disk goes live over reconnect >> >> I actually had looked at it, but then decided it's (trying) to address >> a different issue. >> >> Thing is, I'll see it being stuck even _before_ disconnect happens; >> we're stuck with the _initial_ scan: >> >> [ 3920.880552] device_add_disk+0x377/0x3c0 >> [ 3920.880558] nvme_mpath_set_live+0x130/0x1b0 [nvme_core] >> [ 3920.880574] nvme_mpath_add_disk+0x150/0x160 [nvme_core] >> [ 3920.880588] ? device_add_disk+0x27d/0x3c0 >> [ 3920.880593] nvme_alloc_ns+0x417/0x950 [nvme_core] >> [ 3920.880606] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core] >> [ 3920.880618] ? __nvme_submit_sync_cmd+0x19b/0x210 [nvme_core] >> [ 3920.880631] nvme_scan_work+0x168/0x310 [nvme_core] >> >> and can't make progress as the I/O is _NOT_ failed, but rather causing >> a reconnect: >> [ 1364.442390] nvme nvme0: request 0x0 genctr mismatch (got 0x0 >> expected 0x2) >> [ 1364.442408] nvme nvme0: got bad c2hdata.command_id 0x0 on queue 2 >> [ 1364.442411] nvme nvme0: receive failed: -2 >> [ 1364.442414] nvme nvme0: starting error recovery >> [ 1364.442502] block nvme0n1: no usable path - requeuing I/O >> >> so we don't have an usable path, and will requeue I/O. >> But upon reconnect we will retry _the same_ I/O, and get into the same >> state. >> >> So the I/O will be stuck until disconnect happens. >> But disconnect does a >> nvme_do_dele te_ctrl() >> -> nvme_remove_namespaces() >> -> nvme_mpath_clear_ctrl_paths() >> -> mutex_lock(&scan_mutex) >> >> and hangs. >> >> Everything would be solved if we had _aborted_ the invalid I/O instead >> of forcing a reconnect; > > But you don't know which I/O to abort, you got a bad command_id. > Yeah, you are right. I stand corrected. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Deadlock on failure to read NVMe namespace 2021-10-19 14:27 ` Sagi Grimberg 2021-10-19 15:04 ` Hannes Reinecke @ 2021-10-19 15:06 ` Sagi Grimberg 2021-10-19 15:41 ` Hannes Reinecke 1 sibling, 1 reply; 10+ messages in thread From: Sagi Grimberg @ 2021-10-19 15:06 UTC (permalink / raw) To: Hannes Reinecke, Christoph Hellwig Cc: linux-nvme, Keith Busch, Anton Eidelman On 10/19/21 5:27 PM, Sagi Grimberg wrote: > >>>> 481:~ # cat /proc/15761/stack >>>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core] >>>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core] >>>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core] >>>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core] >>>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0 >>>> [<0>] new_sync_write+0x122/0x1b0 >>>> [<0>] vfs_write+0x1eb/0x250 >>>> [<0>] ksys_write+0xa1/0xe0 >>>> [<0>] do_syscall_64+0x3a/0x80 >>>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae >>>> c481:~ # cat /proc/14965/stack >>>> [<0>] do_read_cache_page+0x49b/0x790 >>>> [<0>] read_part_sector+0x39/0xe0 >>>> [<0>] read_lba+0xf9/0x1d0 >>>> [<0>] efi_partition+0xf1/0x7f0 >>>> [<0>] bdev_disk_changed+0x1ee/0x550 >>>> [<0>] blkdev_get_whole+0x81/0x90 >>>> [<0>] blkdev_get_by_dev+0x128/0x2e0 >>>> [<0>] device_add_disk+0x377/0x3c0 >>>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core] >>>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core] >>>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core] >>>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core] >>>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core] >>>> [<0>] process_one_work+0x231/0x420 >>>> [<0>] worker_thread+0x2d/0x3f0 >>>> [<0>] kthread+0x11a/0x140 >>>> [<0>] ret_from_fork+0x22/0x30 ... > I think this sequence is familiar and was addressed by a fix from Anton > (CC'd) which still has some pending review comments. > > Can you lookup and try: > [PATCH] nvme/mpath: fix hang when disk goes live over reconnect Actually, I see the trace is going from nvme_alloc_ns, no the ANA update path, so that is unlikely to address the issue. Looking at nvme_mpath_clear_ctrl_paths, I don't think it should take the scan_lock anymore. IIRC the reason why it needed the scan_lock in the first place was the fact that ctrl->namespaces was added and then sorted in scan_work (taking the namespaces_rwsem twice). But now that ctrl->namespaces is always sorted, and accessed with namespaces_rwsem, I think that scan_lock is no longer needed here and namespaces_rwsem is sufficient. Thoughts? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Deadlock on failure to read NVMe namespace 2021-10-19 15:06 ` Sagi Grimberg @ 2021-10-19 15:41 ` Hannes Reinecke 2021-10-19 20:13 ` Sagi Grimberg 0 siblings, 1 reply; 10+ messages in thread From: Hannes Reinecke @ 2021-10-19 15:41 UTC (permalink / raw) To: Sagi Grimberg, Christoph Hellwig; +Cc: linux-nvme, Keith Busch, Anton Eidelman On 10/19/21 5:06 PM, Sagi Grimberg wrote: > > > On 10/19/21 5:27 PM, Sagi Grimberg wrote: >> >>>>> 481:~ # cat /proc/15761/stack >>>>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core] >>>>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core] >>>>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core] >>>>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core] >>>>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0 >>>>> [<0>] new_sync_write+0x122/0x1b0 >>>>> [<0>] vfs_write+0x1eb/0x250 >>>>> [<0>] ksys_write+0xa1/0xe0 >>>>> [<0>] do_syscall_64+0x3a/0x80 >>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae >>>>> c481:~ # cat /proc/14965/stack >>>>> [<0>] do_read_cache_page+0x49b/0x790 >>>>> [<0>] read_part_sector+0x39/0xe0 >>>>> [<0>] read_lba+0xf9/0x1d0 >>>>> [<0>] efi_partition+0xf1/0x7f0 >>>>> [<0>] bdev_disk_changed+0x1ee/0x550 >>>>> [<0>] blkdev_get_whole+0x81/0x90 >>>>> [<0>] blkdev_get_by_dev+0x128/0x2e0 >>>>> [<0>] device_add_disk+0x377/0x3c0 >>>>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core] >>>>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core] >>>>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core] >>>>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core] >>>>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core] >>>>> [<0>] process_one_work+0x231/0x420 >>>>> [<0>] worker_thread+0x2d/0x3f0 >>>>> [<0>] kthread+0x11a/0x140 >>>>> [<0>] ret_from_fork+0x22/0x30 > > ... > >> I think this sequence is familiar and was addressed by a fix from Anton >> (CC'd) which still has some pending review comments. >> >> Can you lookup and try: >> [PATCH] nvme/mpath: fix hang when disk goes live over reconnect > > Actually, I see the trace is going from nvme_alloc_ns, no the ANA > update path, so that is unlikely to address the issue. > > Looking at nvme_mpath_clear_ctrl_paths, I don't think it should > take the scan_lock anymore. IIRC the reason why it needed the > scan_lock in the first place was the fact that ctrl->namespaces > was added and then sorted in scan_work (taking the namespaces_rwsem > twice). > > But now that ctrl->namespaces is always sorted, and accessed with > namespaces_rwsem, I think that scan_lock is no longer needed > here and namespaces_rwsem is sufficient. > ... which was precisely what my initial patch did. While it worked in the sense the 'nvme disconnect' completed, we did not terminate the outstanding I/O as no current path is set and hence this: down_read(&ctrl->namespaces_rwsem); list_for_each_entry(ns, &ctrl->namespaces, list) if (nvme_mpath_clear_current_path(ns)) kblockd_schedule_work(&ns->head->requeue_work); up_read(&ctrl->namespaces_rwsem); doesn't do anything, in particular does _not_ flush the requeue work. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Deadlock on failure to read NVMe namespace 2021-10-19 15:41 ` Hannes Reinecke @ 2021-10-19 20:13 ` Sagi Grimberg 0 siblings, 0 replies; 10+ messages in thread From: Sagi Grimberg @ 2021-10-19 20:13 UTC (permalink / raw) To: Hannes Reinecke, Christoph Hellwig Cc: linux-nvme, Keith Busch, Anton Eidelman >>>>>> 481:~ # cat /proc/15761/stack >>>>>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core] >>>>>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core] >>>>>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core] >>>>>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core] >>>>>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0 >>>>>> [<0>] new_sync_write+0x122/0x1b0 >>>>>> [<0>] vfs_write+0x1eb/0x250 >>>>>> [<0>] ksys_write+0xa1/0xe0 >>>>>> [<0>] do_syscall_64+0x3a/0x80 >>>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae >>>>>> c481:~ # cat /proc/14965/stack >>>>>> [<0>] do_read_cache_page+0x49b/0x790 >>>>>> [<0>] read_part_sector+0x39/0xe0 >>>>>> [<0>] read_lba+0xf9/0x1d0 >>>>>> [<0>] efi_partition+0xf1/0x7f0 >>>>>> [<0>] bdev_disk_changed+0x1ee/0x550 >>>>>> [<0>] blkdev_get_whole+0x81/0x90 >>>>>> [<0>] blkdev_get_by_dev+0x128/0x2e0 >>>>>> [<0>] device_add_disk+0x377/0x3c0 >>>>>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core] >>>>>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core] >>>>>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core] >>>>>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core] >>>>>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core] >>>>>> [<0>] process_one_work+0x231/0x420 >>>>>> [<0>] worker_thread+0x2d/0x3f0 >>>>>> [<0>] kthread+0x11a/0x140 >>>>>> [<0>] ret_from_fork+0x22/0x30 >> >> ... >> >>> I think this sequence is familiar and was addressed by a fix from Anton >>> (CC'd) which still has some pending review comments. >>> >>> Can you lookup and try: >>> [PATCH] nvme/mpath: fix hang when disk goes live over reconnect >> >> Actually, I see the trace is going from nvme_alloc_ns, no the ANA >> update path, so that is unlikely to address the issue. >> >> Looking at nvme_mpath_clear_ctrl_paths, I don't think it should >> take the scan_lock anymore. IIRC the reason why it needed the >> scan_lock in the first place was the fact that ctrl->namespaces >> was added and then sorted in scan_work (taking the namespaces_rwsem >> twice). >> >> But now that ctrl->namespaces is always sorted, and accessed with >> namespaces_rwsem, I think that scan_lock is no longer needed >> here and namespaces_rwsem is sufficient. >> > ... which was precisely what my initial patch did. > While it worked in the sense the 'nvme disconnect' completed, we did not > terminate the outstanding I/O as no current path is set and hence this: > > down_read(&ctrl->namespaces_rwsem); > list_for_each_entry(ns, &ctrl->namespaces, list) > if (nvme_mpath_clear_current_path(ns)) > kblockd_schedule_work(&ns->head->requeue_work); > up_read(&ctrl->namespaces_rwsem); > > doesn't do anything, in particular does _not_ flush the requeue work. So you get I/O hung? Why does it need to flush it? ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2021-10-19 20:13 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-10-19 11:22 Deadlock on failure to read NVMe namespace Hannes Reinecke 2021-10-19 12:31 ` Sagi Grimberg 2021-10-19 12:40 ` Hannes Reinecke 2021-10-19 14:27 ` Sagi Grimberg 2021-10-19 15:04 ` Hannes Reinecke 2021-10-19 15:09 ` Sagi Grimberg 2021-10-19 15:49 ` Hannes Reinecke 2021-10-19 15:06 ` Sagi Grimberg 2021-10-19 15:41 ` Hannes Reinecke 2021-10-19 20:13 ` Sagi Grimberg
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.