* Deadlock on failure to read NVMe namespace
@ 2021-10-19 11:22 Hannes Reinecke
2021-10-19 12:31 ` Sagi Grimberg
0 siblings, 1 reply; 10+ messages in thread
From: Hannes Reinecke @ 2021-10-19 11:22 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-nvme, Keith Busch, Sagi Grimberg
Hi all,
during testing my test target (https://github.com/hreinecke/nofuse) I've
managed to run into a deadlock (with nvme-5.16):
c481:~ # dmesg
[102885.483587] nvme nvme0: sqsize 128 > ctrl maxcmd 32, clamping down
[102885.483711] nvme nvme0: creating 2 I/O queues.
[102885.484130] nvme nvme0: mapped 2/0/0 default/read/poll queues.
[102885.485527] nvme nvme0: new ctrl: NQN
"nqn.2014-08.org.nvmexpress:uuid:62f37f51-0cc7-46d5-9865-4de22e81bd9d",
addr 127.0.0.1:8009
[102885.487351] nvme nvme0: request 0x0 genctr mismatch (got 0x0
expected 0x1)
[102885.487354] nvme nvme0: got bad c2hdata.command_id 0x0 on queue 2
[102885.487356] nvme nvme0: receive failed: -2
[102885.487357] nvme nvme0: starting error recovery
[102885.487444] block nvme0n1: no usable path - requeuing I/O
[102885.502051] nvme nvme0: Reconnecting in 10 seconds...
[102893.291877] nvme nvme0: Removing ctrl: NQN
"nqn.2014-08.org.nvmexpress:uuid:62f37f51-0cc7-46d5-9865-4de22e81bd9d"
[102895.535149] nvme nvme0: queue_size 128 > ctrl sqsize 32, clamping down
[102895.535214] nvme nvme0: creating 2 I/O queues.
[102895.535943] nvme nvme0: mapped 2/0/0 default/read/poll queues.
[102895.536187] nvme nvme0: Failed reconnect attempt 1
[103700.406060] INFO: task kworker/u4:0:14965 blocked for more than 491
seconds.
[103700.406068] Tainted: G E 5.15.0-rc3-54-default+
#862
[103700.406071] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[103700.406073] task:kworker/u4:0 state:D stack: 0 pid:14965 ppid:
2 flags:0x00004000
[103700.406117] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[103700.406143] Call Trace:
[103700.406146] __schedule+0x302/0x13a0
[103700.406157] ? block_read_full_page+0x21e/0x3b0
[103700.406164] ? blkdev_direct_IO+0x4a0/0x4a0
[103700.406172] schedule+0x3a/0xa0
[103700.406177] io_schedule+0x12/0x40
[103700.406182] do_read_cache_page+0x49b/0x790
481:~ # cat /proc/15761/stack
[<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
[<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
[<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]
[<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core]
[<0>] kernfs_fop_write_iter+0x12f/0x1a0
[<0>] new_sync_write+0x122/0x1b0
[<0>] vfs_write+0x1eb/0x250
[<0>] ksys_write+0xa1/0xe0
[<0>] do_syscall_64+0x3a/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
c481:~ # cat /proc/14965/stack
[<0>] do_read_cache_page+0x49b/0x790
[<0>] read_part_sector+0x39/0xe0
[<0>] read_lba+0xf9/0x1d0
[<0>] efi_partition+0xf1/0x7f0
[<0>] bdev_disk_changed+0x1ee/0x550
[<0>] blkdev_get_whole+0x81/0x90
[<0>] blkdev_get_by_dev+0x128/0x2e0
[<0>] device_add_disk+0x377/0x3c0
[<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
[<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
[<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
[<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
[<0>] nvme_scan_work+0x168/0x310 [nvme_core]
[<0>] process_one_work+0x231/0x420
[<0>] worker_thread+0x2d/0x3f0
[<0>] kthread+0x11a/0x140
[<0>] ret_from_fork+0x22/0x30
My theory here is that the partition scanning code just calls into the
pagecache, which doesn't set a timeout for any I/O operation.
As this is done under the scan_mutex we cannot clear the active paths,
and consequently we hang.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Deadlock on failure to read NVMe namespace
2021-10-19 11:22 Deadlock on failure to read NVMe namespace Hannes Reinecke
@ 2021-10-19 12:31 ` Sagi Grimberg
2021-10-19 12:40 ` Hannes Reinecke
0 siblings, 1 reply; 10+ messages in thread
From: Sagi Grimberg @ 2021-10-19 12:31 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig; +Cc: linux-nvme, Keith Busch
> Hi all,
>
> during testing my test target (https://github.com/hreinecke/nofuse) I've
> managed to run into a deadlock (with nvme-5.16):
>
> c481:~ # dmesg
> [102885.483587] nvme nvme0: sqsize 128 > ctrl maxcmd 32, clamping down
> [102885.483711] nvme nvme0: creating 2 I/O queues.
> [102885.484130] nvme nvme0: mapped 2/0/0 default/read/poll queues.
> [102885.485527] nvme nvme0: new ctrl: NQN
> "nqn.2014-08.org.nvmexpress:uuid:62f37f51-0cc7-46d5-9865-4de22e81bd9d",
> addr 127.0.0.1:8009
> [102885.487351] nvme nvme0: request 0x0 genctr mismatch (got 0x0
> expected 0x1)
> [102885.487354] nvme nvme0: got bad c2hdata.command_id 0x0 on queue 2
Nice! the genctr is catching bugs already...
> [102885.487356] nvme nvme0: receive failed: -2
> [102885.487357] nvme nvme0: starting error recovery
> [102885.487444] block nvme0n1: no usable path - requeuing I/O
> [102885.502051] nvme nvme0: Reconnecting in 10 seconds...
> [102893.291877] nvme nvme0: Removing ctrl: NQN
Did you trigger this removal? its 2 seconds before
the reconnect attempt was supposed to run.
> "nqn.2014-08.org.nvmexpress:uuid:62f37f51-0cc7-46d5-9865-4de22e81bd9d"
> [102895.535149] nvme nvme0: queue_size 128 > ctrl sqsize 32, clamping down
> [102895.535214] nvme nvme0: creating 2 I/O queues.
> [102895.535943] nvme nvme0: mapped 2/0/0 default/read/poll queues.
> [102895.536187] nvme nvme0: Failed reconnect attempt 1
> [103700.406060] INFO: task kworker/u4:0:14965 blocked for more than 491
> seconds.
> [103700.406068] Tainted: G E 5.15.0-rc3-54-default+
> #862
> [103700.406071] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [103700.406073] task:kworker/u4:0 state:D stack: 0 pid:14965 ppid:
> 2 flags:0x00004000
> [103700.406117] Workqueue: nvme-wq nvme_scan_work [nvme_core]
> [103700.406143] Call Trace:
> [103700.406146] __schedule+0x302/0x13a0
> [103700.406157] ? block_read_full_page+0x21e/0x3b0
> [103700.406164] ? blkdev_direct_IO+0x4a0/0x4a0
> [103700.406172] schedule+0x3a/0xa0
> [103700.406177] io_schedule+0x12/0x40
> [103700.406182] do_read_cache_page+0x49b/0x790
>
> 481:~ # cat /proc/15761/stack
> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]
> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core]
> [<0>] kernfs_fop_write_iter+0x12f/0x1a0
> [<0>] new_sync_write+0x122/0x1b0
> [<0>] vfs_write+0x1eb/0x250
> [<0>] ksys_write+0xa1/0xe0
> [<0>] do_syscall_64+0x3a/0x80
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
> c481:~ # cat /proc/14965/stack
> [<0>] do_read_cache_page+0x49b/0x790
> [<0>] read_part_sector+0x39/0xe0
> [<0>] read_lba+0xf9/0x1d0
> [<0>] efi_partition+0xf1/0x7f0
> [<0>] bdev_disk_changed+0x1ee/0x550
> [<0>] blkdev_get_whole+0x81/0x90
> [<0>] blkdev_get_by_dev+0x128/0x2e0
> [<0>] device_add_disk+0x377/0x3c0
> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
> [<0>] nvme_scan_work+0x168/0x310 [nvme_core]
> [<0>] process_one_work+0x231/0x420
> [<0>] worker_thread+0x2d/0x3f0
> [<0>] kthread+0x11a/0x140
> [<0>] ret_from_fork+0x22/0x30
>
> My theory here is that the partition scanning code just calls into the
> pagecache, which doesn't set a timeout for any I/O operation.
> As this is done under the scan_mutex we cannot clear the active paths,
> and consequently we hang.
But the controller removal should have cancelled all inflight
commands...
Maybe we're missing unfreeze? Hannes, can you try this one?
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index e29c47114739..783fde36d2ba 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1974,8 +1974,11 @@ static void nvme_tcp_teardown_io_queues(struct
nvme_ctrl *ctrl,
nvme_sync_io_queues(ctrl);
nvme_tcp_stop_io_queues(ctrl);
nvme_cancel_tagset(ctrl);
- if (remove)
+ if (remove) {
nvme_start_queues(ctrl);
+ nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT);
+ nvme_unfreeze(ctrl);
+ }
nvme_tcp_destroy_io_queues(ctrl, remove);
}
--
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: Deadlock on failure to read NVMe namespace
2021-10-19 12:31 ` Sagi Grimberg
@ 2021-10-19 12:40 ` Hannes Reinecke
2021-10-19 14:27 ` Sagi Grimberg
0 siblings, 1 reply; 10+ messages in thread
From: Hannes Reinecke @ 2021-10-19 12:40 UTC (permalink / raw)
To: Sagi Grimberg, Christoph Hellwig; +Cc: linux-nvme, Keith Busch
On 10/19/21 2:31 PM, Sagi Grimberg wrote:
>
>> Hi all,
>>
>> during testing my test target (https://github.com/hreinecke/nofuse)
>> I've managed to run into a deadlock (with nvme-5.16):
>>
>> c481:~ # dmesg
>> [102885.483587] nvme nvme0: sqsize 128 > ctrl maxcmd 32, clamping down
>> [102885.483711] nvme nvme0: creating 2 I/O queues.
>> [102885.484130] nvme nvme0: mapped 2/0/0 default/read/poll queues.
>> [102885.485527] nvme nvme0: new ctrl: NQN
>> "nqn.2014-08.org.nvmexpress:uuid:62f37f51-0cc7-46d5-9865-4de22e81bd9d", addr
>> 127.0.0.1:8009
>> [102885.487351] nvme nvme0: request 0x0 genctr mismatch (got 0x0
>> expected 0x1)
>> [102885.487354] nvme nvme0: got bad c2hdata.command_id 0x0 on queue 2
>
> Nice! the genctr is catching bugs already...
>
>> [102885.487356] nvme nvme0: receive failed: -2
>> [102885.487357] nvme nvme0: starting error recovery
>> [102885.487444] block nvme0n1: no usable path - requeuing I/O
>> [102885.502051] nvme nvme0: Reconnecting in 10 seconds...
>> [102893.291877] nvme nvme0: Removing ctrl: NQN
>
> Did you trigger this removal? its 2 seconds before
> the reconnect attempt was supposed to run.
>
>> "nqn.2014-08.org.nvmexpress:uuid:62f37f51-0cc7-46d5-9865-4de22e81bd9d"
>> [102895.535149] nvme nvme0: queue_size 128 > ctrl sqsize 32, clamping
>> down
>> [102895.535214] nvme nvme0: creating 2 I/O queues.
>> [102895.535943] nvme nvme0: mapped 2/0/0 default/read/poll queues.
>> [102895.536187] nvme nvme0: Failed reconnect attempt 1
>> [103700.406060] INFO: task kworker/u4:0:14965 blocked for more than
>> 491 seconds.
>> [103700.406068] Tainted: G E
>> 5.15.0-rc3-54-default+ #862
>> [103700.406071] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [103700.406073] task:kworker/u4:0 state:D stack: 0 pid:14965
>> ppid: 2 flags:0x00004000
>> [103700.406117] Workqueue: nvme-wq nvme_scan_work [nvme_core]
>> [103700.406143] Call Trace:
>> [103700.406146] __schedule+0x302/0x13a0
>> [103700.406157] ? block_read_full_page+0x21e/0x3b0
>> [103700.406164] ? blkdev_direct_IO+0x4a0/0x4a0
>> [103700.406172] schedule+0x3a/0xa0
>> [103700.406177] io_schedule+0x12/0x40
>> [103700.406182] do_read_cache_page+0x49b/0x790
>>
>> 481:~ # cat /proc/15761/stack
>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]
>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core]
>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0
>> [<0>] new_sync_write+0x122/0x1b0
>> [<0>] vfs_write+0x1eb/0x250
>> [<0>] ksys_write+0xa1/0xe0
>> [<0>] do_syscall_64+0x3a/0x80
>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>> c481:~ # cat /proc/14965/stack
>> [<0>] do_read_cache_page+0x49b/0x790
>> [<0>] read_part_sector+0x39/0xe0
>> [<0>] read_lba+0xf9/0x1d0
>> [<0>] efi_partition+0xf1/0x7f0
>> [<0>] bdev_disk_changed+0x1ee/0x550
>> [<0>] blkdev_get_whole+0x81/0x90
>> [<0>] blkdev_get_by_dev+0x128/0x2e0
>> [<0>] device_add_disk+0x377/0x3c0
>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core]
>> [<0>] process_one_work+0x231/0x420
>> [<0>] worker_thread+0x2d/0x3f0
>> [<0>] kthread+0x11a/0x140
>> [<0>] ret_from_fork+0x22/0x30
>>
>> My theory here is that the partition scanning code just calls into the
>> pagecache, which doesn't set a timeout for any I/O operation.
>> As this is done under the scan_mutex we cannot clear the active paths,
>> and consequently we hang.
>
> But the controller removal should have cancelled all inflight
> commands...
>
> Maybe we're missing unfreeze? Hannes, can you try this one?
> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index e29c47114739..783fde36d2ba 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1974,8 +1974,11 @@ static void nvme_tcp_teardown_io_queues(struct
> nvme_ctrl *ctrl,
> nvme_sync_io_queues(ctrl);
> nvme_tcp_stop_io_queues(ctrl);
> nvme_cancel_tagset(ctrl);
> - if (remove)
> + if (remove) {
> nvme_start_queues(ctrl);
> + nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT);
> + nvme_unfreeze(ctrl);
> + }
> nvme_tcp_destroy_io_queues(ctrl, remove);
> }
> --
Nope. Same problem.
I managed to make the problem go away (for some definitions of 'go
away') with this patch:
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index fb96e900dd3a..30d1154eb611 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -141,12 +141,12 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl
*ctrl)
{
struct nvme_ns *ns;
- mutex_lock(&ctrl->scan_lock);
down_read(&ctrl->namespaces_rwsem);
list_for_each_entry(ns, &ctrl->namespaces, list)
if (nvme_mpath_clear_current_path(ns))
kblockd_schedule_work(&ns->head->requeue_work);
up_read(&ctrl->namespaces_rwsem);
+ mutex_lock(&ctrl->scan_lock);
mutex_unlock(&ctrl->scan_lock);
}
But I'd be the first to agree that this really is hackish.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: Deadlock on failure to read NVMe namespace
2021-10-19 12:40 ` Hannes Reinecke
@ 2021-10-19 14:27 ` Sagi Grimberg
2021-10-19 15:04 ` Hannes Reinecke
2021-10-19 15:06 ` Sagi Grimberg
0 siblings, 2 replies; 10+ messages in thread
From: Sagi Grimberg @ 2021-10-19 14:27 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig
Cc: linux-nvme, Keith Busch, Anton Eidelman
>>> 481:~ # cat /proc/15761/stack
>>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
>>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
>>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]
>>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core]
>>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0
>>> [<0>] new_sync_write+0x122/0x1b0
>>> [<0>] vfs_write+0x1eb/0x250
>>> [<0>] ksys_write+0xa1/0xe0
>>> [<0>] do_syscall_64+0x3a/0x80
>>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>>> c481:~ # cat /proc/14965/stack
>>> [<0>] do_read_cache_page+0x49b/0x790
>>> [<0>] read_part_sector+0x39/0xe0
>>> [<0>] read_lba+0xf9/0x1d0
>>> [<0>] efi_partition+0xf1/0x7f0
>>> [<0>] bdev_disk_changed+0x1ee/0x550
>>> [<0>] blkdev_get_whole+0x81/0x90
>>> [<0>] blkdev_get_by_dev+0x128/0x2e0
>>> [<0>] device_add_disk+0x377/0x3c0
>>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
>>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
>>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
>>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
>>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core]
>>> [<0>] process_one_work+0x231/0x420
>>> [<0>] worker_thread+0x2d/0x3f0
>>> [<0>] kthread+0x11a/0x140
>>> [<0>] ret_from_fork+0x22/0x30
>>>
>>> My theory here is that the partition scanning code just calls into
>>> the pagecache, which doesn't set a timeout for any I/O operation.
>>> As this is done under the scan_mutex we cannot clear the active
>>> paths, and consequently we hang.
>>
>> But the controller removal should have cancelled all inflight
>> commands...
>>
>> Maybe we're missing unfreeze? Hannes, can you try this one?
>> --
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index e29c47114739..783fde36d2ba 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -1974,8 +1974,11 @@ static void nvme_tcp_teardown_io_queues(struct
>> nvme_ctrl *ctrl,
>> nvme_sync_io_queues(ctrl);
>> nvme_tcp_stop_io_queues(ctrl);
>> nvme_cancel_tagset(ctrl);
>> - if (remove)
>> + if (remove) {
>> nvme_start_queues(ctrl);
>> + nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT);
>> + nvme_unfreeze(ctrl);
>> + }
>> nvme_tcp_destroy_io_queues(ctrl, remove);
>> }
>> --
> Nope. Same problem.
>
> I managed to make the problem go away (for some definitions of 'go
> away') with this patch:
>
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index fb96e900dd3a..30d1154eb611 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -141,12 +141,12 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl
> *ctrl)
> {
> struct nvme_ns *ns;
>
> - mutex_lock(&ctrl->scan_lock);
> down_read(&ctrl->namespaces_rwsem);
> list_for_each_entry(ns, &ctrl->namespaces, list)
> if (nvme_mpath_clear_current_path(ns))
> kblockd_schedule_work(&ns->head->requeue_work);
> up_read(&ctrl->namespaces_rwsem);
> + mutex_lock(&ctrl->scan_lock);
> mutex_unlock(&ctrl->scan_lock);
> }
>
> But I'd be the first to agree that this really is hackish.
Yea, that doesn't solve anything.
I think this sequence is familiar and was addressed by a fix from Anton
(CC'd) which still has some pending review comments.
Can you lookup and try:
[PATCH] nvme/mpath: fix hang when disk goes live over reconnect
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Deadlock on failure to read NVMe namespace
2021-10-19 14:27 ` Sagi Grimberg
@ 2021-10-19 15:04 ` Hannes Reinecke
2021-10-19 15:09 ` Sagi Grimberg
2021-10-19 15:06 ` Sagi Grimberg
1 sibling, 1 reply; 10+ messages in thread
From: Hannes Reinecke @ 2021-10-19 15:04 UTC (permalink / raw)
To: Sagi Grimberg, Christoph Hellwig; +Cc: linux-nvme, Keith Busch, Anton Eidelman
On 10/19/21 4:27 PM, Sagi Grimberg wrote:
>
>>>> 481:~ # cat /proc/15761/stack
>>>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
>>>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
>>>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]
>>>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core]
>>>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0
>>>> [<0>] new_sync_write+0x122/0x1b0
>>>> [<0>] vfs_write+0x1eb/0x250
>>>> [<0>] ksys_write+0xa1/0xe0
>>>> [<0>] do_syscall_64+0x3a/0x80
>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>> c481:~ # cat /proc/14965/stack
>>>> [<0>] do_read_cache_page+0x49b/0x790
>>>> [<0>] read_part_sector+0x39/0xe0
>>>> [<0>] read_lba+0xf9/0x1d0
>>>> [<0>] efi_partition+0xf1/0x7f0
>>>> [<0>] bdev_disk_changed+0x1ee/0x550
>>>> [<0>] blkdev_get_whole+0x81/0x90
>>>> [<0>] blkdev_get_by_dev+0x128/0x2e0
>>>> [<0>] device_add_disk+0x377/0x3c0
>>>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
>>>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
>>>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
>>>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
>>>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core]
>>>> [<0>] process_one_work+0x231/0x420
>>>> [<0>] worker_thread+0x2d/0x3f0
>>>> [<0>] kthread+0x11a/0x140
>>>> [<0>] ret_from_fork+0x22/0x30
>>>>
>>>> My theory here is that the partition scanning code just calls into
>>>> the pagecache, which doesn't set a timeout for any I/O operation.
>>>> As this is done under the scan_mutex we cannot clear the active
>>>> paths, and consequently we hang.
>>>
>>> But the controller removal should have cancelled all inflight
>>> commands...
>>>
>>> Maybe we're missing unfreeze? Hannes, can you try this one?
>>> --
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index e29c47114739..783fde36d2ba 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -1974,8 +1974,11 @@ static void nvme_tcp_teardown_io_queues(struct
>>> nvme_ctrl *ctrl,
>>> nvme_sync_io_queues(ctrl);
>>> nvme_tcp_stop_io_queues(ctrl);
>>> nvme_cancel_tagset(ctrl);
>>> - if (remove)
>>> + if (remove) {
>>> nvme_start_queues(ctrl);
>>> + nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT);
>>> + nvme_unfreeze(ctrl);
>>> + }
>>> nvme_tcp_destroy_io_queues(ctrl, remove);
>>> }
>>> --
>> Nope. Same problem.
>>
>> I managed to make the problem go away (for some definitions of 'go
>> away') with this patch:
>>
>> diff --git a/drivers/nvme/host/multipath.c
>> b/drivers/nvme/host/multipath.c
>> index fb96e900dd3a..30d1154eb611 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -141,12 +141,12 @@ void nvme_mpath_clear_ctrl_paths(struct
>> nvme_ctrl *ctrl)
>> {
>> struct nvme_ns *ns;
>>
>> - mutex_lock(&ctrl->scan_lock);
>> down_read(&ctrl->namespaces_rwsem);
>> list_for_each_entry(ns, &ctrl->namespaces, list)
>> if (nvme_mpath_clear_current_path(ns))
>> kblockd_schedule_work(&ns->head->requeue_work);
>> up_read(&ctrl->namespaces_rwsem);
>> + mutex_lock(&ctrl->scan_lock);
>> mutex_unlock(&ctrl->scan_lock);
>> }
>>
>> But I'd be the first to agree that this really is hackish.
>
> Yea, that doesn't solve anything.
>
> I think this sequence is familiar and was addressed by a fix from Anton
> (CC'd) which still has some pending review comments.
>
> Can you lookup and try:
> [PATCH] nvme/mpath: fix hang when disk goes live over reconnect
I actually had looked at it, but then decided it's (trying) to address a
different issue.
Thing is, I'll see it being stuck even _before_ disconnect happens;
we're stuck with the _initial_ scan:
[ 3920.880552] device_add_disk+0x377/0x3c0
[ 3920.880558] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
[ 3920.880574] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
[ 3920.880588] ? device_add_disk+0x27d/0x3c0
[ 3920.880593] nvme_alloc_ns+0x417/0x950 [nvme_core]
[ 3920.880606] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
[ 3920.880618] ? __nvme_submit_sync_cmd+0x19b/0x210 [nvme_core]
[ 3920.880631] nvme_scan_work+0x168/0x310 [nvme_core]
and can't make progress as the I/O is _NOT_ failed, but rather causing a
reconnect:
[ 1364.442390] nvme nvme0: request 0x0 genctr mismatch (got 0x0 expected
0x2)
[ 1364.442408] nvme nvme0: got bad c2hdata.command_id 0x0 on queue 2
[ 1364.442411] nvme nvme0: receive failed: -2
[ 1364.442414] nvme nvme0: starting error recovery
[ 1364.442502] block nvme0n1: no usable path - requeuing I/O
so we don't have an usable path, and will requeue I/O.
But upon reconnect we will retry _the same_ I/O, and get into the same
state.
So the I/O will be stuck until disconnect happens.
But disconnect does a
nvme_do_dele te_ctrl()
-> nvme_remove_namespaces()
-> nvme_mpath_clear_ctrl_paths()
-> mutex_lock(&scan_mutex)
and hangs.
Everything would be solved if we had _aborted_ the invalid I/O instead
of forcing a reconnect; which, incidentally, is a far better way to
handle TCP PDU sequence errors than the current way of dropping the
connection. Which would only help if it were a transport artifact, but
I've yet to see a fabric which eats individual bits of a frame;
a programming error far more likely, in which case an I/O error would be
a far better response.
But I guess I'll need to raise it at the fmds call...
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Deadlock on failure to read NVMe namespace
2021-10-19 14:27 ` Sagi Grimberg
2021-10-19 15:04 ` Hannes Reinecke
@ 2021-10-19 15:06 ` Sagi Grimberg
2021-10-19 15:41 ` Hannes Reinecke
1 sibling, 1 reply; 10+ messages in thread
From: Sagi Grimberg @ 2021-10-19 15:06 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig
Cc: linux-nvme, Keith Busch, Anton Eidelman
On 10/19/21 5:27 PM, Sagi Grimberg wrote:
>
>>>> 481:~ # cat /proc/15761/stack
>>>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
>>>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
>>>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]
>>>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core]
>>>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0
>>>> [<0>] new_sync_write+0x122/0x1b0
>>>> [<0>] vfs_write+0x1eb/0x250
>>>> [<0>] ksys_write+0xa1/0xe0
>>>> [<0>] do_syscall_64+0x3a/0x80
>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>> c481:~ # cat /proc/14965/stack
>>>> [<0>] do_read_cache_page+0x49b/0x790
>>>> [<0>] read_part_sector+0x39/0xe0
>>>> [<0>] read_lba+0xf9/0x1d0
>>>> [<0>] efi_partition+0xf1/0x7f0
>>>> [<0>] bdev_disk_changed+0x1ee/0x550
>>>> [<0>] blkdev_get_whole+0x81/0x90
>>>> [<0>] blkdev_get_by_dev+0x128/0x2e0
>>>> [<0>] device_add_disk+0x377/0x3c0
>>>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
>>>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
>>>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
>>>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
>>>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core]
>>>> [<0>] process_one_work+0x231/0x420
>>>> [<0>] worker_thread+0x2d/0x3f0
>>>> [<0>] kthread+0x11a/0x140
>>>> [<0>] ret_from_fork+0x22/0x30
...
> I think this sequence is familiar and was addressed by a fix from Anton
> (CC'd) which still has some pending review comments.
>
> Can you lookup and try:
> [PATCH] nvme/mpath: fix hang when disk goes live over reconnect
Actually, I see the trace is going from nvme_alloc_ns, no the ANA
update path, so that is unlikely to address the issue.
Looking at nvme_mpath_clear_ctrl_paths, I don't think it should
take the scan_lock anymore. IIRC the reason why it needed the
scan_lock in the first place was the fact that ctrl->namespaces
was added and then sorted in scan_work (taking the namespaces_rwsem
twice).
But now that ctrl->namespaces is always sorted, and accessed with
namespaces_rwsem, I think that scan_lock is no longer needed
here and namespaces_rwsem is sufficient.
Thoughts?
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Deadlock on failure to read NVMe namespace
2021-10-19 15:04 ` Hannes Reinecke
@ 2021-10-19 15:09 ` Sagi Grimberg
2021-10-19 15:49 ` Hannes Reinecke
0 siblings, 1 reply; 10+ messages in thread
From: Sagi Grimberg @ 2021-10-19 15:09 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig
Cc: linux-nvme, Keith Busch, Anton Eidelman
On 10/19/21 6:04 PM, Hannes Reinecke wrote:
> On 10/19/21 4:27 PM, Sagi Grimberg wrote:
>>
>>>>> 481:~ # cat /proc/15761/stack
>>>>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
>>>>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
>>>>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]
>>>>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core]
>>>>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0
>>>>> [<0>] new_sync_write+0x122/0x1b0
>>>>> [<0>] vfs_write+0x1eb/0x250
>>>>> [<0>] ksys_write+0xa1/0xe0
>>>>> [<0>] do_syscall_64+0x3a/0x80
>>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>> c481:~ # cat /proc/14965/stack
>>>>> [<0>] do_read_cache_page+0x49b/0x790
>>>>> [<0>] read_part_sector+0x39/0xe0
>>>>> [<0>] read_lba+0xf9/0x1d0
>>>>> [<0>] efi_partition+0xf1/0x7f0
>>>>> [<0>] bdev_disk_changed+0x1ee/0x550
>>>>> [<0>] blkdev_get_whole+0x81/0x90
>>>>> [<0>] blkdev_get_by_dev+0x128/0x2e0
>>>>> [<0>] device_add_disk+0x377/0x3c0
>>>>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
>>>>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
>>>>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
>>>>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
>>>>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core]
>>>>> [<0>] process_one_work+0x231/0x420
>>>>> [<0>] worker_thread+0x2d/0x3f0
>>>>> [<0>] kthread+0x11a/0x140
>>>>> [<0>] ret_from_fork+0x22/0x30
>>>>>
>>>>> My theory here is that the partition scanning code just calls into
>>>>> the pagecache, which doesn't set a timeout for any I/O operation.
>>>>> As this is done under the scan_mutex we cannot clear the active
>>>>> paths, and consequently we hang.
>>>>
>>>> But the controller removal should have cancelled all inflight
>>>> commands...
>>>>
>>>> Maybe we're missing unfreeze? Hannes, can you try this one?
>>>> --
>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>> index e29c47114739..783fde36d2ba 100644
>>>> --- a/drivers/nvme/host/tcp.c
>>>> +++ b/drivers/nvme/host/tcp.c
>>>> @@ -1974,8 +1974,11 @@ static void
>>>> nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
>>>> nvme_sync_io_queues(ctrl);
>>>> nvme_tcp_stop_io_queues(ctrl);
>>>> nvme_cancel_tagset(ctrl);
>>>> - if (remove)
>>>> + if (remove) {
>>>> nvme_start_queues(ctrl);
>>>> + nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT);
>>>> + nvme_unfreeze(ctrl);
>>>> + }
>>>> nvme_tcp_destroy_io_queues(ctrl, remove);
>>>> }
>>>> --
>>> Nope. Same problem.
>>>
>>> I managed to make the problem go away (for some definitions of 'go
>>> away') with this patch:
>>>
>>> diff --git a/drivers/nvme/host/multipath.c
>>> b/drivers/nvme/host/multipath.c
>>> index fb96e900dd3a..30d1154eb611 100644
>>> --- a/drivers/nvme/host/multipath.c
>>> +++ b/drivers/nvme/host/multipath.c
>>> @@ -141,12 +141,12 @@ void nvme_mpath_clear_ctrl_paths(struct
>>> nvme_ctrl *ctrl)
>>> {
>>> struct nvme_ns *ns;
>>>
>>> - mutex_lock(&ctrl->scan_lock);
>>> down_read(&ctrl->namespaces_rwsem);
>>> list_for_each_entry(ns, &ctrl->namespaces, list)
>>> if (nvme_mpath_clear_current_path(ns))
>>> kblockd_schedule_work(&ns->head->requeue_work);
>>> up_read(&ctrl->namespaces_rwsem);
>>> + mutex_lock(&ctrl->scan_lock);
>>> mutex_unlock(&ctrl->scan_lock);
>>> }
>>>
>>> But I'd be the first to agree that this really is hackish.
>>
>> Yea, that doesn't solve anything.
>>
>> I think this sequence is familiar and was addressed by a fix from Anton
>> (CC'd) which still has some pending review comments.
>>
>> Can you lookup and try:
>> [PATCH] nvme/mpath: fix hang when disk goes live over reconnect
>
> I actually had looked at it, but then decided it's (trying) to address a
> different issue.
>
> Thing is, I'll see it being stuck even _before_ disconnect happens;
> we're stuck with the _initial_ scan:
>
> [ 3920.880552] device_add_disk+0x377/0x3c0
> [ 3920.880558] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
> [ 3920.880574] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
> [ 3920.880588] ? device_add_disk+0x27d/0x3c0
> [ 3920.880593] nvme_alloc_ns+0x417/0x950 [nvme_core]
> [ 3920.880606] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
> [ 3920.880618] ? __nvme_submit_sync_cmd+0x19b/0x210 [nvme_core]
> [ 3920.880631] nvme_scan_work+0x168/0x310 [nvme_core]
>
> and can't make progress as the I/O is _NOT_ failed, but rather causing a
> reconnect:
> [ 1364.442390] nvme nvme0: request 0x0 genctr mismatch (got 0x0 expected
> 0x2)
> [ 1364.442408] nvme nvme0: got bad c2hdata.command_id 0x0 on queue 2
> [ 1364.442411] nvme nvme0: receive failed: -2
> [ 1364.442414] nvme nvme0: starting error recovery
> [ 1364.442502] block nvme0n1: no usable path - requeuing I/O
>
> so we don't have an usable path, and will requeue I/O.
> But upon reconnect we will retry _the same_ I/O, and get into the same
> state.
>
> So the I/O will be stuck until disconnect happens.
> But disconnect does a
> nvme_do_dele te_ctrl()
> -> nvme_remove_namespaces()
> -> nvme_mpath_clear_ctrl_paths()
> -> mutex_lock(&scan_mutex)
>
> and hangs.
>
> Everything would be solved if we had _aborted_ the invalid I/O instead
> of forcing a reconnect;
But you don't know which I/O to abort, you got a bad command_id.
which, incidentally, is a far better way to
> handle TCP PDU sequence errors than the current way of dropping the
> connection.
This is not a transport error, a digest error would be a transport
error, this is a controller logic error.
> Which would only help if it were a transport artifact, but
> I've yet to see a fabric which eats individual bits of a frame;
> a programming error far more likely, in which case an I/O error would be
> a far better response.
Again, which I/O?
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Deadlock on failure to read NVMe namespace
2021-10-19 15:06 ` Sagi Grimberg
@ 2021-10-19 15:41 ` Hannes Reinecke
2021-10-19 20:13 ` Sagi Grimberg
0 siblings, 1 reply; 10+ messages in thread
From: Hannes Reinecke @ 2021-10-19 15:41 UTC (permalink / raw)
To: Sagi Grimberg, Christoph Hellwig; +Cc: linux-nvme, Keith Busch, Anton Eidelman
On 10/19/21 5:06 PM, Sagi Grimberg wrote:
>
>
> On 10/19/21 5:27 PM, Sagi Grimberg wrote:
>>
>>>>> 481:~ # cat /proc/15761/stack
>>>>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
>>>>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
>>>>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]
>>>>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core]
>>>>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0
>>>>> [<0>] new_sync_write+0x122/0x1b0
>>>>> [<0>] vfs_write+0x1eb/0x250
>>>>> [<0>] ksys_write+0xa1/0xe0
>>>>> [<0>] do_syscall_64+0x3a/0x80
>>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>> c481:~ # cat /proc/14965/stack
>>>>> [<0>] do_read_cache_page+0x49b/0x790
>>>>> [<0>] read_part_sector+0x39/0xe0
>>>>> [<0>] read_lba+0xf9/0x1d0
>>>>> [<0>] efi_partition+0xf1/0x7f0
>>>>> [<0>] bdev_disk_changed+0x1ee/0x550
>>>>> [<0>] blkdev_get_whole+0x81/0x90
>>>>> [<0>] blkdev_get_by_dev+0x128/0x2e0
>>>>> [<0>] device_add_disk+0x377/0x3c0
>>>>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
>>>>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
>>>>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
>>>>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
>>>>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core]
>>>>> [<0>] process_one_work+0x231/0x420
>>>>> [<0>] worker_thread+0x2d/0x3f0
>>>>> [<0>] kthread+0x11a/0x140
>>>>> [<0>] ret_from_fork+0x22/0x30
>
> ...
>
>> I think this sequence is familiar and was addressed by a fix from Anton
>> (CC'd) which still has some pending review comments.
>>
>> Can you lookup and try:
>> [PATCH] nvme/mpath: fix hang when disk goes live over reconnect
>
> Actually, I see the trace is going from nvme_alloc_ns, no the ANA
> update path, so that is unlikely to address the issue.
>
> Looking at nvme_mpath_clear_ctrl_paths, I don't think it should
> take the scan_lock anymore. IIRC the reason why it needed the
> scan_lock in the first place was the fact that ctrl->namespaces
> was added and then sorted in scan_work (taking the namespaces_rwsem
> twice).
>
> But now that ctrl->namespaces is always sorted, and accessed with
> namespaces_rwsem, I think that scan_lock is no longer needed
> here and namespaces_rwsem is sufficient.
>
... which was precisely what my initial patch did.
While it worked in the sense the 'nvme disconnect' completed, we did not
terminate the outstanding I/O as no current path is set and hence this:
down_read(&ctrl->namespaces_rwsem);
list_for_each_entry(ns, &ctrl->namespaces, list)
if (nvme_mpath_clear_current_path(ns))
kblockd_schedule_work(&ns->head->requeue_work);
up_read(&ctrl->namespaces_rwsem);
doesn't do anything, in particular does _not_ flush the requeue work.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Deadlock on failure to read NVMe namespace
2021-10-19 15:09 ` Sagi Grimberg
@ 2021-10-19 15:49 ` Hannes Reinecke
0 siblings, 0 replies; 10+ messages in thread
From: Hannes Reinecke @ 2021-10-19 15:49 UTC (permalink / raw)
To: Sagi Grimberg, Christoph Hellwig; +Cc: linux-nvme, Keith Busch, Anton Eidelman
On 10/19/21 5:09 PM, Sagi Grimberg wrote:
>
>
> On 10/19/21 6:04 PM, Hannes Reinecke wrote:
>> On 10/19/21 4:27 PM, Sagi Grimberg wrote:
>>>
>>>>>> 481:~ # cat /proc/15761/stack
>>>>>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
>>>>>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
>>>>>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]
>>>>>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core]
>>>>>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0
>>>>>> [<0>] new_sync_write+0x122/0x1b0
>>>>>> [<0>] vfs_write+0x1eb/0x250
>>>>>> [<0>] ksys_write+0xa1/0xe0
>>>>>> [<0>] do_syscall_64+0x3a/0x80
>>>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>> c481:~ # cat /proc/14965/stack
>>>>>> [<0>] do_read_cache_page+0x49b/0x790
>>>>>> [<0>] read_part_sector+0x39/0xe0
>>>>>> [<0>] read_lba+0xf9/0x1d0
>>>>>> [<0>] efi_partition+0xf1/0x7f0
>>>>>> [<0>] bdev_disk_changed+0x1ee/0x550
>>>>>> [<0>] blkdev_get_whole+0x81/0x90
>>>>>> [<0>] blkdev_get_by_dev+0x128/0x2e0
>>>>>> [<0>] device_add_disk+0x377/0x3c0
>>>>>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
>>>>>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
>>>>>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
>>>>>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
>>>>>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core]
>>>>>> [<0>] process_one_work+0x231/0x420
>>>>>> [<0>] worker_thread+0x2d/0x3f0
>>>>>> [<0>] kthread+0x11a/0x140
>>>>>> [<0>] ret_from_fork+0x22/0x30
>>>>>>
>>>>>> My theory here is that the partition scanning code just calls into
>>>>>> the pagecache, which doesn't set a timeout for any I/O operation.
>>>>>> As this is done under the scan_mutex we cannot clear the active
>>>>>> paths, and consequently we hang.
>>>>>
>>>>> But the controller removal should have cancelled all inflight
>>>>> commands...
>>>>>
>>>>> Maybe we're missing unfreeze? Hannes, can you try this one?
>>>>> --
>>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>>> index e29c47114739..783fde36d2ba 100644
>>>>> --- a/drivers/nvme/host/tcp.c
>>>>> +++ b/drivers/nvme/host/tcp.c
>>>>> @@ -1974,8 +1974,11 @@ static void
>>>>> nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
>>>>> nvme_sync_io_queues(ctrl);
>>>>> nvme_tcp_stop_io_queues(ctrl);
>>>>> nvme_cancel_tagset(ctrl);
>>>>> - if (remove)
>>>>> + if (remove) {
>>>>> nvme_start_queues(ctrl);
>>>>> + nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT);
>>>>> + nvme_unfreeze(ctrl);
>>>>> + }
>>>>> nvme_tcp_destroy_io_queues(ctrl, remove);
>>>>> }
>>>>> --
>>>> Nope. Same problem.
>>>>
>>>> I managed to make the problem go away (for some definitions of 'go
>>>> away') with this patch:
>>>>
>>>> diff --git a/drivers/nvme/host/multipath.c
>>>> b/drivers/nvme/host/multipath.c
>>>> index fb96e900dd3a..30d1154eb611 100644
>>>> --- a/drivers/nvme/host/multipath.c
>>>> +++ b/drivers/nvme/host/multipath.c
>>>> @@ -141,12 +141,12 @@ void nvme_mpath_clear_ctrl_paths(struct
>>>> nvme_ctrl *ctrl)
>>>> {
>>>> struct nvme_ns *ns;
>>>>
>>>> - mutex_lock(&ctrl->scan_lock);
>>>> down_read(&ctrl->namespaces_rwsem);
>>>> list_for_each_entry(ns, &ctrl->namespaces, list)
>>>> if (nvme_mpath_clear_current_path(ns))
>>>>
>>>> kblockd_schedule_work(&ns->head->requeue_work);
>>>> up_read(&ctrl->namespaces_rwsem);
>>>> + mutex_lock(&ctrl->scan_lock);
>>>> mutex_unlock(&ctrl->scan_lock);
>>>> }
>>>>
>>>> But I'd be the first to agree that this really is hackish.
>>>
>>> Yea, that doesn't solve anything.
>>>
>>> I think this sequence is familiar and was addressed by a fix from Anton
>>> (CC'd) which still has some pending review comments.
>>>
>>> Can you lookup and try:
>>> [PATCH] nvme/mpath: fix hang when disk goes live over reconnect
>>
>> I actually had looked at it, but then decided it's (trying) to address
>> a different issue.
>>
>> Thing is, I'll see it being stuck even _before_ disconnect happens;
>> we're stuck with the _initial_ scan:
>>
>> [ 3920.880552] device_add_disk+0x377/0x3c0
>> [ 3920.880558] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
>> [ 3920.880574] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
>> [ 3920.880588] ? device_add_disk+0x27d/0x3c0
>> [ 3920.880593] nvme_alloc_ns+0x417/0x950 [nvme_core]
>> [ 3920.880606] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
>> [ 3920.880618] ? __nvme_submit_sync_cmd+0x19b/0x210 [nvme_core]
>> [ 3920.880631] nvme_scan_work+0x168/0x310 [nvme_core]
>>
>> and can't make progress as the I/O is _NOT_ failed, but rather causing
>> a reconnect:
>> [ 1364.442390] nvme nvme0: request 0x0 genctr mismatch (got 0x0
>> expected 0x2)
>> [ 1364.442408] nvme nvme0: got bad c2hdata.command_id 0x0 on queue 2
>> [ 1364.442411] nvme nvme0: receive failed: -2
>> [ 1364.442414] nvme nvme0: starting error recovery
>> [ 1364.442502] block nvme0n1: no usable path - requeuing I/O
>>
>> so we don't have an usable path, and will requeue I/O.
>> But upon reconnect we will retry _the same_ I/O, and get into the same
>> state.
>>
>> So the I/O will be stuck until disconnect happens.
>> But disconnect does a
>> nvme_do_dele te_ctrl()
>> -> nvme_remove_namespaces()
>> -> nvme_mpath_clear_ctrl_paths()
>> -> mutex_lock(&scan_mutex)
>>
>> and hangs.
>>
>> Everything would be solved if we had _aborted_ the invalid I/O instead
>> of forcing a reconnect;
>
> But you don't know which I/O to abort, you got a bad command_id.
>
Yeah, you are right. I stand corrected.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Deadlock on failure to read NVMe namespace
2021-10-19 15:41 ` Hannes Reinecke
@ 2021-10-19 20:13 ` Sagi Grimberg
0 siblings, 0 replies; 10+ messages in thread
From: Sagi Grimberg @ 2021-10-19 20:13 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig
Cc: linux-nvme, Keith Busch, Anton Eidelman
>>>>>> 481:~ # cat /proc/15761/stack
>>>>>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
>>>>>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
>>>>>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]
>>>>>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core]
>>>>>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0
>>>>>> [<0>] new_sync_write+0x122/0x1b0
>>>>>> [<0>] vfs_write+0x1eb/0x250
>>>>>> [<0>] ksys_write+0xa1/0xe0
>>>>>> [<0>] do_syscall_64+0x3a/0x80
>>>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>> c481:~ # cat /proc/14965/stack
>>>>>> [<0>] do_read_cache_page+0x49b/0x790
>>>>>> [<0>] read_part_sector+0x39/0xe0
>>>>>> [<0>] read_lba+0xf9/0x1d0
>>>>>> [<0>] efi_partition+0xf1/0x7f0
>>>>>> [<0>] bdev_disk_changed+0x1ee/0x550
>>>>>> [<0>] blkdev_get_whole+0x81/0x90
>>>>>> [<0>] blkdev_get_by_dev+0x128/0x2e0
>>>>>> [<0>] device_add_disk+0x377/0x3c0
>>>>>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
>>>>>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
>>>>>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
>>>>>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
>>>>>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core]
>>>>>> [<0>] process_one_work+0x231/0x420
>>>>>> [<0>] worker_thread+0x2d/0x3f0
>>>>>> [<0>] kthread+0x11a/0x140
>>>>>> [<0>] ret_from_fork+0x22/0x30
>>
>> ...
>>
>>> I think this sequence is familiar and was addressed by a fix from Anton
>>> (CC'd) which still has some pending review comments.
>>>
>>> Can you lookup and try:
>>> [PATCH] nvme/mpath: fix hang when disk goes live over reconnect
>>
>> Actually, I see the trace is going from nvme_alloc_ns, no the ANA
>> update path, so that is unlikely to address the issue.
>>
>> Looking at nvme_mpath_clear_ctrl_paths, I don't think it should
>> take the scan_lock anymore. IIRC the reason why it needed the
>> scan_lock in the first place was the fact that ctrl->namespaces
>> was added and then sorted in scan_work (taking the namespaces_rwsem
>> twice).
>>
>> But now that ctrl->namespaces is always sorted, and accessed with
>> namespaces_rwsem, I think that scan_lock is no longer needed
>> here and namespaces_rwsem is sufficient.
>>
> ... which was precisely what my initial patch did.
> While it worked in the sense the 'nvme disconnect' completed, we did not
> terminate the outstanding I/O as no current path is set and hence this:
>
> down_read(&ctrl->namespaces_rwsem);
> list_for_each_entry(ns, &ctrl->namespaces, list)
> if (nvme_mpath_clear_current_path(ns))
> kblockd_schedule_work(&ns->head->requeue_work);
> up_read(&ctrl->namespaces_rwsem);
>
> doesn't do anything, in particular does _not_ flush the requeue work.
So you get I/O hung? Why does it need to flush it?
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2021-10-19 20:13 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-19 11:22 Deadlock on failure to read NVMe namespace Hannes Reinecke
2021-10-19 12:31 ` Sagi Grimberg
2021-10-19 12:40 ` Hannes Reinecke
2021-10-19 14:27 ` Sagi Grimberg
2021-10-19 15:04 ` Hannes Reinecke
2021-10-19 15:09 ` Sagi Grimberg
2021-10-19 15:49 ` Hannes Reinecke
2021-10-19 15:06 ` Sagi Grimberg
2021-10-19 15:41 ` Hannes Reinecke
2021-10-19 20:13 ` Sagi Grimberg
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.