All of lore.kernel.org
 help / color / mirror / Atom feed
* Deadlock on failure to read NVMe namespace
@ 2021-10-19 11:22 Hannes Reinecke
  2021-10-19 12:31 ` Sagi Grimberg
  0 siblings, 1 reply; 10+ messages in thread
From: Hannes Reinecke @ 2021-10-19 11:22 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-nvme, Keith Busch, Sagi Grimberg

Hi all,

during testing my test target (https://github.com/hreinecke/nofuse) I've 
managed to run into a deadlock (with nvme-5.16):

c481:~ # dmesg
[102885.483587] nvme nvme0: sqsize 128 > ctrl maxcmd 32, clamping down
[102885.483711] nvme nvme0: creating 2 I/O queues.
[102885.484130] nvme nvme0: mapped 2/0/0 default/read/poll queues.
[102885.485527] nvme nvme0: new ctrl: NQN 
"nqn.2014-08.org.nvmexpress:uuid:62f37f51-0cc7-46d5-9865-4de22e81bd9d", 
addr 127.0.0.1:8009
[102885.487351] nvme nvme0: request 0x0 genctr mismatch (got 0x0 
expected 0x1)
[102885.487354] nvme nvme0: got bad c2hdata.command_id 0x0 on queue 2
[102885.487356] nvme nvme0: receive failed:  -2
[102885.487357] nvme nvme0: starting error recovery
[102885.487444] block nvme0n1: no usable path - requeuing I/O
[102885.502051] nvme nvme0: Reconnecting in 10 seconds...
[102893.291877] nvme nvme0: Removing ctrl: NQN 
"nqn.2014-08.org.nvmexpress:uuid:62f37f51-0cc7-46d5-9865-4de22e81bd9d"
[102895.535149] nvme nvme0: queue_size 128 > ctrl sqsize 32, clamping down
[102895.535214] nvme nvme0: creating 2 I/O queues.
[102895.535943] nvme nvme0: mapped 2/0/0 default/read/poll queues.
[102895.536187] nvme nvme0: Failed reconnect attempt 1
[103700.406060] INFO: task kworker/u4:0:14965 blocked for more than 491 
seconds.
[103700.406068]       Tainted: G            E     5.15.0-rc3-54-default+ 
#862
[103700.406071] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[103700.406073] task:kworker/u4:0    state:D stack:    0 pid:14965 ppid: 
     2 flags:0x00004000
[103700.406117] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[103700.406143] Call Trace:
[103700.406146]  __schedule+0x302/0x13a0
[103700.406157]  ? block_read_full_page+0x21e/0x3b0
[103700.406164]  ? blkdev_direct_IO+0x4a0/0x4a0
[103700.406172]  schedule+0x3a/0xa0
[103700.406177]  io_schedule+0x12/0x40
[103700.406182]  do_read_cache_page+0x49b/0x790

481:~ # cat /proc/15761/stack
[<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
[<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
[<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]
[<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core]
[<0>] kernfs_fop_write_iter+0x12f/0x1a0
[<0>] new_sync_write+0x122/0x1b0
[<0>] vfs_write+0x1eb/0x250
[<0>] ksys_write+0xa1/0xe0
[<0>] do_syscall_64+0x3a/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
c481:~ # cat /proc/14965/stack
[<0>] do_read_cache_page+0x49b/0x790
[<0>] read_part_sector+0x39/0xe0
[<0>] read_lba+0xf9/0x1d0
[<0>] efi_partition+0xf1/0x7f0
[<0>] bdev_disk_changed+0x1ee/0x550
[<0>] blkdev_get_whole+0x81/0x90
[<0>] blkdev_get_by_dev+0x128/0x2e0
[<0>] device_add_disk+0x377/0x3c0
[<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
[<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
[<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
[<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
[<0>] nvme_scan_work+0x168/0x310 [nvme_core]
[<0>] process_one_work+0x231/0x420
[<0>] worker_thread+0x2d/0x3f0
[<0>] kthread+0x11a/0x140
[<0>] ret_from_fork+0x22/0x30

My theory here is that the partition scanning code just calls into the 
pagecache, which doesn't set a timeout for any I/O operation.
As this is done under the scan_mutex we cannot clear the active paths, 
and consequently we hang.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Deadlock on failure to read NVMe namespace
  2021-10-19 11:22 Deadlock on failure to read NVMe namespace Hannes Reinecke
@ 2021-10-19 12:31 ` Sagi Grimberg
  2021-10-19 12:40   ` Hannes Reinecke
  0 siblings, 1 reply; 10+ messages in thread
From: Sagi Grimberg @ 2021-10-19 12:31 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: linux-nvme, Keith Busch


> Hi all,
> 
> during testing my test target (https://github.com/hreinecke/nofuse) I've 
> managed to run into a deadlock (with nvme-5.16):
> 
> c481:~ # dmesg
> [102885.483587] nvme nvme0: sqsize 128 > ctrl maxcmd 32, clamping down
> [102885.483711] nvme nvme0: creating 2 I/O queues.
> [102885.484130] nvme nvme0: mapped 2/0/0 default/read/poll queues.
> [102885.485527] nvme nvme0: new ctrl: NQN 
> "nqn.2014-08.org.nvmexpress:uuid:62f37f51-0cc7-46d5-9865-4de22e81bd9d", 
> addr 127.0.0.1:8009
> [102885.487351] nvme nvme0: request 0x0 genctr mismatch (got 0x0 
> expected 0x1)
> [102885.487354] nvme nvme0: got bad c2hdata.command_id 0x0 on queue 2

Nice! the genctr is catching bugs already...

> [102885.487356] nvme nvme0: receive failed:  -2
> [102885.487357] nvme nvme0: starting error recovery
> [102885.487444] block nvme0n1: no usable path - requeuing I/O
> [102885.502051] nvme nvme0: Reconnecting in 10 seconds...
> [102893.291877] nvme nvme0: Removing ctrl: NQN 

Did you trigger this removal? its 2 seconds before
the reconnect attempt was supposed to run.

> "nqn.2014-08.org.nvmexpress:uuid:62f37f51-0cc7-46d5-9865-4de22e81bd9d"
> [102895.535149] nvme nvme0: queue_size 128 > ctrl sqsize 32, clamping down
> [102895.535214] nvme nvme0: creating 2 I/O queues.
> [102895.535943] nvme nvme0: mapped 2/0/0 default/read/poll queues.
> [102895.536187] nvme nvme0: Failed reconnect attempt 1
> [103700.406060] INFO: task kworker/u4:0:14965 blocked for more than 491 
> seconds.
> [103700.406068]       Tainted: G            E     5.15.0-rc3-54-default+ 
> #862
> [103700.406071] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> [103700.406073] task:kworker/u4:0    state:D stack:    0 pid:14965 ppid: 
>      2 flags:0x00004000
> [103700.406117] Workqueue: nvme-wq nvme_scan_work [nvme_core]
> [103700.406143] Call Trace:
> [103700.406146]  __schedule+0x302/0x13a0
> [103700.406157]  ? block_read_full_page+0x21e/0x3b0
> [103700.406164]  ? blkdev_direct_IO+0x4a0/0x4a0
> [103700.406172]  schedule+0x3a/0xa0
> [103700.406177]  io_schedule+0x12/0x40
> [103700.406182]  do_read_cache_page+0x49b/0x790
> 
> 481:~ # cat /proc/15761/stack
> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]
> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core]
> [<0>] kernfs_fop_write_iter+0x12f/0x1a0
> [<0>] new_sync_write+0x122/0x1b0
> [<0>] vfs_write+0x1eb/0x250
> [<0>] ksys_write+0xa1/0xe0
> [<0>] do_syscall_64+0x3a/0x80
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
> c481:~ # cat /proc/14965/stack
> [<0>] do_read_cache_page+0x49b/0x790
> [<0>] read_part_sector+0x39/0xe0
> [<0>] read_lba+0xf9/0x1d0
> [<0>] efi_partition+0xf1/0x7f0
> [<0>] bdev_disk_changed+0x1ee/0x550
> [<0>] blkdev_get_whole+0x81/0x90
> [<0>] blkdev_get_by_dev+0x128/0x2e0
> [<0>] device_add_disk+0x377/0x3c0
> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
> [<0>] nvme_scan_work+0x168/0x310 [nvme_core]
> [<0>] process_one_work+0x231/0x420
> [<0>] worker_thread+0x2d/0x3f0
> [<0>] kthread+0x11a/0x140
> [<0>] ret_from_fork+0x22/0x30
> 
> My theory here is that the partition scanning code just calls into the 
> pagecache, which doesn't set a timeout for any I/O operation.
> As this is done under the scan_mutex we cannot clear the active paths, 
> and consequently we hang.

But the controller removal should have cancelled all inflight
commands...

Maybe we're missing unfreeze? Hannes, can you try this one?
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index e29c47114739..783fde36d2ba 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1974,8 +1974,11 @@ static void nvme_tcp_teardown_io_queues(struct 
nvme_ctrl *ctrl,
         nvme_sync_io_queues(ctrl);
         nvme_tcp_stop_io_queues(ctrl);
         nvme_cancel_tagset(ctrl);
-       if (remove)
+       if (remove) {
                 nvme_start_queues(ctrl);
+               nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT);
+               nvme_unfreeze(ctrl);
+       }
         nvme_tcp_destroy_io_queues(ctrl, remove);
  }
--


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: Deadlock on failure to read NVMe namespace
  2021-10-19 12:31 ` Sagi Grimberg
@ 2021-10-19 12:40   ` Hannes Reinecke
  2021-10-19 14:27     ` Sagi Grimberg
  0 siblings, 1 reply; 10+ messages in thread
From: Hannes Reinecke @ 2021-10-19 12:40 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: linux-nvme, Keith Busch

On 10/19/21 2:31 PM, Sagi Grimberg wrote:
> 
>> Hi all,
>>
>> during testing my test target (https://github.com/hreinecke/nofuse) 
>> I've managed to run into a deadlock (with nvme-5.16):
>>
>> c481:~ # dmesg
>> [102885.483587] nvme nvme0: sqsize 128 > ctrl maxcmd 32, clamping down
>> [102885.483711] nvme nvme0: creating 2 I/O queues.
>> [102885.484130] nvme nvme0: mapped 2/0/0 default/read/poll queues.
>> [102885.485527] nvme nvme0: new ctrl: NQN 
>> "nqn.2014-08.org.nvmexpress:uuid:62f37f51-0cc7-46d5-9865-4de22e81bd9d", addr 
>> 127.0.0.1:8009
>> [102885.487351] nvme nvme0: request 0x0 genctr mismatch (got 0x0 
>> expected 0x1)
>> [102885.487354] nvme nvme0: got bad c2hdata.command_id 0x0 on queue 2
> 
> Nice! the genctr is catching bugs already...
> 
>> [102885.487356] nvme nvme0: receive failed:  -2
>> [102885.487357] nvme nvme0: starting error recovery
>> [102885.487444] block nvme0n1: no usable path - requeuing I/O
>> [102885.502051] nvme nvme0: Reconnecting in 10 seconds...
>> [102893.291877] nvme nvme0: Removing ctrl: NQN 
> 
> Did you trigger this removal? its 2 seconds before
> the reconnect attempt was supposed to run.
> 
>> "nqn.2014-08.org.nvmexpress:uuid:62f37f51-0cc7-46d5-9865-4de22e81bd9d"
>> [102895.535149] nvme nvme0: queue_size 128 > ctrl sqsize 32, clamping 
>> down
>> [102895.535214] nvme nvme0: creating 2 I/O queues.
>> [102895.535943] nvme nvme0: mapped 2/0/0 default/read/poll queues.
>> [102895.536187] nvme nvme0: Failed reconnect attempt 1
>> [103700.406060] INFO: task kworker/u4:0:14965 blocked for more than 
>> 491 seconds.
>> [103700.406068]       Tainted: G            E     
>> 5.15.0-rc3-54-default+ #862
>> [103700.406071] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>> disables this message.
>> [103700.406073] task:kworker/u4:0    state:D stack:    0 pid:14965 
>> ppid:      2 flags:0x00004000
>> [103700.406117] Workqueue: nvme-wq nvme_scan_work [nvme_core]
>> [103700.406143] Call Trace:
>> [103700.406146]  __schedule+0x302/0x13a0
>> [103700.406157]  ? block_read_full_page+0x21e/0x3b0
>> [103700.406164]  ? blkdev_direct_IO+0x4a0/0x4a0
>> [103700.406172]  schedule+0x3a/0xa0
>> [103700.406177]  io_schedule+0x12/0x40
>> [103700.406182]  do_read_cache_page+0x49b/0x790
>>
>> 481:~ # cat /proc/15761/stack
>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]
>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core]
>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0
>> [<0>] new_sync_write+0x122/0x1b0
>> [<0>] vfs_write+0x1eb/0x250
>> [<0>] ksys_write+0xa1/0xe0
>> [<0>] do_syscall_64+0x3a/0x80
>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>> c481:~ # cat /proc/14965/stack
>> [<0>] do_read_cache_page+0x49b/0x790
>> [<0>] read_part_sector+0x39/0xe0
>> [<0>] read_lba+0xf9/0x1d0
>> [<0>] efi_partition+0xf1/0x7f0
>> [<0>] bdev_disk_changed+0x1ee/0x550
>> [<0>] blkdev_get_whole+0x81/0x90
>> [<0>] blkdev_get_by_dev+0x128/0x2e0
>> [<0>] device_add_disk+0x377/0x3c0
>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core]
>> [<0>] process_one_work+0x231/0x420
>> [<0>] worker_thread+0x2d/0x3f0
>> [<0>] kthread+0x11a/0x140
>> [<0>] ret_from_fork+0x22/0x30
>>
>> My theory here is that the partition scanning code just calls into the 
>> pagecache, which doesn't set a timeout for any I/O operation.
>> As this is done under the scan_mutex we cannot clear the active paths, 
>> and consequently we hang.
> 
> But the controller removal should have cancelled all inflight
> commands...
> 
> Maybe we're missing unfreeze? Hannes, can you try this one?
> -- 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index e29c47114739..783fde36d2ba 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1974,8 +1974,11 @@ static void nvme_tcp_teardown_io_queues(struct 
> nvme_ctrl *ctrl,
>          nvme_sync_io_queues(ctrl);
>          nvme_tcp_stop_io_queues(ctrl);
>          nvme_cancel_tagset(ctrl);
> -       if (remove)
> +       if (remove) {
>                  nvme_start_queues(ctrl);
> +               nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT);
> +               nvme_unfreeze(ctrl);
> +       }
>          nvme_tcp_destroy_io_queues(ctrl, remove);
>   }
> -- 
Nope. Same problem.

I managed to make the problem go away (for some definitions of 'go 
away') with this patch:

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index fb96e900dd3a..30d1154eb611 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -141,12 +141,12 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl 
*ctrl)
  {
         struct nvme_ns *ns;

-       mutex_lock(&ctrl->scan_lock);
         down_read(&ctrl->namespaces_rwsem);
         list_for_each_entry(ns, &ctrl->namespaces, list)
                 if (nvme_mpath_clear_current_path(ns))
                         kblockd_schedule_work(&ns->head->requeue_work);
         up_read(&ctrl->namespaces_rwsem);
+       mutex_lock(&ctrl->scan_lock);
         mutex_unlock(&ctrl->scan_lock);
  }

But I'd be the first to agree that this really is hackish.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: Deadlock on failure to read NVMe namespace
  2021-10-19 12:40   ` Hannes Reinecke
@ 2021-10-19 14:27     ` Sagi Grimberg
  2021-10-19 15:04       ` Hannes Reinecke
  2021-10-19 15:06       ` Sagi Grimberg
  0 siblings, 2 replies; 10+ messages in thread
From: Sagi Grimberg @ 2021-10-19 14:27 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: linux-nvme, Keith Busch, Anton Eidelman


>>> 481:~ # cat /proc/15761/stack
>>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
>>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
>>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]
>>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core]
>>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0
>>> [<0>] new_sync_write+0x122/0x1b0
>>> [<0>] vfs_write+0x1eb/0x250
>>> [<0>] ksys_write+0xa1/0xe0
>>> [<0>] do_syscall_64+0x3a/0x80
>>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>>> c481:~ # cat /proc/14965/stack
>>> [<0>] do_read_cache_page+0x49b/0x790
>>> [<0>] read_part_sector+0x39/0xe0
>>> [<0>] read_lba+0xf9/0x1d0
>>> [<0>] efi_partition+0xf1/0x7f0
>>> [<0>] bdev_disk_changed+0x1ee/0x550
>>> [<0>] blkdev_get_whole+0x81/0x90
>>> [<0>] blkdev_get_by_dev+0x128/0x2e0
>>> [<0>] device_add_disk+0x377/0x3c0
>>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
>>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
>>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
>>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
>>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core]
>>> [<0>] process_one_work+0x231/0x420
>>> [<0>] worker_thread+0x2d/0x3f0
>>> [<0>] kthread+0x11a/0x140
>>> [<0>] ret_from_fork+0x22/0x30
>>>
>>> My theory here is that the partition scanning code just calls into 
>>> the pagecache, which doesn't set a timeout for any I/O operation.
>>> As this is done under the scan_mutex we cannot clear the active 
>>> paths, and consequently we hang.
>>
>> But the controller removal should have cancelled all inflight
>> commands...
>>
>> Maybe we're missing unfreeze? Hannes, can you try this one?
>> -- 
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index e29c47114739..783fde36d2ba 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -1974,8 +1974,11 @@ static void nvme_tcp_teardown_io_queues(struct 
>> nvme_ctrl *ctrl,
>>          nvme_sync_io_queues(ctrl);
>>          nvme_tcp_stop_io_queues(ctrl);
>>          nvme_cancel_tagset(ctrl);
>> -       if (remove)
>> +       if (remove) {
>>                  nvme_start_queues(ctrl);
>> +               nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT);
>> +               nvme_unfreeze(ctrl);
>> +       }
>>          nvme_tcp_destroy_io_queues(ctrl, remove);
>>   }
>> -- 
> Nope. Same problem.
> 
> I managed to make the problem go away (for some definitions of 'go 
> away') with this patch:
> 
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index fb96e900dd3a..30d1154eb611 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -141,12 +141,12 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl 
> *ctrl)
>   {
>          struct nvme_ns *ns;
> 
> -       mutex_lock(&ctrl->scan_lock);
>          down_read(&ctrl->namespaces_rwsem);
>          list_for_each_entry(ns, &ctrl->namespaces, list)
>                  if (nvme_mpath_clear_current_path(ns))
>                          kblockd_schedule_work(&ns->head->requeue_work);
>          up_read(&ctrl->namespaces_rwsem);
> +       mutex_lock(&ctrl->scan_lock);
>          mutex_unlock(&ctrl->scan_lock);
>   }
> 
> But I'd be the first to agree that this really is hackish.

Yea, that doesn't solve anything.

I think this sequence is familiar and was addressed by a fix from Anton
(CC'd) which still has some pending review comments.

Can you lookup and try:
[PATCH] nvme/mpath: fix hang when disk goes live over reconnect


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Deadlock on failure to read NVMe namespace
  2021-10-19 14:27     ` Sagi Grimberg
@ 2021-10-19 15:04       ` Hannes Reinecke
  2021-10-19 15:09         ` Sagi Grimberg
  2021-10-19 15:06       ` Sagi Grimberg
  1 sibling, 1 reply; 10+ messages in thread
From: Hannes Reinecke @ 2021-10-19 15:04 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: linux-nvme, Keith Busch, Anton Eidelman

On 10/19/21 4:27 PM, Sagi Grimberg wrote:
> 
>>>> 481:~ # cat /proc/15761/stack
>>>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
>>>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
>>>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]
>>>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core]
>>>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0
>>>> [<0>] new_sync_write+0x122/0x1b0
>>>> [<0>] vfs_write+0x1eb/0x250
>>>> [<0>] ksys_write+0xa1/0xe0
>>>> [<0>] do_syscall_64+0x3a/0x80
>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>> c481:~ # cat /proc/14965/stack
>>>> [<0>] do_read_cache_page+0x49b/0x790
>>>> [<0>] read_part_sector+0x39/0xe0
>>>> [<0>] read_lba+0xf9/0x1d0
>>>> [<0>] efi_partition+0xf1/0x7f0
>>>> [<0>] bdev_disk_changed+0x1ee/0x550
>>>> [<0>] blkdev_get_whole+0x81/0x90
>>>> [<0>] blkdev_get_by_dev+0x128/0x2e0
>>>> [<0>] device_add_disk+0x377/0x3c0
>>>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
>>>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
>>>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
>>>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
>>>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core]
>>>> [<0>] process_one_work+0x231/0x420
>>>> [<0>] worker_thread+0x2d/0x3f0
>>>> [<0>] kthread+0x11a/0x140
>>>> [<0>] ret_from_fork+0x22/0x30
>>>>
>>>> My theory here is that the partition scanning code just calls into 
>>>> the pagecache, which doesn't set a timeout for any I/O operation.
>>>> As this is done under the scan_mutex we cannot clear the active 
>>>> paths, and consequently we hang.
>>>
>>> But the controller removal should have cancelled all inflight
>>> commands...
>>>
>>> Maybe we're missing unfreeze? Hannes, can you try this one?
>>> -- 
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index e29c47114739..783fde36d2ba 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -1974,8 +1974,11 @@ static void nvme_tcp_teardown_io_queues(struct 
>>> nvme_ctrl *ctrl,
>>>          nvme_sync_io_queues(ctrl);
>>>          nvme_tcp_stop_io_queues(ctrl);
>>>          nvme_cancel_tagset(ctrl);
>>> -       if (remove)
>>> +       if (remove) {
>>>                  nvme_start_queues(ctrl);
>>> +               nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT);
>>> +               nvme_unfreeze(ctrl);
>>> +       }
>>>          nvme_tcp_destroy_io_queues(ctrl, remove);
>>>   }
>>> -- 
>> Nope. Same problem.
>>
>> I managed to make the problem go away (for some definitions of 'go 
>> away') with this patch:
>>
>> diff --git a/drivers/nvme/host/multipath.c 
>> b/drivers/nvme/host/multipath.c
>> index fb96e900dd3a..30d1154eb611 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -141,12 +141,12 @@ void nvme_mpath_clear_ctrl_paths(struct 
>> nvme_ctrl *ctrl)
>>   {
>>          struct nvme_ns *ns;
>>
>> -       mutex_lock(&ctrl->scan_lock);
>>          down_read(&ctrl->namespaces_rwsem);
>>          list_for_each_entry(ns, &ctrl->namespaces, list)
>>                  if (nvme_mpath_clear_current_path(ns))
>>                          kblockd_schedule_work(&ns->head->requeue_work);
>>          up_read(&ctrl->namespaces_rwsem);
>> +       mutex_lock(&ctrl->scan_lock);
>>          mutex_unlock(&ctrl->scan_lock);
>>   }
>>
>> But I'd be the first to agree that this really is hackish.
> 
> Yea, that doesn't solve anything.
> 
> I think this sequence is familiar and was addressed by a fix from Anton
> (CC'd) which still has some pending review comments.
> 
> Can you lookup and try:
> [PATCH] nvme/mpath: fix hang when disk goes live over reconnect

I actually had looked at it, but then decided it's (trying) to address a 
different issue.

Thing is, I'll see it being stuck even _before_ disconnect happens; 
we're stuck with the _initial_ scan:

[ 3920.880552]  device_add_disk+0x377/0x3c0
[ 3920.880558]  nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
[ 3920.880574]  nvme_mpath_add_disk+0x150/0x160 [nvme_core]
[ 3920.880588]  ? device_add_disk+0x27d/0x3c0
[ 3920.880593]  nvme_alloc_ns+0x417/0x950 [nvme_core]
[ 3920.880606]  nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
[ 3920.880618]  ? __nvme_submit_sync_cmd+0x19b/0x210 [nvme_core]
[ 3920.880631]  nvme_scan_work+0x168/0x310 [nvme_core]

and can't make progress as the I/O is _NOT_ failed, but rather causing a 
reconnect:
[ 1364.442390] nvme nvme0: request 0x0 genctr mismatch (got 0x0 expected 
0x2)
[ 1364.442408] nvme nvme0: got bad c2hdata.command_id 0x0 on queue 2
[ 1364.442411] nvme nvme0: receive failed:  -2
[ 1364.442414] nvme nvme0: starting error recovery
[ 1364.442502] block nvme0n1: no usable path - requeuing I/O

so we don't have an usable path, and will requeue I/O.
But upon reconnect we will retry _the same_ I/O, and get into the same 
state.

So the I/O will be stuck until disconnect happens.
But disconnect does a
nvme_do_dele te_ctrl()
-> nvme_remove_namespaces()
    -> nvme_mpath_clear_ctrl_paths()
       -> mutex_lock(&scan_mutex)

and hangs.

Everything would be solved if we had _aborted_ the invalid I/O instead 
of forcing a reconnect; which, incidentally, is a far better way to 
handle TCP PDU sequence errors than the current way of dropping the 
connection. Which would only help if it were a transport artifact, but 
I've yet to see a fabric which eats individual bits of a frame;
a programming error far more likely, in which case an I/O error would be 
a far better response.

But I guess I'll need to raise it at the fmds call...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Deadlock on failure to read NVMe namespace
  2021-10-19 14:27     ` Sagi Grimberg
  2021-10-19 15:04       ` Hannes Reinecke
@ 2021-10-19 15:06       ` Sagi Grimberg
  2021-10-19 15:41         ` Hannes Reinecke
  1 sibling, 1 reply; 10+ messages in thread
From: Sagi Grimberg @ 2021-10-19 15:06 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: linux-nvme, Keith Busch, Anton Eidelman



On 10/19/21 5:27 PM, Sagi Grimberg wrote:
> 
>>>> 481:~ # cat /proc/15761/stack
>>>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
>>>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
>>>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]
>>>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core]
>>>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0
>>>> [<0>] new_sync_write+0x122/0x1b0
>>>> [<0>] vfs_write+0x1eb/0x250
>>>> [<0>] ksys_write+0xa1/0xe0
>>>> [<0>] do_syscall_64+0x3a/0x80
>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>> c481:~ # cat /proc/14965/stack
>>>> [<0>] do_read_cache_page+0x49b/0x790
>>>> [<0>] read_part_sector+0x39/0xe0
>>>> [<0>] read_lba+0xf9/0x1d0
>>>> [<0>] efi_partition+0xf1/0x7f0
>>>> [<0>] bdev_disk_changed+0x1ee/0x550
>>>> [<0>] blkdev_get_whole+0x81/0x90
>>>> [<0>] blkdev_get_by_dev+0x128/0x2e0
>>>> [<0>] device_add_disk+0x377/0x3c0
>>>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
>>>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
>>>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
>>>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
>>>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core]
>>>> [<0>] process_one_work+0x231/0x420
>>>> [<0>] worker_thread+0x2d/0x3f0
>>>> [<0>] kthread+0x11a/0x140
>>>> [<0>] ret_from_fork+0x22/0x30

...

> I think this sequence is familiar and was addressed by a fix from Anton
> (CC'd) which still has some pending review comments.
> 
> Can you lookup and try:
> [PATCH] nvme/mpath: fix hang when disk goes live over reconnect

Actually, I see the trace is going from nvme_alloc_ns, no the ANA
update path, so that is unlikely to address the issue.

Looking at nvme_mpath_clear_ctrl_paths, I don't think it should
take the scan_lock anymore. IIRC the reason why it needed the
scan_lock in the first place was the fact that ctrl->namespaces
was added and then sorted in scan_work (taking the namespaces_rwsem
twice).

But now that ctrl->namespaces is always sorted, and accessed with
namespaces_rwsem, I think that scan_lock is no longer needed
here and namespaces_rwsem is sufficient.

Thoughts?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Deadlock on failure to read NVMe namespace
  2021-10-19 15:04       ` Hannes Reinecke
@ 2021-10-19 15:09         ` Sagi Grimberg
  2021-10-19 15:49           ` Hannes Reinecke
  0 siblings, 1 reply; 10+ messages in thread
From: Sagi Grimberg @ 2021-10-19 15:09 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: linux-nvme, Keith Busch, Anton Eidelman



On 10/19/21 6:04 PM, Hannes Reinecke wrote:
> On 10/19/21 4:27 PM, Sagi Grimberg wrote:
>>
>>>>> 481:~ # cat /proc/15761/stack
>>>>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
>>>>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
>>>>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]
>>>>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core]
>>>>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0
>>>>> [<0>] new_sync_write+0x122/0x1b0
>>>>> [<0>] vfs_write+0x1eb/0x250
>>>>> [<0>] ksys_write+0xa1/0xe0
>>>>> [<0>] do_syscall_64+0x3a/0x80
>>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>> c481:~ # cat /proc/14965/stack
>>>>> [<0>] do_read_cache_page+0x49b/0x790
>>>>> [<0>] read_part_sector+0x39/0xe0
>>>>> [<0>] read_lba+0xf9/0x1d0
>>>>> [<0>] efi_partition+0xf1/0x7f0
>>>>> [<0>] bdev_disk_changed+0x1ee/0x550
>>>>> [<0>] blkdev_get_whole+0x81/0x90
>>>>> [<0>] blkdev_get_by_dev+0x128/0x2e0
>>>>> [<0>] device_add_disk+0x377/0x3c0
>>>>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
>>>>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
>>>>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
>>>>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
>>>>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core]
>>>>> [<0>] process_one_work+0x231/0x420
>>>>> [<0>] worker_thread+0x2d/0x3f0
>>>>> [<0>] kthread+0x11a/0x140
>>>>> [<0>] ret_from_fork+0x22/0x30
>>>>>
>>>>> My theory here is that the partition scanning code just calls into 
>>>>> the pagecache, which doesn't set a timeout for any I/O operation.
>>>>> As this is done under the scan_mutex we cannot clear the active 
>>>>> paths, and consequently we hang.
>>>>
>>>> But the controller removal should have cancelled all inflight
>>>> commands...
>>>>
>>>> Maybe we're missing unfreeze? Hannes, can you try this one?
>>>> -- 
>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>> index e29c47114739..783fde36d2ba 100644
>>>> --- a/drivers/nvme/host/tcp.c
>>>> +++ b/drivers/nvme/host/tcp.c
>>>> @@ -1974,8 +1974,11 @@ static void 
>>>> nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
>>>>          nvme_sync_io_queues(ctrl);
>>>>          nvme_tcp_stop_io_queues(ctrl);
>>>>          nvme_cancel_tagset(ctrl);
>>>> -       if (remove)
>>>> +       if (remove) {
>>>>                  nvme_start_queues(ctrl);
>>>> +               nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT);
>>>> +               nvme_unfreeze(ctrl);
>>>> +       }
>>>>          nvme_tcp_destroy_io_queues(ctrl, remove);
>>>>   }
>>>> -- 
>>> Nope. Same problem.
>>>
>>> I managed to make the problem go away (for some definitions of 'go 
>>> away') with this patch:
>>>
>>> diff --git a/drivers/nvme/host/multipath.c 
>>> b/drivers/nvme/host/multipath.c
>>> index fb96e900dd3a..30d1154eb611 100644
>>> --- a/drivers/nvme/host/multipath.c
>>> +++ b/drivers/nvme/host/multipath.c
>>> @@ -141,12 +141,12 @@ void nvme_mpath_clear_ctrl_paths(struct 
>>> nvme_ctrl *ctrl)
>>>   {
>>>          struct nvme_ns *ns;
>>>
>>> -       mutex_lock(&ctrl->scan_lock);
>>>          down_read(&ctrl->namespaces_rwsem);
>>>          list_for_each_entry(ns, &ctrl->namespaces, list)
>>>                  if (nvme_mpath_clear_current_path(ns))
>>>                          kblockd_schedule_work(&ns->head->requeue_work);
>>>          up_read(&ctrl->namespaces_rwsem);
>>> +       mutex_lock(&ctrl->scan_lock);
>>>          mutex_unlock(&ctrl->scan_lock);
>>>   }
>>>
>>> But I'd be the first to agree that this really is hackish.
>>
>> Yea, that doesn't solve anything.
>>
>> I think this sequence is familiar and was addressed by a fix from Anton
>> (CC'd) which still has some pending review comments.
>>
>> Can you lookup and try:
>> [PATCH] nvme/mpath: fix hang when disk goes live over reconnect
> 
> I actually had looked at it, but then decided it's (trying) to address a 
> different issue.
> 
> Thing is, I'll see it being stuck even _before_ disconnect happens; 
> we're stuck with the _initial_ scan:
> 
> [ 3920.880552]  device_add_disk+0x377/0x3c0
> [ 3920.880558]  nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
> [ 3920.880574]  nvme_mpath_add_disk+0x150/0x160 [nvme_core]
> [ 3920.880588]  ? device_add_disk+0x27d/0x3c0
> [ 3920.880593]  nvme_alloc_ns+0x417/0x950 [nvme_core]
> [ 3920.880606]  nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
> [ 3920.880618]  ? __nvme_submit_sync_cmd+0x19b/0x210 [nvme_core]
> [ 3920.880631]  nvme_scan_work+0x168/0x310 [nvme_core]
> 
> and can't make progress as the I/O is _NOT_ failed, but rather causing a 
> reconnect:
> [ 1364.442390] nvme nvme0: request 0x0 genctr mismatch (got 0x0 expected 
> 0x2)
> [ 1364.442408] nvme nvme0: got bad c2hdata.command_id 0x0 on queue 2
> [ 1364.442411] nvme nvme0: receive failed:  -2
> [ 1364.442414] nvme nvme0: starting error recovery
> [ 1364.442502] block nvme0n1: no usable path - requeuing I/O
> 
> so we don't have an usable path, and will requeue I/O.
> But upon reconnect we will retry _the same_ I/O, and get into the same 
> state.
> 
> So the I/O will be stuck until disconnect happens.
> But disconnect does a
> nvme_do_dele te_ctrl()
> -> nvme_remove_namespaces()
>     -> nvme_mpath_clear_ctrl_paths()
>        -> mutex_lock(&scan_mutex)
> 
> and hangs.
> 
> Everything would be solved if we had _aborted_ the invalid I/O instead 
> of forcing a reconnect;

But you don't know which I/O to abort, you got a bad command_id.

  which, incidentally, is a far better way to
> handle TCP PDU sequence errors than the current way of dropping the 
> connection.

This is not a transport error, a digest error would be a transport
error, this is a controller logic error.

> Which would only help if it were a transport artifact, but 
> I've yet to see a fabric which eats individual bits of a frame;
> a programming error far more likely, in which case an I/O error would be 
> a far better response.

Again, which I/O?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Deadlock on failure to read NVMe namespace
  2021-10-19 15:06       ` Sagi Grimberg
@ 2021-10-19 15:41         ` Hannes Reinecke
  2021-10-19 20:13           ` Sagi Grimberg
  0 siblings, 1 reply; 10+ messages in thread
From: Hannes Reinecke @ 2021-10-19 15:41 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: linux-nvme, Keith Busch, Anton Eidelman

On 10/19/21 5:06 PM, Sagi Grimberg wrote:
> 
> 
> On 10/19/21 5:27 PM, Sagi Grimberg wrote:
>>
>>>>> 481:~ # cat /proc/15761/stack
>>>>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
>>>>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
>>>>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]
>>>>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core]
>>>>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0
>>>>> [<0>] new_sync_write+0x122/0x1b0
>>>>> [<0>] vfs_write+0x1eb/0x250
>>>>> [<0>] ksys_write+0xa1/0xe0
>>>>> [<0>] do_syscall_64+0x3a/0x80
>>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>> c481:~ # cat /proc/14965/stack
>>>>> [<0>] do_read_cache_page+0x49b/0x790
>>>>> [<0>] read_part_sector+0x39/0xe0
>>>>> [<0>] read_lba+0xf9/0x1d0
>>>>> [<0>] efi_partition+0xf1/0x7f0
>>>>> [<0>] bdev_disk_changed+0x1ee/0x550
>>>>> [<0>] blkdev_get_whole+0x81/0x90
>>>>> [<0>] blkdev_get_by_dev+0x128/0x2e0
>>>>> [<0>] device_add_disk+0x377/0x3c0
>>>>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
>>>>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
>>>>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
>>>>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
>>>>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core]
>>>>> [<0>] process_one_work+0x231/0x420
>>>>> [<0>] worker_thread+0x2d/0x3f0
>>>>> [<0>] kthread+0x11a/0x140
>>>>> [<0>] ret_from_fork+0x22/0x30
> 
> ...
> 
>> I think this sequence is familiar and was addressed by a fix from Anton
>> (CC'd) which still has some pending review comments.
>>
>> Can you lookup and try:
>> [PATCH] nvme/mpath: fix hang when disk goes live over reconnect
> 
> Actually, I see the trace is going from nvme_alloc_ns, no the ANA
> update path, so that is unlikely to address the issue.
> 
> Looking at nvme_mpath_clear_ctrl_paths, I don't think it should
> take the scan_lock anymore. IIRC the reason why it needed the
> scan_lock in the first place was the fact that ctrl->namespaces
> was added and then sorted in scan_work (taking the namespaces_rwsem
> twice).
> 
> But now that ctrl->namespaces is always sorted, and accessed with
> namespaces_rwsem, I think that scan_lock is no longer needed
> here and namespaces_rwsem is sufficient.
> 
... which was precisely what my initial patch did.
While it worked in the sense the 'nvme disconnect' completed, we did not 
terminate the outstanding I/O as no current path is set and hence this:

	down_read(&ctrl->namespaces_rwsem);
	list_for_each_entry(ns, &ctrl->namespaces, list)
		if (nvme_mpath_clear_current_path(ns))
			kblockd_schedule_work(&ns->head->requeue_work);
	up_read(&ctrl->namespaces_rwsem);

doesn't do anything, in particular does _not_ flush the requeue work.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Deadlock on failure to read NVMe namespace
  2021-10-19 15:09         ` Sagi Grimberg
@ 2021-10-19 15:49           ` Hannes Reinecke
  0 siblings, 0 replies; 10+ messages in thread
From: Hannes Reinecke @ 2021-10-19 15:49 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: linux-nvme, Keith Busch, Anton Eidelman

On 10/19/21 5:09 PM, Sagi Grimberg wrote:
> 
> 
> On 10/19/21 6:04 PM, Hannes Reinecke wrote:
>> On 10/19/21 4:27 PM, Sagi Grimberg wrote:
>>>
>>>>>> 481:~ # cat /proc/15761/stack
>>>>>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
>>>>>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
>>>>>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]
>>>>>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core]
>>>>>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0
>>>>>> [<0>] new_sync_write+0x122/0x1b0
>>>>>> [<0>] vfs_write+0x1eb/0x250
>>>>>> [<0>] ksys_write+0xa1/0xe0
>>>>>> [<0>] do_syscall_64+0x3a/0x80
>>>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>> c481:~ # cat /proc/14965/stack
>>>>>> [<0>] do_read_cache_page+0x49b/0x790
>>>>>> [<0>] read_part_sector+0x39/0xe0
>>>>>> [<0>] read_lba+0xf9/0x1d0
>>>>>> [<0>] efi_partition+0xf1/0x7f0
>>>>>> [<0>] bdev_disk_changed+0x1ee/0x550
>>>>>> [<0>] blkdev_get_whole+0x81/0x90
>>>>>> [<0>] blkdev_get_by_dev+0x128/0x2e0
>>>>>> [<0>] device_add_disk+0x377/0x3c0
>>>>>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
>>>>>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
>>>>>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
>>>>>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
>>>>>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core]
>>>>>> [<0>] process_one_work+0x231/0x420
>>>>>> [<0>] worker_thread+0x2d/0x3f0
>>>>>> [<0>] kthread+0x11a/0x140
>>>>>> [<0>] ret_from_fork+0x22/0x30
>>>>>>
>>>>>> My theory here is that the partition scanning code just calls into 
>>>>>> the pagecache, which doesn't set a timeout for any I/O operation.
>>>>>> As this is done under the scan_mutex we cannot clear the active 
>>>>>> paths, and consequently we hang.
>>>>>
>>>>> But the controller removal should have cancelled all inflight
>>>>> commands...
>>>>>
>>>>> Maybe we're missing unfreeze? Hannes, can you try this one?
>>>>> -- 
>>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>>> index e29c47114739..783fde36d2ba 100644
>>>>> --- a/drivers/nvme/host/tcp.c
>>>>> +++ b/drivers/nvme/host/tcp.c
>>>>> @@ -1974,8 +1974,11 @@ static void 
>>>>> nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
>>>>>          nvme_sync_io_queues(ctrl);
>>>>>          nvme_tcp_stop_io_queues(ctrl);
>>>>>          nvme_cancel_tagset(ctrl);
>>>>> -       if (remove)
>>>>> +       if (remove) {
>>>>>                  nvme_start_queues(ctrl);
>>>>> +               nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT);
>>>>> +               nvme_unfreeze(ctrl);
>>>>> +       }
>>>>>          nvme_tcp_destroy_io_queues(ctrl, remove);
>>>>>   }
>>>>> -- 
>>>> Nope. Same problem.
>>>>
>>>> I managed to make the problem go away (for some definitions of 'go 
>>>> away') with this patch:
>>>>
>>>> diff --git a/drivers/nvme/host/multipath.c 
>>>> b/drivers/nvme/host/multipath.c
>>>> index fb96e900dd3a..30d1154eb611 100644
>>>> --- a/drivers/nvme/host/multipath.c
>>>> +++ b/drivers/nvme/host/multipath.c
>>>> @@ -141,12 +141,12 @@ void nvme_mpath_clear_ctrl_paths(struct 
>>>> nvme_ctrl *ctrl)
>>>>   {
>>>>          struct nvme_ns *ns;
>>>>
>>>> -       mutex_lock(&ctrl->scan_lock);
>>>>          down_read(&ctrl->namespaces_rwsem);
>>>>          list_for_each_entry(ns, &ctrl->namespaces, list)
>>>>                  if (nvme_mpath_clear_current_path(ns))
>>>>                          
>>>> kblockd_schedule_work(&ns->head->requeue_work);
>>>>          up_read(&ctrl->namespaces_rwsem);
>>>> +       mutex_lock(&ctrl->scan_lock);
>>>>          mutex_unlock(&ctrl->scan_lock);
>>>>   }
>>>>
>>>> But I'd be the first to agree that this really is hackish.
>>>
>>> Yea, that doesn't solve anything.
>>>
>>> I think this sequence is familiar and was addressed by a fix from Anton
>>> (CC'd) which still has some pending review comments.
>>>
>>> Can you lookup and try:
>>> [PATCH] nvme/mpath: fix hang when disk goes live over reconnect
>>
>> I actually had looked at it, but then decided it's (trying) to address 
>> a different issue.
>>
>> Thing is, I'll see it being stuck even _before_ disconnect happens; 
>> we're stuck with the _initial_ scan:
>>
>> [ 3920.880552]  device_add_disk+0x377/0x3c0
>> [ 3920.880558]  nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
>> [ 3920.880574]  nvme_mpath_add_disk+0x150/0x160 [nvme_core]
>> [ 3920.880588]  ? device_add_disk+0x27d/0x3c0
>> [ 3920.880593]  nvme_alloc_ns+0x417/0x950 [nvme_core]
>> [ 3920.880606]  nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
>> [ 3920.880618]  ? __nvme_submit_sync_cmd+0x19b/0x210 [nvme_core]
>> [ 3920.880631]  nvme_scan_work+0x168/0x310 [nvme_core]
>>
>> and can't make progress as the I/O is _NOT_ failed, but rather causing 
>> a reconnect:
>> [ 1364.442390] nvme nvme0: request 0x0 genctr mismatch (got 0x0 
>> expected 0x2)
>> [ 1364.442408] nvme nvme0: got bad c2hdata.command_id 0x0 on queue 2
>> [ 1364.442411] nvme nvme0: receive failed:  -2
>> [ 1364.442414] nvme nvme0: starting error recovery
>> [ 1364.442502] block nvme0n1: no usable path - requeuing I/O
>>
>> so we don't have an usable path, and will requeue I/O.
>> But upon reconnect we will retry _the same_ I/O, and get into the same 
>> state.
>>
>> So the I/O will be stuck until disconnect happens.
>> But disconnect does a
>> nvme_do_dele te_ctrl()
>> -> nvme_remove_namespaces()
>>     -> nvme_mpath_clear_ctrl_paths()
>>        -> mutex_lock(&scan_mutex)
>>
>> and hangs.
>>
>> Everything would be solved if we had _aborted_ the invalid I/O instead 
>> of forcing a reconnect;
> 
> But you don't know which I/O to abort, you got a bad command_id.
> 

Yeah, you are right. I stand  corrected.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Deadlock on failure to read NVMe namespace
  2021-10-19 15:41         ` Hannes Reinecke
@ 2021-10-19 20:13           ` Sagi Grimberg
  0 siblings, 0 replies; 10+ messages in thread
From: Sagi Grimberg @ 2021-10-19 20:13 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: linux-nvme, Keith Busch, Anton Eidelman


>>>>>> 481:~ # cat /proc/15761/stack
>>>>>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
>>>>>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
>>>>>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]
>>>>>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core]
>>>>>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0
>>>>>> [<0>] new_sync_write+0x122/0x1b0
>>>>>> [<0>] vfs_write+0x1eb/0x250
>>>>>> [<0>] ksys_write+0xa1/0xe0
>>>>>> [<0>] do_syscall_64+0x3a/0x80
>>>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>> c481:~ # cat /proc/14965/stack
>>>>>> [<0>] do_read_cache_page+0x49b/0x790
>>>>>> [<0>] read_part_sector+0x39/0xe0
>>>>>> [<0>] read_lba+0xf9/0x1d0
>>>>>> [<0>] efi_partition+0xf1/0x7f0
>>>>>> [<0>] bdev_disk_changed+0x1ee/0x550
>>>>>> [<0>] blkdev_get_whole+0x81/0x90
>>>>>> [<0>] blkdev_get_by_dev+0x128/0x2e0
>>>>>> [<0>] device_add_disk+0x377/0x3c0
>>>>>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
>>>>>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
>>>>>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
>>>>>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
>>>>>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core]
>>>>>> [<0>] process_one_work+0x231/0x420
>>>>>> [<0>] worker_thread+0x2d/0x3f0
>>>>>> [<0>] kthread+0x11a/0x140
>>>>>> [<0>] ret_from_fork+0x22/0x30
>>
>> ...
>>
>>> I think this sequence is familiar and was addressed by a fix from Anton
>>> (CC'd) which still has some pending review comments.
>>>
>>> Can you lookup and try:
>>> [PATCH] nvme/mpath: fix hang when disk goes live over reconnect
>>
>> Actually, I see the trace is going from nvme_alloc_ns, no the ANA
>> update path, so that is unlikely to address the issue.
>>
>> Looking at nvme_mpath_clear_ctrl_paths, I don't think it should
>> take the scan_lock anymore. IIRC the reason why it needed the
>> scan_lock in the first place was the fact that ctrl->namespaces
>> was added and then sorted in scan_work (taking the namespaces_rwsem
>> twice).
>>
>> But now that ctrl->namespaces is always sorted, and accessed with
>> namespaces_rwsem, I think that scan_lock is no longer needed
>> here and namespaces_rwsem is sufficient.
>>
> ... which was precisely what my initial patch did.
> While it worked in the sense the 'nvme disconnect' completed, we did not 
> terminate the outstanding I/O as no current path is set and hence this:
> 
>      down_read(&ctrl->namespaces_rwsem);
>      list_for_each_entry(ns, &ctrl->namespaces, list)
>          if (nvme_mpath_clear_current_path(ns))
>              kblockd_schedule_work(&ns->head->requeue_work);
>      up_read(&ctrl->namespaces_rwsem);
> 
> doesn't do anything, in particular does _not_ flush the requeue work.

So you get I/O hung? Why does it need to flush it?


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-10-19 20:13 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-19 11:22 Deadlock on failure to read NVMe namespace Hannes Reinecke
2021-10-19 12:31 ` Sagi Grimberg
2021-10-19 12:40   ` Hannes Reinecke
2021-10-19 14:27     ` Sagi Grimberg
2021-10-19 15:04       ` Hannes Reinecke
2021-10-19 15:09         ` Sagi Grimberg
2021-10-19 15:49           ` Hannes Reinecke
2021-10-19 15:06       ` Sagi Grimberg
2021-10-19 15:41         ` Hannes Reinecke
2021-10-19 20:13           ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.