All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hannes Reinecke <hare@suse.de>
To: Sagi Grimberg <sagi@grimberg.me>, Christoph Hellwig <hch@lst.de>
Cc: "linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	Keith Busch <kbusch@kernel.org>,
	Anton Eidelman <anton@lightbitslabs.com>
Subject: Re: Deadlock on failure to read NVMe namespace
Date: Tue, 19 Oct 2021 17:04:42 +0200	[thread overview]
Message-ID: <ed1d5970-60a8-88bb-c0d7-0c3e8eafbf7c@suse.de> (raw)
In-Reply-To: <f5a43c7e-608b-5191-dd73-11c6c8dc1953@grimberg.me>

On 10/19/21 4:27 PM, Sagi Grimberg wrote:
> 
>>>> 481:~ # cat /proc/15761/stack
>>>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
>>>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
>>>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]
>>>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core]
>>>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0
>>>> [<0>] new_sync_write+0x122/0x1b0
>>>> [<0>] vfs_write+0x1eb/0x250
>>>> [<0>] ksys_write+0xa1/0xe0
>>>> [<0>] do_syscall_64+0x3a/0x80
>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>> c481:~ # cat /proc/14965/stack
>>>> [<0>] do_read_cache_page+0x49b/0x790
>>>> [<0>] read_part_sector+0x39/0xe0
>>>> [<0>] read_lba+0xf9/0x1d0
>>>> [<0>] efi_partition+0xf1/0x7f0
>>>> [<0>] bdev_disk_changed+0x1ee/0x550
>>>> [<0>] blkdev_get_whole+0x81/0x90
>>>> [<0>] blkdev_get_by_dev+0x128/0x2e0
>>>> [<0>] device_add_disk+0x377/0x3c0
>>>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
>>>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
>>>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
>>>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
>>>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core]
>>>> [<0>] process_one_work+0x231/0x420
>>>> [<0>] worker_thread+0x2d/0x3f0
>>>> [<0>] kthread+0x11a/0x140
>>>> [<0>] ret_from_fork+0x22/0x30
>>>>
>>>> My theory here is that the partition scanning code just calls into 
>>>> the pagecache, which doesn't set a timeout for any I/O operation.
>>>> As this is done under the scan_mutex we cannot clear the active 
>>>> paths, and consequently we hang.
>>>
>>> But the controller removal should have cancelled all inflight
>>> commands...
>>>
>>> Maybe we're missing unfreeze? Hannes, can you try this one?
>>> -- 
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index e29c47114739..783fde36d2ba 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -1974,8 +1974,11 @@ static void nvme_tcp_teardown_io_queues(struct 
>>> nvme_ctrl *ctrl,
>>>          nvme_sync_io_queues(ctrl);
>>>          nvme_tcp_stop_io_queues(ctrl);
>>>          nvme_cancel_tagset(ctrl);
>>> -       if (remove)
>>> +       if (remove) {
>>>                  nvme_start_queues(ctrl);
>>> +               nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT);
>>> +               nvme_unfreeze(ctrl);
>>> +       }
>>>          nvme_tcp_destroy_io_queues(ctrl, remove);
>>>   }
>>> -- 
>> Nope. Same problem.
>>
>> I managed to make the problem go away (for some definitions of 'go 
>> away') with this patch:
>>
>> diff --git a/drivers/nvme/host/multipath.c 
>> b/drivers/nvme/host/multipath.c
>> index fb96e900dd3a..30d1154eb611 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -141,12 +141,12 @@ void nvme_mpath_clear_ctrl_paths(struct 
>> nvme_ctrl *ctrl)
>>   {
>>          struct nvme_ns *ns;
>>
>> -       mutex_lock(&ctrl->scan_lock);
>>          down_read(&ctrl->namespaces_rwsem);
>>          list_for_each_entry(ns, &ctrl->namespaces, list)
>>                  if (nvme_mpath_clear_current_path(ns))
>>                          kblockd_schedule_work(&ns->head->requeue_work);
>>          up_read(&ctrl->namespaces_rwsem);
>> +       mutex_lock(&ctrl->scan_lock);
>>          mutex_unlock(&ctrl->scan_lock);
>>   }
>>
>> But I'd be the first to agree that this really is hackish.
> 
> Yea, that doesn't solve anything.
> 
> I think this sequence is familiar and was addressed by a fix from Anton
> (CC'd) which still has some pending review comments.
> 
> Can you lookup and try:
> [PATCH] nvme/mpath: fix hang when disk goes live over reconnect

I actually had looked at it, but then decided it's (trying) to address a 
different issue.

Thing is, I'll see it being stuck even _before_ disconnect happens; 
we're stuck with the _initial_ scan:

[ 3920.880552]  device_add_disk+0x377/0x3c0
[ 3920.880558]  nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
[ 3920.880574]  nvme_mpath_add_disk+0x150/0x160 [nvme_core]
[ 3920.880588]  ? device_add_disk+0x27d/0x3c0
[ 3920.880593]  nvme_alloc_ns+0x417/0x950 [nvme_core]
[ 3920.880606]  nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
[ 3920.880618]  ? __nvme_submit_sync_cmd+0x19b/0x210 [nvme_core]
[ 3920.880631]  nvme_scan_work+0x168/0x310 [nvme_core]

and can't make progress as the I/O is _NOT_ failed, but rather causing a 
reconnect:
[ 1364.442390] nvme nvme0: request 0x0 genctr mismatch (got 0x0 expected 
0x2)
[ 1364.442408] nvme nvme0: got bad c2hdata.command_id 0x0 on queue 2
[ 1364.442411] nvme nvme0: receive failed:  -2
[ 1364.442414] nvme nvme0: starting error recovery
[ 1364.442502] block nvme0n1: no usable path - requeuing I/O

so we don't have an usable path, and will requeue I/O.
But upon reconnect we will retry _the same_ I/O, and get into the same 
state.

So the I/O will be stuck until disconnect happens.
But disconnect does a
nvme_do_dele te_ctrl()
-> nvme_remove_namespaces()
    -> nvme_mpath_clear_ctrl_paths()
       -> mutex_lock(&scan_mutex)

and hangs.

Everything would be solved if we had _aborted_ the invalid I/O instead 
of forcing a reconnect; which, incidentally, is a far better way to 
handle TCP PDU sequence errors than the current way of dropping the 
connection. Which would only help if it were a transport artifact, but 
I've yet to see a fabric which eats individual bits of a frame;
a programming error far more likely, in which case an I/O error would be 
a far better response.

But I guess I'll need to raise it at the fmds call...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


  reply	other threads:[~2021-10-19 15:04 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-19 11:22 Deadlock on failure to read NVMe namespace Hannes Reinecke
2021-10-19 12:31 ` Sagi Grimberg
2021-10-19 12:40   ` Hannes Reinecke
2021-10-19 14:27     ` Sagi Grimberg
2021-10-19 15:04       ` Hannes Reinecke [this message]
2021-10-19 15:09         ` Sagi Grimberg
2021-10-19 15:49           ` Hannes Reinecke
2021-10-19 15:06       ` Sagi Grimberg
2021-10-19 15:41         ` Hannes Reinecke
2021-10-19 20:13           ` Sagi Grimberg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ed1d5970-60a8-88bb-c0d7-0c3e8eafbf7c@suse.de \
    --to=hare@suse.de \
    --cc=anton@lightbitslabs.com \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.