From: Sagi Grimberg <sagi@grimberg.me>
To: Keith Busch <kbusch@kernel.org>
Cc: linux-nvme@lists.infradead.org, Christoph Hellwig <hch@lst.de>,
Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com>
Subject: Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
Date: Tue, 16 Mar 2021 16:51:28 -0700 [thread overview]
Message-ID: <59f7a030-ea33-5c31-3c18-197c5a12e982@grimberg.me> (raw)
In-Reply-To: <20210316204204.GA23332@redsun51.ssa.fujisawa.hgst.com>
>>> These patches on their own are correct because they fixed a controller reset
>>> regression.
>>>
>>> When we reset/teardown a controller, we must freeze and quiesce the namespaces
>>> request queues to make sure that we safely stop inflight I/O submissions.
>>> Freeze is mandatory because if our hctx map changed between reconnects,
>>> blk_mq_update_nr_hw_queues will immediately attempt to freeze the queue, and
>>> if it still has pending submissions (that are still quiesced) it will hang.
>>> This is what the above patches fixed.
>>>
>>> However, by freezing the namespaces request queues, and only unfreezing them
>>> when we successfully reconnect, inflight submissions that are running
>>> concurrently can now block grabbing the nshead srcu until either we successfully
>>> reconnect or ctrl_loss_tmo expired (or the user explicitly disconnected).
>>>
>>> This caused a deadlock [1] when a different controller (different path on the
>>> same subsystem) became live (i.e. optimized/non-optimized). This is because
>>> nvme_mpath_set_live needs to synchronize the nshead srcu before requeueing I/O
>>> in order to make sure that current_path is visible to future (re)submisions.
>>> However the srcu lock is taken by a blocked submission on a frozen request
>>> queue, and we have a deadlock.
>>>
>>> For multipath, we obviously cannot allow that as we want to failover I/O asap.
>>> However for non-mpath, we do not want to fail I/O (at least until controller
>>> FASTFAIL expires, and that is disabled by default btw).
>>>
>>> This creates a non-symmetric behavior of how the driver should behave in the
>>> presence or absence of multipath.
>>>
>>> [1]:
>>> Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp]
>>> Call Trace:
>>> __schedule+0x293/0x730
>>> schedule+0x33/0xa0
>>> schedule_timeout+0x1d3/0x2f0
>>> wait_for_completion+0xba/0x140
>>> __synchronize_srcu.part.21+0x91/0xc0
>>> synchronize_srcu_expedited+0x27/0x30
>>> synchronize_srcu+0xce/0xe0
>>> nvme_mpath_set_live+0x64/0x130 [nvme_core]
>>> nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
>>> nvme_update_ana_state+0xcd/0xe0 [nvme_core]
>>> nvme_parse_ana_log+0xa1/0x180 [nvme_core]
>>> nvme_read_ana_log+0x76/0x100 [nvme_core]
>>> nvme_mpath_init+0x122/0x180 [nvme_core]
>>> nvme_init_identify+0x80e/0xe20 [nvme_core]
>>> nvme_tcp_setup_ctrl+0x359/0x660 [nvme_tcp]
>>> nvme_tcp_reconnect_ctrl_work+0x24/0x70 [nvme_tcp]
>>>
>>>
>>> In order to fix this, we recognize the different behavior a driver needs to take
>>> in error recovery scenarios for mpath and non-mpath scenarios and expose this
>>> awareness with a new helper nvme_ctrl_is_mpath and use that to know what needs
>>> to be done.
>>
>> Christoph, Keith,
>>
>> Any thoughts on this? The RFC part is getting the transport driver to
>> behave differently for mpath vs. non-mpath.
>
> Will it work if nvme mpath used request NOWAIT flag for its submit_bio()
> call, and add the bio to the requeue_list if blk_queue_enter() fails? I
> think that looks like another way to resolve the deadlock, but we need
> the block layer to return a failed status to the original caller.
But who would kick the requeue list? and that would make
near-tag-exhaust performance stink...
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
next prev parent reply other threads:[~2021-03-16 23:51 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-15 22:27 [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs Sagi Grimberg
2021-03-15 22:27 ` [PATCH 1/3] nvme: introduce nvme_ctrl_is_mpath helper Sagi Grimberg
2021-03-15 22:27 ` [PATCH 2/3] nvme-tcp: fix possible hang when trying to set a live path during I/O Sagi Grimberg
2021-03-15 22:27 ` [PATCH 3/3] nvme-rdma: " Sagi Grimberg
2021-03-16 3:24 ` [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs Chao Leng
2021-03-16 5:04 ` Sagi Grimberg
2021-03-16 6:18 ` Chao Leng
2021-03-16 6:25 ` Sagi Grimberg
2021-03-16 20:07 ` Sagi Grimberg
2021-03-16 20:42 ` Keith Busch
2021-03-16 23:51 ` Sagi Grimberg [this message]
2021-03-17 2:55 ` Chao Leng
2021-03-17 6:59 ` Christoph Hellwig
2021-03-17 7:59 ` Chao Leng
2021-03-17 18:43 ` Sagi Grimberg
2021-03-18 1:51 ` Chao Leng
2021-03-18 4:45 ` Christoph Hellwig
2021-03-18 18:46 ` Sagi Grimberg
2021-03-18 19:16 ` Keith Busch
2021-03-18 19:31 ` Sagi Grimberg
2021-03-18 21:52 ` Keith Busch
2021-03-18 22:45 ` Sagi Grimberg
2021-03-19 14:05 ` Christoph Hellwig
2021-03-19 17:28 ` Christoph Hellwig
2021-03-19 19:07 ` Keith Busch
2021-03-19 19:34 ` Sagi Grimberg
2021-03-20 6:11 ` Christoph Hellwig
2021-03-21 6:49 ` Sagi Grimberg
2021-03-22 6:34 ` Christoph Hellwig
2021-03-17 8:16 ` Sagi Grimberg
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=59f7a030-ea33-5c31-3c18-197c5a12e982@grimberg.me \
--to=sagi@grimberg.me \
--cc=Chaitanya.Kulkarni@wdc.com \
--cc=hch@lst.de \
--cc=kbusch@kernel.org \
--cc=linux-nvme@lists.infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).