[PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
@ 2021-03-15 22:27 Sagi Grimberg
  2021-03-15 22:27 ` [PATCH 1/3] nvme: introduce nvme_ctrl_is_mpath helper Sagi Grimberg
                   ` (4 more replies)
  0 siblings, 5 replies; 30+ messages in thread
From: Sagi Grimberg @ 2021-03-15 22:27 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, Chaitanya Kulkarni

The below patches caused a regression in a multipath setup:
Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic")
Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic")

These patches on their own are correct because they fixed a controller reset
regression.

When we reset/teardown a controller, we must freeze and quiesce the namespaces
request queues to make sure that we safely stop inflight I/O submissions.
Freeze is mandatory because if our hctx map changed between reconnects,
blk_mq_update_nr_hw_queues will immediately attempt to freeze the queue, and
if it still has pending submissions (that are still quiesced) it will hang.
This is what the above patches fixed.

However, by freezing the namespaces request queues, and only unfreezing them
when we successfully reconnect, inflight submissions that are running
concurrently can now block grabbing the nshead srcu until either we successfully
reconnect or ctrl_loss_tmo expired (or the user explicitly disconnected).

This caused a deadlock [1] when a different controller (different path on the
same subsystem) became live (i.e. optimized/non-optimized). This is because
nvme_mpath_set_live needs to synchronize the nshead srcu before requeueing I/O
in order to make sure that current_path is visible to future (re)submisions.
However the srcu lock is taken by a blocked submission on a frozen request
queue, and we have a deadlock.

For multipath, we obviously cannot allow that as we want to failover I/O asap.
However for non-mpath, we do not want to fail I/O (at least until controller
FASTFAIL expires, and that is disabled by default btw).

This creates a non-symmetric behavior of how the driver should behave in the
presence or absence of multipath.

[1]:
Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp]
Call Trace:
 __schedule+0x293/0x730
 schedule+0x33/0xa0
 schedule_timeout+0x1d3/0x2f0
 wait_for_completion+0xba/0x140
 __synchronize_srcu.part.21+0x91/0xc0
 synchronize_srcu_expedited+0x27/0x30
 synchronize_srcu+0xce/0xe0
 nvme_mpath_set_live+0x64/0x130 [nvme_core]
 nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
 nvme_update_ana_state+0xcd/0xe0 [nvme_core]
 nvme_parse_ana_log+0xa1/0x180 [nvme_core]
 nvme_read_ana_log+0x76/0x100 [nvme_core]
 nvme_mpath_init+0x122/0x180 [nvme_core]
 nvme_init_identify+0x80e/0xe20 [nvme_core]
 nvme_tcp_setup_ctrl+0x359/0x660 [nvme_tcp]
 nvme_tcp_reconnect_ctrl_work+0x24/0x70 [nvme_tcp]

In order to fix this, we recognize the different behavior a driver needs to take
in error recovery scenarios for mpath and non-mpath scenarios and expose this
awareness with a new helper nvme_ctrl_is_mpath and use that to know what needs
to be done.

Sagi Grimberg (3):
  nvme: introduce nvme_ctrl_is_mpath helper
  nvme-tcp: fix possible hang when trying to set a live path during I/O
  nvme-rdma: fix possible hang when trying to set a live path during I/O

 drivers/nvme/host/multipath.c |  5 +++--
 drivers/nvme/host/nvme.h      | 15 +++++++++++++++
 drivers/nvme/host/rdma.c      | 29 +++++++++++++++++------------
 drivers/nvme/host/tcp.c       | 30 +++++++++++++++++-------------
 4 files changed, 52 insertions(+), 27 deletions(-)

-- 
2.27.0

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/3] nvme: introduce nvme_ctrl_is_mpath helper
  2021-03-15 22:27 [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs Sagi Grimberg
@ 2021-03-15 22:27 ` Sagi Grimberg
  2021-03-15 22:27 ` [PATCH 2/3] nvme-tcp: fix possible hang when trying to set a live path during I/O Sagi Grimberg
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 30+ messages in thread
From: Sagi Grimberg @ 2021-03-15 22:27 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, Chaitanya Kulkarni

Given that our error recovery and I/O failover logic semantics is
different for multipath vs. non-multipath controllers, transports
need a helper to distinguish how to react upon error recovery
or controller resets (i.e. fail I/O or to keep the queue
frozen/quiesced).

Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic")
Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic")
[sagi: added the fixes tag so this can also go to stable with its respective
fixes]
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/multipath.c |  5 +++--
 drivers/nvme/host/nvme.h      | 15 +++++++++++++++
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index a1d476e1ac02..5c67a5e96738 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -8,10 +8,11 @@
 #include <trace/events/block.h>
 #include "nvme.h"
 
-static bool multipath = true;
+bool multipath = true;
 module_param(multipath, bool, 0444);
 MODULE_PARM_DESC(multipath,
 	"turn on native support for multiple controllers per subsystem");
+EXPORT_SYMBOL_GPL(multipath);
 
 void nvme_mpath_unfreeze(struct nvme_subsystem *subsys)
 {
@@ -372,7 +373,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
 	 * We also do this for private namespaces as the namespace sharing data could
 	 * change after a rescan.
 	 */
-	if (!(ctrl->subsys->cmic & NVME_CTRL_CMIC_MULTI_CTRL) || !multipath)
+	if (!nvme_ctrl_is_mpath(ctrl))
 		return 0;
 
 	q = blk_alloc_queue(ctrl->numa_node);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 07b34175c6ce..4f7c9970e3fc 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -840,4 +840,19 @@ struct nvme_ctrl *nvme_ctrl_from_file(struct file *file);
 struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid);
 void nvme_put_ns(struct nvme_ns *ns);
 
+extern bool multipath;
+#ifdef CONFIG_NVME_MULTIPATH
+static inline bool nvme_ctrl_is_mpath(struct nvme_ctrl *ctrl)
+{
+	return !(!(ctrl->subsys->cmic & NVME_CTRL_CMIC_MULTI_CTRL) || !multipath);
+
+}
+#else
+static inline bool nvme_ctrl_is_mpath(struct nvme_ctrl *ctrl)
+{
+	return false;
+
+}
+#endif
+
 #endif /* _NVME_H */
-- 
2.27.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/3] nvme-tcp: fix possible hang when trying to set a live path during I/O
  2021-03-15 22:27 [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs Sagi Grimberg
  2021-03-15 22:27 ` [PATCH 1/3] nvme: introduce nvme_ctrl_is_mpath helper Sagi Grimberg
@ 2021-03-15 22:27 ` Sagi Grimberg
  2021-03-15 22:27 ` [PATCH 3/3] nvme-rdma: " Sagi Grimberg
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 30+ messages in thread
From: Sagi Grimberg @ 2021-03-15 22:27 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, Chaitanya Kulkarni

When we teardown a controller we first freeze the queue to prevent
request submissions, and quiesce the queue to prevent request queueing
and we only unfreeze/unquiesce when we successfully reconnect a
controller.

In case we attempt to set a live path (optimized/non-optimized) and
update the current_path reference, we first need to wait for any
ongoing dispatches (synchronize the head->srcu).

However bio submissions _can_ block as the underlying controller queue
is frozen. which creates the below deadlock [1]. So it is clear that
the namespaces request queue must be unfrozen and unquiesced asap when
we teardown the controller.

However, when we are not in a multipath environment (!multipath or cmic
indicates ns isn't shared) we don't want to fail-fast the I/O, hence we
must keep the namespaces request queue frozen and quiesced and only
expire them when the controller successfully reconnects (and FAILFAST
may fail the I/O sooner).

[1]:
Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp]
Call Trace:
 __schedule+0x293/0x730
 schedule+0x33/0xa0
 schedule_timeout+0x1d3/0x2f0
 wait_for_completion+0xba/0x140
 __synchronize_srcu.part.21+0x91/0xc0
 synchronize_srcu_expedited+0x27/0x30
 synchronize_srcu+0xce/0xe0
 nvme_mpath_set_live+0x64/0x130 [nvme_core]
 nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
 nvme_update_ana_state+0xcd/0xe0 [nvme_core]
 nvme_parse_ana_log+0xa1/0x180 [nvme_core]
 nvme_read_ana_log+0x76/0x100 [nvme_core]
 nvme_mpath_init+0x122/0x180 [nvme_core]
 nvme_init_identify+0x80e/0xe20 [nvme_core]
 nvme_tcp_setup_ctrl+0x359/0x660 [nvme_tcp]
 nvme_tcp_reconnect_ctrl_work+0x24/0x70 [nvme_tcp]

Fix this by looking into the newly introduced nvme_ctrl_is_mpath and
unquiesce/unfreeze the namespaces request queues accordingly (in
the teardown for mpath and after a successful reconnect for non-mpath).

Also, we no longer need the explicit nvme_start_queues in the error
recovery work.

Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/tcp.c | 30 +++++++++++++++++-------------
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index a0f00cb8f9f3..b81649d0c12c 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1803,19 +1803,22 @@ static int nvme_tcp_configure_io_queues(struct nvme_ctrl *ctrl, bool new)
 		goto out_cleanup_connect_q;
 
 	if (!new) {
-		nvme_start_queues(ctrl);
-		if (!nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT)) {
-			/*
-			 * If we timed out waiting for freeze we are likely to
-			 * be stuck.  Fail the controller initialization just
-			 * to be safe.
-			 */
-			ret = -ENODEV;
-			goto out_wait_freeze_timed_out;
+		if (!nvme_ctrl_is_mpath(ctrl)) {
+			nvme_start_queues(ctrl);
+			if (!nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT)) {
+				/*
+				 * If we timed out waiting for freeze we are
+				 * likely to be stuck.  Fail the controller
+				 * initialization just to be safe.
+				 */
+				ret = -ENODEV;
+				goto out_wait_freeze_timed_out;
+			}
 		}
 		blk_mq_update_nr_hw_queues(ctrl->tagset,
 			ctrl->queue_count - 1);
-		nvme_unfreeze(ctrl);
+		if (!nvme_ctrl_is_mpath(ctrl))
+			nvme_unfreeze(ctrl);
 	}
 
 	return 0;
@@ -1934,8 +1937,11 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
 	nvme_sync_io_queues(ctrl);
 	nvme_tcp_stop_io_queues(ctrl);
 	nvme_cancel_tagset(ctrl);
-	if (remove)
+	if (nvme_ctrl_is_mpath(ctrl)) {
 		nvme_start_queues(ctrl);
+		nvme_wait_freeze(ctrl);
+		nvme_unfreeze(ctrl);
+	}
 	nvme_tcp_destroy_io_queues(ctrl, remove);
 }
 
@@ -2056,8 +2062,6 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work)
 
 	nvme_stop_keep_alive(ctrl);
 	nvme_tcp_teardown_io_queues(ctrl, false);
-	/* unquiesce to fail fast pending requests */
-	nvme_start_queues(ctrl);
 	nvme_tcp_teardown_admin_queue(ctrl, false);
 	blk_mq_unquiesce_queue(ctrl->admin_q);
 
-- 
2.27.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/3] nvme-rdma: fix possible hang when trying to set a live path during I/O
  2021-03-15 22:27 [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs Sagi Grimberg
  2021-03-15 22:27 ` [PATCH 1/3] nvme: introduce nvme_ctrl_is_mpath helper Sagi Grimberg
  2021-03-15 22:27 ` [PATCH 2/3] nvme-tcp: fix possible hang when trying to set a live path during I/O Sagi Grimberg
@ 2021-03-15 22:27 ` Sagi Grimberg
  2021-03-16  3:24 ` [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs Chao Leng
  2021-03-16 20:07 ` Sagi Grimberg
  4 siblings, 0 replies; 30+ messages in thread
From: Sagi Grimberg @ 2021-03-15 22:27 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, Chaitanya Kulkarni

When we teardown a controller we first freeze the queue to prevent
request submissions, and quiesce the queue to prevent request queueing
and we only unfreeze/unquiesce when we successfully reconnect a
controller.

In case we attempt to set a live path (optimized/non-optimized) and
update the current_path reference, we first need to wait for any
ongoing dispatches (synchronize the head->srcu).

However bio submissions _can_ block as the underlying controller queue
is frozen. which creates the below deadlock [1]. So it is clear that
the namespaces request queue must be unfrozen and unquiesced asap when
we teardown the controller.

However, when we are not in a multipath environment (!multipath or cmic
indicates ns isn't shared) we don't want to fail-fast the I/O, hence we
must keep the namespaces request queue frozen and quiesced and only
expire them when the controller successfully reconnects (and FAILFAST
may fail the I/O sooner).

[1] (happened in nvme-tcp, but works the same in nvme-rdma):
Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp]
Call Trace:
 __schedule+0x293/0x730
 schedule+0x33/0xa0
 schedule_timeout+0x1d3/0x2f0
 wait_for_completion+0xba/0x140
 __synchronize_srcu.part.21+0x91/0xc0
 synchronize_srcu_expedited+0x27/0x30
 synchronize_srcu+0xce/0xe0
 nvme_mpath_set_live+0x64/0x130 [nvme_core]
 nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
 nvme_update_ana_state+0xcd/0xe0 [nvme_core]
 nvme_parse_ana_log+0xa1/0x180 [nvme_core]
 nvme_read_ana_log+0x76/0x100 [nvme_core]
 nvme_mpath_init+0x122/0x180 [nvme_core]
 nvme_init_identify+0x80e/0xe20 [nvme_core]
 nvme_tcp_setup_ctrl+0x359/0x660 [nvme_tcp]
 nvme_tcp_reconnect_ctrl_work+0x24/0x70 [nvme_tcp]

Fix this by looking into the newly introduced nvme_ctrl_is_mpath and
unquiesce/unfreeze the namespaces request queues accordingly (in
the teardown for mpath and after a successful reconnect for non-mpath).

Also, we no longer need the explicit nvme_start_queues in the error
recovery work.

Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/rdma.c | 29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index be905d4fdb47..43e8608ad5b7 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -989,19 +989,22 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
 		goto out_cleanup_connect_q;
 
 	if (!new) {
-		nvme_start_queues(&ctrl->ctrl);
-		if (!nvme_wait_freeze_timeout(&ctrl->ctrl, NVME_IO_TIMEOUT)) {
-			/*
-			 * If we timed out waiting for freeze we are likely to
-			 * be stuck.  Fail the controller initialization just
-			 * to be safe.
-			 */
-			ret = -ENODEV;
-			goto out_wait_freeze_timed_out;
+		if (!nvme_ctrl_is_mpath(&ctrl->ctrl)) {
+			nvme_start_queues(&ctrl->ctrl);
+			if (!nvme_wait_freeze_timeout(&ctrl->ctrl, NVME_IO_TIMEOUT)) {
+				/*
+				 * If we timed out waiting for freeze we are likely to
+				 * be stuck.  Fail the controller initialization just
+				 * to be safe.
+				 */
+				ret = -ENODEV;
+				goto out_wait_freeze_timed_out;
+			}
 		}
 		blk_mq_update_nr_hw_queues(ctrl->ctrl.tagset,
 			ctrl->ctrl.queue_count - 1);
-		nvme_unfreeze(&ctrl->ctrl);
+		if (!nvme_ctrl_is_mpath(&ctrl->ctrl))
+			nvme_unfreeze(&ctrl->ctrl);
 	}
 
 	return 0;
@@ -1043,8 +1046,11 @@ static void nvme_rdma_teardown_io_queues(struct nvme_rdma_ctrl *ctrl,
 		nvme_sync_io_queues(&ctrl->ctrl);
 		nvme_rdma_stop_io_queues(ctrl);
 		nvme_cancel_tagset(&ctrl->ctrl);
-		if (remove)
+		if (nvme_ctrl_is_mpath(&ctrl->ctrl)) {
 			nvme_start_queues(&ctrl->ctrl);
+			nvme_wait_freeze(&ctrl->ctrl);
+			nvme_unfreeze(&ctrl->ctrl);
+		}
 		nvme_rdma_destroy_io_queues(ctrl, remove);
 	}
 }
@@ -1191,7 +1197,6 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work)
 
 	nvme_stop_keep_alive(&ctrl->ctrl);
 	nvme_rdma_teardown_io_queues(ctrl, false);
-	nvme_start_queues(&ctrl->ctrl);
 	nvme_rdma_teardown_admin_queue(ctrl, false);
 	blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
 
-- 
2.27.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-15 22:27 [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs Sagi Grimberg
                   ` (2 preceding siblings ...)
  2021-03-15 22:27 ` [PATCH 3/3] nvme-rdma: " Sagi Grimberg
@ 2021-03-16  3:24 ` Chao Leng
  2021-03-16  5:04   ` Sagi Grimberg
  2021-03-16 20:07 ` Sagi Grimberg
  4 siblings, 1 reply; 30+ messages in thread
From: Chao Leng @ 2021-03-16  3:24 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch,
	Chaitanya Kulkarni

Does the problem exist on the latest version?

We also found Similar deadlocks in the older version.
However, with the latest code, it do not block grabbing the nshead srcu
when ctrl is freezed.
related patches:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/block/blk-core.c?id=fe2008640ae36e3920cf41507a84fb5d3227435a
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a6c35f9af416114588298aa7a90b15bbed15a41
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/block/blk-core.c?id=ed00aabd5eb9fb44d6aff1173234a2e911b9fead
I am not sure they are the same problem.


On 2021/3/16 6:27, Sagi Grimberg wrote:
> The below patches caused a regression in a multipath setup:
> Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic")
> Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic")
> 
> These patches on their own are correct because they fixed a controller reset
> regression.
> 
> When we reset/teardown a controller, we must freeze and quiesce the namespaces
> request queues to make sure that we safely stop inflight I/O submissions.
> Freeze is mandatory because if our hctx map changed between reconnects,
> blk_mq_update_nr_hw_queues will immediately attempt to freeze the queue, and
> if it still has pending submissions (that are still quiesced) it will hang.
> This is what the above patches fixed.
> 
> However, by freezing the namespaces request queues, and only unfreezing them
> when we successfully reconnect, inflight submissions that are running
> concurrently can now block grabbing the nshead srcu until either we successfully
> reconnect or ctrl_loss_tmo expired (or the user explicitly disconnected).
> 
> This caused a deadlock [1] when a different controller (different path on the
> same subsystem) became live (i.e. optimized/non-optimized). This is because
> nvme_mpath_set_live needs to synchronize the nshead srcu before requeueing I/O
> in order to make sure that current_path is visible to future (re)submisions.
> However the srcu lock is taken by a blocked submission on a frozen request
> queue, and we have a deadlock.
> 
> For multipath, we obviously cannot allow that as we want to failover I/O asap.
> However for non-mpath, we do not want to fail I/O (at least until controller
> FASTFAIL expires, and that is disabled by default btw).
> 
> This creates a non-symmetric behavior of how the driver should behave in the
> presence or absence of multipath.
> 
> [1]:
> Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp]
> Call Trace:
>   __schedule+0x293/0x730
>   schedule+0x33/0xa0
>   schedule_timeout+0x1d3/0x2f0
>   wait_for_completion+0xba/0x140
>   __synchronize_srcu.part.21+0x91/0xc0
>   synchronize_srcu_expedited+0x27/0x30
>   synchronize_srcu+0xce/0xe0
>   nvme_mpath_set_live+0x64/0x130 [nvme_core]
>   nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
>   nvme_update_ana_state+0xcd/0xe0 [nvme_core]
>   nvme_parse_ana_log+0xa1/0x180 [nvme_core]
>   nvme_read_ana_log+0x76/0x100 [nvme_core]
>   nvme_mpath_init+0x122/0x180 [nvme_core]
>   nvme_init_identify+0x80e/0xe20 [nvme_core]
>   nvme_tcp_setup_ctrl+0x359/0x660 [nvme_tcp]
>   nvme_tcp_reconnect_ctrl_work+0x24/0x70 [nvme_tcp]
> 
> 
> In order to fix this, we recognize the different behavior a driver needs to take
> in error recovery scenarios for mpath and non-mpath scenarios and expose this
> awareness with a new helper nvme_ctrl_is_mpath and use that to know what needs
> to be done.
> 
> Sagi Grimberg (3):
>    nvme: introduce nvme_ctrl_is_mpath helper
>    nvme-tcp: fix possible hang when trying to set a live path during I/O
>    nvme-rdma: fix possible hang when trying to set a live path during I/O
> 
>   drivers/nvme/host/multipath.c |  5 +++--
>   drivers/nvme/host/nvme.h      | 15 +++++++++++++++
>   drivers/nvme/host/rdma.c      | 29 +++++++++++++++++------------
>   drivers/nvme/host/tcp.c       | 30 +++++++++++++++++-------------
>   4 files changed, 52 insertions(+), 27 deletions(-)
> 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-16  3:24 ` [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs Chao Leng
@ 2021-03-16  5:04   ` Sagi Grimberg
  2021-03-16  6:18     ` Chao Leng
  0 siblings, 1 reply; 30+ messages in thread
From: Sagi Grimberg @ 2021-03-16  5:04 UTC (permalink / raw)
  To: Chao Leng, linux-nvme, Christoph Hellwig, Keith Busch,
	Chaitanya Kulkarni

> Does the problem exist on the latest version?

This was seen on 5.4 stable, not upstream but nothing prevents
this from happening in upstream code.

> 
> We also found Similar deadlocks in the older version.
> However, with the latest code, it do not block grabbing the nshead srcu
> when ctrl is freezed.
> related patches:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/block/blk-core.c?id=fe2008640ae36e3920cf41507a84fb5d3227435a 
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a6c35f9af416114588298aa7a90b15bbed15a41 
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/block/blk-core.c?id=ed00aabd5eb9fb44d6aff1173234a2e911b9fead 
> 
> I am not sure they are the same problem.

Its not the same problem.

When we teardown the io queues, we freeze the namespaces request queues.
This means that concurrent mpath submit_bio calls can now block with
the srcu lock taken.

When another path calls nvme_mpath_set_live, it needs to wait for
the srcu to sync before kicking the requeue work (to make sure
the updated current_path is visible).

And this is where the hang is, the only thing that will free it
is if the offending controller reconnects (and unfreeze the queue)
or it will disconnect (automatically or manually). Both can take
a very long time or even forever in some cases.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-16  5:04   ` Sagi Grimberg
@ 2021-03-16  6:18     ` Chao Leng
  2021-03-16  6:25       ` Sagi Grimberg
  0 siblings, 1 reply; 30+ messages in thread
From: Chao Leng @ 2021-03-16  6:18 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch,
	Chaitanya Kulkarni



On 2021/3/16 13:04, Sagi Grimberg wrote:
> 
>> Does the problem exist on the latest version?
> 
> This was seen on 5.4 stable, not upstream but nothing prevents
> this from happening in upstream code.
> 
>>
>> We also found Similar deadlocks in the older version.
>> However, with the latest code, it do not block grabbing the nshead srcu
>> when ctrl is freezed.
>> related patches:
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/block/blk-core.c?id=fe2008640ae36e3920cf41507a84fb5d3227435a
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a6c35f9af416114588298aa7a90b15bbed15a41
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/block/blk-core.c?id=ed00aabd5eb9fb44d6aff1173234a2e911b9fead
>> I am not sure they are the same problem.
> 
> Its not the same problem.
> 
> When we teardown the io queues, we freeze the namespaces request queues.
> This means that concurrent mpath submit_bio calls can now block with
> the srcu lock taken.What is the call trace of ->submit_bio()?
The requeue work or normal submit bio?
> 
> When another path calls nvme_mpath_set_live, it needs to wait for
> the srcu to sync before kicking the requeue work (to make sure
> the updated current_path is visible).
> 
> And this is where the hang is, the only thing that will free it
> is if the offending controller reconnects (and unfreeze the queue)
> or it will disconnect (automatically or manually). Both can take
> a very long time or even forever in some cases.
> .

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-16  6:18     ` Chao Leng
@ 2021-03-16  6:25       ` Sagi Grimberg
  0 siblings, 0 replies; 30+ messages in thread
From: Sagi Grimberg @ 2021-03-16  6:25 UTC (permalink / raw)
  To: Chao Leng, linux-nvme, Christoph Hellwig, Keith Busch,
	Chaitanya Kulkarni


>>> We also found Similar deadlocks in the older version.
>>> However, with the latest code, it do not block grabbing the nshead srcu
>>> when ctrl is freezed.
>>> related patches:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/block/blk-core.c?id=fe2008640ae36e3920cf41507a84fb5d3227435a 
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a6c35f9af416114588298aa7a90b15bbed15a41 
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/block/blk-core.c?id=ed00aabd5eb9fb44d6aff1173234a2e911b9fead 
>>>
>>> I am not sure they are the same problem.
>>
>> Its not the same problem.
>>
>> When we teardown the io queues, we freeze the namespaces request queues.
>> This means that concurrent mpath submit_bio calls can now block with
>> the srcu lock taken.What is the call trace of ->submit_bio()?
> The requeue work or normal submit bio?

Both.

submit_bio_noacct will try to get the queue->g_usage_counter
(blk_queue_enter), will fail because the queue is frozen, and then will
block until the queue unfreeze will wake it up to try again...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-15 22:27 [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs Sagi Grimberg
                   ` (3 preceding siblings ...)
  2021-03-16  3:24 ` [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs Chao Leng
@ 2021-03-16 20:07 ` Sagi Grimberg
  2021-03-16 20:42   ` Keith Busch
  4 siblings, 1 reply; 30+ messages in thread
From: Sagi Grimberg @ 2021-03-16 20:07 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, Chaitanya Kulkarni

> The below patches caused a regression in a multipath setup:
> Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic")
> Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic")
> 
> These patches on their own are correct because they fixed a controller reset
> regression.
> 
> When we reset/teardown a controller, we must freeze and quiesce the namespaces
> request queues to make sure that we safely stop inflight I/O submissions.
> Freeze is mandatory because if our hctx map changed between reconnects,
> blk_mq_update_nr_hw_queues will immediately attempt to freeze the queue, and
> if it still has pending submissions (that are still quiesced) it will hang.
> This is what the above patches fixed.
> 
> However, by freezing the namespaces request queues, and only unfreezing them
> when we successfully reconnect, inflight submissions that are running
> concurrently can now block grabbing the nshead srcu until either we successfully
> reconnect or ctrl_loss_tmo expired (or the user explicitly disconnected).
> 
> This caused a deadlock [1] when a different controller (different path on the
> same subsystem) became live (i.e. optimized/non-optimized). This is because
> nvme_mpath_set_live needs to synchronize the nshead srcu before requeueing I/O
> in order to make sure that current_path is visible to future (re)submisions.
> However the srcu lock is taken by a blocked submission on a frozen request
> queue, and we have a deadlock.
> 
> For multipath, we obviously cannot allow that as we want to failover I/O asap.
> However for non-mpath, we do not want to fail I/O (at least until controller
> FASTFAIL expires, and that is disabled by default btw).
> 
> This creates a non-symmetric behavior of how the driver should behave in the
> presence or absence of multipath.
> 
> [1]:
> Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp]
> Call Trace:
>   __schedule+0x293/0x730
>   schedule+0x33/0xa0
>   schedule_timeout+0x1d3/0x2f0
>   wait_for_completion+0xba/0x140
>   __synchronize_srcu.part.21+0x91/0xc0
>   synchronize_srcu_expedited+0x27/0x30
>   synchronize_srcu+0xce/0xe0
>   nvme_mpath_set_live+0x64/0x130 [nvme_core]
>   nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
>   nvme_update_ana_state+0xcd/0xe0 [nvme_core]
>   nvme_parse_ana_log+0xa1/0x180 [nvme_core]
>   nvme_read_ana_log+0x76/0x100 [nvme_core]
>   nvme_mpath_init+0x122/0x180 [nvme_core]
>   nvme_init_identify+0x80e/0xe20 [nvme_core]
>   nvme_tcp_setup_ctrl+0x359/0x660 [nvme_tcp]
>   nvme_tcp_reconnect_ctrl_work+0x24/0x70 [nvme_tcp]
> 
> 
> In order to fix this, we recognize the different behavior a driver needs to take
> in error recovery scenarios for mpath and non-mpath scenarios and expose this
> awareness with a new helper nvme_ctrl_is_mpath and use that to know what needs
> to be done.

Christoph, Keith,

Any thoughts on this? The RFC part is getting the transport driver to
behave differently for mpath vs. non-mpath.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-16 20:07 ` Sagi Grimberg
@ 2021-03-16 20:42   ` Keith Busch
  2021-03-16 23:51     ` Sagi Grimberg
  0 siblings, 1 reply; 30+ messages in thread
From: Keith Busch @ 2021-03-16 20:42 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, Christoph Hellwig, Chaitanya Kulkarni

On Tue, Mar 16, 2021 at 01:07:12PM -0700, Sagi Grimberg wrote:
> > The below patches caused a regression in a multipath setup:
> > Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic")
> > Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic")
> > 
> > These patches on their own are correct because they fixed a controller reset
> > regression.
> > 
> > When we reset/teardown a controller, we must freeze and quiesce the namespaces
> > request queues to make sure that we safely stop inflight I/O submissions.
> > Freeze is mandatory because if our hctx map changed between reconnects,
> > blk_mq_update_nr_hw_queues will immediately attempt to freeze the queue, and
> > if it still has pending submissions (that are still quiesced) it will hang.
> > This is what the above patches fixed.
> > 
> > However, by freezing the namespaces request queues, and only unfreezing them
> > when we successfully reconnect, inflight submissions that are running
> > concurrently can now block grabbing the nshead srcu until either we successfully
> > reconnect or ctrl_loss_tmo expired (or the user explicitly disconnected).
> > 
> > This caused a deadlock [1] when a different controller (different path on the
> > same subsystem) became live (i.e. optimized/non-optimized). This is because
> > nvme_mpath_set_live needs to synchronize the nshead srcu before requeueing I/O
> > in order to make sure that current_path is visible to future (re)submisions.
> > However the srcu lock is taken by a blocked submission on a frozen request
> > queue, and we have a deadlock.
> > 
> > For multipath, we obviously cannot allow that as we want to failover I/O asap.
> > However for non-mpath, we do not want to fail I/O (at least until controller
> > FASTFAIL expires, and that is disabled by default btw).
> > 
> > This creates a non-symmetric behavior of how the driver should behave in the
> > presence or absence of multipath.
> > 
> > [1]:
> > Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp]
> > Call Trace:
> >   __schedule+0x293/0x730
> >   schedule+0x33/0xa0
> >   schedule_timeout+0x1d3/0x2f0
> >   wait_for_completion+0xba/0x140
> >   __synchronize_srcu.part.21+0x91/0xc0
> >   synchronize_srcu_expedited+0x27/0x30
> >   synchronize_srcu+0xce/0xe0
> >   nvme_mpath_set_live+0x64/0x130 [nvme_core]
> >   nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
> >   nvme_update_ana_state+0xcd/0xe0 [nvme_core]
> >   nvme_parse_ana_log+0xa1/0x180 [nvme_core]
> >   nvme_read_ana_log+0x76/0x100 [nvme_core]
> >   nvme_mpath_init+0x122/0x180 [nvme_core]
> >   nvme_init_identify+0x80e/0xe20 [nvme_core]
> >   nvme_tcp_setup_ctrl+0x359/0x660 [nvme_tcp]
> >   nvme_tcp_reconnect_ctrl_work+0x24/0x70 [nvme_tcp]
> > 
> > 
> > In order to fix this, we recognize the different behavior a driver needs to take
> > in error recovery scenarios for mpath and non-mpath scenarios and expose this
> > awareness with a new helper nvme_ctrl_is_mpath and use that to know what needs
> > to be done.
> 
> Christoph, Keith,
> 
> Any thoughts on this? The RFC part is getting the transport driver to
> behave differently for mpath vs. non-mpath.

Will it work if nvme mpath used request NOWAIT flag for its submit_bio()
call, and add the bio to the requeue_list if blk_queue_enter() fails? I
think that looks like another way to resolve the deadlock, but we need
the block layer to return a failed status to the original caller.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-16 20:42   ` Keith Busch
@ 2021-03-16 23:51     ` Sagi Grimberg
  2021-03-17  2:55       ` Chao Leng
  0 siblings, 1 reply; 30+ messages in thread
From: Sagi Grimberg @ 2021-03-16 23:51 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, Christoph Hellwig, Chaitanya Kulkarni


>>> These patches on their own are correct because they fixed a controller reset
>>> regression.
>>>
>>> When we reset/teardown a controller, we must freeze and quiesce the namespaces
>>> request queues to make sure that we safely stop inflight I/O submissions.
>>> Freeze is mandatory because if our hctx map changed between reconnects,
>>> blk_mq_update_nr_hw_queues will immediately attempt to freeze the queue, and
>>> if it still has pending submissions (that are still quiesced) it will hang.
>>> This is what the above patches fixed.
>>>
>>> However, by freezing the namespaces request queues, and only unfreezing them
>>> when we successfully reconnect, inflight submissions that are running
>>> concurrently can now block grabbing the nshead srcu until either we successfully
>>> reconnect or ctrl_loss_tmo expired (or the user explicitly disconnected).
>>>
>>> This caused a deadlock [1] when a different controller (different path on the
>>> same subsystem) became live (i.e. optimized/non-optimized). This is because
>>> nvme_mpath_set_live needs to synchronize the nshead srcu before requeueing I/O
>>> in order to make sure that current_path is visible to future (re)submisions.
>>> However the srcu lock is taken by a blocked submission on a frozen request
>>> queue, and we have a deadlock.
>>>
>>> For multipath, we obviously cannot allow that as we want to failover I/O asap.
>>> However for non-mpath, we do not want to fail I/O (at least until controller
>>> FASTFAIL expires, and that is disabled by default btw).
>>>
>>> This creates a non-symmetric behavior of how the driver should behave in the
>>> presence or absence of multipath.
>>>
>>> [1]:
>>> Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp]
>>> Call Trace:
>>>    __schedule+0x293/0x730
>>>    schedule+0x33/0xa0
>>>    schedule_timeout+0x1d3/0x2f0
>>>    wait_for_completion+0xba/0x140
>>>    __synchronize_srcu.part.21+0x91/0xc0
>>>    synchronize_srcu_expedited+0x27/0x30
>>>    synchronize_srcu+0xce/0xe0
>>>    nvme_mpath_set_live+0x64/0x130 [nvme_core]
>>>    nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
>>>    nvme_update_ana_state+0xcd/0xe0 [nvme_core]
>>>    nvme_parse_ana_log+0xa1/0x180 [nvme_core]
>>>    nvme_read_ana_log+0x76/0x100 [nvme_core]
>>>    nvme_mpath_init+0x122/0x180 [nvme_core]
>>>    nvme_init_identify+0x80e/0xe20 [nvme_core]
>>>    nvme_tcp_setup_ctrl+0x359/0x660 [nvme_tcp]
>>>    nvme_tcp_reconnect_ctrl_work+0x24/0x70 [nvme_tcp]
>>>
>>>
>>> In order to fix this, we recognize the different behavior a driver needs to take
>>> in error recovery scenarios for mpath and non-mpath scenarios and expose this
>>> awareness with a new helper nvme_ctrl_is_mpath and use that to know what needs
>>> to be done.
>>
>> Christoph, Keith,
>>
>> Any thoughts on this? The RFC part is getting the transport driver to
>> behave differently for mpath vs. non-mpath.
> 
> Will it work if nvme mpath used request NOWAIT flag for its submit_bio()
> call, and add the bio to the requeue_list if blk_queue_enter() fails? I
> think that looks like another way to resolve the deadlock, but we need
> the block layer to return a failed status to the original caller.

But who would kick the requeue list? and that would make 
near-tag-exhaust performance stink...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-16 23:51     ` Sagi Grimberg
@ 2021-03-17  2:55       ` Chao Leng
  2021-03-17  6:59         ` Christoph Hellwig
  0 siblings, 1 reply; 30+ messages in thread
From: Chao Leng @ 2021-03-17  2:55 UTC (permalink / raw)
  To: Sagi Grimberg, Keith Busch
  Cc: linux-nvme, Christoph Hellwig, Chaitanya Kulkarni



On 2021/3/17 7:51, Sagi Grimberg wrote:
> 
>>>> These patches on their own are correct because they fixed a controller reset
>>>> regression.
>>>>
>>>> When we reset/teardown a controller, we must freeze and quiesce the namespaces
>>>> request queues to make sure that we safely stop inflight I/O submissions.
>>>> Freeze is mandatory because if our hctx map changed between reconnects,
>>>> blk_mq_update_nr_hw_queues will immediately attempt to freeze the queue, and
>>>> if it still has pending submissions (that are still quiesced) it will hang.
>>>> This is what the above patches fixed.
>>>>
>>>> However, by freezing the namespaces request queues, and only unfreezing them
>>>> when we successfully reconnect, inflight submissions that are running
>>>> concurrently can now block grabbing the nshead srcu until either we successfully
>>>> reconnect or ctrl_loss_tmo expired (or the user explicitly disconnected).
>>>>
>>>> This caused a deadlock [1] when a different controller (different path on the
>>>> same subsystem) became live (i.e. optimized/non-optimized). This is because
>>>> nvme_mpath_set_live needs to synchronize the nshead srcu before requeueing I/O
>>>> in order to make sure that current_path is visible to future (re)submisions.
>>>> However the srcu lock is taken by a blocked submission on a frozen request
>>>> queue, and we have a deadlock.
>>>>
>>>> For multipath, we obviously cannot allow that as we want to failover I/O asap.
>>>> However for non-mpath, we do not want to fail I/O (at least until controller
>>>> FASTFAIL expires, and that is disabled by default btw).
>>>>
>>>> This creates a non-symmetric behavior of how the driver should behave in the
>>>> presence or absence of multipath.
>>>>
>>>> [1]:
>>>> Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp]
>>>> Call Trace:
>>>>    __schedule+0x293/0x730
>>>>    schedule+0x33/0xa0
>>>>    schedule_timeout+0x1d3/0x2f0
>>>>    wait_for_completion+0xba/0x140
>>>>    __synchronize_srcu.part.21+0x91/0xc0
>>>>    synchronize_srcu_expedited+0x27/0x30
>>>>    synchronize_srcu+0xce/0xe0
>>>>    nvme_mpath_set_live+0x64/0x130 [nvme_core]
>>>>    nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
>>>>    nvme_update_ana_state+0xcd/0xe0 [nvme_core]
>>>>    nvme_parse_ana_log+0xa1/0x180 [nvme_core]
>>>>    nvme_read_ana_log+0x76/0x100 [nvme_core]
>>>>    nvme_mpath_init+0x122/0x180 [nvme_core]
>>>>    nvme_init_identify+0x80e/0xe20 [nvme_core]
>>>>    nvme_tcp_setup_ctrl+0x359/0x660 [nvme_tcp]
>>>>    nvme_tcp_reconnect_ctrl_work+0x24/0x70 [nvme_tcp]
>>>>
>>>>
>>>> In order to fix this, we recognize the different behavior a driver needs to take
>>>> in error recovery scenarios for mpath and non-mpath scenarios and expose this
>>>> awareness with a new helper nvme_ctrl_is_mpath and use that to know what needs
>>>> to be done.
>>>
>>> Christoph, Keith,
>>>
>>> Any thoughts on this? The RFC part is getting the transport driver to
>>> behave differently for mpath vs. non-mpath.
>>
>> Will it work if nvme mpath used request NOWAIT flag for its submit_bio()
>> call, and add the bio to the requeue_list if blk_queue_enter() fails? I
>> think that looks like another way to resolve the deadlock, but we need
>> the block layer to return a failed status to the original caller.
> 
> But who would kick the requeue list? and that would make near-tag-exhaust performance stink...
moving nvme_start_freeze from nvme_rdma_teardown_io_queues to nvme_rdma_configure_io_queues can fix it.
It can also avoid I/O hang long time if reconnection failed.
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
> .

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-17  2:55       ` Chao Leng
@ 2021-03-17  6:59         ` Christoph Hellwig
  2021-03-17  7:59           ` Chao Leng
  2021-03-17  8:16           ` Sagi Grimberg
  0 siblings, 2 replies; 30+ messages in thread
From: Christoph Hellwig @ 2021-03-17  6:59 UTC (permalink / raw)
  To: Chao Leng
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Christoph Hellwig,
	Chaitanya Kulkarni

On Wed, Mar 17, 2021 at 10:55:57AM +0800, Chao Leng wrote:
>>> Will it work if nvme mpath used request NOWAIT flag for its submit_bio()
>>> call, and add the bio to the requeue_list if blk_queue_enter() fails? I
>>> think that looks like another way to resolve the deadlock, but we need
>>> the block layer to return a failed status to the original caller.

Yes, I think BLK_MQ_REQ_NOWAIT makes total sense here.  dm-mpath also
uses it for its request allocation for similar reasons.

>>
>> But who would kick the requeue list? and that would make near-tag-exhaust performance stink...

The multipath code would have to kick the list.  We could also try to
split into two flags, one that affects blk_queue_enter and one that
affects the tag allocation.

> moving nvme_start_freeze from nvme_rdma_teardown_io_queues to nvme_rdma_configure_io_queues can fix it.
> It can also avoid I/O hang long time if reconnection failed.

Can you explain how we'd still ensure that no new commands get queued
during teardown using that scheme?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-17  6:59         ` Christoph Hellwig
@ 2021-03-17  7:59           ` Chao Leng
  2021-03-17 18:43             ` Sagi Grimberg
  2021-03-17  8:16           ` Sagi Grimberg
  1 sibling, 1 reply; 30+ messages in thread
From: Chao Leng @ 2021-03-17  7:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Chaitanya Kulkarni



On 2021/3/17 14:59, Christoph Hellwig wrote:
> On Wed, Mar 17, 2021 at 10:55:57AM +0800, Chao Leng wrote:
>>>> Will it work if nvme mpath used request NOWAIT flag for its submit_bio()
>>>> call, and add the bio to the requeue_list if blk_queue_enter() fails? I
>>>> think that looks like another way to resolve the deadlock, but we need
>>>> the block layer to return a failed status to the original caller.
> 
> Yes, I think BLK_MQ_REQ_NOWAIT makes total sense here.  dm-mpath also
> uses it for its request allocation for similar reasons.
> 
>>>
>>> But who would kick the requeue list? and that would make near-tag-exhaust performance stink...
> 
> The multipath code would have to kick the list.  We could also try to
> split into two flags, one that affects blk_queue_enter and one that
> affects the tag allocation.
> 
>> moving nvme_start_freeze from nvme_rdma_teardown_io_queues to nvme_rdma_configure_io_queues can fix it.
>> It can also avoid I/O hang long time if reconnection failed.
> 
> Can you explain how we'd still ensure that no new commands get queued
> during teardown using that scheme?
1. tear down will cancel all inflight requests, and then multipath will clear the path.
2. and then we may freeze the controler.
3. nvme_ns_head_submit_bio can not find the reconnection controller as valid path, so it is safe.
> .
> 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-17  6:59         ` Christoph Hellwig
  2021-03-17  7:59           ` Chao Leng
@ 2021-03-17  8:16           ` Sagi Grimberg
  1 sibling, 0 replies; 30+ messages in thread
From: Sagi Grimberg @ 2021-03-17  8:16 UTC (permalink / raw)
  To: Christoph Hellwig, Chao Leng; +Cc: Keith Busch, linux-nvme, Chaitanya Kulkarni

>>>> Will it work if nvme mpath used request NOWAIT flag for its submit_bio()
>>>> call, and add the bio to the requeue_list if blk_queue_enter() fails? I
>>>> think that looks like another way to resolve the deadlock, but we need
>>>> the block layer to return a failed status to the original caller.
> 
> Yes, I think BLK_MQ_REQ_NOWAIT makes total sense here.

BTW, the specific hang reported is not blocking on tag allocation, but
rather than on blk_queue_enter blocking on a frozen queue.

> dm-mpath also uses it for its request allocation for similar reasons.

That is the rq based dm, and I think it is because dm_mq_queue_rq is
non-blocking. Care to explain what is similar to nvme-mpath?

I don't see how bio based dm cares about any of this...

>>> But who would kick the requeue list? and that would make near-tag-exhaust performance stink...
> 
> The multipath code would have to kick the list.

When? Not following your thoughts...

You are suggesting that we call submit_bio that will fail, put it on the
requeue_list and then what? blindly kick the requeue list? try to see if
there is an alternate path and then kick the list? for every bio that
comes in?

> We could also try to
> split into two flags, one that affects blk_queue_enter and one that
> affects the tag allocation.

If this is something that can work reliably then its better off, plus we
can probably kill the srcu as well. But I don't see how this would
work unfortunately.

>> moving nvme_start_freeze from nvme_rdma_teardown_io_queues to nvme_rdma_configure_io_queues can fix it.
>> It can also avoid I/O hang long time if reconnection failed.
> 
> Can you explain how we'd still ensure that no new commands get queued
> during teardown using that scheme?

quiescing the queue would prevent new submissions from coming down to
the driver, but I don't see how this move can help here...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-17  7:59           ` Chao Leng
@ 2021-03-17 18:43             ` Sagi Grimberg
  2021-03-18  1:51               ` Chao Leng
  0 siblings, 1 reply; 30+ messages in thread
From: Sagi Grimberg @ 2021-03-17 18:43 UTC (permalink / raw)
  To: Chao Leng, Christoph Hellwig; +Cc: Keith Busch, linux-nvme, Chaitanya Kulkarni


>>>>> Will it work if nvme mpath used request NOWAIT flag for its 
>>>>> submit_bio()
>>>>> call, and add the bio to the requeue_list if blk_queue_enter() 
>>>>> fails? I
>>>>> think that looks like another way to resolve the deadlock, but we need
>>>>> the block layer to return a failed status to the original caller.
>>
>> Yes, I think BLK_MQ_REQ_NOWAIT makes total sense here.  dm-mpath also
>> uses it for its request allocation for similar reasons.
>>
>>>>
>>>> But who would kick the requeue list? and that would make 
>>>> near-tag-exhaust performance stink...
>>
>> The multipath code would have to kick the list.  We could also try to
>> split into two flags, one that affects blk_queue_enter and one that
>> affects the tag allocation.
>>
>>> moving nvme_start_freeze from nvme_rdma_teardown_io_queues to 
>>> nvme_rdma_configure_io_queues can fix it.
>>> It can also avoid I/O hang long time if reconnection failed.
>>
>> Can you explain how we'd still ensure that no new commands get queued
>> during teardown using that scheme?
> 1. tear down will cancel all inflight requests, and then multipath will 
> clear the path.
> 2. and then we may freeze the controler.
> 3. nvme_ns_head_submit_bio can not find the reconnection controller as 
> valid path, so it is safe.

In non-mpath (which unfortunately is a valid use-case), there is no
failover, and we cannot freeze the queue after we stopped (and/or
started) the queues because then fail_non_ready_command() constantly 
return BLK_STS_RESOURCE (just causing a re-submission over and over
again) and the freeze will never complete (the commands are still
inflight from the queue->g_usage_counter perspective).

So I think we should still start queue freeze before we quiesce
the queues.

I still don't see how the mpath NOWAIT suggestion works either...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-17 18:43             ` Sagi Grimberg
@ 2021-03-18  1:51               ` Chao Leng
  2021-03-18  4:45                 ` Christoph Hellwig
  2021-03-18 18:46                 ` Sagi Grimberg
  0 siblings, 2 replies; 30+ messages in thread
From: Chao Leng @ 2021-03-18  1:51 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chaitanya Kulkarni



On 2021/3/18 2:43, Sagi Grimberg wrote:
> 
>>>>>> Will it work if nvme mpath used request NOWAIT flag for its submit_bio()
>>>>>> call, and add the bio to the requeue_list if blk_queue_enter() fails? I
>>>>>> think that looks like another way to resolve the deadlock, but we need
>>>>>> the block layer to return a failed status to the original caller.
>>>
>>> Yes, I think BLK_MQ_REQ_NOWAIT makes total sense here.  dm-mpath also
>>> uses it for its request allocation for similar reasons.
>>>
>>>>>
>>>>> But who would kick the requeue list? and that would make near-tag-exhaust performance stink...
>>>
>>> The multipath code would have to kick the list.  We could also try to
>>> split into two flags, one that affects blk_queue_enter and one that
>>> affects the tag allocation.
>>>
>>>> moving nvme_start_freeze from nvme_rdma_teardown_io_queues to nvme_rdma_configure_io_queues can fix it.
>>>> It can also avoid I/O hang long time if reconnection failed.
>>>
>>> Can you explain how we'd still ensure that no new commands get queued
>>> during teardown using that scheme?
>> 1. tear down will cancel all inflight requests, and then multipath will clear the path.
>> 2. and then we may freeze the controler.
>> 3. nvme_ns_head_submit_bio can not find the reconnection controller as valid path, so it is safe.
> 
> In non-mpath (which unfortunately is a valid use-case), there is no
> failover, and we cannot freeze the queue after we stopped (and/or
> started) the queues because then fail_non_ready_command() constantly return BLK_STS_RESOURCE (just causing a re-submission over and over
> again) and the freeze will never complete (the commands are still
> inflight from the queue->g_usage_counter perspective).
If the request set the flags to REQ_FAILFAST_xxx, will hang long time if reconnection failed.
This is not expected.
Another, If the controller is not live and the controller is freezed ，fast_io_fail_tmo will not work.
This is also not expected.
So I think freezing the controller when reconnecting is not good idea.
It's really not good behavior to try again and again. This is at least better than request hang long time.
> 
> So I think we should still start queue freeze before we quiesce
> the queues.
We should unquiesce and unfreeze the queues when reconnecting, otherwise fast_io_fail_tmo will not work.
> 
> I still don't see how the mpath NOWAIT suggestion works either...
mpath will queuue request to other live path or requeue the request(if no used path), so it will not wait.
> .

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-18  1:51               ` Chao Leng
@ 2021-03-18  4:45                 ` Christoph Hellwig
  2021-03-18 18:46                 ` Sagi Grimberg
  1 sibling, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2021-03-18  4:45 UTC (permalink / raw)
  To: Chao Leng
  Cc: Sagi Grimberg, Christoph Hellwig, Keith Busch, linux-nvme,
	Chaitanya Kulkarni

On Thu, Mar 18, 2021 at 09:51:14AM +0800, Chao Leng wrote:
>>>> The multipath code would have to kick the list.  We could also try to
>>>> split into two flags, one that affects blk_queue_enter and one that
>>>> affects the tag allocation.
>>>>
>>>>> moving nvme_start_freeze from nvme_rdma_teardown_io_queues to nvme_rdma_configure_io_queues can fix it.
>>>>> It can also avoid I/O hang long time if reconnection failed.
>>>>
>>>> Can you explain how we'd still ensure that no new commands get queued
>>>> during teardown using that scheme?
>>> 1. tear down will cancel all inflight requests, and then multipath will clear the path.
>>> 2. and then we may freeze the controler.
>>> 3. nvme_ns_head_submit_bio can not find the reconnection controller as valid path, so it is safe.
>>
>> In non-mpath (which unfortunately is a valid use-case), there is no
>> failover, and we cannot freeze the queue after we stopped (and/or
>> started) the queues because then fail_non_ready_command() constantly return BLK_STS_RESOURCE (just causing a re-submission over and over
>> again) and the freeze will never complete (the commands are still
>> inflight from the queue->g_usage_counter perspective).
> If the request set the flags to REQ_FAILFAST_xxx, will hang long time if reconnection failed.
> This is not expected.
> Another, If the controller is not live and the controller is freezed ，fast_io_fail_tmo will not work.
> This is also not expected.
> So I think freezing the controller when reconnecting is not good idea.
> It's really not good behavior to try again and again. This is at least better than request hang long time.

Well, it is pretty clear that REQ_FAILFAST_* (and I'm still confused
about the three variants of that) should not block in blk_queue_enter,
and we should make sure nvme-multipath triggers that.  Let me thing
of a good way to refactor blk_queue_enter first to make that least
painful.

>> So I think we should still start queue freeze before we quiesce
>> the queues.
> We should unquiesce and unfreeze the queues when reconnecting, otherwise fast_io_fail_tmo will not work.
>>
>> I still don't see how the mpath NOWAIT suggestion works either...
> mpath will queuue request to other live path or requeue the request(if no used path), so it will not wait.
>> .

Yes.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-18  1:51               ` Chao Leng
  2021-03-18  4:45                 ` Christoph Hellwig
@ 2021-03-18 18:46                 ` Sagi Grimberg
  2021-03-18 19:16                   ` Keith Busch
  1 sibling, 1 reply; 30+ messages in thread
From: Sagi Grimberg @ 2021-03-18 18:46 UTC (permalink / raw)
  To: Chao Leng, Christoph Hellwig; +Cc: Keith Busch, linux-nvme, Chaitanya Kulkarni


>>>>>>> Will it work if nvme mpath used request NOWAIT flag for its 
>>>>>>> submit_bio()
>>>>>>> call, and add the bio to the requeue_list if blk_queue_enter() 
>>>>>>> fails? I
>>>>>>> think that looks like another way to resolve the deadlock, but we 
>>>>>>> need
>>>>>>> the block layer to return a failed status to the original caller.
>>>>
>>>> Yes, I think BLK_MQ_REQ_NOWAIT makes total sense here.  dm-mpath also
>>>> uses it for its request allocation for similar reasons.
>>>>
>>>>>>
>>>>>> But who would kick the requeue list? and that would make 
>>>>>> near-tag-exhaust performance stink...
>>>>
>>>> The multipath code would have to kick the list.  We could also try to
>>>> split into two flags, one that affects blk_queue_enter and one that
>>>> affects the tag allocation.
>>>>
>>>>> moving nvme_start_freeze from nvme_rdma_teardown_io_queues to 
>>>>> nvme_rdma_configure_io_queues can fix it.
>>>>> It can also avoid I/O hang long time if reconnection failed.
>>>>
>>>> Can you explain how we'd still ensure that no new commands get queued
>>>> during teardown using that scheme?
>>> 1. tear down will cancel all inflight requests, and then multipath 
>>> will clear the path.
>>> 2. and then we may freeze the controler.
>>> 3. nvme_ns_head_submit_bio can not find the reconnection controller 
>>> as valid path, so it is safe.
>>
>> In non-mpath (which unfortunately is a valid use-case), there is no
>> failover, and we cannot freeze the queue after we stopped (and/or
>> started) the queues because then fail_non_ready_command() constantly 
>> return BLK_STS_RESOURCE (just causing a re-submission over and over
>> again) and the freeze will never complete (the commands are still
>> inflight from the queue->g_usage_counter perspective).
> If the request set the flags to REQ_FAILFAST_xxx, will hang long time if 
> reconnection failed.
> This is not expected.
> Another, If the controller is not live and the controller is freezed 
> ，fast_io_fail_tmo will not work.
> This is also not expected.

No arguments that the queue needs to unfreeze asap for mpath, that
is exactly what the patch does. The only unnatural part is the
non-mpath case where if we unfreeze the queue before we reconnect
I/Os will fail, which is we should also respect fast_fail_tmo.

The main issue here is that there are two behaviors that we
should maintain based if its mpath or non-mpath...

> So I think freezing the controller when reconnecting is not good idea.

As said, for mpath its for sure not, but for non-mpath that matches
the expected behavior.

> It's really not good behavior to try again and again. This is at least 
> better than request hang long time.

I am not sure I understand how that even supposed to work TBH.

>> So I think we should still start queue freeze before we quiesce
>> the queues.
> We should unquiesce and unfreeze the queues when reconnecting, otherwise 
> fast_io_fail_tmo will not work.
>>
>> I still don't see how the mpath NOWAIT suggestion works either...
> mpath will queuue request to other live path or requeue the request(if 
> no used path), so it will not wait.

Placing the request on the requeue_list is fine, but the question is
when to kick the requeue_work, nothing guarantees that an alternate path
exist or will in a sane period. So constantly requeue+kick sounds like
a really bad practice to me.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-18 18:46                 ` Sagi Grimberg
@ 2021-03-18 19:16                   ` Keith Busch
  2021-03-18 19:31                     ` Sagi Grimberg
  0 siblings, 1 reply; 30+ messages in thread
From: Keith Busch @ 2021-03-18 19:16 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Chao Leng, Christoph Hellwig, linux-nvme, Chaitanya Kulkarni

On Thu, Mar 18, 2021 at 11:46:08AM -0700, Sagi Grimberg wrote:
> 
> Placing the request on the requeue_list is fine, but the question is
> when to kick the requeue_work, nothing guarantees that an alternate path
> exist or will in a sane period. So constantly requeue+kick sounds like
> a really bad practice to me.

nvme_mpath_set_live(), where you reported the deadlock, kicks the
requeue_list. The difference that NOWAIT provides is that
nvme_mpath_set_live's schronize_srcu() is no longer blocked forever
because the .submit_bio() isn't waiting for entery on a frozen queue, so
now it's free to schedule the dispatch.

There's probably an optimization to kick it sooner if there's a viable
alternate path, but that could be a follow on.

If there's no immediate viable path, then the requests would remain on
the requeue list. That currently happens as long as there's a potential
controller in a reset or connecting state.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-18 19:16                   ` Keith Busch
@ 2021-03-18 19:31                     ` Sagi Grimberg
  2021-03-18 21:52                       ` Keith Busch
  0 siblings, 1 reply; 30+ messages in thread
From: Sagi Grimberg @ 2021-03-18 19:31 UTC (permalink / raw)
  To: Keith Busch; +Cc: Chao Leng, Christoph Hellwig, linux-nvme, Chaitanya Kulkarni


>> Placing the request on the requeue_list is fine, but the question is
>> when to kick the requeue_work, nothing guarantees that an alternate path
>> exist or will in a sane period. So constantly requeue+kick sounds like
>> a really bad practice to me.
> 
> nvme_mpath_set_live(), where you reported the deadlock, kicks the
> requeue_list. The difference that NOWAIT provides is that
> nvme_mpath_set_live's schronize_srcu() is no longer blocked forever
> because the .submit_bio() isn't waiting for entery on a frozen queue, so
> now it's free to schedule the dispatch.
> 
> There's probably an optimization to kick it sooner if there's a viable
> alternate path, but that could be a follow on.

That would be mandatory I think, otherwise this would introduce
a regression...

> If there's no immediate viable path, then the requests would remain on
> the requeue list. That currently happens as long as there's a potential
> controller in a reset or connecting state.

Well, also worth to keep in mind that now we'll need to clone the bio
because we need to override bi_end_io which adds us some overhead
in the data path. Unless we make submit_bio return a status which
is a much bigger scope of a change I would expect...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-18 19:31                     ` Sagi Grimberg
@ 2021-03-18 21:52                       ` Keith Busch
  2021-03-18 22:45                         ` Sagi Grimberg
  2021-03-19 14:05                         ` Christoph Hellwig
  0 siblings, 2 replies; 30+ messages in thread
From: Keith Busch @ 2021-03-18 21:52 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Chao Leng, Christoph Hellwig, linux-nvme, Chaitanya Kulkarni

On Thu, Mar 18, 2021 at 12:31:35PM -0700, Sagi Grimberg wrote:
> 
> > > Placing the request on the requeue_list is fine, but the question is
> > > when to kick the requeue_work, nothing guarantees that an alternate path
> > > exist or will in a sane period. So constantly requeue+kick sounds like
> > > a really bad practice to me.
> > 
> > nvme_mpath_set_live(), where you reported the deadlock, kicks the
> > requeue_list. The difference that NOWAIT provides is that
> > nvme_mpath_set_live's schronize_srcu() is no longer blocked forever
> > because the .submit_bio() isn't waiting for entery on a frozen queue, so
> > now it's free to schedule the dispatch.
> > 
> > There's probably an optimization to kick it sooner if there's a viable
> > alternate path, but that could be a follow on.
> 
> That would be mandatory I think, otherwise this would introduce
> a regression...
>
> > If there's no immediate viable path, then the requests would remain on
> > the requeue list. That currently happens as long as there's a potential
> > controller in a reset or connecting state.
> 
> Well, also worth to keep in mind that now we'll need to clone the bio
> because we need to override bi_end_io which adds us some overhead
> in the data path. Unless we make submit_bio return a status which
> is a much bigger scope of a change I would expect...

Having submit_bio() return the enter status was where I was going with
this, but the recursive handling makes this more complicated than I
initially thought.

If you use the NOWAIT flag today with a freezing queue, the IO will end
with BLK_STS_AGAIN and punt retry handling to the application. I'm
guessing you don't want that to happen, so a little more is required for
this idea.

Since it's an error path, perhaps a block operations callback is okay?
Something like this compile tested patch?

---
diff --git a/block/blk-core.c b/block/blk-core.c
index fc60ff208497..423b89005a28 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -475,6 +475,16 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 	}
 }
 
+static inline void bio_enter_error(struct bio *bio)
+{
+	struct gendisk *disk = bio->bi_bdev->bd_disk;
+
+	if (disk->fops->enter_err)
+		disk->fops->enter_err(bio);
+	else
+		bio_wouldblock_error(bio);
+}
+
 static inline int bio_queue_enter(struct bio *bio)
 {
 	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
@@ -484,7 +494,7 @@ static inline int bio_queue_enter(struct bio *bio)
 	ret = blk_queue_enter(q, nowait ? BLK_MQ_REQ_NOWAIT : 0);
 	if (unlikely(ret)) {
 		if (nowait && !blk_queue_dying(q))
-			bio_wouldblock_error(bio);
+			bio_enter_error(bio);
 		else
 			bio_io_error(bio);
 	}
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index edf19bbb904f..2c27eeaa83b0 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2366,9 +2366,24 @@ static void nvme_ns_head_release(struct gendisk *disk, fmode_t mode)
 	nvme_put_ns_head(disk->private_data);
 }
 
+void nvme_ns_head_enter_err(struct bio *bio)
+{
+	struct nvme_ns_head *head = bio->bi_bdev->bd_disk->private_data;
+
+	if (nvme_available_path(head)) {
+		spin_lock_irq(&head->requeue_lock);
+		bio_list_add(&head->requeue_list, bio);
+		spin_unlock_irq(&head->requeue_lock);
+	} else {
+		bio->bi_status = BLK_STS_IOERR;
+		bio_endio(bio);
+	}
+}
+
 const struct block_device_operations nvme_ns_head_ops = {
 	.owner		= THIS_MODULE,
 	.submit_bio	= nvme_ns_head_submit_bio,
+	.enter_err	= nvme_ns_head_enter_err,
 	.open		= nvme_ns_head_open,
 	.release	= nvme_ns_head_release,
 	.ioctl		= nvme_ioctl,
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index a1d476e1ac02..47595bb09032 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -274,7 +274,7 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
 	return ns;
 }
 
-static bool nvme_available_path(struct nvme_ns_head *head)
+bool nvme_available_path(struct nvme_ns_head *head)
 {
 	struct nvme_ns *ns;
 
@@ -313,7 +313,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
 	ns = nvme_find_path(head);
 	if (likely(ns)) {
 		bio_set_dev(bio, ns->disk->part0);
-		bio->bi_opf |= REQ_NVME_MPATH;
+		bio->bi_opf |= REQ_NVME_MPATH | REQ_NOWAIT;
 		trace_block_bio_remap(bio, disk_devt(ns->head->disk),
 				      bio->bi_iter.bi_sector);
 		ret = submit_bio_noacct(bio);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 815c032a190e..5dbd6baebd70 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -677,6 +677,7 @@ bool nvme_mpath_clear_current_path(struct nvme_ns *ns);
 void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl);
 struct nvme_ns *nvme_find_path(struct nvme_ns_head *head);
 blk_qc_t nvme_ns_head_submit_bio(struct bio *bio);
+bool nvme_available_path(struct nvme_ns_head *head);
 
 static inline void nvme_mpath_check_last_path(struct nvme_ns *ns)
 {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bc6bc8383b43..b5ae1aa292c1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1862,6 +1862,7 @@ static inline void blk_ksm_unregister(struct request_queue *q) { }
 
 struct block_device_operations {
 	blk_qc_t (*submit_bio) (struct bio *bio);
+	void (*enter_err) (struct bio *bio);
 	int (*open) (struct block_device *, fmode_t);
 	void (*release) (struct gendisk *, fmode_t);
 	int (*rw_page)(struct block_device *, sector_t, struct page *, unsigned int);
--

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-18 21:52                       ` Keith Busch
@ 2021-03-18 22:45                         ` Sagi Grimberg
  2021-03-19 14:05                         ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Sagi Grimberg @ 2021-03-18 22:45 UTC (permalink / raw)
  To: Keith Busch; +Cc: Chao Leng, Christoph Hellwig, linux-nvme, Chaitanya Kulkarni


>>>> Placing the request on the requeue_list is fine, but the question is
>>>> when to kick the requeue_work, nothing guarantees that an alternate path
>>>> exist or will in a sane period. So constantly requeue+kick sounds like
>>>> a really bad practice to me.
>>>
>>> nvme_mpath_set_live(), where you reported the deadlock, kicks the
>>> requeue_list. The difference that NOWAIT provides is that
>>> nvme_mpath_set_live's schronize_srcu() is no longer blocked forever
>>> because the .submit_bio() isn't waiting for entery on a frozen queue, so
>>> now it's free to schedule the dispatch.
>>>
>>> There's probably an optimization to kick it sooner if there's a viable
>>> alternate path, but that could be a follow on.
>>
>> That would be mandatory I think, otherwise this would introduce
>> a regression...
>>
>>> If there's no immediate viable path, then the requests would remain on
>>> the requeue list. That currently happens as long as there's a potential
>>> controller in a reset or connecting state.
>>
>> Well, also worth to keep in mind that now we'll need to clone the bio
>> because we need to override bi_end_io which adds us some overhead
>> in the data path. Unless we make submit_bio return a status which
>> is a much bigger scope of a change I would expect...
> 
> Having submit_bio() return the enter status was where I was going with
> this, but the recursive handling makes this more complicated than I
> initially thought.
> 
> If you use the NOWAIT flag today with a freezing queue, the IO will end
> with BLK_STS_AGAIN and punt retry handling to the application. I'm
> guessing you don't want that to happen, so a little more is required for
> this idea.
> 
> Since it's an error path, perhaps a block operations callback is okay?
> Something like this compile tested patch?

Maybe... I don't see any apparent reason why it would not work...

> ---
> diff --git a/block/blk-core.c b/block/blk-core.c
> index fc60ff208497..423b89005a28 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -475,6 +475,16 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
>   	}
>   }
>   
> +static inline void bio_enter_error(struct bio *bio)
> +{
> +	struct gendisk *disk = bio->bi_bdev->bd_disk;
> +
> +	if (disk->fops->enter_err)
> +		disk->fops->enter_err(bio);
> +	else
> +		bio_wouldblock_error(bio);
> +}
> +
>   static inline int bio_queue_enter(struct bio *bio)
>   {
>   	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> @@ -484,7 +494,7 @@ static inline int bio_queue_enter(struct bio *bio)
>   	ret = blk_queue_enter(q, nowait ? BLK_MQ_REQ_NOWAIT : 0);
>   	if (unlikely(ret)) {
>   		if (nowait && !blk_queue_dying(q))
> -			bio_wouldblock_error(bio);
> +			bio_enter_error(bio);
>   		else
>   			bio_io_error(bio);
>   	}
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index edf19bbb904f..2c27eeaa83b0 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -2366,9 +2366,24 @@ static void nvme_ns_head_release(struct gendisk *disk, fmode_t mode)
>   	nvme_put_ns_head(disk->private_data);
>   }
>   
> +void nvme_ns_head_enter_err(struct bio *bio)
> +{
> +	struct nvme_ns_head *head = bio->bi_bdev->bd_disk->private_data;
> +
> +	if (nvme_available_path(head)) {
> +		spin_lock_irq(&head->requeue_lock);
> +		bio_list_add(&head->requeue_list, bio);
> +		spin_unlock_irq(&head->requeue_lock);
> +	} else {
> +		bio->bi_status = BLK_STS_IOERR;
> +		bio_endio(bio);
> +	}
> +}

Nice, you can take the error path in nvme_ns_head_submit_bio
and use it there too:
--
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 5c67a5e96738..8d0ef83f545c 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -318,17 +318,8 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
                 trace_block_bio_remap(bio, disk_devt(ns->head->disk),
                                       bio->bi_iter.bi_sector);
                 ret = submit_bio_noacct(bio);
-       } else if (nvme_available_path(head)) {
-               dev_warn_ratelimited(dev, "no usable path - requeuing 
I/O\n");
-
-               spin_lock_irq(&head->requeue_lock);
-               bio_list_add(&head->requeue_list, bio);
-               spin_unlock_irq(&head->requeue_lock);
         } else {
-               dev_warn_ratelimited(dev, "no available path - failing 
I/O\n");
-
-               bio->bi_status = BLK_STS_IOERR;
-               bio_endio(bio);
+               nvme_ns_head_enter_err(bio);
         }

         srcu_read_unlock(&head->srcu, srcu_idx);
--

And move the prints as well for some logging..

> +
>   const struct block_device_operations nvme_ns_head_ops = {
>   	.owner		= THIS_MODULE,
>   	.submit_bio	= nvme_ns_head_submit_bio,
> +	.enter_err	= nvme_ns_head_enter_err,
>   	.open		= nvme_ns_head_open,
>   	.release	= nvme_ns_head_release,
>   	.ioctl		= nvme_ioctl,
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index a1d476e1ac02..47595bb09032 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -274,7 +274,7 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>   	return ns;
>   }
>   
> -static bool nvme_available_path(struct nvme_ns_head *head)
> +bool nvme_available_path(struct nvme_ns_head *head)
>   {
>   	struct nvme_ns *ns;
>   
> @@ -313,7 +313,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
>   	ns = nvme_find_path(head);
>   	if (likely(ns)) {
>   		bio_set_dev(bio, ns->disk->part0);
> -		bio->bi_opf |= REQ_NVME_MPATH;
> +		bio->bi_opf |= REQ_NVME_MPATH | REQ_NOWAIT;
>   		trace_block_bio_remap(bio, disk_devt(ns->head->disk),
>   				      bio->bi_iter.bi_sector);
>   		ret = submit_bio_noacct(bio);
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 815c032a190e..5dbd6baebd70 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -677,6 +677,7 @@ bool nvme_mpath_clear_current_path(struct nvme_ns *ns);
>   void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl);
>   struct nvme_ns *nvme_find_path(struct nvme_ns_head *head);
>   blk_qc_t nvme_ns_head_submit_bio(struct bio *bio);
> +bool nvme_available_path(struct nvme_ns_head *head);
>   
>   static inline void nvme_mpath_check_last_path(struct nvme_ns *ns)
>   {
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index bc6bc8383b43..b5ae1aa292c1 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1862,6 +1862,7 @@ static inline void blk_ksm_unregister(struct request_queue *q) { }
>   
>   struct block_device_operations {
>   	blk_qc_t (*submit_bio) (struct bio *bio);
> +	void (*enter_err) (struct bio *bio);
>   	int (*open) (struct block_device *, fmode_t);
>   	void (*release) (struct gendisk *, fmode_t);
>   	int (*rw_page)(struct block_device *, sector_t, struct page *, unsigned int);
> --
> 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-18 21:52                       ` Keith Busch
  2021-03-18 22:45                         ` Sagi Grimberg
@ 2021-03-19 14:05                         ` Christoph Hellwig
  2021-03-19 17:28                           ` Christoph Hellwig
  1 sibling, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2021-03-19 14:05 UTC (permalink / raw)
  To: Keith Busch
  Cc: Sagi Grimberg, Chao Leng, Christoph Hellwig, linux-nvme,
	Chaitanya Kulkarni

On Fri, Mar 19, 2021 at 06:52:56AM +0900, Keith Busch wrote:
> Having submit_bio() return the enter status was where I was going with
> this, but the recursive handling makes this more complicated than I
> initially thought.

Note that the recursion handling is not really required for
nvme-multipath.  I have some plans to actually kill it off entirely
for blk-mq submissions, which needs work on the bounce buffering
and bio splitting code, but should not be too hard.

> If you use the NOWAIT flag today with a freezing queue, the IO will end
> with BLK_STS_AGAIN and punt retry handling to the application. I'm
> guessing you don't want that to happen, so a little more is required for
> this idea.

We really should not use NOWAIT but a flag that only escapes the
freeze protection.  I think REQ_FAILFAST_DRIVER should probably be changed
to trigger that, but even if not we could add a new flag.

> Since it's an error path, perhaps a block operations callback is okay?
> Something like this compile tested patch?

We really should not need an indirection.  And more importantly I don't
think the consuming driver cares, it really is the submitting one.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-19 14:05                         ` Christoph Hellwig
@ 2021-03-19 17:28                           ` Christoph Hellwig
  2021-03-19 19:07                             ` Keith Busch
  2021-03-19 19:34                             ` Sagi Grimberg
  0 siblings, 2 replies; 30+ messages in thread
From: Christoph Hellwig @ 2021-03-19 17:28 UTC (permalink / raw)
  To: Keith Busch
  Cc: Sagi Grimberg, Chao Leng, Christoph Hellwig, linux-nvme,
	Chaitanya Kulkarni

What about something like this?

diff --git a/block/blk-core.c b/block/blk-core.c
index fc60ff20849738..4344f3c9058282 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -792,7 +792,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 	return BLK_STS_OK;
 }
 
-static noinline_for_stack bool submit_bio_checks(struct bio *bio)
+noinline_for_stack bool submit_bio_checks(struct bio *bio)
 {
 	struct block_device *bdev = bio->bi_bdev;
 	struct request_queue *q = bdev->bd_disk->queue;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d4d7c1caa43966..4ff85692843b49 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2286,6 +2286,43 @@ blk_qc_t blk_mq_submit_bio(struct bio *bio)
 	return BLK_QC_T_NONE;
 }
 
+/**
+ * blk_mq_submit_bio_direct - hand a bio directly to the driver for I/O
+ * @bio:  The bio describing the location in memory and on the device.
+ *
+ * This function behaves similar to submit_bio_noacct(), but does never waits
+ * for the queue to be unfreozen, instead it return false and lets the caller
+ * deal with the fallout.  It also does not protect against recursion and thus
+ * must only be used if the called driver is known to be blk-mq based.
+ */
+bool blk_mq_submit_bio_direct(struct bio *bio, blk_qc_t *qc)
+{
+	struct gendisk *disk = bio->bi_bdev->bd_disk;
+	struct request_queue *q = disk->queue;
+
+	if (WARN_ON_ONCE(!current->bio_list) ||
+	    WARN_ON_ONCE(disk->fops->submit_bio)) {
+		bio_io_error(bio);
+		goto fail;
+	}
+	if (!submit_bio_checks(bio))
+		goto fail;
+
+	if (unlikely(blk_queue_enter(q, BLK_MQ_REQ_NOWAIT)))
+		return false;
+	if (!blk_crypto_bio_prep(&bio))
+		goto fail_queue_exit;
+	*qc = blk_mq_submit_bio(bio);
+	return true;
+
+fail_queue_exit:
+	blk_queue_exit(disk->queue);
+fail:
+	*qc = BLK_QC_T_NONE;
+	return true;
+}
+EXPORT_SYMBOL_GPL(blk_mq_submit_bio_direct);
+
 void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		     unsigned int hctx_idx)
 {
diff --git a/block/blk.h b/block/blk.h
index 3b53e44b967e4e..c4c66b2a9ffb19 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -221,6 +221,7 @@ ssize_t part_timeout_show(struct device *, struct device_attribute *, char *);
 ssize_t part_timeout_store(struct device *, struct device_attribute *,
 				const char *, size_t);
 
+bool submit_bio_checks(struct bio *bio);
 void __blk_queue_split(struct bio **bio, unsigned int *nr_segs);
 int ll_back_merge_fn(struct request *req, struct bio *bio,
 		unsigned int nr_segs);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index a1d476e1ac020f..92adebfaf86fd1 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -309,6 +309,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
 	 */
 	blk_queue_split(&bio);
 
+retry:
 	srcu_idx = srcu_read_lock(&head->srcu);
 	ns = nvme_find_path(head);
 	if (likely(ns)) {
@@ -316,7 +317,12 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
 		bio->bi_opf |= REQ_NVME_MPATH;
 		trace_block_bio_remap(bio, disk_devt(ns->head->disk),
 				      bio->bi_iter.bi_sector);
-		ret = submit_bio_noacct(bio);
+
+		if (!blk_mq_submit_bio_direct(bio, &ret)) {
+			nvme_mpath_clear_current_path(ns);
+			srcu_read_unlock(&head->srcu, srcu_idx);
+			goto retry;
+		}
 	} else if (nvme_available_path(head)) {
 		dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2c473c9b899089..6804f397106ada 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -615,6 +615,7 @@ static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio,
 }
 
 blk_qc_t blk_mq_submit_bio(struct bio *bio);
+bool blk_mq_submit_bio_direct(struct bio *bio, blk_qc_t *qc);
 void blk_mq_hctx_set_fq_lock_class(struct blk_mq_hw_ctx *hctx,
 		struct lock_class_key *key);
 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-19 17:28                           ` Christoph Hellwig
@ 2021-03-19 19:07                             ` Keith Busch
  2021-03-19 19:34                             ` Sagi Grimberg
  1 sibling, 0 replies; 30+ messages in thread
From: Keith Busch @ 2021-03-19 19:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Chao Leng, linux-nvme, Chaitanya Kulkarni

On Fri, Mar 19, 2021 at 06:28:17PM +0100, Christoph Hellwig wrote:
> What about something like this?

Yes, this looks good.
 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index fc60ff20849738..4344f3c9058282 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -792,7 +792,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
>  	return BLK_STS_OK;
>  }
>  
> -static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> +noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  {
>  	struct block_device *bdev = bio->bi_bdev;
>  	struct request_queue *q = bdev->bd_disk->queue;
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d4d7c1caa43966..4ff85692843b49 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2286,6 +2286,43 @@ blk_qc_t blk_mq_submit_bio(struct bio *bio)
>  	return BLK_QC_T_NONE;
>  }
>  
> +/**
> + * blk_mq_submit_bio_direct - hand a bio directly to the driver for I/O
> + * @bio:  The bio describing the location in memory and on the device.
> + *
> + * This function behaves similar to submit_bio_noacct(), but does never waits
> + * for the queue to be unfreozen, instead it return false and lets the caller
> + * deal with the fallout.  It also does not protect against recursion and thus
> + * must only be used if the called driver is known to be blk-mq based.
> + */
> +bool blk_mq_submit_bio_direct(struct bio *bio, blk_qc_t *qc)
> +{
> +	struct gendisk *disk = bio->bi_bdev->bd_disk;
> +	struct request_queue *q = disk->queue;
> +
> +	if (WARN_ON_ONCE(!current->bio_list) ||
> +	    WARN_ON_ONCE(disk->fops->submit_bio)) {
> +		bio_io_error(bio);
> +		goto fail;
> +	}
> +	if (!submit_bio_checks(bio))
> +		goto fail;
> +
> +	if (unlikely(blk_queue_enter(q, BLK_MQ_REQ_NOWAIT)))
> +		return false;
> +	if (!blk_crypto_bio_prep(&bio))
> +		goto fail_queue_exit;
> +	*qc = blk_mq_submit_bio(bio);
> +	return true;
> +
> +fail_queue_exit:
> +	blk_queue_exit(disk->queue);
> +fail:
> +	*qc = BLK_QC_T_NONE;
> +	return true;
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_submit_bio_direct);
> +
>  void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>  		     unsigned int hctx_idx)
>  {
> diff --git a/block/blk.h b/block/blk.h
> index 3b53e44b967e4e..c4c66b2a9ffb19 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -221,6 +221,7 @@ ssize_t part_timeout_show(struct device *, struct device_attribute *, char *);
>  ssize_t part_timeout_store(struct device *, struct device_attribute *,
>  				const char *, size_t);
>  
> +bool submit_bio_checks(struct bio *bio);
>  void __blk_queue_split(struct bio **bio, unsigned int *nr_segs);
>  int ll_back_merge_fn(struct request *req, struct bio *bio,
>  		unsigned int nr_segs);
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index a1d476e1ac020f..92adebfaf86fd1 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -309,6 +309,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
>  	 */
>  	blk_queue_split(&bio);
>  
> +retry:
>  	srcu_idx = srcu_read_lock(&head->srcu);
>  	ns = nvme_find_path(head);
>  	if (likely(ns)) {
> @@ -316,7 +317,12 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
>  		bio->bi_opf |= REQ_NVME_MPATH;
>  		trace_block_bio_remap(bio, disk_devt(ns->head->disk),
>  				      bio->bi_iter.bi_sector);
> -		ret = submit_bio_noacct(bio);
> +
> +		if (!blk_mq_submit_bio_direct(bio, &ret)) {
> +			nvme_mpath_clear_current_path(ns);
> +			srcu_read_unlock(&head->srcu, srcu_idx);
> +			goto retry;
> +		}
>  	} else if (nvme_available_path(head)) {
>  		dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
>  
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 2c473c9b899089..6804f397106ada 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -615,6 +615,7 @@ static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio,
>  }
>  
>  blk_qc_t blk_mq_submit_bio(struct bio *bio);
> +bool blk_mq_submit_bio_direct(struct bio *bio, blk_qc_t *qc);
>  void blk_mq_hctx_set_fq_lock_class(struct blk_mq_hw_ctx *hctx,
>  		struct lock_class_key *key);
>  

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-19 17:28                           ` Christoph Hellwig
  2021-03-19 19:07                             ` Keith Busch
@ 2021-03-19 19:34                             ` Sagi Grimberg
  2021-03-20  6:11                               ` Christoph Hellwig
  1 sibling, 1 reply; 30+ messages in thread
From: Sagi Grimberg @ 2021-03-19 19:34 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch; +Cc: Chao Leng, linux-nvme, Chaitanya Kulkarni


> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index a1d476e1ac020f..92adebfaf86fd1 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -309,6 +309,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
>   	 */
>   	blk_queue_split(&bio);
>   
> +retry:
>   	srcu_idx = srcu_read_lock(&head->srcu);
>   	ns = nvme_find_path(head);
>   	if (likely(ns)) {
> @@ -316,7 +317,12 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
>   		bio->bi_opf |= REQ_NVME_MPATH;
>   		trace_block_bio_remap(bio, disk_devt(ns->head->disk),
>   				      bio->bi_iter.bi_sector);
> -		ret = submit_bio_noacct(bio);
> +
> +		if (!blk_mq_submit_bio_direct(bio, &ret)) {
> +			nvme_mpath_clear_current_path(ns);
> +			srcu_read_unlock(&head->srcu, srcu_idx);

Its a bit unusual to see mutation of a data protected by srcu while
under the srcu_read_lock, can that be problematic somehow?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-19 19:34                             ` Sagi Grimberg
@ 2021-03-20  6:11                               ` Christoph Hellwig
  2021-03-21  6:49                                 ` Sagi Grimberg
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2021-03-20  6:11 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Keith Busch, Chao Leng, linux-nvme,
	Chaitanya Kulkarni

On Fri, Mar 19, 2021 at 12:34:25PM -0700, Sagi Grimberg wrote:
>
>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>> index a1d476e1ac020f..92adebfaf86fd1 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -309,6 +309,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
>>   	 */
>>   	blk_queue_split(&bio);
>>   +retry:
>>   	srcu_idx = srcu_read_lock(&head->srcu);
>>   	ns = nvme_find_path(head);
>>   	if (likely(ns)) {
>> @@ -316,7 +317,12 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
>>   		bio->bi_opf |= REQ_NVME_MPATH;
>>   		trace_block_bio_remap(bio, disk_devt(ns->head->disk),
>>   				      bio->bi_iter.bi_sector);
>> -		ret = submit_bio_noacct(bio);
>> +
>> +		if (!blk_mq_submit_bio_direct(bio, &ret)) {
>> +			nvme_mpath_clear_current_path(ns);
>> +			srcu_read_unlock(&head->srcu, srcu_idx);
>
> Its a bit unusual to see mutation of a data protected by srcu while
> under the srcu_read_lock, can that be problematic somehow?

Hmm.  I don't think head->srcu is intended to protect the current path.
We also call nvme_mpath_clear_current_path from nvme_complete_rq or
nvme_ns_remove, which has no locking at all.  The srcu protection is
for head->list, but leaks into the current path due to the __rcu
annotations.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-20  6:11                               ` Christoph Hellwig
@ 2021-03-21  6:49                                 ` Sagi Grimberg
  2021-03-22  6:34                                   ` Christoph Hellwig
  0 siblings, 1 reply; 30+ messages in thread
From: Sagi Grimberg @ 2021-03-21  6:49 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, Chao Leng, linux-nvme, Chaitanya Kulkarni


>>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>>> index a1d476e1ac020f..92adebfaf86fd1 100644
>>> --- a/drivers/nvme/host/multipath.c
>>> +++ b/drivers/nvme/host/multipath.c
>>> @@ -309,6 +309,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
>>>    	 */
>>>    	blk_queue_split(&bio);
>>>    +retry:
>>>    	srcu_idx = srcu_read_lock(&head->srcu);
>>>    	ns = nvme_find_path(head);
>>>    	if (likely(ns)) {
>>> @@ -316,7 +317,12 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
>>>    		bio->bi_opf |= REQ_NVME_MPATH;
>>>    		trace_block_bio_remap(bio, disk_devt(ns->head->disk),
>>>    				      bio->bi_iter.bi_sector);
>>> -		ret = submit_bio_noacct(bio);
>>> +
>>> +		if (!blk_mq_submit_bio_direct(bio, &ret)) {
>>> +			nvme_mpath_clear_current_path(ns);
>>> +			srcu_read_unlock(&head->srcu, srcu_idx);
>>
>> Its a bit unusual to see mutation of a data protected by srcu while
>> under the srcu_read_lock, can that be problematic somehow?
> 
> Hmm.  I don't think head->srcu is intended to protect the current path.
> We also call nvme_mpath_clear_current_path from nvme_complete_rq or
> nvme_ns_remove, which has no locking at all.  The srcu protection is
> for head->list, but leaks into the current path due to the __rcu
> annotations.

OK, care to send a formal patch that I can give a test drive?

Also, given that this issue has gone back to stable 5.4 and 5.10 we will
need to take care of those too. We should make sure to annotate the
fixes tags in this patch and probably also understand how we can create
a version of this to apply cleanly (for sure there are some extra
dependencies).

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs
  2021-03-21  6:49                                 ` Sagi Grimberg
@ 2021-03-22  6:34                                   ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2021-03-22  6:34 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Keith Busch, Chao Leng, linux-nvme,
	Chaitanya Kulkarni

On Sat, Mar 20, 2021 at 11:49:35PM -0700, Sagi Grimberg wrote:
>> Hmm.  I don't think head->srcu is intended to protect the current path.
>> We also call nvme_mpath_clear_current_path from nvme_complete_rq or
>> nvme_ns_remove, which has no locking at all.  The srcu protection is
>> for head->list, but leaks into the current path due to the __rcu
>> annotations.
>
> OK, care to send a formal patch that I can give a test drive?

I'll write a changelog.  But as-is this should be testable.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2021-03-22  6:34 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-15 22:27 [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs Sagi Grimberg
2021-03-15 22:27 ` [PATCH 1/3] nvme: introduce nvme_ctrl_is_mpath helper Sagi Grimberg
2021-03-15 22:27 ` [PATCH 2/3] nvme-tcp: fix possible hang when trying to set a live path during I/O Sagi Grimberg
2021-03-15 22:27 ` [PATCH 3/3] nvme-rdma: " Sagi Grimberg
2021-03-16  3:24 ` [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs Chao Leng
2021-03-16  5:04   ` Sagi Grimberg
2021-03-16  6:18     ` Chao Leng
2021-03-16  6:25       ` Sagi Grimberg
2021-03-16 20:07 ` Sagi Grimberg
2021-03-16 20:42   ` Keith Busch
2021-03-16 23:51     ` Sagi Grimberg
2021-03-17  2:55       ` Chao Leng
2021-03-17  6:59         ` Christoph Hellwig
2021-03-17  7:59           ` Chao Leng
2021-03-17 18:43             ` Sagi Grimberg
2021-03-18  1:51               ` Chao Leng
2021-03-18  4:45                 ` Christoph Hellwig
2021-03-18 18:46                 ` Sagi Grimberg
2021-03-18 19:16                   ` Keith Busch
2021-03-18 19:31                     ` Sagi Grimberg
2021-03-18 21:52                       ` Keith Busch
2021-03-18 22:45                         ` Sagi Grimberg
2021-03-19 14:05                         ` Christoph Hellwig
2021-03-19 17:28                           ` Christoph Hellwig
2021-03-19 19:07                             ` Keith Busch
2021-03-19 19:34                             ` Sagi Grimberg
2021-03-20  6:11                               ` Christoph Hellwig
2021-03-21  6:49                                 ` Sagi Grimberg
2021-03-22  6:34                                   ` Christoph Hellwig
2021-03-17  8:16           ` Sagi Grimberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).