All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] nvme fabircs: fix io hang in error recovery vs. removal
@ 2023-07-11  9:40 Ming Lei
  2023-07-11  9:40 ` [PATCH V2 1/3] nvme: fix possible hang when removing a controller during error recovery Ming Lei
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Ming Lei @ 2023-07-11  9:40 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, Keith Busch, linux-nvme
  Cc: Yi Zhang, Chunguang Xu, Ming Lei

Hello,

The 1st patch fixes io hang when queues are left as quiesced.

The other 2 patches fixes io hang when queues are left as frozen.

All are triggered in error recovery vs. removal.

Ming Lei (3):
  nvme: fix possible hang when removing a controller during error
    recovery
  nvme-tcp: fix potential unbalanced freeze & unfreeze
  nvme-rdma: fix potential unbalanced freeze & unfreeze

 drivers/nvme/host/core.c | 10 +++++++---
 drivers/nvme/host/rdma.c |  3 ++-
 drivers/nvme/host/tcp.c  |  3 ++-
 3 files changed, 11 insertions(+), 5 deletions(-)

-- 
2.40.1



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH V2 1/3] nvme: fix possible hang when removing a controller during error recovery
  2023-07-11  9:40 [PATCH 0/3] nvme fabircs: fix io hang in error recovery vs. removal Ming Lei
@ 2023-07-11  9:40 ` Ming Lei
  2023-07-11  9:40 ` [PATCH V2 2/3] nvme-tcp: fix potential unbalanced freeze & unfreeze Ming Lei
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Ming Lei @ 2023-07-11  9:40 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, Keith Busch, linux-nvme
  Cc: Yi Zhang, Chunguang Xu, Ming Lei, stable

Error recovery can be interrupted by controller removal, then the
controller is left as quiesced, and IO hang can be caused.

Fix the issue by unquiescing controller unconditionally when removing
namespaces.

This way is reasonable and safe given forward progress can be made
when removing namespaces.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reported-by: Chunguang Xu <brookxu.cn@gmail.com>
Closes: https://lore.kernel.org/linux-nvme/cover.1685350577.git.chunguang.xu@shopee.com/
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/core.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 47d7ba2827ff..98fa8315bc65 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3903,6 +3903,12 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
 	 */
 	nvme_mpath_clear_ctrl_paths(ctrl);
 
+	/*
+	 * Unquiesce io queues so any pending IO won't hang, especially
+	 * those submitted from scan work
+	 */
+	nvme_unquiesce_io_queues(ctrl);
+
 	/* prevent racing with ns scanning */
 	flush_work(&ctrl->scan_work);
 
@@ -3912,10 +3918,8 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
 	 * removing the namespaces' disks; fail all the queues now to avoid
 	 * potentially having to clean up the failed sync later.
 	 */
-	if (ctrl->state == NVME_CTRL_DEAD) {
+	if (ctrl->state == NVME_CTRL_DEAD)
 		nvme_mark_namespaces_dead(ctrl);
-		nvme_unquiesce_io_queues(ctrl);
-	}
 
 	/* this is a no-op when called from the controller reset handler */
 	nvme_change_ctrl_state(ctrl, NVME_CTRL_DELETING_NOIO);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH V2 2/3] nvme-tcp: fix potential unbalanced freeze & unfreeze
  2023-07-11  9:40 [PATCH 0/3] nvme fabircs: fix io hang in error recovery vs. removal Ming Lei
  2023-07-11  9:40 ` [PATCH V2 1/3] nvme: fix possible hang when removing a controller during error recovery Ming Lei
@ 2023-07-11  9:40 ` Ming Lei
  2023-07-11 10:38   ` Sagi Grimberg
  2023-07-21  1:58   ` Yi Zhang
  2023-07-11  9:40 ` [PATCH V2 3/3] nvme-rdma: " Ming Lei
  2023-07-21  7:59 ` [PATCH 0/3] nvme fabircs: fix io hang in error recovery vs. removal Keith Busch
  3 siblings, 2 replies; 9+ messages in thread
From: Ming Lei @ 2023-07-11  9:40 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, Keith Busch, linux-nvme
  Cc: Yi Zhang, Chunguang Xu, Ming Lei, stable

Move start_freeze into nvme_tcp_configure_io_queues(), and there is
at least two benefits:

1) fix unbalanced freeze and unfreeze, since re-connection work may
fail or be broken by removal

2) IO during error recovery can be failfast quickly because nvme fabrics
unquiesces queues after teardown.

One side-effect is that !mpath request may timeout during connecting
because of queue topo change, but that looks not one big deal:

1) same problem exists with current code base

2) compared with !mpath, mpath use case is dominant

Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic")
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/tcp.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 3e7dd6f91832..fb24cd8ac46c 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1868,6 +1868,7 @@ static int nvme_tcp_configure_io_queues(struct nvme_ctrl *ctrl, bool new)
 		goto out_cleanup_connect_q;
 
 	if (!new) {
+		nvme_start_freeze(ctrl);
 		nvme_unquiesce_io_queues(ctrl);
 		if (!nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT)) {
 			/*
@@ -1876,6 +1877,7 @@ static int nvme_tcp_configure_io_queues(struct nvme_ctrl *ctrl, bool new)
 			 * to be safe.
 			 */
 			ret = -ENODEV;
+			nvme_unfreeze(ctrl);
 			goto out_wait_freeze_timed_out;
 		}
 		blk_mq_update_nr_hw_queues(ctrl->tagset,
@@ -1980,7 +1982,6 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
 	if (ctrl->queue_count <= 1)
 		return;
 	nvme_quiesce_admin_queue(ctrl);
-	nvme_start_freeze(ctrl);
 	nvme_quiesce_io_queues(ctrl);
 	nvme_sync_io_queues(ctrl);
 	nvme_tcp_stop_io_queues(ctrl);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH V2 3/3] nvme-rdma: fix potential unbalanced freeze & unfreeze
  2023-07-11  9:40 [PATCH 0/3] nvme fabircs: fix io hang in error recovery vs. removal Ming Lei
  2023-07-11  9:40 ` [PATCH V2 1/3] nvme: fix possible hang when removing a controller during error recovery Ming Lei
  2023-07-11  9:40 ` [PATCH V2 2/3] nvme-tcp: fix potential unbalanced freeze & unfreeze Ming Lei
@ 2023-07-11  9:40 ` Ming Lei
  2023-07-11 10:38   ` Sagi Grimberg
  2023-07-21  1:58   ` Yi Zhang
  2023-07-21  7:59 ` [PATCH 0/3] nvme fabircs: fix io hang in error recovery vs. removal Keith Busch
  3 siblings, 2 replies; 9+ messages in thread
From: Ming Lei @ 2023-07-11  9:40 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, Keith Busch, linux-nvme
  Cc: Yi Zhang, Chunguang Xu, Ming Lei, stable

Move start_freeze into nvme_rdma_configure_io_queues(), and there is
at least two benefits:

1) fix unbalanced freeze and unfreeze, since re-connection work may
fail or be broken by removal

2) IO during error recovery can be failfast quickly because nvme fabrics
unquiesces queues after teardown.

One side-effect is that !mpath request may timeout during connecting
because of queue topo change, but that looks not one big deal:

1) same problem exists with current code base

2) compared with !mpath, mpath use case is dominant

Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic")
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/rdma.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index d433b2ec07a6..337a624a537c 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -883,6 +883,7 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
 		goto out_cleanup_tagset;
 
 	if (!new) {
+		nvme_start_freeze(&ctrl->ctrl);
 		nvme_unquiesce_io_queues(&ctrl->ctrl);
 		if (!nvme_wait_freeze_timeout(&ctrl->ctrl, NVME_IO_TIMEOUT)) {
 			/*
@@ -891,6 +892,7 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
 			 * to be safe.
 			 */
 			ret = -ENODEV;
+			nvme_unfreeze(&ctrl->ctrl);
 			goto out_wait_freeze_timed_out;
 		}
 		blk_mq_update_nr_hw_queues(ctrl->ctrl.tagset,
@@ -940,7 +942,6 @@ static void nvme_rdma_teardown_io_queues(struct nvme_rdma_ctrl *ctrl,
 		bool remove)
 {
 	if (ctrl->ctrl.queue_count > 1) {
-		nvme_start_freeze(&ctrl->ctrl);
 		nvme_quiesce_io_queues(&ctrl->ctrl);
 		nvme_sync_io_queues(&ctrl->ctrl);
 		nvme_rdma_stop_io_queues(ctrl);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH V2 2/3] nvme-tcp: fix potential unbalanced freeze & unfreeze
  2023-07-11  9:40 ` [PATCH V2 2/3] nvme-tcp: fix potential unbalanced freeze & unfreeze Ming Lei
@ 2023-07-11 10:38   ` Sagi Grimberg
  2023-07-21  1:58   ` Yi Zhang
  1 sibling, 0 replies; 9+ messages in thread
From: Sagi Grimberg @ 2023-07-11 10:38 UTC (permalink / raw)
  To: Ming Lei, Christoph Hellwig, Keith Busch, linux-nvme
  Cc: Yi Zhang, Chunguang Xu, stable

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH V2 3/3] nvme-rdma: fix potential unbalanced freeze & unfreeze
  2023-07-11  9:40 ` [PATCH V2 3/3] nvme-rdma: " Ming Lei
@ 2023-07-11 10:38   ` Sagi Grimberg
  2023-07-21  1:58   ` Yi Zhang
  1 sibling, 0 replies; 9+ messages in thread
From: Sagi Grimberg @ 2023-07-11 10:38 UTC (permalink / raw)
  To: Ming Lei, Christoph Hellwig, Keith Busch, linux-nvme
  Cc: Yi Zhang, Chunguang Xu, stable

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH V2 2/3] nvme-tcp: fix potential unbalanced freeze & unfreeze
  2023-07-11  9:40 ` [PATCH V2 2/3] nvme-tcp: fix potential unbalanced freeze & unfreeze Ming Lei
  2023-07-11 10:38   ` Sagi Grimberg
@ 2023-07-21  1:58   ` Yi Zhang
  1 sibling, 0 replies; 9+ messages in thread
From: Yi Zhang @ 2023-07-21  1:58 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Sagi Grimberg, Keith Busch, linux-nvme,
	Chunguang Xu, stable

Verified it with the nvme/tcp scenario, Thanks Ming
Tested-by: Yi Zhang <yi.zhang@redhat.com>

On Tue, Jul 11, 2023 at 5:41 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> Move start_freeze into nvme_tcp_configure_io_queues(), and there is
> at least two benefits:
>
> 1) fix unbalanced freeze and unfreeze, since re-connection work may
> fail or be broken by removal
>
> 2) IO during error recovery can be failfast quickly because nvme fabrics
> unquiesces queues after teardown.
>
> One side-effect is that !mpath request may timeout during connecting
> because of queue topo change, but that looks not one big deal:
>
> 1) same problem exists with current code base
>
> 2) compared with !mpath, mpath use case is dominant
>
> Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic")
> Cc: stable@vger.kernel.org
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/nvme/host/tcp.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 3e7dd6f91832..fb24cd8ac46c 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1868,6 +1868,7 @@ static int nvme_tcp_configure_io_queues(struct nvme_ctrl *ctrl, bool new)
>                 goto out_cleanup_connect_q;
>
>         if (!new) {
> +               nvme_start_freeze(ctrl);
>                 nvme_unquiesce_io_queues(ctrl);
>                 if (!nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT)) {
>                         /*
> @@ -1876,6 +1877,7 @@ static int nvme_tcp_configure_io_queues(struct nvme_ctrl *ctrl, bool new)
>                          * to be safe.
>                          */
>                         ret = -ENODEV;
> +                       nvme_unfreeze(ctrl);
>                         goto out_wait_freeze_timed_out;
>                 }
>                 blk_mq_update_nr_hw_queues(ctrl->tagset,
> @@ -1980,7 +1982,6 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
>         if (ctrl->queue_count <= 1)
>                 return;
>         nvme_quiesce_admin_queue(ctrl);
> -       nvme_start_freeze(ctrl);
>         nvme_quiesce_io_queues(ctrl);
>         nvme_sync_io_queues(ctrl);
>         nvme_tcp_stop_io_queues(ctrl);
> --
> 2.40.1
>


-- 
Best Regards,
  Yi Zhang


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH V2 3/3] nvme-rdma: fix potential unbalanced freeze & unfreeze
  2023-07-11  9:40 ` [PATCH V2 3/3] nvme-rdma: " Ming Lei
  2023-07-11 10:38   ` Sagi Grimberg
@ 2023-07-21  1:58   ` Yi Zhang
  1 sibling, 0 replies; 9+ messages in thread
From: Yi Zhang @ 2023-07-21  1:58 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Sagi Grimberg, Keith Busch, linux-nvme,
	Chunguang Xu, stable

Verified it with the nvme/rdma scenario, Thanks Ming
Tested-by: Yi Zhang <yi.zhang@redhat.com>

On Tue, Jul 11, 2023 at 5:41 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> Move start_freeze into nvme_rdma_configure_io_queues(), and there is
> at least two benefits:
>
> 1) fix unbalanced freeze and unfreeze, since re-connection work may
> fail or be broken by removal
>
> 2) IO during error recovery can be failfast quickly because nvme fabrics
> unquiesces queues after teardown.
>
> One side-effect is that !mpath request may timeout during connecting
> because of queue topo change, but that looks not one big deal:
>
> 1) same problem exists with current code base
>
> 2) compared with !mpath, mpath use case is dominant
>
> Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic")
> Cc: stable@vger.kernel.org
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/nvme/host/rdma.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index d433b2ec07a6..337a624a537c 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -883,6 +883,7 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
>                 goto out_cleanup_tagset;
>
>         if (!new) {
> +               nvme_start_freeze(&ctrl->ctrl);
>                 nvme_unquiesce_io_queues(&ctrl->ctrl);
>                 if (!nvme_wait_freeze_timeout(&ctrl->ctrl, NVME_IO_TIMEOUT)) {
>                         /*
> @@ -891,6 +892,7 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
>                          * to be safe.
>                          */
>                         ret = -ENODEV;
> +                       nvme_unfreeze(&ctrl->ctrl);
>                         goto out_wait_freeze_timed_out;
>                 }
>                 blk_mq_update_nr_hw_queues(ctrl->ctrl.tagset,
> @@ -940,7 +942,6 @@ static void nvme_rdma_teardown_io_queues(struct nvme_rdma_ctrl *ctrl,
>                 bool remove)
>  {
>         if (ctrl->ctrl.queue_count > 1) {
> -               nvme_start_freeze(&ctrl->ctrl);
>                 nvme_quiesce_io_queues(&ctrl->ctrl);
>                 nvme_sync_io_queues(&ctrl->ctrl);
>                 nvme_rdma_stop_io_queues(ctrl);
> --
> 2.40.1
>


-- 
Best Regards,
  Yi Zhang


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/3] nvme fabircs: fix io hang in error recovery vs. removal
  2023-07-11  9:40 [PATCH 0/3] nvme fabircs: fix io hang in error recovery vs. removal Ming Lei
                   ` (2 preceding siblings ...)
  2023-07-11  9:40 ` [PATCH V2 3/3] nvme-rdma: " Ming Lei
@ 2023-07-21  7:59 ` Keith Busch
  3 siblings, 0 replies; 9+ messages in thread
From: Keith Busch @ 2023-07-21  7:59 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Sagi Grimberg, linux-nvme, Yi Zhang, Chunguang Xu

Series applied for nvme-6.5. Thanks!


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-07-21  8:00 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-11  9:40 [PATCH 0/3] nvme fabircs: fix io hang in error recovery vs. removal Ming Lei
2023-07-11  9:40 ` [PATCH V2 1/3] nvme: fix possible hang when removing a controller during error recovery Ming Lei
2023-07-11  9:40 ` [PATCH V2 2/3] nvme-tcp: fix potential unbalanced freeze & unfreeze Ming Lei
2023-07-11 10:38   ` Sagi Grimberg
2023-07-21  1:58   ` Yi Zhang
2023-07-11  9:40 ` [PATCH V2 3/3] nvme-rdma: " Ming Lei
2023-07-11 10:38   ` Sagi Grimberg
2023-07-21  1:58   ` Yi Zhang
2023-07-21  7:59 ` [PATCH 0/3] nvme fabircs: fix io hang in error recovery vs. removal Keith Busch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.