[PATCH 0/6] fix possible controller reset hangs in nvme-tcp/nvme-rdma

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/6] fix possible controller reset hangs in nvme-tcp/nvme-rdma
@ 2020-08-03  6:58 Sagi Grimberg
  2020-08-03  6:58 ` [PATCH 1/6] nvme-fabrics: allow to queue requests for live queues Sagi Grimberg
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Sagi Grimberg @ 2020-08-03  6:58 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

When a controller reset runs during I/O we may hang if the controller
suddenly becomes unresponsive during the reset and/or the reconnection
stages. This is due to how the timeout handler did not fail inflight
commands properly and also not being able to abort the controller reset
sequence when the controller becomes unresponsive (hence can't ever
recover even if the controller ever becomes responsive again).

This set fixes nvme-tcp and nvme-rdma for exactly the same scenarios.

Patch 1 prevents commands being queued fora  live queued, making
commands mistakenly getting requeued forever while we are either
resetting or connecting to a controller.

Patches 2,4,6 address the case when a controller stops responding when
we are in the middle of a connection establishment stage (tcp and rdma).

Patches 3,5 rework the timeout handler to fail commands (and allow them
to either requeue or fail) in case the controller is not responsive when
we are in the middle of reset (teardown) or establishment (connect
sequence).

James, please have a look to patch 1, this relates to the discussions
we had recently. We still keep the admin commands with a guard, but
that would be addressed in a follow-up set.

Sagi Grimberg (6):
  nvme-fabrics: allow to queue requests for live queues
  nvme: have nvme_wait_freeze_timeout return if it timed out
  nvme-tcp: fix timeout handler
  nvme-tcp: fix reset hang if controller died in the middle of a reset
  nvme-rdma: fix timeout handler
  nvme-rdma: fix reset hang if controller died in the middle of a reset

 drivers/nvme/host/core.c    |  3 +-
 drivers/nvme/host/fabrics.c | 13 +++---
 drivers/nvme/host/nvme.h    |  2 +-
 drivers/nvme/host/rdma.c    | 78 ++++++++++++++++++++++++++---------
 drivers/nvme/host/tcp.c     | 81 +++++++++++++++++++++++++++----------
 5 files changed, 130 insertions(+), 47 deletions(-)

-- 
2.25.1

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/6] nvme-fabrics: allow to queue requests for live queues
  2020-08-03  6:58 [PATCH 0/6] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
@ 2020-08-03  6:58 ` Sagi Grimberg
  2020-08-03  6:58 ` [PATCH 2/6] nvme: have nvme_wait_freeze_timeout return if it timed out Sagi Grimberg
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Sagi Grimberg @ 2020-08-03  6:58 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

Right now we are failing requests based on the controller
state (which is checked inline in nvmf_check_ready) however
we should definitely accept requests if the queue is live.

When entering controller reset, we transition the controller
into NVME_CTRL_RESETTING, and then return BLK_STS_RESOURCE for
non-mpath requests (have blk_noretry_request set).

This is also the case for NVME_REQ_USER for the wrong reason.
There shouldn't be any reason for us to reject this I/O in a
controller reset. We do want to prevent passthru commands on
the admin queue because we need the controller to fully initialize
first before we let user passthru admin commands to be issued.

In a non-mpath setup, this means that the requests will simply
be requeued over and over forever not allowing the q_usage_counter
to drop its final reference, causing controller reset to hang
if running concurrently with heavy I/O.

While we are at it, remove the redundant NVME_CTRL_NEW case, which
should never see any I/O as it must first transition to
NVME_CTRL_CONNECTING.

Fixes: 35897b920c8a ("nvme-fabrics: fix and refine state checks in __nvmf_check_ready")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/fabrics.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 4ec4829d6233..8575724734e0 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -565,10 +565,14 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
 	struct nvme_request *req = nvme_req(rq);

 	/*
-	 * If we are in some state of setup or teardown only allow
-	 * internally generated commands.
+	 * currently we have a problem sending passthru commands
+	 * on the admin_q if the controller is not LIVE because we can't
+	 * make sure that they are going out after the admin connect,
+	 * controller enable and/or other commands in the initialization
+	 * sequence. until the controller will be LIVE, fail with
+	 * BLK_STS_RESOURCE so that they will be rescheduled.
 	 */
-	if (!blk_rq_is_passthrough(rq) || (req->flags & NVME_REQ_USERCMD))
+	if (rq->q == ctrl->admin_q && (req->flags & NVME_REQ_USERCMD))
 		return false;

 	/*
@@ -576,9 +580,8 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
 	 * which is require to set the queue live in the appropinquate states.
 	 */
 	switch (ctrl->state) {
-	case NVME_CTRL_NEW:
 	case NVME_CTRL_CONNECTING:
-		if (nvme_is_fabrics(req->cmd) &&
+		if (blk_rq_is_passthrough(rq) && nvme_is_fabrics(req->cmd) &&
 		    req->cmd->fabrics.fctype == nvme_fabrics_type_connect)
 			return true;
 		break;
-- 
2.25.1

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/6] nvme: have nvme_wait_freeze_timeout return if it timed out
  2020-08-03  6:58 [PATCH 0/6] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
  2020-08-03  6:58 ` [PATCH 1/6] nvme-fabrics: allow to queue requests for live queues Sagi Grimberg
@ 2020-08-03  6:58 ` Sagi Grimberg
  2020-08-03  6:58 ` [PATCH 3/6] nvme-tcp: fix timeout handler Sagi Grimberg
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Sagi Grimberg @ 2020-08-03  6:58 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

users can detect if the wait has completed or not

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/core.c | 3 ++-
 drivers/nvme/host/nvme.h | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index e8cc728dee46..fd99ec0118a3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4519,7 +4519,7 @@ void nvme_unfreeze(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_unfreeze);
 
-void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout)
+int nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout)
 {
 	struct nvme_ns *ns;
 
@@ -4530,6 +4530,7 @@ void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout)
 			break;
 	}
 	up_read(&ctrl->namespaces_rwsem);
+	return timeout;
 }
 EXPORT_SYMBOL_GPL(nvme_wait_freeze_timeout);
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index c5c1bac797aa..e82744de3cd3 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -574,7 +574,7 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl);
 void nvme_sync_queues(struct nvme_ctrl *ctrl);
 void nvme_unfreeze(struct nvme_ctrl *ctrl);
 void nvme_wait_freeze(struct nvme_ctrl *ctrl);
-void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout);
+int nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout);
 void nvme_start_freeze(struct nvme_ctrl *ctrl);
 
 #define NVME_QID_ANY -1
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/6] nvme-tcp: fix timeout handler
  2020-08-03  6:58 [PATCH 0/6] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
  2020-08-03  6:58 ` [PATCH 1/6] nvme-fabrics: allow to queue requests for live queues Sagi Grimberg
  2020-08-03  6:58 ` [PATCH 2/6] nvme: have nvme_wait_freeze_timeout return if it timed out Sagi Grimberg
@ 2020-08-03  6:58 ` Sagi Grimberg
  2020-08-03  6:58 ` [PATCH 4/6] nvme-tcp: fix reset hang if controller died in the middle of a reset Sagi Grimberg
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Sagi Grimberg @ 2020-08-03  6:58 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

Currently we check if the controller state != LIVE, and
we directly fail the command under the assumption that this
is the connect command or an admin command within the
controller initialization sequence.

This is wrong, we need to check if the request risking
controller setup/teardown blocking if not completed and
only then fail.

Moreover, we teardown the entire controller in the timeout
handler, which does more than what we really want and it causes
us to freeze the queues again. We need to only fence the I/O
queue and the error recovery teardown.

The logic should be:
- RESETTING, only fail fabrics/admin commands otherwise
  controller teardown will block. otherwise reset the timer
  and come back again.
- CONNECTING, if this is a connect (or an admin command), we fail
  right away (unblock controller initialization), otherwise we
  treat it like anything else.
- otherwise trigger error recovery and reset the timer (the
  error handler will take care of completing/delaying it).

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/tcp.c | 70 +++++++++++++++++++++++++++++------------
 1 file changed, 50 insertions(+), 20 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 62fbaecdc960..ddacfeaa2543 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -464,6 +464,7 @@ static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
 	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
 		return;
 
+	dev_warn(ctrl->device, "starting error recovery\n");
 	queue_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work);
 }
 
@@ -2149,40 +2150,69 @@ static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg)
 	nvme_tcp_queue_request(&ctrl->async_req, true, true);
 }
 
+static void nvme_tcp_complete_timed_out(struct request *rq)
+{
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+	struct nvme_ctrl *ctrl = &req->queue->ctrl->ctrl;
+
+	/* fence other contexts that may complete the command */
+	flush_work(&to_tcp_ctrl(ctrl)->err_work);
+	nvme_tcp_stop_queue(ctrl, nvme_tcp_queue_id(req->queue));
+	if (blk_mq_request_completed(rq))
+		return;
+	nvme_req(rq)->flags |= NVME_REQ_CANCELLED;
+	nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
+	blk_mq_complete_request(rq);
+}
+
 static enum blk_eh_timer_return
 nvme_tcp_timeout(struct request *rq, bool reserved)
 {
 	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
-	struct nvme_tcp_ctrl *ctrl = req->queue->ctrl;
+	struct nvme_ctrl *ctrl = &req->queue->ctrl->ctrl;
 	struct nvme_tcp_cmd_pdu *pdu = req->pdu;
 
-	/*
-	 * Restart the timer if a controller reset is already scheduled. Any
-	 * timed out commands would be handled before entering the connecting
-	 * state.
-	 */
-	if (ctrl->ctrl.state == NVME_CTRL_RESETTING)
-		return BLK_EH_RESET_TIMER;
-
-	dev_warn(ctrl->ctrl.device,
+	dev_warn(ctrl->device,
 		"queue %d: timeout request %#x type %d\n",
 		nvme_tcp_queue_id(req->queue), rq->tag, pdu->hdr.type);
 
-	if (ctrl->ctrl.state != NVME_CTRL_LIVE) {
+	switch (ctrl->state) {
+	case NVME_CTRL_RESETTING:
+		if (!nvme_tcp_queue_id(req->queue)) {
+			/*
+			 * if we are in teardown we must complete immediately
+			 * because we may block the teardown sequence (e.g.
+			 * nvme_disable_ctrl timed out).
+			 */
+			nvme_tcp_complete_timed_out(rq);
+			return BLK_EH_DONE;
+		}
+		/*
+		 * Restart the timer if a controller reset is already scheduled.
+		 * Any timed out commands would be handled before entering the
+		 * connecting state.
+		 */
+		return BLK_EH_RESET_TIMER;
+	case NVME_CTRL_CONNECTING:
 		/*
-		 * Teardown immediately if controller times out while starting
-		 * or we are already started error recovery. all outstanding
-		 * requests are completed on shutdown, so we return BLK_EH_DONE.
+		 * if we are connecting we must complete immediately
+		 * because we may block controller setup sequence
+		 * - connect requests
+		 * - initialization admin requests
+		 * - I/O requests that entered when unquiesced and
+		 *   the controller stopped responding
 		 */
-		flush_work(&ctrl->err_work);
-		nvme_tcp_teardown_io_queues(&ctrl->ctrl, false);
-		nvme_tcp_teardown_admin_queue(&ctrl->ctrl, false);
+		nvme_tcp_complete_timed_out(rq);
 		return BLK_EH_DONE;
+	default:
+		/*
+		 * every other state should trigger the error recovery
+		 * which will be handled by the flow and controller state
+		 * machine
+		 */
+		nvme_tcp_error_recovery(ctrl);
 	}
 
-	dev_warn(ctrl->ctrl.device, "starting error recovery\n");
-	nvme_tcp_error_recovery(&ctrl->ctrl);
-
 	return BLK_EH_RESET_TIMER;
 }
 
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/6] nvme-tcp: fix reset hang if controller died in the middle of a reset
  2020-08-03  6:58 [PATCH 0/6] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
                   ` (2 preceding siblings ...)
  2020-08-03  6:58 ` [PATCH 3/6] nvme-tcp: fix timeout handler Sagi Grimberg
@ 2020-08-03  6:58 ` Sagi Grimberg
  2020-08-03  6:58 ` [PATCH 5/6] nvme-rdma: fix timeout handler Sagi Grimberg
  2020-08-03  6:58 ` [PATCH 6/6] nvme-rdma: fix reset hang if controller died in the middle of a reset Sagi Grimberg
  5 siblings, 0 replies; 21+ messages in thread
From: Sagi Grimberg @ 2020-08-03  6:58 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

If the controller becomes unresponsive in the middle of a reset, we
will hang because we are waiting for the freeze to complete, but that
cannot happen since we have commands that are inflight holding the
q_usage_counter, and we can't blindly fail requests that times out.

So give a timeout and if we cannot wait for queue freeze before
unfreezing, fail and have the error handling take care how to
proceed (either schedule a reconnect of remove the controller).

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/tcp.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index ddacfeaa2543..a0a5d8baddf5 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1783,7 +1783,13 @@ static int nvme_tcp_configure_io_queues(struct nvme_ctrl *ctrl, bool new)
 
 	if (!new) {
 		nvme_start_queues(ctrl);
-		nvme_wait_freeze(ctrl);
+		if (!nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT)) {
+			/* if we timed out waiting for freeze we are
+			 * likely stuck, fail just to be safe
+			 */
+			ret = -ENODEV;
+			goto out_wait_freeze_timed_out;
+		}
 		blk_mq_update_nr_hw_queues(ctrl->tagset,
 			ctrl->queue_count - 1);
 		nvme_unfreeze(ctrl);
@@ -1791,6 +1797,9 @@ static int nvme_tcp_configure_io_queues(struct nvme_ctrl *ctrl, bool new)
 
 	return 0;
 
+out_wait_freeze_timed_out:
+	nvme_stop_queues(ctrl);
+	nvme_tcp_stop_io_queues(ctrl);
 out_cleanup_connect_q:
 	if (new)
 		blk_cleanup_queue(ctrl->connect_q);
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/6] nvme-rdma: fix timeout handler
  2020-08-03  6:58 [PATCH 0/6] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
                   ` (3 preceding siblings ...)
  2020-08-03  6:58 ` [PATCH 4/6] nvme-tcp: fix reset hang if controller died in the middle of a reset Sagi Grimberg
@ 2020-08-03  6:58 ` Sagi Grimberg
  2020-08-03 10:25   ` Chao Leng
  2020-08-06 19:52   ` David Milburn
  2020-08-03  6:58 ` [PATCH 6/6] nvme-rdma: fix reset hang if controller died in the middle of a reset Sagi Grimberg
  5 siblings, 2 replies; 21+ messages in thread
From: Sagi Grimberg @ 2020-08-03  6:58 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

Currently we check if the controller state != LIVE, and
we directly fail the command under the assumption that this
is the connect command or an admin command within the
controller initialization sequence.

This is wrong, we need to check if the request risking
controller setup/teardown blocking if not completed and
only then fail.

The logic should be:
- RESETTING, only fail fabrics/admin commands otherwise
  controller teardown will block. otherwise reset the timer
  and come back again.
- CONNECTING, if this is a connect (or an admin command), we fail
  right away (unblock controller initialization), otherwise we
  treat it like anything else.
- otherwise trigger error recovery and reset the timer (the
  error handler will take care of completing/delaying it).

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/rdma.c | 67 +++++++++++++++++++++++++++++-----------
 1 file changed, 49 insertions(+), 18 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 44c76ffbb264..a58c6deaf691 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1180,6 +1180,7 @@ static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
 	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
 		return;
 
+	dev_warn(ctrl->ctrl.device, "starting error recovery\n");
 	queue_work(nvme_reset_wq, &ctrl->err_work);
 }
 
@@ -1946,6 +1947,22 @@ static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
 	return 0;
 }
 
+static void nvme_rdma_complete_timed_out(struct request *rq)
+{
+	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
+	struct nvme_rdma_queue *queue = req->queue;
+	struct nvme_rdma_ctrl *ctrl = queue->ctrl;
+
+	/* fence other contexts that may complete the command */
+	flush_work(&ctrl->err_work);
+	nvme_rdma_stop_queue(queue);
+	if (blk_mq_request_completed(rq))
+		return;
+	nvme_req(rq)->flags |= NVME_REQ_CANCELLED;
+	nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
+	blk_mq_complete_request(rq);
+}
+
 static enum blk_eh_timer_return
 nvme_rdma_timeout(struct request *rq, bool reserved)
 {
@@ -1956,29 +1973,43 @@ nvme_rdma_timeout(struct request *rq, bool reserved)
 	dev_warn(ctrl->ctrl.device, "I/O %d QID %d timeout\n",
 		 rq->tag, nvme_rdma_queue_idx(queue));
 
-	/*
-	 * Restart the timer if a controller reset is already scheduled. Any
-	 * timed out commands would be handled before entering the connecting
-	 * state.
-	 */
-	if (ctrl->ctrl.state == NVME_CTRL_RESETTING)
+	switch (ctrl->ctrl.state) {
+	case NVME_CTRL_RESETTING:
+		if (!nvme_rdma_queue_idx(queue)) {
+			/*
+			 * if we are in teardown we must complete immediately
+			 * because we may block the teardown sequence (e.g.
+			 * nvme_disable_ctrl timed out).
+			 */
+			nvme_rdma_complete_timed_out(rq);
+			return BLK_EH_DONE;
+		}
+		/*
+		 * Restart the timer if a controller reset is already scheduled.
+		 * Any timed out commands would be handled before entering the
+		 * connecting state.
+		 */
 		return BLK_EH_RESET_TIMER;
-
-	if (ctrl->ctrl.state != NVME_CTRL_LIVE) {
+	case NVME_CTRL_CONNECTING:
+		if (reserved || !nvme_rdma_queue_idx(queue)) {
+			/*
+			 * if we are connecting we must complete immediately
+			 * connect (reserved) or admin requests because we may
+			 * block controller setup sequence.
+			 */
+			nvme_rdma_complete_timed_out(rq);
+			return BLK_EH_DONE;
+		}
+		/* fallthru */
+	default:
 		/*
-		 * Teardown immediately if controller times out while starting
-		 * or we are already started error recovery. all outstanding
-		 * requests are completed on shutdown, so we return BLK_EH_DONE.
+		 * every other state should trigger the error recovery
+		 * which will be handled by the flow and controller state
+		 * machine
 		 */
-		flush_work(&ctrl->err_work);
-		nvme_rdma_teardown_io_queues(ctrl, false);
-		nvme_rdma_teardown_admin_queue(ctrl, false);
-		return BLK_EH_DONE;
+		nvme_rdma_error_recovery(ctrl);
 	}
 
-	dev_warn(ctrl->ctrl.device, "starting error recovery\n");
-	nvme_rdma_error_recovery(ctrl);
-
 	return BLK_EH_RESET_TIMER;
 }
 
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/6] nvme-rdma: fix reset hang if controller died in the middle of a reset
  2020-08-03  6:58 [PATCH 0/6] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
                   ` (4 preceding siblings ...)
  2020-08-03  6:58 ` [PATCH 5/6] nvme-rdma: fix timeout handler Sagi Grimberg
@ 2020-08-03  6:58 ` Sagi Grimberg
  5 siblings, 0 replies; 21+ messages in thread
From: Sagi Grimberg @ 2020-08-03  6:58 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

If the controller becomes unresponsive in the middle of a reset, we
will hang because we are waiting for the freeze to complete, but that
cannot happen since we have commands that are inflight holding the
q_usage_counter, and we can't blindly fail requests that times out.

So give a timeout and if we cannot wait for queue freeze before
unfreezing, fail and have the error handling take care how to
proceed (either schedule a reconnect of remove the controller).

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/rdma.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index a58c6deaf691..96fa3185d123 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -975,7 +975,13 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
 
 	if (!new) {
 		nvme_start_queues(&ctrl->ctrl);
-		nvme_wait_freeze(&ctrl->ctrl);
+		if (!nvme_wait_freeze_timeout(&ctrl->ctrl, NVME_IO_TIMEOUT)) {
+			/* if we timed out waiting for freeze we are
+			 * likely stuck, fail just to be safe
+			 */
+			ret = -ENODEV;
+			goto out_wait_freeze_timed_out;
+		}
 		blk_mq_update_nr_hw_queues(ctrl->ctrl.tagset,
 			ctrl->ctrl.queue_count - 1);
 		nvme_unfreeze(&ctrl->ctrl);
@@ -983,6 +989,9 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
 
 	return 0;
 
+out_wait_freeze_timed_out:
+	nvme_stop_queues(&ctrl->ctrl);
+	nvme_rdma_stop_io_queues(ctrl);
 out_cleanup_connect_q:
 	if (new)
 		blk_cleanup_queue(ctrl->ctrl.connect_q);
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/6] nvme-rdma: fix timeout handler
  2020-08-03  6:58 ` [PATCH 5/6] nvme-rdma: fix timeout handler Sagi Grimberg
@ 2020-08-03 10:25   ` Chao Leng
  2020-08-03 15:03     ` Sagi Grimberg
  2020-08-06 19:52   ` David Milburn
  1 sibling, 1 reply; 21+ messages in thread
From: Chao Leng @ 2020-08-03 10:25 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch, James Smart



On 2020/8/3 14:58, Sagi Grimberg wrote:
> Currently we check if the controller state != LIVE, and
> we directly fail the command under the assumption that this
> is the connect command or an admin command within the
> controller initialization sequence.
> 
> This is wrong, we need to check if the request risking
> controller setup/teardown blocking if not completed and
> only then fail.
> 
> The logic should be:
> - RESETTING, only fail fabrics/admin commands otherwise
>    controller teardown will block. otherwise reset the timer
>    and come back again.
> - CONNECTING, if this is a connect (or an admin command), we fail
>    right away (unblock controller initialization), otherwise we
>    treat it like anything else.
> - otherwise trigger error recovery and reset the timer (the
>    error handler will take care of completing/delaying it).
> 
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>   drivers/nvme/host/rdma.c | 67 +++++++++++++++++++++++++++++-----------
>   1 file changed, 49 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 44c76ffbb264..a58c6deaf691 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -1180,6 +1180,7 @@ static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
>   	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
>   		return;
>   
> +	dev_warn(ctrl->ctrl.device, "starting error recovery\n");
>   	queue_work(nvme_reset_wq, &ctrl->err_work);
>   }
>   
> @@ -1946,6 +1947,22 @@ static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
>   	return 0;
>   }
>   
> +static void nvme_rdma_complete_timed_out(struct request *rq)
> +{
> +	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
> +	struct nvme_rdma_queue *queue = req->queue;
> +	struct nvme_rdma_ctrl *ctrl = queue->ctrl;
> +
> +	/* fence other contexts that may complete the command */
> +	flush_work(&ctrl->err_work);
> +	nvme_rdma_stop_queue(queue);
There maybe concurrent with error recovery, may cause abnormal because
nvme_rdma_stop_queue will return but the queue is not stoped,
maybe is stopping by the error recovery.
> +	if (blk_mq_request_completed(rq))
> +		return;
> +	nvme_req(rq)->flags |= NVME_REQ_CANCELLED;
> +	nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
> +	blk_mq_complete_request(rq);
> +}
> +
>   static enum blk_eh_timer_return
>   nvme_rdma_timeout(struct request *rq, bool reserved)
>   {
> @@ -1956,29 +1973,43 @@ nvme_rdma_timeout(struct request *rq, bool reserved)
>   	dev_warn(ctrl->ctrl.device, "I/O %d QID %d timeout\n",
>   		 rq->tag, nvme_rdma_queue_idx(queue));
>   
> -	/*
> -	 * Restart the timer if a controller reset is already scheduled. Any
> -	 * timed out commands would be handled before entering the connecting
> -	 * state.
> -	 */
> -	if (ctrl->ctrl.state == NVME_CTRL_RESETTING)
> +	switch (ctrl->ctrl.state) {
> +	case NVME_CTRL_RESETTING:
> +		if (!nvme_rdma_queue_idx(queue)) {
> +			/*
> +			 * if we are in teardown we must complete immediately
> +			 * because we may block the teardown sequence (e.g.
> +			 * nvme_disable_ctrl timed out).
> +			 */
> +			nvme_rdma_complete_timed_out(rq);
> +			return BLK_EH_DONE;
> +		}
> +		/*
> +		 * Restart the timer if a controller reset is already scheduled.
> +		 * Any timed out commands would be handled before entering the
> +		 * connecting state.
> +		 */
>   		return BLK_EH_RESET_TIMER;
> -
> -	if (ctrl->ctrl.state != NVME_CTRL_LIVE) {
> +	case NVME_CTRL_CONNECTING:
> +		if (reserved || !nvme_rdma_queue_idx(queue)) {
> +			/*
> +			 * if we are connecting we must complete immediately
> +			 * connect (reserved) or admin requests because we may
> +			 * block controller setup sequence.
> +			 */
> +			nvme_rdma_complete_timed_out(rq);
> +			return BLK_EH_DONE;
> +		}
> +		/* fallthru */
> +	default:
>   		/*
> -		 * Teardown immediately if controller times out while starting
> -		 * or we are already started error recovery. all outstanding
> -		 * requests are completed on shutdown, so we return BLK_EH_DONE.
> +		 * every other state should trigger the error recovery
> +		 * which will be handled by the flow and controller state
> +		 * machine
>   		 */
> -		flush_work(&ctrl->err_work);
> -		nvme_rdma_teardown_io_queues(ctrl, false);
> -		nvme_rdma_teardown_admin_queue(ctrl, false);
> -		return BLK_EH_DONE;
> +		nvme_rdma_error_recovery(ctrl);
>   	}
>   
> -	dev_warn(ctrl->ctrl.device, "starting error recovery\n");
> -	nvme_rdma_error_recovery(ctrl);
> -
>   	return BLK_EH_RESET_TIMER;
>   }
>   
> 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/6] nvme-rdma: fix timeout handler
  2020-08-03 10:25   ` Chao Leng
@ 2020-08-03 15:03     ` Sagi Grimberg
  2020-08-04  1:49       ` Chao Leng
  0 siblings, 1 reply; 21+ messages in thread
From: Sagi Grimberg @ 2020-08-03 15:03 UTC (permalink / raw)
  To: Chao Leng, linux-nvme, Christoph Hellwig, Keith Busch, James Smart


>> @@ -1946,6 +1947,22 @@ static int nvme_rdma_cm_handler(struct 
>> rdma_cm_id *cm_id,
>>       return 0;
>>   }
>> +static void nvme_rdma_complete_timed_out(struct request *rq)
>> +{
>> +    struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
>> +    struct nvme_rdma_queue *queue = req->queue;
>> +    struct nvme_rdma_ctrl *ctrl = queue->ctrl;
>> +
>> +    /* fence other contexts that may complete the command */
>> +    flush_work(&ctrl->err_work);
>> +    nvme_rdma_stop_queue(queue);
> There maybe concurrent with error recovery, may cause abnormal because
> nvme_rdma_stop_queue will return but the queue is not stoped,
> maybe is stopping by the error recovery.

err_work flush used to fence, once we did queue stop, it should be safe
to complete the command from the timeout handler.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/6] nvme-rdma: fix timeout handler
  2020-08-03 15:03     ` Sagi Grimberg
@ 2020-08-04  1:49       ` Chao Leng
  2020-08-04 15:36         ` Sagi Grimberg
  0 siblings, 1 reply; 21+ messages in thread
From: Chao Leng @ 2020-08-04  1:49 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch, James Smart



On 2020/8/3 23:03, Sagi Grimberg wrote:
> 
>>> @@ -1946,6 +1947,22 @@ static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
>>>       return 0;
>>>   }
>>> +static void nvme_rdma_complete_timed_out(struct request *rq)
>>> +{
>>> +    struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
>>> +    struct nvme_rdma_queue *queue = req->queue;
>>> +    struct nvme_rdma_ctrl *ctrl = queue->ctrl;
>>> +
>>> +    /* fence other contexts that may complete the command */
>>> +    flush_work(&ctrl->err_work);
>>> +    nvme_rdma_stop_queue(queue);
>> There maybe concurrent with error recovery, may cause abnormal because
>> nvme_rdma_stop_queue will return but the queue is not stoped,
>> maybe is stopping by the error recovery.
> 
> err_work flush used to fence, once we did queue stop, it should be safe
> to complete the command from the timeout handler.

Flush work just can avoid trigger error recovery by nvme_rdma_timeout or
reduce concurrent probalibity trigger error recovery by other progress,
but can not avoid. if nvme_rdma_cm_handler or other progress call
nvme_rdma_error_recovery, between change state to queue_work may
interrupt by hard interrupt, and then timeout happen, thus flush work
can not avoid concurrent.
Like this:

static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
{
	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
		return;
--------------------------------
may interrupt by hard interrupt, and then timeout progress flush work
at this time. Thus error recovery and nvme_rdma_complete_timed_out may
concurrent to stop queue. will cause: error recovery may cancel request
or nvme_rdma_complete_timed_out may complete request, but the queue may
not be stoped. Thus will cause abnormal.
--------------------------------
	queue_work(nvme_reset_wq, &ctrl->err_work);
}

Another, although the probability of occurrence is very low, reset work
and nvme_rdma_complete_timed_out may also concurrent to stop queue, may
also cause abnormal.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/6] nvme-rdma: fix timeout handler
  2020-08-04  1:49       ` Chao Leng
@ 2020-08-04 15:36         ` Sagi Grimberg
  2020-08-05  1:07           ` Chao Leng
  0 siblings, 1 reply; 21+ messages in thread
From: Sagi Grimberg @ 2020-08-04 15:36 UTC (permalink / raw)
  To: Chao Leng, linux-nvme, Christoph Hellwig, Keith Busch, James Smart


>>>> +static void nvme_rdma_complete_timed_out(struct request *rq)
>>>> +{
>>>> +    struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
>>>> +    struct nvme_rdma_queue *queue = req->queue;
>>>> +    struct nvme_rdma_ctrl *ctrl = queue->ctrl;
>>>> +
>>>> +    /* fence other contexts that may complete the command */
>>>> +    flush_work(&ctrl->err_work);
>>>> +    nvme_rdma_stop_queue(queue);
>>> There maybe concurrent with error recovery, may cause abnormal because
>>> nvme_rdma_stop_queue will return but the queue is not stoped,
>>> maybe is stopping by the error recovery.
>>
>> err_work flush used to fence, once we did queue stop, it should be safe
>> to complete the command from the timeout handler.
> 
> Flush work just can avoid trigger error recovery by nvme_rdma_timeout or
> reduce concurrent probalibity trigger error recovery by other progress,
> but can not avoid.

The point is that we can complete the command because err_work
was flushed and the queue was stopped, which means we shouldn't have
any context completing the request.

> if nvme_rdma_cm_handler or other progress call
> nvme_rdma_error_recovery, between change state to queue_work may
> interrupt by hard interrupt, and then timeout happen, thus flush work
> can not avoid concurrent.
> Like this:
> 
> static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
> {
>      if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
>          return;
> --------------------------------

If we are in RESETTING/CONNECTING state already, this won't do anything.

> may interrupt by hard interrupt, and then timeout progress flush work
> at this time. Thus error recovery and nvme_rdma_complete_timed_out may
> concurrent to stop queue. will cause: error recovery may cancel request
> or nvme_rdma_complete_timed_out may complete request, but the queue may
> not be stoped. Thus will cause abnormal.

We should be fine and safe to complete the I/O.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/6] nvme-rdma: fix timeout handler
  2020-08-04 15:36         ` Sagi Grimberg
@ 2020-08-05  1:07           ` Chao Leng
  2020-08-05  1:12             ` Sagi Grimberg
  0 siblings, 1 reply; 21+ messages in thread
From: Chao Leng @ 2020-08-05  1:07 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch, James Smart



On 2020/8/4 23:36, Sagi Grimberg wrote:
> 
>>>>> +static void nvme_rdma_complete_timed_out(struct request *rq)
>>>>> +{
>>>>> +    struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
>>>>> +    struct nvme_rdma_queue *queue = req->queue;
>>>>> +    struct nvme_rdma_ctrl *ctrl = queue->ctrl;
>>>>> +
>>>>> +    /* fence other contexts that may complete the command */
>>>>> +    flush_work(&ctrl->err_work);
>>>>> +    nvme_rdma_stop_queue(queue);
>>>> There maybe concurrent with error recovery, may cause abnormal because
>>>> nvme_rdma_stop_queue will return but the queue is not stoped,
>>>> maybe is stopping by the error recovery.
>>>
>>> err_work flush used to fence, once we did queue stop, it should be safe
>>> to complete the command from the timeout handler.
>>
>> Flush work just can avoid trigger error recovery by nvme_rdma_timeout or
>> reduce concurrent probalibity trigger error recovery by other progress,
>> but can not avoid.
> 
> The point is that we can complete the command because err_work
> was flushed and the queue was stopped, which means we shouldn't have
> any context completing the request.
> 
>> if nvme_rdma_cm_handler or other progress call
>> nvme_rdma_error_recovery, between change state to queue_work may
>> interrupt by hard interrupt, and then timeout happen, thus flush work
>> can not avoid concurrent.
>> Like this:
>>
>> static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
>> {
>>      if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
>>          return;
>> --------------------------------
> 
> If we are in RESETTING/CONNECTING state already, this won't do anything.
> 
>> may interrupt by hard interrupt, and then timeout progress flush work
>> at this time. Thus error recovery and nvme_rdma_complete_timed_out may
>> concurrent to stop queue. will cause: error recovery may cancel request
>> or nvme_rdma_complete_timed_out may complete request, but the queue may
>> not be stoped. Thus will cause abnormal.
> 
> We should be fine and safe to complete the I/O.

Complete request in nvme_rdma_timeout or cancel request in
nvme_rdma_error_recovery_work or nvme_rdma_reset_ctrl_work is not safe.
Because the queue may be not really stoped, it may just cleard the flag:
NVME_RDMA_Q_ALLOCATED for the queue. Thus one request may concurrent
treat by two progress, it is not allowed.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/6] nvme-rdma: fix timeout handler
  2020-08-05  1:07           ` Chao Leng
@ 2020-08-05  1:12             ` Sagi Grimberg
  2020-08-05  6:27               ` Chao Leng
  0 siblings, 1 reply; 21+ messages in thread
From: Sagi Grimberg @ 2020-08-05  1:12 UTC (permalink / raw)
  To: Chao Leng, linux-nvme, Christoph Hellwig, Keith Busch, James Smart


>>> may interrupt by hard interrupt, and then timeout progress flush work
>>> at this time. Thus error recovery and nvme_rdma_complete_timed_out may
>>> concurrent to stop queue. will cause: error recovery may cancel request
>>> or nvme_rdma_complete_timed_out may complete request, but the queue may
>>> not be stoped. Thus will cause abnormal.
>>
>> We should be fine and safe to complete the I/O.
> 
> Complete request in nvme_rdma_timeout or cancel request in
> nvme_rdma_error_recovery_work or nvme_rdma_reset_ctrl_work is not safe.
> Because the queue may be not really stoped, it may just cleard the flag:
> NVME_RDMA_Q_ALLOCATED for the queue. Thus one request may concurrent
> treat by two progress, it is not allowed.

The request being timed out cannot be completed after the queue is
stopped, that is the point of nvme_rdma_stop_queue. if it is only
ALLOCATED, we did not yet connect hence there is zero chance for
any command to complete.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/6] nvme-rdma: fix timeout handler
  2020-08-05  1:12             ` Sagi Grimberg
@ 2020-08-05  6:27               ` Chao Leng
  2020-08-05  7:00                 ` Sagi Grimberg
  0 siblings, 1 reply; 21+ messages in thread
From: Chao Leng @ 2020-08-05  6:27 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch, James Smart



On 2020/8/5 9:12, Sagi Grimberg wrote:
> 
>>>> may interrupt by hard interrupt, and then timeout progress flush work
>>>> at this time. Thus error recovery and nvme_rdma_complete_timed_out may
>>>> concurrent to stop queue. will cause: error recovery may cancel request
>>>> or nvme_rdma_complete_timed_out may complete request, but the queue may
>>>> not be stoped. Thus will cause abnormal.
>>>
>>> We should be fine and safe to complete the I/O.
>>
>> Complete request in nvme_rdma_timeout or cancel request in
>> nvme_rdma_error_recovery_work or nvme_rdma_reset_ctrl_work is not safe.
>> Because the queue may be not really stoped, it may just cleard the flag:
>> NVME_RDMA_Q_ALLOCATED for the queue. Thus one request may concurrent
>> treat by two progress, it is not allowed.
> 
> The request being timed out cannot be completed after the queue is
> stopped, that is the point of nvme_rdma_stop_queue. if it is only
> ALLOCATED, we did not yet connect hence there is zero chance for
> any command to complete.
The request may already complete before stop queue, it is in the cq, but
is not treated by software. If nvme_rdma_stop_queue concurrent, for example:
The error recovery run first, it will clear the flag:NVME_RDMA_Q_LIVE,
and then wait drain cq. At the same time nvme_rdma_timeout
call nvme_rdma_stop_queue will return immediately, and then may call
blk_mq_complete_request. but error recovery may drain cq at the same
time, and may also treat the same request.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/6] nvme-rdma: fix timeout handler
  2020-08-05  6:27               ` Chao Leng
@ 2020-08-05  7:00                 ` Sagi Grimberg
  2020-08-05  7:14                   ` Chao Leng
  0 siblings, 1 reply; 21+ messages in thread
From: Sagi Grimberg @ 2020-08-05  7:00 UTC (permalink / raw)
  To: Chao Leng, linux-nvme, Christoph Hellwig, Keith Busch, James Smart


>>>>> may interrupt by hard interrupt, and then timeout progress flush work
>>>>> at this time. Thus error recovery and nvme_rdma_complete_timed_out may
>>>>> concurrent to stop queue. will cause: error recovery may cancel 
>>>>> request
>>>>> or nvme_rdma_complete_timed_out may complete request, but the queue 
>>>>> may
>>>>> not be stoped. Thus will cause abnormal.
>>>>
>>>> We should be fine and safe to complete the I/O.
>>>
>>> Complete request in nvme_rdma_timeout or cancel request in
>>> nvme_rdma_error_recovery_work or nvme_rdma_reset_ctrl_work is not safe.
>>> Because the queue may be not really stoped, it may just cleard the flag:
>>> NVME_RDMA_Q_ALLOCATED for the queue. Thus one request may concurrent
>>> treat by two progress, it is not allowed.
>>
>> The request being timed out cannot be completed after the queue is
>> stopped, that is the point of nvme_rdma_stop_queue. if it is only
>> ALLOCATED, we did not yet connect hence there is zero chance for
>> any command to complete.
> The request may already complete before stop queue, it is in the cq, but
> is not treated by software.

Not possible, ib_drain_cq completion guarantees that all cqes were
reaped and handled by SW.

> If nvme_rdma_stop_queue concurrent

Before we complete we make sure the queue is stopped (and drained and
reaped).

, for
> example:
> The error recovery run first, it will clear the flag:NVME_RDMA_Q_LIVE,
> and then wait drain cq. At the same time nvme_rdma_timeout
> call nvme_rdma_stop_queue will return immediately, and then may call
> blk_mq_complete_request. but error recovery may drain cq at the same
> time, and may also treat the same request.

We flush the err_work before running nvme_rdma_stop_queue exactly
because of that. your example cannot happen.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/6] nvme-rdma: fix timeout handler
  2020-08-05  7:00                 ` Sagi Grimberg
@ 2020-08-05  7:14                   ` Chao Leng
  2020-08-05  7:19                     ` Sagi Grimberg
  0 siblings, 1 reply; 21+ messages in thread
From: Chao Leng @ 2020-08-05  7:14 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch, James Smart



On 2020/8/5 15:00, Sagi Grimberg wrote:
> 
>>>>>> may interrupt by hard interrupt, and then timeout progress flush work
>>>>>> at this time. Thus error recovery and nvme_rdma_complete_timed_out may
>>>>>> concurrent to stop queue. will cause: error recovery may cancel request
>>>>>> or nvme_rdma_complete_timed_out may complete request, but the queue may
>>>>>> not be stoped. Thus will cause abnormal.
>>>>>
>>>>> We should be fine and safe to complete the I/O.
>>>>
>>>> Complete request in nvme_rdma_timeout or cancel request in
>>>> nvme_rdma_error_recovery_work or nvme_rdma_reset_ctrl_work is not safe.
>>>> Because the queue may be not really stoped, it may just cleard the flag:
>>>> NVME_RDMA_Q_ALLOCATED for the queue. Thus one request may concurrent
>>>> treat by two progress, it is not allowed.
>>>
>>> The request being timed out cannot be completed after the queue is
>>> stopped, that is the point of nvme_rdma_stop_queue. if it is only
>>> ALLOCATED, we did not yet connect hence there is zero chance for
>>> any command to complete.
>> The request may already complete before stop queue, it is in the cq, but
>> is not treated by software.
> 
> Not possible, ib_drain_cq completion guarantees that all cqes were
> reaped and handled by SW.
> 
>> If nvme_rdma_stop_queue concurrent
> 
> Before we complete we make sure the queue is stopped (and drained and
> reaped).
> 
> , for
>> example:
>> The error recovery run first, it will clear the flag:NVME_RDMA_Q_LIVE,
>> and then wait drain cq. At the same time nvme_rdma_timeout
>> call nvme_rdma_stop_queue will return immediately, and then may call
>> blk_mq_complete_request. but error recovery may drain cq at the same
>> time, and may also treat the same request.
> 
> We flush the err_work before running nvme_rdma_stop_queue exactly
> because of that. your example cannot happen.
Flush work is not safe. See my previous email.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/6] nvme-rdma: fix timeout handler
  2020-08-05  7:14                   ` Chao Leng
@ 2020-08-05  7:19                     ` Sagi Grimberg
  2020-08-05  7:35                       ` Chao Leng
  0 siblings, 1 reply; 21+ messages in thread
From: Sagi Grimberg @ 2020-08-05  7:19 UTC (permalink / raw)
  To: Chao Leng, linux-nvme, Christoph Hellwig, Keith Busch, James Smart


>>>> The request being timed out cannot be completed after the queue is
>>>> stopped, that is the point of nvme_rdma_stop_queue. if it is only
>>>> ALLOCATED, we did not yet connect hence there is zero chance for
>>>> any command to complete.
>>> The request may already complete before stop queue, it is in the cq, but
>>> is not treated by software.
>>
>> Not possible, ib_drain_cq completion guarantees that all cqes were
>> reaped and handled by SW.
>>
>>> If nvme_rdma_stop_queue concurrent
>>
>> Before we complete we make sure the queue is stopped (and drained and
>> reaped).
>>
>> , for
>>> example:
>>> The error recovery run first, it will clear the flag:NVME_RDMA_Q_LIVE,
>>> and then wait drain cq. At the same time nvme_rdma_timeout
>>> call nvme_rdma_stop_queue will return immediately, and then may call
>>> blk_mq_complete_request. but error recovery may drain cq at the same
>>> time, and may also treat the same request.
>>
>> We flush the err_work before running nvme_rdma_stop_queue exactly
>> because of that. your example cannot happen.
> Flush work is not safe. See my previous email.

How is it not safe? when flush_work returns, the work is guaranteed
to have finished execution, and we only do that for states
RESETTING/CONNECTING which means that it either has already started
or already finished.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/6] nvme-rdma: fix timeout handler
  2020-08-05  7:19                     ` Sagi Grimberg
@ 2020-08-05  7:35                       ` Chao Leng
  2020-08-05  8:17                         ` Sagi Grimberg
  0 siblings, 1 reply; 21+ messages in thread
From: Chao Leng @ 2020-08-05  7:35 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch, James Smart



On 2020/8/5 15:19, Sagi Grimberg wrote:
> 
>>>>> The request being timed out cannot be completed after the queue is
>>>>> stopped, that is the point of nvme_rdma_stop_queue. if it is only
>>>>> ALLOCATED, we did not yet connect hence there is zero chance for
>>>>> any command to complete.
>>>> The request may already complete before stop queue, it is in the cq, but
>>>> is not treated by software.
>>>
>>> Not possible, ib_drain_cq completion guarantees that all cqes were
>>> reaped and handled by SW.
>>>
>>>> If nvme_rdma_stop_queue concurrent
>>>
>>> Before we complete we make sure the queue is stopped (and drained and
>>> reaped).
>>>
>>> , for
>>>> example:
>>>> The error recovery run first, it will clear the flag:NVME_RDMA_Q_LIVE,
>>>> and then wait drain cq. At the same time nvme_rdma_timeout
>>>> call nvme_rdma_stop_queue will return immediately, and then may call
>>>> blk_mq_complete_request. but error recovery may drain cq at the same
>>>> time, and may also treat the same request.
>>>
>>> We flush the err_work before running nvme_rdma_stop_queue exactly
>>> because of that. your example cannot happen.
>> Flush work is not safe. See my previous email.
> 
> How is it not safe? when flush_work returns, the work is guaranteed
> to have finished execution, and we only do that for states
> RESETTING/CONNECTING which means that it either has already started
> or already finished.

Though the state is NVME_CTRL_RESETTING, but it does not mean the work
is already queued(started) or finished. There is a hole between Change state
and queue work.

Like this:
static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
{
     if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
         return;
--------------------------------
may interrupt by hard interrupt, and then timeout progress flush work
at this time. Thus error recovery and nvme_rdma_complete_timed_out may
concurrent to stop queue. will cause: error recovery may cancel request
or nvme_rdma_complete_timed_out may complete request, but the queue may
not be stoped. Thus will cause abnormal.
--------------------------------
     queue_work(nvme_reset_wq, &ctrl->err_work);
}

Another, although the probability of occurrence is very low, reset work
and nvme_rdma_complete_timed_out may also concurrent to stop queue, may
also cause abnormal.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/6] nvme-rdma: fix timeout handler
  2020-08-05  7:35                       ` Chao Leng
@ 2020-08-05  8:17                         ` Sagi Grimberg
  0 siblings, 0 replies; 21+ messages in thread
From: Sagi Grimberg @ 2020-08-05  8:17 UTC (permalink / raw)
  To: Chao Leng, linux-nvme, Christoph Hellwig, Keith Busch, James Smart


>> How is it not safe? when flush_work returns, the work is guaranteed
>> to have finished execution, and we only do that for states
>> RESETTING/CONNECTING which means that it either has already started
>> or already finished.
> 
> Though the state is NVME_CTRL_RESETTING, but it does not mean the work
> is already queued(started) or finished. There is a hole between Change 
> state
> and queue work.
> 
> Like this:
> static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
> {
>      if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
>          return;
> --------------------------------
> may interrupt by hard interrupt, and then timeout progress flush work
> at this time. Thus error recovery and nvme_rdma_complete_timed_out may
> concurrent to stop queue. will cause: error recovery may cancel request
> or nvme_rdma_complete_timed_out may complete request, but the queue may
> not be stoped. Thus will cause abnormal.
> --------------------------------
>      queue_work(nvme_reset_wq, &ctrl->err_work);
> }
> 
> Another, although the probability of occurrence is very low, reset work
> and nvme_rdma_complete_timed_out may also concurrent to stop queue, may
> also cause abnormal.

I see your point.

We can serialize ctrl teardown with a lock (similar to
dev->shutdown_lock that we have in pci).

Something like:
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 96fa3185d123..8c8f7492cab4 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1168,11 +1168,13 @@ static void nvme_rdma_error_recovery_work(struct 
work_struct *work)
         struct nvme_rdma_ctrl *ctrl = container_of(work,
                         struct nvme_rdma_ctrl, err_work);

+       mutex_lock(&ctrl->shutdown_lock);
         nvme_stop_keep_alive(&ctrl->ctrl);
         nvme_rdma_teardown_io_queues(ctrl, false);
         nvme_start_queues(&ctrl->ctrl);
         nvme_rdma_teardown_admin_queue(ctrl, false);
         blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
+       mutex_unlock(&ctrl->shutdown_lock);

         if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) {
                 /* state change failure is ok if we started ctrl delete */
@@ -1964,7 +1966,9 @@ static void nvme_rdma_complete_timed_out(struct 
request *rq)

         /* fence other contexts that may complete the command */
         flush_work(&ctrl->err_work);
+       mutex_lock(&ctrl->shutdown_lock);
         nvme_rdma_stop_queue(queue);
+       mutex_unlock(&ctrl->shutdown_lock);
         if (blk_mq_request_completed(rq))
                 return;
         nvme_req(rq)->flags |= NVME_REQ_CANCELLED;
@@ -2226,6 +2230,7 @@ static void nvme_rdma_shutdown_ctrl(struct 
nvme_rdma_ctrl *ctrl, bool shutdown)
         cancel_work_sync(&ctrl->err_work);
         cancel_delayed_work_sync(&ctrl->reconnect_work);

+       mutex_lock(&ctrl->shutdown_lock);
         nvme_rdma_teardown_io_queues(ctrl, shutdown);
         blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
         if (shutdown)
@@ -2233,6 +2238,7 @@ static void nvme_rdma_shutdown_ctrl(struct 
nvme_rdma_ctrl *ctrl, bool shutdown)
         else
                 nvme_disable_ctrl(&ctrl->ctrl);
         nvme_rdma_teardown_admin_queue(ctrl, shutdown);
+       mutex_unlock(&ctrl->shutdown_lock);
  }

  static void nvme_rdma_delete_ctrl(struct nvme_ctrl *ctrl)
--

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/6] nvme-rdma: fix timeout handler
  2020-08-03  6:58 ` [PATCH 5/6] nvme-rdma: fix timeout handler Sagi Grimberg
  2020-08-03 10:25   ` Chao Leng
@ 2020-08-06 19:52   ` David Milburn
  2020-08-06 20:11     ` Sagi Grimberg
  1 sibling, 1 reply; 21+ messages in thread
From: David Milburn @ 2020-08-06 19:52 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch, James Smart


Hi Sagi,

On 08/03/2020 01:58 AM, Sagi Grimberg wrote:
> Currently we check if the controller state != LIVE, and
> we directly fail the command under the assumption that this
> is the connect command or an admin command within the
> controller initialization sequence.
> 
> This is wrong, we need to check if the request risking
> controller setup/teardown blocking if not completed and
> only then fail.
> 
> The logic should be:
> - RESETTING, only fail fabrics/admin commands otherwise
>    controller teardown will block. otherwise reset the timer
>    and come back again.
> - CONNECTING, if this is a connect (or an admin command), we fail
>    right away (unblock controller initialization), otherwise we
>    treat it like anything else.
> - otherwise trigger error recovery and reset the timer (the
>    error handler will take care of completing/delaying it).
> 
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>   drivers/nvme/host/rdma.c | 67 +++++++++++++++++++++++++++++-----------
>   1 file changed, 49 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 44c76ffbb264..a58c6deaf691 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -1180,6 +1180,7 @@ static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
>   	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
>   		return;
>   
> +	dev_warn(ctrl->ctrl.device, "starting error recovery\n");
>   	queue_work(nvme_reset_wq, &ctrl->err_work);
>   }
>   
> @@ -1946,6 +1947,22 @@ static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
>   	return 0;
>   }
>   
> +static void nvme_rdma_complete_timed_out(struct request *rq)
> +{
> +	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
> +	struct nvme_rdma_queue *queue = req->queue;
> +	struct nvme_rdma_ctrl *ctrl = queue->ctrl;
> +
> +	/* fence other contexts that may complete the command */
> +	flush_work(&ctrl->err_work) > +	nvme_rdma_stop_queue(queue);
> +	if (blk_mq_request_completed(rq))
> +		return;
> +	nvme_req(rq)->flags |= NVME_REQ_CANCELLED;
> +	nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
> +	blk_mq_complete_request(rq);


If keep_alive times out, is is possible we try and
blk_mq_free_request() twice for same request.

blk_mq_complete_request
  nvme_rdma_complete_rq
   blk_mq_end_request
    __blk_mq_end_request
     rq->end_io(rq, error) - nvme_keep_alive_end_io
      blk_mq_free_request
       __blk_mq_free_request
         rq->mq_hctx = NULL;
.
.
.
return BLK_EH_DONE to blk_mq_rq_timed_out

And then before returning from blk_mq_check_expired
back down

rq->end_io(rq, 0)
  nvme_keep_alive_end_io
   blk_mq_free_request
    atomic_dec(&hctx->nr_active)

since rq->mq_hctx is now NULL, crash in blk_mq_free_request

Thanks,
David



> +}
> +
>   static enum blk_eh_timer_return
>   nvme_rdma_timeout(struct request *rq, bool reserved)
>   {
> @@ -1956,29 +1973,43 @@ nvme_rdma_timeout(struct request *rq, bool reserved)
>   	dev_warn(ctrl->ctrl.device, "I/O %d QID %d timeout\n",
>   		 rq->tag, nvme_rdma_queue_idx(queue));
>   
> -	/*
> -	 * Restart the timer if a controller reset is already scheduled. Any
> -	 * timed out commands would be handled before entering the connecting
> -	 * state.
> -	 */
> -	if (ctrl->ctrl.state == NVME_CTRL_RESETTING)
> +	switch (ctrl->ctrl.state) {
> +	case NVME_CTRL_RESETTING:
> +		if (!nvme_rdma_queue_idx(queue)) {
> +			/*
> +			 * if we are in teardown we must complete immediately
> +			 * because we may block the teardown sequence (e.g.
> +			 * nvme_disable_ctrl timed out).
> +			 */
> +			nvme_rdma_complete_timed_out(rq);
> +			return BLK_EH_DONE;
> +		}
> +		/*
> +		 * Restart the timer if a controller reset is already scheduled.
> +		 * Any timed out commands would be handled before entering the
> +		 * connecting state.
> +		 */
>   		return BLK_EH_RESET_TIMER;
> -
> -	if (ctrl->ctrl.state != NVME_CTRL_LIVE) {
> +	case NVME_CTRL_CONNECTING:
> +		if (reserved || !nvme_rdma_queue_idx(queue)) {
> +			/*
> +			 * if we are connecting we must complete immediately
> +			 * connect (reserved) or admin requests because we may
> +			 * block controller setup sequence.
> +			 */
> +			nvme_rdma_complete_timed_out(rq);
> +			return BLK_EH_DONE;
> +		}
> +		/* fallthru */
> +	default:
>   		/*
> -		 * Teardown immediately if controller times out while starting
> -		 * or we are already started error recovery. all outstanding
> -		 * requests are completed on shutdown, so we return BLK_EH_DONE.
> +		 * every other state should trigger the error recovery
> +		 * which will be handled by the flow and controller state
> +		 * machine
>   		 */
> -		flush_work(&ctrl->err_work);
> -		nvme_rdma_teardown_io_queues(ctrl, false);
> -		nvme_rdma_teardown_admin_queue(ctrl, false);
> -		return BLK_EH_DONE;
> +		nvme_rdma_error_recovery(ctrl);
>   	}
>   
> -	dev_warn(ctrl->ctrl.device, "starting error recovery\n");
> -	nvme_rdma_error_recovery(ctrl);
> -
>   	return BLK_EH_RESET_TIMER;
>   }
>   
> 


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/6] nvme-rdma: fix timeout handler
  2020-08-06 19:52   ` David Milburn
@ 2020-08-06 20:11     ` Sagi Grimberg
  0 siblings, 0 replies; 21+ messages in thread
From: Sagi Grimberg @ 2020-08-06 20:11 UTC (permalink / raw)
  To: David Milburn, linux-nvme, Christoph Hellwig, Keith Busch, James Smart


>> @@ -1946,6 +1947,22 @@ static int nvme_rdma_cm_handler(struct 
>> rdma_cm_id *cm_id,
>>       return 0;
>>   }
>> +static void nvme_rdma_complete_timed_out(struct request *rq)
>> +{
>> +    struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
>> +    struct nvme_rdma_queue *queue = req->queue;
>> +    struct nvme_rdma_ctrl *ctrl = queue->ctrl;
>> +
>> +    /* fence other contexts that may complete the command */
>> +    flush_work(&ctrl->err_work) > +    nvme_rdma_stop_queue(queue);
>> +    if (blk_mq_request_completed(rq))
>> +        return;
>> +    nvme_req(rq)->flags |= NVME_REQ_CANCELLED;
>> +    nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
>> +    blk_mq_complete_request(rq);
> 
> 
> If keep_alive times out, is is possible we try and
> blk_mq_free_request() twice for same request.
> 
> blk_mq_complete_request
>   nvme_rdma_complete_rq
>    blk_mq_end_request
>     __blk_mq_end_request
>      rq->end_io(rq, error) - nvme_keep_alive_end_io
>       blk_mq_free_request
>        __blk_mq_free_request
>          rq->mq_hctx = NULL;
> .
> .
> .
> return BLK_EH_DONE to blk_mq_rq_timed_out
> 
> And then before returning from blk_mq_check_expired
> back down
> 
> rq->end_io(rq, 0)
>   nvme_keep_alive_end_io
>    blk_mq_free_request
>     atomic_dec(&hctx->nr_active)
> 
> since rq->mq_hctx is now NULL, crash in blk_mq_free_request

But the keep alive request is not a flush request...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2020-08-06 20:11 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-03  6:58 [PATCH 0/6] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
2020-08-03  6:58 ` [PATCH 1/6] nvme-fabrics: allow to queue requests for live queues Sagi Grimberg
2020-08-03  6:58 ` [PATCH 2/6] nvme: have nvme_wait_freeze_timeout return if it timed out Sagi Grimberg
2020-08-03  6:58 ` [PATCH 3/6] nvme-tcp: fix timeout handler Sagi Grimberg
2020-08-03  6:58 ` [PATCH 4/6] nvme-tcp: fix reset hang if controller died in the middle of a reset Sagi Grimberg
2020-08-03  6:58 ` [PATCH 5/6] nvme-rdma: fix timeout handler Sagi Grimberg
2020-08-03 10:25   ` Chao Leng
2020-08-03 15:03     ` Sagi Grimberg
2020-08-04  1:49       ` Chao Leng
2020-08-04 15:36         ` Sagi Grimberg
2020-08-05  1:07           ` Chao Leng
2020-08-05  1:12             ` Sagi Grimberg
2020-08-05  6:27               ` Chao Leng
2020-08-05  7:00                 ` Sagi Grimberg
2020-08-05  7:14                   ` Chao Leng
2020-08-05  7:19                     ` Sagi Grimberg
2020-08-05  7:35                       ` Chao Leng
2020-08-05  8:17                         ` Sagi Grimberg
2020-08-06 19:52   ` David Milburn
2020-08-06 20:11     ` Sagi Grimberg
2020-08-03  6:58 ` [PATCH 6/6] nvme-rdma: fix reset hang if controller died in the middle of a reset Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.