linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/8] fix possible controller reset hangs in nvme-tcp/nvme-rdma
@ 2020-08-06 19:11 Sagi Grimberg
  2020-08-06 19:11 ` [PATCH v2 1/8] nvme-fabrics: allow to queue requests for live queues Sagi Grimberg
                   ` (8 more replies)
  0 siblings, 9 replies; 30+ messages in thread
From: Sagi Grimberg @ 2020-08-06 19:11 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

When a controller reset runs during I/O we may hang if the controller
suddenly becomes unresponsive during the reset and/or the reconnection
stages. This is due to how the timeout handler did not fail inflight
commands properly and also not being able to abort the controller reset
sequence when the controller becomes unresponsive (hence can't ever
recover even if the controller ever becomes responsive again).

This set fixes nvme-tcp and nvme-rdma for exactly the same scenarios.

Patch 1 prevents commands being rejected by a live queue, making
commands mistakenly getting requeued forever while we are either
resetting or connecting to a controller.

Patch 2 lets consumers know if the freeze completed or a timeout
elapsed, will be used in patches 5,8.

Patches 3,4,6,7 fix the timeout handler in nvme-tcp and nvme-rdma
respectively to correctly and safely fail requests that are a
part of a serial (blocking) initialization or teardown sequences.

Patches 5,8 address the case when a controller stops responding when
we are in the middle of a connection establishment stage (tcp and rdma).

James, please have a look as well to see what needs to be addressed
for fc.

Changes from v1:
- added patches 3,6 to protect against possible (but rare) double
  completions for timed out requests.

Sagi Grimberg (8):
  nvme-fabrics: allow to queue requests for live queues
  nvme: have nvme_wait_freeze_timeout return if it timed out
  nvme-tcp: serialize controller teardown double completion
  nvme-tcp: fix timeout handler
  nvme-tcp: fix reset hang if controller died in the middle of a reset
  nvme-rdma: serialize controller teardown sequences
  nvme-rdma: fix timeout handler
  nvme-rdma: fix reset hang if controller died in the middle of a reset

 drivers/nvme/host/core.c    |  3 +-
 drivers/nvme/host/fabrics.c | 13 ++++--
 drivers/nvme/host/nvme.h    |  2 +-
 drivers/nvme/host/rdma.c    | 85 +++++++++++++++++++++++++--------
 drivers/nvme/host/tcp.c     | 93 ++++++++++++++++++++++++++++---------
 5 files changed, 147 insertions(+), 49 deletions(-)

-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 1/8] nvme-fabrics: allow to queue requests for live queues
  2020-08-06 19:11 [PATCH v2 0/8] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
@ 2020-08-06 19:11 ` Sagi Grimberg
  2020-08-14  6:44   ` Christoph Hellwig
  2020-08-06 19:11 ` [PATCH v2 2/8] nvme: have nvme_wait_freeze_timeout return if it timed out Sagi Grimberg
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Sagi Grimberg @ 2020-08-06 19:11 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

Right now we are failing requests based on the controller
state (which is checked inline in nvmf_check_ready) however
we should definitely accept requests if the queue is live.

When entering controller reset, we transition the controller
into NVME_CTRL_RESETTING, and then return BLK_STS_RESOURCE for
non-mpath requests (have blk_noretry_request set).

This is also the case for NVME_REQ_USER for the wrong reason.
There shouldn't be any reason for us to reject this I/O in a
controller reset. We do want to prevent passthru commands on
the admin queue because we need the controller to fully initialize
first before we let user passthru admin commands to be issued.

In a non-mpath setup, this means that the requests will simply
be requeued over and over forever not allowing the q_usage_counter
to drop its final reference, causing controller reset to hang
if running concurrently with heavy I/O.

While we are at it, remove the redundant NVME_CTRL_NEW case, which
should never see any I/O as it must first transition to
NVME_CTRL_CONNECTING.

Fixes: 35897b920c8a ("nvme-fabrics: fix and refine state checks in __nvmf_check_ready")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/fabrics.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 4ec4829d6233..8575724734e0 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -565,10 +565,14 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
 	struct nvme_request *req = nvme_req(rq);
 
 	/*
-	 * If we are in some state of setup or teardown only allow
-	 * internally generated commands.
+	 * currently we have a problem sending passthru commands
+	 * on the admin_q if the controller is not LIVE because we can't
+	 * make sure that they are going out after the admin connect,
+	 * controller enable and/or other commands in the initialization
+	 * sequence. until the controller will be LIVE, fail with
+	 * BLK_STS_RESOURCE so that they will be rescheduled.
 	 */
-	if (!blk_rq_is_passthrough(rq) || (req->flags & NVME_REQ_USERCMD))
+	if (rq->q == ctrl->admin_q && (req->flags & NVME_REQ_USERCMD))
 		return false;
 
 	/*
@@ -576,9 +580,8 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
 	 * which is require to set the queue live in the appropinquate states.
 	 */
 	switch (ctrl->state) {
-	case NVME_CTRL_NEW:
 	case NVME_CTRL_CONNECTING:
-		if (nvme_is_fabrics(req->cmd) &&
+		if (blk_rq_is_passthrough(rq) && nvme_is_fabrics(req->cmd) &&
 		    req->cmd->fabrics.fctype == nvme_fabrics_type_connect)
 			return true;
 		break;
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 2/8] nvme: have nvme_wait_freeze_timeout return if it timed out
  2020-08-06 19:11 [PATCH v2 0/8] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
  2020-08-06 19:11 ` [PATCH v2 1/8] nvme-fabrics: allow to queue requests for live queues Sagi Grimberg
@ 2020-08-06 19:11 ` Sagi Grimberg
  2020-08-14  6:45   ` Christoph Hellwig
  2020-08-06 19:11 ` [PATCH v2 3/8] nvme-tcp: serialize controller teardown double completion Sagi Grimberg
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Sagi Grimberg @ 2020-08-06 19:11 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

users can detect if the wait has completed or not

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/core.c | 3 ++-
 drivers/nvme/host/nvme.h | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index e8cc728dee46..fd99ec0118a3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4519,7 +4519,7 @@ void nvme_unfreeze(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_unfreeze);
 
-void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout)
+int nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout)
 {
 	struct nvme_ns *ns;
 
@@ -4530,6 +4530,7 @@ void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout)
 			break;
 	}
 	up_read(&ctrl->namespaces_rwsem);
+	return timeout;
 }
 EXPORT_SYMBOL_GPL(nvme_wait_freeze_timeout);
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index c5c1bac797aa..e82744de3cd3 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -574,7 +574,7 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl);
 void nvme_sync_queues(struct nvme_ctrl *ctrl);
 void nvme_unfreeze(struct nvme_ctrl *ctrl);
 void nvme_wait_freeze(struct nvme_ctrl *ctrl);
-void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout);
+int nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout);
 void nvme_start_freeze(struct nvme_ctrl *ctrl);
 
 #define NVME_QID_ANY -1
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 3/8] nvme-tcp: serialize controller teardown double completion
  2020-08-06 19:11 [PATCH v2 0/8] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
  2020-08-06 19:11 ` [PATCH v2 1/8] nvme-fabrics: allow to queue requests for live queues Sagi Grimberg
  2020-08-06 19:11 ` [PATCH v2 2/8] nvme: have nvme_wait_freeze_timeout return if it timed out Sagi Grimberg
@ 2020-08-06 19:11 ` Sagi Grimberg
  2020-08-06 19:11 ` [PATCH v2 4/8] nvme-tcp: fix timeout handler Sagi Grimberg
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Sagi Grimberg @ 2020-08-06 19:11 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

In the timeout handler we may need to complete a request because the
request that timed out may be an I/O that is a part of a serial sequence
of controller teardown or initialization. In order to complete the
request, we need to fence any other context that may compete with us
and complete the request that is timing out.

In this case, we could have a potential double completion in case
a hard-irq or a different competing context triggered error recovery
and is running inflight request cancellation concurrently with the
timeout handler.

Protect using a ctrl teardown_lock to serialize contexts that may
complete a cancelled request due to error recovery or a reset.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/tcp.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 62fbaecdc960..80af1b287cd0 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -124,6 +124,7 @@ struct nvme_tcp_ctrl {
 	struct sockaddr_storage src_addr;
 	struct nvme_ctrl	ctrl;
 
+	struct mutex		teardown_lock;
 	struct work_struct	err_work;
 	struct delayed_work	connect_work;
 	struct nvme_tcp_request async_req;
@@ -1527,7 +1528,6 @@ static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
 
 	if (!test_and_clear_bit(NVME_TCP_Q_LIVE, &queue->flags))
 		return;
-
 	__nvme_tcp_stop_queue(queue);
 }
 
@@ -1875,6 +1875,7 @@ static int nvme_tcp_configure_admin_queue(struct nvme_ctrl *ctrl, bool new)
 static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl,
 		bool remove)
 {
+	mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
 	blk_mq_quiesce_queue(ctrl->admin_q);
 	nvme_tcp_stop_queue(ctrl, 0);
 	if (ctrl->admin_tagset) {
@@ -1885,13 +1886,16 @@ static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl,
 	if (remove)
 		blk_mq_unquiesce_queue(ctrl->admin_q);
 	nvme_tcp_destroy_admin_queue(ctrl, remove);
+	mutex_unlock(&to_tcp_ctrl(ctrl)->teardown_lock);
 }
 
 static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
 		bool remove)
 {
+	mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
 	if (ctrl->queue_count <= 1)
-		return;
+		goto out;
+	blk_mq_quiesce_queue(ctrl->admin_q);
 	nvme_start_freeze(ctrl);
 	nvme_stop_queues(ctrl);
 	nvme_tcp_stop_io_queues(ctrl);
@@ -1903,6 +1907,8 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
 	if (remove)
 		nvme_start_queues(ctrl);
 	nvme_tcp_destroy_io_queues(ctrl, remove);
+out:
+	mutex_unlock(&to_tcp_ctrl(ctrl)->teardown_lock);
 }
 
 static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl)
@@ -2423,6 +2429,7 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
 			nvme_tcp_reconnect_ctrl_work);
 	INIT_WORK(&ctrl->err_work, nvme_tcp_error_recovery_work);
 	INIT_WORK(&ctrl->ctrl.reset_work, nvme_reset_ctrl_work);
+	mutex_init(&ctrl->teardown_lock);
 
 	if (!(opts->mask & NVMF_OPT_TRSVCID)) {
 		opts->trsvcid =
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 4/8] nvme-tcp: fix timeout handler
  2020-08-06 19:11 [PATCH v2 0/8] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
                   ` (2 preceding siblings ...)
  2020-08-06 19:11 ` [PATCH v2 3/8] nvme-tcp: serialize controller teardown double completion Sagi Grimberg
@ 2020-08-06 19:11 ` Sagi Grimberg
  2020-08-06 19:11 ` [PATCH v2 5/8] nvme-tcp: fix reset hang if controller died in the middle of a reset Sagi Grimberg
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Sagi Grimberg @ 2020-08-06 19:11 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

Currently we check if the controller state != LIVE, and
we directly fail the command under the assumption that this
is the connect command or an admin command within the
controller initialization sequence.

This is wrong, we need to check if the request risking
controller setup/teardown blocking if not completed and
only then fail.

Moreover, we teardown the entire controller in the timeout
handler, which does more than what we really want and it causes
us to freeze the queues again. We need to only fence the I/O
queue and the error recovery teardown.

The logic should be:
- RESETTING, only fail fabrics/admin commands otherwise
  controller teardown will block. otherwise reset the timer
  and come back again.
- CONNECTING, if this is a connect (or an admin command), we fail
  right away (unblock controller initialization), otherwise we
  treat it like anything else.
- otherwise trigger error recovery and reset the timer (the
  error handler will take care of completing/delaying it).

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/tcp.c | 71 +++++++++++++++++++++++++++++------------
 1 file changed, 51 insertions(+), 20 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 80af1b287cd0..d472fb5e79fc 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -465,6 +465,7 @@ static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
 	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
 		return;
 
+	dev_warn(ctrl->device, "starting error recovery\n");
 	queue_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work);
 }
 
@@ -2155,40 +2156,70 @@ static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg)
 	nvme_tcp_queue_request(&ctrl->async_req, true, true);
 }
 
+static void nvme_tcp_complete_timed_out(struct request *rq)
+{
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+	struct nvme_ctrl *ctrl = &req->queue->ctrl->ctrl;
+
+	/* fence other contexts that may complete the command */
+	mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
+	nvme_tcp_stop_queue(ctrl, nvme_tcp_queue_id(req->queue));
+	if (blk_mq_request_completed(rq))
+		goto out;
+	nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
+	blk_mq_complete_request(rq);
+out:
+	mutex_unlock(&to_tcp_ctrl(ctrl)->teardown_lock);
+}
+
 static enum blk_eh_timer_return
 nvme_tcp_timeout(struct request *rq, bool reserved)
 {
 	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
-	struct nvme_tcp_ctrl *ctrl = req->queue->ctrl;
+	struct nvme_ctrl *ctrl = &req->queue->ctrl->ctrl;
 	struct nvme_tcp_cmd_pdu *pdu = req->pdu;
 
-	/*
-	 * Restart the timer if a controller reset is already scheduled. Any
-	 * timed out commands would be handled before entering the connecting
-	 * state.
-	 */
-	if (ctrl->ctrl.state == NVME_CTRL_RESETTING)
-		return BLK_EH_RESET_TIMER;
-
-	dev_warn(ctrl->ctrl.device,
+	dev_warn(ctrl->device,
 		"queue %d: timeout request %#x type %d\n",
 		nvme_tcp_queue_id(req->queue), rq->tag, pdu->hdr.type);
 
-	if (ctrl->ctrl.state != NVME_CTRL_LIVE) {
+	switch (ctrl->state) {
+	case NVME_CTRL_RESETTING:
+		if (!nvme_tcp_queue_id(req->queue)) {
+			/*
+			 * if we are in teardown we must complete immediately
+			 * because we may block the teardown sequence (e.g.
+			 * nvme_disable_ctrl timed out).
+			 */
+			nvme_tcp_complete_timed_out(rq);
+			return BLK_EH_DONE;
+		}
 		/*
-		 * Teardown immediately if controller times out while starting
-		 * or we are already started error recovery. all outstanding
-		 * requests are completed on shutdown, so we return BLK_EH_DONE.
+		 * Restart the timer if a controller reset is already scheduled.
+		 * Any timed out commands would be handled before entering the
+		 * connecting state.
 		 */
-		flush_work(&ctrl->err_work);
-		nvme_tcp_teardown_io_queues(&ctrl->ctrl, false);
-		nvme_tcp_teardown_admin_queue(&ctrl->ctrl, false);
+		return BLK_EH_RESET_TIMER;
+	case NVME_CTRL_CONNECTING:
+		/*
+		 * if we are connecting we must complete immediately
+		 * because we may block controller setup sequence
+		 * - connect requests
+		 * - initialization admin requests
+		 * - I/O requests that entered when unquiesced and
+		 *   the controller stopped responding
+		 */
+		nvme_tcp_complete_timed_out(rq);
 		return BLK_EH_DONE;
+	default:
+		/*
+		 * every other state should trigger the error recovery
+		 * which will be handled by the flow and controller state
+		 * machine
+		 */
+		nvme_tcp_error_recovery(ctrl);
 	}
 
-	dev_warn(ctrl->ctrl.device, "starting error recovery\n");
-	nvme_tcp_error_recovery(&ctrl->ctrl);
-
 	return BLK_EH_RESET_TIMER;
 }
 
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 5/8] nvme-tcp: fix reset hang if controller died in the middle of a reset
  2020-08-06 19:11 [PATCH v2 0/8] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
                   ` (3 preceding siblings ...)
  2020-08-06 19:11 ` [PATCH v2 4/8] nvme-tcp: fix timeout handler Sagi Grimberg
@ 2020-08-06 19:11 ` Sagi Grimberg
  2020-08-06 19:11 ` [PATCH v2 6/8] nvme-rdma: serialize controller teardown sequences Sagi Grimberg
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Sagi Grimberg @ 2020-08-06 19:11 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

If the controller becomes unresponsive in the middle of a reset, we
will hang because we are waiting for the freeze to complete, but that
cannot happen since we have commands that are inflight holding the
q_usage_counter, and we can't blindly fail requests that times out.

So give a timeout and if we cannot wait for queue freeze before
unfreezing, fail and have the error handling take care how to
proceed (either schedule a reconnect of remove the controller).

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/tcp.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index d472fb5e79fc..489a3350b9d3 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1783,7 +1783,13 @@ static int nvme_tcp_configure_io_queues(struct nvme_ctrl *ctrl, bool new)
 
 	if (!new) {
 		nvme_start_queues(ctrl);
-		nvme_wait_freeze(ctrl);
+		if (!nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT)) {
+			/* if we timed out waiting for freeze we are
+			 * likely stuck, fail just to be safe
+			 */
+			ret = -ENODEV;
+			goto out_wait_freeze_timed_out;
+		}
 		blk_mq_update_nr_hw_queues(ctrl->tagset,
 			ctrl->queue_count - 1);
 		nvme_unfreeze(ctrl);
@@ -1791,6 +1797,9 @@ static int nvme_tcp_configure_io_queues(struct nvme_ctrl *ctrl, bool new)
 
 	return 0;
 
+out_wait_freeze_timed_out:
+	nvme_stop_queues(ctrl);
+	nvme_tcp_stop_io_queues(ctrl);
 out_cleanup_connect_q:
 	if (new)
 		blk_cleanup_queue(ctrl->connect_q);
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 6/8] nvme-rdma: serialize controller teardown sequences
  2020-08-06 19:11 [PATCH v2 0/8] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
                   ` (4 preceding siblings ...)
  2020-08-06 19:11 ` [PATCH v2 5/8] nvme-tcp: fix reset hang if controller died in the middle of a reset Sagi Grimberg
@ 2020-08-06 19:11 ` Sagi Grimberg
  2020-08-14  6:45   ` Christoph Hellwig
  2020-08-14 21:12   ` James Smart
  2020-08-06 19:11 ` [PATCH v2 7/8] nvme-rdma: fix timeout handler Sagi Grimberg
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 30+ messages in thread
From: Sagi Grimberg @ 2020-08-06 19:11 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

In the timeout handler we may need to complete a request because the
request that timed out may be an I/O that is a part of a serial sequence
of controller teardown or initialization. In order to complete the
request, we need to fence any other context that may compete with us
and complete the request that is timing out.

In this case, we could have a potential double completion in case
a hard-irq or a different competing context triggered error recovery
and is running inflight request cancellation concurrently with the
timeout handler.

Protect using a ctrl teardown_lock to serialize contexts that may
complete a cancelled request due to error recovery or a reset.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/rdma.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 44c76ffbb264..abc318737f35 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -122,6 +122,7 @@ struct nvme_rdma_ctrl {
 	struct sockaddr_storage src_addr;
 
 	struct nvme_ctrl	ctrl;
+	struct mutex		teardown_lock;
 	bool			use_inline_data;
 	u32			io_queues[HCTX_MAX_TYPES];
 };
@@ -997,6 +998,7 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
 static void nvme_rdma_teardown_admin_queue(struct nvme_rdma_ctrl *ctrl,
 		bool remove)
 {
+	mutex_lock(&ctrl->teardown_lock);
 	blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
 	nvme_rdma_stop_queue(&ctrl->queues[0]);
 	if (ctrl->ctrl.admin_tagset) {
@@ -1007,11 +1009,13 @@ static void nvme_rdma_teardown_admin_queue(struct nvme_rdma_ctrl *ctrl,
 	if (remove)
 		blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
 	nvme_rdma_destroy_admin_queue(ctrl, remove);
+	mutex_unlock(&ctrl->teardown_lock);
 }
 
 static void nvme_rdma_teardown_io_queues(struct nvme_rdma_ctrl *ctrl,
 		bool remove)
 {
+	mutex_lock(&ctrl->teardown_lock);
 	if (ctrl->ctrl.queue_count > 1) {
 		nvme_start_freeze(&ctrl->ctrl);
 		nvme_stop_queues(&ctrl->ctrl);
@@ -1025,6 +1029,7 @@ static void nvme_rdma_teardown_io_queues(struct nvme_rdma_ctrl *ctrl,
 			nvme_start_queues(&ctrl->ctrl);
 		nvme_rdma_destroy_io_queues(ctrl, remove);
 	}
+	mutex_unlock(&ctrl->teardown_lock);
 }
 
 static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl)
@@ -2278,6 +2283,7 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
 		return ERR_PTR(-ENOMEM);
 	ctrl->ctrl.opts = opts;
 	INIT_LIST_HEAD(&ctrl->list);
+	mutex_init(&ctrl->teardown_lock);
 
 	if (!(opts->mask & NVMF_OPT_TRSVCID)) {
 		opts->trsvcid =
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 7/8] nvme-rdma: fix timeout handler
  2020-08-06 19:11 [PATCH v2 0/8] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
                   ` (5 preceding siblings ...)
  2020-08-06 19:11 ` [PATCH v2 6/8] nvme-rdma: serialize controller teardown sequences Sagi Grimberg
@ 2020-08-06 19:11 ` Sagi Grimberg
  2020-08-14  6:52   ` Christoph Hellwig
  2020-08-14 23:27   ` James Smart
  2020-08-06 19:11 ` [PATCH v2 8/8] nvme-rdma: fix reset hang if controller died in the middle of a reset Sagi Grimberg
  2020-08-11 22:16 ` [PATCH v2 0/8] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
  8 siblings, 2 replies; 30+ messages in thread
From: Sagi Grimberg @ 2020-08-06 19:11 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

Currently we check if the controller state != LIVE, and
we directly fail the command under the assumption that this
is the connect command or an admin command within the
controller initialization sequence.

This is wrong, we need to check if the request risking
controller setup/teardown blocking if not completed and
only then fail.

The logic should be:
- RESETTING, only fail fabrics/admin commands otherwise
  controller teardown will block. otherwise reset the timer
  and come back again.
- CONNECTING, if this is a connect (or an admin command), we fail
  right away (unblock controller initialization), otherwise we
  treat it like anything else.
- otherwise trigger error recovery and reset the timer (the
  error handler will take care of completing/delaying it).

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/rdma.c | 68 +++++++++++++++++++++++++++++-----------
 1 file changed, 50 insertions(+), 18 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index abc318737f35..30b401fcc06a 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1185,6 +1185,7 @@ static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
 	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
 		return;
 
+	dev_warn(ctrl->ctrl.device, "starting error recovery\n");
 	queue_work(nvme_reset_wq, &ctrl->err_work);
 }
 
@@ -1951,6 +1952,23 @@ static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
 	return 0;
 }
 
+static void nvme_rdma_complete_timed_out(struct request *rq)
+{
+	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
+	struct nvme_rdma_queue *queue = req->queue;
+	struct nvme_rdma_ctrl *ctrl = queue->ctrl;
+
+	/* fence other contexts that may complete the command */
+	mutex_lock(&ctrl->teardown_lock);
+	nvme_rdma_stop_queue(queue);
+	if (blk_mq_request_completed(rq))
+		goto out;
+	nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
+	blk_mq_complete_request(rq);
+out:
+	mutex_unlock(&ctrl->teardown_lock);
+}
+
 static enum blk_eh_timer_return
 nvme_rdma_timeout(struct request *rq, bool reserved)
 {
@@ -1961,29 +1979,43 @@ nvme_rdma_timeout(struct request *rq, bool reserved)
 	dev_warn(ctrl->ctrl.device, "I/O %d QID %d timeout\n",
 		 rq->tag, nvme_rdma_queue_idx(queue));
 
-	/*
-	 * Restart the timer if a controller reset is already scheduled. Any
-	 * timed out commands would be handled before entering the connecting
-	 * state.
-	 */
-	if (ctrl->ctrl.state == NVME_CTRL_RESETTING)
+	switch (ctrl->ctrl.state) {
+	case NVME_CTRL_RESETTING:
+		if (!nvme_rdma_queue_idx(queue)) {
+			/*
+			 * if we are in teardown we must complete immediately
+			 * because we may block the teardown sequence (e.g.
+			 * nvme_disable_ctrl timed out).
+			 */
+			nvme_rdma_complete_timed_out(rq);
+			return BLK_EH_DONE;
+		}
+		/*
+		 * Restart the timer if a controller reset is already scheduled.
+		 * Any timed out commands would be handled before entering the
+		 * connecting state.
+		 */
 		return BLK_EH_RESET_TIMER;
-
-	if (ctrl->ctrl.state != NVME_CTRL_LIVE) {
+	case NVME_CTRL_CONNECTING:
+		if (reserved || !nvme_rdma_queue_idx(queue)) {
+			/*
+			 * if we are connecting we must complete immediately
+			 * connect (reserved) or admin requests because we may
+			 * block controller setup sequence.
+			 */
+			nvme_rdma_complete_timed_out(rq);
+			return BLK_EH_DONE;
+		}
+		/* fallthru */
+	default:
 		/*
-		 * Teardown immediately if controller times out while starting
-		 * or we are already started error recovery. all outstanding
-		 * requests are completed on shutdown, so we return BLK_EH_DONE.
+		 * every other state should trigger the error recovery
+		 * which will be handled by the flow and controller state
+		 * machine
 		 */
-		flush_work(&ctrl->err_work);
-		nvme_rdma_teardown_io_queues(ctrl, false);
-		nvme_rdma_teardown_admin_queue(ctrl, false);
-		return BLK_EH_DONE;
+		nvme_rdma_error_recovery(ctrl);
 	}
 
-	dev_warn(ctrl->ctrl.device, "starting error recovery\n");
-	nvme_rdma_error_recovery(ctrl);
-
 	return BLK_EH_RESET_TIMER;
 }
 
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 8/8] nvme-rdma: fix reset hang if controller died in the middle of a reset
  2020-08-06 19:11 [PATCH v2 0/8] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
                   ` (6 preceding siblings ...)
  2020-08-06 19:11 ` [PATCH v2 7/8] nvme-rdma: fix timeout handler Sagi Grimberg
@ 2020-08-06 19:11 ` Sagi Grimberg
  2020-08-14  6:53   ` Christoph Hellwig
  2020-08-11 22:16 ` [PATCH v2 0/8] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
  8 siblings, 1 reply; 30+ messages in thread
From: Sagi Grimberg @ 2020-08-06 19:11 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

If the controller becomes unresponsive in the middle of a reset, we
will hang because we are waiting for the freeze to complete, but that
cannot happen since we have commands that are inflight holding the
q_usage_counter, and we can't blindly fail requests that times out.

So give a timeout and if we cannot wait for queue freeze before
unfreezing, fail and have the error handling take care how to
proceed (either schedule a reconnect of remove the controller).

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/rdma.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 30b401fcc06a..4ca53b864636 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -976,7 +976,13 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
 
 	if (!new) {
 		nvme_start_queues(&ctrl->ctrl);
-		nvme_wait_freeze(&ctrl->ctrl);
+		if (!nvme_wait_freeze_timeout(&ctrl->ctrl, NVME_IO_TIMEOUT)) {
+			/* if we timed out waiting for freeze we are
+			 * likely stuck, fail just to be safe
+			 */
+			ret = -ENODEV;
+			goto out_wait_freeze_timed_out;
+		}
 		blk_mq_update_nr_hw_queues(ctrl->ctrl.tagset,
 			ctrl->ctrl.queue_count - 1);
 		nvme_unfreeze(&ctrl->ctrl);
@@ -984,6 +990,9 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
 
 	return 0;
 
+out_wait_freeze_timed_out:
+	nvme_stop_queues(&ctrl->ctrl);
+	nvme_rdma_stop_io_queues(ctrl);
 out_cleanup_connect_q:
 	if (new)
 		blk_cleanup_queue(ctrl->ctrl.connect_q);
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/8] fix possible controller reset hangs in nvme-tcp/nvme-rdma
  2020-08-06 19:11 [PATCH v2 0/8] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
                   ` (7 preceding siblings ...)
  2020-08-06 19:11 ` [PATCH v2 8/8] nvme-rdma: fix reset hang if controller died in the middle of a reset Sagi Grimberg
@ 2020-08-11 22:16 ` Sagi Grimberg
  2020-08-13 15:39   ` Christoph Hellwig
  8 siblings, 1 reply; 30+ messages in thread
From: Sagi Grimberg @ 2020-08-11 22:16 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

Ping...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/8] fix possible controller reset hangs in nvme-tcp/nvme-rdma
  2020-08-11 22:16 ` [PATCH v2 0/8] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
@ 2020-08-13 15:39   ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2020-08-13 15:39 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Christoph Hellwig, linux-nvme, James Smart

On Tue, Aug 11, 2020 at 03:16:00PM -0700, Sagi Grimberg wrote:
> Ping...

I'll try to get to it, just way overloaded after returning from the
vacation (who would have guessed..)

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/8] nvme-fabrics: allow to queue requests for live queues
  2020-08-06 19:11 ` [PATCH v2 1/8] nvme-fabrics: allow to queue requests for live queues Sagi Grimberg
@ 2020-08-14  6:44   ` Christoph Hellwig
  2020-08-14  7:08     ` Sagi Grimberg
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2020-08-14  6:44 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Christoph Hellwig, linux-nvme, James Smart

On Thu, Aug 06, 2020 at 12:11:20PM -0700, Sagi Grimberg wrote:
> Right now we are failing requests based on the controller
> state (which is checked inline in nvmf_check_ready) however
> we should definitely accept requests if the queue is live.
> 
> When entering controller reset, we transition the controller
> into NVME_CTRL_RESETTING, and then return BLK_STS_RESOURCE for
> non-mpath requests (have blk_noretry_request set).
> 
> This is also the case for NVME_REQ_USER for the wrong reason.
> There shouldn't be any reason for us to reject this I/O in a
> controller reset. We do want to prevent passthru commands on
> the admin queue because we need the controller to fully initialize
> first before we let user passthru admin commands to be issued.
> 
> In a non-mpath setup, this means that the requests will simply
> be requeued over and over forever not allowing the q_usage_counter
> to drop its final reference, causing controller reset to hang
> if running concurrently with heavy I/O.

Which will still happen with the admin queue user passthrough
commands with this patch, so I don't think it actually solves anything,
it just reduces the exposure a bit.

> 
> While we are at it, remove the redundant NVME_CTRL_NEW case, which
> should never see any I/O as it must first transition to
> NVME_CTRL_CONNECTING.

That probablyly should be a separate patch.

> -		if (nvme_is_fabrics(req->cmd) &&
> +		if (blk_rq_is_passthrough(rq) && nvme_is_fabrics(req->cmd) &&

And this (make sure we don't access garbage in ->cmd for non-passthrough)
should probably be a separate fix as well.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/8] nvme: have nvme_wait_freeze_timeout return if it timed out
  2020-08-06 19:11 ` [PATCH v2 2/8] nvme: have nvme_wait_freeze_timeout return if it timed out Sagi Grimberg
@ 2020-08-14  6:45   ` Christoph Hellwig
  2020-08-14  7:09     ` Sagi Grimberg
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2020-08-14  6:45 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Christoph Hellwig, linux-nvme, James Smart

On Thu, Aug 06, 2020 at 12:11:21PM -0700, Sagi Grimberg wrote:
> users can detect if the wait has completed or not

Please capitalize the first word in a sentence and end it with a
punctuation.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 6/8] nvme-rdma: serialize controller teardown sequences
  2020-08-06 19:11 ` [PATCH v2 6/8] nvme-rdma: serialize controller teardown sequences Sagi Grimberg
@ 2020-08-14  6:45   ` Christoph Hellwig
  2020-08-14 21:12   ` James Smart
  1 sibling, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2020-08-14  6:45 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Christoph Hellwig, linux-nvme, James Smart

On Thu, Aug 06, 2020 at 12:11:25PM -0700, Sagi Grimberg wrote:
> In the timeout handler we may need to complete a request because the
> request that timed out may be an I/O that is a part of a serial sequence
> of controller teardown or initialization. In order to complete the
> request, we need to fence any other context that may compete with us
> and complete the request that is timing out.
> 
> In this case, we could have a potential double completion in case
> a hard-irq or a different competing context triggered error recovery
> and is running inflight request cancellation concurrently with the
> timeout handler.
> 
> Protect using a ctrl teardown_lock to serialize contexts that may
> complete a cancelled request due to error recovery or a reset.
> 
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>

Looks sensible:

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 7/8] nvme-rdma: fix timeout handler
  2020-08-06 19:11 ` [PATCH v2 7/8] nvme-rdma: fix timeout handler Sagi Grimberg
@ 2020-08-14  6:52   ` Christoph Hellwig
  2020-08-14  7:14     ` Sagi Grimberg
  2020-08-14 23:27   ` James Smart
  1 sibling, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2020-08-14  6:52 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Christoph Hellwig, linux-nvme, James Smart

On Thu, Aug 06, 2020 at 12:11:26PM -0700, Sagi Grimberg wrote:
> Currently we check if the controller state != LIVE, and
> we directly fail the command under the assumption that this
> is the connect command or an admin command within the
> controller initialization sequence.
> 
> This is wrong, we need to check if the request risking
> controller setup/teardown blocking if not completed and
> only then fail.

FYI: you can use up to 73 characters in the commit log..

> +++ b/drivers/nvme/host/rdma.c
> @@ -1185,6 +1185,7 @@ static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
>  	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
>  		return;
>  
> +	dev_warn(ctrl->ctrl.device, "starting error recovery\n");

Should this really be a warning?  I'd turn this down to _info.

> +static void nvme_rdma_complete_timed_out(struct request *rq)
> +{
> +	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
> +	struct nvme_rdma_queue *queue = req->queue;
> +	struct nvme_rdma_ctrl *ctrl = queue->ctrl;
> +
> +	/* fence other contexts that may complete the command */
> +	mutex_lock(&ctrl->teardown_lock);
> +	nvme_rdma_stop_queue(queue);
> +	if (blk_mq_request_completed(rq))
> +		goto out;
> +	nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
> +	blk_mq_complete_request(rq);
> +out:

Nit: I'd probably avoid the goto here for a slightly simpler flow.

>  {
> @@ -1961,29 +1979,43 @@ nvme_rdma_timeout(struct request *rq, bool reserved)
>  	dev_warn(ctrl->ctrl.device, "I/O %d QID %d timeout\n",
>  		 rq->tag, nvme_rdma_queue_idx(queue));
>  
> +	switch (ctrl->ctrl.state) {
> +	case NVME_CTRL_RESETTING:
> +		if (!nvme_rdma_queue_idx(queue)) {
> +			/*
> +			 * if we are in teardown we must complete immediately
> +			 * because we may block the teardown sequence (e.g.
> +			 * nvme_disable_ctrl timed out).
> +			 */

Please start the setence with an upper case character.

> +			nvme_rdma_complete_timed_out(rq);
> +			return BLK_EH_DONE;
> +		}
> +		/*
> +		 * Restart the timer if a controller reset is already scheduled.
> +		 * Any timed out commands would be handled before entering the
> +		 * connecting state.
> +		 */
>  		return BLK_EH_RESET_TIMER;
> +	case NVME_CTRL_CONNECTING:
> +		if (reserved || !nvme_rdma_queue_idx(queue)) {
> +			/*
> +			 * if we are connecting we must complete immediately
> +			 * connect (reserved) or admin requests because we may
> +			 * block controller setup sequence.
> +			 */
> +			nvme_rdma_complete_timed_out(rq);
> +			return BLK_EH_DONE;

A goto to share the immediate completion branch would be nice.  I wonder
if we should also do it for the reserved case during shutdown even if
that should never happen and entirely share the code, though:

	switch (ctrl->ctrl.state) {
	case NVME_CTRL_RESETTING:
	case NVME_CTRL_CONNECTING:
		/*
		 * If we are connecting or connecting, we must complete
		 * connect (reserved) or admin requests immediately, because
		 * they may block the controller setup or teardown sequence.
		 */
		if (reserved || !nvme_rdma_queue_idx(queue)) {
			nvme_rdma_complete_timed_out(rq);
			return BLK_EH_DONE;
		}
		break;
	default:
		break;
	}

	nvme_rdma_error_recovery(ctrl);
 	return BLK_EH_RESET_TIMER;
}

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 8/8] nvme-rdma: fix reset hang if controller died in the middle of a reset
  2020-08-06 19:11 ` [PATCH v2 8/8] nvme-rdma: fix reset hang if controller died in the middle of a reset Sagi Grimberg
@ 2020-08-14  6:53   ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2020-08-14  6:53 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Christoph Hellwig, linux-nvme, James Smart

On Thu, Aug 06, 2020 at 12:11:27PM -0700, Sagi Grimberg wrote:
> If the controller becomes unresponsive in the middle of a reset, we
> will hang because we are waiting for the freeze to complete, but that
> cannot happen since we have commands that are inflight holding the
> q_usage_counter, and we can't blindly fail requests that times out.
> 
> So give a timeout and if we cannot wait for queue freeze before
> unfreezing, fail and have the error handling take care how to
> proceed (either schedule a reconnect of remove the controller).
> 
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>  drivers/nvme/host/rdma.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 30b401fcc06a..4ca53b864636 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -976,7 +976,13 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
>  
>  	if (!new) {
>  		nvme_start_queues(&ctrl->ctrl);
> -		nvme_wait_freeze(&ctrl->ctrl);
> +		if (!nvme_wait_freeze_timeout(&ctrl->ctrl, NVME_IO_TIMEOUT)) {
> +			/* if we timed out waiting for freeze we are
> +			 * likely stuck, fail just to be safe
> +			 */

			/*
			 * If we timed out waiting for freeze we are likely to
			 * be stuck.  Fail the controller initialization just
			 * to be safe.
			 */

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/8] nvme-fabrics: allow to queue requests for live queues
  2020-08-14  6:44   ` Christoph Hellwig
@ 2020-08-14  7:08     ` Sagi Grimberg
  2020-08-14  7:22       ` Christoph Hellwig
  0 siblings, 1 reply; 30+ messages in thread
From: Sagi Grimberg @ 2020-08-14  7:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, James Smart, linux-nvme


>> Right now we are failing requests based on the controller
>> state (which is checked inline in nvmf_check_ready) however
>> we should definitely accept requests if the queue is live.
>>
>> When entering controller reset, we transition the controller
>> into NVME_CTRL_RESETTING, and then return BLK_STS_RESOURCE for
>> non-mpath requests (have blk_noretry_request set).
>>
>> This is also the case for NVME_REQ_USER for the wrong reason.
>> There shouldn't be any reason for us to reject this I/O in a
>> controller reset. We do want to prevent passthru commands on
>> the admin queue because we need the controller to fully initialize
>> first before we let user passthru admin commands to be issued.
>>
>> In a non-mpath setup, this means that the requests will simply
>> be requeued over and over forever not allowing the q_usage_counter
>> to drop its final reference, causing controller reset to hang
>> if running concurrently with heavy I/O.
> 
> Which will still happen with the admin queue user passthrough
> commands with this patch, so I don't think it actually solves anything,
> it just reduces the exposure a bit.

The original version of the patch removed that as well, but james
indicated that it's still needed because we have no way to make sure
the admin (re)connect will be the first request when we unquiesce.

So I kept that one around and will fix it later, and yes, this
is niche corner case compared to user I/O.

>> While we are at it, remove the redundant NVME_CTRL_NEW case, which
>> should never see any I/O as it must first transition to
>> NVME_CTRL_CONNECTING.
> 
> That probablyly should be a separate patch.

OK.

>> -		if (nvme_is_fabrics(req->cmd) &&
>> +		if (blk_rq_is_passthrough(rq) && nvme_is_fabrics(req->cmd) &&
> 
> And this (make sure we don't access garbage in ->cmd for non-passthrough)
> should probably be a separate fix as well.

No, the check was in the upper condition and this reference relied on
it so if I separate this part there is no justification for the
change.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/8] nvme: have nvme_wait_freeze_timeout return if it timed out
  2020-08-14  6:45   ` Christoph Hellwig
@ 2020-08-14  7:09     ` Sagi Grimberg
  0 siblings, 0 replies; 30+ messages in thread
From: Sagi Grimberg @ 2020-08-14  7:09 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, James Smart, linux-nvme


>> users can detect if the wait has completed or not
> 
> Please capitalize the first word in a sentence and end it with a
> punctuation.

ack.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 7/8] nvme-rdma: fix timeout handler
  2020-08-14  6:52   ` Christoph Hellwig
@ 2020-08-14  7:14     ` Sagi Grimberg
  2020-08-14 23:19       ` James Smart
  0 siblings, 1 reply; 30+ messages in thread
From: Sagi Grimberg @ 2020-08-14  7:14 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, James Smart, linux-nvme


>> +++ b/drivers/nvme/host/rdma.c
>> @@ -1185,6 +1185,7 @@ static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
>>   	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
>>   		return;
>>   
>> +	dev_warn(ctrl->ctrl.device, "starting error recovery\n");
> 
> Should this really be a warning?  I'd turn this down to _info.

I can do that. but given that this is a fix, I'll to do that
as a separate patch.

>> +static void nvme_rdma_complete_timed_out(struct request *rq)
>> +{
>> +	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
>> +	struct nvme_rdma_queue *queue = req->queue;
>> +	struct nvme_rdma_ctrl *ctrl = queue->ctrl;
>> +
>> +	/* fence other contexts that may complete the command */
>> +	mutex_lock(&ctrl->teardown_lock);
>> +	nvme_rdma_stop_queue(queue);
>> +	if (blk_mq_request_completed(rq))
>> +		goto out;
>> +	nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
>> +	blk_mq_complete_request(rq);
>> +out:
> 
> Nit: I'd probably avoid the goto here for a slightly simpler flow.

OK

>> +			nvme_rdma_complete_timed_out(rq);
>> +			return BLK_EH_DONE;
>> +		}
>> +		/*
>> +		 * Restart the timer if a controller reset is already scheduled.
>> +		 * Any timed out commands would be handled before entering the
>> +		 * connecting state.
>> +		 */
>>   		return BLK_EH_RESET_TIMER;
>> +	case NVME_CTRL_CONNECTING:
>> +		if (reserved || !nvme_rdma_queue_idx(queue)) {
>> +			/*
>> +			 * if we are connecting we must complete immediately
>> +			 * connect (reserved) or admin requests because we may
>> +			 * block controller setup sequence.
>> +			 */
>> +			nvme_rdma_complete_timed_out(rq);
>> +			return BLK_EH_DONE;
> 
> A goto to share the immediate completion branch would be nice.  I wonder
> if we should also do it for the reserved case during shutdown even if
> that should never happen and entirely share the code, though:
> 
> 	switch (ctrl->ctrl.state) {
> 	case NVME_CTRL_RESETTING:
> 	case NVME_CTRL_CONNECTING:
> 		/*
> 		 * If we are connecting or connecting, we must complete
> 		 * connect (reserved) or admin requests immediately, because
> 		 * they may block the controller setup or teardown sequence.
> 		 */
> 		if (reserved || !nvme_rdma_queue_idx(queue)) {
> 			nvme_rdma_complete_timed_out(rq);
> 			return BLK_EH_DONE;
> 		}
> 		break;
> 	default:
> 		break;
> 	}
> 
> 	nvme_rdma_error_recovery(ctrl);
>   	return BLK_EH_RESET_TIMER;
> }

Maybe that can work, I'll look into that. thanks.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/8] nvme-fabrics: allow to queue requests for live queues
  2020-08-14  7:08     ` Sagi Grimberg
@ 2020-08-14  7:22       ` Christoph Hellwig
  2020-08-14 15:55         ` James Smart
  2020-08-14 17:49         ` Sagi Grimberg
  0 siblings, 2 replies; 30+ messages in thread
From: Christoph Hellwig @ 2020-08-14  7:22 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Christoph Hellwig, linux-nvme, James Smart

On Fri, Aug 14, 2020 at 12:08:52AM -0700, Sagi Grimberg wrote:
>> Which will still happen with the admin queue user passthrough
>> commands with this patch, so I don't think it actually solves anything,
>> it just reduces the exposure a bit.
>
> The original version of the patch removed that as well, but james
> indicated that it's still needed because we have no way to make sure
> the admin (re)connect will be the first request when we unquiesce.

Is that whole thing really a problem?  All the pass through requests
are inserted at the head of the queue, so how could something else
slip in before it?  If we have a race window we probably need
to use BLK_MQ_REQ_PREEMPT or something like to force executing the
connect on an otherwise frozen queue.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/8] nvme-fabrics: allow to queue requests for live queues
  2020-08-14  7:22       ` Christoph Hellwig
@ 2020-08-14 15:55         ` James Smart
  2020-08-14 17:49         ` Sagi Grimberg
  1 sibling, 0 replies; 30+ messages in thread
From: James Smart @ 2020-08-14 15:55 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg; +Cc: Keith Busch, linux-nvme



On 8/14/2020 12:22 AM, Christoph Hellwig wrote:
> On Fri, Aug 14, 2020 at 12:08:52AM -0700, Sagi Grimberg wrote:
>>> Which will still happen with the admin queue user passthrough
>>> commands with this patch, so I don't think it actually solves anything,
>>> it just reduces the exposure a bit.
>> The original version of the patch removed that as well, but james
>> indicated that it's still needed because we have no way to make sure
>> the admin (re)connect will be the first request when we unquiesce.
> Is that whole thing really a problem?  All the pass through requests
> are inserted at the head of the queue, so how could something else
> slip in before it?  If we have a race window we probably need
> to use BLK_MQ_REQ_PREEMPT or something like to force executing the
> connect on an otherwise frozen queue.

It wasn't connect I was worried about - but rather ioctls competing with 
the ios to init the controller on the admin queue. Had really odd ioctls 
getting in immediately after connect and before controller enabled.

-- james


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/8] nvme-fabrics: allow to queue requests for live queues
  2020-08-14  7:22       ` Christoph Hellwig
  2020-08-14 15:55         ` James Smart
@ 2020-08-14 17:49         ` Sagi Grimberg
  1 sibling, 0 replies; 30+ messages in thread
From: Sagi Grimberg @ 2020-08-14 17:49 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, James Smart, linux-nvme


>>> Which will still happen with the admin queue user passthrough
>>> commands with this patch, so I don't think it actually solves anything,
>>> it just reduces the exposure a bit.
>>
>> The original version of the patch removed that as well, but james
>> indicated that it's still needed because we have no way to make sure
>> the admin (re)connect will be the first request when we unquiesce.
> 
> Is that whole thing really a problem?  All the pass through requests
> are inserted at the head of the queue, so how could something else
> slip in before it?  If we have a race window we probably need
> to use BLK_MQ_REQ_PREEMPT or something like to force executing the
> connect on an otherwise frozen queue.

The problem is that we have to unquiesce the admin queue in order
to send the connect request (because it's sent on the ctrl->admin_q),
and once we do that, quiesced commands go through and the condition
being discussed is what's preventing them from passing through.

As James noted, not only before the admin connect but also before
the controller was enabled.

Anyways, this needs to be resolved, but it doesn't make this change
incorrect.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 6/8] nvme-rdma: serialize controller teardown sequences
  2020-08-06 19:11 ` [PATCH v2 6/8] nvme-rdma: serialize controller teardown sequences Sagi Grimberg
  2020-08-14  6:45   ` Christoph Hellwig
@ 2020-08-14 21:12   ` James Smart
  2020-08-19  0:35     ` Sagi Grimberg
  1 sibling, 1 reply; 30+ messages in thread
From: James Smart @ 2020-08-14 21:12 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch



On 8/6/2020 12:11 PM, Sagi Grimberg wrote:
> In the timeout handler we may need to complete a request because the
> request that timed out may be an I/O that is a part of a serial sequence
> of controller teardown or initialization. In order to complete the
> request, we need to fence any other context that may compete with us
> and complete the request that is timing out.
>
> In this case, we could have a potential double completion in case
> a hard-irq or a different competing context triggered error recovery
> and is running inflight request cancellation concurrently with the
> timeout handler.
>
> Protect using a ctrl teardown_lock to serialize contexts that may
> complete a cancelled request due to error recovery or a reset.
>
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>   drivers/nvme/host/rdma.c | 6 ++++++
>   1 file changed, 6 insertions(+)
>
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 44c76ffbb264..abc318737f35 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -122,6 +122,7 @@ struct nvme_rdma_ctrl {
>   	struct sockaddr_storage src_addr;
>   
>   	struct nvme_ctrl	ctrl;
> +	struct mutex		teardown_lock;
>   	bool			use_inline_data;
>   	u32			io_queues[HCTX_MAX_TYPES];
>   };
> @@ -997,6 +998,7 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
>   static void nvme_rdma_teardown_admin_queue(struct nvme_rdma_ctrl *ctrl,
>   		bool remove)
>   {
> +	mutex_lock(&ctrl->teardown_lock);
>   	blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
>   	nvme_rdma_stop_queue(&ctrl->queues[0]);
>   	if (ctrl->ctrl.admin_tagset) {
> @@ -1007,11 +1009,13 @@ static void nvme_rdma_teardown_admin_queue(struct nvme_rdma_ctrl *ctrl,
>   	if (remove)
>   		blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
>   	nvme_rdma_destroy_admin_queue(ctrl, remove);
> +	mutex_unlock(&ctrl->teardown_lock);
>   }
>   
>   static void nvme_rdma_teardown_io_queues(struct nvme_rdma_ctrl *ctrl,
>   		bool remove)
>   {
> +	mutex_lock(&ctrl->teardown_lock);
>   	if (ctrl->ctrl.queue_count > 1) {
>   		nvme_start_freeze(&ctrl->ctrl);
>   		nvme_stop_queues(&ctrl->ctrl);
> @@ -1025,6 +1029,7 @@ static void nvme_rdma_teardown_io_queues(struct nvme_rdma_ctrl *ctrl,
>   			nvme_start_queues(&ctrl->ctrl);
>   		nvme_rdma_destroy_io_queues(ctrl, remove);
>   	}
> +	mutex_unlock(&ctrl->teardown_lock);
>   }
>   
>   static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl)
> @@ -2278,6 +2283,7 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
>   		return ERR_PTR(-ENOMEM);
>   	ctrl->ctrl.opts = opts;
>   	INIT_LIST_HEAD(&ctrl->list);
> +	mutex_init(&ctrl->teardown_lock);
>   
>   	if (!(opts->mask & NVMF_OPT_TRSVCID)) {
>   		opts->trsvcid =

Looks good - but....

I hit this same issue on FC - I will need to post a similar path. My 
problem was that the reset/teardown path due to the timeout then raced 
with the error that the connect path saw for its io that dropped into 
the partial-teardown steps as connect backed-out.   So I recommend 
looking at nvme_rdma_setup_ctrl() and any of it's teardown paths that 
don't have the mutex and may race with cases that are taking the mutex.

If it's all good - you can add my Reviewed-by for it.

-- james


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 7/8] nvme-rdma: fix timeout handler
  2020-08-14  7:14     ` Sagi Grimberg
@ 2020-08-14 23:19       ` James Smart
  2020-08-19  0:26         ` Sagi Grimberg
  0 siblings, 1 reply; 30+ messages in thread
From: James Smart @ 2020-08-14 23:19 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme



On 8/14/2020 12:14 AM, Sagi Grimberg wrote:
>
>>> +++ b/drivers/nvme/host/rdma.c
>>> @@ -1185,6 +1185,7 @@ static void nvme_rdma_error_recovery(struct 
>>> nvme_rdma_ctrl *ctrl)
>>>       if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
>>>           return;
>>>   +    dev_warn(ctrl->ctrl.device, "starting error recovery\n");
>>
>> Should this really be a warning?  I'd turn this down to _info.
>
> I can do that. but given that this is a fix, I'll to do that
> as a separate patch.

I've found the impacts of the error_recovery/ctrl reset on aborting ios 
to be a significant event, and as it should be happening rarely, it was 
much better to put it as a warning than as an info.

-- james


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 7/8] nvme-rdma: fix timeout handler
  2020-08-06 19:11 ` [PATCH v2 7/8] nvme-rdma: fix timeout handler Sagi Grimberg
  2020-08-14  6:52   ` Christoph Hellwig
@ 2020-08-14 23:27   ` James Smart
  2020-08-14 23:30     ` James Smart
  2020-08-19  0:38     ` Sagi Grimberg
  1 sibling, 2 replies; 30+ messages in thread
From: James Smart @ 2020-08-14 23:27 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch



On 8/6/2020 12:11 PM, Sagi Grimberg wrote:
> Currently we check if the controller state != LIVE, and
> we directly fail the command under the assumption that this
> is the connect command or an admin command within the
> controller initialization sequence.
>
> This is wrong, we need to check if the request risking
> controller setup/teardown blocking if not completed and
> only then fail.
>
> The logic should be:
> - RESETTING, only fail fabrics/admin commands otherwise
>    controller teardown will block. otherwise reset the timer
>    and come back again.
> - CONNECTING, if this is a connect (or an admin command), we fail
>    right away (unblock controller initialization), otherwise we
>    treat it like anything else.
> - otherwise trigger error recovery and reset the timer (the
>    error handler will take care of completing/delaying it).
>
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>   drivers/nvme/host/rdma.c | 68 +++++++++++++++++++++++++++++-----------
>   1 file changed, 50 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index abc318737f35..30b401fcc06a 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -1185,6 +1185,7 @@ static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
>   	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
>   		return;
>   
> +	dev_warn(ctrl->ctrl.device, "starting error recovery\n");
>   	queue_work(nvme_reset_wq, &ctrl->err_work);
>   }
>   
> @@ -1951,6 +1952,23 @@ static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
>   	return 0;
>   }
>   
> +static void nvme_rdma_complete_timed_out(struct request *rq)
> +{
> +	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
> +	struct nvme_rdma_queue *queue = req->queue;
> +	struct nvme_rdma_ctrl *ctrl = queue->ctrl;
> +
> +	/* fence other contexts that may complete the command */
> +	mutex_lock(&ctrl->teardown_lock);
> +	nvme_rdma_stop_queue(queue);
> +	if (blk_mq_request_completed(rq))
> +		goto out;
> +	nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
> +	blk_mq_complete_request(rq);
> +out:
> +	mutex_unlock(&ctrl->teardown_lock);
> +}
> +

I believe there should be some comment explaining why it's ok to leave 
the rdma queue stopped.
I think it's ok as:
resetting: the controller will be reset, so the queue will be deleted
connecting: init io failures will teardown partially initialized 
controller, so the queue will be deleted

>   static enum blk_eh_timer_return
>   nvme_rdma_timeout(struct request *rq, bool reserved)
>   {
> @@ -1961,29 +1979,43 @@ nvme_rdma_timeout(struct request *rq, bool reserved)
>   	dev_warn(ctrl->ctrl.device, "I/O %d QID %d timeout\n",
>   		 rq->tag, nvme_rdma_queue_idx(queue));
>   
> -	/*
> -	 * Restart the timer if a controller reset is already scheduled. Any
> -	 * timed out commands would be handled before entering the connecting
> -	 * state.
> -	 */
> -	if (ctrl->ctrl.state == NVME_CTRL_RESETTING)
> +	switch (ctrl->ctrl.state) {
> +	case NVME_CTRL_RESETTING:
> +		if (!nvme_rdma_queue_idx(queue)) {
> +			/*
> +			 * if we are in teardown we must complete immediately
> +			 * because we may block the teardown sequence (e.g.
> +			 * nvme_disable_ctrl timed out).
> +			 */
> +			nvme_rdma_complete_timed_out(rq);
> +			return BLK_EH_DONE;
> +		}
> +		/*
> +		 * Restart the timer if a controller reset is already scheduled.
> +		 * Any timed out commands would be handled before entering the
> +		 * connecting state.
> +		 */
>   		return BLK_EH_RESET_TIMER;

If you're in RESETTING, why do you need to qualify ios only on the admin 
queue. Can't all ios, regardless of queue, just be complete_timed_out() 
?  Isn't this just a race between the io timeout and the resetting 
routine reaching the io ?


> -
> -	if (ctrl->ctrl.state != NVME_CTRL_LIVE) {
> +	case NVME_CTRL_CONNECTING:
> +		if (reserved || !nvme_rdma_queue_idx(queue)) {
> +			/*
> +			 * if we are connecting we must complete immediately
> +			 * connect (reserved) or admin requests because we may
> +			 * block controller setup sequence.
> +			 */
> +			nvme_rdma_complete_timed_out(rq);
> +			return BLK_EH_DONE;
> +		}

This is reasonable.  But I'm wondering why this too isn't just 
completing any io that timed out.  For the non-controller create/init 
ios - they'll either bounce back to the multipather or will requeue. 
With the requeue, there's an opportunity for Viktor Gladko'vs "reject 
I/O to offline device" to bounce it if it's been waiting a while.

> +		/* fallthru */
> +	default:
>   		/*
> -		 * Teardown immediately if controller times out while starting
> -		 * or we are already started error recovery. all outstanding
> -		 * requests are completed on shutdown, so we return BLK_EH_DONE.
> +		 * every other state should trigger the error recovery
> +		 * which will be handled by the flow and controller state
> +		 * machine
>   		 */
> -		flush_work(&ctrl->err_work);
> -		nvme_rdma_teardown_io_queues(ctrl, false);
> -		nvme_rdma_teardown_admin_queue(ctrl, false);
> -		return BLK_EH_DONE;
> +		nvme_rdma_error_recovery(ctrl);
>   	}
>   
> -	dev_warn(ctrl->ctrl.device, "starting error recovery\n");
> -	nvme_rdma_error_recovery(ctrl);
> -
>   	return BLK_EH_RESET_TIMER;
>   }
>   

-- james



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 7/8] nvme-rdma: fix timeout handler
  2020-08-14 23:27   ` James Smart
@ 2020-08-14 23:30     ` James Smart
  2020-08-19  0:39       ` Sagi Grimberg
  2020-08-19  0:38     ` Sagi Grimberg
  1 sibling, 1 reply; 30+ messages in thread
From: James Smart @ 2020-08-14 23:30 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch



On 8/14/2020 4:27 PM, James Smart wrote:
>
>
> On 8/6/2020 12:11 PM, Sagi Grimberg wrote:
>> Currently we check if the controller state != LIVE, and
>> we directly fail the command under the assumption that this
>> is the connect command or an admin command within the
>> controller initialization sequence.
>>
>> This is wrong, we need to check if the request risking
>> controller setup/teardown blocking if not completed and
>> only then fail.
>>
>> The logic should be:
>> - RESETTING, only fail fabrics/admin commands otherwise
>>    controller teardown will block. otherwise reset the timer
>>    and come back again.
>> - CONNECTING, if this is a connect (or an admin command), we fail
>>    right away (unblock controller initialization), otherwise we
>>    treat it like anything else.
>> - otherwise trigger error recovery and reset the timer (the
>>    error handler will take care of completing/delaying it).
>>
>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>> ---
>>   drivers/nvme/host/rdma.c | 68 +++++++++++++++++++++++++++++-----------
>>   1 file changed, 50 insertions(+), 18 deletions(-)
>>
>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>> index abc318737f35..30b401fcc06a 100644

Note: FC has to do this a little differently. From what I can tell, 
there's no cross-overs in your patches vs mine, so no need to wait for 
FC patches to marry with them.

-- james


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 7/8] nvme-rdma: fix timeout handler
  2020-08-14 23:19       ` James Smart
@ 2020-08-19  0:26         ` Sagi Grimberg
  0 siblings, 0 replies; 30+ messages in thread
From: Sagi Grimberg @ 2020-08-19  0:26 UTC (permalink / raw)
  To: James Smart, Christoph Hellwig; +Cc: Keith Busch, linux-nvme


>>>> +++ b/drivers/nvme/host/rdma.c
>>>> @@ -1185,6 +1185,7 @@ static void nvme_rdma_error_recovery(struct 
>>>> nvme_rdma_ctrl *ctrl)
>>>>       if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
>>>>           return;
>>>>   +    dev_warn(ctrl->ctrl.device, "starting error recovery\n");
>>>
>>> Should this really be a warning?  I'd turn this down to _info.
>>
>> I can do that. but given that this is a fix, I'll to do that
>> as a separate patch.
> 
> I've found the impacts of the error_recovery/ctrl reset on aborting ios 
> to be a significant event, and as it should be happening rarely, it was 
> much better to put it as a warning than as an info.

We already log a warning in the timeout handler. I can keep it as is,
pretty indifferent about it.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 6/8] nvme-rdma: serialize controller teardown sequences
  2020-08-14 21:12   ` James Smart
@ 2020-08-19  0:35     ` Sagi Grimberg
  0 siblings, 0 replies; 30+ messages in thread
From: Sagi Grimberg @ 2020-08-19  0:35 UTC (permalink / raw)
  To: James Smart, linux-nvme, Christoph Hellwig, Keith Busch



On 8/14/20 2:12 PM, James Smart wrote:
> 
> 
> On 8/6/2020 12:11 PM, Sagi Grimberg wrote:
>> In the timeout handler we may need to complete a request because the
>> request that timed out may be an I/O that is a part of a serial sequence
>> of controller teardown or initialization. In order to complete the
>> request, we need to fence any other context that may compete with us
>> and complete the request that is timing out.
>>
>> In this case, we could have a potential double completion in case
>> a hard-irq or a different competing context triggered error recovery
>> and is running inflight request cancellation concurrently with the
>> timeout handler.
>>
>> Protect using a ctrl teardown_lock to serialize contexts that may
>> complete a cancelled request due to error recovery or a reset.
>>
>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>> ---
>>   drivers/nvme/host/rdma.c | 6 ++++++
>>   1 file changed, 6 insertions(+)
>>
>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>> index 44c76ffbb264..abc318737f35 100644
>> --- a/drivers/nvme/host/rdma.c
>> +++ b/drivers/nvme/host/rdma.c
>> @@ -122,6 +122,7 @@ struct nvme_rdma_ctrl {
>>       struct sockaddr_storage src_addr;
>>       struct nvme_ctrl    ctrl;
>> +    struct mutex        teardown_lock;
>>       bool            use_inline_data;
>>       u32            io_queues[HCTX_MAX_TYPES];
>>   };
>> @@ -997,6 +998,7 @@ static int nvme_rdma_configure_io_queues(struct 
>> nvme_rdma_ctrl *ctrl, bool new)
>>   static void nvme_rdma_teardown_admin_queue(struct nvme_rdma_ctrl *ctrl,
>>           bool remove)
>>   {
>> +    mutex_lock(&ctrl->teardown_lock);
>>       blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
>>       nvme_rdma_stop_queue(&ctrl->queues[0]);
>>       if (ctrl->ctrl.admin_tagset) {
>> @@ -1007,11 +1009,13 @@ static void 
>> nvme_rdma_teardown_admin_queue(struct nvme_rdma_ctrl *ctrl,
>>       if (remove)
>>           blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
>>       nvme_rdma_destroy_admin_queue(ctrl, remove);
>> +    mutex_unlock(&ctrl->teardown_lock);
>>   }
>>   static void nvme_rdma_teardown_io_queues(struct nvme_rdma_ctrl *ctrl,
>>           bool remove)
>>   {
>> +    mutex_lock(&ctrl->teardown_lock);
>>       if (ctrl->ctrl.queue_count > 1) {
>>           nvme_start_freeze(&ctrl->ctrl);
>>           nvme_stop_queues(&ctrl->ctrl);
>> @@ -1025,6 +1029,7 @@ static void nvme_rdma_teardown_io_queues(struct 
>> nvme_rdma_ctrl *ctrl,
>>               nvme_start_queues(&ctrl->ctrl);
>>           nvme_rdma_destroy_io_queues(ctrl, remove);
>>       }
>> +    mutex_unlock(&ctrl->teardown_lock);
>>   }
>>   static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl)
>> @@ -2278,6 +2283,7 @@ static struct nvme_ctrl 
>> *nvme_rdma_create_ctrl(struct device *dev,
>>           return ERR_PTR(-ENOMEM);
>>       ctrl->ctrl.opts = opts;
>>       INIT_LIST_HEAD(&ctrl->list);
>> +    mutex_init(&ctrl->teardown_lock);
>>       if (!(opts->mask & NVMF_OPT_TRSVCID)) {
>>           opts->trsvcid =
> 
> Looks good - but....
> 
> I hit this same issue on FC - I will need to post a similar path. My 
> problem was that the reset/teardown path due to the timeout then raced 
> with the error that the connect path saw for its io that dropped into 
> the partial-teardown steps as connect backed-out.   So I recommend 
> looking at nvme_rdma_setup_ctrl() and any of it's teardown paths that 
> don't have the mutex and may race with cases that are taking the mutex.

Goof point.

The synchronization is not really required for the entire teardown path,
because the delete_work and flushing the connect_work, and state machine
doesn't allow reset and reconnect to compete. So this synchronization is
really just against the timeout handler.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 7/8] nvme-rdma: fix timeout handler
  2020-08-14 23:27   ` James Smart
  2020-08-14 23:30     ` James Smart
@ 2020-08-19  0:38     ` Sagi Grimberg
  1 sibling, 0 replies; 30+ messages in thread
From: Sagi Grimberg @ 2020-08-19  0:38 UTC (permalink / raw)
  To: James Smart, linux-nvme, Christoph Hellwig, Keith Busch


>> +static void nvme_rdma_complete_timed_out(struct request *rq)
>> +{
>> +    struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
>> +    struct nvme_rdma_queue *queue = req->queue;
>> +    struct nvme_rdma_ctrl *ctrl = queue->ctrl;
>> +
>> +    /* fence other contexts that may complete the command */
>> +    mutex_lock(&ctrl->teardown_lock);
>> +    nvme_rdma_stop_queue(queue);
>> +    if (blk_mq_request_completed(rq))
>> +        goto out;
>> +    nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
>> +    blk_mq_complete_request(rq);
>> +out:
>> +    mutex_unlock(&ctrl->teardown_lock);
>> +}
>> +
> 
> I believe there should be some comment explaining why it's ok to leave 
> the rdma queue stopped.
> I think it's ok as:
> resetting: the controller will be reset, so the queue will be deleted
> connecting: init io failures will teardown partially initialized 
> controller, so the queue will be deleted

I can add this comment.

> 
>>   static enum blk_eh_timer_return
>>   nvme_rdma_timeout(struct request *rq, bool reserved)
>>   {
>> @@ -1961,29 +1979,43 @@ nvme_rdma_timeout(struct request *rq, bool 
>> reserved)
>>       dev_warn(ctrl->ctrl.device, "I/O %d QID %d timeout\n",
>>            rq->tag, nvme_rdma_queue_idx(queue));
>> -    /*
>> -     * Restart the timer if a controller reset is already scheduled. Any
>> -     * timed out commands would be handled before entering the 
>> connecting
>> -     * state.
>> -     */
>> -    if (ctrl->ctrl.state == NVME_CTRL_RESETTING)
>> +    switch (ctrl->ctrl.state) {
>> +    case NVME_CTRL_RESETTING:
>> +        if (!nvme_rdma_queue_idx(queue)) {
>> +            /*
>> +             * if we are in teardown we must complete immediately
>> +             * because we may block the teardown sequence (e.g.
>> +             * nvme_disable_ctrl timed out).
>> +             */
>> +            nvme_rdma_complete_timed_out(rq);
>> +            return BLK_EH_DONE;
>> +        }
>> +        /*
>> +         * Restart the timer if a controller reset is already scheduled.
>> +         * Any timed out commands would be handled before entering the
>> +         * connecting state.
>> +         */
>>           return BLK_EH_RESET_TIMER;
> 
> If you're in RESETTING, why do you need to qualify ios only on the admin 
> queue. Can't all ios, regardless of queue, just be complete_timed_out() 
> ?  Isn't this just a race between the io timeout and the resetting 
> routine reaching the io ?

You are correct, given that we are serialized against the reset/error 
recovery we can just do the same for both. The request is going to
be cancelled anyways.

> 
> 
>> -
>> -    if (ctrl->ctrl.state != NVME_CTRL_LIVE) {
>> +    case NVME_CTRL_CONNECTING:
>> +        if (reserved || !nvme_rdma_queue_idx(queue)) {
>> +            /*
>> +             * if we are connecting we must complete immediately
>> +             * connect (reserved) or admin requests because we may
>> +             * block controller setup sequence.
>> +             */
>> +            nvme_rdma_complete_timed_out(rq);
>> +            return BLK_EH_DONE;
>> +        }
> 
> This is reasonable.  But I'm wondering why this too isn't just 
> completing any io that timed out.  For the non-controller create/init 
> ios - they'll either bounce back to the multipather or will requeue. 
> With the requeue, there's an opportunity for Viktor Gladko'vs "reject 
> I/O to offline device" to bounce it if it's been waiting a while.

You are right, I can do that to any state that is not LIVE.

Thanks for the review!

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 7/8] nvme-rdma: fix timeout handler
  2020-08-14 23:30     ` James Smart
@ 2020-08-19  0:39       ` Sagi Grimberg
  0 siblings, 0 replies; 30+ messages in thread
From: Sagi Grimberg @ 2020-08-19  0:39 UTC (permalink / raw)
  To: James Smart, linux-nvme, Christoph Hellwig, Keith Busch


>> On 8/6/2020 12:11 PM, Sagi Grimberg wrote:
>>> Currently we check if the controller state != LIVE, and
>>> we directly fail the command under the assumption that this
>>> is the connect command or an admin command within the
>>> controller initialization sequence.
>>>
>>> This is wrong, we need to check if the request risking
>>> controller setup/teardown blocking if not completed and
>>> only then fail.
>>>
>>> The logic should be:
>>> - RESETTING, only fail fabrics/admin commands otherwise
>>>    controller teardown will block. otherwise reset the timer
>>>    and come back again.
>>> - CONNECTING, if this is a connect (or an admin command), we fail
>>>    right away (unblock controller initialization), otherwise we
>>>    treat it like anything else.
>>> - otherwise trigger error recovery and reset the timer (the
>>>    error handler will take care of completing/delaying it).
>>>
>>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>>> ---
>>>   drivers/nvme/host/rdma.c | 68 +++++++++++++++++++++++++++++-----------
>>>   1 file changed, 50 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>> index abc318737f35..30b401fcc06a 100644
> 
> Note: FC has to do this a little differently. From what I can tell, 
> there's no cross-overs in your patches vs mine, so no need to wait for 
> FC patches to marry with them.

Good to know James.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2020-08-19  0:39 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-06 19:11 [PATCH v2 0/8] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
2020-08-06 19:11 ` [PATCH v2 1/8] nvme-fabrics: allow to queue requests for live queues Sagi Grimberg
2020-08-14  6:44   ` Christoph Hellwig
2020-08-14  7:08     ` Sagi Grimberg
2020-08-14  7:22       ` Christoph Hellwig
2020-08-14 15:55         ` James Smart
2020-08-14 17:49         ` Sagi Grimberg
2020-08-06 19:11 ` [PATCH v2 2/8] nvme: have nvme_wait_freeze_timeout return if it timed out Sagi Grimberg
2020-08-14  6:45   ` Christoph Hellwig
2020-08-14  7:09     ` Sagi Grimberg
2020-08-06 19:11 ` [PATCH v2 3/8] nvme-tcp: serialize controller teardown double completion Sagi Grimberg
2020-08-06 19:11 ` [PATCH v2 4/8] nvme-tcp: fix timeout handler Sagi Grimberg
2020-08-06 19:11 ` [PATCH v2 5/8] nvme-tcp: fix reset hang if controller died in the middle of a reset Sagi Grimberg
2020-08-06 19:11 ` [PATCH v2 6/8] nvme-rdma: serialize controller teardown sequences Sagi Grimberg
2020-08-14  6:45   ` Christoph Hellwig
2020-08-14 21:12   ` James Smart
2020-08-19  0:35     ` Sagi Grimberg
2020-08-06 19:11 ` [PATCH v2 7/8] nvme-rdma: fix timeout handler Sagi Grimberg
2020-08-14  6:52   ` Christoph Hellwig
2020-08-14  7:14     ` Sagi Grimberg
2020-08-14 23:19       ` James Smart
2020-08-19  0:26         ` Sagi Grimberg
2020-08-14 23:27   ` James Smart
2020-08-14 23:30     ` James Smart
2020-08-19  0:39       ` Sagi Grimberg
2020-08-19  0:38     ` Sagi Grimberg
2020-08-06 19:11 ` [PATCH v2 8/8] nvme-rdma: fix reset hang if controller died in the middle of a reset Sagi Grimberg
2020-08-14  6:53   ` Christoph Hellwig
2020-08-11 22:16 ` [PATCH v2 0/8] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
2020-08-13 15:39   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).