[PATCH v3 0/9] fix possible controller reset hangs in nvme-tcp/nvme-rdma

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/9] fix possible controller reset hangs in nvme-tcp/nvme-rdma
@ 2020-08-20  5:36 Sagi Grimberg
  2020-08-20  5:36 ` [PATCH v3 1/9] nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance Sagi Grimberg
                   ` (9 more replies)
  0 siblings, 10 replies; 44+ messages in thread
From: Sagi Grimberg @ 2020-08-20  5:36 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

When a controller reset runs during I/O we may hang if the controller
suddenly becomes unresponsive during the reset and/or the reconnection
stages. This is due to how the timeout handler did not fail inflight
commands properly and also not being able to abort the controller reset
sequence when the controller becomes unresponsive (hence can't ever
recover even if the controller ever becomes responsive again).

This set fixes nvme-tcp and nvme-rdma for exactly the same scenarios.

Changes from v2:
- move NVME_CTRL_NEW state check in __nvme_check_ready to a separate patch
- various comment phrasing fixes
- fixed change log descriptions
- changed patches nvme-tcp/nvme-rdma: fix timeout handler to restore the
  timed out requests cancellation for all the non-LIVE states as the request
  is going to be cancelled anyways. The change is now purely fixes how
  we serialize and fence against error recovery (as pointed out by James).

Changes from v1:
- added patches 3,6 to protect against possible (but rare) double
  completions for timed out requests.

Sagi Grimberg (9):
  nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance
  nvme-fabrics: allow to queue requests for live queues
  nvme: have nvme_wait_freeze_timeout return if it timed out
  nvme-tcp: serialize controller teardown sequences
  nvme-tcp: fix timeout handler
  nvme-tcp: fix reset hang if controller died in the middle of a reset
  nvme-rdma: serialize controller teardown sequences
  nvme-rdma: fix timeout handler
  nvme-rdma: fix reset hang if controller died in the middle of a reset

 drivers/nvme/host/core.c    |  3 +-
 drivers/nvme/host/fabrics.c | 13 +++---
 drivers/nvme/host/nvme.h    |  2 +-
 drivers/nvme/host/rdma.c    | 68 +++++++++++++++++++++++--------
 drivers/nvme/host/tcp.c     | 80 ++++++++++++++++++++++++++-----------
 5 files changed, 119 insertions(+), 47 deletions(-)

-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v3 1/9] nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance
  2020-08-20  5:36 [PATCH v3 0/9] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
@ 2020-08-20  5:36 ` Sagi Grimberg
  2020-08-20  6:02   ` Christoph Hellwig
  2020-08-20 20:49   ` James Smart
  2020-08-20  5:36 ` [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues Sagi Grimberg
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 44+ messages in thread
From: Sagi Grimberg @ 2020-08-20  5:36 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

NVME_CTRL_NEW should never see any I/O, because in order to start
initialization it has to transition to NVME_CTRL_CONNECTING and from
there it will never return to this state.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/fabrics.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 4ec4829d6233..32f61fc5f4c5 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -576,7 +576,6 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
 	 * which is require to set the queue live in the appropinquate states.
 	 */
 	switch (ctrl->state) {
-	case NVME_CTRL_NEW:
 	case NVME_CTRL_CONNECTING:
 		if (nvme_is_fabrics(req->cmd) &&
 		    req->cmd->fabrics.fctype == nvme_fabrics_type_connect)
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues
  2020-08-20  5:36 [PATCH v3 0/9] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
  2020-08-20  5:36 ` [PATCH v3 1/9] nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance Sagi Grimberg
@ 2020-08-20  5:36 ` Sagi Grimberg
  2020-08-20  6:09   ` Christoph Hellwig
                     ` (2 more replies)
  2020-08-20  5:36 ` [PATCH v3 3/9] nvme: have nvme_wait_freeze_timeout return if it timed out Sagi Grimberg
                   ` (7 subsequent siblings)
  9 siblings, 3 replies; 44+ messages in thread
From: Sagi Grimberg @ 2020-08-20  5:36 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

Right now we are failing requests based on the controller state (which
is checked inline in nvmf_check_ready) however we should definitely
accept requests if the queue is live.

When entering controller reset, we transition the controller into
NVME_CTRL_RESETTING, and then return BLK_STS_RESOURCE for non-mpath
requests (have blk_noretry_request set).

This is also the case for NVME_REQ_USER for the wrong reason. There
shouldn't be any reason for us to reject this I/O in a controller reset.
We do want to prevent passthru commands on the admin queue because we
need the controller to fully initialize first before we let user passthru
admin commands to be issued.

In a non-mpath setup, this means that the requests will simply be
requeued over and over forever not allowing the q_usage_counter to drop
its final reference, causing controller reset to hang if running
concurrently with heavy I/O.

Fixes: 35897b920c8a ("nvme-fabrics: fix and refine state checks in __nvmf_check_ready")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/fabrics.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 32f61fc5f4c5..8575724734e0 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -565,10 +565,14 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
 	struct nvme_request *req = nvme_req(rq);
 
 	/*
-	 * If we are in some state of setup or teardown only allow
-	 * internally generated commands.
+	 * currently we have a problem sending passthru commands
+	 * on the admin_q if the controller is not LIVE because we can't
+	 * make sure that they are going out after the admin connect,
+	 * controller enable and/or other commands in the initialization
+	 * sequence. until the controller will be LIVE, fail with
+	 * BLK_STS_RESOURCE so that they will be rescheduled.
 	 */
-	if (!blk_rq_is_passthrough(rq) || (req->flags & NVME_REQ_USERCMD))
+	if (rq->q == ctrl->admin_q && (req->flags & NVME_REQ_USERCMD))
 		return false;
 
 	/*
@@ -577,7 +581,7 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
 	 */
 	switch (ctrl->state) {
 	case NVME_CTRL_CONNECTING:
-		if (nvme_is_fabrics(req->cmd) &&
+		if (blk_rq_is_passthrough(rq) && nvme_is_fabrics(req->cmd) &&
 		    req->cmd->fabrics.fctype == nvme_fabrics_type_connect)
 			return true;
 		break;
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 3/9] nvme: have nvme_wait_freeze_timeout return if it timed out
  2020-08-20  5:36 [PATCH v3 0/9] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
  2020-08-20  5:36 ` [PATCH v3 1/9] nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance Sagi Grimberg
  2020-08-20  5:36 ` [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues Sagi Grimberg
@ 2020-08-20  5:36 ` Sagi Grimberg
  2020-08-20  6:09   ` Christoph Hellwig
  2020-08-20  5:36 ` [PATCH v3 4/9] nvme-tcp: serialize controller teardown sequences Sagi Grimberg
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 44+ messages in thread
From: Sagi Grimberg @ 2020-08-20  5:36 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

Users can detect if the wait has completed or not and take appropriate
actions based on this information (e.g. weather to continue
initialization or rather fail and schedule another initialization
attempt).

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/core.c | 3 ++-
 drivers/nvme/host/nvme.h | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 725f1263b7d2..65333f67e2ff 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4541,7 +4541,7 @@ void nvme_unfreeze(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_unfreeze);
 
-void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout)
+int nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout)
 {
 	struct nvme_ns *ns;
 
@@ -4552,6 +4552,7 @@ void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout)
 			break;
 	}
 	up_read(&ctrl->namespaces_rwsem);
+	return timeout;
 }
 EXPORT_SYMBOL_GPL(nvme_wait_freeze_timeout);
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 49a566d192a7..c676af150665 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -598,7 +598,7 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl);
 void nvme_sync_queues(struct nvme_ctrl *ctrl);
 void nvme_unfreeze(struct nvme_ctrl *ctrl);
 void nvme_wait_freeze(struct nvme_ctrl *ctrl);
-void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout);
+int nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout);
 void nvme_start_freeze(struct nvme_ctrl *ctrl);
 
 #define NVME_QID_ANY -1
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 4/9] nvme-tcp: serialize controller teardown sequences
  2020-08-20  5:36 [PATCH v3 0/9] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
                   ` (2 preceding siblings ...)
  2020-08-20  5:36 ` [PATCH v3 3/9] nvme: have nvme_wait_freeze_timeout return if it timed out Sagi Grimberg
@ 2020-08-20  5:36 ` Sagi Grimberg
  2020-08-20  5:36 ` [PATCH v3 5/9] nvme-tcp: fix timeout handler Sagi Grimberg
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2020-08-20  5:36 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

In the timeout handler we may need to complete a request because the
request that timed out may be an I/O that is a part of a serial sequence
of controller teardown or initialization. In order to complete the
request, we need to fence any other context that may compete with us
and complete the request that is timing out.

In this case, we could have a potential double completion in case
a hard-irq or a different competing context triggered error recovery
and is running inflight request cancellation concurrently with the
timeout handler.

Protect using a ctrl teardown_lock to serialize contexts that may
complete a cancelled request due to error recovery or a reset.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/tcp.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 3be4749dbf11..d58a6e2a4ab1 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -124,6 +124,7 @@ struct nvme_tcp_ctrl {
 	struct sockaddr_storage src_addr;
 	struct nvme_ctrl	ctrl;
 
+	struct mutex		teardown_lock;
 	struct work_struct	err_work;
 	struct delayed_work	connect_work;
 	struct nvme_tcp_request async_req;
@@ -1527,7 +1528,6 @@ static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
 
 	if (!test_and_clear_bit(NVME_TCP_Q_LIVE, &queue->flags))
 		return;
-
 	__nvme_tcp_stop_queue(queue);
 }
 
@@ -1875,6 +1875,7 @@ static int nvme_tcp_configure_admin_queue(struct nvme_ctrl *ctrl, bool new)
 static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl,
 		bool remove)
 {
+	mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
 	blk_mq_quiesce_queue(ctrl->admin_q);
 	nvme_tcp_stop_queue(ctrl, 0);
 	if (ctrl->admin_tagset) {
@@ -1885,13 +1886,16 @@ static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl,
 	if (remove)
 		blk_mq_unquiesce_queue(ctrl->admin_q);
 	nvme_tcp_destroy_admin_queue(ctrl, remove);
+	mutex_unlock(&to_tcp_ctrl(ctrl)->teardown_lock);
 }
 
 static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
 		bool remove)
 {
+	mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
 	if (ctrl->queue_count <= 1)
-		return;
+		goto out;
+	blk_mq_quiesce_queue(ctrl->admin_q);
 	nvme_start_freeze(ctrl);
 	nvme_stop_queues(ctrl);
 	nvme_tcp_stop_io_queues(ctrl);
@@ -1903,6 +1907,8 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
 	if (remove)
 		nvme_start_queues(ctrl);
 	nvme_tcp_destroy_io_queues(ctrl, remove);
+out:
+	mutex_unlock(&to_tcp_ctrl(ctrl)->teardown_lock);
 }
 
 static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl)
@@ -2423,6 +2429,7 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
 			nvme_tcp_reconnect_ctrl_work);
 	INIT_WORK(&ctrl->err_work, nvme_tcp_error_recovery_work);
 	INIT_WORK(&ctrl->ctrl.reset_work, nvme_reset_ctrl_work);
+	mutex_init(&ctrl->teardown_lock);
 
 	if (!(opts->mask & NVMF_OPT_TRSVCID)) {
 		opts->trsvcid =
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 5/9] nvme-tcp: fix timeout handler
  2020-08-20  5:36 [PATCH v3 0/9] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
                   ` (3 preceding siblings ...)
  2020-08-20  5:36 ` [PATCH v3 4/9] nvme-tcp: serialize controller teardown sequences Sagi Grimberg
@ 2020-08-20  5:36 ` Sagi Grimberg
  2020-08-20  5:36 ` [PATCH v3 6/9] nvme-tcp: fix reset hang if controller died in the middle of a reset Sagi Grimberg
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2020-08-20  5:36 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

When a request times out in a LIVE state, we simply trigger error
recovery and let the error recovery handle the request cancellation,
however when a request times out in a non LIVE state, we make sure to
complete it immediately as it might block controller setup or teardown
and prevent forward progress.

However tearing down the entire set of I/O and admin queues causes
freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
an overkill to what we actually need, which is to just fence controller
teardown that may be running, stop the queue, and cancel the request if
it is not already completed.

Now that we have the controller teardown_lock, we can safely serialize
request cancellation. This addresses a hang caused by calling extra
queue freeze on controller namespaces, causing unfreeze to not complete
correctly.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/tcp.c | 56 ++++++++++++++++++++++++++---------------
 1 file changed, 36 insertions(+), 20 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index d58a6e2a4ab1..dbb661b899ef 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -465,6 +465,7 @@ static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
 	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
 		return;
 
+	dev_warn(ctrl->device, "starting error recovery\n");
 	queue_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work);
 }
 
@@ -2155,40 +2156,55 @@ static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg)
 	nvme_tcp_queue_request(&ctrl->async_req, true, true);
 }
 
+static void nvme_tcp_complete_timed_out(struct request *rq)
+{
+	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
+	struct nvme_ctrl *ctrl = &req->queue->ctrl->ctrl;
+
+	/* fence other contexts that may complete the command */
+	mutex_lock(&to_tcp_ctrl(ctrl)->teardown_lock);
+	nvme_tcp_stop_queue(ctrl, nvme_tcp_queue_id(req->queue));
+	if (!blk_mq_request_completed(rq)) {
+		nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
+		blk_mq_complete_request(rq);
+	}
+	mutex_unlock(&to_tcp_ctrl(ctrl)->teardown_lock);
+}
+
 static enum blk_eh_timer_return
 nvme_tcp_timeout(struct request *rq, bool reserved)
 {
 	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
-	struct nvme_tcp_ctrl *ctrl = req->queue->ctrl;
+	struct nvme_ctrl *ctrl = &req->queue->ctrl->ctrl;
 	struct nvme_tcp_cmd_pdu *pdu = req->pdu;
 
-	/*
-	 * Restart the timer if a controller reset is already scheduled. Any
-	 * timed out commands would be handled before entering the connecting
-	 * state.
-	 */
-	if (ctrl->ctrl.state == NVME_CTRL_RESETTING)
-		return BLK_EH_RESET_TIMER;
-
-	dev_warn(ctrl->ctrl.device,
+	dev_warn(ctrl->device,
 		"queue %d: timeout request %#x type %d\n",
 		nvme_tcp_queue_id(req->queue), rq->tag, pdu->hdr.type);
 
-	if (ctrl->ctrl.state != NVME_CTRL_LIVE) {
+	if (ctrl->state != NVME_CTRL_LIVE) {
 		/*
-		 * Teardown immediately if controller times out while starting
-		 * or we are already started error recovery. all outstanding
-		 * requests are completed on shutdown, so we return BLK_EH_DONE.
+		 * If we are resetting, connecting or deleting we should
+		 * complete immediately because we may block controller
+		 * teardown or setup sequence
+		 * - ctrl disable/shutdown fabrics requests
+		 * - connect requests
+		 * - initialization admin requests
+		 * - I/O requests that entered after unquiescing and
+		 *   the controller stopped responding
+		 *
+		 * All other requests should be cancelled by the error
+		 * recovery work, so it's fine that we fail it here.
 		 */
-		flush_work(&ctrl->err_work);
-		nvme_tcp_teardown_io_queues(&ctrl->ctrl, false);
-		nvme_tcp_teardown_admin_queue(&ctrl->ctrl, false);
+		nvme_tcp_complete_timed_out(rq);
 		return BLK_EH_DONE;
 	}
 
-	dev_warn(ctrl->ctrl.device, "starting error recovery\n");
-	nvme_tcp_error_recovery(&ctrl->ctrl);
-
+	/*
+	 * LIVE state should trigger the normal error recovery which will
+	 * handle completing this request.
+	 */
+	nvme_tcp_error_recovery(ctrl);
 	return BLK_EH_RESET_TIMER;
 }
 
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 6/9] nvme-tcp: fix reset hang if controller died in the middle of a reset
  2020-08-20  5:36 [PATCH v3 0/9] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
                   ` (4 preceding siblings ...)
  2020-08-20  5:36 ` [PATCH v3 5/9] nvme-tcp: fix timeout handler Sagi Grimberg
@ 2020-08-20  5:36 ` Sagi Grimberg
  2020-08-20  5:36 ` [PATCH v3 7/9] nvme-rdma: serialize controller teardown sequences Sagi Grimberg
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2020-08-20  5:36 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

If the controller becomes unresponsive in the middle of a reset, we will
hang because we are waiting for the freeze to complete, but that cannot
happen since we have commands that are inflight holding the
q_usage_counter, and we can't blindly fail requests that times out.

So give a timeout and if we cannot wait for queue freeze before
unfreezing, fail and have the error handling take care how to proceed
(either schedule a reconnect of remove the controller).

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/tcp.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index dbb661b899ef..873d6afcbc9e 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1783,7 +1783,15 @@ static int nvme_tcp_configure_io_queues(struct nvme_ctrl *ctrl, bool new)
 
 	if (!new) {
 		nvme_start_queues(ctrl);
-		nvme_wait_freeze(ctrl);
+		if (!nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT)) {
+			/*
+			 * If we timed out waiting for freeze we are likely to
+			 * be stuck.  Fail the controller initialization just
+			 * to be safe.
+			 */
+			ret = -ENODEV;
+			goto out_wait_freeze_timed_out;
+		}
 		blk_mq_update_nr_hw_queues(ctrl->tagset,
 			ctrl->queue_count - 1);
 		nvme_unfreeze(ctrl);
@@ -1791,6 +1799,9 @@ static int nvme_tcp_configure_io_queues(struct nvme_ctrl *ctrl, bool new)
 
 	return 0;
 
+out_wait_freeze_timed_out:
+	nvme_stop_queues(ctrl);
+	nvme_tcp_stop_io_queues(ctrl);
 out_cleanup_connect_q:
 	if (new)
 		blk_cleanup_queue(ctrl->connect_q);
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 7/9] nvme-rdma: serialize controller teardown sequences
  2020-08-20  5:36 [PATCH v3 0/9] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
                   ` (5 preceding siblings ...)
  2020-08-20  5:36 ` [PATCH v3 6/9] nvme-tcp: fix reset hang if controller died in the middle of a reset Sagi Grimberg
@ 2020-08-20  5:36 ` Sagi Grimberg
  2020-08-20 21:04   ` James Smart
  2020-08-21 21:08   ` James Smart
  2020-08-20  5:36 ` [PATCH v3 8/9] nvme-rdma: fix timeout handler Sagi Grimberg
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 44+ messages in thread
From: Sagi Grimberg @ 2020-08-20  5:36 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

In the timeout handler we may need to complete a request because the
request that timed out may be an I/O that is a part of a serial sequence
of controller teardown or initialization. In order to complete the
request, we need to fence any other context that may compete with us
and complete the request that is timing out.

In this case, we could have a potential double completion in case
a hard-irq or a different competing context triggered error recovery
and is running inflight request cancellation concurrently with the
timeout handler.

Protect using a ctrl teardown_lock to serialize contexts that may
complete a cancelled request due to error recovery or a reset.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/rdma.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 4c5660201009..ed387f61f8f7 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -122,6 +122,7 @@ struct nvme_rdma_ctrl {
 	struct sockaddr_storage src_addr;
 
 	struct nvme_ctrl	ctrl;
+	struct mutex		teardown_lock;
 	bool			use_inline_data;
 	u32			io_queues[HCTX_MAX_TYPES];
 };
@@ -997,6 +998,7 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
 static void nvme_rdma_teardown_admin_queue(struct nvme_rdma_ctrl *ctrl,
 		bool remove)
 {
+	mutex_lock(&ctrl->teardown_lock);
 	blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
 	nvme_rdma_stop_queue(&ctrl->queues[0]);
 	if (ctrl->ctrl.admin_tagset) {
@@ -1007,11 +1009,13 @@ static void nvme_rdma_teardown_admin_queue(struct nvme_rdma_ctrl *ctrl,
 	if (remove)
 		blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
 	nvme_rdma_destroy_admin_queue(ctrl, remove);
+	mutex_unlock(&ctrl->teardown_lock);
 }
 
 static void nvme_rdma_teardown_io_queues(struct nvme_rdma_ctrl *ctrl,
 		bool remove)
 {
+	mutex_lock(&ctrl->teardown_lock);
 	if (ctrl->ctrl.queue_count > 1) {
 		nvme_start_freeze(&ctrl->ctrl);
 		nvme_stop_queues(&ctrl->ctrl);
@@ -1025,6 +1029,7 @@ static void nvme_rdma_teardown_io_queues(struct nvme_rdma_ctrl *ctrl,
 			nvme_start_queues(&ctrl->ctrl);
 		nvme_rdma_destroy_io_queues(ctrl, remove);
 	}
+	mutex_unlock(&ctrl->teardown_lock);
 }
 
 static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl)
@@ -2278,6 +2283,7 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
 		return ERR_PTR(-ENOMEM);
 	ctrl->ctrl.opts = opts;
 	INIT_LIST_HEAD(&ctrl->list);
+	mutex_init(&ctrl->teardown_lock);
 
 	if (!(opts->mask & NVMF_OPT_TRSVCID)) {
 		opts->trsvcid =
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 8/9] nvme-rdma: fix timeout handler
  2020-08-20  5:36 [PATCH v3 0/9] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
                   ` (6 preceding siblings ...)
  2020-08-20  5:36 ` [PATCH v3 7/9] nvme-rdma: serialize controller teardown sequences Sagi Grimberg
@ 2020-08-20  5:36 ` Sagi Grimberg
  2020-08-20  6:10   ` Christoph Hellwig
  2020-08-20 21:37   ` James Smart
  2020-08-20  5:36 ` [PATCH v3 9/9] nvme-rdma: fix reset hang if controller died in the middle of a reset Sagi Grimberg
  2020-08-24 18:29 ` [PATCH v3 0/9] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
  9 siblings, 2 replies; 44+ messages in thread
From: Sagi Grimberg @ 2020-08-20  5:36 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

When a request times out in a LIVE state, we simply trigger error
recovery and let the error recovery handle the request cancellation,
however when a request times out in a non LIVE state, we make sure to
complete it immediately as it might block controller setup or teardown
and prevent forward progress.

However tearing down the entire set of I/O and admin queues causes
freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
an overkill to what we actually need, which is to just fence controller
teardown that may be running, stop the queue, and cancel the request if
it is not already completed.

Now that we have the controller teardown_lock, we can safely serialize
request cancellation. This addresses a hang caused by calling extra
queue freeze on controller namespaces, causing unfreeze to not complete
correctly.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/rdma.c | 49 +++++++++++++++++++++++++++-------------
 1 file changed, 33 insertions(+), 16 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index ed387f61f8f7..cb8731f4b316 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1185,6 +1185,7 @@ static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
 	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
 		return;
 
+	dev_warn(ctrl->ctrl.device, "starting error recovery\n");
 	queue_work(nvme_reset_wq, &ctrl->err_work);
 }
 
@@ -1951,6 +1952,22 @@ static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
 	return 0;
 }
 
+static void nvme_rdma_complete_timed_out(struct request *rq)
+{
+	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
+	struct nvme_rdma_queue *queue = req->queue;
+	struct nvme_rdma_ctrl *ctrl = queue->ctrl;
+
+	/* fence other contexts that may complete the command */
+	mutex_lock(&ctrl->teardown_lock);
+	nvme_rdma_stop_queue(queue);
+	if (!blk_mq_request_completed(rq)) {
+		nvme_req(rq)->status = NVME_SC_HOST_ABORTED_CMD;
+		blk_mq_complete_request(rq);
+	}
+	mutex_unlock(&ctrl->teardown_lock);
+}
+
 static enum blk_eh_timer_return
 nvme_rdma_timeout(struct request *rq, bool reserved)
 {
@@ -1961,29 +1978,29 @@ nvme_rdma_timeout(struct request *rq, bool reserved)
 	dev_warn(ctrl->ctrl.device, "I/O %d QID %d timeout\n",
 		 rq->tag, nvme_rdma_queue_idx(queue));
 
-	/*
-	 * Restart the timer if a controller reset is already scheduled. Any
-	 * timed out commands would be handled before entering the connecting
-	 * state.
-	 */
-	if (ctrl->ctrl.state == NVME_CTRL_RESETTING)
-		return BLK_EH_RESET_TIMER;
-
 	if (ctrl->ctrl.state != NVME_CTRL_LIVE) {
 		/*
-		 * Teardown immediately if controller times out while starting
-		 * or we are already started error recovery. all outstanding
-		 * requests are completed on shutdown, so we return BLK_EH_DONE.
+		 * If we are resetting, connecting or deleting we should
+		 * complete immediately because we may block controller
+		 * teardown or setup sequence
+		 * - ctrl disable/shutdown fabrics requests
+		 * - connect requests
+		 * - initialization admin requests
+		 * - I/O requests that entered after unquiescing and
+		 *   the controller stopped responding
+		 *
+		 * All other requests should be cancelled by the error
+		 * recovery work, so it's fine that we fail it here.
 		 */
-		flush_work(&ctrl->err_work);
-		nvme_rdma_teardown_io_queues(ctrl, false);
-		nvme_rdma_teardown_admin_queue(ctrl, false);
+		nvme_rdma_complete_timed_out(rq);
 		return BLK_EH_DONE;
 	}
 
-	dev_warn(ctrl->ctrl.device, "starting error recovery\n");
+	/*
+	 * LIVE state should trigger the normal error recovery which will
+	 * handle completing this request.
+	 */
 	nvme_rdma_error_recovery(ctrl);
-
 	return BLK_EH_RESET_TIMER;
 }
 
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 9/9] nvme-rdma: fix reset hang if controller died in the middle of a reset
  2020-08-20  5:36 [PATCH v3 0/9] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
                   ` (7 preceding siblings ...)
  2020-08-20  5:36 ` [PATCH v3 8/9] nvme-rdma: fix timeout handler Sagi Grimberg
@ 2020-08-20  5:36 ` Sagi Grimberg
  2020-08-20  6:10   ` Christoph Hellwig
  2020-08-24 18:29 ` [PATCH v3 0/9] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
  9 siblings, 1 reply; 44+ messages in thread
From: Sagi Grimberg @ 2020-08-20  5:36 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

If the controller becomes unresponsive in the middle of a reset, we
will hang because we are waiting for the freeze to complete, but that
cannot happen since we have commands that are inflight holding the
q_usage_counter, and we can't blindly fail requests that times out.

So give a timeout and if we cannot wait for queue freeze before
unfreezing, fail and have the error handling take care how to
proceed (either schedule a reconnect of remove the controller).

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/rdma.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index cb8731f4b316..1d8ea5305d60 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -976,7 +976,15 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
 
 	if (!new) {
 		nvme_start_queues(&ctrl->ctrl);
-		nvme_wait_freeze(&ctrl->ctrl);
+		if (!nvme_wait_freeze_timeout(&ctrl->ctrl, NVME_IO_TIMEOUT)) {
+			/*
+			 * If we timed out waiting for freeze we are likely to
+			 * be stuck.  Fail the controller initialization just
+			 * to be safe.
+			 */
+			ret = -ENODEV;
+			goto out_wait_freeze_timed_out;
+		}
 		blk_mq_update_nr_hw_queues(ctrl->ctrl.tagset,
 			ctrl->ctrl.queue_count - 1);
 		nvme_unfreeze(&ctrl->ctrl);
@@ -984,6 +992,9 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
 
 	return 0;
 
+out_wait_freeze_timed_out:
+	nvme_stop_queues(&ctrl->ctrl);
+	nvme_rdma_stop_io_queues(ctrl);
 out_cleanup_connect_q:
 	if (new)
 		blk_cleanup_queue(ctrl->ctrl.connect_q);
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 1/9] nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance
  2020-08-20  5:36 ` [PATCH v3 1/9] nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance Sagi Grimberg
@ 2020-08-20  6:02   ` Christoph Hellwig
  2020-08-20 20:49   ` James Smart
  1 sibling, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2020-08-20  6:02 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Christoph Hellwig, linux-nvme, James Smart

On Wed, Aug 19, 2020 at 10:36:43PM -0700, Sagi Grimberg wrote:
> NVME_CTRL_NEW should never see any I/O, because in order to start
> initialization it has to transition to NVME_CTRL_CONNECTING and from
> there it will never return to this state.
> 
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues
  2020-08-20  5:36 ` [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues Sagi Grimberg
@ 2020-08-20  6:09   ` Christoph Hellwig
  2020-08-20 16:58     ` Sagi Grimberg
  2020-08-20 20:54   ` James Smart
  2020-08-20 20:56   ` James Smart
  2 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2020-08-20  6:09 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Christoph Hellwig, linux-nvme, James Smart

On Wed, Aug 19, 2020 at 10:36:44PM -0700, Sagi Grimberg wrote:
> Right now we are failing requests based on the controller state (which
> is checked inline in nvmf_check_ready) however we should definitely
> accept requests if the queue is live.
> 
> When entering controller reset, we transition the controller into
> NVME_CTRL_RESETTING, and then return BLK_STS_RESOURCE for non-mpath
> requests (have blk_noretry_request set).
> 
> This is also the case for NVME_REQ_USER for the wrong reason. There
> shouldn't be any reason for us to reject this I/O in a controller reset.
> We do want to prevent passthru commands on the admin queue because we
> need the controller to fully initialize first before we let user passthru
> admin commands to be issued.
> 
> In a non-mpath setup, this means that the requests will simply be
> requeued over and over forever not allowing the q_usage_counter to drop
> its final reference, causing controller reset to hang if running
> concurrently with heavy I/O.

I'm still rather bothered with the admin queue exception.  And given that
the q_usage_counter problem should only really be an issue for file system
requests, as passthrough requests do not automatically get retried why
can't we just reject all user command to be symetric and straight forward?
The callers in userspace need to be able to cope with retryable errors
anyway.

>  	/*
> +	 * currently we have a problem sending passthru commands
> +	 * on the admin_q if the controller is not LIVE because we can't
> +	 * make sure that they are going out after the admin connect,
> +	 * controller enable and/or other commands in the initialization
> +	 * sequence. until the controller will be LIVE, fail with
> +	 * BLK_STS_RESOURCE so that they will be rescheduled.
>  	 */

Nit: please start multi-line comments with a capital letter.  Also I
think some of the lines do not nearly use up the 80 characters available.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 3/9] nvme: have nvme_wait_freeze_timeout return if it timed out
  2020-08-20  5:36 ` [PATCH v3 3/9] nvme: have nvme_wait_freeze_timeout return if it timed out Sagi Grimberg
@ 2020-08-20  6:09   ` Christoph Hellwig
  0 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2020-08-20  6:09 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Christoph Hellwig, linux-nvme, James Smart

On Wed, Aug 19, 2020 at 10:36:45PM -0700, Sagi Grimberg wrote:
> Users can detect if the wait has completed or not and take appropriate
> actions based on this information (e.g. weather to continue
> initialization or rather fail and schedule another initialization
> attempt).
> 
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 8/9] nvme-rdma: fix timeout handler
  2020-08-20  5:36 ` [PATCH v3 8/9] nvme-rdma: fix timeout handler Sagi Grimberg
@ 2020-08-20  6:10   ` Christoph Hellwig
  2020-08-20 21:37   ` James Smart
  1 sibling, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2020-08-20  6:10 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Christoph Hellwig, linux-nvme, James Smart

On Wed, Aug 19, 2020 at 10:36:50PM -0700, Sagi Grimberg wrote:
> When a request times out in a LIVE state, we simply trigger error
> recovery and let the error recovery handle the request cancellation,
> however when a request times out in a non LIVE state, we make sure to
> complete it immediately as it might block controller setup or teardown
> and prevent forward progress.
> 
> However tearing down the entire set of I/O and admin queues causes
> freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
> an overkill to what we actually need, which is to just fence controller
> teardown that may be running, stop the queue, and cancel the request if
> it is not already completed.
> 
> Now that we have the controller teardown_lock, we can safely serialize
> request cancellation. This addresses a hang caused by calling extra
> queue freeze on controller namespaces, causing unfreeze to not complete
> correctly.

I still think this should be dev_info instead of dev_warn, but otherwise:

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 9/9] nvme-rdma: fix reset hang if controller died in the middle of a reset
  2020-08-20  5:36 ` [PATCH v3 9/9] nvme-rdma: fix reset hang if controller died in the middle of a reset Sagi Grimberg
@ 2020-08-20  6:10   ` Christoph Hellwig
  0 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2020-08-20  6:10 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Christoph Hellwig, linux-nvme, James Smart

On Wed, Aug 19, 2020 at 10:36:51PM -0700, Sagi Grimberg wrote:
> If the controller becomes unresponsive in the middle of a reset, we
> will hang because we are waiting for the freeze to complete, but that
> cannot happen since we have commands that are inflight holding the
> q_usage_counter, and we can't blindly fail requests that times out.
> 
> So give a timeout and if we cannot wait for queue freeze before
> unfreezing, fail and have the error handling take care how to
> proceed (either schedule a reconnect of remove the controller).
> 
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues
  2020-08-20  6:09   ` Christoph Hellwig
@ 2020-08-20 16:58     ` Sagi Grimberg
  2020-08-20 20:45       ` James Smart
  0 siblings, 1 reply; 44+ messages in thread
From: Sagi Grimberg @ 2020-08-20 16:58 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, James Smart, linux-nvme


>> Right now we are failing requests based on the controller state (which
>> is checked inline in nvmf_check_ready) however we should definitely
>> accept requests if the queue is live.
>>
>> When entering controller reset, we transition the controller into
>> NVME_CTRL_RESETTING, and then return BLK_STS_RESOURCE for non-mpath
>> requests (have blk_noretry_request set).
>>
>> This is also the case for NVME_REQ_USER for the wrong reason. There
>> shouldn't be any reason for us to reject this I/O in a controller reset.
>> We do want to prevent passthru commands on the admin queue because we
>> need the controller to fully initialize first before we let user passthru
>> admin commands to be issued.
>>
>> In a non-mpath setup, this means that the requests will simply be
>> requeued over and over forever not allowing the q_usage_counter to drop
>> its final reference, causing controller reset to hang if running
>> concurrently with heavy I/O.
> 
> I'm still rather bothered with the admin queue exception.  And given that
> the q_usage_counter problem should only really be an issue for file system
> requests, as passthrough requests do not automatically get retried why
> can't we just reject all user command to be symetric and straight forward?
> The callers in userspace need to be able to cope with retryable errors
> anyway.

Looking at the code again, I think we can kill it as well.

The concern is we may issue user generated admin commands before
the controller is enabled (generating an unforced error just because
we queued to early). That used to be the case when the admin connect
used the admin_q which meant we needed to unquiesce before, but
now that the admin connect uses the fabrics_q, that should no
longer be an issue.

in nvme-tcp we unquiesce after we enable the ctrl:
--
         error = nvme_tcp_start_queue(ctrl, 0);
         if (error)
                 goto out_cleanup_queue;

         error = nvme_enable_ctrl(ctrl);
         if (error)
                 goto out_stop_queue;

         blk_mq_unquiesce_queue(ctrl->admin_q);
--

Also in nvme-rdma:
--
         error = nvme_rdma_start_queue(ctrl, 0);
         if (error)
                 goto out_cleanup_queue;

         error = nvme_enable_ctrl(&ctrl->ctrl);
         if (error)
                 goto out_stop_queue;

         ctrl->ctrl.max_segments = ctrl->max_fr_pages;
         ctrl->ctrl.max_hw_sectors = ctrl->max_fr_pages << (ilog2(SZ_4K) 
- 9);
         if (pi_capable)
                 ctrl->ctrl.max_integrity_segments = ctrl->max_fr_pages;
         else
                 ctrl->ctrl.max_integrity_segments = 0;

         blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
--

And also in nvme-fc:
--
         ret = nvmf_connect_admin_queue(&ctrl->ctrl);
         if (ret)
                 goto out_disconnect_admin_queue;

         set_bit(NVME_FC_Q_LIVE, &ctrl->queues[0].flags);

         /*
          * Check controller capabilities
          *
          * todo:- add code to check if ctrl attributes changed from
          * prior connection values
          */

         ret = nvme_enable_ctrl(&ctrl->ctrl);
         if (ret)
                 goto out_disconnect_admin_queue;

         ctrl->ctrl.max_segments = ctrl->lport->ops->max_sgl_segments;
         ctrl->ctrl.max_hw_sectors = ctrl->ctrl.max_segments <<
                                                 (ilog2(SZ_4K) - 9);

         blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
--

James, can you please have a look if this is still an issue?

>>   	/*
>> +	 * currently we have a problem sending passthru commands
>> +	 * on the admin_q if the controller is not LIVE because we can't
>> +	 * make sure that they are going out after the admin connect,
>> +	 * controller enable and/or other commands in the initialization
>> +	 * sequence. until the controller will be LIVE, fail with
>> +	 * BLK_STS_RESOURCE so that they will be rescheduled.
>>   	 */
> 
> Nit: please start multi-line comments with a capital letter.  Also I
> think some of the lines do not nearly use up the 80 characters available.

Will fix.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues
  2020-08-20 16:58     ` Sagi Grimberg
@ 2020-08-20 20:45       ` James Smart
  2020-08-20 22:13         ` Sagi Grimberg
  2020-08-21  6:22         ` Christoph Hellwig
  0 siblings, 2 replies; 44+ messages in thread
From: James Smart @ 2020-08-20 20:45 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme



On 8/20/2020 9:58 AM, Sagi Grimberg wrote:
>
>>> Right now we are failing requests based on the controller state (which
>>> is checked inline in nvmf_check_ready) however we should definitely
>>> accept requests if the queue is live.
>>>
>>> When entering controller reset, we transition the controller into
>>> NVME_CTRL_RESETTING, and then return BLK_STS_RESOURCE for non-mpath
>>> requests (have blk_noretry_request set).
>>>
>>> This is also the case for NVME_REQ_USER for the wrong reason. There
>>> shouldn't be any reason for us to reject this I/O in a controller 
>>> reset.
>>> We do want to prevent passthru commands on the admin queue because we
>>> need the controller to fully initialize first before we let user 
>>> passthru
>>> admin commands to be issued.
>>>
>>> In a non-mpath setup, this means that the requests will simply be
>>> requeued over and over forever not allowing the q_usage_counter to drop
>>> its final reference, causing controller reset to hang if running
>>> concurrently with heavy I/O.
>>
>> I'm still rather bothered with the admin queue exception.  And given 
>> that
>> the q_usage_counter problem should only really be an issue for file 
>> system
>> requests, as passthrough requests do not automatically get retried why
>> can't we just reject all user command to be symetric and straight 
>> forward?
>> The callers in userspace need to be able to cope with retryable errors
>> anyway.
>
> Looking at the code again, I think we can kill it as well.
>
> The concern is we may issue user generated admin commands before
> the controller is enabled (generating an unforced error just because
> we queued to early). That used to be the case when the admin connect
> used the admin_q which meant we needed to unquiesce before, but
> now that the admin connect uses the fabrics_q, that should no
> longer be an issue.
>
> in nvme-tcp we unquiesce after we enable the ctrl:
> ...
>
> James, can you please have a look if this is still an issue?

I still dislike any random ioctls coming in while we're still 
initializing the controller.  Looking at the flow - I wouldn't want them 
to be allowed until after nvme_init_identify() is complete. Especially 
if the ioctls are doing subsystem or controller dumping or using 
commands that should be capped by values set by nvme_queue_limits().   
But, if we're going to allow nvme_init_identify the admin_q needs to be 
unquiesced.

So I'm still voting for the admin queue exception.

-- james


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 1/9] nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance
  2020-08-20  5:36 ` [PATCH v3 1/9] nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance Sagi Grimberg
  2020-08-20  6:02   ` Christoph Hellwig
@ 2020-08-20 20:49   ` James Smart
  1 sibling, 0 replies; 44+ messages in thread
From: James Smart @ 2020-08-20 20:49 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch



On 8/19/2020 10:36 PM, Sagi Grimberg wrote:
> NVME_CTRL_NEW should never see any I/O, because in order to start
> initialization it has to transition to NVME_CTRL_CONNECTING and from
> there it will never return to this state.
>
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>   drivers/nvme/host/fabrics.c | 1 -
>   1 file changed, 1 deletion(-)
>
> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
> index 4ec4829d6233..32f61fc5f4c5 100644
> --- a/drivers/nvme/host/fabrics.c
> +++ b/drivers/nvme/host/fabrics.c
> @@ -576,7 +576,6 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
>   	 * which is require to set the queue live in the appropinquate states.
>   	 */
>   	switch (ctrl->state) {
> -	case NVME_CTRL_NEW:
>   	case NVME_CTRL_CONNECTING:
>   		if (nvme_is_fabrics(req->cmd) &&
>   		    req->cmd->fabrics.fctype == nvme_fabrics_type_connect)

Reviewed-by: James Smart <james.smart@broadcom.com>

-- james

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues
  2020-08-20  5:36 ` [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues Sagi Grimberg
  2020-08-20  6:09   ` Christoph Hellwig
@ 2020-08-20 20:54   ` James Smart
  2020-08-20 20:56   ` James Smart
  2 siblings, 0 replies; 44+ messages in thread
From: James Smart @ 2020-08-20 20:54 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch



On 8/19/2020 10:36 PM, Sagi Grimberg wrote:
> Right now we are failing requests based on the controller state (which
> is checked inline in nvmf_check_ready) however we should definitely
> accept requests if the queue is live.
>
> When entering controller reset, we transition the controller into
> NVME_CTRL_RESETTING, and then return BLK_STS_RESOURCE for non-mpath
> requests (have blk_noretry_request set).
>
> This is also the case for NVME_REQ_USER for the wrong reason. There
> shouldn't be any reason for us to reject this I/O in a controller reset.
> We do want to prevent passthru commands on the admin queue because we
> need the controller to fully initialize first before we let user passthru
> admin commands to be issued.
>
> In a non-mpath setup, this means that the requests will simply be
> requeued over and over forever not allowing the q_usage_counter to drop
> its final reference, causing controller reset to hang if running
> concurrently with heavy I/O.
>
> Fixes: 35897b920c8a ("nvme-fabrics: fix and refine state checks in __nvmf_check_ready")
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>   drivers/nvme/host/fabrics.c | 12 ++++++++----
>   1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
> index 32f61fc5f4c5..8575724734e0 100644
> --- a/drivers/nvme/host/fabrics.c
> +++ b/drivers/nvme/host/fabrics.c
> @@ -565,10 +565,14 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
>   	struct nvme_request *req = nvme_req(rq);
>   
>   	/*
> -	 * If we are in some state of setup or teardown only allow
> -	 * internally generated commands.
> +	 * currently we have a problem sending passthru commands
> +	 * on the admin_q if the controller is not LIVE because we can't
> +	 * make sure that they are going out after the admin connect,
> +	 * controller enable and/or other commands in the initialization
> +	 * sequence. until the controller will be LIVE, fail with
> +	 * BLK_STS_RESOURCE so that they will be rescheduled.
>   	 */
> -	if (!blk_rq_is_passthrough(rq) || (req->flags & NVME_REQ_USERCMD))
> +	if (rq->q == ctrl->admin_q && (req->flags & NVME_REQ_USERCMD))
>   		return false;
>   
>   	/*
> @@ -577,7 +581,7 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
>   	 */
>   	switch (ctrl->state) {
>   	case NVME_CTRL_CONNECTING:
> -		if (nvme_is_fabrics(req->cmd) &&
> +		if (blk_rq_is_passthrough(rq) && nvme_is_fabrics(req->cmd) &&
>   		    req->cmd->fabrics.fctype == nvme_fabrics_type_connect)
>   			return true;
>   		break;

Reviewed-by: James Smart <james.smart@broadcom.com>

-- james

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues
  2020-08-20  5:36 ` [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues Sagi Grimberg
  2020-08-20  6:09   ` Christoph Hellwig
  2020-08-20 20:54   ` James Smart
@ 2020-08-20 20:56   ` James Smart
  2 siblings, 0 replies; 44+ messages in thread
From: James Smart @ 2020-08-20 20:56 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch



On 8/19/2020 10:36 PM, Sagi Grimberg wrote:
> Right now we are failing requests based on the controller state (which
> is checked inline in nvmf_check_ready) however we should definitely
> accept requests if the queue is live.
>
> When entering controller reset, we transition the controller into
> NVME_CTRL_RESETTING, and then return BLK_STS_RESOURCE for non-mpath
> requests (have blk_noretry_request set).
>
> This is also the case for NVME_REQ_USER for the wrong reason. There
> shouldn't be any reason for us to reject this I/O in a controller reset.
> We do want to prevent passthru commands on the admin queue because we
> need the controller to fully initialize first before we let user passthru
> admin commands to be issued.
>
> In a non-mpath setup, this means that the requests will simply be
> requeued over and over forever not allowing the q_usage_counter to drop
> its final reference, causing controller reset to hang if running
> concurrently with heavy I/O.
>
> Fixes: 35897b920c8a ("nvme-fabrics: fix and refine state checks in __nvmf_check_ready")
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>   drivers/nvme/host/fabrics.c | 12 ++++++++----
>   1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
> index 32f61fc5f4c5..8575724734e0 100644
> --- a/drivers/nvme/host/fabrics.c
> +++ b/drivers/nvme/host/fabrics.c
> @@ -565,10 +565,14 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
>   	struct nvme_request *req = nvme_req(rq);
>   
>   	/*
> -	 * If we are in some state of setup or teardown only allow
> -	 * internally generated commands.
> +	 * currently we have a problem sending passthru commands
> +	 * on the admin_q if the controller is not LIVE because we can't
> +	 * make sure that they are going out after the admin connect,
> +	 * controller enable and/or other commands in the initialization
> +	 * sequence. until the controller will be LIVE, fail with
> +	 * BLK_STS_RESOURCE so that they will be rescheduled.
>   	 */
> -	if (!blk_rq_is_passthrough(rq) || (req->flags & NVME_REQ_USERCMD))
> +	if (rq->q == ctrl->admin_q && (req->flags & NVME_REQ_USERCMD))
>   		return false;
>   
>   	/*
> @@ -577,7 +581,7 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
>   	 */
>   	switch (ctrl->state) {
>   	case NVME_CTRL_CONNECTING:
> -		if (nvme_is_fabrics(req->cmd) &&
> +		if (blk_rq_is_passthrough(rq) && nvme_is_fabrics(req->cmd) &&
>   		    req->cmd->fabrics.fctype == nvme_fabrics_type_connect)
>   			return true;
>   		break;

Reviewed-by: James Smart <james.smart@broadcom.com>

-- james

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 7/9] nvme-rdma: serialize controller teardown sequences
  2020-08-20  5:36 ` [PATCH v3 7/9] nvme-rdma: serialize controller teardown sequences Sagi Grimberg
@ 2020-08-20 21:04   ` James Smart
  2020-08-20 22:16     ` Sagi Grimberg
  2020-08-21 21:08   ` James Smart
  1 sibling, 1 reply; 44+ messages in thread
From: James Smart @ 2020-08-20 21:04 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch



On 8/19/2020 10:36 PM, Sagi Grimberg wrote:
> In the timeout handler we may need to complete a request because the
> request that timed out may be an I/O that is a part of a serial sequence
> of controller teardown or initialization. In order to complete the
> request, we need to fence any other context that may compete with us
> and complete the request that is timing out.
>
> In this case, we could have a potential double completion in case
> a hard-irq or a different competing context triggered error recovery
> and is running inflight request cancellation concurrently with the
> timeout handler.
>
> Protect using a ctrl teardown_lock to serialize contexts that may
> complete a cancelled request due to error recovery or a reset.
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>   drivers/nvme/host/rdma.c | 6 ++++++
>   1 file changed, 6 insertions(+)
>
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 4c5660201009..ed387f61f8f7 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -122,6 +122,7 @@ struct nvme_rdma_ctrl {
>   	struct sockaddr_storage src_addr;
>   
>   	struct nvme_ctrl	ctrl;
> +	struct mutex		teardown_lock;
>   	bool			use_inline_data;
>   	u32			io_queues[HCTX_MAX_TYPES];
>   };
> @@ -997,6 +998,7 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
>   static void nvme_rdma_teardown_admin_queue(struct nvme_rdma_ctrl *ctrl,
>   		bool remove)
>   {
> +	mutex_lock(&ctrl->teardown_lock);
>   	blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
>   	nvme_rdma_stop_queue(&ctrl->queues[0]);
>   	if (ctrl->ctrl.admin_tagset) {
> @@ -1007,11 +1009,13 @@ static void nvme_rdma_teardown_admin_queue(struct nvme_rdma_ctrl *ctrl,
>   	if (remove)
>   		blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
>   	nvme_rdma_destroy_admin_queue(ctrl, remove);
> +	mutex_unlock(&ctrl->teardown_lock);
>   }
>   
>   static void nvme_rdma_teardown_io_queues(struct nvme_rdma_ctrl *ctrl,
>   		bool remove)
>   {
> +	mutex_lock(&ctrl->teardown_lock);
>   	if (ctrl->ctrl.queue_count > 1) {
>   		nvme_start_freeze(&ctrl->ctrl);
>   		nvme_stop_queues(&ctrl->ctrl);
> @@ -1025,6 +1029,7 @@ static void nvme_rdma_teardown_io_queues(struct nvme_rdma_ctrl *ctrl,
>   			nvme_start_queues(&ctrl->ctrl);
>   		nvme_rdma_destroy_io_queues(ctrl, remove);
>   	}
> +	mutex_unlock(&ctrl->teardown_lock);
>   }
>   
>   static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl)
> @@ -2278,6 +2283,7 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
>   		return ERR_PTR(-ENOMEM);
>   	ctrl->ctrl.opts = opts;
>   	INIT_LIST_HEAD(&ctrl->list);
> +	mutex_init(&ctrl->teardown_lock);
>   
>   	if (!(opts->mask & NVMF_OPT_TRSVCID)) {
>   		opts->trsvcid =

Q: so you don't believe there's any conflict from these teardown paths 
(possibly invoked by sysfs delete ctrl)  vs  a reconnect (thus 
nvme_rdma_setup_ctrl) encountering a failure during controller init, 
thus it hits the destroy_io and destroy_admin exit paths - which call 
nvme_rdma_destroy_io_queues and stop/destroy_admin_queue - which can be 
simultaneous to the teardowns and without the mutex ?

-- james


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 8/9] nvme-rdma: fix timeout handler
  2020-08-20  5:36 ` [PATCH v3 8/9] nvme-rdma: fix timeout handler Sagi Grimberg
  2020-08-20  6:10   ` Christoph Hellwig
@ 2020-08-20 21:37   ` James Smart
  1 sibling, 0 replies; 44+ messages in thread
From: James Smart @ 2020-08-20 21:37 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch



On 8/19/2020 10:36 PM, Sagi Grimberg wrote:
> When a request times out in a LIVE state, we simply trigger error
> recovery and let the error recovery handle the request cancellation,
> however when a request times out in a non LIVE state, we make sure to
> complete it immediately as it might block controller setup or teardown
> and prevent forward progress.
>
> However tearing down the entire set of I/O and admin queues causes
> freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
> an overkill to what we actually need, which is to just fence controller
> teardown that may be running, stop the queue, and cancel the request if
> it is not already completed.
>
> Now that we have the controller teardown_lock, we can safely serialize
> request cancellation. This addresses a hang caused by calling extra
> queue freeze on controller namespaces, causing unfreeze to not complete
> correctly.
>
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>

Reviewed-by: James Smart <james.smart@broadcom.com>

Looks good

-- james

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues
  2020-08-20 20:45       ` James Smart
@ 2020-08-20 22:13         ` Sagi Grimberg
  2020-08-20 22:17           ` James Smart
  2020-08-21  6:22         ` Christoph Hellwig
  1 sibling, 1 reply; 44+ messages in thread
From: Sagi Grimberg @ 2020-08-20 22:13 UTC (permalink / raw)
  To: James Smart, Christoph Hellwig; +Cc: Keith Busch, linux-nvme


>>> I'm still rather bothered with the admin queue exception.  And given 
>>> that
>>> the q_usage_counter problem should only really be an issue for file 
>>> system
>>> requests, as passthrough requests do not automatically get retried why
>>> can't we just reject all user command to be symetric and straight 
>>> forward?
>>> The callers in userspace need to be able to cope with retryable errors
>>> anyway.
>>
>> Looking at the code again, I think we can kill it as well.
>>
>> The concern is we may issue user generated admin commands before
>> the controller is enabled (generating an unforced error just because
>> we queued to early). That used to be the case when the admin connect
>> used the admin_q which meant we needed to unquiesce before, but
>> now that the admin connect uses the fabrics_q, that should no
>> longer be an issue.
>>
>> in nvme-tcp we unquiesce after we enable the ctrl:
>> ...
>>
>> James, can you please have a look if this is still an issue?
> 
> I still dislike any random ioctls coming in while we're still 
> initializing the controller.  Looking at the flow - I wouldn't want them 
> to be allowed until after nvme_init_identify() is complete. Especially 
> if the ioctls are doing subsystem or controller dumping or using 
> commands that should be capped by values set by nvme_queue_limits(). 
> But, if we're going to allow nvme_init_identify the admin_q needs to be 
> unquiesced.
> 
> So I'm still voting for the admin queue exception.

I think that if the controller is enabled, it should be able to accept
commands. For your specific concern, I don't think that the right place
to protect against it is __nvmf_check_ready, because from that
standpoint the controller _is_ ready.

Perhaps if this is a real concern we should have passthru commands
to check the controller state before executing?

Christoph, Keith, WDYT?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 7/9] nvme-rdma: serialize controller teardown sequences
  2020-08-20 21:04   ` James Smart
@ 2020-08-20 22:16     ` Sagi Grimberg
  0 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2020-08-20 22:16 UTC (permalink / raw)
  To: James Smart, linux-nvme, Christoph Hellwig, Keith Busch


>> @@ -997,6 +998,7 @@ static int nvme_rdma_configure_io_queues(struct 
>> nvme_rdma_ctrl *ctrl, bool new)
>>   static void nvme_rdma_teardown_admin_queue(struct nvme_rdma_ctrl *ctrl,
>>           bool remove)
>>   {
>> +    mutex_lock(&ctrl->teardown_lock);
>>       blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
>>       nvme_rdma_stop_queue(&ctrl->queues[0]);
>>       if (ctrl->ctrl.admin_tagset) {
>> @@ -1007,11 +1009,13 @@ static void 
>> nvme_rdma_teardown_admin_queue(struct nvme_rdma_ctrl *ctrl,
>>       if (remove)
>>           blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
>>       nvme_rdma_destroy_admin_queue(ctrl, remove);
>> +    mutex_unlock(&ctrl->teardown_lock);
>>   }
>>   static void nvme_rdma_teardown_io_queues(struct nvme_rdma_ctrl *ctrl,
>>           bool remove)
>>   {
>> +    mutex_lock(&ctrl->teardown_lock);
>>       if (ctrl->ctrl.queue_count > 1) {
>>           nvme_start_freeze(&ctrl->ctrl);
>>           nvme_stop_queues(&ctrl->ctrl);
>> @@ -1025,6 +1029,7 @@ static void nvme_rdma_teardown_io_queues(struct 
>> nvme_rdma_ctrl *ctrl,
>>               nvme_start_queues(&ctrl->ctrl);
>>           nvme_rdma_destroy_io_queues(ctrl, remove);
>>       }
>> +    mutex_unlock(&ctrl->teardown_lock);
>>   }
>>   static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl)
>> @@ -2278,6 +2283,7 @@ static struct nvme_ctrl 
>> *nvme_rdma_create_ctrl(struct device *dev,
>>           return ERR_PTR(-ENOMEM);
>>       ctrl->ctrl.opts = opts;
>>       INIT_LIST_HEAD(&ctrl->list);
>> +    mutex_init(&ctrl->teardown_lock);
>>       if (!(opts->mask & NVMF_OPT_TRSVCID)) {
>>           opts->trsvcid =
> 
> Q: so you don't believe there's any conflict from these teardown paths 
> (possibly invoked by sysfs delete ctrl)  vs  a reconnect (thus 
> nvme_rdma_setup_ctrl) encountering a failure during controller init, 
> thus it hits the destroy_io and destroy_admin exit paths - which call 
> nvme_rdma_destroy_io_queues and stop/destroy_admin_queue - which can be 
> simultaneous to the teardowns and without the mutex ?

Not really, the first thing rdma/tcp does in .delete_ctrl is to
cancel_work_sync the connect_work and the err_work, and any other
invocation of these works cannot occur because the state machine
won't allow it. so that serialization should be sufficient unless I'm
missing something...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues
  2020-08-20 22:13         ` Sagi Grimberg
@ 2020-08-20 22:17           ` James Smart
  0 siblings, 0 replies; 44+ messages in thread
From: James Smart @ 2020-08-20 22:17 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme



On 8/20/2020 3:13 PM, Sagi Grimberg wrote:
>
>>>> I'm still rather bothered with the admin queue exception.  And 
>>>> given that
>>>> the q_usage_counter problem should only really be an issue for file 
>>>> system
>>>> requests, as passthrough requests do not automatically get retried why
>>>> can't we just reject all user command to be symetric and straight 
>>>> forward?
>>>> The callers in userspace need to be able to cope with retryable errors
>>>> anyway.
>>>
>>> Looking at the code again, I think we can kill it as well.
>>>
>>> The concern is we may issue user generated admin commands before
>>> the controller is enabled (generating an unforced error just because
>>> we queued to early). That used to be the case when the admin connect
>>> used the admin_q which meant we needed to unquiesce before, but
>>> now that the admin connect uses the fabrics_q, that should no
>>> longer be an issue.
>>>
>>> in nvme-tcp we unquiesce after we enable the ctrl:
>>> ...
>>>
>>> James, can you please have a look if this is still an issue?
>>
>> I still dislike any random ioctls coming in while we're still 
>> initializing the controller.  Looking at the flow - I wouldn't want 
>> them to be allowed until after nvme_init_identify() is complete. 
>> Especially if the ioctls are doing subsystem or controller dumping or 
>> using commands that should be capped by values set by 
>> nvme_queue_limits(). But, if we're going to allow nvme_init_identify 
>> the admin_q needs to be unquiesced.
>>
>> So I'm still voting for the admin queue exception.
>
> I think that if the controller is enabled, it should be able to accept
> commands. For your specific concern, I don't think that the right place
> to protect against it is __nvmf_check_ready, because from that
> standpoint the controller _is_ ready.
>
> Perhaps if this is a real concern we should have passthru commands
> to check the controller state before executing?
>
> Christoph, Keith, WDYT?

there is always a window between when the cmd is submitted after passing 
the check vs when it comes off the q for execution and now the state of 
the controller has changed

-- james


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues
  2020-08-20 20:45       ` James Smart
  2020-08-20 22:13         ` Sagi Grimberg
@ 2020-08-21  6:22         ` Christoph Hellwig
  2020-08-21 15:22           ` James Smart
                             ` (2 more replies)
  1 sibling, 3 replies; 44+ messages in thread
From: Christoph Hellwig @ 2020-08-21  6:22 UTC (permalink / raw)
  To: James Smart; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, Christoph Hellwig

On Thu, Aug 20, 2020 at 01:45:50PM -0700, James Smart wrote:
> I still dislike any random ioctls coming in while we're still initializing 
> the controller.

Agreed.

>   Looking at the flow - I wouldn't want them to be allowed 
> until after nvme_init_identify() is complete. Especially if the ioctls are 
> doing subsystem or controller dumping or using commands that should be 
> capped by values set by nvme_queue_limits().   But, if we're going to 
> allow nvme_init_identify the admin_q needs to be unquiesced.
>
> So I'm still voting for the admin queue exception.

And I really don't like the admin queue special case.  What is the
advantage of letting user space passthrough I/O commands in at this
point in time?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues
  2020-08-21  6:22         ` Christoph Hellwig
@ 2020-08-21 15:22           ` James Smart
  2020-08-21 19:44           ` Sagi Grimberg
  2020-08-24  8:02           ` Sagi Grimberg
  2 siblings, 0 replies; 44+ messages in thread
From: James Smart @ 2020-08-21 15:22 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, Sagi Grimberg, linux-nvme

On 8/20/2020 11:22 PM, Christoph Hellwig wrote:
> On Thu, Aug 20, 2020 at 01:45:50PM -0700, James Smart wrote:
>> I still dislike any random ioctls coming in while we're still initializing
>> the controller.
> Agreed.
>
>>    Looking at the flow - I wouldn't want them to be allowed
>> until after nvme_init_identify() is complete. Especially if the ioctls are
>> doing subsystem or controller dumping or using commands that should be
>> capped by values set by nvme_queue_limits().   But, if we're going to
>> allow nvme_init_identify the admin_q needs to be unquiesced.
>>
>> So I'm still voting for the admin queue exception.
> And I really don't like the admin queue special case.  What is the
> advantage of letting user space passthrough I/O commands in at this
> point in time?

You somewhat lost me... the special case/exception is what prevents the 
commands in at this point in time.  I don't know of an advantage to 
letting them through.

In some respects, this goes back to the ioctl issue from the cli that we 
had 1.5+ yrs ago... where we backed out retry patches.  The desire was 
to have the ioctl rejected if the state wasn't live. Granted, there's 
still the race of it passing this check but controller changing state by 
the time the request comes off the request queue (which is ultimately 
what this exception prevents). The biggest issue I've seen is that in 
between the 1st connect and the read of the discovery log, something 
happened such that the controller is still connected yet has to go 
through a reconnect. The cli saw the error on the read log ioctl, gives 
up, then proceeds with it's exit path that deletes the controller - 
leaving nothing discovered and no event to cause it to retry.  When we 
last discussed this - we were looking for killable ioctls and also 
wanted to add retries to nvme-cli but using a standard method.   I've 
had someone working with me on putting at least the retry part into the 
cli so I'll post them soon.

Regardless - given there's the window between check and request being 
executed by the handler - I don't see it worthwhile to remove the 
special case.

-- james

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues
  2020-08-21  6:22         ` Christoph Hellwig
  2020-08-21 15:22           ` James Smart
@ 2020-08-21 19:44           ` Sagi Grimberg
  2020-08-23 15:19             ` James Smart
  2020-08-24  8:02           ` Sagi Grimberg
  2 siblings, 1 reply; 44+ messages in thread
From: Sagi Grimberg @ 2020-08-21 19:44 UTC (permalink / raw)
  To: Christoph Hellwig, James Smart; +Cc: Keith Busch, linux-nvme


>> I still dislike any random ioctls coming in while we're still initializing
>> the controller.
> 
> Agreed.
> 
>>    Looking at the flow - I wouldn't want them to be allowed
>> until after nvme_init_identify() is complete. Especially if the ioctls are
>> doing subsystem or controller dumping or using commands that should be
>> capped by values set by nvme_queue_limits().   But, if we're going to
>> allow nvme_init_identify the admin_q needs to be unquiesced.
>>
>> So I'm still voting for the admin queue exception.
> 
> And I really don't like the admin queue special case.  What is the
> advantage of letting user space passthrough I/O commands in at this
> point in time?

We need to pass in normal I/O commands for sure in order to have a
robust reset and error recovery (what in general the patchset
addresses). What is the difference between FS I/O commands and
passthru I/O commands? In fact, user passthru I/O commands will never
execute before nvme_init_identify() because we always start
the I/O queues after that.

Let's look at pci, do we have the same enforcement for passthru
commands? What's special about fabrics that we need to deny
these commands from going through?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 7/9] nvme-rdma: serialize controller teardown sequences
  2020-08-20  5:36 ` [PATCH v3 7/9] nvme-rdma: serialize controller teardown sequences Sagi Grimberg
  2020-08-20 21:04   ` James Smart
@ 2020-08-21 21:08   ` James Smart
  1 sibling, 0 replies; 44+ messages in thread
From: James Smart @ 2020-08-21 21:08 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch

On 8/19/2020 10:36 PM, Sagi Grimberg wrote:
> In the timeout handler we may need to complete a request because the
> request that timed out may be an I/O that is a part of a serial sequence
> of controller teardown or initialization. In order to complete the
> request, we need to fence any other context that may compete with us
> and complete the request that is timing out.
>
> In this case, we could have a potential double completion in case
> a hard-irq or a different competing context triggered error recovery
> and is running inflight request cancellation concurrently with the
> timeout handler.
>
> Protect using a ctrl teardown_lock to serialize contexts that may
> complete a cancelled request due to error recovery or a reset.
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>   drivers/nvme/host/rdma.c | 6 ++++++
>   1 file changed, 6 insertions(+)
>
>

Reviewed-by:  James Smart <james.smart@broadcom.com>

yes, it looks better. the reworked eh handler definitely helps.

-- james


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues
  2020-08-21 19:44           ` Sagi Grimberg
@ 2020-08-23 15:19             ` James Smart
  2020-08-24  8:06               ` Sagi Grimberg
  0 siblings, 1 reply; 44+ messages in thread
From: James Smart @ 2020-08-23 15:19 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme



On 8/21/2020 12:44 PM, Sagi Grimberg wrote:
>
>>> I still dislike any random ioctls coming in while we're still 
>>> initializing
>>> the controller.
>>
>> Agreed.
>>
>>>    Looking at the flow - I wouldn't want them to be allowed
>>> until after nvme_init_identify() is complete. Especially if the 
>>> ioctls are
>>> doing subsystem or controller dumping or using commands that should be
>>> capped by values set by nvme_queue_limits().   But, if we're going to
>>> allow nvme_init_identify the admin_q needs to be unquiesced.
>>>
>>> So I'm still voting for the admin queue exception.
>>
>> And I really don't like the admin queue special case.  What is the
>> advantage of letting user space passthrough I/O commands in at this
>> point in time?
>
> We need to pass in normal I/O commands for sure in order to have a
> robust reset and error recovery (what in general the patchset
> addresses). What is the difference between FS I/O commands and
> passthru I/O commands? In fact, user passthru I/O commands will never
> execute before nvme_init_identify() because we always start
> the I/O queues after that.
>
> Let's look at pci, do we have the same enforcement for passthru
> commands? What's special about fabrics that we need to deny
> these commands from going through?

short answer - timing/latency and much much lower chance of failure.

I also don't think people are querying pci (local attached drivers) like 
they are fabric attachments.

-- james


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues
  2020-08-21  6:22         ` Christoph Hellwig
  2020-08-21 15:22           ` James Smart
  2020-08-21 19:44           ` Sagi Grimberg
@ 2020-08-24  8:02           ` Sagi Grimberg
  2020-08-25  7:13             ` Christoph Hellwig
  2 siblings, 1 reply; 44+ messages in thread
From: Sagi Grimberg @ 2020-08-24  8:02 UTC (permalink / raw)
  To: Christoph Hellwig, James Smart; +Cc: Keith Busch, linux-nvme

>> I still dislike any random ioctls coming in while we're still initializing
>> the controller.
> 
> Agreed.
> 
>>    Looking at the flow - I wouldn't want them to be allowed
>> until after nvme_init_identify() is complete. Especially if the ioctls are
>> doing subsystem or controller dumping or using commands that should be
>> capped by values set by nvme_queue_limits().   But, if we're going to
>> allow nvme_init_identify the admin_q needs to be unquiesced.
>>
>> So I'm still voting for the admin queue exception.
> 
> And I really don't like the admin queue special case.  What is the
> advantage of letting user space passthrough I/O commands in at this
> point in time?

I checked again, and regarding the comment:

"I'm still rather bothered with the admin queue exception.  And given that
the q_usage_counter problem should only really be an issue for file system
requests, as passthrough requests do not automatically get retried why
can't we just reject all user command to be symetric and straight forward?
The callers in userspace need to be able to cope with retryable errors
anyway."

blk_mq_alloc_request calls blk_queue_enter, which means that if we don't
let them in, controller reset can hang, just like in normal fs I/O.

So we either keep this exception for admin commands, or we also let
them through as it seems to be safe with the current code (from the
reset forward-progress perspective).

I vote to remove this exception altogether...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues
  2020-08-23 15:19             ` James Smart
@ 2020-08-24  8:06               ` Sagi Grimberg
  0 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2020-08-24  8:06 UTC (permalink / raw)
  To: James Smart, Christoph Hellwig; +Cc: Keith Busch, linux-nvme


>>>> I still dislike any random ioctls coming in while we're still 
>>>> initializing
>>>> the controller.
>>>
>>> Agreed.
>>>
>>>>    Looking at the flow - I wouldn't want them to be allowed
>>>> until after nvme_init_identify() is complete. Especially if the 
>>>> ioctls are
>>>> doing subsystem or controller dumping or using commands that should be
>>>> capped by values set by nvme_queue_limits().   But, if we're going to
>>>> allow nvme_init_identify the admin_q needs to be unquiesced.
>>>>
>>>> So I'm still voting for the admin queue exception.
>>>
>>> And I really don't like the admin queue special case.  What is the
>>> advantage of letting user space passthrough I/O commands in at this
>>> point in time?
>>
>> We need to pass in normal I/O commands for sure in order to have a
>> robust reset and error recovery (what in general the patchset
>> addresses). What is the difference between FS I/O commands and
>> passthru I/O commands? In fact, user passthru I/O commands will never
>> execute before nvme_init_identify() because we always start
>> the I/O queues after that.
>>
>> Let's look at pci, do we have the same enforcement for passthru
>> commands? What's special about fabrics that we need to deny
>> these commands from going through?
> 
> short answer - timing/latency and much much lower chance of failure.

So are you saying there is a bug in pci that is hiding?

> I also don't think people are querying pci (local attached drivers) like 
> they are fabric attachments.

Not sure about that at all, I'd even speculate it is the other way
around...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 0/9] fix possible controller reset hangs in nvme-tcp/nvme-rdma
  2020-08-20  5:36 [PATCH v3 0/9] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
                   ` (8 preceding siblings ...)
  2020-08-20  5:36 ` [PATCH v3 9/9] nvme-rdma: fix reset hang if controller died in the middle of a reset Sagi Grimberg
@ 2020-08-24 18:29 ` Sagi Grimberg
  2020-08-25  7:16   ` Christoph Hellwig
  9 siblings, 1 reply; 44+ messages in thread
From: Sagi Grimberg @ 2020-08-24 18:29 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

Guys,

Given that there is consensus on all patches beside patch 2/9,
can we get all the rest in and specifically take this one as
an incremental step?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues
  2020-08-24  8:02           ` Sagi Grimberg
@ 2020-08-25  7:13             ` Christoph Hellwig
  2020-08-25 15:00               ` Sagi Grimberg
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2020-08-25  7:13 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Christoph Hellwig, linux-nvme, James Smart

On Mon, Aug 24, 2020 at 01:02:40AM -0700, Sagi Grimberg wrote:
> I checked again, and regarding the comment:
>
> "I'm still rather bothered with the admin queue exception.  And given that
> the q_usage_counter problem should only really be an issue for file system
> requests, as passthrough requests do not automatically get retried why
> can't we just reject all user command to be symetric and straight forward?
> The callers in userspace need to be able to cope with retryable errors
> anyway."
>
> blk_mq_alloc_request calls blk_queue_enter, which means that if we don't
> let them in, controller reset can hang, just like in normal fs I/O.

Yes.  But the difference is that they don't get retrieѕ, but instead just
fail.

> So we either keep this exception for admin commands, or we also let
> them through as it seems to be safe with the current code (from the
> reset forward-progress perspective).
>
> I vote to remove this exception altogether...

To me the right answer would be to reject user commands on the admin
or I/O queue for the not live controller as the callers need to handle
it.  That seems to make more sense to me than a special admin queue
exception.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 0/9] fix possible controller reset hangs in nvme-tcp/nvme-rdma
  2020-08-24 18:29 ` [PATCH v3 0/9] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
@ 2020-08-25  7:16   ` Christoph Hellwig
  2020-08-25 15:35     ` James Smart
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2020-08-25  7:16 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Christoph Hellwig, linux-nvme, James Smart

On Mon, Aug 24, 2020 at 11:29:49AM -0700, Sagi Grimberg wrote:
> Guys,
>
> Given that there is consensus on all patches beside patch 2/9,
> can we get all the rest in and specifically take this one as
> an incremental step?

Fine with me.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues
  2020-08-25  7:13             ` Christoph Hellwig
@ 2020-08-25 15:00               ` Sagi Grimberg
  2020-08-25 15:41                 ` James Smart
  0 siblings, 1 reply; 44+ messages in thread
From: Sagi Grimberg @ 2020-08-25 15:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, James Smart, linux-nvme


>> I checked again, and regarding the comment:
>>
>> "I'm still rather bothered with the admin queue exception.  And given that
>> the q_usage_counter problem should only really be an issue for file system
>> requests, as passthrough requests do not automatically get retried why
>> can't we just reject all user command to be symetric and straight forward?
>> The callers in userspace need to be able to cope with retryable errors
>> anyway."
>>
>> blk_mq_alloc_request calls blk_queue_enter, which means that if we don't
>> let them in, controller reset can hang, just like in normal fs I/O.
> 
> Yes.  But the difference is that they don't get retrieѕ, but instead just
> fail.

Doesn't matter. The issue is that we get a user I/O request, enters
the q_usage_counter, goes into queue_rq, we check and we decide we want
to fail. then in nvmf_fail_nonready_command we return BLK_STS_RESOURCE,
and the request retried and failed again... Until its allowed to pass
through (when the controller is LIVE again).

If someone freeze the queue, it cannot because we have an entered 
request that will never complete.


>> So we either keep this exception for admin commands, or we also let
>> them through as it seems to be safe with the current code (from the
>> reset forward-progress perspective).
>>
>> I vote to remove this exception altogether...
> 
> To me the right answer would be to reject user commands on the admin
> or I/O queue for the not live controller as the callers need to handle
> it.  That seems to make more sense to me than a special admin queue
> exception.

So you say that we should never return BLK_STS_RESOURCE but rather fail
all requests? regardless for multipath or not?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 0/9] fix possible controller reset hangs in nvme-tcp/nvme-rdma
  2020-08-25  7:16   ` Christoph Hellwig
@ 2020-08-25 15:35     ` James Smart
  0 siblings, 0 replies; 44+ messages in thread
From: James Smart @ 2020-08-25 15:35 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg; +Cc: Keith Busch, linux-nvme


On 8/25/2020 12:16 AM, Christoph Hellwig wrote:
> On Mon, Aug 24, 2020 at 11:29:49AM -0700, Sagi Grimberg wrote:
>> Guys,
>>
>> Given that there is consensus on all patches beside patch 2/9,
>> can we get all the rest in and specifically take this one as
>> an incremental step?
> Fine with me.

fine with me

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues
  2020-08-25 15:00               ` Sagi Grimberg
@ 2020-08-25 15:41                 ` James Smart
  2020-08-25 17:35                   ` Sagi Grimberg
  0 siblings, 1 reply; 44+ messages in thread
From: James Smart @ 2020-08-25 15:41 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme



On 8/25/2020 8:00 AM, Sagi Grimberg wrote:
>
>>> So we either keep this exception for admin commands, or we also let
>>> them through as it seems to be safe with the current code (from the
>>> reset forward-progress perspective).
>>>
>>> I vote to remove this exception altogether...
>>
>> To me the right answer would be to reject user commands on the admin
>> or I/O queue for the not live controller as the callers need to handle
>> it.  That seems to make more sense to me than a special admin queue
>> exception.
>
> So you say that we should never return BLK_STS_RESOURCE but rather fail
> all requests? regardless for multipath or not?

How about we keep the exception for now. As I mentioned earlier, I have 
an ioctl retry patch coming for nvme-cli and if we agree with what it 
does then we can put in the reject and remove the exception.

-- james


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues
  2020-08-25 15:41                 ` James Smart
@ 2020-08-25 17:35                   ` Sagi Grimberg
  2020-09-04 20:26                     ` Sagi Grimberg
  0 siblings, 1 reply; 44+ messages in thread
From: Sagi Grimberg @ 2020-08-25 17:35 UTC (permalink / raw)
  To: James Smart, Christoph Hellwig; +Cc: Keith Busch, linux-nvme


>>>> So we either keep this exception for admin commands, or we also let
>>>> them through as it seems to be safe with the current code (from the
>>>> reset forward-progress perspective).
>>>>
>>>> I vote to remove this exception altogether...
>>>
>>> To me the right answer would be to reject user commands on the admin
>>> or I/O queue for the not live controller as the callers need to handle
>>> it.  That seems to make more sense to me than a special admin queue
>>> exception.
>>
>> So you say that we should never return BLK_STS_RESOURCE but rather fail
>> all requests? regardless for multipath or not?
> 
> How about we keep the exception for now. As I mentioned earlier, I have 
> an ioctl retry patch coming for nvme-cli and if we agree with what it 
> does then we can put in the reject and remove the exception.

I honestly don't think we should fail user I/O commands, there is no
for us reason to do that.
I also don't think we should fail user admin commands either, at least
not in nvmf_check_ready.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues
  2020-08-25 17:35                   ` Sagi Grimberg
@ 2020-09-04 20:26                     ` Sagi Grimberg
  2020-09-08  9:05                       ` Christoph Hellwig
  0 siblings, 1 reply; 44+ messages in thread
From: Sagi Grimberg @ 2020-09-04 20:26 UTC (permalink / raw)
  To: James Smart, Christoph Hellwig; +Cc: Keith Busch, linux-nvme


>>>>> So we either keep this exception for admin commands, or we also let
>>>>> them through as it seems to be safe with the current code (from the
>>>>> reset forward-progress perspective).
>>>>>
>>>>> I vote to remove this exception altogether...
>>>>
>>>> To me the right answer would be to reject user commands on the admin
>>>> or I/O queue for the not live controller as the callers need to handle
>>>> it.  That seems to make more sense to me than a special admin queue
>>>> exception.
>>>
>>> So you say that we should never return BLK_STS_RESOURCE but rather fail
>>> all requests? regardless for multipath or not?
>>
>> How about we keep the exception for now. As I mentioned earlier, I 
>> have an ioctl retry patch coming for nvme-cli and if we agree with 
>> what it does then we can put in the reject and remove the exception.
> 
> I honestly don't think we should fail user I/O commands, there is no
> for us reason to do that.
> I also don't think we should fail user admin commands either, at least
> not in nvmf_check_ready.

Guys, we need to resolve this one, this is needed to prevent freeze
during reset from hanging.

As I said, any I/O that will use the ns->queue shouldn't be denied here
because it takes the q_usage_counter and will prevent a freeze (this is
true for both fs or user I/O commands).

I still think we should allow the admin ones as well and deny them
somewhere else if we want to, but I'm fine with keeping this exception
now because no one is expected to freeze the admin queue.

Let me know what you think. I want it to go into an rc soon.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues
  2020-09-04 20:26                     ` Sagi Grimberg
@ 2020-09-08  9:05                       ` Christoph Hellwig
  2020-09-08 16:47                         ` Sagi Grimberg
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2020-09-08  9:05 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, James Smart, linux-nvme, Christoph Hellwig

On Fri, Sep 04, 2020 at 01:26:37PM -0700, Sagi Grimberg wrote:
> Guys, we need to resolve this one, this is needed to prevent freeze
> during reset from hanging.
>
> As I said, any I/O that will use the ns->queue shouldn't be denied here
> because it takes the q_usage_counter and will prevent a freeze (this is
> true for both fs or user I/O commands).
>
> I still think we should allow the admin ones as well and deny them
> somewhere else if we want to, but I'm fine with keeping this exception
> now because no one is expected to freeze the admin queue.
>
> Let me know what you think. I want it to go into an rc soon.

Let's do the minimum change variant for now then.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues
  2020-09-08  9:05                       ` Christoph Hellwig
@ 2020-09-08 16:47                         ` Sagi Grimberg
  2020-09-08 16:48                           ` Christoph Hellwig
  0 siblings, 1 reply; 44+ messages in thread
From: Sagi Grimberg @ 2020-09-08 16:47 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, James Smart, linux-nvme


>> Guys, we need to resolve this one, this is needed to prevent freeze
>> during reset from hanging.
>>
>> As I said, any I/O that will use the ns->queue shouldn't be denied here
>> because it takes the q_usage_counter and will prevent a freeze (this is
>> true for both fs or user I/O commands).
>>
>> I still think we should allow the admin ones as well and deny them
>> somewhere else if we want to, but I'm fine with keeping this exception
>> now because no one is expected to freeze the admin queue.
>>
>> Let me know what you think. I want it to go into an rc soon.
> 
> Let's do the minimum change variant for now then.

The patch in its current form is that minimum. I suggest we take that
and work our way from there.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues
  2020-09-08 16:47                         ` Sagi Grimberg
@ 2020-09-08 16:48                           ` Christoph Hellwig
  2020-09-08 19:56                             ` Sagi Grimberg
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2020-09-08 16:48 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Christoph Hellwig, linux-nvme, James Smart

On Tue, Sep 08, 2020 at 09:47:24AM -0700, Sagi Grimberg wrote:
>
>>> Guys, we need to resolve this one, this is needed to prevent freeze
>>> during reset from hanging.
>>>
>>> As I said, any I/O that will use the ns->queue shouldn't be denied here
>>> because it takes the q_usage_counter and will prevent a freeze (this is
>>> true for both fs or user I/O commands).
>>>
>>> I still think we should allow the admin ones as well and deny them
>>> somewhere else if we want to, but I'm fine with keeping this exception
>>> now because no one is expected to freeze the admin queue.
>>>
>>> Let me know what you think. I want it to go into an rc soon.
>>
>> Let's do the minimum change variant for now then.
>
> The patch in its current form is that minimum. I suggest we take that
> and work our way from there.

Can you resend it?  I'm about to do a patch sweep.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues
  2020-09-08 16:48                           ` Christoph Hellwig
@ 2020-09-08 19:56                             ` Sagi Grimberg
  0 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2020-09-08 19:56 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, James Smart, linux-nvme


>>> Let's do the minimum change variant for now then.
>>
>> The patch in its current form is that minimum. I suggest we take that
>> and work our way from there.
> 
> Can you resend it?  I'm about to do a patch sweep.

Sent.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2020-09-08 19:56 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-20  5:36 [PATCH v3 0/9] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
2020-08-20  5:36 ` [PATCH v3 1/9] nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance Sagi Grimberg
2020-08-20  6:02   ` Christoph Hellwig
2020-08-20 20:49   ` James Smart
2020-08-20  5:36 ` [PATCH v3 2/9] nvme-fabrics: allow to queue requests for live queues Sagi Grimberg
2020-08-20  6:09   ` Christoph Hellwig
2020-08-20 16:58     ` Sagi Grimberg
2020-08-20 20:45       ` James Smart
2020-08-20 22:13         ` Sagi Grimberg
2020-08-20 22:17           ` James Smart
2020-08-21  6:22         ` Christoph Hellwig
2020-08-21 15:22           ` James Smart
2020-08-21 19:44           ` Sagi Grimberg
2020-08-23 15:19             ` James Smart
2020-08-24  8:06               ` Sagi Grimberg
2020-08-24  8:02           ` Sagi Grimberg
2020-08-25  7:13             ` Christoph Hellwig
2020-08-25 15:00               ` Sagi Grimberg
2020-08-25 15:41                 ` James Smart
2020-08-25 17:35                   ` Sagi Grimberg
2020-09-04 20:26                     ` Sagi Grimberg
2020-09-08  9:05                       ` Christoph Hellwig
2020-09-08 16:47                         ` Sagi Grimberg
2020-09-08 16:48                           ` Christoph Hellwig
2020-09-08 19:56                             ` Sagi Grimberg
2020-08-20 20:54   ` James Smart
2020-08-20 20:56   ` James Smart
2020-08-20  5:36 ` [PATCH v3 3/9] nvme: have nvme_wait_freeze_timeout return if it timed out Sagi Grimberg
2020-08-20  6:09   ` Christoph Hellwig
2020-08-20  5:36 ` [PATCH v3 4/9] nvme-tcp: serialize controller teardown sequences Sagi Grimberg
2020-08-20  5:36 ` [PATCH v3 5/9] nvme-tcp: fix timeout handler Sagi Grimberg
2020-08-20  5:36 ` [PATCH v3 6/9] nvme-tcp: fix reset hang if controller died in the middle of a reset Sagi Grimberg
2020-08-20  5:36 ` [PATCH v3 7/9] nvme-rdma: serialize controller teardown sequences Sagi Grimberg
2020-08-20 21:04   ` James Smart
2020-08-20 22:16     ` Sagi Grimberg
2020-08-21 21:08   ` James Smart
2020-08-20  5:36 ` [PATCH v3 8/9] nvme-rdma: fix timeout handler Sagi Grimberg
2020-08-20  6:10   ` Christoph Hellwig
2020-08-20 21:37   ` James Smart
2020-08-20  5:36 ` [PATCH v3 9/9] nvme-rdma: fix reset hang if controller died in the middle of a reset Sagi Grimberg
2020-08-20  6:10   ` Christoph Hellwig
2020-08-24 18:29 ` [PATCH v3 0/9] fix possible controller reset hangs in nvme-tcp/nvme-rdma Sagi Grimberg
2020-08-25  7:16   ` Christoph Hellwig
2020-08-25 15:35     ` James Smart

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.