All of lore.kernel.org
 help / color / mirror / Atom feed
* PATCH V4 0/5 nvme-pci: fixes on nvme_timeout and nvme_dev_disable
@ 2018-03-08  6:19 ` Jianchao Wang
  0 siblings, 0 replies; 38+ messages in thread
From: Jianchao Wang @ 2018-03-08  6:19 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

Firstly, really appreciate Keith and Sagi's precious advice on previous versions.
And this is the version 4.

Some patches of the previous patchset have been submitted and the left is this patchset
which has been refactored. Please consider it for 4.17.

The target of this patchset is to avoid nvme_dev_disable to be invoked by nvme_timeout.
As we know, nvme_dev_disable will issue commands on adminq, if the controller no response,
it has to depend on timeout path. However, nvme_timeout will also need to invoke
nvme_dev_disable. This will introduce dangerous circular dependence. Moreover,
nvme_dev_disable is under the shutdown_lock, even when it go to sleep, this makes things
worse.

The basic idea of this patchset is:
 - When need to schedule reset_work, hand over expired requests to nvme_dev_disable. They
   will be completed after the controller is disabled/shtudown.

 - When requests from nvme_dev_disable and nvme_reset_work expires, disable the controller
   directly then the request could be completed to wakeup the waiter. 

The 'disable the controller directly' here means that it doesn't send commands on adminq.
A new interface is introduced for this, nvme_pci_disable_ctrl_directly. More details,
please refer to the comment of the function.

Then nvme_timeout doesn't depends on nvme_dev_disable any more.

Because there is big difference from previous version, and some relatively independent patches
have been submitted, so I just reserve the key part of previous version change log following.

Change V3->V4
 - refactor the interfaces flushing in-flight requests and add them to nvme core.
 - refactor the nvme_timeout to make it more clearly

Change V2->V3:
 - discard the patch which unfreeze the queue after nvme_dev_disable

Changes V1->V2:
 - disable PCI controller bus master in nvme_pci_disable_ctrl_directly

There are 5 patches:
1st one is to change the operations on nvme_request->flags to atomic operations, then we could introduce
another NVME_REQ_ABORTED next.
2nd patch introduce two new interfaces to flush in-flight requests in nvme core.
3rd patch is to avoid the nvme_dev_disable in nvme_timeout, it introduce new interface nvme_pci_disable_ctrl_directly
and refactor the nvme_timeout
4th~5th is to fix issues introduced after 3rd patch.

Jianchao Wang (5)
0001-nvme-do-atomically-bit-operations-on-nvme_request.fl.patch
0002-nvme-add-helper-interface-to-flush-in-flight-request.patch
0003-nvme-pci-avoid-nvme_dev_disable-to-be-invoked-in-nvm.patch
0004-nvme-pci-discard-wait-timeout-when-delete-cq-sq.patch
0005-nvme-pci-add-the-timeout-case-for-DELETEING-state.patch

diff stat
 drivers/nvme/host/core.c |  96 +++++++++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h |   4 +-
 drivers/nvme/host/pci.c  | 224 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------------
 
 Thanks
 Jianchao

^ permalink raw reply	[flat|nested] 38+ messages in thread

* PATCH V4 0/5 nvme-pci: fixes on nvme_timeout and nvme_dev_disable
@ 2018-03-08  6:19 ` Jianchao Wang
  0 siblings, 0 replies; 38+ messages in thread
From: Jianchao Wang @ 2018-03-08  6:19 UTC (permalink / raw)


Firstly, really appreciate Keith and Sagi's precious advice on previous versions.
And this is the version 4.

Some patches of the previous patchset have been submitted and the left is this patchset
which has been refactored. Please consider it for 4.17.

The target of this patchset is to avoid nvme_dev_disable to be invoked by nvme_timeout.
As we know, nvme_dev_disable will issue commands on adminq, if the controller no response,
it has to depend on timeout path. However, nvme_timeout will also need to invoke
nvme_dev_disable. This will introduce dangerous circular dependence. Moreover,
nvme_dev_disable is under the shutdown_lock, even when it go to sleep, this makes things
worse.

The basic idea of this patchset is:
 - When need to schedule reset_work, hand over expired requests to nvme_dev_disable. They
   will be completed after the controller is disabled/shtudown.

 - When requests from nvme_dev_disable and nvme_reset_work expires, disable the controller
   directly then the request could be completed to wakeup the waiter. 

The 'disable the controller directly' here means that it doesn't send commands on adminq.
A new interface is introduced for this, nvme_pci_disable_ctrl_directly. More details,
please refer to the comment of the function.

Then nvme_timeout doesn't depends on nvme_dev_disable any more.

Because there is big difference from previous version, and some relatively independent patches
have been submitted, so I just reserve the key part of previous version change log following.

Change V3->V4
 - refactor the interfaces flushing in-flight requests and add them to nvme core.
 - refactor the nvme_timeout to make it more clearly

Change V2->V3:
 - discard the patch which unfreeze the queue after nvme_dev_disable

Changes V1->V2:
 - disable PCI controller bus master in nvme_pci_disable_ctrl_directly

There are 5 patches:
1st one is to change the operations on nvme_request->flags to atomic operations, then we could introduce
another NVME_REQ_ABORTED next.
2nd patch introduce two new interfaces to flush in-flight requests in nvme core.
3rd patch is to avoid the nvme_dev_disable in nvme_timeout, it introduce new interface nvme_pci_disable_ctrl_directly
and refactor the nvme_timeout
4th~5th is to fix issues introduced after 3rd patch.

Jianchao Wang (5)
0001-nvme-do-atomically-bit-operations-on-nvme_request.fl.patch
0002-nvme-add-helper-interface-to-flush-in-flight-request.patch
0003-nvme-pci-avoid-nvme_dev_disable-to-be-invoked-in-nvm.patch
0004-nvme-pci-discard-wait-timeout-when-delete-cq-sq.patch
0005-nvme-pci-add-the-timeout-case-for-DELETEING-state.patch

diff stat
 drivers/nvme/host/core.c |  96 +++++++++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h |   4 +-
 drivers/nvme/host/pci.c  | 224 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------------
 
 Thanks
 Jianchao

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH V4 1/5] nvme: do atomically bit operations on nvme_request.flags
  2018-03-08  6:19 ` Jianchao Wang
@ 2018-03-08  6:19   ` Jianchao Wang
  -1 siblings, 0 replies; 38+ messages in thread
From: Jianchao Wang @ 2018-03-08  6:19 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

Do atomically bit operations on nvme_request.flags instead of
regular read/write, then we could add other flags and set/clear
them safely.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/core.c     | 4 ++--
 drivers/nvme/host/lightnvm.c | 4 ++--
 drivers/nvme/host/nvme.h     | 4 ++--
 drivers/nvme/host/pci.c      | 4 ++--
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 29ead91..7b8df47 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -674,7 +674,7 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
 	blk_execute_rq(req->q, NULL, req, at_head);
 	if (result)
 		*result = nvme_req(req)->result;
-	if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
+	if (test_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags))
 		ret = -EINTR;
 	else
 		ret = nvme_req(req)->status;
@@ -763,7 +763,7 @@ static int nvme_submit_user_cmd(struct request_queue *q,
 	}
 
 	blk_execute_rq(req->q, disk, req, 0);
-	if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
+	if (test_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags))
 		ret = -EINTR;
 	else
 		ret = nvme_req(req)->status;
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 50ef71ee..fd0c499 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -470,7 +470,7 @@ static int nvme_nvm_submit_io_sync(struct nvm_dev *dev, struct nvm_rq *rqd)
 	 * handle the error accordingly.
 	 */
 	blk_execute_rq(q, NULL, rq, 0);
-	if (nvme_req(rq)->flags & NVME_REQ_CANCELLED)
+	if (test_bit(NVME_REQ_CANCELLED, &nvme_req(rq)->flags))
 		ret = -EINTR;
 
 	rqd->ppa_status = le64_to_cpu(nvme_req(rq)->result.u64);
@@ -599,7 +599,7 @@ static int nvme_nvm_submit_user_cmd(struct request_queue *q,
 
 	blk_execute_rq(q, NULL, rq, 0);
 
-	if (nvme_req(rq)->flags & NVME_REQ_CANCELLED)
+	if (test_bit(NVME_REQ_CANCELLED, &nvme_req(rq)->flags))
 		ret = -EINTR;
 	else if (nvme_req(rq)->status & 0x7ff)
 		ret = -EIO;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index e80fd74..02097e8 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -93,8 +93,8 @@ struct nvme_request {
 	struct nvme_command	*cmd;
 	union nvme_result	result;
 	u8			retries;
-	u8			flags;
 	u16			status;
+	unsigned long flags;
 };
 
 /*
@@ -103,7 +103,7 @@ struct nvme_request {
 #define REQ_NVME_MPATH		REQ_DRV
 
 enum {
-	NVME_REQ_CANCELLED		= (1 << 0),
+	NVME_REQ_CANCELLED		= 0,
 };
 
 static inline struct nvme_request *nvme_req(struct request *req)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 73036d2..e186158 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1222,7 +1222,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 			 "I/O %d QID %d timeout, disable controller\n",
 			 req->tag, nvmeq->qid);
 		nvme_dev_disable(dev, false);
-		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
 		return BLK_EH_HANDLED;
 	default:
 		break;
@@ -1244,7 +1244,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		 * Mark the request as handled, since the inline shutdown
 		 * forces all outstanding requests to complete.
 		 */
-		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
 		return BLK_EH_HANDLED;
 	}
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V4 1/5] nvme: do atomically bit operations on nvme_request.flags
@ 2018-03-08  6:19   ` Jianchao Wang
  0 siblings, 0 replies; 38+ messages in thread
From: Jianchao Wang @ 2018-03-08  6:19 UTC (permalink / raw)


Do atomically bit operations on nvme_request.flags instead of
regular read/write, then we could add other flags and set/clear
them safely.

Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/core.c     | 4 ++--
 drivers/nvme/host/lightnvm.c | 4 ++--
 drivers/nvme/host/nvme.h     | 4 ++--
 drivers/nvme/host/pci.c      | 4 ++--
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 29ead91..7b8df47 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -674,7 +674,7 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
 	blk_execute_rq(req->q, NULL, req, at_head);
 	if (result)
 		*result = nvme_req(req)->result;
-	if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
+	if (test_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags))
 		ret = -EINTR;
 	else
 		ret = nvme_req(req)->status;
@@ -763,7 +763,7 @@ static int nvme_submit_user_cmd(struct request_queue *q,
 	}
 
 	blk_execute_rq(req->q, disk, req, 0);
-	if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
+	if (test_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags))
 		ret = -EINTR;
 	else
 		ret = nvme_req(req)->status;
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 50ef71ee..fd0c499 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -470,7 +470,7 @@ static int nvme_nvm_submit_io_sync(struct nvm_dev *dev, struct nvm_rq *rqd)
 	 * handle the error accordingly.
 	 */
 	blk_execute_rq(q, NULL, rq, 0);
-	if (nvme_req(rq)->flags & NVME_REQ_CANCELLED)
+	if (test_bit(NVME_REQ_CANCELLED, &nvme_req(rq)->flags))
 		ret = -EINTR;
 
 	rqd->ppa_status = le64_to_cpu(nvme_req(rq)->result.u64);
@@ -599,7 +599,7 @@ static int nvme_nvm_submit_user_cmd(struct request_queue *q,
 
 	blk_execute_rq(q, NULL, rq, 0);
 
-	if (nvme_req(rq)->flags & NVME_REQ_CANCELLED)
+	if (test_bit(NVME_REQ_CANCELLED, &nvme_req(rq)->flags))
 		ret = -EINTR;
 	else if (nvme_req(rq)->status & 0x7ff)
 		ret = -EIO;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index e80fd74..02097e8 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -93,8 +93,8 @@ struct nvme_request {
 	struct nvme_command	*cmd;
 	union nvme_result	result;
 	u8			retries;
-	u8			flags;
 	u16			status;
+	unsigned long flags;
 };
 
 /*
@@ -103,7 +103,7 @@ struct nvme_request {
 #define REQ_NVME_MPATH		REQ_DRV
 
 enum {
-	NVME_REQ_CANCELLED		= (1 << 0),
+	NVME_REQ_CANCELLED		= 0,
 };
 
 static inline struct nvme_request *nvme_req(struct request *req)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 73036d2..e186158 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1222,7 +1222,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 			 "I/O %d QID %d timeout, disable controller\n",
 			 req->tag, nvmeq->qid);
 		nvme_dev_disable(dev, false);
-		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
 		return BLK_EH_HANDLED;
 	default:
 		break;
@@ -1244,7 +1244,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		 * Mark the request as handled, since the inline shutdown
 		 * forces all outstanding requests to complete.
 		 */
-		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
 		return BLK_EH_HANDLED;
 	}
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V4 2/5] nvme: add helper interface to flush in-flight requests
  2018-03-08  6:19 ` Jianchao Wang
@ 2018-03-08  6:19   ` Jianchao Wang
  -1 siblings, 0 replies; 38+ messages in thread
From: Jianchao Wang @ 2018-03-08  6:19 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

Currently, we use nvme_cancel_request to complete the request
forcedly. This has following defects:
 - It is not safe to race with the normal completion path.
   blk_mq_complete_request is ok to race with timeout path,
   but not with itself.
 - Cannot ensure all the requests have been handled. The timeout
   path may grab some expired requests, blk_mq_complete_request
   cannot touch them.

add two helper interface to flush in-flight requests more safely.
- nvme_abort_requests_sync
use nvme_abort_req to timeout all the in-flight requests and wait
until timeout work and irq completion path completes. More details
please refer to the comment of this interface.
- nvme_flush_aborted_requests
complete the requests 'aborted' by nvme_abort_requests_sync. It will
be invoked after the controller is disabled/shutdown.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/core.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h |  4 +-
 2 files changed, 99 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7b8df47..e26759b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3567,6 +3567,102 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_start_queues);
 
+static void nvme_abort_req(struct request *req, void *data, bool reserved)
+{
+	if (!blk_mq_request_started(req))
+		return;
+
+	dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
+				"Abort I/O %d", req->tag);
+
+	/* The timeout path need identify this flag and return
+	 * BLK_EH_NOT_HANDLED, then the request will not be completed.
+	 * we will defer the completion after the controller is disabled or
+	 * shutdown.
+	 */
+	set_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags);
+	blk_abort_request(req);
+}
+
+/*
+ * This function will ensure all the in-flight requests on the
+ * controller to be handled by timeout path or irq completion path.
+ * It has to pair with nvme_flush_aborted_requests which will be
+ * invoked after the controller has been disabled/shutdown and
+ * complete the requests 'aborted' by nvme_abort_req.
+ *
+ * Note, the driver layer will not be quiescent before disable
+ * controller, because the requests aborted by blk_abort_request
+ * are still active and the irq will fire at any time, but it can
+ * not enter into completion path, because the request has been
+ * timed out.
+ */
+void nvme_abort_requests_sync(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	blk_mq_tagset_busy_iter(ctrl->tagset, nvme_abort_req, ctrl);
+	blk_mq_tagset_busy_iter(ctrl->admin_tagset, nvme_abort_req, ctrl);
+	/*
+	 * ensure the timeout_work is queued, thus needn't to sync
+	 * the timer
+	 */
+	kblockd_schedule_work(&ctrl->admin_q->timeout_work);
+
+	down_read(&ctrl->namespaces_rwsem);
+
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		kblockd_schedule_work(&ns->queue->timeout_work);
+
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		flush_work(&ns->queue->timeout_work);
+
+	up_read(&ctrl->namespaces_rwsem);
+	/* This will ensure all the nvme irq completion path have exited */
+	synchronize_sched();
+}
+EXPORT_SYMBOL_GPL(nvme_abort_requests_sync);
+
+static void nvme_comp_req(struct request *req, void *data, bool reserved)
+{
+	struct nvme_ctrl *ctrl = (struct nvme_ctrl *)data;
+
+	if (!test_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags))
+		return;
+
+	WARN_ON(!blk_mq_request_started(req));
+
+	if (ctrl->tagset && ctrl->tagset->ops->complete) {
+		clear_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags);
+		/*
+		 * We set the status to NVME_SC_ABORT_REQ, then ioq request
+		 * will be requeued and adminq one will be failed.
+		 */
+		nvme_req(req)->status = NVME_SC_ABORT_REQ;
+		/*
+		 * For ioq request, blk_mq_requeue_request should be better
+		 * here. But the nvme code will still setup the cmd even if
+		 * the RQF_DONTPREP is set. We have to use .complete to free
+		 * the cmd and then requeue it.
+		 *
+		 * For adminq request, invoking .complete directly will miss
+		 * blk_mq_sched_completed_request, but this is ok because we
+		 * won't have io scheduler for adminq.
+		 */
+		ctrl->tagset->ops->complete(req);
+	}
+}
+
+/*
+ * Should pair with nvme_abort_requests_sync
+ */
+void nvme_flush_aborted_requests(struct nvme_ctrl *ctrl)
+{
+	blk_mq_tagset_busy_iter(ctrl->tagset, nvme_comp_req, ctrl);
+	blk_mq_tagset_busy_iter(ctrl->admin_tagset, nvme_comp_req, ctrl);
+}
+EXPORT_SYMBOL_GPL(nvme_flush_aborted_requests);
+
 int nvme_reinit_tagset(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set)
 {
 	if (!ctrl->ops->reinit_request)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 02097e8..3c71c73 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -104,6 +104,7 @@ struct nvme_request {
 
 enum {
 	NVME_REQ_CANCELLED		= 0,
+	NVME_REQ_ABORTED,            /* cmd is aborted by nvme_abort_request */
 };
 
 static inline struct nvme_request *nvme_req(struct request *req)
@@ -381,7 +382,8 @@ void nvme_wait_freeze(struct nvme_ctrl *ctrl);
 void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout);
 void nvme_start_freeze(struct nvme_ctrl *ctrl);
 int nvme_reinit_tagset(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set);
-
+void nvme_abort_requests_sync(struct nvme_ctrl *ctrl);
+void nvme_flush_aborted_requests(struct nvme_ctrl *ctrl);
 #define NVME_QID_ANY -1
 struct request *nvme_alloc_request(struct request_queue *q,
 		struct nvme_command *cmd, blk_mq_req_flags_t flags, int qid);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V4 2/5] nvme: add helper interface to flush in-flight requests
@ 2018-03-08  6:19   ` Jianchao Wang
  0 siblings, 0 replies; 38+ messages in thread
From: Jianchao Wang @ 2018-03-08  6:19 UTC (permalink / raw)


Currently, we use nvme_cancel_request to complete the request
forcedly. This has following defects:
 - It is not safe to race with the normal completion path.
   blk_mq_complete_request is ok to race with timeout path,
   but not with itself.
 - Cannot ensure all the requests have been handled. The timeout
   path may grab some expired requests, blk_mq_complete_request
   cannot touch them.

add two helper interface to flush in-flight requests more safely.
- nvme_abort_requests_sync
use nvme_abort_req to timeout all the in-flight requests and wait
until timeout work and irq completion path completes. More details
please refer to the comment of this interface.
- nvme_flush_aborted_requests
complete the requests 'aborted' by nvme_abort_requests_sync. It will
be invoked after the controller is disabled/shutdown.

Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/core.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h |  4 +-
 2 files changed, 99 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7b8df47..e26759b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3567,6 +3567,102 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_start_queues);
 
+static void nvme_abort_req(struct request *req, void *data, bool reserved)
+{
+	if (!blk_mq_request_started(req))
+		return;
+
+	dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
+				"Abort I/O %d", req->tag);
+
+	/* The timeout path need identify this flag and return
+	 * BLK_EH_NOT_HANDLED, then the request will not be completed.
+	 * we will defer the completion after the controller is disabled or
+	 * shutdown.
+	 */
+	set_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags);
+	blk_abort_request(req);
+}
+
+/*
+ * This function will ensure all the in-flight requests on the
+ * controller to be handled by timeout path or irq completion path.
+ * It has to pair with nvme_flush_aborted_requests which will be
+ * invoked after the controller has been disabled/shutdown and
+ * complete the requests 'aborted' by nvme_abort_req.
+ *
+ * Note, the driver layer will not be quiescent before disable
+ * controller, because the requests aborted by blk_abort_request
+ * are still active and the irq will fire at any time, but it can
+ * not enter into completion path, because the request has been
+ * timed out.
+ */
+void nvme_abort_requests_sync(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	blk_mq_tagset_busy_iter(ctrl->tagset, nvme_abort_req, ctrl);
+	blk_mq_tagset_busy_iter(ctrl->admin_tagset, nvme_abort_req, ctrl);
+	/*
+	 * ensure the timeout_work is queued, thus needn't to sync
+	 * the timer
+	 */
+	kblockd_schedule_work(&ctrl->admin_q->timeout_work);
+
+	down_read(&ctrl->namespaces_rwsem);
+
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		kblockd_schedule_work(&ns->queue->timeout_work);
+
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		flush_work(&ns->queue->timeout_work);
+
+	up_read(&ctrl->namespaces_rwsem);
+	/* This will ensure all the nvme irq completion path have exited */
+	synchronize_sched();
+}
+EXPORT_SYMBOL_GPL(nvme_abort_requests_sync);
+
+static void nvme_comp_req(struct request *req, void *data, bool reserved)
+{
+	struct nvme_ctrl *ctrl = (struct nvme_ctrl *)data;
+
+	if (!test_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags))
+		return;
+
+	WARN_ON(!blk_mq_request_started(req));
+
+	if (ctrl->tagset && ctrl->tagset->ops->complete) {
+		clear_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags);
+		/*
+		 * We set the status to NVME_SC_ABORT_REQ, then ioq request
+		 * will be requeued and adminq one will be failed.
+		 */
+		nvme_req(req)->status = NVME_SC_ABORT_REQ;
+		/*
+		 * For ioq request, blk_mq_requeue_request should be better
+		 * here. But the nvme code will still setup the cmd even if
+		 * the RQF_DONTPREP is set. We have to use .complete to free
+		 * the cmd and then requeue it.
+		 *
+		 * For adminq request, invoking .complete directly will miss
+		 * blk_mq_sched_completed_request, but this is ok because we
+		 * won't have io scheduler for adminq.
+		 */
+		ctrl->tagset->ops->complete(req);
+	}
+}
+
+/*
+ * Should pair with nvme_abort_requests_sync
+ */
+void nvme_flush_aborted_requests(struct nvme_ctrl *ctrl)
+{
+	blk_mq_tagset_busy_iter(ctrl->tagset, nvme_comp_req, ctrl);
+	blk_mq_tagset_busy_iter(ctrl->admin_tagset, nvme_comp_req, ctrl);
+}
+EXPORT_SYMBOL_GPL(nvme_flush_aborted_requests);
+
 int nvme_reinit_tagset(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set)
 {
 	if (!ctrl->ops->reinit_request)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 02097e8..3c71c73 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -104,6 +104,7 @@ struct nvme_request {
 
 enum {
 	NVME_REQ_CANCELLED		= 0,
+	NVME_REQ_ABORTED,            /* cmd is aborted by nvme_abort_request */
 };
 
 static inline struct nvme_request *nvme_req(struct request *req)
@@ -381,7 +382,8 @@ void nvme_wait_freeze(struct nvme_ctrl *ctrl);
 void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout);
 void nvme_start_freeze(struct nvme_ctrl *ctrl);
 int nvme_reinit_tagset(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set);
-
+void nvme_abort_requests_sync(struct nvme_ctrl *ctrl);
+void nvme_flush_aborted_requests(struct nvme_ctrl *ctrl);
 #define NVME_QID_ANY -1
 struct request *nvme_alloc_request(struct request_queue *q,
 		struct nvme_command *cmd, blk_mq_req_flags_t flags, int qid);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V4 3/5] nvme-pci: avoid nvme_dev_disable to be invoked in nvme_timeout
  2018-03-08  6:19 ` Jianchao Wang
@ 2018-03-08  6:19   ` Jianchao Wang
  -1 siblings, 0 replies; 38+ messages in thread
From: Jianchao Wang @ 2018-03-08  6:19 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

nvme_dev_disable will issue command on adminq to clear HMB and
delete io cq/sqs, maybe more in the future. When adminq no response,
it has to depends on timeout path. However, nvme_timeout has to
invoke nvme_dev_disable before return, so that the DMA mappings
could be released safely. This will introduce dangerous circular
dependence. Moreover, the whole nvme_dev_disable is under
shutdown_lock even waiting for the command, this makes things
worse.

To avoid this, this patch refactors the nvme_timeout. The basic
principle is:
 - When need to schedule reset_work, hand over expired requests
   to nvme_dev_disable. They will be completed after the controller
   is disabled/shtudown.
 - When requests from nvme_dev_disable and nvme_reset_work expires,
   disable the controller directly then the request could be completed
   to wakeup the waiter. nvme_pci_disable_ctrl_directly is introduced
   for this, it doesn't send commands on adminq and the shutdown_lock
   is not needed here, because the nvme_abort_requests_sync in
   nvme_dev_disable could synchronize with nvme_timeout.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/pci.c | 199 ++++++++++++++++++++++++++++++++----------------
 1 file changed, 134 insertions(+), 65 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e186158..ce09057 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -70,6 +70,7 @@ struct nvme_queue;
 
 static void nvme_process_cq(struct nvme_queue *nvmeq);
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
+static void nvme_pci_disable_ctrl_directly(struct nvme_dev *dev);
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -98,6 +99,7 @@ struct nvme_dev {
 	u32 cmbloc;
 	struct nvme_ctrl ctrl;
 	struct completion ioq_wait;
+	bool inflight_flushed;
 
 	/* shadow doorbell buffer support: */
 	u32 *dbbuf_dbs;
@@ -1180,73 +1182,13 @@ static void nvme_warn_reset(struct nvme_dev *dev, u32 csts)
 			 csts, result);
 }
 
-static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
+static enum blk_eh_timer_return nvme_pci_abort_io_req(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	struct nvme_queue *nvmeq = iod->nvmeq;
 	struct nvme_dev *dev = nvmeq->dev;
-	struct request *abort_req;
 	struct nvme_command cmd;
-	u32 csts = readl(dev->bar + NVME_REG_CSTS);
-
-	/*
-	 * Reset immediately if the controller is failed
-	 */
-	if (nvme_should_reset(dev, csts)) {
-		nvme_warn_reset(dev, csts);
-		nvme_dev_disable(dev, false);
-		nvme_reset_ctrl(&dev->ctrl);
-		return BLK_EH_HANDLED;
-	}
-
-	/*
-	 * Did we miss an interrupt?
-	 */
-	if (__nvme_poll(nvmeq, req->tag)) {
-		dev_warn(dev->ctrl.device,
-			 "I/O %d QID %d timeout, completion polled\n",
-			 req->tag, nvmeq->qid);
-		return BLK_EH_HANDLED;
-	}
-
-	/*
-	 * Shutdown immediately if controller times out while starting. The
-	 * reset work will see the pci device disabled when it gets the forced
-	 * cancellation error. All outstanding requests are completed on
-	 * shutdown, so we return BLK_EH_HANDLED.
-	 */
-	switch (dev->ctrl.state) {
-	case NVME_CTRL_CONNECTING:
-	case NVME_CTRL_RESETTING:
-		dev_warn(dev->ctrl.device,
-			 "I/O %d QID %d timeout, disable controller\n",
-			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
-		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
-		return BLK_EH_HANDLED;
-	default:
-		break;
-	}
-
-	/*
- 	 * Shutdown the controller immediately and schedule a reset if the
- 	 * command was already aborted once before and still hasn't been
- 	 * returned to the driver, or if this is the admin queue.
-	 */
-	if (!nvmeq->qid || iod->aborted) {
-		dev_warn(dev->ctrl.device,
-			 "I/O %d QID %d timeout, reset controller\n",
-			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
-		nvme_reset_ctrl(&dev->ctrl);
-
-		/*
-		 * Mark the request as handled, since the inline shutdown
-		 * forces all outstanding requests to complete.
-		 */
-		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
-		return BLK_EH_HANDLED;
-	}
+	struct request *abort_req;
 
 	if (atomic_dec_return(&dev->ctrl.abort_limit) < 0) {
 		atomic_inc(&dev->ctrl.abort_limit);
@@ -1282,6 +1224,105 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	return BLK_EH_RESET_TIMER;
 }
 
+static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
+{
+	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	struct nvme_queue *nvmeq = iod->nvmeq;
+	struct nvme_dev *dev = nvmeq->dev;
+	u32 csts = readl(dev->bar + NVME_REG_CSTS);
+	enum {ABORT, RESET, DISABLE} action;
+	enum blk_eh_timer_return ret;
+	/*
+	 * This is for nvme_abort_req. This request will be completed
+	 * by nvme_flush_aborted_requests after the controller is
+	 * disabled/shutdown
+	 */
+	if (test_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags))
+		return BLK_EH_NOT_HANDLED;
+
+	/*
+	 * Reset immediately if the controller is failed.
+	 * Defer the completion to nvme_flush_aborted_requests.
+	 */
+	if (nvme_should_reset(dev, csts)) {
+		nvme_warn_reset(dev, csts);
+		nvme_reset_ctrl(&dev->ctrl);
+		return BLK_EH_RESET_TIMER;
+	}
+
+	/*
+	 * Did we miss an interrupt?
+	 */
+	if (__nvme_poll(nvmeq, req->tag)) {
+		dev_warn(dev->ctrl.device,
+			 "I/O %d QID %d timeout, completion polled\n",
+			 req->tag, nvmeq->qid);
+		return BLK_EH_HANDLED;
+	}
+
+	if (nvmeq->qid) {
+		if (dev->ctrl.state == NVME_CTRL_RESETTING ||
+				iod->aborted)
+			action = RESET;
+		else
+			action = ABORT;
+	} else {
+		/*
+		 * Disable immediately if controller times out while disabling/
+		 * shuting down/starting. The nvme_dev_disable/nvme_reset_work
+		 * will see the error.
+		 * Note: inflight_flushed is set in nvme_dev_disable when it
+		 * abort all the inflight requests. Introducing this flag is due
+		 * to there is no state to represent the shutdown procedure.
+		 */
+		if (dev->ctrl.state == NVME_CTRL_CONNECTING ||
+			dev->inflight_flushed)
+			action = DISABLE;
+		else
+			action = RESET;
+	}
+
+	switch (action) {
+	case ABORT:
+		ret = nvme_pci_abort_io_req(req);
+		break;
+	case RESET:
+		dev_warn(dev->ctrl.device,
+			 "I/O %d QID %d timeout, reset controller\n",
+			 req->tag, nvmeq->qid);
+		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
+		nvme_reset_ctrl(&dev->ctrl);
+		/*
+		 * The reset work will take over this request. nvme_abort_req
+		 * employs blk_abort_request to force the request to be timed
+		 * out. So we need to return BLK_EH_RESET_TIMER then the
+		 * RQF_MQ_TIMEOUT_EXPIRED could be cleared.
+		 */
+		ret = BLK_EH_RESET_TIMER;
+		break;
+	case DISABLE:
+		/*
+		 * We disable the controller directly here so that we could
+		 * complete the request safely to wake up the nvme_dev_disable/
+		 * nvme_reset_work who is waiting on adminq. We cannot return
+		 * BLK_EH_RESET_TIMER to depend on error recovery procedure
+		 * because it is waiting for timeout path.
+		 */
+		dev_warn(dev->ctrl.device,
+			 "I/O %d QID %d timeout, disable controller\n",
+			 req->tag, nvmeq->qid);
+		nvme_pci_disable_ctrl_directly(dev);
+		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
+		ret = BLK_EH_HANDLED;
+		break;
+	default:
+		WARN_ON(1);
+		break;
+	}
+
+	return ret;
+}
+
 static void nvme_free_queue(struct nvme_queue *nvmeq)
 {
 	dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
@@ -2169,6 +2210,33 @@ static void nvme_pci_disable(struct nvme_dev *dev)
 	}
 }
 
+/*
+ * This is only invoked by nvme_timeout. shutdown_lock is not needed
+ * here because nvme_abort_requests_sync in nvme_dev_disable will
+ * synchronize with the timeout path.
+ */
+static void nvme_pci_disable_ctrl_directly(struct nvme_dev *dev)
+{
+	int i;
+	struct pci_dev *pdev = to_pci_dev(dev->dev);
+	bool dead = true;
+
+	if (pci_is_enabled(pdev)) {
+		u32 csts = readl(dev->bar + NVME_REG_CSTS);
+
+		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
+			pdev->error_state  != pci_channel_io_normal);
+
+		if (!dead)
+			nvme_disable_ctrl(&dev->ctrl, dev->ctrl.cap);
+	}
+
+	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
+		nvme_suspend_queue(&dev->queues[i]);
+
+	nvme_pci_disable(dev);
+}
+
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 {
 	int i;
@@ -2205,6 +2273,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 
 	}
 	nvme_stop_queues(&dev->ctrl);
+	nvme_abort_requests_sync(&dev->ctrl);
+	dev->inflight_flushed = true;
 
 	if (!dead) {
 		nvme_disable_io_queues(dev);
@@ -2215,9 +2285,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 
 	nvme_pci_disable(dev);
 
-	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
-	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
-
+	nvme_flush_aborted_requests(&dev->ctrl);
+	dev->inflight_flushed = false;
 	/*
 	 * The driver will not be starting up queues again if shutting down so
 	 * must flush all entered requests to their failed completion to avoid
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V4 3/5] nvme-pci: avoid nvme_dev_disable to be invoked in nvme_timeout
@ 2018-03-08  6:19   ` Jianchao Wang
  0 siblings, 0 replies; 38+ messages in thread
From: Jianchao Wang @ 2018-03-08  6:19 UTC (permalink / raw)


nvme_dev_disable will issue command on adminq to clear HMB and
delete io cq/sqs, maybe more in the future. When adminq no response,
it has to depends on timeout path. However, nvme_timeout has to
invoke nvme_dev_disable before return, so that the DMA mappings
could be released safely. This will introduce dangerous circular
dependence. Moreover, the whole nvme_dev_disable is under
shutdown_lock even waiting for the command, this makes things
worse.

To avoid this, this patch refactors the nvme_timeout. The basic
principle is:
 - When need to schedule reset_work, hand over expired requests
   to nvme_dev_disable. They will be completed after the controller
   is disabled/shtudown.
 - When requests from nvme_dev_disable and nvme_reset_work expires,
   disable the controller directly then the request could be completed
   to wakeup the waiter. nvme_pci_disable_ctrl_directly is introduced
   for this, it doesn't send commands on adminq and the shutdown_lock
   is not needed here, because the nvme_abort_requests_sync in
   nvme_dev_disable could synchronize with nvme_timeout.

Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/pci.c | 199 ++++++++++++++++++++++++++++++++----------------
 1 file changed, 134 insertions(+), 65 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e186158..ce09057 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -70,6 +70,7 @@ struct nvme_queue;
 
 static void nvme_process_cq(struct nvme_queue *nvmeq);
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
+static void nvme_pci_disable_ctrl_directly(struct nvme_dev *dev);
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -98,6 +99,7 @@ struct nvme_dev {
 	u32 cmbloc;
 	struct nvme_ctrl ctrl;
 	struct completion ioq_wait;
+	bool inflight_flushed;
 
 	/* shadow doorbell buffer support: */
 	u32 *dbbuf_dbs;
@@ -1180,73 +1182,13 @@ static void nvme_warn_reset(struct nvme_dev *dev, u32 csts)
 			 csts, result);
 }
 
-static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
+static enum blk_eh_timer_return nvme_pci_abort_io_req(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	struct nvme_queue *nvmeq = iod->nvmeq;
 	struct nvme_dev *dev = nvmeq->dev;
-	struct request *abort_req;
 	struct nvme_command cmd;
-	u32 csts = readl(dev->bar + NVME_REG_CSTS);
-
-	/*
-	 * Reset immediately if the controller is failed
-	 */
-	if (nvme_should_reset(dev, csts)) {
-		nvme_warn_reset(dev, csts);
-		nvme_dev_disable(dev, false);
-		nvme_reset_ctrl(&dev->ctrl);
-		return BLK_EH_HANDLED;
-	}
-
-	/*
-	 * Did we miss an interrupt?
-	 */
-	if (__nvme_poll(nvmeq, req->tag)) {
-		dev_warn(dev->ctrl.device,
-			 "I/O %d QID %d timeout, completion polled\n",
-			 req->tag, nvmeq->qid);
-		return BLK_EH_HANDLED;
-	}
-
-	/*
-	 * Shutdown immediately if controller times out while starting. The
-	 * reset work will see the pci device disabled when it gets the forced
-	 * cancellation error. All outstanding requests are completed on
-	 * shutdown, so we return BLK_EH_HANDLED.
-	 */
-	switch (dev->ctrl.state) {
-	case NVME_CTRL_CONNECTING:
-	case NVME_CTRL_RESETTING:
-		dev_warn(dev->ctrl.device,
-			 "I/O %d QID %d timeout, disable controller\n",
-			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
-		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
-		return BLK_EH_HANDLED;
-	default:
-		break;
-	}
-
-	/*
- 	 * Shutdown the controller immediately and schedule a reset if the
- 	 * command was already aborted once before and still hasn't been
- 	 * returned to the driver, or if this is the admin queue.
-	 */
-	if (!nvmeq->qid || iod->aborted) {
-		dev_warn(dev->ctrl.device,
-			 "I/O %d QID %d timeout, reset controller\n",
-			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
-		nvme_reset_ctrl(&dev->ctrl);
-
-		/*
-		 * Mark the request as handled, since the inline shutdown
-		 * forces all outstanding requests to complete.
-		 */
-		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
-		return BLK_EH_HANDLED;
-	}
+	struct request *abort_req;
 
 	if (atomic_dec_return(&dev->ctrl.abort_limit) < 0) {
 		atomic_inc(&dev->ctrl.abort_limit);
@@ -1282,6 +1224,105 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	return BLK_EH_RESET_TIMER;
 }
 
+static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
+{
+	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	struct nvme_queue *nvmeq = iod->nvmeq;
+	struct nvme_dev *dev = nvmeq->dev;
+	u32 csts = readl(dev->bar + NVME_REG_CSTS);
+	enum {ABORT, RESET, DISABLE} action;
+	enum blk_eh_timer_return ret;
+	/*
+	 * This is for nvme_abort_req. This request will be completed
+	 * by nvme_flush_aborted_requests after the controller is
+	 * disabled/shutdown
+	 */
+	if (test_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags))
+		return BLK_EH_NOT_HANDLED;
+
+	/*
+	 * Reset immediately if the controller is failed.
+	 * Defer the completion to nvme_flush_aborted_requests.
+	 */
+	if (nvme_should_reset(dev, csts)) {
+		nvme_warn_reset(dev, csts);
+		nvme_reset_ctrl(&dev->ctrl);
+		return BLK_EH_RESET_TIMER;
+	}
+
+	/*
+	 * Did we miss an interrupt?
+	 */
+	if (__nvme_poll(nvmeq, req->tag)) {
+		dev_warn(dev->ctrl.device,
+			 "I/O %d QID %d timeout, completion polled\n",
+			 req->tag, nvmeq->qid);
+		return BLK_EH_HANDLED;
+	}
+
+	if (nvmeq->qid) {
+		if (dev->ctrl.state == NVME_CTRL_RESETTING ||
+				iod->aborted)
+			action = RESET;
+		else
+			action = ABORT;
+	} else {
+		/*
+		 * Disable immediately if controller times out while disabling/
+		 * shuting down/starting. The nvme_dev_disable/nvme_reset_work
+		 * will see the error.
+		 * Note: inflight_flushed is set in nvme_dev_disable when it
+		 * abort all the inflight requests. Introducing this flag is due
+		 * to there is no state to represent the shutdown procedure.
+		 */
+		if (dev->ctrl.state == NVME_CTRL_CONNECTING ||
+			dev->inflight_flushed)
+			action = DISABLE;
+		else
+			action = RESET;
+	}
+
+	switch (action) {
+	case ABORT:
+		ret = nvme_pci_abort_io_req(req);
+		break;
+	case RESET:
+		dev_warn(dev->ctrl.device,
+			 "I/O %d QID %d timeout, reset controller\n",
+			 req->tag, nvmeq->qid);
+		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
+		nvme_reset_ctrl(&dev->ctrl);
+		/*
+		 * The reset work will take over this request. nvme_abort_req
+		 * employs blk_abort_request to force the request to be timed
+		 * out. So we need to return BLK_EH_RESET_TIMER then the
+		 * RQF_MQ_TIMEOUT_EXPIRED could be cleared.
+		 */
+		ret = BLK_EH_RESET_TIMER;
+		break;
+	case DISABLE:
+		/*
+		 * We disable the controller directly here so that we could
+		 * complete the request safely to wake up the nvme_dev_disable/
+		 * nvme_reset_work who is waiting on adminq. We cannot return
+		 * BLK_EH_RESET_TIMER to depend on error recovery procedure
+		 * because it is waiting for timeout path.
+		 */
+		dev_warn(dev->ctrl.device,
+			 "I/O %d QID %d timeout, disable controller\n",
+			 req->tag, nvmeq->qid);
+		nvme_pci_disable_ctrl_directly(dev);
+		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
+		ret = BLK_EH_HANDLED;
+		break;
+	default:
+		WARN_ON(1);
+		break;
+	}
+
+	return ret;
+}
+
 static void nvme_free_queue(struct nvme_queue *nvmeq)
 {
 	dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
@@ -2169,6 +2210,33 @@ static void nvme_pci_disable(struct nvme_dev *dev)
 	}
 }
 
+/*
+ * This is only invoked by nvme_timeout. shutdown_lock is not needed
+ * here because nvme_abort_requests_sync in nvme_dev_disable will
+ * synchronize with the timeout path.
+ */
+static void nvme_pci_disable_ctrl_directly(struct nvme_dev *dev)
+{
+	int i;
+	struct pci_dev *pdev = to_pci_dev(dev->dev);
+	bool dead = true;
+
+	if (pci_is_enabled(pdev)) {
+		u32 csts = readl(dev->bar + NVME_REG_CSTS);
+
+		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
+			pdev->error_state  != pci_channel_io_normal);
+
+		if (!dead)
+			nvme_disable_ctrl(&dev->ctrl, dev->ctrl.cap);
+	}
+
+	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
+		nvme_suspend_queue(&dev->queues[i]);
+
+	nvme_pci_disable(dev);
+}
+
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 {
 	int i;
@@ -2205,6 +2273,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 
 	}
 	nvme_stop_queues(&dev->ctrl);
+	nvme_abort_requests_sync(&dev->ctrl);
+	dev->inflight_flushed = true;
 
 	if (!dead) {
 		nvme_disable_io_queues(dev);
@@ -2215,9 +2285,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 
 	nvme_pci_disable(dev);
 
-	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
-	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
-
+	nvme_flush_aborted_requests(&dev->ctrl);
+	dev->inflight_flushed = false;
 	/*
 	 * The driver will not be starting up queues again if shutting down so
 	 * must flush all entered requests to their failed completion to avoid
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V4 4/5] nvme-pci: discard wait timeout when delete cq/sq
  2018-03-08  6:19 ` Jianchao Wang
@ 2018-03-08  6:19   ` Jianchao Wang
  -1 siblings, 0 replies; 38+ messages in thread
From: Jianchao Wang @ 2018-03-08  6:19 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

nvme_disable_io_queues could be wakeup by both request completion
and wait timeout path. If delete cq/sq command expires, the
nvme_disable_io_queues will be wakeup, return, even get to
nvme_reset_work, while the timeout path is still ongoing. The disable
and initialization procedure may run in parallel.

Use wait_for_completion instead of the timeout one here. The timeout
path is reliable now and will complete the request to wakeup it.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/pci.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ce09057..6c7c19cb 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2062,7 +2062,6 @@ static int nvme_delete_queue(struct nvme_queue *nvmeq, u8 opcode)
 static void nvme_disable_io_queues(struct nvme_dev *dev)
 {
 	int pass, queues = dev->online_queues - 1;
-	unsigned long timeout;
 	u8 opcode = nvme_admin_delete_sq;
 
 	for (pass = 0; pass < 2; pass++) {
@@ -2070,15 +2069,12 @@ static void nvme_disable_io_queues(struct nvme_dev *dev)
 
 		reinit_completion(&dev->ioq_wait);
  retry:
-		timeout = ADMIN_TIMEOUT;
 		for (; i > 0; i--, sent++)
 			if (nvme_delete_queue(&dev->queues[i], opcode))
 				break;
 
 		while (sent--) {
-			timeout = wait_for_completion_io_timeout(&dev->ioq_wait, timeout);
-			if (timeout == 0)
-				return;
+			wait_for_completion(&dev->ioq_wait);
 			if (i)
 				goto retry;
 		}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V4 4/5] nvme-pci: discard wait timeout when delete cq/sq
@ 2018-03-08  6:19   ` Jianchao Wang
  0 siblings, 0 replies; 38+ messages in thread
From: Jianchao Wang @ 2018-03-08  6:19 UTC (permalink / raw)


nvme_disable_io_queues could be wakeup by both request completion
and wait timeout path. If delete cq/sq command expires, the
nvme_disable_io_queues will be wakeup, return, even get to
nvme_reset_work, while the timeout path is still ongoing. The disable
and initialization procedure may run in parallel.

Use wait_for_completion instead of the timeout one here. The timeout
path is reliable now and will complete the request to wakeup it.

Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/pci.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ce09057..6c7c19cb 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2062,7 +2062,6 @@ static int nvme_delete_queue(struct nvme_queue *nvmeq, u8 opcode)
 static void nvme_disable_io_queues(struct nvme_dev *dev)
 {
 	int pass, queues = dev->online_queues - 1;
-	unsigned long timeout;
 	u8 opcode = nvme_admin_delete_sq;
 
 	for (pass = 0; pass < 2; pass++) {
@@ -2070,15 +2069,12 @@ static void nvme_disable_io_queues(struct nvme_dev *dev)
 
 		reinit_completion(&dev->ioq_wait);
  retry:
-		timeout = ADMIN_TIMEOUT;
 		for (; i > 0; i--, sent++)
 			if (nvme_delete_queue(&dev->queues[i], opcode))
 				break;
 
 		while (sent--) {
-			timeout = wait_for_completion_io_timeout(&dev->ioq_wait, timeout);
-			if (timeout == 0)
-				return;
+			wait_for_completion(&dev->ioq_wait);
 			if (i)
 				goto retry;
 		}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V4 5/5] nvme-pci: add the timeout case for DELETEING state
  2018-03-08  6:19 ` Jianchao Wang
@ 2018-03-08  6:19   ` Jianchao Wang
  -1 siblings, 0 replies; 38+ messages in thread
From: Jianchao Wang @ 2018-03-08  6:19 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

When the controller is being removed, blk_cleanup_queue will try
to drain the queues. At the moment, if the controller no response,
because of DELETEING state, reset_work will not be able to be
scheduled, and completion of the expired request is deferred to
nvme_dev_disable, blk_cleanup_queue will hang forever. Add case
for DELETEING in nvme_timeout, when abort fails, disable the
controller and complete the request directly.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/pci.c | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6c7c19cb..ac9efcd 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1261,11 +1261,30 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	}
 
 	if (nvmeq->qid) {
-		if (dev->ctrl.state == NVME_CTRL_RESETTING ||
-				iod->aborted)
+		switch (dev->ctrl.state) {
+		case NVME_CTRL_RESETTING:
 			action = RESET;
-		else
-			action = ABORT;
+			break;
+		case NVME_CTRL_DELETING:
+			/*
+			 * When ctrl is being removed, we try to abort the
+			 * expired request first, if success, it will be
+			 * requeued, otherwise, disable the controller and
+			 * complete it directly, because we cannot schedule
+			 * the reset_work to do recovery in DELELTING state.
+			 */
+			if (iod->aborted)
+				action = DISABLE;
+			else
+				action = ABORT;
+			break;
+		default:
+			if (iod->aborted)
+				action = RESET;
+			else
+				action = ABORT;
+			break;
+		}
 	} else {
 		/*
 		 * Disable immediately if controller times out while disabling/
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V4 5/5] nvme-pci: add the timeout case for DELETEING state
@ 2018-03-08  6:19   ` Jianchao Wang
  0 siblings, 0 replies; 38+ messages in thread
From: Jianchao Wang @ 2018-03-08  6:19 UTC (permalink / raw)


When the controller is being removed, blk_cleanup_queue will try
to drain the queues. At the moment, if the controller no response,
because of DELETEING state, reset_work will not be able to be
scheduled, and completion of the expired request is deferred to
nvme_dev_disable, blk_cleanup_queue will hang forever. Add case
for DELETEING in nvme_timeout, when abort fails, disable the
controller and complete the request directly.

Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/pci.c | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6c7c19cb..ac9efcd 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1261,11 +1261,30 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	}
 
 	if (nvmeq->qid) {
-		if (dev->ctrl.state == NVME_CTRL_RESETTING ||
-				iod->aborted)
+		switch (dev->ctrl.state) {
+		case NVME_CTRL_RESETTING:
 			action = RESET;
-		else
-			action = ABORT;
+			break;
+		case NVME_CTRL_DELETING:
+			/*
+			 * When ctrl is being removed, we try to abort the
+			 * expired request first, if success, it will be
+			 * requeued, otherwise, disable the controller and
+			 * complete it directly, because we cannot schedule
+			 * the reset_work to do recovery in DELELTING state.
+			 */
+			if (iod->aborted)
+				action = DISABLE;
+			else
+				action = ABORT;
+			break;
+		default:
+			if (iod->aborted)
+				action = RESET;
+			else
+				action = ABORT;
+			break;
+		}
 	} else {
 		/*
 		 * Disable immediately if controller times out while disabling/
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 1/5] nvme: do atomically bit operations on nvme_request.flags
  2018-03-08  6:19   ` Jianchao Wang
@ 2018-03-08  7:57     ` Christoph Hellwig
  -1 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2018-03-08  7:57 UTC (permalink / raw)
  To: Jianchao Wang; +Cc: keith.busch, axboe, hch, sagi, linux-nvme, linux-kernel

On Thu, Mar 08, 2018 at 02:19:27PM +0800, Jianchao Wang wrote:
> Do atomically bit operations on nvme_request.flags instead of
> regular read/write, then we could add other flags and set/clear
> them safely.
> 
> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>

Looks good, assuming an actual need for this shows up in the next
patches :)

Reviewed-by: Christoph Hellwig <hch@lst.de>

> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index e80fd74..02097e8 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -93,8 +93,8 @@ struct nvme_request {
>  	struct nvme_command	*cmd;
>  	union nvme_result	result;
>  	u8			retries;
> -	u8			flags;
>  	u16			status;
> +	unsigned long flags;

Please align the field name like the others, though.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH V4 1/5] nvme: do atomically bit operations on nvme_request.flags
@ 2018-03-08  7:57     ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2018-03-08  7:57 UTC (permalink / raw)


On Thu, Mar 08, 2018@02:19:27PM +0800, Jianchao Wang wrote:
> Do atomically bit operations on nvme_request.flags instead of
> regular read/write, then we could add other flags and set/clear
> them safely.
> 
> Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>

Looks good, assuming an actual need for this shows up in the next
patches :)

Reviewed-by: Christoph Hellwig <hch at lst.de>

> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index e80fd74..02097e8 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -93,8 +93,8 @@ struct nvme_request {
>  	struct nvme_command	*cmd;
>  	union nvme_result	result;
>  	u8			retries;
> -	u8			flags;
>  	u16			status;
> +	unsigned long flags;

Please align the field name like the others, though.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 2/5] nvme: add helper interface to flush in-flight requests
  2018-03-08  6:19   ` Jianchao Wang
@ 2018-03-08 13:11     ` Ming Lei
  -1 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-03-08 13:11 UTC (permalink / raw)
  To: Jianchao Wang
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Linux Kernel Mailing List, linux-nvme

On Thu, Mar 8, 2018 at 2:19 PM, Jianchao Wang
<jianchao.w.wang@oracle.com> wrote:
> Currently, we use nvme_cancel_request to complete the request
> forcedly. This has following defects:
>  - It is not safe to race with the normal completion path.
>    blk_mq_complete_request is ok to race with timeout path,
>    but not with itself.

The irq path shouldn't be raced with nvme_cancel_request()
because io queues are suspended before calling nvme_cancel_request().

Could you please explain a bit why one same request can be
completed at the same time via blk_mq_complete_request()?

>  - Cannot ensure all the requests have been handled. The timeout
>    path may grab some expired requests, blk_mq_complete_request
>    cannot touch them.
>
> add two helper interface to flush in-flight requests more safely.
> - nvme_abort_requests_sync
> use nvme_abort_req to timeout all the in-flight requests and wait
> until timeout work and irq completion path completes. More details
> please refer to the comment of this interface.
> - nvme_flush_aborted_requests
> complete the requests 'aborted' by nvme_abort_requests_sync. It will
> be invoked after the controller is disabled/shutdown.

IMO, the helper's name of 'abort' is very misleading since the request
isn't aborted actually, it is just cancelled from dispatched state, once
it is cancelled, most of times the request is just re-inserted to sw
queue or scheduler queue. After NVMe controller is resetted successfully,
these request will be dispatched again.

So please keep the name of 'cancel' or use sort of name.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH V4 2/5] nvme: add helper interface to flush in-flight requests
@ 2018-03-08 13:11     ` Ming Lei
  0 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-03-08 13:11 UTC (permalink / raw)


On Thu, Mar 8, 2018 at 2:19 PM, Jianchao Wang
<jianchao.w.wang@oracle.com> wrote:
> Currently, we use nvme_cancel_request to complete the request
> forcedly. This has following defects:
>  - It is not safe to race with the normal completion path.
>    blk_mq_complete_request is ok to race with timeout path,
>    but not with itself.

The irq path shouldn't be raced with nvme_cancel_request()
because io queues are suspended before calling nvme_cancel_request().

Could you please explain a bit why one same request can be
completed at the same time via blk_mq_complete_request()?

>  - Cannot ensure all the requests have been handled. The timeout
>    path may grab some expired requests, blk_mq_complete_request
>    cannot touch them.
>
> add two helper interface to flush in-flight requests more safely.
> - nvme_abort_requests_sync
> use nvme_abort_req to timeout all the in-flight requests and wait
> until timeout work and irq completion path completes. More details
> please refer to the comment of this interface.
> - nvme_flush_aborted_requests
> complete the requests 'aborted' by nvme_abort_requests_sync. It will
> be invoked after the controller is disabled/shutdown.

IMO, the helper's name of 'abort' is very misleading since the request
isn't aborted actually, it is just cancelled from dispatched state, once
it is cancelled, most of times the request is just re-inserted to sw
queue or scheduler queue. After NVMe controller is resetted successfully,
these request will be dispatched again.

So please keep the name of 'cancel' or use sort of name.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 1/5] nvme: do atomically bit operations on nvme_request.flags
  2018-03-08  7:57     ` Christoph Hellwig
@ 2018-03-08 14:32       ` jianchao.wang
  -1 siblings, 0 replies; 38+ messages in thread
From: jianchao.wang @ 2018-03-08 14:32 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, sagi, linux-kernel, linux-nvme, keith.busch

Hi Christoph

Thanks for your precious time for reviewing this.

On 03/08/2018 03:57 PM, Christoph Hellwig wrote:
>> -	u8			flags;
>>  	u16			status;
>> +	unsigned long flags;
> Please align the field name like the others, though

Yes, I will change this next version.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH V4 1/5] nvme: do atomically bit operations on nvme_request.flags
@ 2018-03-08 14:32       ` jianchao.wang
  0 siblings, 0 replies; 38+ messages in thread
From: jianchao.wang @ 2018-03-08 14:32 UTC (permalink / raw)


Hi Christoph

Thanks for your precious time for reviewing this.

On 03/08/2018 03:57 PM, Christoph Hellwig wrote:
>> -	u8			flags;
>>  	u16			status;
>> +	unsigned long flags;
> Please align the field name like the others, though

Yes, I will change this next version.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 2/5] nvme: add helper interface to flush in-flight requests
  2018-03-08 13:11     ` Ming Lei
@ 2018-03-08 14:44       ` jianchao.wang
  -1 siblings, 0 replies; 38+ messages in thread
From: jianchao.wang @ 2018-03-08 14:44 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Sagi Grimberg, Linux Kernel Mailing List, linux-nvme,
	Keith Busch, Christoph Hellwig

Hi Ming

Thanks for your precious time for reviewing and comment.

On 03/08/2018 09:11 PM, Ming Lei wrote:
> On Thu, Mar 8, 2018 at 2:19 PM, Jianchao Wang
> <jianchao.w.wang@oracle.com> wrote:
>> Currently, we use nvme_cancel_request to complete the request
>> forcedly. This has following defects:
>>  - It is not safe to race with the normal completion path.
>>    blk_mq_complete_request is ok to race with timeout path,
>>    but not with itself.
> 
> The irq path shouldn't be raced with nvme_cancel_request()
> because io queues are suspended before calling nvme_cancel_request().
> 
> Could you please explain a bit why one same request can be
> completed at the same time via blk_mq_complete_request()?

In fact, this interface will be used before suspend ioqs and disable controller.
Then the timeout path could be more clearly when we issue adminq commands during
nvme_dev_disable. Otherwise, it is hard to distinguish which is from previous workload
, which is from nvme_dev_disable. We will take different action for them.

>>  - Cannot ensure all the requests have been handled. The timeout
>>    path may grab some expired requests, blk_mq_complete_request
>>    cannot touch them.
>>
>> add two helper interface to flush in-flight requests more safely.
>> - nvme_abort_requests_sync
>> use nvme_abort_req to timeout all the in-flight requests and wait
>> until timeout work and irq completion path completes. More details
>> please refer to the comment of this interface.
>> - nvme_flush_aborted_requests
>> complete the requests 'aborted' by nvme_abort_requests_sync. It will
>> be invoked after the controller is disabled/shutdown.
> 
> IMO, the helper's name of 'abort' is very misleading since the request
> isn't aborted actually, it is just cancelled from dispatched state, once
> it is cancelled, most of times the request is just re-inserted to sw
> queue or scheduler queue. After NVMe controller is resetted successfully,
> these request will be dispatched again.
> 
> So please keep the name of 'cancel' or use sort of name.

Yes, it is indeed misleading.
In fact, this 'abort' inherits from the blk_abort_request which is invoked by
nvme_abort_req.
 
Thanks
Jianchao

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH V4 2/5] nvme: add helper interface to flush in-flight requests
@ 2018-03-08 14:44       ` jianchao.wang
  0 siblings, 0 replies; 38+ messages in thread
From: jianchao.wang @ 2018-03-08 14:44 UTC (permalink / raw)


Hi Ming

Thanks for your precious time for reviewing and comment.

On 03/08/2018 09:11 PM, Ming Lei wrote:
> On Thu, Mar 8, 2018 at 2:19 PM, Jianchao Wang
> <jianchao.w.wang@oracle.com> wrote:
>> Currently, we use nvme_cancel_request to complete the request
>> forcedly. This has following defects:
>>  - It is not safe to race with the normal completion path.
>>    blk_mq_complete_request is ok to race with timeout path,
>>    but not with itself.
> 
> The irq path shouldn't be raced with nvme_cancel_request()
> because io queues are suspended before calling nvme_cancel_request().
> 
> Could you please explain a bit why one same request can be
> completed at the same time via blk_mq_complete_request()?

In fact, this interface will be used before suspend ioqs and disable controller.
Then the timeout path could be more clearly when we issue adminq commands during
nvme_dev_disable. Otherwise, it is hard to distinguish which is from previous workload
, which is from nvme_dev_disable. We will take different action for them.

>>  - Cannot ensure all the requests have been handled. The timeout
>>    path may grab some expired requests, blk_mq_complete_request
>>    cannot touch them.
>>
>> add two helper interface to flush in-flight requests more safely.
>> - nvme_abort_requests_sync
>> use nvme_abort_req to timeout all the in-flight requests and wait
>> until timeout work and irq completion path completes. More details
>> please refer to the comment of this interface.
>> - nvme_flush_aborted_requests
>> complete the requests 'aborted' by nvme_abort_requests_sync. It will
>> be invoked after the controller is disabled/shutdown.
> 
> IMO, the helper's name of 'abort' is very misleading since the request
> isn't aborted actually, it is just cancelled from dispatched state, once
> it is cancelled, most of times the request is just re-inserted to sw
> queue or scheduler queue. After NVMe controller is resetted successfully,
> these request will be dispatched again.
> 
> So please keep the name of 'cancel' or use sort of name.

Yes, it is indeed misleading.
In fact, this 'abort' inherits from the blk_abort_request which is invoked by
nvme_abort_req.
 
Thanks
Jianchao

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 2/5] nvme: add helper interface to flush in-flight requests
  2018-03-08  6:19   ` Jianchao Wang
@ 2018-03-08 18:21     ` Sagi Grimberg
  -1 siblings, 0 replies; 38+ messages in thread
From: Sagi Grimberg @ 2018-03-08 18:21 UTC (permalink / raw)
  To: Jianchao Wang, keith.busch, axboe, hch; +Cc: linux-nvme, linux-kernel



On 03/08/2018 08:19 AM, Jianchao Wang wrote:
> Currently, we use nvme_cancel_request to complete the request
> forcedly. This has following defects:
>   - It is not safe to race with the normal completion path.
>     blk_mq_complete_request is ok to race with timeout path,
>     but not with itself.
>   - Cannot ensure all the requests have been handled. The timeout
>     path may grab some expired requests, blk_mq_complete_request
>     cannot touch them.
> 
> add two helper interface to flush in-flight requests more safely.
> - nvme_abort_requests_sync
> use nvme_abort_req to timeout all the in-flight requests and wait
> until timeout work and irq completion path completes. More details
> please refer to the comment of this interface.
> - nvme_flush_aborted_requests
> complete the requests 'aborted' by nvme_abort_requests_sync. It will
> be invoked after the controller is disabled/shutdown.
> 
> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
> ---
>   drivers/nvme/host/core.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++
>   drivers/nvme/host/nvme.h |  4 +-
>   2 files changed, 99 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 7b8df47..e26759b 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3567,6 +3567,102 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
>   }
>   EXPORT_SYMBOL_GPL(nvme_start_queues);
>   
> +static void nvme_abort_req(struct request *req, void *data, bool reserved)
> +{
> +	if (!blk_mq_request_started(req))
> +		return;
> +
> +	dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
> +				"Abort I/O %d", req->tag);
> +
> +	/* The timeout path need identify this flag and return
> +	 * BLK_EH_NOT_HANDLED, then the request will not be completed.
> +	 * we will defer the completion after the controller is disabled or
> +	 * shutdown.
> +	 */
> +	set_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags);
> +	blk_abort_request(req);
> +}
> +
> +/*
> + * This function will ensure all the in-flight requests on the
> + * controller to be handled by timeout path or irq completion path.
> + * It has to pair with nvme_flush_aborted_requests which will be
> + * invoked after the controller has been disabled/shutdown and
> + * complete the requests 'aborted' by nvme_abort_req.
> + *
> + * Note, the driver layer will not be quiescent before disable
> + * controller, because the requests aborted by blk_abort_request
> + * are still active and the irq will fire at any time, but it can
> + * not enter into completion path, because the request has been
> + * timed out.
> + */
> +void nvme_abort_requests_sync(struct nvme_ctrl *ctrl)
> +{
> +	struct nvme_ns *ns;
> +
> +	blk_mq_tagset_busy_iter(ctrl->tagset, nvme_abort_req, ctrl);
> +	blk_mq_tagset_busy_iter(ctrl->admin_tagset, nvme_abort_req, ctrl);
> +	/*
> +	 * ensure the timeout_work is queued, thus needn't to sync
> +	 * the timer
> +	 */
> +	kblockd_schedule_work(&ctrl->admin_q->timeout_work);
> +
> +	down_read(&ctrl->namespaces_rwsem);
> +
> +	list_for_each_entry(ns, &ctrl->namespaces, list)
> +		kblockd_schedule_work(&ns->queue->timeout_work);
> +
> +	list_for_each_entry(ns, &ctrl->namespaces, list)
> +		flush_work(&ns->queue->timeout_work);
> +
> +	up_read(&ctrl->namespaces_rwsem);
> +	/* This will ensure all the nvme irq completion path have exited */
> +	synchronize_sched();
> +}
> +EXPORT_SYMBOL_GPL(nvme_abort_requests_sync);
> +
> +static void nvme_comp_req(struct request *req, void *data, bool reserved)

Not a very good name...

> +{
> +	struct nvme_ctrl *ctrl = (struct nvme_ctrl *)data;
> +
> +	if (!test_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags))
> +		return;
> +
> +	WARN_ON(!blk_mq_request_started(req));
> +
> +	if (ctrl->tagset && ctrl->tagset->ops->complete) {

What happens when this called on the admin tagset when the controller
does not have an io tagset?

> +		clear_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags);
> +		/*
> +		 * We set the status to NVME_SC_ABORT_REQ, then ioq request
> +		 * will be requeued and adminq one will be failed.
> +		 */
> +		nvme_req(req)->status = NVME_SC_ABORT_REQ;
> +		/*
> +		 * For ioq request, blk_mq_requeue_request should be better
> +		 * here. But the nvme code will still setup the cmd even if
> +		 * the RQF_DONTPREP is set. We have to use .complete to free
> +		 * the cmd and then requeue it.

IMO, its better to fix nvme to not setup the command if RQF_DONTPREP is 
on (other than the things it must setup).

> +		 *
> +		 * For adminq request, invoking .complete directly will miss
> +		 * blk_mq_sched_completed_request, but this is ok because we
> +		 * won't have io scheduler for adminq.
> +		 */
> +		ctrl->tagset->ops->complete(req);

I don't think that directly calling .complete is a good idea...

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH V4 2/5] nvme: add helper interface to flush in-flight requests
@ 2018-03-08 18:21     ` Sagi Grimberg
  0 siblings, 0 replies; 38+ messages in thread
From: Sagi Grimberg @ 2018-03-08 18:21 UTC (permalink / raw)




On 03/08/2018 08:19 AM, Jianchao Wang wrote:
> Currently, we use nvme_cancel_request to complete the request
> forcedly. This has following defects:
>   - It is not safe to race with the normal completion path.
>     blk_mq_complete_request is ok to race with timeout path,
>     but not with itself.
>   - Cannot ensure all the requests have been handled. The timeout
>     path may grab some expired requests, blk_mq_complete_request
>     cannot touch them.
> 
> add two helper interface to flush in-flight requests more safely.
> - nvme_abort_requests_sync
> use nvme_abort_req to timeout all the in-flight requests and wait
> until timeout work and irq completion path completes. More details
> please refer to the comment of this interface.
> - nvme_flush_aborted_requests
> complete the requests 'aborted' by nvme_abort_requests_sync. It will
> be invoked after the controller is disabled/shutdown.
> 
> Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
> ---
>   drivers/nvme/host/core.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++
>   drivers/nvme/host/nvme.h |  4 +-
>   2 files changed, 99 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 7b8df47..e26759b 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3567,6 +3567,102 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
>   }
>   EXPORT_SYMBOL_GPL(nvme_start_queues);
>   
> +static void nvme_abort_req(struct request *req, void *data, bool reserved)
> +{
> +	if (!blk_mq_request_started(req))
> +		return;
> +
> +	dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
> +				"Abort I/O %d", req->tag);
> +
> +	/* The timeout path need identify this flag and return
> +	 * BLK_EH_NOT_HANDLED, then the request will not be completed.
> +	 * we will defer the completion after the controller is disabled or
> +	 * shutdown.
> +	 */
> +	set_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags);
> +	blk_abort_request(req);
> +}
> +
> +/*
> + * This function will ensure all the in-flight requests on the
> + * controller to be handled by timeout path or irq completion path.
> + * It has to pair with nvme_flush_aborted_requests which will be
> + * invoked after the controller has been disabled/shutdown and
> + * complete the requests 'aborted' by nvme_abort_req.
> + *
> + * Note, the driver layer will not be quiescent before disable
> + * controller, because the requests aborted by blk_abort_request
> + * are still active and the irq will fire at any time, but it can
> + * not enter into completion path, because the request has been
> + * timed out.
> + */
> +void nvme_abort_requests_sync(struct nvme_ctrl *ctrl)
> +{
> +	struct nvme_ns *ns;
> +
> +	blk_mq_tagset_busy_iter(ctrl->tagset, nvme_abort_req, ctrl);
> +	blk_mq_tagset_busy_iter(ctrl->admin_tagset, nvme_abort_req, ctrl);
> +	/*
> +	 * ensure the timeout_work is queued, thus needn't to sync
> +	 * the timer
> +	 */
> +	kblockd_schedule_work(&ctrl->admin_q->timeout_work);
> +
> +	down_read(&ctrl->namespaces_rwsem);
> +
> +	list_for_each_entry(ns, &ctrl->namespaces, list)
> +		kblockd_schedule_work(&ns->queue->timeout_work);
> +
> +	list_for_each_entry(ns, &ctrl->namespaces, list)
> +		flush_work(&ns->queue->timeout_work);
> +
> +	up_read(&ctrl->namespaces_rwsem);
> +	/* This will ensure all the nvme irq completion path have exited */
> +	synchronize_sched();
> +}
> +EXPORT_SYMBOL_GPL(nvme_abort_requests_sync);
> +
> +static void nvme_comp_req(struct request *req, void *data, bool reserved)

Not a very good name...

> +{
> +	struct nvme_ctrl *ctrl = (struct nvme_ctrl *)data;
> +
> +	if (!test_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags))
> +		return;
> +
> +	WARN_ON(!blk_mq_request_started(req));
> +
> +	if (ctrl->tagset && ctrl->tagset->ops->complete) {

What happens when this called on the admin tagset when the controller
does not have an io tagset?

> +		clear_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags);
> +		/*
> +		 * We set the status to NVME_SC_ABORT_REQ, then ioq request
> +		 * will be requeued and adminq one will be failed.
> +		 */
> +		nvme_req(req)->status = NVME_SC_ABORT_REQ;
> +		/*
> +		 * For ioq request, blk_mq_requeue_request should be better
> +		 * here. But the nvme code will still setup the cmd even if
> +		 * the RQF_DONTPREP is set. We have to use .complete to free
> +		 * the cmd and then requeue it.

IMO, its better to fix nvme to not setup the command if RQF_DONTPREP is 
on (other than the things it must setup).

> +		 *
> +		 * For adminq request, invoking .complete directly will miss
> +		 * blk_mq_sched_completed_request, but this is ok because we
> +		 * won't have io scheduler for adminq.
> +		 */
> +		ctrl->tagset->ops->complete(req);

I don't think that directly calling .complete is a good idea...

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 2/5] nvme: add helper interface to flush in-flight requests
  2018-03-08 18:21     ` Sagi Grimberg
@ 2018-03-09  1:59       ` jianchao.wang
  -1 siblings, 0 replies; 38+ messages in thread
From: jianchao.wang @ 2018-03-09  1:59 UTC (permalink / raw)
  To: Sagi Grimberg, keith.busch, axboe, hch; +Cc: linux-kernel, linux-nvme

Hi Sagi

Thanks for your precious time for review and comment.

On 03/09/2018 02:21 AM, Sagi Grimberg wrote:

>> +EXPORT_SYMBOL_GPL(nvme_abort_requests_sync);
>> +
>> +static void nvme_comp_req(struct request *req, void *data, bool reserved)
> 
> Not a very good name...

Yes, indeed.

> 
>> +{
>> +    struct nvme_ctrl *ctrl = (struct nvme_ctrl *)data;
>> +
>> +    if (!test_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags))
>> +        return;
>> +
>> +    WARN_ON(!blk_mq_request_started(req));
>> +
>> +    if (ctrl->tagset && ctrl->tagset->ops->complete) {
> 
> What happens when this called on the admin tagset when the controller
> does not have an io tagset?

Yes, nvme_ctrl.admin_tagset should be used here for adminq request.


> 
>> +        clear_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags);
>> +        /*
>> +         * We set the status to NVME_SC_ABORT_REQ, then ioq request
>> +         * will be requeued and adminq one will be failed.
>> +         */
>> +        nvme_req(req)->status = NVME_SC_ABORT_REQ;
>> +        /*
>> +         * For ioq request, blk_mq_requeue_request should be better
>> +         * here. But the nvme code will still setup the cmd even if
>> +         * the RQF_DONTPREP is set. We have to use .complete to free
>> +         * the cmd and then requeue it.
> 
> IMO, its better to fix nvme to not setup the command if RQF_DONTPREP is on (other than the things it must setup).

Yes, I used to consider change that.
But I'm not sure whether there is special consideration there.
If you and Keith think it is ok that not setup command when RQF_DONTPREP is set, we could do that.

> 
>> +         *
>> +         * For adminq request, invoking .complete directly will miss
>> +         * blk_mq_sched_completed_request, but this is ok because we
>> +         * won't have io scheduler for adminq.
>> +         */
>> +        ctrl->tagset->ops->complete(req);
> 
> I don't think that directly calling .complete is a good idea...

Yes, indeed.

Thanks a lot 
Jianchao

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH V4 2/5] nvme: add helper interface to flush in-flight requests
@ 2018-03-09  1:59       ` jianchao.wang
  0 siblings, 0 replies; 38+ messages in thread
From: jianchao.wang @ 2018-03-09  1:59 UTC (permalink / raw)


Hi Sagi

Thanks for your precious time for review and comment.

On 03/09/2018 02:21 AM, Sagi Grimberg wrote:

>> +EXPORT_SYMBOL_GPL(nvme_abort_requests_sync);
>> +
>> +static void nvme_comp_req(struct request *req, void *data, bool reserved)
> 
> Not a very good name...

Yes, indeed.

> 
>> +{
>> +??? struct nvme_ctrl *ctrl = (struct nvme_ctrl *)data;
>> +
>> +??? if (!test_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags))
>> +??????? return;
>> +
>> +??? WARN_ON(!blk_mq_request_started(req));
>> +
>> +??? if (ctrl->tagset && ctrl->tagset->ops->complete) {
> 
> What happens when this called on the admin tagset when the controller
> does not have an io tagset?

Yes, nvme_ctrl.admin_tagset should be used here for adminq request.


> 
>> +??????? clear_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags);
>> +??????? /*
>> +???????? * We set the status to NVME_SC_ABORT_REQ, then ioq request
>> +???????? * will be requeued and adminq one will be failed.
>> +???????? */
>> +??????? nvme_req(req)->status = NVME_SC_ABORT_REQ;
>> +??????? /*
>> +???????? * For ioq request, blk_mq_requeue_request should be better
>> +???????? * here. But the nvme code will still setup the cmd even if
>> +???????? * the RQF_DONTPREP is set. We have to use .complete to free
>> +???????? * the cmd and then requeue it.
> 
> IMO, its better to fix nvme to not setup the command if RQF_DONTPREP is on (other than the things it must setup).

Yes, I used to consider change that.
But I'm not sure whether there is special consideration there.
If you and Keith think it is ok that not setup command when RQF_DONTPREP is set, we could do that.

> 
>> +???????? *
>> +???????? * For adminq request, invoking .complete directly will miss
>> +???????? * blk_mq_sched_completed_request, but this is ok because we
>> +???????? * won't have io scheduler for adminq.
>> +???????? */
>> +??????? ctrl->tagset->ops->complete(req);
> 
> I don't think that directly calling .complete is a good idea...

Yes, indeed.

Thanks a lot 
Jianchao

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 3/5] nvme-pci: avoid nvme_dev_disable to be invoked in nvme_timeout
  2018-03-08  6:19   ` Jianchao Wang
@ 2018-03-09  2:01     ` jianchao.wang
  -1 siblings, 0 replies; 38+ messages in thread
From: jianchao.wang @ 2018-03-09  2:01 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-kernel, linux-nvme

Hi Keith

Can I have the honor of getting your comment on this patch?

Thanks in advance
Jianchao

On 03/08/2018 02:19 PM, Jianchao Wang wrote:
> nvme_dev_disable will issue command on adminq to clear HMB and
> delete io cq/sqs, maybe more in the future. When adminq no response,
> it has to depends on timeout path. However, nvme_timeout has to
> invoke nvme_dev_disable before return, so that the DMA mappings
> could be released safely. This will introduce dangerous circular
> dependence. Moreover, the whole nvme_dev_disable is under
> shutdown_lock even waiting for the command, this makes things
> worse.
> 
> To avoid this, this patch refactors the nvme_timeout. The basic
> principle is:
>  - When need to schedule reset_work, hand over expired requests
>    to nvme_dev_disable. They will be completed after the controller
>    is disabled/shtudown.
>  - When requests from nvme_dev_disable and nvme_reset_work expires,
>    disable the controller directly then the request could be completed
>    to wakeup the waiter. nvme_pci_disable_ctrl_directly is introduced
>    for this, it doesn't send commands on adminq and the shutdown_lock
>    is not needed here, because the nvme_abort_requests_sync in
>    nvme_dev_disable could synchronize with nvme_timeout.
> 
> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
> ---
>  drivers/nvme/host/pci.c | 199 ++++++++++++++++++++++++++++++++----------------
>  1 file changed, 134 insertions(+), 65 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index e186158..ce09057 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -70,6 +70,7 @@ struct nvme_queue;
>  
>  static void nvme_process_cq(struct nvme_queue *nvmeq);
>  static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
> +static void nvme_pci_disable_ctrl_directly(struct nvme_dev *dev);
>  
>  /*
>   * Represents an NVM Express device.  Each nvme_dev is a PCI function.
> @@ -98,6 +99,7 @@ struct nvme_dev {
>  	u32 cmbloc;
>  	struct nvme_ctrl ctrl;
>  	struct completion ioq_wait;
> +	bool inflight_flushed;
>  
>  	/* shadow doorbell buffer support: */
>  	u32 *dbbuf_dbs;
> @@ -1180,73 +1182,13 @@ static void nvme_warn_reset(struct nvme_dev *dev, u32 csts)
>  			 csts, result);
>  }
>  
> -static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> +static enum blk_eh_timer_return nvme_pci_abort_io_req(struct request *req)
>  {
>  	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
>  	struct nvme_queue *nvmeq = iod->nvmeq;
>  	struct nvme_dev *dev = nvmeq->dev;
> -	struct request *abort_req;
>  	struct nvme_command cmd;
> -	u32 csts = readl(dev->bar + NVME_REG_CSTS);
> -
> -	/*
> -	 * Reset immediately if the controller is failed
> -	 */
> -	if (nvme_should_reset(dev, csts)) {
> -		nvme_warn_reset(dev, csts);
> -		nvme_dev_disable(dev, false);
> -		nvme_reset_ctrl(&dev->ctrl);
> -		return BLK_EH_HANDLED;
> -	}
> -
> -	/*
> -	 * Did we miss an interrupt?
> -	 */
> -	if (__nvme_poll(nvmeq, req->tag)) {
> -		dev_warn(dev->ctrl.device,
> -			 "I/O %d QID %d timeout, completion polled\n",
> -			 req->tag, nvmeq->qid);
> -		return BLK_EH_HANDLED;
> -	}
> -
> -	/*
> -	 * Shutdown immediately if controller times out while starting. The
> -	 * reset work will see the pci device disabled when it gets the forced
> -	 * cancellation error. All outstanding requests are completed on
> -	 * shutdown, so we return BLK_EH_HANDLED.
> -	 */
> -	switch (dev->ctrl.state) {
> -	case NVME_CTRL_CONNECTING:
> -	case NVME_CTRL_RESETTING:
> -		dev_warn(dev->ctrl.device,
> -			 "I/O %d QID %d timeout, disable controller\n",
> -			 req->tag, nvmeq->qid);
> -		nvme_dev_disable(dev, false);
> -		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
> -		return BLK_EH_HANDLED;
> -	default:
> -		break;
> -	}
> -
> -	/*
> - 	 * Shutdown the controller immediately and schedule a reset if the
> - 	 * command was already aborted once before and still hasn't been
> - 	 * returned to the driver, or if this is the admin queue.
> -	 */
> -	if (!nvmeq->qid || iod->aborted) {
> -		dev_warn(dev->ctrl.device,
> -			 "I/O %d QID %d timeout, reset controller\n",
> -			 req->tag, nvmeq->qid);
> -		nvme_dev_disable(dev, false);
> -		nvme_reset_ctrl(&dev->ctrl);
> -
> -		/*
> -		 * Mark the request as handled, since the inline shutdown
> -		 * forces all outstanding requests to complete.
> -		 */
> -		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
> -		return BLK_EH_HANDLED;
> -	}
> +	struct request *abort_req;
>  
>  	if (atomic_dec_return(&dev->ctrl.abort_limit) < 0) {
>  		atomic_inc(&dev->ctrl.abort_limit);
> @@ -1282,6 +1224,105 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  	return BLK_EH_RESET_TIMER;
>  }
>  
> +static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> +{
> +	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> +	struct nvme_queue *nvmeq = iod->nvmeq;
> +	struct nvme_dev *dev = nvmeq->dev;
> +	u32 csts = readl(dev->bar + NVME_REG_CSTS);
> +	enum {ABORT, RESET, DISABLE} action;
> +	enum blk_eh_timer_return ret;
> +	/*
> +	 * This is for nvme_abort_req. This request will be completed
> +	 * by nvme_flush_aborted_requests after the controller is
> +	 * disabled/shutdown
> +	 */
> +	if (test_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags))
> +		return BLK_EH_NOT_HANDLED;
> +
> +	/*
> +	 * Reset immediately if the controller is failed.
> +	 * Defer the completion to nvme_flush_aborted_requests.
> +	 */
> +	if (nvme_should_reset(dev, csts)) {
> +		nvme_warn_reset(dev, csts);
> +		nvme_reset_ctrl(&dev->ctrl);
> +		return BLK_EH_RESET_TIMER;
> +	}
> +
> +	/*
> +	 * Did we miss an interrupt?
> +	 */
> +	if (__nvme_poll(nvmeq, req->tag)) {
> +		dev_warn(dev->ctrl.device,
> +			 "I/O %d QID %d timeout, completion polled\n",
> +			 req->tag, nvmeq->qid);
> +		return BLK_EH_HANDLED;
> +	}
> +
> +	if (nvmeq->qid) {
> +		if (dev->ctrl.state == NVME_CTRL_RESETTING ||
> +				iod->aborted)
> +			action = RESET;
> +		else
> +			action = ABORT;
> +	} else {
> +		/*
> +		 * Disable immediately if controller times out while disabling/
> +		 * shuting down/starting. The nvme_dev_disable/nvme_reset_work
> +		 * will see the error.
> +		 * Note: inflight_flushed is set in nvme_dev_disable when it
> +		 * abort all the inflight requests. Introducing this flag is due
> +		 * to there is no state to represent the shutdown procedure.
> +		 */
> +		if (dev->ctrl.state == NVME_CTRL_CONNECTING ||
> +			dev->inflight_flushed)
> +			action = DISABLE;
> +		else
> +			action = RESET;
> +	}
> +
> +	switch (action) {
> +	case ABORT:
> +		ret = nvme_pci_abort_io_req(req);
> +		break;
> +	case RESET:
> +		dev_warn(dev->ctrl.device,
> +			 "I/O %d QID %d timeout, reset controller\n",
> +			 req->tag, nvmeq->qid);
> +		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
> +		nvme_reset_ctrl(&dev->ctrl);
> +		/*
> +		 * The reset work will take over this request. nvme_abort_req
> +		 * employs blk_abort_request to force the request to be timed
> +		 * out. So we need to return BLK_EH_RESET_TIMER then the
> +		 * RQF_MQ_TIMEOUT_EXPIRED could be cleared.
> +		 */
> +		ret = BLK_EH_RESET_TIMER;
> +		break;
> +	case DISABLE:
> +		/*
> +		 * We disable the controller directly here so that we could
> +		 * complete the request safely to wake up the nvme_dev_disable/
> +		 * nvme_reset_work who is waiting on adminq. We cannot return
> +		 * BLK_EH_RESET_TIMER to depend on error recovery procedure
> +		 * because it is waiting for timeout path.
> +		 */
> +		dev_warn(dev->ctrl.device,
> +			 "I/O %d QID %d timeout, disable controller\n",
> +			 req->tag, nvmeq->qid);
> +		nvme_pci_disable_ctrl_directly(dev);
> +		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
> +		ret = BLK_EH_HANDLED;
> +		break;
> +	default:
> +		WARN_ON(1);
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
>  static void nvme_free_queue(struct nvme_queue *nvmeq)
>  {
>  	dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
> @@ -2169,6 +2210,33 @@ static void nvme_pci_disable(struct nvme_dev *dev)
>  	}
>  }
>  
> +/*
> + * This is only invoked by nvme_timeout. shutdown_lock is not needed
> + * here because nvme_abort_requests_sync in nvme_dev_disable will
> + * synchronize with the timeout path.
> + */
> +static void nvme_pci_disable_ctrl_directly(struct nvme_dev *dev)
> +{
> +	int i;
> +	struct pci_dev *pdev = to_pci_dev(dev->dev);
> +	bool dead = true;
> +
> +	if (pci_is_enabled(pdev)) {
> +		u32 csts = readl(dev->bar + NVME_REG_CSTS);
> +
> +		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
> +			pdev->error_state  != pci_channel_io_normal);
> +
> +		if (!dead)
> +			nvme_disable_ctrl(&dev->ctrl, dev->ctrl.cap);
> +	}
> +
> +	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
> +		nvme_suspend_queue(&dev->queues[i]);
> +
> +	nvme_pci_disable(dev);
> +}
> +
>  static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  {
>  	int i;
> @@ -2205,6 +2273,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  
>  	}
>  	nvme_stop_queues(&dev->ctrl);
> +	nvme_abort_requests_sync(&dev->ctrl);
> +	dev->inflight_flushed = true;
>  
>  	if (!dead) {
>  		nvme_disable_io_queues(dev);
> @@ -2215,9 +2285,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  
>  	nvme_pci_disable(dev);
>  
> -	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
> -	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
> -
> +	nvme_flush_aborted_requests(&dev->ctrl);
> +	dev->inflight_flushed = false;
>  	/*
>  	 * The driver will not be starting up queues again if shutting down so
>  	 * must flush all entered requests to their failed completion to avoid
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH V4 3/5] nvme-pci: avoid nvme_dev_disable to be invoked in nvme_timeout
@ 2018-03-09  2:01     ` jianchao.wang
  0 siblings, 0 replies; 38+ messages in thread
From: jianchao.wang @ 2018-03-09  2:01 UTC (permalink / raw)


Hi Keith

Can I have the honor of getting your comment on this patch?

Thanks in advance
Jianchao

On 03/08/2018 02:19 PM, Jianchao Wang wrote:
> nvme_dev_disable will issue command on adminq to clear HMB and
> delete io cq/sqs, maybe more in the future. When adminq no response,
> it has to depends on timeout path. However, nvme_timeout has to
> invoke nvme_dev_disable before return, so that the DMA mappings
> could be released safely. This will introduce dangerous circular
> dependence. Moreover, the whole nvme_dev_disable is under
> shutdown_lock even waiting for the command, this makes things
> worse.
> 
> To avoid this, this patch refactors the nvme_timeout. The basic
> principle is:
>  - When need to schedule reset_work, hand over expired requests
>    to nvme_dev_disable. They will be completed after the controller
>    is disabled/shtudown.
>  - When requests from nvme_dev_disable and nvme_reset_work expires,
>    disable the controller directly then the request could be completed
>    to wakeup the waiter. nvme_pci_disable_ctrl_directly is introduced
>    for this, it doesn't send commands on adminq and the shutdown_lock
>    is not needed here, because the nvme_abort_requests_sync in
>    nvme_dev_disable could synchronize with nvme_timeout.
> 
> Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
> ---
>  drivers/nvme/host/pci.c | 199 ++++++++++++++++++++++++++++++++----------------
>  1 file changed, 134 insertions(+), 65 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index e186158..ce09057 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -70,6 +70,7 @@ struct nvme_queue;
>  
>  static void nvme_process_cq(struct nvme_queue *nvmeq);
>  static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
> +static void nvme_pci_disable_ctrl_directly(struct nvme_dev *dev);
>  
>  /*
>   * Represents an NVM Express device.  Each nvme_dev is a PCI function.
> @@ -98,6 +99,7 @@ struct nvme_dev {
>  	u32 cmbloc;
>  	struct nvme_ctrl ctrl;
>  	struct completion ioq_wait;
> +	bool inflight_flushed;
>  
>  	/* shadow doorbell buffer support: */
>  	u32 *dbbuf_dbs;
> @@ -1180,73 +1182,13 @@ static void nvme_warn_reset(struct nvme_dev *dev, u32 csts)
>  			 csts, result);
>  }
>  
> -static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> +static enum blk_eh_timer_return nvme_pci_abort_io_req(struct request *req)
>  {
>  	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
>  	struct nvme_queue *nvmeq = iod->nvmeq;
>  	struct nvme_dev *dev = nvmeq->dev;
> -	struct request *abort_req;
>  	struct nvme_command cmd;
> -	u32 csts = readl(dev->bar + NVME_REG_CSTS);
> -
> -	/*
> -	 * Reset immediately if the controller is failed
> -	 */
> -	if (nvme_should_reset(dev, csts)) {
> -		nvme_warn_reset(dev, csts);
> -		nvme_dev_disable(dev, false);
> -		nvme_reset_ctrl(&dev->ctrl);
> -		return BLK_EH_HANDLED;
> -	}
> -
> -	/*
> -	 * Did we miss an interrupt?
> -	 */
> -	if (__nvme_poll(nvmeq, req->tag)) {
> -		dev_warn(dev->ctrl.device,
> -			 "I/O %d QID %d timeout, completion polled\n",
> -			 req->tag, nvmeq->qid);
> -		return BLK_EH_HANDLED;
> -	}
> -
> -	/*
> -	 * Shutdown immediately if controller times out while starting. The
> -	 * reset work will see the pci device disabled when it gets the forced
> -	 * cancellation error. All outstanding requests are completed on
> -	 * shutdown, so we return BLK_EH_HANDLED.
> -	 */
> -	switch (dev->ctrl.state) {
> -	case NVME_CTRL_CONNECTING:
> -	case NVME_CTRL_RESETTING:
> -		dev_warn(dev->ctrl.device,
> -			 "I/O %d QID %d timeout, disable controller\n",
> -			 req->tag, nvmeq->qid);
> -		nvme_dev_disable(dev, false);
> -		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
> -		return BLK_EH_HANDLED;
> -	default:
> -		break;
> -	}
> -
> -	/*
> - 	 * Shutdown the controller immediately and schedule a reset if the
> - 	 * command was already aborted once before and still hasn't been
> - 	 * returned to the driver, or if this is the admin queue.
> -	 */
> -	if (!nvmeq->qid || iod->aborted) {
> -		dev_warn(dev->ctrl.device,
> -			 "I/O %d QID %d timeout, reset controller\n",
> -			 req->tag, nvmeq->qid);
> -		nvme_dev_disable(dev, false);
> -		nvme_reset_ctrl(&dev->ctrl);
> -
> -		/*
> -		 * Mark the request as handled, since the inline shutdown
> -		 * forces all outstanding requests to complete.
> -		 */
> -		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
> -		return BLK_EH_HANDLED;
> -	}
> +	struct request *abort_req;
>  
>  	if (atomic_dec_return(&dev->ctrl.abort_limit) < 0) {
>  		atomic_inc(&dev->ctrl.abort_limit);
> @@ -1282,6 +1224,105 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  	return BLK_EH_RESET_TIMER;
>  }
>  
> +static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> +{
> +	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> +	struct nvme_queue *nvmeq = iod->nvmeq;
> +	struct nvme_dev *dev = nvmeq->dev;
> +	u32 csts = readl(dev->bar + NVME_REG_CSTS);
> +	enum {ABORT, RESET, DISABLE} action;
> +	enum blk_eh_timer_return ret;
> +	/*
> +	 * This is for nvme_abort_req. This request will be completed
> +	 * by nvme_flush_aborted_requests after the controller is
> +	 * disabled/shutdown
> +	 */
> +	if (test_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags))
> +		return BLK_EH_NOT_HANDLED;
> +
> +	/*
> +	 * Reset immediately if the controller is failed.
> +	 * Defer the completion to nvme_flush_aborted_requests.
> +	 */
> +	if (nvme_should_reset(dev, csts)) {
> +		nvme_warn_reset(dev, csts);
> +		nvme_reset_ctrl(&dev->ctrl);
> +		return BLK_EH_RESET_TIMER;
> +	}
> +
> +	/*
> +	 * Did we miss an interrupt?
> +	 */
> +	if (__nvme_poll(nvmeq, req->tag)) {
> +		dev_warn(dev->ctrl.device,
> +			 "I/O %d QID %d timeout, completion polled\n",
> +			 req->tag, nvmeq->qid);
> +		return BLK_EH_HANDLED;
> +	}
> +
> +	if (nvmeq->qid) {
> +		if (dev->ctrl.state == NVME_CTRL_RESETTING ||
> +				iod->aborted)
> +			action = RESET;
> +		else
> +			action = ABORT;
> +	} else {
> +		/*
> +		 * Disable immediately if controller times out while disabling/
> +		 * shuting down/starting. The nvme_dev_disable/nvme_reset_work
> +		 * will see the error.
> +		 * Note: inflight_flushed is set in nvme_dev_disable when it
> +		 * abort all the inflight requests. Introducing this flag is due
> +		 * to there is no state to represent the shutdown procedure.
> +		 */
> +		if (dev->ctrl.state == NVME_CTRL_CONNECTING ||
> +			dev->inflight_flushed)
> +			action = DISABLE;
> +		else
> +			action = RESET;
> +	}
> +
> +	switch (action) {
> +	case ABORT:
> +		ret = nvme_pci_abort_io_req(req);
> +		break;
> +	case RESET:
> +		dev_warn(dev->ctrl.device,
> +			 "I/O %d QID %d timeout, reset controller\n",
> +			 req->tag, nvmeq->qid);
> +		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
> +		nvme_reset_ctrl(&dev->ctrl);
> +		/*
> +		 * The reset work will take over this request. nvme_abort_req
> +		 * employs blk_abort_request to force the request to be timed
> +		 * out. So we need to return BLK_EH_RESET_TIMER then the
> +		 * RQF_MQ_TIMEOUT_EXPIRED could be cleared.
> +		 */
> +		ret = BLK_EH_RESET_TIMER;
> +		break;
> +	case DISABLE:
> +		/*
> +		 * We disable the controller directly here so that we could
> +		 * complete the request safely to wake up the nvme_dev_disable/
> +		 * nvme_reset_work who is waiting on adminq. We cannot return
> +		 * BLK_EH_RESET_TIMER to depend on error recovery procedure
> +		 * because it is waiting for timeout path.
> +		 */
> +		dev_warn(dev->ctrl.device,
> +			 "I/O %d QID %d timeout, disable controller\n",
> +			 req->tag, nvmeq->qid);
> +		nvme_pci_disable_ctrl_directly(dev);
> +		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
> +		ret = BLK_EH_HANDLED;
> +		break;
> +	default:
> +		WARN_ON(1);
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
>  static void nvme_free_queue(struct nvme_queue *nvmeq)
>  {
>  	dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
> @@ -2169,6 +2210,33 @@ static void nvme_pci_disable(struct nvme_dev *dev)
>  	}
>  }
>  
> +/*
> + * This is only invoked by nvme_timeout. shutdown_lock is not needed
> + * here because nvme_abort_requests_sync in nvme_dev_disable will
> + * synchronize with the timeout path.
> + */
> +static void nvme_pci_disable_ctrl_directly(struct nvme_dev *dev)
> +{
> +	int i;
> +	struct pci_dev *pdev = to_pci_dev(dev->dev);
> +	bool dead = true;
> +
> +	if (pci_is_enabled(pdev)) {
> +		u32 csts = readl(dev->bar + NVME_REG_CSTS);
> +
> +		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
> +			pdev->error_state  != pci_channel_io_normal);
> +
> +		if (!dead)
> +			nvme_disable_ctrl(&dev->ctrl, dev->ctrl.cap);
> +	}
> +
> +	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
> +		nvme_suspend_queue(&dev->queues[i]);
> +
> +	nvme_pci_disable(dev);
> +}
> +
>  static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  {
>  	int i;
> @@ -2205,6 +2273,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  
>  	}
>  	nvme_stop_queues(&dev->ctrl);
> +	nvme_abort_requests_sync(&dev->ctrl);
> +	dev->inflight_flushed = true;
>  
>  	if (!dead) {
>  		nvme_disable_io_queues(dev);
> @@ -2215,9 +2285,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  
>  	nvme_pci_disable(dev);
>  
> -	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
> -	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
> -
> +	nvme_flush_aborted_requests(&dev->ctrl);
> +	dev->inflight_flushed = false;
>  	/*
>  	 * The driver will not be starting up queues again if shutting down so
>  	 * must flush all entered requests to their failed completion to avoid
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V4 3/5] nvme-pci: avoid nvme_dev_disable to be invoked in nvme_timeout
  2018-03-09  2:01     ` jianchao.wang
@ 2018-03-13 13:29       ` jianchao.wang
  -1 siblings, 0 replies; 38+ messages in thread
From: jianchao.wang @ 2018-03-13 13:29 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-kernel, linux-nvme

Hi Keith

Would you please take a look at this patch.
I really need your suggestion on this.

Sincerely
Jianchao

On 03/09/2018 10:01 AM, jianchao.wang wrote:
> Hi Keith
> 
> Can I have the honor of getting your comment on this patch?
> 
> Thanks in advance
> Jianchao
> 
> On 03/08/2018 02:19 PM, Jianchao Wang wrote:
>> nvme_dev_disable will issue command on adminq to clear HMB and
>> delete io cq/sqs, maybe more in the future. When adminq no response,
>> it has to depends on timeout path. However, nvme_timeout has to
>> invoke nvme_dev_disable before return, so that the DMA mappings
>> could be released safely. This will introduce dangerous circular
>> dependence. Moreover, the whole nvme_dev_disable is under
>> shutdown_lock even waiting for the command, this makes things
>> worse.
>>
>> To avoid this, this patch refactors the nvme_timeout. The basic
>> principle is:
>>  - When need to schedule reset_work, hand over expired requests
>>    to nvme_dev_disable. They will be completed after the controller
>>    is disabled/shtudown.
>>  - When requests from nvme_dev_disable and nvme_reset_work expires,
>>    disable the controller directly then the request could be completed
>>    to wakeup the waiter. nvme_pci_disable_ctrl_directly is introduced
>>    for this, it doesn't send commands on adminq and the shutdown_lock
>>    is not needed here, because the nvme_abort_requests_sync in
>>    nvme_dev_disable could synchronize with nvme_timeout.
>>
>> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
>> ---
>>  drivers/nvme/host/pci.c | 199 ++++++++++++++++++++++++++++++++----------------
>>  1 file changed, 134 insertions(+), 65 deletions(-)
>>
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index e186158..ce09057 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -70,6 +70,7 @@ struct nvme_queue;
>>  
>>  static void nvme_process_cq(struct nvme_queue *nvmeq);
>>  static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
>> +static void nvme_pci_disable_ctrl_directly(struct nvme_dev *dev);
>>  
>>  /*
>>   * Represents an NVM Express device.  Each nvme_dev is a PCI function.
>> @@ -98,6 +99,7 @@ struct nvme_dev {
>>  	u32 cmbloc;
>>  	struct nvme_ctrl ctrl;
>>  	struct completion ioq_wait;
>> +	bool inflight_flushed;
>>  
>>  	/* shadow doorbell buffer support: */
>>  	u32 *dbbuf_dbs;
>> @@ -1180,73 +1182,13 @@ static void nvme_warn_reset(struct nvme_dev *dev, u32 csts)
>>  			 csts, result);
>>  }
>>  
>> -static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>> +static enum blk_eh_timer_return nvme_pci_abort_io_req(struct request *req)
>>  {
>>  	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
>>  	struct nvme_queue *nvmeq = iod->nvmeq;
>>  	struct nvme_dev *dev = nvmeq->dev;
>> -	struct request *abort_req;
>>  	struct nvme_command cmd;
>> -	u32 csts = readl(dev->bar + NVME_REG_CSTS);
>> -
>> -	/*
>> -	 * Reset immediately if the controller is failed
>> -	 */
>> -	if (nvme_should_reset(dev, csts)) {
>> -		nvme_warn_reset(dev, csts);
>> -		nvme_dev_disable(dev, false);
>> -		nvme_reset_ctrl(&dev->ctrl);
>> -		return BLK_EH_HANDLED;
>> -	}
>> -
>> -	/*
>> -	 * Did we miss an interrupt?
>> -	 */
>> -	if (__nvme_poll(nvmeq, req->tag)) {
>> -		dev_warn(dev->ctrl.device,
>> -			 "I/O %d QID %d timeout, completion polled\n",
>> -			 req->tag, nvmeq->qid);
>> -		return BLK_EH_HANDLED;
>> -	}
>> -
>> -	/*
>> -	 * Shutdown immediately if controller times out while starting. The
>> -	 * reset work will see the pci device disabled when it gets the forced
>> -	 * cancellation error. All outstanding requests are completed on
>> -	 * shutdown, so we return BLK_EH_HANDLED.
>> -	 */
>> -	switch (dev->ctrl.state) {
>> -	case NVME_CTRL_CONNECTING:
>> -	case NVME_CTRL_RESETTING:
>> -		dev_warn(dev->ctrl.device,
>> -			 "I/O %d QID %d timeout, disable controller\n",
>> -			 req->tag, nvmeq->qid);
>> -		nvme_dev_disable(dev, false);
>> -		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
>> -		return BLK_EH_HANDLED;
>> -	default:
>> -		break;
>> -	}
>> -
>> -	/*
>> - 	 * Shutdown the controller immediately and schedule a reset if the
>> - 	 * command was already aborted once before and still hasn't been
>> - 	 * returned to the driver, or if this is the admin queue.
>> -	 */
>> -	if (!nvmeq->qid || iod->aborted) {
>> -		dev_warn(dev->ctrl.device,
>> -			 "I/O %d QID %d timeout, reset controller\n",
>> -			 req->tag, nvmeq->qid);
>> -		nvme_dev_disable(dev, false);
>> -		nvme_reset_ctrl(&dev->ctrl);
>> -
>> -		/*
>> -		 * Mark the request as handled, since the inline shutdown
>> -		 * forces all outstanding requests to complete.
>> -		 */
>> -		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
>> -		return BLK_EH_HANDLED;
>> -	}
>> +	struct request *abort_req;
>>  
>>  	if (atomic_dec_return(&dev->ctrl.abort_limit) < 0) {
>>  		atomic_inc(&dev->ctrl.abort_limit);
>> @@ -1282,6 +1224,105 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>>  	return BLK_EH_RESET_TIMER;
>>  }
>>  
>> +static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>> +{
>> +	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
>> +	struct nvme_queue *nvmeq = iod->nvmeq;
>> +	struct nvme_dev *dev = nvmeq->dev;
>> +	u32 csts = readl(dev->bar + NVME_REG_CSTS);
>> +	enum {ABORT, RESET, DISABLE} action;
>> +	enum blk_eh_timer_return ret;
>> +	/*
>> +	 * This is for nvme_abort_req. This request will be completed
>> +	 * by nvme_flush_aborted_requests after the controller is
>> +	 * disabled/shutdown
>> +	 */
>> +	if (test_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags))
>> +		return BLK_EH_NOT_HANDLED;
>> +
>> +	/*
>> +	 * Reset immediately if the controller is failed.
>> +	 * Defer the completion to nvme_flush_aborted_requests.
>> +	 */
>> +	if (nvme_should_reset(dev, csts)) {
>> +		nvme_warn_reset(dev, csts);
>> +		nvme_reset_ctrl(&dev->ctrl);
>> +		return BLK_EH_RESET_TIMER;
>> +	}
>> +
>> +	/*
>> +	 * Did we miss an interrupt?
>> +	 */
>> +	if (__nvme_poll(nvmeq, req->tag)) {
>> +		dev_warn(dev->ctrl.device,
>> +			 "I/O %d QID %d timeout, completion polled\n",
>> +			 req->tag, nvmeq->qid);
>> +		return BLK_EH_HANDLED;
>> +	}
>> +
>> +	if (nvmeq->qid) {
>> +		if (dev->ctrl.state == NVME_CTRL_RESETTING ||
>> +				iod->aborted)
>> +			action = RESET;
>> +		else
>> +			action = ABORT;
>> +	} else {
>> +		/*
>> +		 * Disable immediately if controller times out while disabling/
>> +		 * shuting down/starting. The nvme_dev_disable/nvme_reset_work
>> +		 * will see the error.
>> +		 * Note: inflight_flushed is set in nvme_dev_disable when it
>> +		 * abort all the inflight requests. Introducing this flag is due
>> +		 * to there is no state to represent the shutdown procedure.
>> +		 */
>> +		if (dev->ctrl.state == NVME_CTRL_CONNECTING ||
>> +			dev->inflight_flushed)
>> +			action = DISABLE;
>> +		else
>> +			action = RESET;
>> +	}
>> +
>> +	switch (action) {
>> +	case ABORT:
>> +		ret = nvme_pci_abort_io_req(req);
>> +		break;
>> +	case RESET:
>> +		dev_warn(dev->ctrl.device,
>> +			 "I/O %d QID %d timeout, reset controller\n",
>> +			 req->tag, nvmeq->qid);
>> +		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
>> +		nvme_reset_ctrl(&dev->ctrl);
>> +		/*
>> +		 * The reset work will take over this request. nvme_abort_req
>> +		 * employs blk_abort_request to force the request to be timed
>> +		 * out. So we need to return BLK_EH_RESET_TIMER then the
>> +		 * RQF_MQ_TIMEOUT_EXPIRED could be cleared.
>> +		 */
>> +		ret = BLK_EH_RESET_TIMER;
>> +		break;
>> +	case DISABLE:
>> +		/*
>> +		 * We disable the controller directly here so that we could
>> +		 * complete the request safely to wake up the nvme_dev_disable/
>> +		 * nvme_reset_work who is waiting on adminq. We cannot return
>> +		 * BLK_EH_RESET_TIMER to depend on error recovery procedure
>> +		 * because it is waiting for timeout path.
>> +		 */
>> +		dev_warn(dev->ctrl.device,
>> +			 "I/O %d QID %d timeout, disable controller\n",
>> +			 req->tag, nvmeq->qid);
>> +		nvme_pci_disable_ctrl_directly(dev);
>> +		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
>> +		ret = BLK_EH_HANDLED;
>> +		break;
>> +	default:
>> +		WARN_ON(1);
>> +		break;
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>>  static void nvme_free_queue(struct nvme_queue *nvmeq)
>>  {
>>  	dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
>> @@ -2169,6 +2210,33 @@ static void nvme_pci_disable(struct nvme_dev *dev)
>>  	}
>>  }
>>  
>> +/*
>> + * This is only invoked by nvme_timeout. shutdown_lock is not needed
>> + * here because nvme_abort_requests_sync in nvme_dev_disable will
>> + * synchronize with the timeout path.
>> + */
>> +static void nvme_pci_disable_ctrl_directly(struct nvme_dev *dev)
>> +{
>> +	int i;
>> +	struct pci_dev *pdev = to_pci_dev(dev->dev);
>> +	bool dead = true;
>> +
>> +	if (pci_is_enabled(pdev)) {
>> +		u32 csts = readl(dev->bar + NVME_REG_CSTS);
>> +
>> +		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
>> +			pdev->error_state  != pci_channel_io_normal);
>> +
>> +		if (!dead)
>> +			nvme_disable_ctrl(&dev->ctrl, dev->ctrl.cap);
>> +	}
>> +
>> +	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
>> +		nvme_suspend_queue(&dev->queues[i]);
>> +
>> +	nvme_pci_disable(dev);
>> +}
>> +
>>  static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>>  {
>>  	int i;
>> @@ -2205,6 +2273,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>>  
>>  	}
>>  	nvme_stop_queues(&dev->ctrl);
>> +	nvme_abort_requests_sync(&dev->ctrl);
>> +	dev->inflight_flushed = true;
>>  
>>  	if (!dead) {
>>  		nvme_disable_io_queues(dev);
>> @@ -2215,9 +2285,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>>  
>>  	nvme_pci_disable(dev);
>>  
>> -	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
>> -	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
>> -
>> +	nvme_flush_aborted_requests(&dev->ctrl);
>> +	dev->inflight_flushed = false;
>>  	/*
>>  	 * The driver will not be starting up queues again if shutting down so
>>  	 * must flush all entered requests to their failed completion to avoid
>>
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=5Mw5gjePWHT_YZ77OroeGMhDmXLF0LW6EtV-hBMRYas&s=VM31J1nvWkmoWzuZIHV0f5nlrSW0ZTKUQ1TjLwbB18g&e=
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH V4 3/5] nvme-pci: avoid nvme_dev_disable to be invoked in nvme_timeout
@ 2018-03-13 13:29       ` jianchao.wang
  0 siblings, 0 replies; 38+ messages in thread
From: jianchao.wang @ 2018-03-13 13:29 UTC (permalink / raw)


Hi Keith

Would you please take a look at this patch.
I really need your suggestion on this.

Sincerely
Jianchao

On 03/09/2018 10:01 AM, jianchao.wang wrote:
> Hi Keith
> 
> Can I have the honor of getting your comment on this patch?
> 
> Thanks in advance
> Jianchao
> 
> On 03/08/2018 02:19 PM, Jianchao Wang wrote:
>> nvme_dev_disable will issue command on adminq to clear HMB and
>> delete io cq/sqs, maybe more in the future. When adminq no response,
>> it has to depends on timeout path. However, nvme_timeout has to
>> invoke nvme_dev_disable before return, so that the DMA mappings
>> could be released safely. This will introduce dangerous circular
>> dependence. Moreover, the whole nvme_dev_disable is under
>> shutdown_lock even waiting for the command, this makes things
>> worse.
>>
>> To avoid this, this patch refactors the nvme_timeout. The basic
>> principle is:
>>  - When need to schedule reset_work, hand over expired requests
>>    to nvme_dev_disable. They will be completed after the controller
>>    is disabled/shtudown.
>>  - When requests from nvme_dev_disable and nvme_reset_work expires,
>>    disable the controller directly then the request could be completed
>>    to wakeup the waiter. nvme_pci_disable_ctrl_directly is introduced
>>    for this, it doesn't send commands on adminq and the shutdown_lock
>>    is not needed here, because the nvme_abort_requests_sync in
>>    nvme_dev_disable could synchronize with nvme_timeout.
>>
>> Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
>> ---
>>  drivers/nvme/host/pci.c | 199 ++++++++++++++++++++++++++++++++----------------
>>  1 file changed, 134 insertions(+), 65 deletions(-)
>>
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index e186158..ce09057 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -70,6 +70,7 @@ struct nvme_queue;
>>  
>>  static void nvme_process_cq(struct nvme_queue *nvmeq);
>>  static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
>> +static void nvme_pci_disable_ctrl_directly(struct nvme_dev *dev);
>>  
>>  /*
>>   * Represents an NVM Express device.  Each nvme_dev is a PCI function.
>> @@ -98,6 +99,7 @@ struct nvme_dev {
>>  	u32 cmbloc;
>>  	struct nvme_ctrl ctrl;
>>  	struct completion ioq_wait;
>> +	bool inflight_flushed;
>>  
>>  	/* shadow doorbell buffer support: */
>>  	u32 *dbbuf_dbs;
>> @@ -1180,73 +1182,13 @@ static void nvme_warn_reset(struct nvme_dev *dev, u32 csts)
>>  			 csts, result);
>>  }
>>  
>> -static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>> +static enum blk_eh_timer_return nvme_pci_abort_io_req(struct request *req)
>>  {
>>  	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
>>  	struct nvme_queue *nvmeq = iod->nvmeq;
>>  	struct nvme_dev *dev = nvmeq->dev;
>> -	struct request *abort_req;
>>  	struct nvme_command cmd;
>> -	u32 csts = readl(dev->bar + NVME_REG_CSTS);
>> -
>> -	/*
>> -	 * Reset immediately if the controller is failed
>> -	 */
>> -	if (nvme_should_reset(dev, csts)) {
>> -		nvme_warn_reset(dev, csts);
>> -		nvme_dev_disable(dev, false);
>> -		nvme_reset_ctrl(&dev->ctrl);
>> -		return BLK_EH_HANDLED;
>> -	}
>> -
>> -	/*
>> -	 * Did we miss an interrupt?
>> -	 */
>> -	if (__nvme_poll(nvmeq, req->tag)) {
>> -		dev_warn(dev->ctrl.device,
>> -			 "I/O %d QID %d timeout, completion polled\n",
>> -			 req->tag, nvmeq->qid);
>> -		return BLK_EH_HANDLED;
>> -	}
>> -
>> -	/*
>> -	 * Shutdown immediately if controller times out while starting. The
>> -	 * reset work will see the pci device disabled when it gets the forced
>> -	 * cancellation error. All outstanding requests are completed on
>> -	 * shutdown, so we return BLK_EH_HANDLED.
>> -	 */
>> -	switch (dev->ctrl.state) {
>> -	case NVME_CTRL_CONNECTING:
>> -	case NVME_CTRL_RESETTING:
>> -		dev_warn(dev->ctrl.device,
>> -			 "I/O %d QID %d timeout, disable controller\n",
>> -			 req->tag, nvmeq->qid);
>> -		nvme_dev_disable(dev, false);
>> -		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
>> -		return BLK_EH_HANDLED;
>> -	default:
>> -		break;
>> -	}
>> -
>> -	/*
>> - 	 * Shutdown the controller immediately and schedule a reset if the
>> - 	 * command was already aborted once before and still hasn't been
>> - 	 * returned to the driver, or if this is the admin queue.
>> -	 */
>> -	if (!nvmeq->qid || iod->aborted) {
>> -		dev_warn(dev->ctrl.device,
>> -			 "I/O %d QID %d timeout, reset controller\n",
>> -			 req->tag, nvmeq->qid);
>> -		nvme_dev_disable(dev, false);
>> -		nvme_reset_ctrl(&dev->ctrl);
>> -
>> -		/*
>> -		 * Mark the request as handled, since the inline shutdown
>> -		 * forces all outstanding requests to complete.
>> -		 */
>> -		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
>> -		return BLK_EH_HANDLED;
>> -	}
>> +	struct request *abort_req;
>>  
>>  	if (atomic_dec_return(&dev->ctrl.abort_limit) < 0) {
>>  		atomic_inc(&dev->ctrl.abort_limit);
>> @@ -1282,6 +1224,105 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>>  	return BLK_EH_RESET_TIMER;
>>  }
>>  
>> +static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>> +{
>> +	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
>> +	struct nvme_queue *nvmeq = iod->nvmeq;
>> +	struct nvme_dev *dev = nvmeq->dev;
>> +	u32 csts = readl(dev->bar + NVME_REG_CSTS);
>> +	enum {ABORT, RESET, DISABLE} action;
>> +	enum blk_eh_timer_return ret;
>> +	/*
>> +	 * This is for nvme_abort_req. This request will be completed
>> +	 * by nvme_flush_aborted_requests after the controller is
>> +	 * disabled/shutdown
>> +	 */
>> +	if (test_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags))
>> +		return BLK_EH_NOT_HANDLED;
>> +
>> +	/*
>> +	 * Reset immediately if the controller is failed.
>> +	 * Defer the completion to nvme_flush_aborted_requests.
>> +	 */
>> +	if (nvme_should_reset(dev, csts)) {
>> +		nvme_warn_reset(dev, csts);
>> +		nvme_reset_ctrl(&dev->ctrl);
>> +		return BLK_EH_RESET_TIMER;
>> +	}
>> +
>> +	/*
>> +	 * Did we miss an interrupt?
>> +	 */
>> +	if (__nvme_poll(nvmeq, req->tag)) {
>> +		dev_warn(dev->ctrl.device,
>> +			 "I/O %d QID %d timeout, completion polled\n",
>> +			 req->tag, nvmeq->qid);
>> +		return BLK_EH_HANDLED;
>> +	}
>> +
>> +	if (nvmeq->qid) {
>> +		if (dev->ctrl.state == NVME_CTRL_RESETTING ||
>> +				iod->aborted)
>> +			action = RESET;
>> +		else
>> +			action = ABORT;
>> +	} else {
>> +		/*
>> +		 * Disable immediately if controller times out while disabling/
>> +		 * shuting down/starting. The nvme_dev_disable/nvme_reset_work
>> +		 * will see the error.
>> +		 * Note: inflight_flushed is set in nvme_dev_disable when it
>> +		 * abort all the inflight requests. Introducing this flag is due
>> +		 * to there is no state to represent the shutdown procedure.
>> +		 */
>> +		if (dev->ctrl.state == NVME_CTRL_CONNECTING ||
>> +			dev->inflight_flushed)
>> +			action = DISABLE;
>> +		else
>> +			action = RESET;
>> +	}
>> +
>> +	switch (action) {
>> +	case ABORT:
>> +		ret = nvme_pci_abort_io_req(req);
>> +		break;
>> +	case RESET:
>> +		dev_warn(dev->ctrl.device,
>> +			 "I/O %d QID %d timeout, reset controller\n",
>> +			 req->tag, nvmeq->qid);
>> +		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
>> +		nvme_reset_ctrl(&dev->ctrl);
>> +		/*
>> +		 * The reset work will take over this request. nvme_abort_req
>> +		 * employs blk_abort_request to force the request to be timed
>> +		 * out. So we need to return BLK_EH_RESET_TIMER then the
>> +		 * RQF_MQ_TIMEOUT_EXPIRED could be cleared.
>> +		 */
>> +		ret = BLK_EH_RESET_TIMER;
>> +		break;
>> +	case DISABLE:
>> +		/*
>> +		 * We disable the controller directly here so that we could
>> +		 * complete the request safely to wake up the nvme_dev_disable/
>> +		 * nvme_reset_work who is waiting on adminq. We cannot return
>> +		 * BLK_EH_RESET_TIMER to depend on error recovery procedure
>> +		 * because it is waiting for timeout path.
>> +		 */
>> +		dev_warn(dev->ctrl.device,
>> +			 "I/O %d QID %d timeout, disable controller\n",
>> +			 req->tag, nvmeq->qid);
>> +		nvme_pci_disable_ctrl_directly(dev);
>> +		set_bit(NVME_REQ_CANCELLED, &nvme_req(req)->flags);
>> +		ret = BLK_EH_HANDLED;
>> +		break;
>> +	default:
>> +		WARN_ON(1);
>> +		break;
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>>  static void nvme_free_queue(struct nvme_queue *nvmeq)
>>  {
>>  	dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
>> @@ -2169,6 +2210,33 @@ static void nvme_pci_disable(struct nvme_dev *dev)
>>  	}
>>  }
>>  
>> +/*
>> + * This is only invoked by nvme_timeout. shutdown_lock is not needed
>> + * here because nvme_abort_requests_sync in nvme_dev_disable will
>> + * synchronize with the timeout path.
>> + */
>> +static void nvme_pci_disable_ctrl_directly(struct nvme_dev *dev)
>> +{
>> +	int i;
>> +	struct pci_dev *pdev = to_pci_dev(dev->dev);
>> +	bool dead = true;
>> +
>> +	if (pci_is_enabled(pdev)) {
>> +		u32 csts = readl(dev->bar + NVME_REG_CSTS);
>> +
>> +		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
>> +			pdev->error_state  != pci_channel_io_normal);
>> +
>> +		if (!dead)
>> +			nvme_disable_ctrl(&dev->ctrl, dev->ctrl.cap);
>> +	}
>> +
>> +	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
>> +		nvme_suspend_queue(&dev->queues[i]);
>> +
>> +	nvme_pci_disable(dev);
>> +}
>> +
>>  static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>>  {
>>  	int i;
>> @@ -2205,6 +2273,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>>  
>>  	}
>>  	nvme_stop_queues(&dev->ctrl);
>> +	nvme_abort_requests_sync(&dev->ctrl);
>> +	dev->inflight_flushed = true;
>>  
>>  	if (!dead) {
>>  		nvme_disable_io_queues(dev);
>> @@ -2215,9 +2285,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>>  
>>  	nvme_pci_disable(dev);
>>  
>> -	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
>> -	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
>> -
>> +	nvme_flush_aborted_requests(&dev->ctrl);
>> +	dev->inflight_flushed = false;
>>  	/*
>>  	 * The driver will not be starting up queues again if shutting down so
>>  	 * must flush all entered requests to their failed completion to avoid
>>
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=5Mw5gjePWHT_YZ77OroeGMhDmXLF0LW6EtV-hBMRYas&s=VM31J1nvWkmoWzuZIHV0f5nlrSW0ZTKUQ1TjLwbB18g&e=
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: PATCH V4 0/5 nvme-pci: fixes on nvme_timeout and nvme_dev_disable
  2018-03-08  6:19 ` Jianchao Wang
@ 2018-04-17 15:17   ` Ming Lei
  -1 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-04-17 15:17 UTC (permalink / raw)
  To: Jianchao Wang; +Cc: keith.busch, axboe, hch, sagi, linux-kernel, linux-nvme

On Thu, Mar 08, 2018 at 02:19:26PM +0800, Jianchao Wang wrote:
> Firstly, really appreciate Keith and Sagi's precious advice on previous versions.
> And this is the version 4.
> 
> Some patches of the previous patchset have been submitted and the left is this patchset
> which has been refactored. Please consider it for 4.17.
> 
> The target of this patchset is to avoid nvme_dev_disable to be invoked by nvme_timeout.
> As we know, nvme_dev_disable will issue commands on adminq, if the controller no response,
> it has to depend on timeout path. However, nvme_timeout will also need to invoke
> nvme_dev_disable. This will introduce dangerous circular dependence. Moreover,
> nvme_dev_disable is under the shutdown_lock, even when it go to sleep, this makes things
> worse.
> 
> The basic idea of this patchset is:
>  - When need to schedule reset_work, hand over expired requests to nvme_dev_disable. They
>    will be completed after the controller is disabled/shtudown.
> 
>  - When requests from nvme_dev_disable and nvme_reset_work expires, disable the controller
>    directly then the request could be completed to wakeup the waiter. 
> 
> The 'disable the controller directly' here means that it doesn't send commands on adminq.
> A new interface is introduced for this, nvme_pci_disable_ctrl_directly. More details,
> please refer to the comment of the function.
> 
> Then nvme_timeout doesn't depends on nvme_dev_disable any more.
> 
> Because there is big difference from previous version, and some relatively independent patches
> have been submitted, so I just reserve the key part of previous version change log following.
> 
> Change V3->V4
>  - refactor the interfaces flushing in-flight requests and add them to nvme core.
>  - refactor the nvme_timeout to make it more clearly
> 
> Change V2->V3:
>  - discard the patch which unfreeze the queue after nvme_dev_disable
> 
> Changes V1->V2:
>  - disable PCI controller bus master in nvme_pci_disable_ctrl_directly
> 
> There are 5 patches:
> 1st one is to change the operations on nvme_request->flags to atomic operations, then we could introduce
> another NVME_REQ_ABORTED next.
> 2nd patch introduce two new interfaces to flush in-flight requests in nvme core.
> 3rd patch is to avoid the nvme_dev_disable in nvme_timeout, it introduce new interface nvme_pci_disable_ctrl_directly
> and refactor the nvme_timeout
> 4th~5th is to fix issues introduced after 3rd patch.
> 
> Jianchao Wang (5)
> 0001-nvme-do-atomically-bit-operations-on-nvme_request.fl.patch
> 0002-nvme-add-helper-interface-to-flush-in-flight-request.patch
> 0003-nvme-pci-avoid-nvme_dev_disable-to-be-invoked-in-nvm.patch
> 0004-nvme-pci-discard-wait-timeout-when-delete-cq-sq.patch
> 0005-nvme-pci-add-the-timeout-case-for-DELETEING-state.patch
> 
> diff stat
>  drivers/nvme/host/core.c |  96 +++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/nvme/host/nvme.h |   4 +-
>  drivers/nvme/host/pci.c  | 224 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------------
>  

Hi Jianchao,

Looks blktest(block/011) can trigger IO hang easily on NVMe PCI device,
and all are related with nvme_dev_disable():

1) admin queue may be disabled by nvme_dev_disable() from timeout path
during resetting, then reset can't move on

2) the nvme_dev_disable() called from nvme_reset_work() may cause double
completion on timed-out request 

So could you share us what your plan is about this patchset?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 38+ messages in thread

* PATCH V4 0/5 nvme-pci: fixes on nvme_timeout and nvme_dev_disable
@ 2018-04-17 15:17   ` Ming Lei
  0 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-04-17 15:17 UTC (permalink / raw)


On Thu, Mar 08, 2018@02:19:26PM +0800, Jianchao Wang wrote:
> Firstly, really appreciate Keith and Sagi's precious advice on previous versions.
> And this is the version 4.
> 
> Some patches of the previous patchset have been submitted and the left is this patchset
> which has been refactored. Please consider it for 4.17.
> 
> The target of this patchset is to avoid nvme_dev_disable to be invoked by nvme_timeout.
> As we know, nvme_dev_disable will issue commands on adminq, if the controller no response,
> it has to depend on timeout path. However, nvme_timeout will also need to invoke
> nvme_dev_disable. This will introduce dangerous circular dependence. Moreover,
> nvme_dev_disable is under the shutdown_lock, even when it go to sleep, this makes things
> worse.
> 
> The basic idea of this patchset is:
>  - When need to schedule reset_work, hand over expired requests to nvme_dev_disable. They
>    will be completed after the controller is disabled/shtudown.
> 
>  - When requests from nvme_dev_disable and nvme_reset_work expires, disable the controller
>    directly then the request could be completed to wakeup the waiter. 
> 
> The 'disable the controller directly' here means that it doesn't send commands on adminq.
> A new interface is introduced for this, nvme_pci_disable_ctrl_directly. More details,
> please refer to the comment of the function.
> 
> Then nvme_timeout doesn't depends on nvme_dev_disable any more.
> 
> Because there is big difference from previous version, and some relatively independent patches
> have been submitted, so I just reserve the key part of previous version change log following.
> 
> Change V3->V4
>  - refactor the interfaces flushing in-flight requests and add them to nvme core.
>  - refactor the nvme_timeout to make it more clearly
> 
> Change V2->V3:
>  - discard the patch which unfreeze the queue after nvme_dev_disable
> 
> Changes V1->V2:
>  - disable PCI controller bus master in nvme_pci_disable_ctrl_directly
> 
> There are 5 patches:
> 1st one is to change the operations on nvme_request->flags to atomic operations, then we could introduce
> another NVME_REQ_ABORTED next.
> 2nd patch introduce two new interfaces to flush in-flight requests in nvme core.
> 3rd patch is to avoid the nvme_dev_disable in nvme_timeout, it introduce new interface nvme_pci_disable_ctrl_directly
> and refactor the nvme_timeout
> 4th~5th is to fix issues introduced after 3rd patch.
> 
> Jianchao Wang (5)
> 0001-nvme-do-atomically-bit-operations-on-nvme_request.fl.patch
> 0002-nvme-add-helper-interface-to-flush-in-flight-request.patch
> 0003-nvme-pci-avoid-nvme_dev_disable-to-be-invoked-in-nvm.patch
> 0004-nvme-pci-discard-wait-timeout-when-delete-cq-sq.patch
> 0005-nvme-pci-add-the-timeout-case-for-DELETEING-state.patch
> 
> diff stat
>  drivers/nvme/host/core.c |  96 +++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/nvme/host/nvme.h |   4 +-
>  drivers/nvme/host/pci.c  | 224 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------------
>  

Hi Jianchao,

Looks blktest(block/011) can trigger IO hang easily on NVMe PCI device,
and all are related with nvme_dev_disable():

1) admin queue may be disabled by nvme_dev_disable() from timeout path
during resetting, then reset can't move on

2) the nvme_dev_disable() called from nvme_reset_work() may cause double
completion on timed-out request 

So could you share us what your plan is about this patchset?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: PATCH V4 0/5 nvme-pci: fixes on nvme_timeout and nvme_dev_disable
  2018-04-17 15:17   ` Ming Lei
@ 2018-04-18 14:24     ` jianchao.wang
  -1 siblings, 0 replies; 38+ messages in thread
From: jianchao.wang @ 2018-04-18 14:24 UTC (permalink / raw)
  To: Ming Lei; +Cc: axboe, sagi, linux-kernel, linux-nvme, keith.busch, hch

Hi Ming

On 04/17/2018 11:17 PM, Ming Lei wrote:
> Looks blktest(block/011) can trigger IO hang easily on NVMe PCI device,
> and all are related with nvme_dev_disable():
> 
> 1) admin queue may be disabled by nvme_dev_disable() from timeout path
> during resetting, then reset can't move on
> 
> 2) the nvme_dev_disable() called from nvme_reset_work() may cause double
> completion on timed-out request 
> 
> So could you share us what your plan is about this patchset?

Regarding to this patchset, it is mainly to fix the dependency between
nvme_timeout and nvme_dev_disable, as your can see:
nvme_timeout will invoke nvme_dev_disable, and nvme_dev_disable have to
depend on nvme_timeout when controller no response. Till now, some parts
of the patchset looks bad and seem to have a lot of work need to be done.
:)

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 38+ messages in thread

* PATCH V4 0/5 nvme-pci: fixes on nvme_timeout and nvme_dev_disable
@ 2018-04-18 14:24     ` jianchao.wang
  0 siblings, 0 replies; 38+ messages in thread
From: jianchao.wang @ 2018-04-18 14:24 UTC (permalink / raw)


Hi Ming

On 04/17/2018 11:17 PM, Ming Lei wrote:
> Looks blktest(block/011) can trigger IO hang easily on NVMe PCI device,
> and all are related with nvme_dev_disable():
> 
> 1) admin queue may be disabled by nvme_dev_disable() from timeout path
> during resetting, then reset can't move on
> 
> 2) the nvme_dev_disable() called from nvme_reset_work() may cause double
> completion on timed-out request 
> 
> So could you share us what your plan is about this patchset?

Regarding to this patchset, it is mainly to fix the dependency between
nvme_timeout and nvme_dev_disable, as your can see:
nvme_timeout will invoke nvme_dev_disable, and nvme_dev_disable have to
depend on nvme_timeout when controller no response. Till now, some parts
of the patchset looks bad and seem to have a lot of work need to be done.
:)

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: PATCH V4 0/5 nvme-pci: fixes on nvme_timeout and nvme_dev_disable
  2018-04-18 14:24     ` jianchao.wang
@ 2018-04-18 15:40       ` Ming Lei
  -1 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-04-18 15:40 UTC (permalink / raw)
  To: jianchao.wang; +Cc: axboe, sagi, linux-kernel, linux-nvme, keith.busch, hch

On Wed, Apr 18, 2018 at 10:24:28PM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 04/17/2018 11:17 PM, Ming Lei wrote:
> > Looks blktest(block/011) can trigger IO hang easily on NVMe PCI device,
> > and all are related with nvme_dev_disable():
> > 
> > 1) admin queue may be disabled by nvme_dev_disable() from timeout path
> > during resetting, then reset can't move on
> > 
> > 2) the nvme_dev_disable() called from nvme_reset_work() may cause double
> > completion on timed-out request 
> > 
> > So could you share us what your plan is about this patchset?
> 
> Regarding to this patchset, it is mainly to fix the dependency between
> nvme_timeout and nvme_dev_disable, as your can see:
> nvme_timeout will invoke nvme_dev_disable, and nvme_dev_disable have to
> depend on nvme_timeout when controller no response.

Do you mean nvme_disable_io_queues()? If yes, this one has been handled
by wait_for_completion_io_timeout() already, and looks the block timeout
can be disabled simply. Or are there others?

> Till now, some parts
> of the patchset looks bad and seem to have a lot of work need to be done.
> :)

Yeah, this part is much more complicated than I thought.

I think it is a good topic to discuss in the coming LSF/MM, and the NVMe
timeout(EH) may need to be refactored/cleaned up, and current issues
should be addressed in clean way.

Guys, are there other issues wrt. NVMe timeout & reset except for the
above?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 38+ messages in thread

* PATCH V4 0/5 nvme-pci: fixes on nvme_timeout and nvme_dev_disable
@ 2018-04-18 15:40       ` Ming Lei
  0 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-04-18 15:40 UTC (permalink / raw)


On Wed, Apr 18, 2018@10:24:28PM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 04/17/2018 11:17 PM, Ming Lei wrote:
> > Looks blktest(block/011) can trigger IO hang easily on NVMe PCI device,
> > and all are related with nvme_dev_disable():
> > 
> > 1) admin queue may be disabled by nvme_dev_disable() from timeout path
> > during resetting, then reset can't move on
> > 
> > 2) the nvme_dev_disable() called from nvme_reset_work() may cause double
> > completion on timed-out request 
> > 
> > So could you share us what your plan is about this patchset?
> 
> Regarding to this patchset, it is mainly to fix the dependency between
> nvme_timeout and nvme_dev_disable, as your can see:
> nvme_timeout will invoke nvme_dev_disable, and nvme_dev_disable have to
> depend on nvme_timeout when controller no response.

Do you mean nvme_disable_io_queues()? If yes, this one has been handled
by wait_for_completion_io_timeout() already, and looks the block timeout
can be disabled simply. Or are there others?

> Till now, some parts
> of the patchset looks bad and seem to have a lot of work need to be done.
> :)

Yeah, this part is much more complicated than I thought.

I think it is a good topic to discuss in the coming LSF/MM, and the NVMe
timeout(EH) may need to be refactored/cleaned up, and current issues
should be addressed in clean way.

Guys, are there other issues wrt. NVMe timeout & reset except for the
above?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: PATCH V4 0/5 nvme-pci: fixes on nvme_timeout and nvme_dev_disable
  2018-04-18 15:40       ` Ming Lei
@ 2018-04-19  1:51         ` jianchao.wang
  -1 siblings, 0 replies; 38+ messages in thread
From: jianchao.wang @ 2018-04-19  1:51 UTC (permalink / raw)
  To: Ming Lei; +Cc: keith.busch, sagi, linux-kernel, linux-nvme, axboe, hch

Hi Ming

Thanks for your kindly response.

On 04/18/2018 11:40 PM, Ming Lei wrote:
>> Regarding to this patchset, it is mainly to fix the dependency between
>> nvme_timeout and nvme_dev_disable, as your can see:
>> nvme_timeout will invoke nvme_dev_disable, and nvme_dev_disable have to
>> depend on nvme_timeout when controller no response.
> Do you mean nvme_disable_io_queues()? If yes, this one has been handled
> by wait_for_completion_io_timeout() already, and looks the block timeout
> can be disabled simply. Or are there others?
> 
Here is one possible scenario currently

nvme_dev_disable // hold shutdown_lock             nvme_timeout
  -> nvme_set_host_mem                               -> nvme_dev_disable
    -> nvme_submit_sync_cmd                            -> try to require shutdown_lock 
      -> __nvme_submit_sync_cmd
        -> blk_execute_rq
          //if sysctl_hung_task_timeout_secs == 0
          -> wait_for_completion_io
And maybe nvme_dev_disable need to issue other commands in the future.

Even if we could fix these kind of issues as nvme_disable_io_queues, 
it is still a risk I think.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 38+ messages in thread

* PATCH V4 0/5 nvme-pci: fixes on nvme_timeout and nvme_dev_disable
@ 2018-04-19  1:51         ` jianchao.wang
  0 siblings, 0 replies; 38+ messages in thread
From: jianchao.wang @ 2018-04-19  1:51 UTC (permalink / raw)


Hi Ming

Thanks for your kindly response.

On 04/18/2018 11:40 PM, Ming Lei wrote:
>> Regarding to this patchset, it is mainly to fix the dependency between
>> nvme_timeout and nvme_dev_disable, as your can see:
>> nvme_timeout will invoke nvme_dev_disable, and nvme_dev_disable have to
>> depend on nvme_timeout when controller no response.
> Do you mean nvme_disable_io_queues()? If yes, this one has been handled
> by wait_for_completion_io_timeout() already, and looks the block timeout
> can be disabled simply. Or are there others?
> 
Here is one possible scenario currently

nvme_dev_disable // hold shutdown_lock             nvme_timeout
  -> nvme_set_host_mem                               -> nvme_dev_disable
    -> nvme_submit_sync_cmd                            -> try to require shutdown_lock 
      -> __nvme_submit_sync_cmd
        -> blk_execute_rq
          //if sysctl_hung_task_timeout_secs == 0
          -> wait_for_completion_io
And maybe nvme_dev_disable need to issue other commands in the future.

Even if we could fix these kind of issues as nvme_disable_io_queues, 
it is still a risk I think.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: PATCH V4 0/5 nvme-pci: fixes on nvme_timeout and nvme_dev_disable
  2018-04-19  1:51         ` jianchao.wang
@ 2018-04-19  2:27           ` Ming Lei
  -1 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-04-19  2:27 UTC (permalink / raw)
  To: jianchao.wang; +Cc: keith.busch, sagi, linux-kernel, linux-nvme, axboe, hch

On Thu, Apr 19, 2018 at 09:51:16AM +0800, jianchao.wang wrote:
> Hi Ming
> 
> Thanks for your kindly response.
> 
> On 04/18/2018 11:40 PM, Ming Lei wrote:
> >> Regarding to this patchset, it is mainly to fix the dependency between
> >> nvme_timeout and nvme_dev_disable, as your can see:
> >> nvme_timeout will invoke nvme_dev_disable, and nvme_dev_disable have to
> >> depend on nvme_timeout when controller no response.
> > Do you mean nvme_disable_io_queues()? If yes, this one has been handled
> > by wait_for_completion_io_timeout() already, and looks the block timeout
> > can be disabled simply. Or are there others?
> > 
> Here is one possible scenario currently
> 
> nvme_dev_disable // hold shutdown_lock             nvme_timeout
>   -> nvme_set_host_mem                               -> nvme_dev_disable
>     -> nvme_submit_sync_cmd                            -> try to require shutdown_lock 
>       -> __nvme_submit_sync_cmd
>         -> blk_execute_rq
>           //if sysctl_hung_task_timeout_secs == 0
>           -> wait_for_completion_io
> And maybe nvme_dev_disable need to issue other commands in the future.

OK, thanks for sharing this one, for now I think it might need to be
handled by wait_for_completion_io_timeout() for working around this issue.

> 
> Even if we could fix these kind of issues as nvme_disable_io_queues, 
> it is still a risk I think.

Yeah, I can't agree more, that is why I think the nvme time/eh code should
be refactored, and solve the current issues in a more clean/maintainable
way.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 38+ messages in thread

* PATCH V4 0/5 nvme-pci: fixes on nvme_timeout and nvme_dev_disable
@ 2018-04-19  2:27           ` Ming Lei
  0 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-04-19  2:27 UTC (permalink / raw)


On Thu, Apr 19, 2018@09:51:16AM +0800, jianchao.wang wrote:
> Hi Ming
> 
> Thanks for your kindly response.
> 
> On 04/18/2018 11:40 PM, Ming Lei wrote:
> >> Regarding to this patchset, it is mainly to fix the dependency between
> >> nvme_timeout and nvme_dev_disable, as your can see:
> >> nvme_timeout will invoke nvme_dev_disable, and nvme_dev_disable have to
> >> depend on nvme_timeout when controller no response.
> > Do you mean nvme_disable_io_queues()? If yes, this one has been handled
> > by wait_for_completion_io_timeout() already, and looks the block timeout
> > can be disabled simply. Or are there others?
> > 
> Here is one possible scenario currently
> 
> nvme_dev_disable // hold shutdown_lock             nvme_timeout
>   -> nvme_set_host_mem                               -> nvme_dev_disable
>     -> nvme_submit_sync_cmd                            -> try to require shutdown_lock 
>       -> __nvme_submit_sync_cmd
>         -> blk_execute_rq
>           //if sysctl_hung_task_timeout_secs == 0
>           -> wait_for_completion_io
> And maybe nvme_dev_disable need to issue other commands in the future.

OK, thanks for sharing this one, for now I think it might need to be
handled by wait_for_completion_io_timeout() for working around this issue.

> 
> Even if we could fix these kind of issues as nvme_disable_io_queues, 
> it is still a risk I think.

Yeah, I can't agree more, that is why I think the nvme time/eh code should
be refactored, and solve the current issues in a more clean/maintainable
way.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2018-04-19  2:28 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-08  6:19 PATCH V4 0/5 nvme-pci: fixes on nvme_timeout and nvme_dev_disable Jianchao Wang
2018-03-08  6:19 ` Jianchao Wang
2018-03-08  6:19 ` [PATCH V4 1/5] nvme: do atomically bit operations on nvme_request.flags Jianchao Wang
2018-03-08  6:19   ` Jianchao Wang
2018-03-08  7:57   ` Christoph Hellwig
2018-03-08  7:57     ` Christoph Hellwig
2018-03-08 14:32     ` jianchao.wang
2018-03-08 14:32       ` jianchao.wang
2018-03-08  6:19 ` [PATCH V4 2/5] nvme: add helper interface to flush in-flight requests Jianchao Wang
2018-03-08  6:19   ` Jianchao Wang
2018-03-08 13:11   ` Ming Lei
2018-03-08 13:11     ` Ming Lei
2018-03-08 14:44     ` jianchao.wang
2018-03-08 14:44       ` jianchao.wang
2018-03-08 18:21   ` Sagi Grimberg
2018-03-08 18:21     ` Sagi Grimberg
2018-03-09  1:59     ` jianchao.wang
2018-03-09  1:59       ` jianchao.wang
2018-03-08  6:19 ` [PATCH V4 3/5] nvme-pci: avoid nvme_dev_disable to be invoked in nvme_timeout Jianchao Wang
2018-03-08  6:19   ` Jianchao Wang
2018-03-09  2:01   ` jianchao.wang
2018-03-09  2:01     ` jianchao.wang
2018-03-13 13:29     ` jianchao.wang
2018-03-13 13:29       ` jianchao.wang
2018-03-08  6:19 ` [PATCH V4 4/5] nvme-pci: discard wait timeout when delete cq/sq Jianchao Wang
2018-03-08  6:19   ` Jianchao Wang
2018-03-08  6:19 ` [PATCH V4 5/5] nvme-pci: add the timeout case for DELETEING state Jianchao Wang
2018-03-08  6:19   ` Jianchao Wang
2018-04-17 15:17 ` PATCH V4 0/5 nvme-pci: fixes on nvme_timeout and nvme_dev_disable Ming Lei
2018-04-17 15:17   ` Ming Lei
2018-04-18 14:24   ` jianchao.wang
2018-04-18 14:24     ` jianchao.wang
2018-04-18 15:40     ` Ming Lei
2018-04-18 15:40       ` Ming Lei
2018-04-19  1:51       ` jianchao.wang
2018-04-19  1:51         ` jianchao.wang
2018-04-19  2:27         ` Ming Lei
2018-04-19  2:27           ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.