All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable
@ 2018-02-02  7:00 ` Jianchao Wang
  0 siblings, 0 replies; 48+ messages in thread
From: Jianchao Wang @ 2018-02-02  7:00 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

Hi Christoph, Keith and Sagi

Please consider and comment on the following patchset.
That's really appreciated.

There is a complicated relationship between nvme_timeout and nvme_dev_disable.
 - nvme_timeout has to invoke nvme_dev_disable to stop the
   controller doing DMA access before free the request.
 - nvme_dev_disable has to depend on nvme_timeout to complete
   adminq requests to set HMB or delete sq/cq when the controller
   has no response.
 - nvme_dev_disable will race with nvme_timeout when cancels the
   outstanding requests.
We have found some issues introduced by them, please refer the following link

http://lists.infradead.org/pipermail/linux-nvme/2018-January/015053.html 
http://lists.infradead.org/pipermail/linux-nvme/2018-January/015276.html
http://lists.infradead.org/pipermail/linux-nvme/2018-January/015328.html
Even we cannot ensure there is no other issue.

The best way to fix them is to break up the relationship between them.
With this patch, we could avoid nvme_dev_disable to be invoked
by nvme_timeout and eliminate the race between nvme_timeout and
nvme_dev_disable on outstanding requests.


There are 6 patches:

1st ~ 3th patches does some preparation for the 4th one.
4th is to avoid nvme_dev_disable to be invoked by nvme_timeout, and implement
the synchronization between them. More details, please refer to the comment of
this patch.
5th fixes a bug after 4th patch is introduced. It let nvme_delete_io_queues can
only be wakeup by completion path.
6th fixes a bug found when test, it is not related with 4th patch.

This patchset was tested under debug patch for some days.
And some bugfix have been done.
The debug patch and other patches are available in following it branch:
https://github.com/jianchwa/linux-blcok.git nvme_fixes_test

Jianchao Wang (6)
0001-nvme-pci-move-clearing-host-mem-behind-stopping-queu.patch
0002-nvme-pci-fix-the-freeze-and-quiesce-for-shutdown-and.patch
0003-blk-mq-make-blk_mq_rq_update_aborted_gstate-a-extern.patch
0004-nvme-pci-break-up-nvme_timeout-and-nvme_dev_disable.patch
0005-nvme-pci-discard-wait-timeout-when-delete-cq-sq.patch
0006-nvme-pci-suspend-queues-based-on-online_queues.patch

diff stat following:
 block/blk-mq.c          |   3 +-
 drivers/nvme/host/pci.c | 225 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------
 include/linux/blk-mq.h  |   1 +
 3 files changed, 169 insertions(+), 60 deletions(-)

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable
@ 2018-02-02  7:00 ` Jianchao Wang
  0 siblings, 0 replies; 48+ messages in thread
From: Jianchao Wang @ 2018-02-02  7:00 UTC (permalink / raw)


Hi Christoph, Keith and Sagi

Please consider and comment on the following patchset.
That's really appreciated.

There is a complicated relationship between nvme_timeout and nvme_dev_disable.
 - nvme_timeout has to invoke nvme_dev_disable to stop the
   controller doing DMA access before free the request.
 - nvme_dev_disable has to depend on nvme_timeout to complete
   adminq requests to set HMB or delete sq/cq when the controller
   has no response.
 - nvme_dev_disable will race with nvme_timeout when cancels the
   outstanding requests.
We have found some issues introduced by them, please refer the following link

http://lists.infradead.org/pipermail/linux-nvme/2018-January/015053.html 
http://lists.infradead.org/pipermail/linux-nvme/2018-January/015276.html
http://lists.infradead.org/pipermail/linux-nvme/2018-January/015328.html
Even we cannot ensure there is no other issue.

The best way to fix them is to break up the relationship between them.
With this patch, we could avoid nvme_dev_disable to be invoked
by nvme_timeout and eliminate the race between nvme_timeout and
nvme_dev_disable on outstanding requests.


There are 6 patches:

1st ~ 3th patches does some preparation for the 4th one.
4th is to avoid nvme_dev_disable to be invoked by nvme_timeout, and implement
the synchronization between them. More details, please refer to the comment of
this patch.
5th fixes a bug after 4th patch is introduced. It let nvme_delete_io_queues can
only be wakeup by completion path.
6th fixes a bug found when test, it is not related with 4th patch.

This patchset was tested under debug patch for some days.
And some bugfix have been done.
The debug patch and other patches are available in following it branch:
https://github.com/jianchwa/linux-blcok.git nvme_fixes_test

Jianchao Wang (6)
0001-nvme-pci-move-clearing-host-mem-behind-stopping-queu.patch
0002-nvme-pci-fix-the-freeze-and-quiesce-for-shutdown-and.patch
0003-blk-mq-make-blk_mq_rq_update_aborted_gstate-a-extern.patch
0004-nvme-pci-break-up-nvme_timeout-and-nvme_dev_disable.patch
0005-nvme-pci-discard-wait-timeout-when-delete-cq-sq.patch
0006-nvme-pci-suspend-queues-based-on-online_queues.patch

diff stat following:
 block/blk-mq.c          |   3 +-
 drivers/nvme/host/pci.c | 225 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------
 include/linux/blk-mq.h  |   1 +
 3 files changed, 169 insertions(+), 60 deletions(-)

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/6] nvme-pci: move clearing host mem behind stopping queues
  2018-02-02  7:00 ` Jianchao Wang
@ 2018-02-02  7:00   ` Jianchao Wang
  -1 siblings, 0 replies; 48+ messages in thread
From: Jianchao Wang @ 2018-02-02  7:00 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

Move clearing host mem behind stopping queues. Prepare for
following patch which will grab all the outstanding requests.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/pci.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6fe7af0..00cffed 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2186,7 +2186,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	if (!dead) {
 		if (shutdown)
 			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
+	}
+	nvme_stop_queues(&dev->ctrl);
 
+	if (!dead) {
 		/*
 		 * If the controller is still alive tell it to stop using the
 		 * host memory buffer.  In theory the shutdown / reset should
@@ -2195,11 +2198,6 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		 */
 		if (dev->host_mem_descs)
 			nvme_set_host_mem(dev, 0);
-
-	}
-	nvme_stop_queues(&dev->ctrl);
-
-	if (!dead) {
 		nvme_disable_io_queues(dev);
 		nvme_disable_admin_queue(dev, shutdown);
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 1/6] nvme-pci: move clearing host mem behind stopping queues
@ 2018-02-02  7:00   ` Jianchao Wang
  0 siblings, 0 replies; 48+ messages in thread
From: Jianchao Wang @ 2018-02-02  7:00 UTC (permalink / raw)


Move clearing host mem behind stopping queues. Prepare for
following patch which will grab all the outstanding requests.

Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/pci.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6fe7af0..00cffed 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2186,7 +2186,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	if (!dead) {
 		if (shutdown)
 			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
+	}
+	nvme_stop_queues(&dev->ctrl);
 
+	if (!dead) {
 		/*
 		 * If the controller is still alive tell it to stop using the
 		 * host memory buffer.  In theory the shutdown / reset should
@@ -2195,11 +2198,6 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		 */
 		if (dev->host_mem_descs)
 			nvme_set_host_mem(dev, 0);
-
-	}
-	nvme_stop_queues(&dev->ctrl);
-
-	if (!dead) {
 		nvme_disable_io_queues(dev);
 		nvme_disable_admin_queue(dev, shutdown);
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
  2018-02-02  7:00 ` Jianchao Wang
@ 2018-02-02  7:00   ` Jianchao Wang
  -1 siblings, 0 replies; 48+ messages in thread
From: Jianchao Wang @ 2018-02-02  7:00 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

Currently, request queue will be frozen and quiesced for both reset
and shutdown case. This will trigger ioq requests in RECONNECTING
state which should be avoided to prepare for following patch.
Just freeze request queue for shutdown case and drain all the resudual
entered requests after controller has been shutdown.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/pci.c | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 00cffed..a7fa397 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2172,21 +2172,23 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	if (pci_is_enabled(pdev)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
-		if (dev->ctrl.state == NVME_CTRL_LIVE ||
-		    dev->ctrl.state == NVME_CTRL_RESETTING)
-			nvme_start_freeze(&dev->ctrl);
 		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
 			pdev->error_state  != pci_channel_io_normal);
 	}
 
-	/*
-	 * Give the controller a chance to complete all entered requests if
-	 * doing a safe shutdown.
-	 */
-	if (!dead) {
-		if (shutdown)
+	/* Just freeze the queue for shutdown case */
+	if (shutdown) {
+		if (dev->ctrl.state == NVME_CTRL_LIVE ||
+			dev->ctrl.state == NVME_CTRL_RESETTING)
+			nvme_start_freeze(&dev->ctrl);
+		/*
+		 * Give the controller a chance to complete all
+		 * entered requests if doing a safe shutdown.
+		 */
+		if (!dead)
 			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
 	}
+
 	nvme_stop_queues(&dev->ctrl);
 
 	if (!dead) {
@@ -2210,12 +2212,15 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
 
 	/*
-	 * The driver will not be starting up queues again if shutting down so
-	 * must flush all entered requests to their failed completion to avoid
-	 * deadlocking blk-mq hot-cpu notifier.
+	 * For shutdown case, controller will not be setup again soon. If any
+	 * residual requests here, the controller must have go wrong. Drain and
+	 * fail all the residual entered IO requests.
 	 */
-	if (shutdown)
+	if (shutdown) {
 		nvme_start_queues(&dev->ctrl);
+		nvme_wait_freeze(&dev->ctrl);
+		nvme_stop_queues(&dev->ctrl);
+	}
 	mutex_unlock(&dev->shutdown_lock);
 }
 
@@ -2349,12 +2354,11 @@ static void nvme_reset_work(struct work_struct *work)
 		nvme_remove_namespaces(&dev->ctrl);
 		new_state = NVME_CTRL_ADMIN_ONLY;
 	} else {
-		nvme_start_queues(&dev->ctrl);
-		nvme_wait_freeze(&dev->ctrl);
 		/* hit this only when allocate tagset fails */
 		if (nvme_dev_add(dev))
 			new_state = NVME_CTRL_ADMIN_ONLY;
-		nvme_unfreeze(&dev->ctrl);
+		if (was_suspend)
+			nvme_unfreeze(&dev->ctrl);
 	}
 
 	/*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
@ 2018-02-02  7:00   ` Jianchao Wang
  0 siblings, 0 replies; 48+ messages in thread
From: Jianchao Wang @ 2018-02-02  7:00 UTC (permalink / raw)


Currently, request queue will be frozen and quiesced for both reset
and shutdown case. This will trigger ioq requests in RECONNECTING
state which should be avoided to prepare for following patch.
Just freeze request queue for shutdown case and drain all the resudual
entered requests after controller has been shutdown.

Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/pci.c | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 00cffed..a7fa397 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2172,21 +2172,23 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	if (pci_is_enabled(pdev)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
-		if (dev->ctrl.state == NVME_CTRL_LIVE ||
-		    dev->ctrl.state == NVME_CTRL_RESETTING)
-			nvme_start_freeze(&dev->ctrl);
 		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
 			pdev->error_state  != pci_channel_io_normal);
 	}
 
-	/*
-	 * Give the controller a chance to complete all entered requests if
-	 * doing a safe shutdown.
-	 */
-	if (!dead) {
-		if (shutdown)
+	/* Just freeze the queue for shutdown case */
+	if (shutdown) {
+		if (dev->ctrl.state == NVME_CTRL_LIVE ||
+			dev->ctrl.state == NVME_CTRL_RESETTING)
+			nvme_start_freeze(&dev->ctrl);
+		/*
+		 * Give the controller a chance to complete all
+		 * entered requests if doing a safe shutdown.
+		 */
+		if (!dead)
 			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
 	}
+
 	nvme_stop_queues(&dev->ctrl);
 
 	if (!dead) {
@@ -2210,12 +2212,15 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
 
 	/*
-	 * The driver will not be starting up queues again if shutting down so
-	 * must flush all entered requests to their failed completion to avoid
-	 * deadlocking blk-mq hot-cpu notifier.
+	 * For shutdown case, controller will not be setup again soon. If any
+	 * residual requests here, the controller must have go wrong. Drain and
+	 * fail all the residual entered IO requests.
 	 */
-	if (shutdown)
+	if (shutdown) {
 		nvme_start_queues(&dev->ctrl);
+		nvme_wait_freeze(&dev->ctrl);
+		nvme_stop_queues(&dev->ctrl);
+	}
 	mutex_unlock(&dev->shutdown_lock);
 }
 
@@ -2349,12 +2354,11 @@ static void nvme_reset_work(struct work_struct *work)
 		nvme_remove_namespaces(&dev->ctrl);
 		new_state = NVME_CTRL_ADMIN_ONLY;
 	} else {
-		nvme_start_queues(&dev->ctrl);
-		nvme_wait_freeze(&dev->ctrl);
 		/* hit this only when allocate tagset fails */
 		if (nvme_dev_add(dev))
 			new_state = NVME_CTRL_ADMIN_ONLY;
-		nvme_unfreeze(&dev->ctrl);
+		if (was_suspend)
+			nvme_unfreeze(&dev->ctrl);
 	}
 
 	/*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 3/6] blk-mq: make blk_mq_rq_update_aborted_gstate a external interface
  2018-02-02  7:00 ` Jianchao Wang
@ 2018-02-02  7:00   ` Jianchao Wang
  -1 siblings, 0 replies; 48+ messages in thread
From: Jianchao Wang @ 2018-02-02  7:00 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

No functional change, just make blk_mq_rq_update_aborted_gstate a
external interface.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 block/blk-mq.c         | 3 ++-
 include/linux/blk-mq.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 01f271d..a027ca2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -581,7 +581,7 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
 		*srcu_idx = srcu_read_lock(hctx->srcu);
 }
 
-static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
+void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
 {
 	unsigned long flags;
 
@@ -597,6 +597,7 @@ static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
 	u64_stats_update_end(&rq->aborted_gstate_sync);
 	local_irq_restore(flags);
 }
+EXPORT_SYMBOL(blk_mq_rq_update_aborted_gstate);
 
 static u64 blk_mq_rq_aborted_gstate(struct request *rq)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8efcf49..ad54024 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -257,6 +257,7 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
 void blk_mq_kick_requeue_list(struct request_queue *q);
 void blk_mq_delay_kick_requeue_list(struct request_queue *q, unsigned long msecs);
 void blk_mq_complete_request(struct request *rq);
+void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate);
 
 bool blk_mq_queue_stopped(struct request_queue *q);
 void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 3/6] blk-mq: make blk_mq_rq_update_aborted_gstate a external interface
@ 2018-02-02  7:00   ` Jianchao Wang
  0 siblings, 0 replies; 48+ messages in thread
From: Jianchao Wang @ 2018-02-02  7:00 UTC (permalink / raw)


No functional change, just make blk_mq_rq_update_aborted_gstate a
external interface.

Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 block/blk-mq.c         | 3 ++-
 include/linux/blk-mq.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 01f271d..a027ca2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -581,7 +581,7 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
 		*srcu_idx = srcu_read_lock(hctx->srcu);
 }
 
-static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
+void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
 {
 	unsigned long flags;
 
@@ -597,6 +597,7 @@ static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
 	u64_stats_update_end(&rq->aborted_gstate_sync);
 	local_irq_restore(flags);
 }
+EXPORT_SYMBOL(blk_mq_rq_update_aborted_gstate);
 
 static u64 blk_mq_rq_aborted_gstate(struct request *rq)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8efcf49..ad54024 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -257,6 +257,7 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
 void blk_mq_kick_requeue_list(struct request_queue *q);
 void blk_mq_delay_kick_requeue_list(struct request_queue *q, unsigned long msecs);
 void blk_mq_complete_request(struct request *rq);
+void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate);
 
 bool blk_mq_queue_stopped(struct request_queue *q);
 void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 4/6] nvme-pci: break up nvme_timeout and nvme_dev_disable
  2018-02-02  7:00 ` Jianchao Wang
@ 2018-02-02  7:00   ` Jianchao Wang
  -1 siblings, 0 replies; 48+ messages in thread
From: Jianchao Wang @ 2018-02-02  7:00 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

Currently, the complicated relationship between nvme_dev_disable
and nvme_timeout has become a devil that will introduce many
circular pattern which may trigger deadlock or IO hang. Let's
enumerate the tangles between them:
 - nvme_timeout has to invoke nvme_dev_disable to stop the
   controller doing DMA access before free the request.
 - nvme_dev_disable has to depend on nvme_timeout to complete
   adminq requests to set HMB or delete sq/cq when the controller
   has no response.
 - nvme_dev_disable will race with nvme_timeout when cancels the
   outstanding requests.

To break up them, let's first look at what's kind of requests
nvme_timeout has to handle.

RESETTING          previous adminq/IOq request
or shutdown        adminq requests from nvme_dev_disable

RECONNECTING       adminq requests from nvme_reset_work

nvme_timeout has to invoke nvme_dev_disable first before complete
all the expired request above. We avoid this as following.

For the previous adminq/IOq request:
use blk_abort_request to force all the outstanding requests expired
in nvme_dev_disable. In nvme_timeout, set NVME_REQ_CANCELLED and
return BLK_EH_NOT_HANDLED. Then the request will not be completed and
freed. We needn't invoke nvme_dev_disable any more.

blk_abort_request is safe when race with irq completion path.
we have been able to grab all the outstanding requests. This could
eliminate the race between nvme_timeout and nvme_dev_disable.

We use NVME_REQ_CANCELLED to identify them. After the controller is
totally disabled/shutdown, we invoke blk_mq_rq_update_aborted_gstate
to clear requests and invoke blk_mq_complete_request to complete them.

In addition, to identify the previous adminq/IOq request and adminq
requests from nvme_dev_disable, we introduce NVME_PCI_OUTSTANDING_GRABBING
and NVME_PCI_OUTSTANDING_GRABBED to let nvme_timeout be able to
distinguish them.

For the adminq requests from nvme_dev_disable/nvme_reset_work:
invoke nvme_disable_ctrl directly, then set NVME_REQ_CANCELLED and
return BLK_EH_HANDLED. nvme_dev_disable/nvme_reset_work will
see the error.

With this patch, we could avoid nvme_dev_disable to be invoked
by nvme_timeout and eliminate the race between nvme_timeout and
nvme_dev_disable on outstanding requests.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/pci.c | 146 ++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 123 insertions(+), 23 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a7fa397..5b192b0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -70,6 +70,8 @@ struct nvme_queue;
 
 static void nvme_process_cq(struct nvme_queue *nvmeq);
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
+#define NVME_PCI_OUTSTANDING_GRABBING 1
+#define NVME_PCI_OUTSTANDING_GRABBED 2
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -80,6 +82,7 @@ struct nvme_dev {
 	struct blk_mq_tag_set admin_tagset;
 	u32 __iomem *dbs;
 	struct device *dev;
+	int grab_flag;
 	struct dma_pool *prp_page_pool;
 	struct dma_pool *prp_small_pool;
 	unsigned online_queues;
@@ -1130,6 +1133,23 @@ static void abort_endio(struct request *req, blk_status_t error)
 	blk_mq_free_request(req);
 }
 
+static void nvme_pci_disable_ctrl_directly(struct nvme_dev *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev->dev);
+	u32 csts;
+	bool dead;
+
+	if (!pci_is_enabled(pdev))
+		return;
+
+	csts = readl(dev->bar + NVME_REG_CSTS);
+	dead = !!((csts & NVME_CSTS_CFS) ||
+			!(csts & NVME_CSTS_RDY) ||
+			pdev->error_state  != pci_channel_io_normal);
+	if (!dead)
+		nvme_disable_ctrl(&dev->ctrl, dev->ctrl.cap);
+}
+
 static bool nvme_should_reset(struct nvme_dev *dev, u32 csts)
 {
 
@@ -1191,12 +1211,13 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 
 	/*
 	 * Reset immediately if the controller is failed
+	 * nvme_dev_disable will take over the expired requests.
 	 */
 	if (nvme_should_reset(dev, csts)) {
+		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
 		nvme_warn_reset(dev, csts);
-		nvme_dev_disable(dev, false);
 		nvme_reset_ctrl(&dev->ctrl);
-		return BLK_EH_HANDLED;
+		return BLK_EH_NOT_HANDLED;
 	}
 
 	/*
@@ -1210,38 +1231,51 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	}
 
 	/*
-	 * Shutdown immediately if controller times out while starting. The
-	 * reset work will see the pci device disabled when it gets the forced
-	 * cancellation error. All outstanding requests are completed on
-	 * shutdown, so we return BLK_EH_HANDLED.
+	 * The previous outstanding requests on adminq and ioq have been
+	 * grabbed or drained for RECONNECTING state, so it must be admin
+	 * request from the nvme_dev_disable or  nvme_reset_work.
+	 * - stop the controller directly
+	 * - set NVME_REQ_CANCELLED, nvme_dev_disable or nvme_reset_work
+	 *   could see this error and stop.
+	 * - return BLK_EH_HANDLED
 	 */
-	if (dev->ctrl.state == NVME_CTRL_RESETTING) {
-		dev_warn(dev->ctrl.device,
-			 "I/O %d QID %d timeout, disable controller\n",
-			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
+	if ((READ_ONCE(dev->grab_flag) == NVME_PCI_OUTSTANDING_GRABBED) ||
+		(dev->ctrl.state == NVME_CTRL_RECONNECTING)) {
+		nvme_pci_disable_ctrl_directly(dev);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
 		return BLK_EH_HANDLED;
 	}
 
 	/*
- 	 * Shutdown the controller immediately and schedule a reset if the
- 	 * command was already aborted once before and still hasn't been
- 	 * returned to the driver, or if this is the admin queue.
+	 * nvme_dev_disable is grabbing all the outstanding requests.
+	 * Just set NVME_REQ_CANCELLED and return BLK_EH_NOT_HANDLED.
+	 * The nvme_dev_disable will take over all the things.
+	 */
+	if (READ_ONCE(dev->grab_flag) == NVME_PCI_OUTSTANDING_GRABBING) {
+		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+		return BLK_EH_NOT_HANDLED;
+	}
+
+	/*
+	 * If the controller is DELETING, needn't to do any recovery,
+	 */
+	if (dev->ctrl.state == NVME_CTRL_DELETING) {
+		nvme_pci_disable_ctrl_directly(dev);
+		nvme_req(req)->status = NVME_SC_ABORT_REQ;
+		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+		return BLK_EH_HANDLED;
+	}
+	/*
+	 * Just set NVME_REQ_CANCELLED and return BLK_EH_NOT_HANDLED,
+	 * The nvme_dev_disable will take over all the things.
 	 */
 	if (!nvmeq->qid || iod->aborted) {
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
 		nvme_reset_ctrl(&dev->ctrl);
-
-		/*
-		 * Mark the request as handled, since the inline shutdown
-		 * forces all outstanding requests to complete.
-		 */
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		return BLK_EH_HANDLED;
+		return BLK_EH_NOT_HANDLED;
 	}
 
 	if (atomic_dec_return(&dev->ctrl.abort_limit) < 0) {
@@ -2162,6 +2196,66 @@ static void nvme_pci_disable(struct nvme_dev *dev)
 	}
 }
 
+static void nvme_pci_abort_rq(struct request *req, void *data, bool reserved)
+{
+	if (!blk_mq_request_started(req))
+		return;
+
+	dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
+				"Abort I/O %d", req->tag);
+
+	nvme_req(req)->status = NVME_SC_ABORT_REQ;
+	blk_abort_request(req);
+}
+/*
+ * Now the controller has been disabled, no interrupt will race with us,
+ * so it is safe to complete them
+ *  - IO requests will be requeued.
+ *  - adminq requests will be failed, the tags will also be freed.
+ */
+static void nvme_pci_cancel_rq(struct request *req, void *data, bool reserved)
+{
+	if (!blk_mq_request_started(req))
+		return;
+
+	dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
+				"Cancel I/O %d", req->tag);
+
+	/* controller has been disabled/shtdown, it is safe here to clear
+	 * the aborted_gstate and complete request.
+	 */
+	if (nvme_req(req)->flags | NVME_REQ_CANCELLED)
+		blk_mq_rq_update_aborted_gstate(req, 0);
+	blk_mq_complete_request(req);
+}
+
+
+static void nvme_pci_sync_timeout_work(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	mutex_lock(&ctrl->namespaces_mutex);
+	/* ensure the timeout_work is queued */
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		kblockd_schedule_work(&ns->queue->timeout_work);
+
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		flush_work(&ns->queue->timeout_work);
+	mutex_unlock(&ctrl->namespaces_mutex);
+}
+
+static void nvme_pci_grab_outstanding(struct nvme_dev *dev)
+{
+	blk_mq_tagset_busy_iter(&dev->tagset, nvme_pci_abort_rq, &dev->ctrl);
+	blk_mq_tagset_busy_iter(&dev->admin_tagset,
+			nvme_pci_abort_rq, &dev->ctrl);
+	nvme_pci_sync_timeout_work(&dev->ctrl);
+	/* All the outstanding requests have been grabbed. They are forced to be
+	 * expired, neither irq completion path nor timeout path could take them
+	 * away.
+	 */
+}
+
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 {
 	int i;
@@ -2191,6 +2285,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 
 	nvme_stop_queues(&dev->ctrl);
 
+	WRITE_ONCE(dev->grab_flag, NVME_PCI_OUTSTANDING_GRABBING);
+	nvme_pci_grab_outstanding(dev);
+	WRITE_ONCE(dev->grab_flag, NVME_PCI_OUTSTANDING_GRABBED);
+
 	if (!dead) {
 		/*
 		 * If the controller is still alive tell it to stop using the
@@ -2208,9 +2306,11 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 
 	nvme_pci_disable(dev);
 
-	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
-	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
+	blk_mq_tagset_busy_iter(&dev->tagset, nvme_pci_cancel_rq, &dev->ctrl);
+	blk_mq_tagset_busy_iter(&dev->admin_tagset,
+			nvme_pci_cancel_rq, &dev->ctrl);
 
+	WRITE_ONCE(dev->grab_flag, 0);
 	/*
 	 * For shutdown case, controller will not be setup again soon. If any
 	 * residual requests here, the controller must have go wrong. Drain and
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 4/6] nvme-pci: break up nvme_timeout and nvme_dev_disable
@ 2018-02-02  7:00   ` Jianchao Wang
  0 siblings, 0 replies; 48+ messages in thread
From: Jianchao Wang @ 2018-02-02  7:00 UTC (permalink / raw)


Currently, the complicated relationship between nvme_dev_disable
and nvme_timeout has become a devil that will introduce many
circular pattern which may trigger deadlock or IO hang. Let's
enumerate the tangles between them:
 - nvme_timeout has to invoke nvme_dev_disable to stop the
   controller doing DMA access before free the request.
 - nvme_dev_disable has to depend on nvme_timeout to complete
   adminq requests to set HMB or delete sq/cq when the controller
   has no response.
 - nvme_dev_disable will race with nvme_timeout when cancels the
   outstanding requests.

To break up them, let's first look at what's kind of requests
nvme_timeout has to handle.

RESETTING          previous adminq/IOq request
or shutdown        adminq requests from nvme_dev_disable

RECONNECTING       adminq requests from nvme_reset_work

nvme_timeout has to invoke nvme_dev_disable first before complete
all the expired request above. We avoid this as following.

For the previous adminq/IOq request:
use blk_abort_request to force all the outstanding requests expired
in nvme_dev_disable. In nvme_timeout, set NVME_REQ_CANCELLED and
return BLK_EH_NOT_HANDLED. Then the request will not be completed and
freed. We needn't invoke nvme_dev_disable any more.

blk_abort_request is safe when race with irq completion path.
we have been able to grab all the outstanding requests. This could
eliminate the race between nvme_timeout and nvme_dev_disable.

We use NVME_REQ_CANCELLED to identify them. After the controller is
totally disabled/shutdown, we invoke blk_mq_rq_update_aborted_gstate
to clear requests and invoke blk_mq_complete_request to complete them.

In addition, to identify the previous adminq/IOq request and adminq
requests from nvme_dev_disable, we introduce NVME_PCI_OUTSTANDING_GRABBING
and NVME_PCI_OUTSTANDING_GRABBED to let nvme_timeout be able to
distinguish them.

For the adminq requests from nvme_dev_disable/nvme_reset_work:
invoke nvme_disable_ctrl directly, then set NVME_REQ_CANCELLED and
return BLK_EH_HANDLED. nvme_dev_disable/nvme_reset_work will
see the error.

With this patch, we could avoid nvme_dev_disable to be invoked
by nvme_timeout and eliminate the race between nvme_timeout and
nvme_dev_disable on outstanding requests.

Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/pci.c | 146 ++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 123 insertions(+), 23 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a7fa397..5b192b0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -70,6 +70,8 @@ struct nvme_queue;
 
 static void nvme_process_cq(struct nvme_queue *nvmeq);
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
+#define NVME_PCI_OUTSTANDING_GRABBING 1
+#define NVME_PCI_OUTSTANDING_GRABBED 2
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -80,6 +82,7 @@ struct nvme_dev {
 	struct blk_mq_tag_set admin_tagset;
 	u32 __iomem *dbs;
 	struct device *dev;
+	int grab_flag;
 	struct dma_pool *prp_page_pool;
 	struct dma_pool *prp_small_pool;
 	unsigned online_queues;
@@ -1130,6 +1133,23 @@ static void abort_endio(struct request *req, blk_status_t error)
 	blk_mq_free_request(req);
 }
 
+static void nvme_pci_disable_ctrl_directly(struct nvme_dev *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev->dev);
+	u32 csts;
+	bool dead;
+
+	if (!pci_is_enabled(pdev))
+		return;
+
+	csts = readl(dev->bar + NVME_REG_CSTS);
+	dead = !!((csts & NVME_CSTS_CFS) ||
+			!(csts & NVME_CSTS_RDY) ||
+			pdev->error_state  != pci_channel_io_normal);
+	if (!dead)
+		nvme_disable_ctrl(&dev->ctrl, dev->ctrl.cap);
+}
+
 static bool nvme_should_reset(struct nvme_dev *dev, u32 csts)
 {
 
@@ -1191,12 +1211,13 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 
 	/*
 	 * Reset immediately if the controller is failed
+	 * nvme_dev_disable will take over the expired requests.
 	 */
 	if (nvme_should_reset(dev, csts)) {
+		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
 		nvme_warn_reset(dev, csts);
-		nvme_dev_disable(dev, false);
 		nvme_reset_ctrl(&dev->ctrl);
-		return BLK_EH_HANDLED;
+		return BLK_EH_NOT_HANDLED;
 	}
 
 	/*
@@ -1210,38 +1231,51 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	}
 
 	/*
-	 * Shutdown immediately if controller times out while starting. The
-	 * reset work will see the pci device disabled when it gets the forced
-	 * cancellation error. All outstanding requests are completed on
-	 * shutdown, so we return BLK_EH_HANDLED.
+	 * The previous outstanding requests on adminq and ioq have been
+	 * grabbed or drained for RECONNECTING state, so it must be admin
+	 * request from the nvme_dev_disable or  nvme_reset_work.
+	 * - stop the controller directly
+	 * - set NVME_REQ_CANCELLED, nvme_dev_disable or nvme_reset_work
+	 *   could see this error and stop.
+	 * - return BLK_EH_HANDLED
 	 */
-	if (dev->ctrl.state == NVME_CTRL_RESETTING) {
-		dev_warn(dev->ctrl.device,
-			 "I/O %d QID %d timeout, disable controller\n",
-			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
+	if ((READ_ONCE(dev->grab_flag) == NVME_PCI_OUTSTANDING_GRABBED) ||
+		(dev->ctrl.state == NVME_CTRL_RECONNECTING)) {
+		nvme_pci_disable_ctrl_directly(dev);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
 		return BLK_EH_HANDLED;
 	}
 
 	/*
- 	 * Shutdown the controller immediately and schedule a reset if the
- 	 * command was already aborted once before and still hasn't been
- 	 * returned to the driver, or if this is the admin queue.
+	 * nvme_dev_disable is grabbing all the outstanding requests.
+	 * Just set NVME_REQ_CANCELLED and return BLK_EH_NOT_HANDLED.
+	 * The nvme_dev_disable will take over all the things.
+	 */
+	if (READ_ONCE(dev->grab_flag) == NVME_PCI_OUTSTANDING_GRABBING) {
+		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+		return BLK_EH_NOT_HANDLED;
+	}
+
+	/*
+	 * If the controller is DELETING, needn't to do any recovery,
+	 */
+	if (dev->ctrl.state == NVME_CTRL_DELETING) {
+		nvme_pci_disable_ctrl_directly(dev);
+		nvme_req(req)->status = NVME_SC_ABORT_REQ;
+		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+		return BLK_EH_HANDLED;
+	}
+	/*
+	 * Just set NVME_REQ_CANCELLED and return BLK_EH_NOT_HANDLED,
+	 * The nvme_dev_disable will take over all the things.
 	 */
 	if (!nvmeq->qid || iod->aborted) {
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
 		nvme_reset_ctrl(&dev->ctrl);
-
-		/*
-		 * Mark the request as handled, since the inline shutdown
-		 * forces all outstanding requests to complete.
-		 */
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		return BLK_EH_HANDLED;
+		return BLK_EH_NOT_HANDLED;
 	}
 
 	if (atomic_dec_return(&dev->ctrl.abort_limit) < 0) {
@@ -2162,6 +2196,66 @@ static void nvme_pci_disable(struct nvme_dev *dev)
 	}
 }
 
+static void nvme_pci_abort_rq(struct request *req, void *data, bool reserved)
+{
+	if (!blk_mq_request_started(req))
+		return;
+
+	dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
+				"Abort I/O %d", req->tag);
+
+	nvme_req(req)->status = NVME_SC_ABORT_REQ;
+	blk_abort_request(req);
+}
+/*
+ * Now the controller has been disabled, no interrupt will race with us,
+ * so it is safe to complete them
+ *  - IO requests will be requeued.
+ *  - adminq requests will be failed, the tags will also be freed.
+ */
+static void nvme_pci_cancel_rq(struct request *req, void *data, bool reserved)
+{
+	if (!blk_mq_request_started(req))
+		return;
+
+	dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
+				"Cancel I/O %d", req->tag);
+
+	/* controller has been disabled/shtdown, it is safe here to clear
+	 * the aborted_gstate and complete request.
+	 */
+	if (nvme_req(req)->flags | NVME_REQ_CANCELLED)
+		blk_mq_rq_update_aborted_gstate(req, 0);
+	blk_mq_complete_request(req);
+}
+
+
+static void nvme_pci_sync_timeout_work(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	mutex_lock(&ctrl->namespaces_mutex);
+	/* ensure the timeout_work is queued */
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		kblockd_schedule_work(&ns->queue->timeout_work);
+
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		flush_work(&ns->queue->timeout_work);
+	mutex_unlock(&ctrl->namespaces_mutex);
+}
+
+static void nvme_pci_grab_outstanding(struct nvme_dev *dev)
+{
+	blk_mq_tagset_busy_iter(&dev->tagset, nvme_pci_abort_rq, &dev->ctrl);
+	blk_mq_tagset_busy_iter(&dev->admin_tagset,
+			nvme_pci_abort_rq, &dev->ctrl);
+	nvme_pci_sync_timeout_work(&dev->ctrl);
+	/* All the outstanding requests have been grabbed. They are forced to be
+	 * expired, neither irq completion path nor timeout path could take them
+	 * away.
+	 */
+}
+
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 {
 	int i;
@@ -2191,6 +2285,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 
 	nvme_stop_queues(&dev->ctrl);
 
+	WRITE_ONCE(dev->grab_flag, NVME_PCI_OUTSTANDING_GRABBING);
+	nvme_pci_grab_outstanding(dev);
+	WRITE_ONCE(dev->grab_flag, NVME_PCI_OUTSTANDING_GRABBED);
+
 	if (!dead) {
 		/*
 		 * If the controller is still alive tell it to stop using the
@@ -2208,9 +2306,11 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 
 	nvme_pci_disable(dev);
 
-	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
-	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
+	blk_mq_tagset_busy_iter(&dev->tagset, nvme_pci_cancel_rq, &dev->ctrl);
+	blk_mq_tagset_busy_iter(&dev->admin_tagset,
+			nvme_pci_cancel_rq, &dev->ctrl);
 
+	WRITE_ONCE(dev->grab_flag, 0);
 	/*
 	 * For shutdown case, controller will not be setup again soon. If any
 	 * residual requests here, the controller must have go wrong. Drain and
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 5/6] nvme-pci: discard wait timeout when delete cq/sq
  2018-02-02  7:00 ` Jianchao Wang
@ 2018-02-02  7:00   ` Jianchao Wang
  -1 siblings, 0 replies; 48+ messages in thread
From: Jianchao Wang @ 2018-02-02  7:00 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

Currently, nvme_disable_io_queues could be wakeup by both request
completion and wait timeout path. This is unnecessary and could
introduce race between nvme_dev_disable and request timeout path.
When delete cq/sq command expires, the nvme_disable_io_queues will
also be wakeup and return to nvme_dev_disable, then handle the
outstanding requests. This will race with the request timeout path.

To fix it, just use wait_for_completion instead of the timeout one.
The request timeout path will wakeup it.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/pci.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 5b192b0..a838713c 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2048,7 +2048,6 @@ static int nvme_delete_queue(struct nvme_queue *nvmeq, u8 opcode)
 static void nvme_disable_io_queues(struct nvme_dev *dev)
 {
 	int pass, queues = dev->online_queues - 1;
-	unsigned long timeout;
 	u8 opcode = nvme_admin_delete_sq;
 
 	for (pass = 0; pass < 2; pass++) {
@@ -2056,15 +2055,12 @@ static void nvme_disable_io_queues(struct nvme_dev *dev)
 
 		reinit_completion(&dev->ioq_wait);
  retry:
-		timeout = ADMIN_TIMEOUT;
 		for (; i > 0; i--, sent++)
 			if (nvme_delete_queue(&dev->queues[i], opcode))
 				break;
 
 		while (sent--) {
-			timeout = wait_for_completion_io_timeout(&dev->ioq_wait, timeout);
-			if (timeout == 0)
-				return;
+			wait_for_completion(&dev->ioq_wait);
 			if (i)
 				goto retry;
 		}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 5/6] nvme-pci: discard wait timeout when delete cq/sq
@ 2018-02-02  7:00   ` Jianchao Wang
  0 siblings, 0 replies; 48+ messages in thread
From: Jianchao Wang @ 2018-02-02  7:00 UTC (permalink / raw)


Currently, nvme_disable_io_queues could be wakeup by both request
completion and wait timeout path. This is unnecessary and could
introduce race between nvme_dev_disable and request timeout path.
When delete cq/sq command expires, the nvme_disable_io_queues will
also be wakeup and return to nvme_dev_disable, then handle the
outstanding requests. This will race with the request timeout path.

To fix it, just use wait_for_completion instead of the timeout one.
The request timeout path will wakeup it.

Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/pci.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 5b192b0..a838713c 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2048,7 +2048,6 @@ static int nvme_delete_queue(struct nvme_queue *nvmeq, u8 opcode)
 static void nvme_disable_io_queues(struct nvme_dev *dev)
 {
 	int pass, queues = dev->online_queues - 1;
-	unsigned long timeout;
 	u8 opcode = nvme_admin_delete_sq;
 
 	for (pass = 0; pass < 2; pass++) {
@@ -2056,15 +2055,12 @@ static void nvme_disable_io_queues(struct nvme_dev *dev)
 
 		reinit_completion(&dev->ioq_wait);
  retry:
-		timeout = ADMIN_TIMEOUT;
 		for (; i > 0; i--, sent++)
 			if (nvme_delete_queue(&dev->queues[i], opcode))
 				break;
 
 		while (sent--) {
-			timeout = wait_for_completion_io_timeout(&dev->ioq_wait, timeout);
-			if (timeout == 0)
-				return;
+			wait_for_completion(&dev->ioq_wait);
 			if (i)
 				goto retry;
 		}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 6/6] nvme-pci: suspend queues based on online_queues
  2018-02-02  7:00 ` Jianchao Wang
@ 2018-02-02  7:00   ` Jianchao Wang
  -1 siblings, 0 replies; 48+ messages in thread
From: Jianchao Wang @ 2018-02-02  7:00 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

nvme cq irq is freed based on queue_count. When the sq/cq creation
fails, irq will not be setup. free_irq will warn 'Try to free
already-free irq'.

To fix it, we only increase online_queues when adminq/sq/cq are
created and associated irq is setup. Then suspend queues based
on online_queues.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/pci.c | 31 ++++++++++++++++++++-----------
 1 file changed, 20 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a838713c..e37f209 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1349,9 +1349,6 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 	nvmeq->cq_vector = -1;
 	spin_unlock_irq(&nvmeq->q_lock);
 
-	if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
-		blk_mq_quiesce_queue(nvmeq->dev->ctrl.admin_q);
-
 	pci_free_irq(to_pci_dev(nvmeq->dev->dev), vector, nvmeq);
 
 	return 0;
@@ -1495,13 +1492,15 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 	nvme_init_queue(nvmeq, qid);
 	result = queue_request_irq(nvmeq);
 	if (result < 0)
-		goto release_sq;
+		goto offline;
 
 	return result;
 
- release_sq:
+offline:
+	dev->online_queues--;
+release_sq:
 	adapter_delete_sq(dev, qid);
- release_cq:
+release_cq:
 	adapter_delete_cq(dev, qid);
 	return result;
 }
@@ -1641,6 +1640,7 @@ static int nvme_pci_configure_admin_queue(struct nvme_dev *dev)
 	result = queue_request_irq(nvmeq);
 	if (result) {
 		nvmeq->cq_vector = -1;
+		dev->online_queues--;
 		return result;
 	}
 
@@ -1988,6 +1988,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	result = queue_request_irq(adminq);
 	if (result) {
 		adminq->cq_vector = -1;
+		dev->online_queues--;
 		return result;
 	}
 	return nvme_create_io_queues(dev);
@@ -2257,13 +2258,16 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	int i;
 	bool dead = true;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
+	int onlines;
 
 	mutex_lock(&dev->shutdown_lock);
 	if (pci_is_enabled(pdev)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
-		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
-			pdev->error_state  != pci_channel_io_normal);
+		dead = !!((csts & NVME_CSTS_CFS) ||
+				!(csts & NVME_CSTS_RDY) ||
+				(pdev->error_state  != pci_channel_io_normal) ||
+				(dev->online_queues == 0));
 	}
 
 	/* Just freeze the queue for shutdown case */
@@ -2297,9 +2301,14 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		nvme_disable_io_queues(dev);
 		nvme_disable_admin_queue(dev, shutdown);
 	}
-	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
+
+	onlines = dev->online_queues;
+	for (i = onlines - 1; i >= 0; i--)
 		nvme_suspend_queue(&dev->queues[i]);
 
+	if (dev->ctrl.admin_q)
+		blk_mq_quiesce_queue(dev->ctrl.admin_q);
+
 	nvme_pci_disable(dev);
 
 	blk_mq_tagset_busy_iter(&dev->tagset, nvme_pci_cancel_rq, &dev->ctrl);
@@ -2444,12 +2453,12 @@ static void nvme_reset_work(struct work_struct *work)
 	 * Keep the controller around but remove all namespaces if we don't have
 	 * any working I/O queue.
 	 */
-	if (dev->online_queues < 2) {
+	if (dev->online_queues == 1) {
 		dev_warn(dev->ctrl.device, "IO queues not created\n");
 		nvme_kill_queues(&dev->ctrl);
 		nvme_remove_namespaces(&dev->ctrl);
 		new_state = NVME_CTRL_ADMIN_ONLY;
-	} else {
+	} else if (dev->online_queues > 1) {
 		/* hit this only when allocate tagset fails */
 		if (nvme_dev_add(dev))
 			new_state = NVME_CTRL_ADMIN_ONLY;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 6/6] nvme-pci: suspend queues based on online_queues
@ 2018-02-02  7:00   ` Jianchao Wang
  0 siblings, 0 replies; 48+ messages in thread
From: Jianchao Wang @ 2018-02-02  7:00 UTC (permalink / raw)


nvme cq irq is freed based on queue_count. When the sq/cq creation
fails, irq will not be setup. free_irq will warn 'Try to free
already-free irq'.

To fix it, we only increase online_queues when adminq/sq/cq are
created and associated irq is setup. Then suspend queues based
on online_queues.

Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/pci.c | 31 ++++++++++++++++++++-----------
 1 file changed, 20 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a838713c..e37f209 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1349,9 +1349,6 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 	nvmeq->cq_vector = -1;
 	spin_unlock_irq(&nvmeq->q_lock);
 
-	if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
-		blk_mq_quiesce_queue(nvmeq->dev->ctrl.admin_q);
-
 	pci_free_irq(to_pci_dev(nvmeq->dev->dev), vector, nvmeq);
 
 	return 0;
@@ -1495,13 +1492,15 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 	nvme_init_queue(nvmeq, qid);
 	result = queue_request_irq(nvmeq);
 	if (result < 0)
-		goto release_sq;
+		goto offline;
 
 	return result;
 
- release_sq:
+offline:
+	dev->online_queues--;
+release_sq:
 	adapter_delete_sq(dev, qid);
- release_cq:
+release_cq:
 	adapter_delete_cq(dev, qid);
 	return result;
 }
@@ -1641,6 +1640,7 @@ static int nvme_pci_configure_admin_queue(struct nvme_dev *dev)
 	result = queue_request_irq(nvmeq);
 	if (result) {
 		nvmeq->cq_vector = -1;
+		dev->online_queues--;
 		return result;
 	}
 
@@ -1988,6 +1988,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	result = queue_request_irq(adminq);
 	if (result) {
 		adminq->cq_vector = -1;
+		dev->online_queues--;
 		return result;
 	}
 	return nvme_create_io_queues(dev);
@@ -2257,13 +2258,16 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	int i;
 	bool dead = true;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
+	int onlines;
 
 	mutex_lock(&dev->shutdown_lock);
 	if (pci_is_enabled(pdev)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
-		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
-			pdev->error_state  != pci_channel_io_normal);
+		dead = !!((csts & NVME_CSTS_CFS) ||
+				!(csts & NVME_CSTS_RDY) ||
+				(pdev->error_state  != pci_channel_io_normal) ||
+				(dev->online_queues == 0));
 	}
 
 	/* Just freeze the queue for shutdown case */
@@ -2297,9 +2301,14 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		nvme_disable_io_queues(dev);
 		nvme_disable_admin_queue(dev, shutdown);
 	}
-	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
+
+	onlines = dev->online_queues;
+	for (i = onlines - 1; i >= 0; i--)
 		nvme_suspend_queue(&dev->queues[i]);
 
+	if (dev->ctrl.admin_q)
+		blk_mq_quiesce_queue(dev->ctrl.admin_q);
+
 	nvme_pci_disable(dev);
 
 	blk_mq_tagset_busy_iter(&dev->tagset, nvme_pci_cancel_rq, &dev->ctrl);
@@ -2444,12 +2453,12 @@ static void nvme_reset_work(struct work_struct *work)
 	 * Keep the controller around but remove all namespaces if we don't have
 	 * any working I/O queue.
 	 */
-	if (dev->online_queues < 2) {
+	if (dev->online_queues == 1) {
 		dev_warn(dev->ctrl.device, "IO queues not created\n");
 		nvme_kill_queues(&dev->ctrl);
 		nvme_remove_namespaces(&dev->ctrl);
 		new_state = NVME_CTRL_ADMIN_ONLY;
-	} else {
+	} else if (dev->online_queues > 1) {
 		/* hit this only when allocate tagset fails */
 		if (nvme_dev_add(dev))
 			new_state = NVME_CTRL_ADMIN_ONLY;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
  2018-02-02  7:00   ` Jianchao Wang
@ 2018-02-02 18:24     ` Keith Busch
  -1 siblings, 0 replies; 48+ messages in thread
From: Keith Busch @ 2018-02-02 18:24 UTC (permalink / raw)
  To: Jianchao Wang; +Cc: axboe, hch, sagi, linux-nvme, linux-kernel

On Fri, Feb 02, 2018 at 03:00:45PM +0800, Jianchao Wang wrote:
> Currently, request queue will be frozen and quiesced for both reset
> and shutdown case. This will trigger ioq requests in RECONNECTING
> state which should be avoided to prepare for following patch.
> Just freeze request queue for shutdown case and drain all the resudual
> entered requests after controller has been shutdown.

Freezing is not just for shutdown. It's also used so
blk_mq_update_nr_hw_queues will work if the queue count changes across
resets.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
@ 2018-02-02 18:24     ` Keith Busch
  0 siblings, 0 replies; 48+ messages in thread
From: Keith Busch @ 2018-02-02 18:24 UTC (permalink / raw)


On Fri, Feb 02, 2018@03:00:45PM +0800, Jianchao Wang wrote:
> Currently, request queue will be frozen and quiesced for both reset
> and shutdown case. This will trigger ioq requests in RECONNECTING
> state which should be avoided to prepare for following patch.
> Just freeze request queue for shutdown case and drain all the resudual
> entered requests after controller has been shutdown.

Freezing is not just for shutdown. It's also used so
blk_mq_update_nr_hw_queues will work if the queue count changes across
resets.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/6] nvme-pci: break up nvme_timeout and nvme_dev_disable
  2018-02-02  7:00   ` Jianchao Wang
@ 2018-02-02 18:31     ` Keith Busch
  -1 siblings, 0 replies; 48+ messages in thread
From: Keith Busch @ 2018-02-02 18:31 UTC (permalink / raw)
  To: Jianchao Wang; +Cc: axboe, hch, sagi, linux-nvme, linux-kernel

On Fri, Feb 02, 2018 at 03:00:47PM +0800, Jianchao Wang wrote:
> Currently, the complicated relationship between nvme_dev_disable
> and nvme_timeout has become a devil that will introduce many
> circular pattern which may trigger deadlock or IO hang. Let's
> enumerate the tangles between them:
>  - nvme_timeout has to invoke nvme_dev_disable to stop the
>    controller doing DMA access before free the request.
>  - nvme_dev_disable has to depend on nvme_timeout to complete
>    adminq requests to set HMB or delete sq/cq when the controller
>    has no response.
>  - nvme_dev_disable will race with nvme_timeout when cancels the
>    outstanding requests.

Your patch is releasing a command back to the OS with the
PCI controller bus master still enabled. This could lead to data or
memory corruption.

In any case, it's not as complicated as you're making it out to
be. It'd be easier to just enforce the exisiting rule that commands
issued in the disabling path not depend on completions or timeout
handling. All of commands issued in this path already do this except
for HMB disabling. Let'sjust fix that command, right?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 4/6] nvme-pci: break up nvme_timeout and nvme_dev_disable
@ 2018-02-02 18:31     ` Keith Busch
  0 siblings, 0 replies; 48+ messages in thread
From: Keith Busch @ 2018-02-02 18:31 UTC (permalink / raw)


On Fri, Feb 02, 2018@03:00:47PM +0800, Jianchao Wang wrote:
> Currently, the complicated relationship between nvme_dev_disable
> and nvme_timeout has become a devil that will introduce many
> circular pattern which may trigger deadlock or IO hang. Let's
> enumerate the tangles between them:
>  - nvme_timeout has to invoke nvme_dev_disable to stop the
>    controller doing DMA access before free the request.
>  - nvme_dev_disable has to depend on nvme_timeout to complete
>    adminq requests to set HMB or delete sq/cq when the controller
>    has no response.
>  - nvme_dev_disable will race with nvme_timeout when cancels the
>    outstanding requests.

Your patch is releasing a command back to the OS with the
PCI controller bus master still enabled. This could lead to data or
memory corruption.

In any case, it's not as complicated as you're making it out to
be. It'd be easier to just enforce the exisiting rule that commands
issued in the disabling path not depend on completions or timeout
handling. All of commands issued in this path already do this except
for HMB disabling. Let'sjust fix that command, right?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/6] nvme-pci: move clearing host mem behind stopping queues
  2018-02-02  7:00   ` Jianchao Wang
@ 2018-02-02 18:46     ` Keith Busch
  -1 siblings, 0 replies; 48+ messages in thread
From: Keith Busch @ 2018-02-02 18:46 UTC (permalink / raw)
  To: Jianchao Wang; +Cc: axboe, hch, sagi, linux-nvme, linux-kernel

On Fri, Feb 02, 2018 at 03:00:44PM +0800, Jianchao Wang wrote:
> Move clearing host mem behind stopping queues. Prepare for
> following patch which will grab all the outstanding requests.
> 
> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>

This one makes sense, though I would alter the change log to something
like:

  This patch quiecses new IO prior to disabling device HMB access.
  A controller using HMB may be relying on it to efficiently complete
  IO commands.

Reviewed-by: Keith Busch <keith.busch@intel.com>

> ---
>  drivers/nvme/host/pci.c | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 6fe7af0..00cffed 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2186,7 +2186,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  	if (!dead) {
>  		if (shutdown)
>  			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
> +	}
> +	nvme_stop_queues(&dev->ctrl);
>  
> +	if (!dead) {
>  		/*
>  		 * If the controller is still alive tell it to stop using the
>  		 * host memory buffer.  In theory the shutdown / reset should
> @@ -2195,11 +2198,6 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  		 */
>  		if (dev->host_mem_descs)
>  			nvme_set_host_mem(dev, 0);
> -
> -	}
> -	nvme_stop_queues(&dev->ctrl);
> -
> -	if (!dead) {
>  		nvme_disable_io_queues(dev);
>  		nvme_disable_admin_queue(dev, shutdown);
>  	}
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/6] nvme-pci: move clearing host mem behind stopping queues
@ 2018-02-02 18:46     ` Keith Busch
  0 siblings, 0 replies; 48+ messages in thread
From: Keith Busch @ 2018-02-02 18:46 UTC (permalink / raw)


On Fri, Feb 02, 2018@03:00:44PM +0800, Jianchao Wang wrote:
> Move clearing host mem behind stopping queues. Prepare for
> following patch which will grab all the outstanding requests.
> 
> Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>

This one makes sense, though I would alter the change log to something
like:

  This patch quiecses new IO prior to disabling device HMB access.
  A controller using HMB may be relying on it to efficiently complete
  IO commands.

Reviewed-by: Keith Busch <keith.busch at intel.com>

> ---
>  drivers/nvme/host/pci.c | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 6fe7af0..00cffed 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2186,7 +2186,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  	if (!dead) {
>  		if (shutdown)
>  			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
> +	}
> +	nvme_stop_queues(&dev->ctrl);
>  
> +	if (!dead) {
>  		/*
>  		 * If the controller is still alive tell it to stop using the
>  		 * host memory buffer.  In theory the shutdown / reset should
> @@ -2195,11 +2198,6 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  		 */
>  		if (dev->host_mem_descs)
>  			nvme_set_host_mem(dev, 0);
> -
> -	}
> -	nvme_stop_queues(&dev->ctrl);
> -
> -	if (!dead) {
>  		nvme_disable_io_queues(dev);
>  		nvme_disable_admin_queue(dev, shutdown);
>  	}
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/6] nvme-pci: break up nvme_timeout and nvme_dev_disable
  2018-02-02 18:31     ` Keith Busch
@ 2018-02-05  2:22       ` jianchao.wang
  -1 siblings, 0 replies; 48+ messages in thread
From: jianchao.wang @ 2018-02-05  2:22 UTC (permalink / raw)
  To: Keith Busch; +Cc: axboe, hch, sagi, linux-nvme, linux-kernel

Hi Keith

Thanks for you kindly response and comment.
That's really appreciated.

On 02/03/2018 02:31 AM, Keith Busch wrote:
> On Fri, Feb 02, 2018 at 03:00:47PM +0800, Jianchao Wang wrote:
>> Currently, the complicated relationship between nvme_dev_disable
>> and nvme_timeout has become a devil that will introduce many
>> circular pattern which may trigger deadlock or IO hang. Let's
>> enumerate the tangles between them:
>>  - nvme_timeout has to invoke nvme_dev_disable to stop the
>>    controller doing DMA access before free the request.
>>  - nvme_dev_disable has to depend on nvme_timeout to complete
>>    adminq requests to set HMB or delete sq/cq when the controller
>>    has no response.
>>  - nvme_dev_disable will race with nvme_timeout when cancels the
>>    outstanding requests.
> 
> Your patch is releasing a command back to the OS with the
> PCI controller bus master still enabled. This could lead to data or
> memory corruption.
> 

There are two cases nvme_timeout will return.
BLK_EH_HANDLED
BLK_EH_NOT_HANDLED

For the 1st case, the patch will disable the controller.
Then the controller will stop processing any outstanding command and delete the sq/cq queues as the protocol.
Looks like it is still not enough, I will to disable the _pci_in nvme_pci_disable_dev_directly next version. 
Really thanks for your directive here.

For the 2nd case, it will return BLK_EH_NOT_HANDLED.
blk_mq_rq_timed_out will do nothing for this case.
All the command will be handled after all the things are disabled.

> In any case, it's not as complicated as you're making it out to
> be. It'd be easier to just enforce the exisiting rule that commands
> issued in the disabling path not depend on completions or timeout
> handling. All of commands issued in this path already do this except
> for HMB disabling. Let'sjust fix that command, right?
> 
We will still met nvme_timeout will invoke nvme_dev_disable and cannot synchronize on the outstanding
requests. This is really a devil and will be a block to do other improvements.
This patch just do two things:
1. grab all the previous outstanding requests with blk_abort_request. Then release them after the controller is totally disabled/shutdown.
   consequently, during the disable/shutdown and initializing procedure, nvme_timeout path only need to serve them. And this also could
   ensure there will be _no_ any outstanding requests after nvme_dev_disable.
2. fail the adminq command issued during disable/shutdown and initializing procedure when the controller no response. we need to do two 
   steps for this, disable the controller/pci and complete the command. Then nvme_timeout will not need to invoke nvme_dev_disable and nvme_dev_disable
   will be independent.
 
Please consider this.

Many thanks
Jianchao

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 4/6] nvme-pci: break up nvme_timeout and nvme_dev_disable
@ 2018-02-05  2:22       ` jianchao.wang
  0 siblings, 0 replies; 48+ messages in thread
From: jianchao.wang @ 2018-02-05  2:22 UTC (permalink / raw)


Hi Keith

Thanks for you kindly response and comment.
That's really appreciated.

On 02/03/2018 02:31 AM, Keith Busch wrote:
> On Fri, Feb 02, 2018@03:00:47PM +0800, Jianchao Wang wrote:
>> Currently, the complicated relationship between nvme_dev_disable
>> and nvme_timeout has become a devil that will introduce many
>> circular pattern which may trigger deadlock or IO hang. Let's
>> enumerate the tangles between them:
>>  - nvme_timeout has to invoke nvme_dev_disable to stop the
>>    controller doing DMA access before free the request.
>>  - nvme_dev_disable has to depend on nvme_timeout to complete
>>    adminq requests to set HMB or delete sq/cq when the controller
>>    has no response.
>>  - nvme_dev_disable will race with nvme_timeout when cancels the
>>    outstanding requests.
> 
> Your patch is releasing a command back to the OS with the
> PCI controller bus master still enabled. This could lead to data or
> memory corruption.
> 

There are two cases nvme_timeout will return.
BLK_EH_HANDLED
BLK_EH_NOT_HANDLED

For the 1st case, the patch will disable the controller.
Then the controller will stop processing any outstanding command and delete the sq/cq queues as the protocol.
Looks like it is still not enough, I will to disable the _pci_in nvme_pci_disable_dev_directly next version. 
Really thanks for your directive here.

For the 2nd case, it will return BLK_EH_NOT_HANDLED.
blk_mq_rq_timed_out will do nothing for this case.
All the command will be handled after all the things are disabled.

> In any case, it's not as complicated as you're making it out to
> be. It'd be easier to just enforce the exisiting rule that commands
> issued in the disabling path not depend on completions or timeout
> handling. All of commands issued in this path already do this except
> for HMB disabling. Let'sjust fix that command, right?
> 
We will still met nvme_timeout will invoke nvme_dev_disable and cannot synchronize on the outstanding
requests. This is really a devil and will be a block to do other improvements.
This patch just do two things:
1. grab all the previous outstanding requests with blk_abort_request. Then release them after the controller is totally disabled/shutdown.
   consequently, during the disable/shutdown and initializing procedure, nvme_timeout path only need to serve them. And this also could
   ensure there will be _no_ any outstanding requests after nvme_dev_disable.
2. fail the adminq command issued during disable/shutdown and initializing procedure when the controller no response. we need to do two 
   steps for this, disable the controller/pci and complete the command. Then nvme_timeout will not need to invoke nvme_dev_disable and nvme_dev_disable
   will be independent.
 
Please consider this.

Many thanks
Jianchao

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
  2018-02-02 18:24     ` Keith Busch
@ 2018-02-05  2:26       ` jianchao.wang
  -1 siblings, 0 replies; 48+ messages in thread
From: jianchao.wang @ 2018-02-05  2:26 UTC (permalink / raw)
  To: Keith Busch; +Cc: axboe, linux-kernel, hch, linux-nvme, sagi

Hi Keith

Thanks for your kindly response.

On 02/03/2018 02:24 AM, Keith Busch wrote:
> On Fri, Feb 02, 2018 at 03:00:45PM +0800, Jianchao Wang wrote:
>> Currently, request queue will be frozen and quiesced for both reset
>> and shutdown case. This will trigger ioq requests in RECONNECTING
>> state which should be avoided to prepare for following patch.
>> Just freeze request queue for shutdown case and drain all the resudual
>> entered requests after controller has been shutdown.

> Freezing is not just for shutdown. It's also used so
> blk_mq_update_nr_hw_queues will work if the queue count changes across
> resets.
blk_mq_update_nr_hw_queues will freeze the queue itself. Please refer to.
static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
							int nr_hw_queues)
{
	struct request_queue *q;

	lockdep_assert_held(&set->tag_list_lock);

	if (nr_hw_queues > nr_cpu_ids)
		nr_hw_queues = nr_cpu_ids;
	if (nr_hw_queues < 1 || nr_hw_queues == set->nr_hw_queues)
		return;

	list_for_each_entry(q, &set->tag_list, tag_set_list)
		blk_mq_freeze_queue(q);

	set->nr_hw_queues = nr_hw_queues;
	blk_mq_update_queue_map(set);
	list_for_each_entry(q, &set->tag_list, tag_set_list) {
		blk_mq_realloc_hw_ctxs(set, q);
		blk_mq_queue_reinit(q);
	}

	list_for_each_entry(q, &set->tag_list, tag_set_list)
		blk_mq_unfreeze_queue(q);
}

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
@ 2018-02-05  2:26       ` jianchao.wang
  0 siblings, 0 replies; 48+ messages in thread
From: jianchao.wang @ 2018-02-05  2:26 UTC (permalink / raw)


Hi Keith

Thanks for your kindly response.

On 02/03/2018 02:24 AM, Keith Busch wrote:
> On Fri, Feb 02, 2018@03:00:45PM +0800, Jianchao Wang wrote:
>> Currently, request queue will be frozen and quiesced for both reset
>> and shutdown case. This will trigger ioq requests in RECONNECTING
>> state which should be avoided to prepare for following patch.
>> Just freeze request queue for shutdown case and drain all the resudual
>> entered requests after controller has been shutdown.

> Freezing is not just for shutdown. It's also used so
> blk_mq_update_nr_hw_queues will work if the queue count changes across
> resets.
blk_mq_update_nr_hw_queues will freeze the queue itself. Please refer to.
static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
							int nr_hw_queues)
{
	struct request_queue *q;

	lockdep_assert_held(&set->tag_list_lock);

	if (nr_hw_queues > nr_cpu_ids)
		nr_hw_queues = nr_cpu_ids;
	if (nr_hw_queues < 1 || nr_hw_queues == set->nr_hw_queues)
		return;

	list_for_each_entry(q, &set->tag_list, tag_set_list)
		blk_mq_freeze_queue(q);

	set->nr_hw_queues = nr_hw_queues;
	blk_mq_update_queue_map(set);
	list_for_each_entry(q, &set->tag_list, tag_set_list) {
		blk_mq_realloc_hw_ctxs(set, q);
		blk_mq_queue_reinit(q);
	}

	list_for_each_entry(q, &set->tag_list, tag_set_list)
		blk_mq_unfreeze_queue(q);
}

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/6] nvme-pci: move clearing host mem behind stopping queues
  2018-02-02 18:46     ` Keith Busch
@ 2018-02-05  2:30       ` jianchao.wang
  -1 siblings, 0 replies; 48+ messages in thread
From: jianchao.wang @ 2018-02-05  2:30 UTC (permalink / raw)
  To: Keith Busch; +Cc: axboe, hch, sagi, linux-nvme, linux-kernel

Hi Keith

Thanks for your kindly response and directive.

On 02/03/2018 02:46 AM, Keith Busch wrote:
> This one makes sense, though I would alter the change log to something
> like:
> 
>   This patch quiecses new IO prior to disabling device HMB access.
>   A controller using HMB may be relying on it to efficiently complete
>   IO commands
Yes, I will change it next version.

Sincerely
Jianchao

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/6] nvme-pci: move clearing host mem behind stopping queues
@ 2018-02-05  2:30       ` jianchao.wang
  0 siblings, 0 replies; 48+ messages in thread
From: jianchao.wang @ 2018-02-05  2:30 UTC (permalink / raw)


Hi Keith

Thanks for your kindly response and directive.

On 02/03/2018 02:46 AM, Keith Busch wrote:
> This one makes sense, though I would alter the change log to something
> like:
> 
>   This patch quiecses new IO prior to disabling device HMB access.
>   A controller using HMB may be relying on it to efficiently complete
>   IO commands
Yes, I will change it next version.

Sincerely
Jianchao

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
  2018-02-05  2:26       ` jianchao.wang
@ 2018-02-05 15:13         ` Keith Busch
  -1 siblings, 0 replies; 48+ messages in thread
From: Keith Busch @ 2018-02-05 15:13 UTC (permalink / raw)
  To: jianchao.wang; +Cc: axboe, linux-kernel, hch, linux-nvme, sagi

On Mon, Feb 05, 2018 at 10:26:03AM +0800, jianchao.wang wrote:
> > Freezing is not just for shutdown. It's also used so
> > blk_mq_update_nr_hw_queues will work if the queue count changes across
> > resets.
> blk_mq_update_nr_hw_queues will freeze the queue itself. Please refer to.
> static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,

Yes, but how many requests are you letting enter to their demise by
freezing on the wrong side of the reset? We don't use the nuclear option
out of convenience.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
@ 2018-02-05 15:13         ` Keith Busch
  0 siblings, 0 replies; 48+ messages in thread
From: Keith Busch @ 2018-02-05 15:13 UTC (permalink / raw)


On Mon, Feb 05, 2018@10:26:03AM +0800, jianchao.wang wrote:
> > Freezing is not just for shutdown. It's also used so
> > blk_mq_update_nr_hw_queues will work if the queue count changes across
> > resets.
> blk_mq_update_nr_hw_queues will freeze the queue itself. Please refer to.
> static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,

Yes, but how many requests are you letting enter to their demise by
freezing on the wrong side of the reset? We don't use the nuclear option
out of convenience.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
  2018-02-05 15:13         ` Keith Busch
@ 2018-02-06  1:46           ` jianchao.wang
  -1 siblings, 0 replies; 48+ messages in thread
From: jianchao.wang @ 2018-02-06  1:46 UTC (permalink / raw)
  To: Keith Busch; +Cc: axboe, linux-kernel, hch, linux-nvme, sagi

Hi Keith

Thanks for your kindly response.

On 02/05/2018 11:13 PM, Keith Busch wrote:
>  but how many requests are you letting enter to their demise by
> freezing on the wrong side of the reset?

There are only two difference with this patch from the original one.
1. Don't freeze the queue for the reset case. At the moment, the outstanding requests will be requeued back to blk-mq queues.
   The new entered requests during reset will also stay in blk-mq queues. All this requests will not enter into nvme driver layer
   due to quiescent request_queues. And they will be issued after the reset is completed successfully.
2. Drain the request queue before nvme_dev_disable. This is nearly same with the previous rule which will also unquiesce the queue
   and let the requests be able to be drained. The only difference is this patch will invoke wait_freeze in nvme_dev_disable instead
   of nvme_reset_work.

We don't sacrifice any request. This patch do the same thing with the previous one and make things clearer.
:)


Thanks
Jianchao

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
@ 2018-02-06  1:46           ` jianchao.wang
  0 siblings, 0 replies; 48+ messages in thread
From: jianchao.wang @ 2018-02-06  1:46 UTC (permalink / raw)


Hi Keith

Thanks for your kindly response.

On 02/05/2018 11:13 PM, Keith Busch wrote:
>  but how many requests are you letting enter to their demise by
> freezing on the wrong side of the reset?

There are only two difference with this patch from the original one.
1. Don't freeze the queue for the reset case. At the moment, the outstanding requests will be requeued back to blk-mq queues.
   The new entered requests during reset will also stay in blk-mq queues. All this requests will not enter into nvme driver layer
   due to quiescent request_queues. And they will be issued after the reset is completed successfully.
2. Drain the request queue before nvme_dev_disable. This is nearly same with the previous rule which will also unquiesce the queue
   and let the requests be able to be drained. The only difference is this patch will invoke wait_freeze in nvme_dev_disable instead
   of nvme_reset_work.

We don't sacrifice any request. This patch do the same thing with the previous one and make things clearer.
:)


Thanks
Jianchao

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
  2018-02-06  1:46           ` jianchao.wang
@ 2018-02-06 15:13             ` Keith Busch
  -1 siblings, 0 replies; 48+ messages in thread
From: Keith Busch @ 2018-02-06 15:13 UTC (permalink / raw)
  To: jianchao.wang; +Cc: axboe, linux-kernel, hch, linux-nvme, sagi

On Tue, Feb 06, 2018 at 09:46:36AM +0800, jianchao.wang wrote:
> Hi Keith
> 
> Thanks for your kindly response.
> 
> On 02/05/2018 11:13 PM, Keith Busch wrote:
> >  but how many requests are you letting enter to their demise by
> > freezing on the wrong side of the reset?
> 
> There are only two difference with this patch from the original one.
> 1. Don't freeze the queue for the reset case. At the moment, the outstanding requests will be requeued back to blk-mq queues.
>    The new entered requests during reset will also stay in blk-mq queues. All this requests will not enter into nvme driver layer
>    due to quiescent request_queues. And they will be issued after the reset is completed successfully.
> 2. Drain the request queue before nvme_dev_disable. This is nearly same with the previous rule which will also unquiesce the queue
>    and let the requests be able to be drained. The only difference is this patch will invoke wait_freeze in nvme_dev_disable instead
>    of nvme_reset_work.
> 
> We don't sacrifice any request. This patch do the same thing with the previous one and make things clearer.

No, what you're proposing is quite different.

By "enter", I'm referring to blk_queue_enter. Once a request enters
into an hctx, it can not be backed out to re-enter a new hctx if the
original one is invalidated.

Prior to a reset, all requests that have entered the queue are committed
to that hctx, and we can't do anything about that. The only thing we can
do is prevent new requests from entering until we're sure that hctx is
valid on the other side of the reset.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
@ 2018-02-06 15:13             ` Keith Busch
  0 siblings, 0 replies; 48+ messages in thread
From: Keith Busch @ 2018-02-06 15:13 UTC (permalink / raw)


On Tue, Feb 06, 2018@09:46:36AM +0800, jianchao.wang wrote:
> Hi Keith
> 
> Thanks for your kindly response.
> 
> On 02/05/2018 11:13 PM, Keith Busch wrote:
> >  but how many requests are you letting enter to their demise by
> > freezing on the wrong side of the reset?
> 
> There are only two difference with this patch from the original one.
> 1. Don't freeze the queue for the reset case. At the moment, the outstanding requests will be requeued back to blk-mq queues.
>    The new entered requests during reset will also stay in blk-mq queues. All this requests will not enter into nvme driver layer
>    due to quiescent request_queues. And they will be issued after the reset is completed successfully.
> 2. Drain the request queue before nvme_dev_disable. This is nearly same with the previous rule which will also unquiesce the queue
>    and let the requests be able to be drained. The only difference is this patch will invoke wait_freeze in nvme_dev_disable instead
>    of nvme_reset_work.
> 
> We don't sacrifice any request. This patch do the same thing with the previous one and make things clearer.

No, what you're proposing is quite different.

By "enter", I'm referring to blk_queue_enter. Once a request enters
into an hctx, it can not be backed out to re-enter a new hctx if the
original one is invalidated.

Prior to a reset, all requests that have entered the queue are committed
to that hctx, and we can't do anything about that. The only thing we can
do is prevent new requests from entering until we're sure that hctx is
valid on the other side of the reset.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
  2018-02-06 15:13             ` Keith Busch
@ 2018-02-07  2:03               ` jianchao.wang
  -1 siblings, 0 replies; 48+ messages in thread
From: jianchao.wang @ 2018-02-07  2:03 UTC (permalink / raw)
  To: Keith Busch; +Cc: axboe, linux-kernel, hch, linux-nvme, sagi

Hi Keith

Thanks for your time and kindly response on this.

On 02/06/2018 11:13 PM, Keith Busch wrote:
> On Tue, Feb 06, 2018 at 09:46:36AM +0800, jianchao.wang wrote:
>> Hi Keith
>>
>> Thanks for your kindly response.
>>
>> On 02/05/2018 11:13 PM, Keith Busch wrote:
>>>  but how many requests are you letting enter to their demise by
>>> freezing on the wrong side of the reset?
>>
>> There are only two difference with this patch from the original one.
>> 1. Don't freeze the queue for the reset case. At the moment, the outstanding requests will be requeued back to blk-mq queues.
>>    The new entered requests during reset will also stay in blk-mq queues. All this requests will not enter into nvme driver layer
>>    due to quiescent request_queues. And they will be issued after the reset is completed successfully.
>> 2. Drain the request queue before nvme_dev_disable. This is nearly same with the previous rule which will also unquiesce the queue
>>    and let the requests be able to be drained. The only difference is this patch will invoke wait_freeze in nvme_dev_disable instead
>>    of nvme_reset_work.
>>
>> We don't sacrifice any request. This patch do the same thing with the previous one and make things clearer.
> 
> No, what you're proposing is quite different.
> 
> By "enter", I'm referring to blk_queue_enter. 
When a request is allocated, it will hold a request_queue->q_usage_counter until it is freed.
Please refer to 
blk_mq_get_request -> blk_queue_enter_live
blk_mq_free_request -> blk_exit_queue

Regarding to 'request enters into an hctx', I cannot get the point.
I think you should mean it enter into nvme driver layer.

> Once a request enters
> into an hctx, it can not be backed out to re-enter a new hctx if the
> original one is invalidated.

I also cannot get the point here. We certainly will not issue a request which
has been issued to other hctx.
What this patch and also the original one does is that disable/shutdown controller, 
cancel and requeue  or fail the outstanding requests.

The requeue mechanism will ensure the requests to be inserted to the ctx where req->mq_ctx->cpu
points to. 

> 
> Prior to a reset, all requests that have entered the queue are committed
> to that hctx, 

A request could be on 
- blk-mq per-cpu ctx->rq_list, IO scheduler list\
- hctx->dispatch list or
- request_queue->requeue_list (will be inserted to 1st case again)

When requests are issued, they will be dequeued from 1st or 2nd case and submitted to nvme driver layer.
These requests are _outstanding_ ones.

When the request queue is quiesced, the request will be stayed in  blk-mq per-cpu ctx->rq_list, IO scheduler list
or hctx->dispatch list, and cannot be issued to driver layer any more.
When the request queue is frozen, it will gate the bio out of generic_make_request, so new request cannot enter
blk-mq layer any more, and certainly the nvme driver layer.

For the reset case, the nvme controller will be back soon, we needn't freeze the queue, just quiescing is enough.
The outstanding ones will be canceled and _requeued_ to request_queue->requeue_list, then they will be inserted into
blk-mq layer again by requeue_work. When reset_work completes and start queues again, all the requests will be
issued again. :)

For the shutdown case, freezing and quiescing is safer. Also we will wait them to be completed if the controller is still
alive. If dead, we need to fail them directly instead of requeue them, otherwise, IO hung will come up, because controller
will be offline for some time.
  

and we can't do anything about that. The only thing we can
> do is prevent new requests from entering until we're sure that hctx is
> valid on the other side of the reset.
> 
yes, that's is what this patch does.

Add some explaining about the 4th patch nvme-pci: break up nvme_timeout and nvme_dev_disable here.
Also thanks for your time to look into this. That's really appreciated!

The key point is blk_abort_request. It will force the request to be expired and then handle the request
in timeout work context. It is safe to race with the irq completion path. This is the most important reason
to use blk_abort_request. 
We don't _complete_ the request or _rearm_ the time, but just set a CANCELED flag. So request will not be freed.
Then these requests cannot be taken away by irq completion path and time out path is also be avoid (no outstanding requests any more).
So we say 'all the outstanding requests are grabbed'. When we close the controller totally, we could complete
these requests safely. This is the core idea of the 4th patch.

Many thanks
Jianchao
 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
@ 2018-02-07  2:03               ` jianchao.wang
  0 siblings, 0 replies; 48+ messages in thread
From: jianchao.wang @ 2018-02-07  2:03 UTC (permalink / raw)


Hi Keith

Thanks for your time and kindly response on this.

On 02/06/2018 11:13 PM, Keith Busch wrote:
> On Tue, Feb 06, 2018@09:46:36AM +0800, jianchao.wang wrote:
>> Hi Keith
>>
>> Thanks for your kindly response.
>>
>> On 02/05/2018 11:13 PM, Keith Busch wrote:
>>>  but how many requests are you letting enter to their demise by
>>> freezing on the wrong side of the reset?
>>
>> There are only two difference with this patch from the original one.
>> 1. Don't freeze the queue for the reset case. At the moment, the outstanding requests will be requeued back to blk-mq queues.
>>    The new entered requests during reset will also stay in blk-mq queues. All this requests will not enter into nvme driver layer
>>    due to quiescent request_queues. And they will be issued after the reset is completed successfully.
>> 2. Drain the request queue before nvme_dev_disable. This is nearly same with the previous rule which will also unquiesce the queue
>>    and let the requests be able to be drained. The only difference is this patch will invoke wait_freeze in nvme_dev_disable instead
>>    of nvme_reset_work.
>>
>> We don't sacrifice any request. This patch do the same thing with the previous one and make things clearer.
> 
> No, what you're proposing is quite different.
> 
> By "enter", I'm referring to blk_queue_enter. 
When a request is allocated, it will hold a request_queue->q_usage_counter until it is freed.
Please refer to 
blk_mq_get_request -> blk_queue_enter_live
blk_mq_free_request -> blk_exit_queue

Regarding to 'request enters into an hctx', I cannot get the point.
I think you should mean it enter into nvme driver layer.

> Once a request enters
> into an hctx, it can not be backed out to re-enter a new hctx if the
> original one is invalidated.

I also cannot get the point here. We certainly will not issue a request which
has been issued to other hctx.
What this patch and also the original one does is that disable/shutdown controller, 
cancel and requeue  or fail the outstanding requests.

The requeue mechanism will ensure the requests to be inserted to the ctx where req->mq_ctx->cpu
points to. 

> 
> Prior to a reset, all requests that have entered the queue are committed
> to that hctx, 

A request could be on 
- blk-mq per-cpu ctx->rq_list, IO scheduler list\
- hctx->dispatch list or
- request_queue->requeue_list (will be inserted to 1st case again)

When requests are issued, they will be dequeued from 1st or 2nd case and submitted to nvme driver layer.
These requests are _outstanding_ ones.

When the request queue is quiesced, the request will be stayed in  blk-mq per-cpu ctx->rq_list, IO scheduler list
or hctx->dispatch list, and cannot be issued to driver layer any more.
When the request queue is frozen, it will gate the bio out of generic_make_request, so new request cannot enter
blk-mq layer any more, and certainly the nvme driver layer.

For the reset case, the nvme controller will be back soon, we needn't freeze the queue, just quiescing is enough.
The outstanding ones will be canceled and _requeued_ to request_queue->requeue_list, then they will be inserted into
blk-mq layer again by requeue_work. When reset_work completes and start queues again, all the requests will be
issued again. :)

For the shutdown case, freezing and quiescing is safer. Also we will wait them to be completed if the controller is still
alive. If dead, we need to fail them directly instead of requeue them, otherwise, IO hung will come up, because controller
will be offline for some time.
  

and we can't do anything about that. The only thing we can
> do is prevent new requests from entering until we're sure that hctx is
> valid on the other side of the reset.
> 
yes, that's is what this patch does.

Add some explaining about the 4th patch nvme-pci: break up nvme_timeout and nvme_dev_disable here.
Also thanks for your time to look into this. That's really appreciated!

The key point is blk_abort_request. It will force the request to be expired and then handle the request
in timeout work context. It is safe to race with the irq completion path. This is the most important reason
to use blk_abort_request. 
We don't _complete_ the request or _rearm_ the time, but just set a CANCELED flag. So request will not be freed.
Then these requests cannot be taken away by irq completion path and time out path is also be avoid (no outstanding requests any more).
So we say 'all the outstanding requests are grabbed'. When we close the controller totally, we could complete
these requests safely. This is the core idea of the 4th patch.

Many thanks
Jianchao
 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
  2018-02-07  2:03               ` jianchao.wang
@ 2018-02-07  2:13                 ` jianchao.wang
  -1 siblings, 0 replies; 48+ messages in thread
From: jianchao.wang @ 2018-02-07  2:13 UTC (permalink / raw)
  To: Keith Busch; +Cc: axboe, linux-kernel, hch, linux-nvme, sagi

Hi Keith

Sorry for bothering you again.

On 02/07/2018 10:03 AM, jianchao.wang wrote:
> Hi Keith
> 
> Thanks for your time and kindly response on this.
> 
> On 02/06/2018 11:13 PM, Keith Busch wrote:
>> On Tue, Feb 06, 2018 at 09:46:36AM +0800, jianchao.wang wrote:
>>> Hi Keith
>>>
>>> Thanks for your kindly response.
>>>
>>> On 02/05/2018 11:13 PM, Keith Busch wrote:
>>>>  but how many requests are you letting enter to their demise by
>>>> freezing on the wrong side of the reset?
>>>
>>> There are only two difference with this patch from the original one.
>>> 1. Don't freeze the queue for the reset case. At the moment, the outstanding requests will be requeued back to blk-mq queues.
>>>    The new entered requests during reset will also stay in blk-mq queues. All this requests will not enter into nvme driver layer
>>>    due to quiescent request_queues. And they will be issued after the reset is completed successfully.
>>> 2. Drain the request queue before nvme_dev_disable. This is nearly same with the previous rule which will also unquiesce the queue
>>>    and let the requests be able to be drained. The only difference is this patch will invoke wait_freeze in nvme_dev_disable instead
>>>    of nvme_reset_work.
>>>
>>> We don't sacrifice any request. This patch do the same thing with the previous one and make things clearer.
>>
>> No, what you're proposing is quite different.

What's the difference ? Can you please point out.
I have shared my understanding below.
But actually, I don't get the point what's the difference you said.
Or what you refer to is about the 4th patch ? If yes, I also explain this below.

Really appreciate your precious time to explain this. :)
Many thanks
Jianchao

>>
>> By "enter", I'm referring to blk_queue_enter. 
> When a request is allocated, it will hold a request_queue->q_usage_counter until it is freed.
> Please refer to 
> blk_mq_get_request -> blk_queue_enter_live
> blk_mq_free_request -> blk_exit_queue
> 
> Regarding to 'request enters into an hctx', I cannot get the point.
> I think you should mean it enter into nvme driver layer.
> 
>> Once a request enters
>> into an hctx, it can not be backed out to re-enter a new hctx if the
>> original one is invalidated.
> 
> I also cannot get the point here. We certainly will not issue a request which
> has been issued to other hctx.
> What this patch and also the original one does is that disable/shutdown controller, 
> cancel and requeue  or fail the outstanding requests.
> 
> The requeue mechanism will ensure the requests to be inserted to the ctx where req->mq_ctx->cpu
> points to. 
> 
>>
>> Prior to a reset, all requests that have entered the queue are committed
>> to that hctx, 
> 
> A request could be on 
> - blk-mq per-cpu ctx->rq_list, IO scheduler list\
> - hctx->dispatch list or
> - request_queue->requeue_list (will be inserted to 1st case again)
> 
> When requests are issued, they will be dequeued from 1st or 2nd case and submitted to nvme driver layer.
> These requests are _outstanding_ ones.
> 
> When the request queue is quiesced, the request will be stayed in  blk-mq per-cpu ctx->rq_list, IO scheduler list
> or hctx->dispatch list, and cannot be issued to driver layer any more.
> When the request queue is frozen, it will gate the bio out of generic_make_request, so new request cannot enter
> blk-mq layer any more, and certainly the nvme driver layer.
> 
> For the reset case, the nvme controller will be back soon, we needn't freeze the queue, just quiescing is enough.
> The outstanding ones will be canceled and _requeued_ to request_queue->requeue_list, then they will be inserted into
> blk-mq layer again by requeue_work. When reset_work completes and start queues again, all the requests will be
> issued again. :)
> 
> For the shutdown case, freezing and quiescing is safer. Also we will wait them to be completed if the controller is still
> alive. If dead, we need to fail them directly instead of requeue them, otherwise, IO hung will come up, because controller
> will be offline for some time.
>   
> 
> and we can't do anything about that. The only thing we can
>> do is prevent new requests from entering until we're sure that hctx is
>> valid on the other side of the reset.
>>
> yes, that's is what this patch does.
> 
> Add some explaining about the 4th patch nvme-pci: break up nvme_timeout and nvme_dev_disable here.
> Also thanks for your time to look into this. That's really appreciated!
> 
> The key point is blk_abort_request. It will force the request to be expired and then handle the request
> in timeout work context. It is safe to race with the irq completion path. This is the most important reason
> to use blk_abort_request. 
> We don't _complete_ the request or _rearm_ the time, but just set a CANCELED flag. So request will not be freed.
> Then these requests cannot be taken away by irq completion path and time out path is also be avoid (no outstanding requests any more).
> So we say 'all the outstanding requests are grabbed'. When we close the controller totally, we could complete
> these requests safely. This is the core idea of the 4th patch.
> 
> Many thanks
> Jianchao
>  
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
@ 2018-02-07  2:13                 ` jianchao.wang
  0 siblings, 0 replies; 48+ messages in thread
From: jianchao.wang @ 2018-02-07  2:13 UTC (permalink / raw)


Hi Keith

Sorry for bothering you again.

On 02/07/2018 10:03 AM, jianchao.wang wrote:
> Hi Keith
> 
> Thanks for your time and kindly response on this.
> 
> On 02/06/2018 11:13 PM, Keith Busch wrote:
>> On Tue, Feb 06, 2018@09:46:36AM +0800, jianchao.wang wrote:
>>> Hi Keith
>>>
>>> Thanks for your kindly response.
>>>
>>> On 02/05/2018 11:13 PM, Keith Busch wrote:
>>>>  but how many requests are you letting enter to their demise by
>>>> freezing on the wrong side of the reset?
>>>
>>> There are only two difference with this patch from the original one.
>>> 1. Don't freeze the queue for the reset case. At the moment, the outstanding requests will be requeued back to blk-mq queues.
>>>    The new entered requests during reset will also stay in blk-mq queues. All this requests will not enter into nvme driver layer
>>>    due to quiescent request_queues. And they will be issued after the reset is completed successfully.
>>> 2. Drain the request queue before nvme_dev_disable. This is nearly same with the previous rule which will also unquiesce the queue
>>>    and let the requests be able to be drained. The only difference is this patch will invoke wait_freeze in nvme_dev_disable instead
>>>    of nvme_reset_work.
>>>
>>> We don't sacrifice any request. This patch do the same thing with the previous one and make things clearer.
>>
>> No, what you're proposing is quite different.

What's the difference ? Can you please point out.
I have shared my understanding below.
But actually, I don't get the point what's the difference you said.
Or what you refer to is about the 4th patch ? If yes, I also explain this below.

Really appreciate your precious time to explain this. :)
Many thanks
Jianchao

>>
>> By "enter", I'm referring to blk_queue_enter. 
> When a request is allocated, it will hold a request_queue->q_usage_counter until it is freed.
> Please refer to 
> blk_mq_get_request -> blk_queue_enter_live
> blk_mq_free_request -> blk_exit_queue
> 
> Regarding to 'request enters into an hctx', I cannot get the point.
> I think you should mean it enter into nvme driver layer.
> 
>> Once a request enters
>> into an hctx, it can not be backed out to re-enter a new hctx if the
>> original one is invalidated.
> 
> I also cannot get the point here. We certainly will not issue a request which
> has been issued to other hctx.
> What this patch and also the original one does is that disable/shutdown controller, 
> cancel and requeue  or fail the outstanding requests.
> 
> The requeue mechanism will ensure the requests to be inserted to the ctx where req->mq_ctx->cpu
> points to. 
> 
>>
>> Prior to a reset, all requests that have entered the queue are committed
>> to that hctx, 
> 
> A request could be on 
> - blk-mq per-cpu ctx->rq_list, IO scheduler list\
> - hctx->dispatch list or
> - request_queue->requeue_list (will be inserted to 1st case again)
> 
> When requests are issued, they will be dequeued from 1st or 2nd case and submitted to nvme driver layer.
> These requests are _outstanding_ ones.
> 
> When the request queue is quiesced, the request will be stayed in  blk-mq per-cpu ctx->rq_list, IO scheduler list
> or hctx->dispatch list, and cannot be issued to driver layer any more.
> When the request queue is frozen, it will gate the bio out of generic_make_request, so new request cannot enter
> blk-mq layer any more, and certainly the nvme driver layer.
> 
> For the reset case, the nvme controller will be back soon, we needn't freeze the queue, just quiescing is enough.
> The outstanding ones will be canceled and _requeued_ to request_queue->requeue_list, then they will be inserted into
> blk-mq layer again by requeue_work. When reset_work completes and start queues again, all the requests will be
> issued again. :)
> 
> For the shutdown case, freezing and quiescing is safer. Also we will wait them to be completed if the controller is still
> alive. If dead, we need to fail them directly instead of requeue them, otherwise, IO hung will come up, because controller
> will be offline for some time.
>   
> 
> and we can't do anything about that. The only thing we can
>> do is prevent new requests from entering until we're sure that hctx is
>> valid on the other side of the reset.
>>
> yes, that's is what this patch does.
> 
> Add some explaining about the 4th patch nvme-pci: break up nvme_timeout and nvme_dev_disable here.
> Also thanks for your time to look into this. That's really appreciated!
> 
> The key point is blk_abort_request. It will force the request to be expired and then handle the request
> in timeout work context. It is safe to race with the irq completion path. This is the most important reason
> to use blk_abort_request. 
> We don't _complete_ the request or _rearm_ the time, but just set a CANCELED flag. So request will not be freed.
> Then these requests cannot be taken away by irq completion path and time out path is also be avoid (no outstanding requests any more).
> So we say 'all the outstanding requests are grabbed'. When we close the controller totally, we could complete
> these requests safely. This is the core idea of the 4th patch.
> 
> Many thanks
> Jianchao
>  
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
  2018-02-07  2:13                 ` jianchao.wang
@ 2018-02-07 16:13                   ` Keith Busch
  -1 siblings, 0 replies; 48+ messages in thread
From: Keith Busch @ 2018-02-07 16:13 UTC (permalink / raw)
  To: jianchao.wang; +Cc: axboe, linux-kernel, hch, linux-nvme, sagi

On Wed, Feb 07, 2018 at 10:13:51AM +0800, jianchao.wang wrote:
> What's the difference ? Can you please point out.
> I have shared my understanding below.
> But actually, I don't get the point what's the difference you said.

It sounds like you have all the pieces. Just keep this in mind: we don't
want to fail IO if we can prevent it.

A request is allocated from an hctx pool of tags. Once the request is
allocated, it is permently tied to that hctx because that's where its
tag came from. If that hctx becomes invalid, the request has to be ended
with an error, and we can't do anything about that[*].

Prior to a reset, we currently halt new requests from being allocated by
freezing the request queues. We unfreeze the queues after the new state
of the hctx's is established. This way all IO requests that were gating
on the unfreeze are guaranteed to enter into a valid context.

You are proposing to skip freeze on a reset. New requests will then be
allocated before we've established the hctx map. Any request allocated
will have to be terminated in failure if the hctx is no longer valid
once the reset completes.

Yes, it's entirely possible today a request allocated prior to the reset
may need to be terminated after the reset. There's nothing we can do
about those except end them in failure, but we can prevent new ones from
sharing the same fate. You are removing that prevention, and that's what
I am complaining about.

 * Future consideration: we recently obtained a way to "steal" bios that
looks like it may be used to back out certain types of requests and let
the bio create a new one.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
@ 2018-02-07 16:13                   ` Keith Busch
  0 siblings, 0 replies; 48+ messages in thread
From: Keith Busch @ 2018-02-07 16:13 UTC (permalink / raw)


On Wed, Feb 07, 2018@10:13:51AM +0800, jianchao.wang wrote:
> What's the difference ? Can you please point out.
> I have shared my understanding below.
> But actually, I don't get the point what's the difference you said.

It sounds like you have all the pieces. Just keep this in mind: we don't
want to fail IO if we can prevent it.

A request is allocated from an hctx pool of tags. Once the request is
allocated, it is permently tied to that hctx because that's where its
tag came from. If that hctx becomes invalid, the request has to be ended
with an error, and we can't do anything about that[*].

Prior to a reset, we currently halt new requests from being allocated by
freezing the request queues. We unfreeze the queues after the new state
of the hctx's is established. This way all IO requests that were gating
on the unfreeze are guaranteed to enter into a valid context.

You are proposing to skip freeze on a reset. New requests will then be
allocated before we've established the hctx map. Any request allocated
will have to be terminated in failure if the hctx is no longer valid
once the reset completes.

Yes, it's entirely possible today a request allocated prior to the reset
may need to be terminated after the reset. There's nothing we can do
about those except end them in failure, but we can prevent new ones from
sharing the same fate. You are removing that prevention, and that's what
I am complaining about.

 * Future consideration: we recently obtained a way to "steal" bios that
looks like it may be used to back out certain types of requests and let
the bio create a new one.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
  2018-02-07 16:13                   ` Keith Busch
@ 2018-02-08  1:40                     ` jianchao.wang
  -1 siblings, 0 replies; 48+ messages in thread
From: jianchao.wang @ 2018-02-08  1:40 UTC (permalink / raw)
  To: Keith Busch; +Cc: axboe, linux-kernel, hch, linux-nvme, sagi

Hi Keith


Really thanks for your your precious time and kindly directive.
That's really appreciated. :)

On 02/08/2018 12:13 AM, Keith Busch wrote:
> On Wed, Feb 07, 2018 at 10:13:51AM +0800, jianchao.wang wrote:
>> What's the difference ? Can you please point out.
>> I have shared my understanding below.
>> But actually, I don't get the point what's the difference you said.
> 
> It sounds like you have all the pieces. Just keep this in mind: we don't
> want to fail IO if we can prevent it.
> 
Yes, absolutely.

> A request is allocated from an hctx pool of tags. Once the request is
> allocated, it is permently tied to that hctx because that's where its
> tag came from. If that hctx becomes invalid, the request has to be ended
> with an error, and we can't do anything about that[*].
> 
> Prior to a reset, we currently halt new requests from being allocated by
> freezing the request queues. We unfreeze the queues after the new state
> of the hctx's is established. This way all IO requests that were gating
> on the unfreeze are guaranteed to enter into a valid context.
> 
> You are proposing to skip freeze on a reset. New requests will then be
> allocated before we've established the hctx map. Any request allocated
> will have to be terminated in failure if the hctx is no longer valid
> once the reset completes.
Yes, if any previous hctx doesn't come back, the requests on that hctx
will be drained with BLK_STS_IOERR.
__blk_mq_update_nr_hw_queues
  -> blk_mq_freeze_queue
    -> blk_freeze_queue
      -> blk_mq_freeze_queue_wait 
But the nvmeq's cq_vector is -1.
 
> Yes, it's entirely possible today a request allocated prior to the reset
> may need to be terminated after the reset. There's nothing we can do
> about those except end them in failure, but we can prevent new ones from
> sharing the same fate. You are removing that prevention, and that's what
> I am complaining about.

Thanks again for your precious time to detail this.
So I got what you concern about is that this patch doesn't freeze the queue for reset case
any more. And there maybe new requests enter, which will be failed when the associated
hctx doesn't come back during reset procedure. And this should be avoided.

I will change this in next V3 version.


>  * Future consideration: we recently obtained a way to "steal" bios that
> looks like it may be used to back out certain types of requests and let
> the bio create a new one.
> 
Yeah, that will be a great idea to reduce the loss when hctx is gone.

Sincerely
Jianchao

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
@ 2018-02-08  1:40                     ` jianchao.wang
  0 siblings, 0 replies; 48+ messages in thread
From: jianchao.wang @ 2018-02-08  1:40 UTC (permalink / raw)


Hi Keith


Really thanks for your your precious time and kindly directive.
That's really appreciated. :)

On 02/08/2018 12:13 AM, Keith Busch wrote:
> On Wed, Feb 07, 2018@10:13:51AM +0800, jianchao.wang wrote:
>> What's the difference ? Can you please point out.
>> I have shared my understanding below.
>> But actually, I don't get the point what's the difference you said.
> 
> It sounds like you have all the pieces. Just keep this in mind: we don't
> want to fail IO if we can prevent it.
> 
Yes, absolutely.

> A request is allocated from an hctx pool of tags. Once the request is
> allocated, it is permently tied to that hctx because that's where its
> tag came from. If that hctx becomes invalid, the request has to be ended
> with an error, and we can't do anything about that[*].
> 
> Prior to a reset, we currently halt new requests from being allocated by
> freezing the request queues. We unfreeze the queues after the new state
> of the hctx's is established. This way all IO requests that were gating
> on the unfreeze are guaranteed to enter into a valid context.
> 
> You are proposing to skip freeze on a reset. New requests will then be
> allocated before we've established the hctx map. Any request allocated
> will have to be terminated in failure if the hctx is no longer valid
> once the reset completes.
Yes, if any previous hctx doesn't come back, the requests on that hctx
will be drained with BLK_STS_IOERR.
__blk_mq_update_nr_hw_queues
  -> blk_mq_freeze_queue
    -> blk_freeze_queue
      -> blk_mq_freeze_queue_wait 
But the nvmeq's cq_vector is -1.
 
> Yes, it's entirely possible today a request allocated prior to the reset
> may need to be terminated after the reset. There's nothing we can do
> about those except end them in failure, but we can prevent new ones from
> sharing the same fate. You are removing that prevention, and that's what
> I am complaining about.

Thanks again for your precious time to detail this.
So I got what you concern about is that this patch doesn't freeze the queue for reset case
any more. And there maybe new requests enter, which will be failed when the associated
hctx doesn't come back during reset procedure. And this should be avoided.

I will change this in next V3 version.


>  * Future consideration: we recently obtained a way to "steal" bios that
> looks like it may be used to back out certain types of requests and let
> the bio create a new one.
> 
Yeah, that will be a great idea to reduce the loss when hctx is gone.

Sincerely
Jianchao

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
  2018-02-08  1:40                     ` jianchao.wang
@ 2018-02-08 14:17                       ` jianchao.wang
  -1 siblings, 0 replies; 48+ messages in thread
From: jianchao.wang @ 2018-02-08 14:17 UTC (permalink / raw)
  To: Keith Busch; +Cc: axboe, sagi, linux-kernel, linux-nvme, hch

Hi Keith

Sorry for bothering you again. ;)

There is a dangerous scenario which caused by nvme_wait_freeze in nvme_reset_work.
please consider it.

nvme_reset_work
  -> nvme_start_queues
  -> nvme_wait_freeze

if the controller no response, we have to rely on the timeout path.
there are issues below:
nvme_dev_disable need to be invoked.
nvme_dev_disable will quiesce queues, cancel and requeue and outstanding requests.
nvme_reset_work will hang at nvme_wait_freeze

if we set NVME_REQ_CANCELLED and return BLK_EH_HANDLED as the RESETTING case,
nvme_reset_work will hang forever, because no one could complete the entered requests.

if we invoke nvme_reset_ctrl after modify the state machine to be able to change to RESETTING
to RECONNECTING and queue reset_work, we still cannot move things forward, because the reset_work
is being executed.

if we use nvme_wait_freeze_timeout in nvme_reset_work, unfreeze and return if expires. But the 
timeout value is tricky..

Maybe we could use blk_set_preempt_only which is also gate on blk_queue_enter.
We needn't to drain the queue for it. It is lightweight.
And nvme needn't worry about the full queue prevent admin requests from being submitted.

Looking forward your precious advice.

Sincerely
Jianchao 


On 02/08/2018 09:40 AM, jianchao.wang wrote:
> Hi Keith
> 
> 
> Really thanks for your your precious time and kindly directive.
> That's really appreciated. :)
> 
> On 02/08/2018 12:13 AM, Keith Busch wrote:
>> On Wed, Feb 07, 2018 at 10:13:51AM +0800, jianchao.wang wrote:
>>> What's the difference ? Can you please point out.
>>> I have shared my understanding below.
>>> But actually, I don't get the point what's the difference you said.
>>
>> It sounds like you have all the pieces. Just keep this in mind: we don't
>> want to fail IO if we can prevent it.
>>
> Yes, absolutely.
> 
>> A request is allocated from an hctx pool of tags. Once the request is
>> allocated, it is permently tied to that hctx because that's where its
>> tag came from. If that hctx becomes invalid, the request has to be ended
>> with an error, and we can't do anything about that[*].
>>
>> Prior to a reset, we currently halt new requests from being allocated by
>> freezing the request queues. We unfreeze the queues after the new state
>> of the hctx's is established. This way all IO requests that were gating
>> on the unfreeze are guaranteed to enter into a valid context.
>>
>> You are proposing to skip freeze on a reset. New requests will then be
>> allocated before we've established the hctx map. Any request allocated
>> will have to be terminated in failure if the hctx is no longer valid
>> once the reset completes.
> Yes, if any previous hctx doesn't come back, the requests on that hctx
> will be drained with BLK_STS_IOERR.
> __blk_mq_update_nr_hw_queues
>   -> blk_mq_freeze_queue
>     -> blk_freeze_queue
>       -> blk_mq_freeze_queue_wait 
> But the nvmeq's cq_vector is -1.
>  
>> Yes, it's entirely possible today a request allocated prior to the reset
>> may need to be terminated after the reset. There's nothing we can do
>> about those except end them in failure, but we can prevent new ones from
>> sharing the same fate. You are removing that prevention, and that's what
>> I am complaining about.
> 
> Thanks again for your precious time to detail this.
> So I got what you concern about is that this patch doesn't freeze the queue for reset case
> any more. And there maybe new requests enter, which will be failed when the associated
> hctx doesn't come back during reset procedure. And this should be avoided.
> 
> I will change this in next V3 version.
> 
> 
>>  * Future consideration: we recently obtained a way to "steal" bios that
>> looks like it may be used to back out certain types of requests and let
>> the bio create a new one.
>>
> Yeah, that will be a great idea to reduce the loss when hctx is gone.
> 
> Sincerely
> Jianchao
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=qgIjamg94NFS7dkacDOfpPEcPa_5ImJwfZzC4Bj7wy8&s=nMzhDp4fnFjiHfbVcjI0pmiX42xK5vRWwOhFpaDV8To&e=
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
@ 2018-02-08 14:17                       ` jianchao.wang
  0 siblings, 0 replies; 48+ messages in thread
From: jianchao.wang @ 2018-02-08 14:17 UTC (permalink / raw)


Hi Keith

Sorry for bothering you again. ;)

There is a dangerous scenario which caused by nvme_wait_freeze in nvme_reset_work.
please consider it.

nvme_reset_work
  -> nvme_start_queues
  -> nvme_wait_freeze

if the controller no response, we have to rely on the timeout path.
there are issues below:
nvme_dev_disable need to be invoked.
nvme_dev_disable will quiesce queues, cancel and requeue and outstanding requests.
nvme_reset_work will hang at nvme_wait_freeze

if we set NVME_REQ_CANCELLED and return BLK_EH_HANDLED as the RESETTING case,
nvme_reset_work will hang forever, because no one could complete the entered requests.

if we invoke nvme_reset_ctrl after modify the state machine to be able to change to RESETTING
to RECONNECTING and queue reset_work, we still cannot move things forward, because the reset_work
is being executed.

if we use nvme_wait_freeze_timeout in nvme_reset_work, unfreeze and return if expires. But the 
timeout value is tricky..

Maybe we could use blk_set_preempt_only which is also gate on blk_queue_enter.
We needn't to drain the queue for it. It is lightweight.
And nvme needn't worry about the full queue prevent admin requests from being submitted.

Looking forward your precious advice.

Sincerely
Jianchao 


On 02/08/2018 09:40 AM, jianchao.wang wrote:
> Hi Keith
> 
> 
> Really thanks for your your precious time and kindly directive.
> That's really appreciated. :)
> 
> On 02/08/2018 12:13 AM, Keith Busch wrote:
>> On Wed, Feb 07, 2018@10:13:51AM +0800, jianchao.wang wrote:
>>> What's the difference ? Can you please point out.
>>> I have shared my understanding below.
>>> But actually, I don't get the point what's the difference you said.
>>
>> It sounds like you have all the pieces. Just keep this in mind: we don't
>> want to fail IO if we can prevent it.
>>
> Yes, absolutely.
> 
>> A request is allocated from an hctx pool of tags. Once the request is
>> allocated, it is permently tied to that hctx because that's where its
>> tag came from. If that hctx becomes invalid, the request has to be ended
>> with an error, and we can't do anything about that[*].
>>
>> Prior to a reset, we currently halt new requests from being allocated by
>> freezing the request queues. We unfreeze the queues after the new state
>> of the hctx's is established. This way all IO requests that were gating
>> on the unfreeze are guaranteed to enter into a valid context.
>>
>> You are proposing to skip freeze on a reset. New requests will then be
>> allocated before we've established the hctx map. Any request allocated
>> will have to be terminated in failure if the hctx is no longer valid
>> once the reset completes.
> Yes, if any previous hctx doesn't come back, the requests on that hctx
> will be drained with BLK_STS_IOERR.
> __blk_mq_update_nr_hw_queues
>   -> blk_mq_freeze_queue
>     -> blk_freeze_queue
>       -> blk_mq_freeze_queue_wait 
> But the nvmeq's cq_vector is -1.
>  
>> Yes, it's entirely possible today a request allocated prior to the reset
>> may need to be terminated after the reset. There's nothing we can do
>> about those except end them in failure, but we can prevent new ones from
>> sharing the same fate. You are removing that prevention, and that's what
>> I am complaining about.
> 
> Thanks again for your precious time to detail this.
> So I got what you concern about is that this patch doesn't freeze the queue for reset case
> any more. And there maybe new requests enter, which will be failed when the associated
> hctx doesn't come back during reset procedure. And this should be avoided.
> 
> I will change this in next V3 version.
> 
> 
>>  * Future consideration: we recently obtained a way to "steal" bios that
>> looks like it may be used to back out certain types of requests and let
>> the bio create a new one.
>>
> Yeah, that will be a great idea to reduce the loss when hctx is gone.
> 
> Sincerely
> Jianchao
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=qgIjamg94NFS7dkacDOfpPEcPa_5ImJwfZzC4Bj7wy8&s=nMzhDp4fnFjiHfbVcjI0pmiX42xK5vRWwOhFpaDV8To&e=
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
  2018-02-08 14:17                       ` jianchao.wang
@ 2018-02-08 15:15                         ` Keith Busch
  -1 siblings, 0 replies; 48+ messages in thread
From: Keith Busch @ 2018-02-08 15:15 UTC (permalink / raw)
  To: jianchao.wang; +Cc: axboe, sagi, linux-kernel, linux-nvme, hch

On Thu, Feb 08, 2018 at 10:17:00PM +0800, jianchao.wang wrote:
> There is a dangerous scenario which caused by nvme_wait_freeze in nvme_reset_work.
> please consider it.
> 
> nvme_reset_work
>   -> nvme_start_queues
>   -> nvme_wait_freeze
> 
> if the controller no response, we have to rely on the timeout path.
> there are issues below:
> nvme_dev_disable need to be invoked.
> nvme_dev_disable will quiesce queues, cancel and requeue and outstanding requests.
> nvme_reset_work will hang at nvme_wait_freeze

We used to not requeue timed out commands, so that wasn't a problem
before. Oh well, I'll take a look.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
@ 2018-02-08 15:15                         ` Keith Busch
  0 siblings, 0 replies; 48+ messages in thread
From: Keith Busch @ 2018-02-08 15:15 UTC (permalink / raw)


On Thu, Feb 08, 2018@10:17:00PM +0800, jianchao.wang wrote:
> There is a dangerous scenario which caused by nvme_wait_freeze in nvme_reset_work.
> please consider it.
> 
> nvme_reset_work
>   -> nvme_start_queues
>   -> nvme_wait_freeze
> 
> if the controller no response, we have to rely on the timeout path.
> there are issues below:
> nvme_dev_disable need to be invoked.
> nvme_dev_disable will quiesce queues, cancel and requeue and outstanding requests.
> nvme_reset_work will hang at nvme_wait_freeze

We used to not requeue timed out commands, so that wasn't a problem
before. Oh well, I'll take a look.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
  2018-02-08 15:15                         ` Keith Busch
@ 2018-02-09  1:41                           ` jianchao.wang
  -1 siblings, 0 replies; 48+ messages in thread
From: jianchao.wang @ 2018-02-09  1:41 UTC (permalink / raw)
  To: Keith Busch; +Cc: axboe, sagi, linux-kernel, linux-nvme, hch

Hi Keith

Thanks for your precious time and kindly response.

On 02/08/2018 11:15 PM, Keith Busch wrote:
> On Thu, Feb 08, 2018 at 10:17:00PM +0800, jianchao.wang wrote:
>> There is a dangerous scenario which caused by nvme_wait_freeze in nvme_reset_work.
>> please consider it.
>>
>> nvme_reset_work
>>   -> nvme_start_queues
>>   -> nvme_wait_freeze
>>
>> if the controller no response, we have to rely on the timeout path.
>> there are issues below:
>> nvme_dev_disable need to be invoked.
>> nvme_dev_disable will quiesce queues, cancel and requeue and outstanding requests.
>> nvme_reset_work will hang at nvme_wait_freeze
> 
> We used to not requeue timed out commands, so that wasn't a problem
> before. Oh well, I'll take a look.
> 
Yes, we indeed don't requeue the timed out commands, but nvme_dev_disable will requeue the other
outstanding requests and quiesce the request queues, this will block the nvme_reset_work->nvme_wati_freeze
to move forward.

As I shared in last email, can we use(or abuse?) blk_set_preempt_only to gate the new bios on generic_make_request ?
Freezing queues is good, but wait_freeze in reset_work is a devil.

Many thanks
Jianchao

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
@ 2018-02-09  1:41                           ` jianchao.wang
  0 siblings, 0 replies; 48+ messages in thread
From: jianchao.wang @ 2018-02-09  1:41 UTC (permalink / raw)


Hi Keith

Thanks for your precious time and kindly response.

On 02/08/2018 11:15 PM, Keith Busch wrote:
> On Thu, Feb 08, 2018@10:17:00PM +0800, jianchao.wang wrote:
>> There is a dangerous scenario which caused by nvme_wait_freeze in nvme_reset_work.
>> please consider it.
>>
>> nvme_reset_work
>>   -> nvme_start_queues
>>   -> nvme_wait_freeze
>>
>> if the controller no response, we have to rely on the timeout path.
>> there are issues below:
>> nvme_dev_disable need to be invoked.
>> nvme_dev_disable will quiesce queues, cancel and requeue and outstanding requests.
>> nvme_reset_work will hang at nvme_wait_freeze
> 
> We used to not requeue timed out commands, so that wasn't a problem
> before. Oh well, I'll take a look.
> 
Yes, we indeed don't requeue the timed out commands, but nvme_dev_disable will requeue the other
outstanding requests and quiesce the request queues, this will block the nvme_reset_work->nvme_wati_freeze
to move forward.

As I shared in last email, can we use(or abuse?) blk_set_preempt_only to gate the new bios on generic_make_request ?
Freezing queues is good, but wait_freeze in reset_work is a devil.

Many thanks
Jianchao

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
  2018-02-02  6:54 Jianchao Wang
@ 2018-02-02  6:54   ` Jianchao Wang
  0 siblings, 0 replies; 48+ messages in thread
From: Jianchao Wang @ 2018-02-02  6:54 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

Currently, request queue will be frozen and quiesced for both reset
and shutdown case. This will trigger ioq requests in RECONNECTING
state which should be avoided to prepare for following patch.
Just freeze request queue for shutdown case and drain all the resudual
entered requests after controller has been shutdown.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/pci.c | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 00cffed..a7fa397 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2172,21 +2172,23 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	if (pci_is_enabled(pdev)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
-		if (dev->ctrl.state == NVME_CTRL_LIVE ||
-		    dev->ctrl.state == NVME_CTRL_RESETTING)
-			nvme_start_freeze(&dev->ctrl);
 		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
 			pdev->error_state  != pci_channel_io_normal);
 	}
 
-	/*
-	 * Give the controller a chance to complete all entered requests if
-	 * doing a safe shutdown.
-	 */
-	if (!dead) {
-		if (shutdown)
+	/* Just freeze the queue for shutdown case */
+	if (shutdown) {
+		if (dev->ctrl.state == NVME_CTRL_LIVE ||
+			dev->ctrl.state == NVME_CTRL_RESETTING)
+			nvme_start_freeze(&dev->ctrl);
+		/*
+		 * Give the controller a chance to complete all
+		 * entered requests if doing a safe shutdown.
+		 */
+		if (!dead)
 			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
 	}
+
 	nvme_stop_queues(&dev->ctrl);
 
 	if (!dead) {
@@ -2210,12 +2212,15 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
 
 	/*
-	 * The driver will not be starting up queues again if shutting down so
-	 * must flush all entered requests to their failed completion to avoid
-	 * deadlocking blk-mq hot-cpu notifier.
+	 * For shutdown case, controller will not be setup again soon. If any
+	 * residual requests here, the controller must have go wrong. Drain and
+	 * fail all the residual entered IO requests.
 	 */
-	if (shutdown)
+	if (shutdown) {
 		nvme_start_queues(&dev->ctrl);
+		nvme_wait_freeze(&dev->ctrl);
+		nvme_stop_queues(&dev->ctrl);
+	}
 	mutex_unlock(&dev->shutdown_lock);
 }
 
@@ -2349,12 +2354,11 @@ static void nvme_reset_work(struct work_struct *work)
 		nvme_remove_namespaces(&dev->ctrl);
 		new_state = NVME_CTRL_ADMIN_ONLY;
 	} else {
-		nvme_start_queues(&dev->ctrl);
-		nvme_wait_freeze(&dev->ctrl);
 		/* hit this only when allocate tagset fails */
 		if (nvme_dev_add(dev))
 			new_state = NVME_CTRL_ADMIN_ONLY;
-		nvme_unfreeze(&dev->ctrl);
+		if (was_suspend)
+			nvme_unfreeze(&dev->ctrl);
 	}
 
 	/*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
@ 2018-02-02  6:54   ` Jianchao Wang
  0 siblings, 0 replies; 48+ messages in thread
From: Jianchao Wang @ 2018-02-02  6:54 UTC (permalink / raw)


Currently, request queue will be frozen and quiesced for both reset
and shutdown case. This will trigger ioq requests in RECONNECTING
state which should be avoided to prepare for following patch.
Just freeze request queue for shutdown case and drain all the resudual
entered requests after controller has been shutdown.

Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/pci.c | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 00cffed..a7fa397 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2172,21 +2172,23 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	if (pci_is_enabled(pdev)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
-		if (dev->ctrl.state == NVME_CTRL_LIVE ||
-		    dev->ctrl.state == NVME_CTRL_RESETTING)
-			nvme_start_freeze(&dev->ctrl);
 		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
 			pdev->error_state  != pci_channel_io_normal);
 	}
 
-	/*
-	 * Give the controller a chance to complete all entered requests if
-	 * doing a safe shutdown.
-	 */
-	if (!dead) {
-		if (shutdown)
+	/* Just freeze the queue for shutdown case */
+	if (shutdown) {
+		if (dev->ctrl.state == NVME_CTRL_LIVE ||
+			dev->ctrl.state == NVME_CTRL_RESETTING)
+			nvme_start_freeze(&dev->ctrl);
+		/*
+		 * Give the controller a chance to complete all
+		 * entered requests if doing a safe shutdown.
+		 */
+		if (!dead)
 			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
 	}
+
 	nvme_stop_queues(&dev->ctrl);
 
 	if (!dead) {
@@ -2210,12 +2212,15 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
 
 	/*
-	 * The driver will not be starting up queues again if shutting down so
-	 * must flush all entered requests to their failed completion to avoid
-	 * deadlocking blk-mq hot-cpu notifier.
+	 * For shutdown case, controller will not be setup again soon. If any
+	 * residual requests here, the controller must have go wrong. Drain and
+	 * fail all the residual entered IO requests.
 	 */
-	if (shutdown)
+	if (shutdown) {
 		nvme_start_queues(&dev->ctrl);
+		nvme_wait_freeze(&dev->ctrl);
+		nvme_stop_queues(&dev->ctrl);
+	}
 	mutex_unlock(&dev->shutdown_lock);
 }
 
@@ -2349,12 +2354,11 @@ static void nvme_reset_work(struct work_struct *work)
 		nvme_remove_namespaces(&dev->ctrl);
 		new_state = NVME_CTRL_ADMIN_ONLY;
 	} else {
-		nvme_start_queues(&dev->ctrl);
-		nvme_wait_freeze(&dev->ctrl);
 		/* hit this only when allocate tagset fails */
 		if (nvme_dev_add(dev))
 			new_state = NVME_CTRL_ADMIN_ONLY;
-		nvme_unfreeze(&dev->ctrl);
+		if (was_suspend)
+			nvme_unfreeze(&dev->ctrl);
 	}
 
 	/*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2018-02-09  1:43 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-02  7:00 [PATCH 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable Jianchao Wang
2018-02-02  7:00 ` Jianchao Wang
2018-02-02  7:00 ` [PATCH 1/6] nvme-pci: move clearing host mem behind stopping queues Jianchao Wang
2018-02-02  7:00   ` Jianchao Wang
2018-02-02 18:46   ` Keith Busch
2018-02-02 18:46     ` Keith Busch
2018-02-05  2:30     ` jianchao.wang
2018-02-05  2:30       ` jianchao.wang
2018-02-02  7:00 ` [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case Jianchao Wang
2018-02-02  7:00   ` Jianchao Wang
2018-02-02 18:24   ` Keith Busch
2018-02-02 18:24     ` Keith Busch
2018-02-05  2:26     ` jianchao.wang
2018-02-05  2:26       ` jianchao.wang
2018-02-05 15:13       ` Keith Busch
2018-02-05 15:13         ` Keith Busch
2018-02-06  1:46         ` jianchao.wang
2018-02-06  1:46           ` jianchao.wang
2018-02-06 15:13           ` Keith Busch
2018-02-06 15:13             ` Keith Busch
2018-02-07  2:03             ` jianchao.wang
2018-02-07  2:03               ` jianchao.wang
2018-02-07  2:13               ` jianchao.wang
2018-02-07  2:13                 ` jianchao.wang
2018-02-07 16:13                 ` Keith Busch
2018-02-07 16:13                   ` Keith Busch
2018-02-08  1:40                   ` jianchao.wang
2018-02-08  1:40                     ` jianchao.wang
2018-02-08 14:17                     ` jianchao.wang
2018-02-08 14:17                       ` jianchao.wang
2018-02-08 15:15                       ` Keith Busch
2018-02-08 15:15                         ` Keith Busch
2018-02-09  1:41                         ` jianchao.wang
2018-02-09  1:41                           ` jianchao.wang
2018-02-02  7:00 ` [PATCH 3/6] blk-mq: make blk_mq_rq_update_aborted_gstate a external interface Jianchao Wang
2018-02-02  7:00   ` Jianchao Wang
2018-02-02  7:00 ` [PATCH 4/6] nvme-pci: break up nvme_timeout and nvme_dev_disable Jianchao Wang
2018-02-02  7:00   ` Jianchao Wang
2018-02-02 18:31   ` Keith Busch
2018-02-02 18:31     ` Keith Busch
2018-02-05  2:22     ` jianchao.wang
2018-02-05  2:22       ` jianchao.wang
2018-02-02  7:00 ` [PATCH 5/6] nvme-pci: discard wait timeout when delete cq/sq Jianchao Wang
2018-02-02  7:00   ` Jianchao Wang
2018-02-02  7:00 ` [PATCH 6/6] nvme-pci: suspend queues based on online_queues Jianchao Wang
2018-02-02  7:00   ` Jianchao Wang
  -- strict thread matches above, loose matches on Subject: below --
2018-02-02  6:54 Jianchao Wang
2018-02-02  6:54 ` [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case Jianchao Wang
2018-02-02  6:54   ` Jianchao Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.