All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable
@ 2018-02-05  9:20 ` Jianchao Wang
  0 siblings, 0 replies; 28+ messages in thread
From: Jianchao Wang @ 2018-02-05  9:20 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

Hi Christoph, Keith and Sagi

Please consider and comment on the following patchset.
That's really appreciated.

There is a complicated relationship between nvme_timeout and nvme_dev_disable.
 - nvme_timeout has to invoke nvme_dev_disable to stop the
   controller doing DMA access before free the request.
 - nvme_dev_disable has to depend on nvme_timeout to complete
   adminq requests to set HMB or delete sq/cq when the controller
   has no response.
 - nvme_dev_disable will race with nvme_timeout when cancels the
   outstanding requests.
We have found some issues introduced by them, please refer the following link

http://lists.infradead.org/pipermail/linux-nvme/2018-January/015053.html 
http://lists.infradead.org/pipermail/linux-nvme/2018-January/015276.html
http://lists.infradead.org/pipermail/linux-nvme/2018-January/015328.html
Even we cannot ensure there is no other issue.

The best way to fix them is to break up the relationship between them.
With this patch, we could avoid nvme_dev_disable to be invoked
by nvme_timeout and eliminate the race between nvme_timeout and
nvme_dev_disable on outstanding requests.

Changes V1->V2:
 - free and disable pci things in nvme_pci_disable_ctrl_directly
 - change comment and add reviewed-by in 1st patch
 - resort patches
 - other misc changes

There are 6 patches:

1st ~ 3th patches does some preparation for the 4th one.
4th fixes a bug found when test.
5th is to avoid nvme_dev_disable to be invoked by nvme_timeout, and implement
the synchronization between them. More details, please refer to the comment of
this patch.
6th fixes a bug after 4th patch is introduced. It let nvme_delete_io_queues can
only be wakeup by completion path.

This patchset was tested under debug patch for some days.
And some bugfix have been done.
The patches are available in following it branch:
https://github.com/jianchwa/linux-blcok.git nvme_fixes_V2

Jianchao Wang (6)
0001-nvme-pci-quiesce-IO-queues-prior-to-disabling-device.patch
0002-nvme-pci-fix-the-freeze-and-quiesce-for-shutdown-and.patch
0003-blk-mq-make-blk_mq_rq_update_aborted_gstate-a-extern.patch
0004-nvme-pci-suspend-queues-based-on-online_queues.patch
0005-nvme-pci-break-up-nvme_timeout-and-nvme_dev_disable.patch
0006-nvme-pci-discard-wait-timeout-when-delete-cq-sq.patch


diff stat:
block/blk-mq.c          |   3 +-
drivers/nvme/host/pci.c | 250 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------
include/linux/blk-mq.h  |   1 +
3 files changed, 188 insertions(+), 66 deletions(-)

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH V2 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable
@ 2018-02-05  9:20 ` Jianchao Wang
  0 siblings, 0 replies; 28+ messages in thread
From: Jianchao Wang @ 2018-02-05  9:20 UTC (permalink / raw)


Hi Christoph, Keith and Sagi

Please consider and comment on the following patchset.
That's really appreciated.

There is a complicated relationship between nvme_timeout and nvme_dev_disable.
 - nvme_timeout has to invoke nvme_dev_disable to stop the
   controller doing DMA access before free the request.
 - nvme_dev_disable has to depend on nvme_timeout to complete
   adminq requests to set HMB or delete sq/cq when the controller
   has no response.
 - nvme_dev_disable will race with nvme_timeout when cancels the
   outstanding requests.
We have found some issues introduced by them, please refer the following link

http://lists.infradead.org/pipermail/linux-nvme/2018-January/015053.html 
http://lists.infradead.org/pipermail/linux-nvme/2018-January/015276.html
http://lists.infradead.org/pipermail/linux-nvme/2018-January/015328.html
Even we cannot ensure there is no other issue.

The best way to fix them is to break up the relationship between them.
With this patch, we could avoid nvme_dev_disable to be invoked
by nvme_timeout and eliminate the race between nvme_timeout and
nvme_dev_disable on outstanding requests.

Changes V1->V2:
 - free and disable pci things in nvme_pci_disable_ctrl_directly
 - change comment and add reviewed-by in 1st patch
 - resort patches
 - other misc changes

There are 6 patches:

1st ~ 3th patches does some preparation for the 4th one.
4th fixes a bug found when test.
5th is to avoid nvme_dev_disable to be invoked by nvme_timeout, and implement
the synchronization between them. More details, please refer to the comment of
this patch.
6th fixes a bug after 4th patch is introduced. It let nvme_delete_io_queues can
only be wakeup by completion path.

This patchset was tested under debug patch for some days.
And some bugfix have been done.
The patches are available in following it branch:
https://github.com/jianchwa/linux-blcok.git nvme_fixes_V2

Jianchao Wang (6)
0001-nvme-pci-quiesce-IO-queues-prior-to-disabling-device.patch
0002-nvme-pci-fix-the-freeze-and-quiesce-for-shutdown-and.patch
0003-blk-mq-make-blk_mq_rq_update_aborted_gstate-a-extern.patch
0004-nvme-pci-suspend-queues-based-on-online_queues.patch
0005-nvme-pci-break-up-nvme_timeout-and-nvme_dev_disable.patch
0006-nvme-pci-discard-wait-timeout-when-delete-cq-sq.patch


diff stat:
block/blk-mq.c          |   3 +-
drivers/nvme/host/pci.c | 250 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------
include/linux/blk-mq.h  |   1 +
3 files changed, 188 insertions(+), 66 deletions(-)

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH V2 1/6] nvme-pci: quiesce IO queues prior to disabling device HMB accesses
  2018-02-05  9:20 ` Jianchao Wang
@ 2018-02-05  9:20   ` Jianchao Wang
  -1 siblings, 0 replies; 28+ messages in thread
From: Jianchao Wang @ 2018-02-05  9:20 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

Quiesce IO queues prior to disabling device HMB accesses. A controller
using HMB may relay on it to efficiently complete IO commands.

Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/pci.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6fe7af0..00cffed 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2186,7 +2186,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	if (!dead) {
 		if (shutdown)
 			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
+	}
+	nvme_stop_queues(&dev->ctrl);
 
+	if (!dead) {
 		/*
 		 * If the controller is still alive tell it to stop using the
 		 * host memory buffer.  In theory the shutdown / reset should
@@ -2195,11 +2198,6 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		 */
 		if (dev->host_mem_descs)
 			nvme_set_host_mem(dev, 0);
-
-	}
-	nvme_stop_queues(&dev->ctrl);
-
-	if (!dead) {
 		nvme_disable_io_queues(dev);
 		nvme_disable_admin_queue(dev, shutdown);
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH V2 1/6] nvme-pci: quiesce IO queues prior to disabling device HMB accesses
@ 2018-02-05  9:20   ` Jianchao Wang
  0 siblings, 0 replies; 28+ messages in thread
From: Jianchao Wang @ 2018-02-05  9:20 UTC (permalink / raw)


Quiesce IO queues prior to disabling device HMB accesses. A controller
using HMB may relay on it to efficiently complete IO commands.

Reviewed-by: Keith Busch <keith.busch at intel.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/pci.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6fe7af0..00cffed 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2186,7 +2186,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	if (!dead) {
 		if (shutdown)
 			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
+	}
+	nvme_stop_queues(&dev->ctrl);
 
+	if (!dead) {
 		/*
 		 * If the controller is still alive tell it to stop using the
 		 * host memory buffer.  In theory the shutdown / reset should
@@ -2195,11 +2198,6 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		 */
 		if (dev->host_mem_descs)
 			nvme_set_host_mem(dev, 0);
-
-	}
-	nvme_stop_queues(&dev->ctrl);
-
-	if (!dead) {
 		nvme_disable_io_queues(dev);
 		nvme_disable_admin_queue(dev, shutdown);
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH V2 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
  2018-02-05  9:20 ` Jianchao Wang
@ 2018-02-05  9:20   ` Jianchao Wang
  -1 siblings, 0 replies; 28+ messages in thread
From: Jianchao Wang @ 2018-02-05  9:20 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

Currently, request queue will be frozen and quiesced for both reset
and shutdown case. This will trigger ioq requests in RECONNECTING
state which should be avoided to prepare for following patch.
Just freeze request queue for shutdown case and drain all the resudual
entered requests after controller has been shutdown.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/pci.c | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 00cffed..a7fa397 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2172,21 +2172,23 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	if (pci_is_enabled(pdev)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
-		if (dev->ctrl.state == NVME_CTRL_LIVE ||
-		    dev->ctrl.state == NVME_CTRL_RESETTING)
-			nvme_start_freeze(&dev->ctrl);
 		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
 			pdev->error_state  != pci_channel_io_normal);
 	}
 
-	/*
-	 * Give the controller a chance to complete all entered requests if
-	 * doing a safe shutdown.
-	 */
-	if (!dead) {
-		if (shutdown)
+	/* Just freeze the queue for shutdown case */
+	if (shutdown) {
+		if (dev->ctrl.state == NVME_CTRL_LIVE ||
+			dev->ctrl.state == NVME_CTRL_RESETTING)
+			nvme_start_freeze(&dev->ctrl);
+		/*
+		 * Give the controller a chance to complete all
+		 * entered requests if doing a safe shutdown.
+		 */
+		if (!dead)
 			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
 	}
+
 	nvme_stop_queues(&dev->ctrl);
 
 	if (!dead) {
@@ -2210,12 +2212,15 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
 
 	/*
-	 * The driver will not be starting up queues again if shutting down so
-	 * must flush all entered requests to their failed completion to avoid
-	 * deadlocking blk-mq hot-cpu notifier.
+	 * For shutdown case, controller will not be setup again soon. If any
+	 * residual requests here, the controller must have go wrong. Drain and
+	 * fail all the residual entered IO requests.
 	 */
-	if (shutdown)
+	if (shutdown) {
 		nvme_start_queues(&dev->ctrl);
+		nvme_wait_freeze(&dev->ctrl);
+		nvme_stop_queues(&dev->ctrl);
+	}
 	mutex_unlock(&dev->shutdown_lock);
 }
 
@@ -2349,12 +2354,11 @@ static void nvme_reset_work(struct work_struct *work)
 		nvme_remove_namespaces(&dev->ctrl);
 		new_state = NVME_CTRL_ADMIN_ONLY;
 	} else {
-		nvme_start_queues(&dev->ctrl);
-		nvme_wait_freeze(&dev->ctrl);
 		/* hit this only when allocate tagset fails */
 		if (nvme_dev_add(dev))
 			new_state = NVME_CTRL_ADMIN_ONLY;
-		nvme_unfreeze(&dev->ctrl);
+		if (was_suspend)
+			nvme_unfreeze(&dev->ctrl);
 	}
 
 	/*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH V2 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case
@ 2018-02-05  9:20   ` Jianchao Wang
  0 siblings, 0 replies; 28+ messages in thread
From: Jianchao Wang @ 2018-02-05  9:20 UTC (permalink / raw)


Currently, request queue will be frozen and quiesced for both reset
and shutdown case. This will trigger ioq requests in RECONNECTING
state which should be avoided to prepare for following patch.
Just freeze request queue for shutdown case and drain all the resudual
entered requests after controller has been shutdown.

Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/pci.c | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 00cffed..a7fa397 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2172,21 +2172,23 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	if (pci_is_enabled(pdev)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
-		if (dev->ctrl.state == NVME_CTRL_LIVE ||
-		    dev->ctrl.state == NVME_CTRL_RESETTING)
-			nvme_start_freeze(&dev->ctrl);
 		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
 			pdev->error_state  != pci_channel_io_normal);
 	}
 
-	/*
-	 * Give the controller a chance to complete all entered requests if
-	 * doing a safe shutdown.
-	 */
-	if (!dead) {
-		if (shutdown)
+	/* Just freeze the queue for shutdown case */
+	if (shutdown) {
+		if (dev->ctrl.state == NVME_CTRL_LIVE ||
+			dev->ctrl.state == NVME_CTRL_RESETTING)
+			nvme_start_freeze(&dev->ctrl);
+		/*
+		 * Give the controller a chance to complete all
+		 * entered requests if doing a safe shutdown.
+		 */
+		if (!dead)
 			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
 	}
+
 	nvme_stop_queues(&dev->ctrl);
 
 	if (!dead) {
@@ -2210,12 +2212,15 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
 
 	/*
-	 * The driver will not be starting up queues again if shutting down so
-	 * must flush all entered requests to their failed completion to avoid
-	 * deadlocking blk-mq hot-cpu notifier.
+	 * For shutdown case, controller will not be setup again soon. If any
+	 * residual requests here, the controller must have go wrong. Drain and
+	 * fail all the residual entered IO requests.
 	 */
-	if (shutdown)
+	if (shutdown) {
 		nvme_start_queues(&dev->ctrl);
+		nvme_wait_freeze(&dev->ctrl);
+		nvme_stop_queues(&dev->ctrl);
+	}
 	mutex_unlock(&dev->shutdown_lock);
 }
 
@@ -2349,12 +2354,11 @@ static void nvme_reset_work(struct work_struct *work)
 		nvme_remove_namespaces(&dev->ctrl);
 		new_state = NVME_CTRL_ADMIN_ONLY;
 	} else {
-		nvme_start_queues(&dev->ctrl);
-		nvme_wait_freeze(&dev->ctrl);
 		/* hit this only when allocate tagset fails */
 		if (nvme_dev_add(dev))
 			new_state = NVME_CTRL_ADMIN_ONLY;
-		nvme_unfreeze(&dev->ctrl);
+		if (was_suspend)
+			nvme_unfreeze(&dev->ctrl);
 	}
 
 	/*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH V2 3/6] blk-mq: make blk_mq_rq_update_aborted_gstate a external interface
  2018-02-05  9:20 ` Jianchao Wang
@ 2018-02-05  9:20   ` Jianchao Wang
  -1 siblings, 0 replies; 28+ messages in thread
From: Jianchao Wang @ 2018-02-05  9:20 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

No functional change, just make blk_mq_rq_update_aborted_gstate a
external interface.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 block/blk-mq.c         | 3 ++-
 include/linux/blk-mq.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 01f271d..a027ca2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -581,7 +581,7 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
 		*srcu_idx = srcu_read_lock(hctx->srcu);
 }
 
-static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
+void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
 {
 	unsigned long flags;
 
@@ -597,6 +597,7 @@ static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
 	u64_stats_update_end(&rq->aborted_gstate_sync);
 	local_irq_restore(flags);
 }
+EXPORT_SYMBOL(blk_mq_rq_update_aborted_gstate);
 
 static u64 blk_mq_rq_aborted_gstate(struct request *rq)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8efcf49..ad54024 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -257,6 +257,7 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
 void blk_mq_kick_requeue_list(struct request_queue *q);
 void blk_mq_delay_kick_requeue_list(struct request_queue *q, unsigned long msecs);
 void blk_mq_complete_request(struct request *rq);
+void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate);
 
 bool blk_mq_queue_stopped(struct request_queue *q);
 void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH V2 3/6] blk-mq: make blk_mq_rq_update_aborted_gstate a external interface
@ 2018-02-05  9:20   ` Jianchao Wang
  0 siblings, 0 replies; 28+ messages in thread
From: Jianchao Wang @ 2018-02-05  9:20 UTC (permalink / raw)


No functional change, just make blk_mq_rq_update_aborted_gstate a
external interface.

Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 block/blk-mq.c         | 3 ++-
 include/linux/blk-mq.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 01f271d..a027ca2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -581,7 +581,7 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
 		*srcu_idx = srcu_read_lock(hctx->srcu);
 }
 
-static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
+void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
 {
 	unsigned long flags;
 
@@ -597,6 +597,7 @@ static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
 	u64_stats_update_end(&rq->aborted_gstate_sync);
 	local_irq_restore(flags);
 }
+EXPORT_SYMBOL(blk_mq_rq_update_aborted_gstate);
 
 static u64 blk_mq_rq_aborted_gstate(struct request *rq)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8efcf49..ad54024 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -257,6 +257,7 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
 void blk_mq_kick_requeue_list(struct request_queue *q);
 void blk_mq_delay_kick_requeue_list(struct request_queue *q, unsigned long msecs);
 void blk_mq_complete_request(struct request *rq);
+void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate);
 
 bool blk_mq_queue_stopped(struct request_queue *q);
 void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH V2 4/6] nvme-pci: suspend queues based on online_queues
  2018-02-05  9:20 ` Jianchao Wang
@ 2018-02-05  9:20   ` Jianchao Wang
  -1 siblings, 0 replies; 28+ messages in thread
From: Jianchao Wang @ 2018-02-05  9:20 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

nvme cq irq is freed based on queue_count. When the sq/cq creation
fails, irq will not be setup. free_irq will warn 'Try to free
already-free irq'.

To fix it, we only increase online_queues when adminq/sq/cq are
created and associated irq is setup. Then suspend queues based
on online_queues.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/pci.c | 30 +++++++++++++++++++-----------
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a7fa397..117b837 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1315,9 +1315,6 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 	nvmeq->cq_vector = -1;
 	spin_unlock_irq(&nvmeq->q_lock);
 
-	if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
-		blk_mq_quiesce_queue(nvmeq->dev->ctrl.admin_q);
-
 	pci_free_irq(to_pci_dev(nvmeq->dev->dev), vector, nvmeq);
 
 	return 0;
@@ -1461,13 +1458,14 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 	nvme_init_queue(nvmeq, qid);
 	result = queue_request_irq(nvmeq);
 	if (result < 0)
-		goto release_sq;
+		goto offline;
 
 	return result;
 
- release_sq:
+offline:
+	dev->online_queues--;
 	adapter_delete_sq(dev, qid);
- release_cq:
+release_cq:
 	adapter_delete_cq(dev, qid);
 	return result;
 }
@@ -1607,6 +1605,7 @@ static int nvme_pci_configure_admin_queue(struct nvme_dev *dev)
 	result = queue_request_irq(nvmeq);
 	if (result) {
 		nvmeq->cq_vector = -1;
+		dev->online_queues--;
 		return result;
 	}
 
@@ -1954,6 +1953,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	result = queue_request_irq(adminq);
 	if (result) {
 		adminq->cq_vector = -1;
+		dev->online_queues--;
 		return result;
 	}
 	return nvme_create_io_queues(dev);
@@ -2167,13 +2167,16 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	int i;
 	bool dead = true;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
+	int onlines;
 
 	mutex_lock(&dev->shutdown_lock);
 	if (pci_is_enabled(pdev)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
-		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
-			pdev->error_state  != pci_channel_io_normal);
+		dead = !!((csts & NVME_CSTS_CFS) ||
+				!(csts & NVME_CSTS_RDY) ||
+				(pdev->error_state  != pci_channel_io_normal) ||
+				(dev->online_queues == 0));
 	}
 
 	/* Just freeze the queue for shutdown case */
@@ -2203,9 +2206,14 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		nvme_disable_io_queues(dev);
 		nvme_disable_admin_queue(dev, shutdown);
 	}
-	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
+
+	onlines = dev->online_queues;
+	for (i = onlines - 1; i >= 0; i--)
 		nvme_suspend_queue(&dev->queues[i]);
 
+	if (dev->ctrl.admin_q)
+		blk_mq_quiesce_queue(dev->ctrl.admin_q);
+
 	nvme_pci_disable(dev);
 
 	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
@@ -2348,12 +2356,12 @@ static void nvme_reset_work(struct work_struct *work)
 	 * Keep the controller around but remove all namespaces if we don't have
 	 * any working I/O queue.
 	 */
-	if (dev->online_queues < 2) {
+	if (dev->online_queues == 1) {
 		dev_warn(dev->ctrl.device, "IO queues not created\n");
 		nvme_kill_queues(&dev->ctrl);
 		nvme_remove_namespaces(&dev->ctrl);
 		new_state = NVME_CTRL_ADMIN_ONLY;
-	} else {
+	} else if (dev->online_queues > 1) {
 		/* hit this only when allocate tagset fails */
 		if (nvme_dev_add(dev))
 			new_state = NVME_CTRL_ADMIN_ONLY;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH V2 4/6] nvme-pci: suspend queues based on online_queues
@ 2018-02-05  9:20   ` Jianchao Wang
  0 siblings, 0 replies; 28+ messages in thread
From: Jianchao Wang @ 2018-02-05  9:20 UTC (permalink / raw)


nvme cq irq is freed based on queue_count. When the sq/cq creation
fails, irq will not be setup. free_irq will warn 'Try to free
already-free irq'.

To fix it, we only increase online_queues when adminq/sq/cq are
created and associated irq is setup. Then suspend queues based
on online_queues.

Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/pci.c | 30 +++++++++++++++++++-----------
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a7fa397..117b837 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1315,9 +1315,6 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 	nvmeq->cq_vector = -1;
 	spin_unlock_irq(&nvmeq->q_lock);
 
-	if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
-		blk_mq_quiesce_queue(nvmeq->dev->ctrl.admin_q);
-
 	pci_free_irq(to_pci_dev(nvmeq->dev->dev), vector, nvmeq);
 
 	return 0;
@@ -1461,13 +1458,14 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 	nvme_init_queue(nvmeq, qid);
 	result = queue_request_irq(nvmeq);
 	if (result < 0)
-		goto release_sq;
+		goto offline;
 
 	return result;
 
- release_sq:
+offline:
+	dev->online_queues--;
 	adapter_delete_sq(dev, qid);
- release_cq:
+release_cq:
 	adapter_delete_cq(dev, qid);
 	return result;
 }
@@ -1607,6 +1605,7 @@ static int nvme_pci_configure_admin_queue(struct nvme_dev *dev)
 	result = queue_request_irq(nvmeq);
 	if (result) {
 		nvmeq->cq_vector = -1;
+		dev->online_queues--;
 		return result;
 	}
 
@@ -1954,6 +1953,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	result = queue_request_irq(adminq);
 	if (result) {
 		adminq->cq_vector = -1;
+		dev->online_queues--;
 		return result;
 	}
 	return nvme_create_io_queues(dev);
@@ -2167,13 +2167,16 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	int i;
 	bool dead = true;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
+	int onlines;
 
 	mutex_lock(&dev->shutdown_lock);
 	if (pci_is_enabled(pdev)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
-		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
-			pdev->error_state  != pci_channel_io_normal);
+		dead = !!((csts & NVME_CSTS_CFS) ||
+				!(csts & NVME_CSTS_RDY) ||
+				(pdev->error_state  != pci_channel_io_normal) ||
+				(dev->online_queues == 0));
 	}
 
 	/* Just freeze the queue for shutdown case */
@@ -2203,9 +2206,14 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		nvme_disable_io_queues(dev);
 		nvme_disable_admin_queue(dev, shutdown);
 	}
-	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
+
+	onlines = dev->online_queues;
+	for (i = onlines - 1; i >= 0; i--)
 		nvme_suspend_queue(&dev->queues[i]);
 
+	if (dev->ctrl.admin_q)
+		blk_mq_quiesce_queue(dev->ctrl.admin_q);
+
 	nvme_pci_disable(dev);
 
 	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
@@ -2348,12 +2356,12 @@ static void nvme_reset_work(struct work_struct *work)
 	 * Keep the controller around but remove all namespaces if we don't have
 	 * any working I/O queue.
 	 */
-	if (dev->online_queues < 2) {
+	if (dev->online_queues == 1) {
 		dev_warn(dev->ctrl.device, "IO queues not created\n");
 		nvme_kill_queues(&dev->ctrl);
 		nvme_remove_namespaces(&dev->ctrl);
 		new_state = NVME_CTRL_ADMIN_ONLY;
-	} else {
+	} else if (dev->online_queues > 1) {
 		/* hit this only when allocate tagset fails */
 		if (nvme_dev_add(dev))
 			new_state = NVME_CTRL_ADMIN_ONLY;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH V2 5/6] nvme-pci: break up nvme_timeout and nvme_dev_disable
  2018-02-05  9:20 ` Jianchao Wang
@ 2018-02-05  9:20   ` Jianchao Wang
  -1 siblings, 0 replies; 28+ messages in thread
From: Jianchao Wang @ 2018-02-05  9:20 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

Currently, the complicated relationship between nvme_dev_disable
and nvme_timeout has become a devil that will introduce many
circular pattern which may trigger deadlock or IO hang. Let's
enumerate the tangles between them:
 - nvme_timeout has to invoke nvme_dev_disable to stop the
   controller doing DMA access before free the request.
 - nvme_dev_disable has to depend on nvme_timeout to complete
   adminq requests to set HMB or delete sq/cq when the controller
   has no response.
 - nvme_dev_disable will race with nvme_timeout when cancels the
   previous outstanding requests.
2nd case is necessary. This patch is to break up 1st and 3th case.

To achieve this:
Use blk_abort_request to force all the previous outstanding requests
expired in nvme_dev_disable. Set NVME_REQ_CANCELLED and return
BLK_EH_NOT_HANDLED in nvme_timeout. Then the request will not be
completed and freed. We needn't invoke nvme_dev_disable any more.

We use NVME_REQ_CANCELLED to identify them. After the controller is
totally disabled/shutdown, we invoke blk_mq_rq_update_aborted_gstate
to clear requests and invoke blk_mq_complete_request to complete them.
In addition, to identify the previous adminq/IOq request and adminq
requests from nvme_dev_disable, we introduce
NVME_PCI_OUTSTANDING_GRABBING and NVME_PCI_OUTSTANDING_GRABBED to
let nvme_timeout be able to distinguish them.

For the adminq requests from nvme_dev_disable/nvme_reset_work:
invoke nvme_disable_ctrl directly, then set NVME_REQ_CANCELLED and
return BLK_EH_HANDLED. nvme_dev_disable/nvme_reset_work will
see the error.

With this patch, we could avoid nvme_dev_disable to be invoked
by nvme_timeout and implementat synchronization between them.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/pci.c | 184 +++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 149 insertions(+), 35 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 117b837..43033fc 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -70,6 +70,9 @@ struct nvme_queue;
 
 static void nvme_process_cq(struct nvme_queue *nvmeq);
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
+static void nvme_pci_free_and_disable(struct nvme_dev *dev);
+#define NVME_PCI_OUTSTANDING_GRABBING 1
+#define NVME_PCI_OUTSTANDING_GRABBED 2
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -80,6 +83,7 @@ struct nvme_dev {
 	struct blk_mq_tag_set admin_tagset;
 	u32 __iomem *dbs;
 	struct device *dev;
+	int grab_flag;
 	struct dma_pool *prp_page_pool;
 	struct dma_pool *prp_small_pool;
 	unsigned online_queues;
@@ -1130,6 +1134,24 @@ static void abort_endio(struct request *req, blk_status_t error)
 	blk_mq_free_request(req);
 }
 
+static void nvme_pci_disable_ctrl_directly(struct nvme_dev *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev->dev);
+	u32 csts;
+	bool dead;
+
+	if (!pci_is_enabled(pdev))
+		return;
+
+	csts = readl(dev->bar + NVME_REG_CSTS);
+	dead = !!((csts & NVME_CSTS_CFS) ||
+			!(csts & NVME_CSTS_RDY) ||
+			pdev->error_state  != pci_channel_io_normal);
+	if (!dead)
+		nvme_disable_ctrl(&dev->ctrl, dev->ctrl.cap);
+	nvme_pci_free_and_disable(dev);
+}
+
 static bool nvme_should_reset(struct nvme_dev *dev, u32 csts)
 {
 
@@ -1191,12 +1213,13 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 
 	/*
 	 * Reset immediately if the controller is failed
+	 * nvme_dev_disable will take over the expired requests.
 	 */
 	if (nvme_should_reset(dev, csts)) {
+		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
 		nvme_warn_reset(dev, csts);
-		nvme_dev_disable(dev, false);
 		nvme_reset_ctrl(&dev->ctrl);
-		return BLK_EH_HANDLED;
+		return BLK_EH_NOT_HANDLED;
 	}
 
 	/*
@@ -1210,38 +1233,51 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	}
 
 	/*
-	 * Shutdown immediately if controller times out while starting. The
-	 * reset work will see the pci device disabled when it gets the forced
-	 * cancellation error. All outstanding requests are completed on
-	 * shutdown, so we return BLK_EH_HANDLED.
+	 * The previous outstanding requests on adminq and ioq have been
+	 * grabbed or drained for RECONNECTING state, so it must be admin
+	 * request from the nvme_dev_disable or  nvme_reset_work.
+	 * - stop the controller directly
+	 * - set NVME_REQ_CANCELLED, nvme_dev_disable or nvme_reset_work
+	 *   could see this error and stop.
+	 * - return BLK_EH_HANDLED
 	 */
-	if (dev->ctrl.state == NVME_CTRL_RESETTING) {
-		dev_warn(dev->ctrl.device,
-			 "I/O %d QID %d timeout, disable controller\n",
-			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
+	if ((READ_ONCE(dev->grab_flag) == NVME_PCI_OUTSTANDING_GRABBED) ||
+		(dev->ctrl.state == NVME_CTRL_RECONNECTING)) {
+		nvme_pci_disable_ctrl_directly(dev);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
 		return BLK_EH_HANDLED;
 	}
 
 	/*
- 	 * Shutdown the controller immediately and schedule a reset if the
- 	 * command was already aborted once before and still hasn't been
- 	 * returned to the driver, or if this is the admin queue.
+	 * nvme_dev_disable is grabbing all the outstanding requests.
+	 * Just set NVME_REQ_CANCELLED and return BLK_EH_NOT_HANDLED.
+	 * The nvme_dev_disable will take over all the things.
+	 */
+	if (READ_ONCE(dev->grab_flag) == NVME_PCI_OUTSTANDING_GRABBING) {
+		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+		return BLK_EH_NOT_HANDLED;
+	}
+
+	/*
+	 * If the controller is DELETING, needn't to do any recovery,
+	 */
+	if (dev->ctrl.state == NVME_CTRL_DELETING) {
+		nvme_pci_disable_ctrl_directly(dev);
+		nvme_req(req)->status = NVME_SC_ABORT_REQ;
+		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+		return BLK_EH_HANDLED;
+	}
+	/*
+	 * Just set NVME_REQ_CANCELLED and return BLK_EH_NOT_HANDLED,
+	 * The nvme_dev_disable will take over all the things.
 	 */
 	if (!nvmeq->qid || iod->aborted) {
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
 		nvme_reset_ctrl(&dev->ctrl);
-
-		/*
-		 * Mark the request as handled, since the inline shutdown
-		 * forces all outstanding requests to complete.
-		 */
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		return BLK_EH_HANDLED;
+		return BLK_EH_NOT_HANDLED;
 	}
 
 	if (atomic_dec_return(&dev->ctrl.abort_limit) < 0) {
@@ -2149,25 +2185,97 @@ static void nvme_dev_unmap(struct nvme_dev *dev)
 	pci_release_mem_regions(to_pci_dev(dev->dev));
 }
 
-static void nvme_pci_disable(struct nvme_dev *dev)
+static void nvme_pci_abort_rq(struct request *req, void *data, bool reserved)
+{
+	if (!blk_mq_request_started(req))
+		return;
+
+	dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
+				"Abort I/O %d", req->tag);
+
+	nvme_req(req)->status = NVME_SC_ABORT_REQ;
+	blk_abort_request(req);
+}
+
+/*
+ * nvme_pci_grab_outstanding and shutdown_lock ensure will be
+ * only one nvme_pci_free_and_disable running at one moment.
+ */
+static void nvme_pci_free_and_disable(struct nvme_dev *dev)
 {
+	int onlines, i;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 
+	if (!pci_is_enabled(pdev))
+		return;
+
+	onlines = dev->online_queues;
+	for (i = onlines - 1; i >= 0; i--)
+		nvme_suspend_queue(&dev->queues[i]);
+
 	nvme_release_cmb(dev);
 	pci_free_irq_vectors(pdev);
 
-	if (pci_is_enabled(pdev)) {
-		pci_disable_pcie_error_reporting(pdev);
-		pci_disable_device(pdev);
-	}
+	pci_disable_pcie_error_reporting(pdev);
+	pci_disable_device(pdev);
+}
+
+/*
+ * Now the controller has been disabled, no interrupt will race with us,
+ * so it is safe to complete them
+ *  - IO requests will be requeued.
+ *  - adminq requests will be failed, the tags will also be freed.
+ */
+static void nvme_pci_cancel_rq(struct request *req, void *data, bool reserved)
+{
+	if (!blk_mq_request_started(req))
+		return;
+
+	dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
+				"Cancel I/O %d", req->tag);
+
+	/* controller has been disabled/shtdown, it is safe here to clear
+	 * the aborted_gstate and complete request.
+	 */
+	if (nvme_req(req)->flags | NVME_REQ_CANCELLED)
+		blk_mq_rq_update_aborted_gstate(req, 0);
+	blk_mq_complete_request(req);
+}
+
+
+static void nvme_pci_sync_timeout_work(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	/* ensure the timeout_work is queued */
+	kblockd_schedule_work(&ctrl->admin_q->timeout_work);
+	flush_work(&ctrl->admin_q->timeout_work);
+
+	mutex_lock(&ctrl->namespaces_mutex);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		kblockd_schedule_work(&ns->queue->timeout_work);
+
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		flush_work(&ns->queue->timeout_work);
+	mutex_unlock(&ctrl->namespaces_mutex);
+}
+
+static void nvme_pci_grab_outstanding(struct nvme_dev *dev)
+{
+	blk_mq_tagset_busy_iter(&dev->tagset, nvme_pci_abort_rq, &dev->ctrl);
+	blk_mq_tagset_busy_iter(&dev->admin_tagset,
+			nvme_pci_abort_rq, &dev->ctrl);
+	nvme_pci_sync_timeout_work(&dev->ctrl);
+	/* All the outstanding requests have been grabbed. They are forced to be
+	 * expired, neither irq completion path nor timeout path could take them
+	 * away.
+	 */
 }
 
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 {
-	int i;
 	bool dead = true;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
-	int onlines;
 
 	mutex_lock(&dev->shutdown_lock);
 	if (pci_is_enabled(pdev)) {
@@ -2194,6 +2302,14 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 
 	nvme_stop_queues(&dev->ctrl);
 
+	WRITE_ONCE(dev->grab_flag, NVME_PCI_OUTSTANDING_GRABBING);
+	nvme_pci_grab_outstanding(dev);
+	WRITE_ONCE(dev->grab_flag, NVME_PCI_OUTSTANDING_GRABBED);
+	/*
+	 * All the previous outstanding requests have been grabbed and
+	 * nvme_timeout path is guaranteed to be drained.
+	 */
+
 	if (!dead) {
 		/*
 		 * If the controller is still alive tell it to stop using the
@@ -2207,18 +2323,16 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		nvme_disable_admin_queue(dev, shutdown);
 	}
 
-	onlines = dev->online_queues;
-	for (i = onlines - 1; i >= 0; i--)
-		nvme_suspend_queue(&dev->queues[i]);
-
 	if (dev->ctrl.admin_q)
 		blk_mq_quiesce_queue(dev->ctrl.admin_q);
 
-	nvme_pci_disable(dev);
+	nvme_pci_free_and_disable(dev);
 
-	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
-	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
+	blk_mq_tagset_busy_iter(&dev->tagset, nvme_pci_cancel_rq, &dev->ctrl);
+	blk_mq_tagset_busy_iter(&dev->admin_tagset,
+			nvme_pci_cancel_rq, &dev->ctrl);
 
+	WRITE_ONCE(dev->grab_flag, 0);
 	/*
 	 * For shutdown case, controller will not be setup again soon. If any
 	 * residual requests here, the controller must have go wrong. Drain and
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH V2 5/6] nvme-pci: break up nvme_timeout and nvme_dev_disable
@ 2018-02-05  9:20   ` Jianchao Wang
  0 siblings, 0 replies; 28+ messages in thread
From: Jianchao Wang @ 2018-02-05  9:20 UTC (permalink / raw)


Currently, the complicated relationship between nvme_dev_disable
and nvme_timeout has become a devil that will introduce many
circular pattern which may trigger deadlock or IO hang. Let's
enumerate the tangles between them:
 - nvme_timeout has to invoke nvme_dev_disable to stop the
   controller doing DMA access before free the request.
 - nvme_dev_disable has to depend on nvme_timeout to complete
   adminq requests to set HMB or delete sq/cq when the controller
   has no response.
 - nvme_dev_disable will race with nvme_timeout when cancels the
   previous outstanding requests.
2nd case is necessary. This patch is to break up 1st and 3th case.

To achieve this:
Use blk_abort_request to force all the previous outstanding requests
expired in nvme_dev_disable. Set NVME_REQ_CANCELLED and return
BLK_EH_NOT_HANDLED in nvme_timeout. Then the request will not be
completed and freed. We needn't invoke nvme_dev_disable any more.

We use NVME_REQ_CANCELLED to identify them. After the controller is
totally disabled/shutdown, we invoke blk_mq_rq_update_aborted_gstate
to clear requests and invoke blk_mq_complete_request to complete them.
In addition, to identify the previous adminq/IOq request and adminq
requests from nvme_dev_disable, we introduce
NVME_PCI_OUTSTANDING_GRABBING and NVME_PCI_OUTSTANDING_GRABBED to
let nvme_timeout be able to distinguish them.

For the adminq requests from nvme_dev_disable/nvme_reset_work:
invoke nvme_disable_ctrl directly, then set NVME_REQ_CANCELLED and
return BLK_EH_HANDLED. nvme_dev_disable/nvme_reset_work will
see the error.

With this patch, we could avoid nvme_dev_disable to be invoked
by nvme_timeout and implementat synchronization between them.

Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/pci.c | 184 +++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 149 insertions(+), 35 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 117b837..43033fc 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -70,6 +70,9 @@ struct nvme_queue;
 
 static void nvme_process_cq(struct nvme_queue *nvmeq);
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
+static void nvme_pci_free_and_disable(struct nvme_dev *dev);
+#define NVME_PCI_OUTSTANDING_GRABBING 1
+#define NVME_PCI_OUTSTANDING_GRABBED 2
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -80,6 +83,7 @@ struct nvme_dev {
 	struct blk_mq_tag_set admin_tagset;
 	u32 __iomem *dbs;
 	struct device *dev;
+	int grab_flag;
 	struct dma_pool *prp_page_pool;
 	struct dma_pool *prp_small_pool;
 	unsigned online_queues;
@@ -1130,6 +1134,24 @@ static void abort_endio(struct request *req, blk_status_t error)
 	blk_mq_free_request(req);
 }
 
+static void nvme_pci_disable_ctrl_directly(struct nvme_dev *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev->dev);
+	u32 csts;
+	bool dead;
+
+	if (!pci_is_enabled(pdev))
+		return;
+
+	csts = readl(dev->bar + NVME_REG_CSTS);
+	dead = !!((csts & NVME_CSTS_CFS) ||
+			!(csts & NVME_CSTS_RDY) ||
+			pdev->error_state  != pci_channel_io_normal);
+	if (!dead)
+		nvme_disable_ctrl(&dev->ctrl, dev->ctrl.cap);
+	nvme_pci_free_and_disable(dev);
+}
+
 static bool nvme_should_reset(struct nvme_dev *dev, u32 csts)
 {
 
@@ -1191,12 +1213,13 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 
 	/*
 	 * Reset immediately if the controller is failed
+	 * nvme_dev_disable will take over the expired requests.
 	 */
 	if (nvme_should_reset(dev, csts)) {
+		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
 		nvme_warn_reset(dev, csts);
-		nvme_dev_disable(dev, false);
 		nvme_reset_ctrl(&dev->ctrl);
-		return BLK_EH_HANDLED;
+		return BLK_EH_NOT_HANDLED;
 	}
 
 	/*
@@ -1210,38 +1233,51 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	}
 
 	/*
-	 * Shutdown immediately if controller times out while starting. The
-	 * reset work will see the pci device disabled when it gets the forced
-	 * cancellation error. All outstanding requests are completed on
-	 * shutdown, so we return BLK_EH_HANDLED.
+	 * The previous outstanding requests on adminq and ioq have been
+	 * grabbed or drained for RECONNECTING state, so it must be admin
+	 * request from the nvme_dev_disable or  nvme_reset_work.
+	 * - stop the controller directly
+	 * - set NVME_REQ_CANCELLED, nvme_dev_disable or nvme_reset_work
+	 *   could see this error and stop.
+	 * - return BLK_EH_HANDLED
 	 */
-	if (dev->ctrl.state == NVME_CTRL_RESETTING) {
-		dev_warn(dev->ctrl.device,
-			 "I/O %d QID %d timeout, disable controller\n",
-			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
+	if ((READ_ONCE(dev->grab_flag) == NVME_PCI_OUTSTANDING_GRABBED) ||
+		(dev->ctrl.state == NVME_CTRL_RECONNECTING)) {
+		nvme_pci_disable_ctrl_directly(dev);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
 		return BLK_EH_HANDLED;
 	}
 
 	/*
- 	 * Shutdown the controller immediately and schedule a reset if the
- 	 * command was already aborted once before and still hasn't been
- 	 * returned to the driver, or if this is the admin queue.
+	 * nvme_dev_disable is grabbing all the outstanding requests.
+	 * Just set NVME_REQ_CANCELLED and return BLK_EH_NOT_HANDLED.
+	 * The nvme_dev_disable will take over all the things.
+	 */
+	if (READ_ONCE(dev->grab_flag) == NVME_PCI_OUTSTANDING_GRABBING) {
+		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+		return BLK_EH_NOT_HANDLED;
+	}
+
+	/*
+	 * If the controller is DELETING, needn't to do any recovery,
+	 */
+	if (dev->ctrl.state == NVME_CTRL_DELETING) {
+		nvme_pci_disable_ctrl_directly(dev);
+		nvme_req(req)->status = NVME_SC_ABORT_REQ;
+		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+		return BLK_EH_HANDLED;
+	}
+	/*
+	 * Just set NVME_REQ_CANCELLED and return BLK_EH_NOT_HANDLED,
+	 * The nvme_dev_disable will take over all the things.
 	 */
 	if (!nvmeq->qid || iod->aborted) {
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
 		nvme_reset_ctrl(&dev->ctrl);
-
-		/*
-		 * Mark the request as handled, since the inline shutdown
-		 * forces all outstanding requests to complete.
-		 */
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		return BLK_EH_HANDLED;
+		return BLK_EH_NOT_HANDLED;
 	}
 
 	if (atomic_dec_return(&dev->ctrl.abort_limit) < 0) {
@@ -2149,25 +2185,97 @@ static void nvme_dev_unmap(struct nvme_dev *dev)
 	pci_release_mem_regions(to_pci_dev(dev->dev));
 }
 
-static void nvme_pci_disable(struct nvme_dev *dev)
+static void nvme_pci_abort_rq(struct request *req, void *data, bool reserved)
+{
+	if (!blk_mq_request_started(req))
+		return;
+
+	dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
+				"Abort I/O %d", req->tag);
+
+	nvme_req(req)->status = NVME_SC_ABORT_REQ;
+	blk_abort_request(req);
+}
+
+/*
+ * nvme_pci_grab_outstanding and shutdown_lock ensure will be
+ * only one nvme_pci_free_and_disable running at one moment.
+ */
+static void nvme_pci_free_and_disable(struct nvme_dev *dev)
 {
+	int onlines, i;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 
+	if (!pci_is_enabled(pdev))
+		return;
+
+	onlines = dev->online_queues;
+	for (i = onlines - 1; i >= 0; i--)
+		nvme_suspend_queue(&dev->queues[i]);
+
 	nvme_release_cmb(dev);
 	pci_free_irq_vectors(pdev);
 
-	if (pci_is_enabled(pdev)) {
-		pci_disable_pcie_error_reporting(pdev);
-		pci_disable_device(pdev);
-	}
+	pci_disable_pcie_error_reporting(pdev);
+	pci_disable_device(pdev);
+}
+
+/*
+ * Now the controller has been disabled, no interrupt will race with us,
+ * so it is safe to complete them
+ *  - IO requests will be requeued.
+ *  - adminq requests will be failed, the tags will also be freed.
+ */
+static void nvme_pci_cancel_rq(struct request *req, void *data, bool reserved)
+{
+	if (!blk_mq_request_started(req))
+		return;
+
+	dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
+				"Cancel I/O %d", req->tag);
+
+	/* controller has been disabled/shtdown, it is safe here to clear
+	 * the aborted_gstate and complete request.
+	 */
+	if (nvme_req(req)->flags | NVME_REQ_CANCELLED)
+		blk_mq_rq_update_aborted_gstate(req, 0);
+	blk_mq_complete_request(req);
+}
+
+
+static void nvme_pci_sync_timeout_work(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	/* ensure the timeout_work is queued */
+	kblockd_schedule_work(&ctrl->admin_q->timeout_work);
+	flush_work(&ctrl->admin_q->timeout_work);
+
+	mutex_lock(&ctrl->namespaces_mutex);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		kblockd_schedule_work(&ns->queue->timeout_work);
+
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		flush_work(&ns->queue->timeout_work);
+	mutex_unlock(&ctrl->namespaces_mutex);
+}
+
+static void nvme_pci_grab_outstanding(struct nvme_dev *dev)
+{
+	blk_mq_tagset_busy_iter(&dev->tagset, nvme_pci_abort_rq, &dev->ctrl);
+	blk_mq_tagset_busy_iter(&dev->admin_tagset,
+			nvme_pci_abort_rq, &dev->ctrl);
+	nvme_pci_sync_timeout_work(&dev->ctrl);
+	/* All the outstanding requests have been grabbed. They are forced to be
+	 * expired, neither irq completion path nor timeout path could take them
+	 * away.
+	 */
 }
 
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 {
-	int i;
 	bool dead = true;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
-	int onlines;
 
 	mutex_lock(&dev->shutdown_lock);
 	if (pci_is_enabled(pdev)) {
@@ -2194,6 +2302,14 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 
 	nvme_stop_queues(&dev->ctrl);
 
+	WRITE_ONCE(dev->grab_flag, NVME_PCI_OUTSTANDING_GRABBING);
+	nvme_pci_grab_outstanding(dev);
+	WRITE_ONCE(dev->grab_flag, NVME_PCI_OUTSTANDING_GRABBED);
+	/*
+	 * All the previous outstanding requests have been grabbed and
+	 * nvme_timeout path is guaranteed to be drained.
+	 */
+
 	if (!dead) {
 		/*
 		 * If the controller is still alive tell it to stop using the
@@ -2207,18 +2323,16 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		nvme_disable_admin_queue(dev, shutdown);
 	}
 
-	onlines = dev->online_queues;
-	for (i = onlines - 1; i >= 0; i--)
-		nvme_suspend_queue(&dev->queues[i]);
-
 	if (dev->ctrl.admin_q)
 		blk_mq_quiesce_queue(dev->ctrl.admin_q);
 
-	nvme_pci_disable(dev);
+	nvme_pci_free_and_disable(dev);
 
-	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
-	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
+	blk_mq_tagset_busy_iter(&dev->tagset, nvme_pci_cancel_rq, &dev->ctrl);
+	blk_mq_tagset_busy_iter(&dev->admin_tagset,
+			nvme_pci_cancel_rq, &dev->ctrl);
 
+	WRITE_ONCE(dev->grab_flag, 0);
 	/*
 	 * For shutdown case, controller will not be setup again soon. If any
 	 * residual requests here, the controller must have go wrong. Drain and
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH V2 6/6] nvme-pci: discard wait timeout when delete cq/sq
  2018-02-05  9:20 ` Jianchao Wang
@ 2018-02-05  9:20   ` Jianchao Wang
  -1 siblings, 0 replies; 28+ messages in thread
From: Jianchao Wang @ 2018-02-05  9:20 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

Currently, nvme_disable_io_queues could be wakeup by both request
completion and wait timeout path. This is unnecessary and could
introduce race between nvme_dev_disable and request timeout path.
When delete cq/sq command expires, the nvme_disable_io_queues will
also be wakeup and return to nvme_dev_disable, then handle the
outstanding requests. This will race with the request timeout path.

To fix it, just use wait_for_completion instead of the timeout one.
The request timeout path will wakeup it.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/pci.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 43033fc..f3262cc 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2050,7 +2050,6 @@ static int nvme_delete_queue(struct nvme_queue *nvmeq, u8 opcode)
 static void nvme_disable_io_queues(struct nvme_dev *dev)
 {
 	int pass, queues = dev->online_queues - 1;
-	unsigned long timeout;
 	u8 opcode = nvme_admin_delete_sq;
 
 	for (pass = 0; pass < 2; pass++) {
@@ -2058,15 +2057,12 @@ static void nvme_disable_io_queues(struct nvme_dev *dev)
 
 		reinit_completion(&dev->ioq_wait);
  retry:
-		timeout = ADMIN_TIMEOUT;
 		for (; i > 0; i--, sent++)
 			if (nvme_delete_queue(&dev->queues[i], opcode))
 				break;
 
 		while (sent--) {
-			timeout = wait_for_completion_io_timeout(&dev->ioq_wait, timeout);
-			if (timeout == 0)
-				return;
+			wait_for_completion(&dev->ioq_wait);
 			if (i)
 				goto retry;
 		}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH V2 6/6] nvme-pci: discard wait timeout when delete cq/sq
@ 2018-02-05  9:20   ` Jianchao Wang
  0 siblings, 0 replies; 28+ messages in thread
From: Jianchao Wang @ 2018-02-05  9:20 UTC (permalink / raw)


Currently, nvme_disable_io_queues could be wakeup by both request
completion and wait timeout path. This is unnecessary and could
introduce race between nvme_dev_disable and request timeout path.
When delete cq/sq command expires, the nvme_disable_io_queues will
also be wakeup and return to nvme_dev_disable, then handle the
outstanding requests. This will race with the request timeout path.

To fix it, just use wait_for_completion instead of the timeout one.
The request timeout path will wakeup it.

Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/pci.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 43033fc..f3262cc 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2050,7 +2050,6 @@ static int nvme_delete_queue(struct nvme_queue *nvmeq, u8 opcode)
 static void nvme_disable_io_queues(struct nvme_dev *dev)
 {
 	int pass, queues = dev->online_queues - 1;
-	unsigned long timeout;
 	u8 opcode = nvme_admin_delete_sq;
 
 	for (pass = 0; pass < 2; pass++) {
@@ -2058,15 +2057,12 @@ static void nvme_disable_io_queues(struct nvme_dev *dev)
 
 		reinit_completion(&dev->ioq_wait);
  retry:
-		timeout = ADMIN_TIMEOUT;
 		for (; i > 0; i--, sent++)
 			if (nvme_delete_queue(&dev->queues[i], opcode))
 				break;
 
 		while (sent--) {
-			timeout = wait_for_completion_io_timeout(&dev->ioq_wait, timeout);
-			if (timeout == 0)
-				return;
+			wait_for_completion(&dev->ioq_wait);
 			if (i)
 				goto retry;
 		}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable
  2018-02-05  9:20 ` Jianchao Wang
@ 2018-02-08 15:56   ` Sagi Grimberg
  -1 siblings, 0 replies; 28+ messages in thread
From: Sagi Grimberg @ 2018-02-08 15:56 UTC (permalink / raw)
  To: Jianchao Wang, keith.busch, axboe, hch; +Cc: linux-nvme, linux-kernel

Jianchao,

Given the discussion on this set, you plan to respin again
for 4.16?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH V2 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable
@ 2018-02-08 15:56   ` Sagi Grimberg
  0 siblings, 0 replies; 28+ messages in thread
From: Sagi Grimberg @ 2018-02-08 15:56 UTC (permalink / raw)


Jianchao,

Given the discussion on this set, you plan to respin again
for 4.16?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable
  2018-02-08 15:56   ` Sagi Grimberg
@ 2018-02-08 17:56     ` Keith Busch
  -1 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2018-02-08 17:56 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Jianchao Wang, axboe, hch, linux-nvme, linux-kernel

On Thu, Feb 08, 2018 at 05:56:49PM +0200, Sagi Grimberg wrote:
> Given the discussion on this set, you plan to respin again
> for 4.16?

With the exception of maybe patch 1, this needs more consideration than
I'd feel okay with for the 4.16 release.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH V2 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable
@ 2018-02-08 17:56     ` Keith Busch
  0 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2018-02-08 17:56 UTC (permalink / raw)


On Thu, Feb 08, 2018@05:56:49PM +0200, Sagi Grimberg wrote:
> Given the discussion on this set, you plan to respin again
> for 4.16?

With the exception of maybe patch 1, this needs more consideration than
I'd feel okay with for the 4.16 release.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable
  2018-02-08 17:56     ` Keith Busch
@ 2018-02-09  1:50       ` jianchao.wang
  -1 siblings, 0 replies; 28+ messages in thread
From: jianchao.wang @ 2018-02-09  1:50 UTC (permalink / raw)
  To: Keith Busch, Sagi Grimberg; +Cc: axboe, hch, linux-nvme, linux-kernel

Hi Keith and Sagi

Many thanks for your kindly response.
That's really appreciated.

On 02/09/2018 01:56 AM, Keith Busch wrote:
> On Thu, Feb 08, 2018 at 05:56:49PM +0200, Sagi Grimberg wrote:
>> Given the discussion on this set, you plan to respin again
>> for 4.16?
> 
> With the exception of maybe patch 1, this needs more consideration than
> I'd feel okay with for the 4.16 release.
> 
Currently, one of the block is the nvme_wait_freeze in nvme_reset_work.
This cause some issues when I test this patchset yesterday.
As I posted on the V1 patchset mail thread:

if we set NVME_REQ_CANCELLED and return BLK_EH_HANDLED as the RESETTING case,
nvme_reset_work will hang forever, because no one could complete the entered requests.

if we invoke nvme_reset_ctrl after modify the state machine to be able to change to RESETTING
to RECONNECTING and queue reset_work, we still cannot move things forward, because the reset_work
is being executed.

if we use nvme_wait_freeze_timeout in nvme_reset_work, unfreeze and return if expires. But the 
timeout value is tricky..


And actually, one of the possible solution to fix this cleanly is blk_set_preempt_only.
It is a lightweight way to gate the new bios out of generic_make_request.

Looking forward your advice on this.
And many thanks for your precious time on this.

Sincerely
Thanks
Jianchao

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH V2 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable
@ 2018-02-09  1:50       ` jianchao.wang
  0 siblings, 0 replies; 28+ messages in thread
From: jianchao.wang @ 2018-02-09  1:50 UTC (permalink / raw)


Hi Keith and Sagi

Many thanks for your kindly response.
That's really appreciated.

On 02/09/2018 01:56 AM, Keith Busch wrote:
> On Thu, Feb 08, 2018@05:56:49PM +0200, Sagi Grimberg wrote:
>> Given the discussion on this set, you plan to respin again
>> for 4.16?
> 
> With the exception of maybe patch 1, this needs more consideration than
> I'd feel okay with for the 4.16 release.
> 
Currently, one of the block is the nvme_wait_freeze in nvme_reset_work.
This cause some issues when I test this patchset yesterday.
As I posted on the V1 patchset mail thread:

if we set NVME_REQ_CANCELLED and return BLK_EH_HANDLED as the RESETTING case,
nvme_reset_work will hang forever, because no one could complete the entered requests.

if we invoke nvme_reset_ctrl after modify the state machine to be able to change to RESETTING
to RECONNECTING and queue reset_work, we still cannot move things forward, because the reset_work
is being executed.

if we use nvme_wait_freeze_timeout in nvme_reset_work, unfreeze and return if expires. But the 
timeout value is tricky..


And actually, one of the possible solution to fix this cleanly is blk_set_preempt_only.
It is a lightweight way to gate the new bios out of generic_make_request.

Looking forward your advice on this.
And many thanks for your precious time on this.

Sincerely
Thanks
Jianchao

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable
  2018-02-09  1:50       ` jianchao.wang
@ 2018-02-09 17:12         ` Keith Busch
  -1 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2018-02-09 17:12 UTC (permalink / raw)
  To: jianchao.wang; +Cc: Sagi Grimberg, axboe, hch, linux-nvme, linux-kernel

On Fri, Feb 09, 2018 at 09:50:58AM +0800, jianchao.wang wrote:
> 
> if we set NVME_REQ_CANCELLED and return BLK_EH_HANDLED as the RESETTING case,
> nvme_reset_work will hang forever, because no one could complete the entered requests.

Except it's no longer in the "RESETTING" case since you added the
"CONNECTING" state, so that's already broken for other reasons...

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH V2 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable
@ 2018-02-09 17:12         ` Keith Busch
  0 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2018-02-09 17:12 UTC (permalink / raw)


On Fri, Feb 09, 2018@09:50:58AM +0800, jianchao.wang wrote:
> 
> if we set NVME_REQ_CANCELLED and return BLK_EH_HANDLED as the RESETTING case,
> nvme_reset_work will hang forever, because no one could complete the entered requests.

Except it's no longer in the "RESETTING" case since you added the
"CONNECTING" state, so that's already broken for other reasons...

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable
  2018-02-09 17:12         ` Keith Busch
@ 2018-02-10  2:32           ` jianchao.wang
  -1 siblings, 0 replies; 28+ messages in thread
From: jianchao.wang @ 2018-02-10  2:32 UTC (permalink / raw)
  To: Keith Busch; +Cc: axboe, linux-kernel, Sagi Grimberg, linux-nvme, hch

Hi Keith

Thanks for your kindly response here.
That's really appreciated.

On 02/10/2018 01:12 AM, Keith Busch wrote:
> On Fri, Feb 09, 2018 at 09:50:58AM +0800, jianchao.wang wrote:
>>
>> if we set NVME_REQ_CANCELLED and return BLK_EH_HANDLED as the RESETTING case,
>> nvme_reset_work will hang forever, because no one could complete the entered requests.
> 
> Except it's no longer in the "RESETTING" case since you added the
> "CONNECTING" state, so that's already broken for other reasons...
> 

Yes, but as your patch, we have to fail the IOs and even kill the controller.
In fact, up to nvme_wait_freeze in nvme_reset_work, the RECONNECTING state has been completed.
We even could say it is in LIVE state. Maybe we should recover the controller again instead
of fail the IOs and kill the controller.

On the other hand, can you share with me why we cannot use blk_set_preempt_only to replace
blk_freeze_queue ? we just want to gate the new bios out of generic_make_request and we 
needn't use the preempt requests.

Looking forward your advice and directive.

Thanks
Jianchao


> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=UqKQMB3A2ppfm2sN7PyisX0xTtXKsHlTBwjsS18qVx8&s=A2VMSm9IjQQXxM7foB6VUiRHLs-nIREF2_kMstwxlgw&e=
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH V2 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable
@ 2018-02-10  2:32           ` jianchao.wang
  0 siblings, 0 replies; 28+ messages in thread
From: jianchao.wang @ 2018-02-10  2:32 UTC (permalink / raw)


Hi Keith

Thanks for your kindly response here.
That's really appreciated.

On 02/10/2018 01:12 AM, Keith Busch wrote:
> On Fri, Feb 09, 2018@09:50:58AM +0800, jianchao.wang wrote:
>>
>> if we set NVME_REQ_CANCELLED and return BLK_EH_HANDLED as the RESETTING case,
>> nvme_reset_work will hang forever, because no one could complete the entered requests.
> 
> Except it's no longer in the "RESETTING" case since you added the
> "CONNECTING" state, so that's already broken for other reasons...
> 

Yes, but as your patch, we have to fail the IOs and even kill the controller.
In fact, up to nvme_wait_freeze in nvme_reset_work, the RECONNECTING state has been completed.
We even could say it is in LIVE state. Maybe we should recover the controller again instead
of fail the IOs and kill the controller.

On the other hand, can you share with me why we cannot use blk_set_preempt_only to replace
blk_freeze_queue ? we just want to gate the new bios out of generic_make_request and we 
needn't use the preempt requests.

Looking forward your advice and directive.

Thanks
Jianchao


> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=UqKQMB3A2ppfm2sN7PyisX0xTtXKsHlTBwjsS18qVx8&s=A2VMSm9IjQQXxM7foB6VUiRHLs-nIREF2_kMstwxlgw&e=
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable
  2018-02-10  2:32           ` jianchao.wang
@ 2018-02-10  2:59             ` jianchao.wang
  -1 siblings, 0 replies; 28+ messages in thread
From: jianchao.wang @ 2018-02-10  2:59 UTC (permalink / raw)
  To: Keith Busch; +Cc: axboe, hch, linux-kernel, linux-nvme, Sagi Grimberg

Hi Keith

On 02/10/2018 10:32 AM, jianchao.wang wrote:
> Hi Keith
> 
> Thanks for your kindly response here.
> That's really appreciated.
> 
> On 02/10/2018 01:12 AM, Keith Busch wrote:
>> On Fri, Feb 09, 2018 at 09:50:58AM +0800, jianchao.wang wrote:
>>>
>>> if we set NVME_REQ_CANCELLED and return BLK_EH_HANDLED as the RESETTING case,
>>> nvme_reset_work will hang forever, because no one could complete the entered requests.
>>
>> Except it's no longer in the "RESETTING" case since you added the
>> "CONNECTING" state, so that's already broken for other reasons...
>>
> 
> Yes, but as your patch, we have to fail the IOs and even kill the controller.
> In fact, up to nvme_wait_freeze in nvme_reset_work, the RECONNECTING state has been completed.
> We even could say it is in LIVE state. Maybe we should recover the controller again instead
> of fail the IOs and kill the controller.
> 
> On the other hand, can you share with me why we cannot use blk_set_preempt_only to replace
> blk_freeze_queue ? we just want to gate the new bios out of generic_make_request and we 
> needn't use the preempt requests.
> 
> Looking forward your advice and directive.

Avoid wait_freeze in nvme_reset_work should be a better way to fix this defect.

> 
> Thanks
> Jianchao

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH V2 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable
@ 2018-02-10  2:59             ` jianchao.wang
  0 siblings, 0 replies; 28+ messages in thread
From: jianchao.wang @ 2018-02-10  2:59 UTC (permalink / raw)


Hi Keith

On 02/10/2018 10:32 AM, jianchao.wang wrote:
> Hi Keith
> 
> Thanks for your kindly response here.
> That's really appreciated.
> 
> On 02/10/2018 01:12 AM, Keith Busch wrote:
>> On Fri, Feb 09, 2018@09:50:58AM +0800, jianchao.wang wrote:
>>>
>>> if we set NVME_REQ_CANCELLED and return BLK_EH_HANDLED as the RESETTING case,
>>> nvme_reset_work will hang forever, because no one could complete the entered requests.
>>
>> Except it's no longer in the "RESETTING" case since you added the
>> "CONNECTING" state, so that's already broken for other reasons...
>>
> 
> Yes, but as your patch, we have to fail the IOs and even kill the controller.
> In fact, up to nvme_wait_freeze in nvme_reset_work, the RECONNECTING state has been completed.
> We even could say it is in LIVE state. Maybe we should recover the controller again instead
> of fail the IOs and kill the controller.
> 
> On the other hand, can you share with me why we cannot use blk_set_preempt_only to replace
> blk_freeze_queue ? we just want to gate the new bios out of generic_make_request and we 
> needn't use the preempt requests.
> 
> Looking forward your advice and directive.

Avoid wait_freeze in nvme_reset_work should be a better way to fix this defect.

> 
> Thanks
> Jianchao

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V2 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable
  2018-02-10  2:32           ` jianchao.wang
@ 2018-02-11  3:06             ` jianchao.wang
  -1 siblings, 0 replies; 28+ messages in thread
From: jianchao.wang @ 2018-02-11  3:06 UTC (permalink / raw)
  To: Keith Busch; +Cc: axboe, hch, linux-kernel, linux-nvme, Sagi Grimberg



On 02/10/2018 10:32 AM, jianchao.wang wrote:
> Hi Keith
> 
> Thanks for your kindly response here.
> That's really appreciated.
> 
> On 02/10/2018 01:12 AM, Keith Busch wrote:
>> On Fri, Feb 09, 2018 at 09:50:58AM +0800, jianchao.wang wrote:
>>>
>>> if we set NVME_REQ_CANCELLED and return BLK_EH_HANDLED as the RESETTING case,
>>> nvme_reset_work will hang forever, because no one could complete the entered requests.
>>
>> Except it's no longer in the "RESETTING" case since you added the
>> "CONNECTING" state, so that's already broken for other reasons...
>>
> 
> Yes, but as your patch, we have to fail the IOs and even kill the controller.
> In fact, up to nvme_wait_freeze in nvme_reset_work, the RECONNECTING state has been completed.
> We even could say it is in LIVE state. Maybe we should recover the controller again instead
> of fail the IOs and kill the controller.
> 
> On the other hand, can you share with me why we cannot use blk_set_preempt_only to replace
> blk_freeze_queue ? we just want to gate the new bios out of generic_make_request and we 
> needn't use the preempt requests.

Looks like blk_freeze_queue in blk_mq_update_nr_hw_queues cannot be worked around.
Still face IOs in nvme_reset_work. T_T

> 
> Looking forward your advice and directive.
> 
> Thanks
> Jianchao
> 
> 
>> _______________________________________________
>> Linux-nvme mailing list
>> Linux-nvme@lists.infradead.org
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=UqKQMB3A2ppfm2sN7PyisX0xTtXKsHlTBwjsS18qVx8&s=A2VMSm9IjQQXxM7foB6VUiRHLs-nIREF2_kMstwxlgw&e=
>>
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=cstI6JeNHGVX4OV1UdHdoemUr75aCUAjUrPe23Dhv8U&s=MYTmBPYS5tW4vC23iMEKINprtCxnRHe5AXrbST91XpY&e=
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH V2 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable
@ 2018-02-11  3:06             ` jianchao.wang
  0 siblings, 0 replies; 28+ messages in thread
From: jianchao.wang @ 2018-02-11  3:06 UTC (permalink / raw)




On 02/10/2018 10:32 AM, jianchao.wang wrote:
> Hi Keith
> 
> Thanks for your kindly response here.
> That's really appreciated.
> 
> On 02/10/2018 01:12 AM, Keith Busch wrote:
>> On Fri, Feb 09, 2018@09:50:58AM +0800, jianchao.wang wrote:
>>>
>>> if we set NVME_REQ_CANCELLED and return BLK_EH_HANDLED as the RESETTING case,
>>> nvme_reset_work will hang forever, because no one could complete the entered requests.
>>
>> Except it's no longer in the "RESETTING" case since you added the
>> "CONNECTING" state, so that's already broken for other reasons...
>>
> 
> Yes, but as your patch, we have to fail the IOs and even kill the controller.
> In fact, up to nvme_wait_freeze in nvme_reset_work, the RECONNECTING state has been completed.
> We even could say it is in LIVE state. Maybe we should recover the controller again instead
> of fail the IOs and kill the controller.
> 
> On the other hand, can you share with me why we cannot use blk_set_preempt_only to replace
> blk_freeze_queue ? we just want to gate the new bios out of generic_make_request and we 
> needn't use the preempt requests.

Looks like blk_freeze_queue in blk_mq_update_nr_hw_queues cannot be worked around.
Still face IOs in nvme_reset_work. T_T

> 
> Looking forward your advice and directive.
> 
> Thanks
> Jianchao
> 
> 
>> _______________________________________________
>> Linux-nvme mailing list
>> Linux-nvme at lists.infradead.org
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=UqKQMB3A2ppfm2sN7PyisX0xTtXKsHlTBwjsS18qVx8&s=A2VMSm9IjQQXxM7foB6VUiRHLs-nIREF2_kMstwxlgw&e=
>>
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=cstI6JeNHGVX4OV1UdHdoemUr75aCUAjUrPe23Dhv8U&s=MYTmBPYS5tW4vC23iMEKINprtCxnRHe5AXrbST91XpY&e=
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2018-02-11  3:15 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-05  9:20 [PATCH V2 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable Jianchao Wang
2018-02-05  9:20 ` Jianchao Wang
2018-02-05  9:20 ` [PATCH V2 1/6] nvme-pci: quiesce IO queues prior to disabling device HMB accesses Jianchao Wang
2018-02-05  9:20   ` Jianchao Wang
2018-02-05  9:20 ` [PATCH V2 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case Jianchao Wang
2018-02-05  9:20   ` Jianchao Wang
2018-02-05  9:20 ` [PATCH V2 3/6] blk-mq: make blk_mq_rq_update_aborted_gstate a external interface Jianchao Wang
2018-02-05  9:20   ` Jianchao Wang
2018-02-05  9:20 ` [PATCH V2 4/6] nvme-pci: suspend queues based on online_queues Jianchao Wang
2018-02-05  9:20   ` Jianchao Wang
2018-02-05  9:20 ` [PATCH V2 5/6] nvme-pci: break up nvme_timeout and nvme_dev_disable Jianchao Wang
2018-02-05  9:20   ` Jianchao Wang
2018-02-05  9:20 ` [PATCH V2 6/6] nvme-pci: discard wait timeout when delete cq/sq Jianchao Wang
2018-02-05  9:20   ` Jianchao Wang
2018-02-08 15:56 ` [PATCH V2 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable Sagi Grimberg
2018-02-08 15:56   ` Sagi Grimberg
2018-02-08 17:56   ` Keith Busch
2018-02-08 17:56     ` Keith Busch
2018-02-09  1:50     ` jianchao.wang
2018-02-09  1:50       ` jianchao.wang
2018-02-09 17:12       ` Keith Busch
2018-02-09 17:12         ` Keith Busch
2018-02-10  2:32         ` jianchao.wang
2018-02-10  2:32           ` jianchao.wang
2018-02-10  2:59           ` jianchao.wang
2018-02-10  2:59             ` jianchao.wang
2018-02-11  3:06           ` jianchao.wang
2018-02-11  3:06             ` jianchao.wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.