linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
@ 2018-05-11 12:29 Ming Lei
  2018-05-11 12:29 ` [PATCH V5 1/9] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout() Ming Lei
                   ` (10 more replies)
  0 siblings, 11 replies; 32+ messages in thread
From: Ming Lei @ 2018-05-11 12:29 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, James Smart, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

Hi,

The 1st patch introduces blk_quiesce_timeout() and blk_unquiesce_timeout()
for NVMe, meantime fixes blk_sync_queue().

The 2nd patch covers timeout for admin commands for recovering controller
for avoiding possible deadlock.

The 3rd and 4th patches avoid to wait_freeze on queues which aren't frozen.

The last 5 patches fixes several races wrt. NVMe timeout handler, and
finally can make blktests block/011 passed. Meantime the NVMe PCI timeout
mecanism become much more rebost than before.

gitweb:
	https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V5

V5:
	- avoid to remove controller in case of reset failure in inner EHs
	- make sure that nvme_unfreeze and nvme_start_freeze are paired

V4:
	- fixe nvme_init_set_host_mem_cmd()
	- use nested EH model, and run both nvme_dev_disable() and
	resetting in one same context

V3:
	- fix one new race related freezing in patch 4, nvme_reset_work()
	may hang forever without this patch
	- rewrite the last 3 patches, and avoid to break nvme_reset_ctrl*()

V2:
	- fix draining timeout work, so no need to change return value from
	.timeout()
	- fix race between nvme_start_freeze() and nvme_unfreeze()
	- cover timeout for admin commands running in EH

Ming Lei (9):
  block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
  nvme: pci: cover timeout for admin commands running in EH
  nvme: pci: only wait freezing if queue is frozen
  nvme: pci: freeze queue in nvme_dev_disable() in case of error
    recovery
  nvme: pci: prepare for supporting error recovery from resetting
    context
  nvme: pci: move error handling out of nvme_reset_dev()
  nvme: pci: don't unfreeze queue until controller state updating
    succeeds
  nvme: core: introduce nvme_force_change_ctrl_state()
  nvme: pci: support nested EH

 block/blk-core.c         |  21 ++-
 block/blk-mq.c           |   9 ++
 block/blk-timeout.c      |   5 +-
 drivers/nvme/host/core.c |  57 +++++++
 drivers/nvme/host/nvme.h |   7 +
 drivers/nvme/host/pci.c  | 402 +++++++++++++++++++++++++++++++++++++++++------
 include/linux/blkdev.h   |  13 ++
 7 files changed, 462 insertions(+), 52 deletions(-)

Cc: James Smart <james.smart@broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
-- 
2.9.5

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH V5 1/9] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
  2018-05-11 12:29 [PATCH V5 0/9] nvme: pci: fix & improve timeout handling Ming Lei
@ 2018-05-11 12:29 ` Ming Lei
  2018-05-11 12:29 ` [PATCH V5 2/9] nvme: pci: cover timeout for admin commands running in EH Ming Lei
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2018-05-11 12:29 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, James Smart, Bart Van Assche,
	Jianchao Wang, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	Laurence Oberman

Turns out the current way can't drain timout completely because mod_timer()
can be triggered in the work func, which can be just run inside the synced
timeout work:

        del_timer_sync(&q->timeout);
        cancel_work_sync(&q->timeout_work);

This patch introduces one flag of 'timeout_off' for fixing this issue, turns
out this simple way does work.

Also blk_quiesce_timeout() and blk_unquiesce_timeout() are introduced for
draining timeout, which is needed by NVMe.

Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c       | 21 +++++++++++++++++++--
 block/blk-mq.c         |  9 +++++++++
 block/blk-timeout.c    |  5 ++++-
 include/linux/blkdev.h | 13 +++++++++++++
 4 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 85909b431eb0..c277f1023703 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -392,6 +392,22 @@ void blk_stop_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_stop_queue);
 
+void blk_unquiesce_timeout(struct request_queue *q)
+{
+	blk_mark_timeout_quiesce(q, false);
+	mod_timer(&q->timeout, jiffies + q->rq_timeout);
+}
+EXPORT_SYMBOL(blk_unquiesce_timeout);
+
+void blk_quiesce_timeout(struct request_queue *q)
+{
+	blk_mark_timeout_quiesce(q, true);
+
+	del_timer_sync(&q->timeout);
+	cancel_work_sync(&q->timeout_work);
+}
+EXPORT_SYMBOL(blk_quiesce_timeout);
+
 /**
  * blk_sync_queue - cancel any pending callbacks on a queue
  * @q: the queue
@@ -412,8 +428,7 @@ EXPORT_SYMBOL(blk_stop_queue);
  */
 void blk_sync_queue(struct request_queue *q)
 {
-	del_timer_sync(&q->timeout);
-	cancel_work_sync(&q->timeout_work);
+	blk_quiesce_timeout(q);
 
 	if (q->mq_ops) {
 		struct blk_mq_hw_ctx *hctx;
@@ -425,6 +440,8 @@ void blk_sync_queue(struct request_queue *q)
 	} else {
 		cancel_delayed_work_sync(&q->delay_work);
 	}
+
+	blk_mark_timeout_quiesce(q, false);
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9ce9cac16c3f..173f1723e48f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -917,6 +917,15 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	};
 	struct blk_mq_hw_ctx *hctx;
 	int i;
+	bool timeout_off;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	timeout_off = q->timeout_off;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	if (timeout_off)
+		return;
 
 	/* A deadlock might occur if a request is stuck requiring a
 	 * timeout at the same time a queue freeze is waiting
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 652d4d4d3e97..ffd0b609091e 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -136,12 +136,15 @@ void blk_timeout_work(struct work_struct *work)
 
 	spin_lock_irqsave(q->queue_lock, flags);
 
+	if (q->timeout_off)
+		goto exit;
+
 	list_for_each_entry_safe(rq, tmp, &q->timeout_list, timeout_list)
 		blk_rq_check_expired(rq, &next, &next_set);
 
 	if (next_set)
 		mod_timer(&q->timeout, round_jiffies_up(next));
-
+exit:
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5c4eee043191..a2cc4aaecf50 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -584,6 +584,7 @@ struct request_queue {
 	struct timer_list	timeout;
 	struct work_struct	timeout_work;
 	struct list_head	timeout_list;
+	bool			timeout_off;
 
 	struct list_head	icq_list;
 #ifdef CONFIG_BLK_CGROUP
@@ -1017,6 +1018,18 @@ extern void blk_execute_rq(struct request_queue *, struct gendisk *,
 extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,
 				  struct request *, int, rq_end_io_fn *);
 
+static inline void blk_mark_timeout_quiesce(struct request_queue *q, bool quiesce)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	q->timeout_off = quiesce;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+extern void blk_quiesce_timeout(struct request_queue *q);
+extern void blk_unquiesce_timeout(struct request_queue *q);
+
 int blk_status_to_errno(blk_status_t status);
 blk_status_t errno_to_blk_status(int errno);
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V5 2/9] nvme: pci: cover timeout for admin commands running in EH
  2018-05-11 12:29 [PATCH V5 0/9] nvme: pci: fix & improve timeout handling Ming Lei
  2018-05-11 12:29 ` [PATCH V5 1/9] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout() Ming Lei
@ 2018-05-11 12:29 ` Ming Lei
  2018-05-11 12:29 ` [PATCH V5 3/9] nvme: pci: only wait freezing if queue is frozen Ming Lei
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2018-05-11 12:29 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, James Smart, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

When admin commands are used in EH for recovering controller, we have to
cover their timeout and can't depend on block's timeout since deadlock may
be caused when these commands are timed-out by block layer again.

Cc: James Smart <james.smart@broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 81 ++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 70 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fbc71fac6f1e..ff09b1c760ea 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1733,21 +1733,28 @@ static inline void nvme_release_cmb(struct nvme_dev *dev)
 	}
 }
 
-static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits)
+static void nvme_init_set_host_mem_cmd(struct nvme_dev *dev,
+		struct nvme_command *c, u32 bits)
 {
 	u64 dma_addr = dev->host_mem_descs_dma;
+
+	memset(c, 0, sizeof(*c));
+	c->features.opcode	= nvme_admin_set_features;
+	c->features.fid		= cpu_to_le32(NVME_FEAT_HOST_MEM_BUF);
+	c->features.dword11	= cpu_to_le32(bits);
+	c->features.dword12	= cpu_to_le32(dev->host_mem_size >>
+					      ilog2(dev->ctrl.page_size));
+	c->features.dword13	= cpu_to_le32(lower_32_bits(dma_addr));
+	c->features.dword14	= cpu_to_le32(upper_32_bits(dma_addr));
+	c->features.dword15	= cpu_to_le32(dev->nr_host_mem_descs);
+}
+
+static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits)
+{
 	struct nvme_command c;
 	int ret;
 
-	memset(&c, 0, sizeof(c));
-	c.features.opcode	= nvme_admin_set_features;
-	c.features.fid		= cpu_to_le32(NVME_FEAT_HOST_MEM_BUF);
-	c.features.dword11	= cpu_to_le32(bits);
-	c.features.dword12	= cpu_to_le32(dev->host_mem_size >>
-					      ilog2(dev->ctrl.page_size));
-	c.features.dword13	= cpu_to_le32(lower_32_bits(dma_addr));
-	c.features.dword14	= cpu_to_le32(upper_32_bits(dma_addr));
-	c.features.dword15	= cpu_to_le32(dev->nr_host_mem_descs);
+	nvme_init_set_host_mem_cmd(dev, &c, bits);
 
 	ret = nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0);
 	if (ret) {
@@ -1758,6 +1765,58 @@ static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits)
 	return ret;
 }
 
+static void nvme_set_host_mem_end_io(struct request *rq, blk_status_t sts)
+{
+	struct completion *waiting = rq->end_io_data;
+
+	rq->end_io_data = NULL;
+
+	/*
+	 * complete last, if this is a stack request the process (and thus
+	 * the rq pointer) could be invalid right after this complete()
+	 */
+	complete(waiting);
+}
+
+/*
+ * This function can only be used inside nvme_dev_disable() when timeout
+ * may not work, then this function has to cover the timeout by itself.
+ *
+ * When wait_for_completion_io_timeout() returns 0 and timeout happens,
+ * this request will be completed after controller is shutdown.
+ */
+static int nvme_set_host_mem_timeout(struct nvme_dev *dev, u32 bits)
+{
+	DECLARE_COMPLETION_ONSTACK(wait);
+	struct nvme_command c;
+	struct request_queue *q = dev->ctrl.admin_q;
+	struct request *req;
+	int ret;
+
+	nvme_init_set_host_mem_cmd(dev, &c, bits);
+
+	req = nvme_alloc_request(q, &c, 0, NVME_QID_ANY);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
+	req->timeout = ADMIN_TIMEOUT;
+	req->end_io_data = &wait;
+
+	blk_execute_rq_nowait(q, NULL, req, false,
+			nvme_set_host_mem_end_io);
+	ret = wait_for_completion_io_timeout(&wait, ADMIN_TIMEOUT);
+	if (ret > 0) {
+		if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
+			ret = -EINTR;
+		else
+			ret = nvme_req(req)->status;
+		blk_mq_free_request(req);
+	} else
+		ret = -EINTR;
+
+	return ret;
+}
+
 static void nvme_free_host_mem(struct nvme_dev *dev)
 {
 	int i;
@@ -2216,7 +2275,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		 * but I'd rather be safe than sorry..
 		 */
 		if (dev->host_mem_descs)
-			nvme_set_host_mem(dev, 0);
+			nvme_set_host_mem_timeout(dev, 0);
 		nvme_disable_io_queues(dev);
 		nvme_disable_admin_queue(dev, shutdown);
 	}
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V5 3/9] nvme: pci: only wait freezing if queue is frozen
  2018-05-11 12:29 [PATCH V5 0/9] nvme: pci: fix & improve timeout handling Ming Lei
  2018-05-11 12:29 ` [PATCH V5 1/9] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout() Ming Lei
  2018-05-11 12:29 ` [PATCH V5 2/9] nvme: pci: cover timeout for admin commands running in EH Ming Lei
@ 2018-05-11 12:29 ` Ming Lei
  2018-05-11 12:29 ` [PATCH V5 4/9] nvme: pci: freeze queue in nvme_dev_disable() in case of error recovery Ming Lei
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2018-05-11 12:29 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, James Smart, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

In nvme_dev_disable() called during shutting down controler,
nvme_wait_freeze_timeout() may be done on the controller not
frozen yet, so add the check for avoiding the case.

Cc: James Smart <james.smart@broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ff09b1c760ea..57bd7bebd1e5 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2244,14 +2244,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	int i;
 	bool dead = true;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
+	bool frozen = false;
 
 	mutex_lock(&dev->shutdown_lock);
 	if (pci_is_enabled(pdev)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
 		if (dev->ctrl.state == NVME_CTRL_LIVE ||
-		    dev->ctrl.state == NVME_CTRL_RESETTING)
+		    dev->ctrl.state == NVME_CTRL_RESETTING) {
 			nvme_start_freeze(&dev->ctrl);
+			frozen = true;
+		}
 		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
 			pdev->error_state  != pci_channel_io_normal);
 	}
@@ -2261,7 +2264,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	 * doing a safe shutdown.
 	 */
 	if (!dead) {
-		if (shutdown)
+		if (shutdown && frozen)
 			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
 	}
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V5 4/9] nvme: pci: freeze queue in nvme_dev_disable() in case of error recovery
  2018-05-11 12:29 [PATCH V5 0/9] nvme: pci: fix & improve timeout handling Ming Lei
                   ` (2 preceding siblings ...)
  2018-05-11 12:29 ` [PATCH V5 3/9] nvme: pci: only wait freezing if queue is frozen Ming Lei
@ 2018-05-11 12:29 ` Ming Lei
  2018-05-11 12:29 ` [PATCH V5 5/9] nvme: pci: prepare for supporting error recovery from resetting context Ming Lei
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2018-05-11 12:29 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, James Smart, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

When nvme_dev_disable() is used for error recovery, we should always
freeze queues before shutdown controller:

- reset handler supposes queues are frozen, and will wait_freeze &
unfreeze them explicitly, if queues aren't frozen during nvme_dev_disable(),
reset handler may wait forever even though there isn't any requests
allocated.

- this way may avoid to cancel lots of requests during error recovery

This patch introduces the parameter of 'freeze_queue' for fixing this
issue.

Cc: James Smart <james.smart@broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 47 ++++++++++++++++++++++++++++++++---------------
 1 file changed, 32 insertions(+), 15 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 57bd7bebd1e5..1fafe5d01355 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -69,7 +69,8 @@ struct nvme_dev;
 struct nvme_queue;
 
 static void nvme_process_cq(struct nvme_queue *nvmeq);
-static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
+static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
+		freeze_queue);
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -1197,7 +1198,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	 */
 	if (nvme_should_reset(dev, csts)) {
 		nvme_warn_reset(dev, csts);
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, true);
 		nvme_reset_ctrl(&dev->ctrl);
 		return BLK_EH_HANDLED;
 	}
@@ -1224,7 +1225,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, disable controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, false);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
 		return BLK_EH_HANDLED;
 	default:
@@ -1240,7 +1241,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, true);
 		nvme_reset_ctrl(&dev->ctrl);
 
 		/*
@@ -2239,19 +2240,35 @@ static void nvme_pci_disable(struct nvme_dev *dev)
 	}
 }
 
-static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
+/*
+ * Resetting often follows nvme_dev_disable(), so queues need to be frozen
+ * before resetting.
+ */
+static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
+		freeze_queue)
 {
 	int i;
 	bool dead = true;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 	bool frozen = false;
 
+	/*
+	 * 'freeze_queue' is only valid for non-shutdown, and we do
+	 * inline freeze & wait_freeze_timeout for shutdown just for
+	 * completing as many as possible requests before shutdown
+	 */
+	if (shutdown)
+		freeze_queue = false;
+
+	if (freeze_queue)
+		nvme_start_freeze(&dev->ctrl);
+
 	mutex_lock(&dev->shutdown_lock);
 	if (pci_is_enabled(pdev)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
-		if (dev->ctrl.state == NVME_CTRL_LIVE ||
-		    dev->ctrl.state == NVME_CTRL_RESETTING) {
+		if (shutdown && (dev->ctrl.state == NVME_CTRL_LIVE ||
+		    dev->ctrl.state == NVME_CTRL_RESETTING)) {
 			nvme_start_freeze(&dev->ctrl);
 			frozen = true;
 		}
@@ -2343,7 +2360,7 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
 	dev_warn(dev->ctrl.device, "Removing after probe failure status: %d\n", status);
 
 	nvme_get_ctrl(&dev->ctrl);
-	nvme_dev_disable(dev, false);
+	nvme_dev_disable(dev, false, false);
 	if (!queue_work(nvme_wq, &dev->remove_work))
 		nvme_put_ctrl(&dev->ctrl);
 }
@@ -2364,7 +2381,7 @@ static void nvme_reset_work(struct work_struct *work)
 	 * moving on.
 	 */
 	if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, false);
 
 	/*
 	 * Introduce CONNECTING state from nvme-fc/rdma transports to mark the
@@ -2613,7 +2630,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 static void nvme_reset_prepare(struct pci_dev *pdev)
 {
 	struct nvme_dev *dev = pci_get_drvdata(pdev);
-	nvme_dev_disable(dev, false);
+	nvme_dev_disable(dev, false, true);
 }
 
 static void nvme_reset_done(struct pci_dev *pdev)
@@ -2625,7 +2642,7 @@ static void nvme_reset_done(struct pci_dev *pdev)
 static void nvme_shutdown(struct pci_dev *pdev)
 {
 	struct nvme_dev *dev = pci_get_drvdata(pdev);
-	nvme_dev_disable(dev, true);
+	nvme_dev_disable(dev, true, false);
 }
 
 /*
@@ -2644,13 +2661,13 @@ static void nvme_remove(struct pci_dev *pdev)
 
 	if (!pci_device_is_present(pdev)) {
 		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, false);
 	}
 
 	flush_work(&dev->ctrl.reset_work);
 	nvme_stop_ctrl(&dev->ctrl);
 	nvme_remove_namespaces(&dev->ctrl);
-	nvme_dev_disable(dev, true);
+	nvme_dev_disable(dev, true, false);
 	nvme_free_host_mem(dev);
 	nvme_dev_remove_admin(dev);
 	nvme_free_queues(dev, 0);
@@ -2684,7 +2701,7 @@ static int nvme_suspend(struct device *dev)
 	struct pci_dev *pdev = to_pci_dev(dev);
 	struct nvme_dev *ndev = pci_get_drvdata(pdev);
 
-	nvme_dev_disable(ndev, true);
+	nvme_dev_disable(ndev, true, false);
 	return 0;
 }
 
@@ -2716,7 +2733,7 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
 	case pci_channel_io_frozen:
 		dev_warn(dev->ctrl.device,
 			"frozen state error detected, reset controller\n");
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, true);
 		return PCI_ERS_RESULT_NEED_RESET;
 	case pci_channel_io_perm_failure:
 		dev_warn(dev->ctrl.device,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V5 5/9] nvme: pci: prepare for supporting error recovery from resetting context
  2018-05-11 12:29 [PATCH V5 0/9] nvme: pci: fix & improve timeout handling Ming Lei
                   ` (3 preceding siblings ...)
  2018-05-11 12:29 ` [PATCH V5 4/9] nvme: pci: freeze queue in nvme_dev_disable() in case of error recovery Ming Lei
@ 2018-05-11 12:29 ` Ming Lei
  2018-05-11 12:29 ` [PATCH V5 6/9] nvme: pci: move error handling out of nvme_reset_dev() Ming Lei
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2018-05-11 12:29 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, James Smart, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

Either the admin or normal IO in reset context may be timed out because
controller error happens. When this timeout happens, we may have to
start controller recovery again.

This patch introduces 'reset_lock' and holds this lock when running reset,
so that we may support nested reset in the following patches.

Cc: James Smart <james.smart@broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/core.c |  2 ++
 drivers/nvme/host/nvme.h |  3 +++
 drivers/nvme/host/pci.c  | 20 +++++++++++++++++---
 3 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a3771c5729f5..adb1743e87f7 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3424,6 +3424,8 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	INIT_WORK(&ctrl->fw_act_work, nvme_fw_act_work);
 	INIT_WORK(&ctrl->delete_work, nvme_delete_ctrl_work);
 
+	mutex_init(&ctrl->reset_lock);
+
 	ret = ida_simple_get(&nvme_instance_ida, 0, 0, GFP_KERNEL);
 	if (ret < 0)
 		goto out;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 7ded7a51c430..021f7147f779 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -146,6 +146,9 @@ struct nvme_ctrl {
 	struct device ctrl_device;
 	struct device *device;	/* char device */
 	struct cdev cdev;
+
+	/* sync reset activities */
+	struct mutex reset_lock;
 	struct work_struct reset_work;
 	struct work_struct delete_work;
 
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 1fafe5d01355..a924246ffdb6 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2365,14 +2365,14 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
 		nvme_put_ctrl(&dev->ctrl);
 }
 
-static void nvme_reset_work(struct work_struct *work)
+static void nvme_reset_dev(struct nvme_dev *dev)
 {
-	struct nvme_dev *dev =
-		container_of(work, struct nvme_dev, ctrl.reset_work);
 	bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
 	int result = -ENODEV;
 	enum nvme_ctrl_state new_state = NVME_CTRL_LIVE;
 
+	lockdep_assert_held(&dev->ctrl.reset_lock);
+
 	if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING))
 		goto out;
 
@@ -2448,7 +2448,11 @@ static void nvme_reset_work(struct work_struct *work)
 		new_state = NVME_CTRL_ADMIN_ONLY;
 	} else {
 		nvme_start_queues(&dev->ctrl);
+		mutex_unlock(&dev->ctrl.reset_lock);
+
 		nvme_wait_freeze(&dev->ctrl);
+
+		mutex_lock(&dev->ctrl.reset_lock);
 		/* hit this only when allocate tagset fails */
 		if (nvme_dev_add(dev))
 			new_state = NVME_CTRL_ADMIN_ONLY;
@@ -2472,6 +2476,16 @@ static void nvme_reset_work(struct work_struct *work)
 	nvme_remove_dead_ctrl(dev, result);
 }
 
+static void nvme_reset_work(struct work_struct *work)
+{
+	struct nvme_dev *dev =
+		container_of(work, struct nvme_dev, ctrl.reset_work);
+
+	mutex_lock(&dev->ctrl.reset_lock);
+	nvme_reset_dev(dev);
+	mutex_unlock(&dev->ctrl.reset_lock);
+}
+
 static void nvme_remove_dead_ctrl_work(struct work_struct *work)
 {
 	struct nvme_dev *dev = container_of(work, struct nvme_dev, remove_work);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V5 6/9] nvme: pci: move error handling out of nvme_reset_dev()
  2018-05-11 12:29 [PATCH V5 0/9] nvme: pci: fix & improve timeout handling Ming Lei
                   ` (4 preceding siblings ...)
  2018-05-11 12:29 ` [PATCH V5 5/9] nvme: pci: prepare for supporting error recovery from resetting context Ming Lei
@ 2018-05-11 12:29 ` Ming Lei
  2018-05-11 12:29 ` [PATCH V5 7/9] nvme: pci: don't unfreeze queue until controller state updating succeeds Ming Lei
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2018-05-11 12:29 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, James Smart, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

Once nested EH is introduced, we may not need to handle error
in the inner EH, so move error handling out of nvme_reset_dev().

Meantime return the reset result to caller.

Cc: James Smart <james.smart@broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a924246ffdb6..d880356feee2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2365,7 +2365,7 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
 		nvme_put_ctrl(&dev->ctrl);
 }
 
-static void nvme_reset_dev(struct nvme_dev *dev)
+static int nvme_reset_dev(struct nvme_dev *dev)
 {
 	bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
 	int result = -ENODEV;
@@ -2459,6 +2459,7 @@ static void nvme_reset_dev(struct nvme_dev *dev)
 		nvme_unfreeze(&dev->ctrl);
 	}
 
+	result = -ENODEV;
 	/*
 	 * If only admin queue live, keep it to do further investigation or
 	 * recovery.
@@ -2470,19 +2471,22 @@ static void nvme_reset_dev(struct nvme_dev *dev)
 	}
 
 	nvme_start_ctrl(&dev->ctrl);
-	return;
+	return 0;
 
  out:
-	nvme_remove_dead_ctrl(dev, result);
+	return result;
 }
 
 static void nvme_reset_work(struct work_struct *work)
 {
 	struct nvme_dev *dev =
 		container_of(work, struct nvme_dev, ctrl.reset_work);
+	int result;
 
 	mutex_lock(&dev->ctrl.reset_lock);
-	nvme_reset_dev(dev);
+	result = nvme_reset_dev(dev);
+	if (result)
+		nvme_remove_dead_ctrl(dev, result);
 	mutex_unlock(&dev->ctrl.reset_lock);
 }
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V5 7/9] nvme: pci: don't unfreeze queue until controller state updating succeeds
  2018-05-11 12:29 [PATCH V5 0/9] nvme: pci: fix & improve timeout handling Ming Lei
                   ` (5 preceding siblings ...)
  2018-05-11 12:29 ` [PATCH V5 6/9] nvme: pci: move error handling out of nvme_reset_dev() Ming Lei
@ 2018-05-11 12:29 ` Ming Lei
  2018-05-11 12:29 ` [PATCH V5 8/9] nvme: core: introduce nvme_force_change_ctrl_state() Ming Lei
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2018-05-11 12:29 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, James Smart, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

If it fails to update controller state into LIVE or ADMIN_ONLY, the
controller will be removed, so not necessary to unfreeze queue any
more.

This way will make the following patch easier to not leak the
freeze couner.

Cc: James Smart <james.smart@broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d880356feee2..b79c7f016489 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2370,6 +2370,7 @@ static int nvme_reset_dev(struct nvme_dev *dev)
 	bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
 	int result = -ENODEV;
 	enum nvme_ctrl_state new_state = NVME_CTRL_LIVE;
+	bool unfreeze_queue = false;
 
 	lockdep_assert_held(&dev->ctrl.reset_lock);
 
@@ -2456,7 +2457,7 @@ static int nvme_reset_dev(struct nvme_dev *dev)
 		/* hit this only when allocate tagset fails */
 		if (nvme_dev_add(dev))
 			new_state = NVME_CTRL_ADMIN_ONLY;
-		nvme_unfreeze(&dev->ctrl);
+		unfreeze_queue = true;
 	}
 
 	result = -ENODEV;
@@ -2471,6 +2472,9 @@ static int nvme_reset_dev(struct nvme_dev *dev)
 	}
 
 	nvme_start_ctrl(&dev->ctrl);
+
+	if (unfreeze_queue)
+		nvme_unfreeze(&dev->ctrl);
 	return 0;
 
  out:
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V5 8/9] nvme: core: introduce nvme_force_change_ctrl_state()
  2018-05-11 12:29 [PATCH V5 0/9] nvme: pci: fix & improve timeout handling Ming Lei
                   ` (6 preceding siblings ...)
  2018-05-11 12:29 ` [PATCH V5 7/9] nvme: pci: don't unfreeze queue until controller state updating succeeds Ming Lei
@ 2018-05-11 12:29 ` Ming Lei
  2018-05-11 12:29 ` [PATCH V5 9/9] nvme: pci: support nested EH Ming Lei
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2018-05-11 12:29 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, James Smart, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

When controller is being reset, timeout still may be triggered,
for handling this situation, the contoller state has to be
changed to NVME_CTRL_RESETTING first.

So introduce nvme_force_change_ctrl_state() for this purpose.

Cc: James Smart <james.smart@broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/core.c | 33 +++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h |  2 ++
 2 files changed, 35 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index adb1743e87f7..af053309c610 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -254,6 +254,39 @@ void nvme_cancel_request(struct request *req, void *data, bool reserved)
 }
 EXPORT_SYMBOL_GPL(nvme_cancel_request);
 
+/* should only be used by EH for handling error during reset */
+void nvme_force_change_ctrl_state(struct nvme_ctrl *ctrl,
+		enum nvme_ctrl_state new_state)
+{
+	enum nvme_ctrl_state old_state;
+	unsigned long flags;
+	bool warn = true;
+
+	spin_lock_irqsave(&ctrl->lock, flags);
+	old_state = ctrl->state;
+	switch (new_state) {
+	case NVME_CTRL_RESETTING:
+		switch (old_state) {
+		case NVME_CTRL_LIVE:
+		case NVME_CTRL_ADMIN_ONLY:
+		case NVME_CTRL_RESETTING:
+		case NVME_CTRL_CONNECTING:
+			warn = false;
+			break;
+		default:
+			break;
+		}
+	default:
+		break;
+	}
+	if (warn)
+		WARN(1, "Make sure you want to change to %d from %d\n",
+				new_state, old_state);
+	ctrl->state = new_state;
+	spin_unlock_irqrestore(&ctrl->lock, flags);
+}
+EXPORT_SYMBOL_GPL(nvme_force_change_ctrl_state);
+
 bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 		enum nvme_ctrl_state new_state)
 {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 021f7147f779..2a68df197c71 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -385,6 +385,8 @@ void nvme_complete_rq(struct request *req);
 void nvme_cancel_request(struct request *req, void *data, bool reserved);
 bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 		enum nvme_ctrl_state new_state);
+void nvme_force_change_ctrl_state(struct nvme_ctrl *ctrl,
+		enum nvme_ctrl_state new_state);
 int nvme_disable_ctrl(struct nvme_ctrl *ctrl, u64 cap);
 int nvme_enable_ctrl(struct nvme_ctrl *ctrl, u64 cap);
 int nvme_shutdown_ctrl(struct nvme_ctrl *ctrl);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH V5 9/9] nvme: pci: support nested EH
  2018-05-11 12:29 [PATCH V5 0/9] nvme: pci: fix & improve timeout handling Ming Lei
                   ` (7 preceding siblings ...)
  2018-05-11 12:29 ` [PATCH V5 8/9] nvme: core: introduce nvme_force_change_ctrl_state() Ming Lei
@ 2018-05-11 12:29 ` Ming Lei
  2018-05-15 10:02   ` jianchao.wang
  2018-05-11 20:50 ` [PATCH V5 0/9] nvme: pci: fix & improve timeout handling Keith Busch
  2018-05-14  8:21 ` jianchao.wang
  10 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2018-05-11 12:29 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, James Smart, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

When one req is timed out, now nvme_timeout() handles it by the
following way:

	nvme_dev_disable(dev, false);
	nvme_reset_ctrl(&dev->ctrl);
	return BLK_EH_HANDLED.

There are several issues about the above approach:

1) IO may fail during resetting

Admin IO timeout may be triggered in nvme_reset_dev() when error happens.
Normal IO timeout may be triggered too during nvme_wait_freeze() in reset
path. When the two kinds of timeout happen, the current reset mechanism
can't work any more.

2) race between nvme_start_freeze and nvme_wait_freeze() & nvme_unfreeze()

- nvme_dev_disable() and resetting controller are required for recovering
controller, but the two are run from different contexts. nvme_start_freeze()
is call from nvme_dev_disable() which is run timeout work context, and
nvme_unfreeze() is run from reset work context. Unfortunatley timeout may be
triggered during resetting controller, so nvme_start_freeze() may be run
several times.

- Also two reset work may run one by one, this may cause hang in
nvme_wait_freeze() forever too.

3) all namespace's EH require to shutdown & reset the controller

block's timeout handler is per-request-queue, that means each
namespace's error handling may shutdown & reset the whole controller,
then the shutdown from one namespace may quiese queues when resetting
from another namespace is in-progress.

This patch fixes the above issues by using nested EH:

1) run controller shutdown(nvme_dev_disable()) and resetting(nvme_reset_dev)
from one same EH context, so the above race 2) can be fixed easily.

2) always start a new context for handling EH, and cancel all in-flight
requests(include the timed-out ones) in nvme_dev_disable() by quiescing
timeout event before shutdown controller.

3) limit the max number of nested EH, when the limit is reached, removes
the controller and fails all in-flight request.

With this approach, blktest block/011 can be passed.

Cc: James Smart <james.smart@broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/core.c |  22 +++++
 drivers/nvme/host/nvme.h |   2 +
 drivers/nvme/host/pci.c  | 243 +++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 247 insertions(+), 20 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index af053309c610..1dec353388be 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3581,6 +3581,28 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_start_freeze);
 
+void nvme_unquiesce_timeout(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_unquiesce_timeout(ns->queue);
+	up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_unquiesce_timeout);
+
+void nvme_quiesce_timeout(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_quiesce_timeout(ns->queue);
+	up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_quiesce_timeout);
+
 void nvme_stop_queues(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 2a68df197c71..934cf98925f3 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -407,6 +407,8 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len,
 void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
 		union nvme_result *res);
 
+void nvme_unquiesce_timeout(struct nvme_ctrl *ctrl);
+void nvme_quiesce_timeout(struct nvme_ctrl *ctrl);
 void nvme_stop_queues(struct nvme_ctrl *ctrl);
 void nvme_start_queues(struct nvme_ctrl *ctrl);
 void nvme_kill_queues(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b79c7f016489..4366c21e59b2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -71,6 +71,7 @@ struct nvme_queue;
 static void nvme_process_cq(struct nvme_queue *nvmeq);
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
 		freeze_queue);
+static int nvme_reset_dev(struct nvme_dev *dev);
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -113,6 +114,23 @@ struct nvme_dev {
 	dma_addr_t host_mem_descs_dma;
 	struct nvme_host_mem_buf_desc *host_mem_descs;
 	void **host_mem_desc_bufs;
+
+	/* EH handler */
+	spinlock_t	eh_lock;
+	bool		ctrl_shutdown_started;
+	bool		ctrl_failed;
+	unsigned int	nested_eh;
+	struct work_struct fail_ctrl_work;
+	wait_queue_head_t	eh_wq;
+	struct list_head	eh_head;
+};
+
+#define  NVME_MAX_NESTED_EH	32
+struct nvme_eh_work {
+	struct work_struct	work;
+	struct nvme_dev		*dev;
+	int			seq;
+	struct list_head	list;
 };
 
 static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
@@ -1177,6 +1195,176 @@ static void nvme_warn_reset(struct nvme_dev *dev, u32 csts)
 			 csts, result);
 }
 
+static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
+{
+	dev_warn(dev->ctrl.device, "Removing after probe failure status: %d\n", status);
+
+	nvme_get_ctrl(&dev->ctrl);
+	nvme_dev_disable(dev, false, false);
+	if (!queue_work(nvme_wq, &dev->remove_work))
+		nvme_put_ctrl(&dev->ctrl);
+}
+
+static void nvme_eh_fail_ctrl_work(struct work_struct *work)
+{
+	struct nvme_dev *dev =
+		container_of(work, struct nvme_dev, fail_ctrl_work);
+
+	dev_info(dev->ctrl.device, "EH: fail controller\n");
+	nvme_remove_dead_ctrl(dev, 0);
+}
+
+static void nvme_eh_mark_ctrl_shutdown(struct nvme_dev *dev)
+{
+	spin_lock(&dev->eh_lock);
+	dev->ctrl_shutdown_started = false;
+	spin_unlock(&dev->eh_lock);
+}
+
+static void nvme_eh_sched_fail_ctrl(struct nvme_dev *dev)
+{
+	INIT_WORK(&dev->fail_ctrl_work, nvme_eh_fail_ctrl_work);
+	queue_work(nvme_reset_wq, &dev->fail_ctrl_work);
+}
+
+/* either controller is updated to LIVE or will be removed */
+static bool nvme_eh_reset_done(struct nvme_dev *dev)
+{
+	return dev->ctrl.state == NVME_CTRL_LIVE || dev->ctrl_failed;
+}
+
+static void nvme_eh_done(struct nvme_eh_work *eh_work, int result)
+{
+	struct nvme_dev *dev = eh_work->dev;
+	bool top_eh;
+
+	spin_lock(&dev->eh_lock);
+	top_eh = list_is_last(&eh_work->list, &dev->eh_head);
+	dev->nested_eh--;
+
+	/* Fail controller if the top EH can't recover it */
+	if (!result)
+		wake_up_all(&dev->eh_wq);
+	else if (top_eh) {
+		dev->ctrl_failed = true;
+		nvme_eh_sched_fail_ctrl(dev);
+		wake_up_all(&dev->eh_wq);
+	}
+
+	list_del(&eh_work->list);
+	spin_unlock(&dev->eh_lock);
+
+	dev_info(dev->ctrl.device, "EH %d: state %d, eh_done %d, top eh %d\n",
+			eh_work->seq, dev->ctrl.state, result, top_eh);
+	wait_event(dev->eh_wq, nvme_eh_reset_done(dev));
+
+	/*
+	 * After controller is recovered in upper EH finally, we have to
+	 * unfreeze queues if reset failed in this EH, otherwise blk-mq
+	 * queues' freeze counter may be leaked.
+	 *
+	 * nvme_unfreeze() can only be called after controller state is
+	 * updated to LIVE.
+	 */
+	if (result && (dev->ctrl.state == NVME_CTRL_LIVE))
+		nvme_unfreeze(&dev->ctrl);
+}
+
+static void nvme_eh_work(struct work_struct *work)
+{
+	struct nvme_eh_work *eh_work =
+		container_of(work, struct nvme_eh_work, work);
+	struct nvme_dev *dev = eh_work->dev;
+	int result = -ENODEV;
+	bool top_eh;
+
+	dev_info(dev->ctrl.device, "EH %d: before shutdown\n",
+			eh_work->seq);
+	nvme_dev_disable(dev, false, true);
+
+	/* allow new EH to be created */
+	nvme_eh_mark_ctrl_shutdown(dev);
+
+	/*
+	 * nvme_dev_disable cancels all in-flight requests, and wont't
+	 * cause timout at all, so I am always the top EH now, but it
+	 * becomes not true after 'reset_lock' is held, so have to check
+	 * if I am still the top EH, and force to update to NVME_CTRL_RESETTING
+	 * if yes.
+	 */
+	mutex_lock(&dev->ctrl.reset_lock);
+	spin_lock(&dev->eh_lock);
+
+	/* allow top EH to preempt other inner EH */
+	top_eh = list_is_last(&eh_work->list, &dev->eh_head);
+	dev_info(dev->ctrl.device, "EH %d: after shutdown, top eh: %d\n",
+			eh_work->seq, top_eh);
+	if (!top_eh) {
+		if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
+			spin_unlock(&dev->eh_lock);
+			goto done;
+		}
+	} else {
+		nvme_force_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING);
+		result = 0;
+	}
+
+	spin_unlock(&dev->eh_lock);
+	result = nvme_reset_dev(dev);
+done:
+	mutex_unlock(&dev->ctrl.reset_lock);
+	nvme_eh_done(eh_work, result);
+	dev_info(dev->ctrl.device, "EH %d: after recovery %d\n",
+			eh_work->seq, result);
+
+	kfree(eh_work);
+}
+
+static void nvme_eh_schedule(struct nvme_dev *dev)
+{
+	bool need_sched = false;
+	bool fail_ctrl = false;
+	struct nvme_eh_work *eh_work;
+	int seq;
+
+	spin_lock(&dev->eh_lock);
+	if (!dev->ctrl_shutdown_started) {
+		need_sched = true;
+		seq = dev->nested_eh;
+		if (++dev->nested_eh >= NVME_MAX_NESTED_EH) {
+			if (!dev->ctrl_failed)
+				dev->ctrl_failed = fail_ctrl = true;
+			else
+				need_sched = false;
+		} else
+			dev->ctrl_shutdown_started = true;
+	}
+	spin_unlock(&dev->eh_lock);
+
+	if (!need_sched)
+		return;
+
+	if (fail_ctrl) {
+ fail_ctrl:
+		nvme_eh_sched_fail_ctrl(dev);
+		return;
+	}
+
+	eh_work = kzalloc(sizeof(*eh_work), GFP_NOIO);
+	if (!eh_work)
+		goto fail_ctrl;
+
+	eh_work->dev = dev;
+	eh_work->seq = seq;
+
+	spin_lock(&dev->eh_lock);
+	list_add_tail(&eh_work->list, &dev->eh_head);
+	spin_unlock(&dev->eh_lock);
+
+	INIT_WORK(&eh_work->work, nvme_eh_work);
+	queue_work(nvme_reset_wq, &eh_work->work);
+}
+
 static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -1198,9 +1386,8 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	 */
 	if (nvme_should_reset(dev, csts)) {
 		nvme_warn_reset(dev, csts);
-		nvme_dev_disable(dev, false, true);
-		nvme_reset_ctrl(&dev->ctrl);
-		return BLK_EH_HANDLED;
+		nvme_eh_schedule(dev);
+		return BLK_EH_RESET_TIMER;
 	}
 
 	/*
@@ -1225,9 +1412,9 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, disable controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false, false);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		return BLK_EH_HANDLED;
+		nvme_eh_schedule(dev);
+		return BLK_EH_RESET_TIMER;
 	default:
 		break;
 	}
@@ -1241,15 +1428,13 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false, true);
-		nvme_reset_ctrl(&dev->ctrl);
-
 		/*
 		 * Mark the request as handled, since the inline shutdown
 		 * forces all outstanding requests to complete.
 		 */
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		return BLK_EH_HANDLED;
+		nvme_eh_schedule(dev);
+		return BLK_EH_RESET_TIMER;
 	}
 
 	if (atomic_dec_return(&dev->ctrl.abort_limit) < 0) {
@@ -2301,12 +2486,26 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
 	}
 	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
 		nvme_suspend_queue(&dev->queues[i]);
+	/*
+	 * safe to sync timeout after queues are quiesced, then all
+	 * requests(include the time-out ones) will be canceled.
+	 */
+	nvme_quiesce_timeout(&dev->ctrl);
+	blk_quiesce_timeout(dev->ctrl.admin_q);
 
 	nvme_pci_disable(dev);
 
+	/*
+	 * Both timeout and interrupt handler have been drained, and all
+	 * in-flight requests will be canceled now.
+	 */
 	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
 	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
 
+	/* all requests have been canceled now, so enable timeout now */
+	nvme_unquiesce_timeout(&dev->ctrl);
+	blk_unquiesce_timeout(dev->ctrl.admin_q);
+
 	/*
 	 * The driver will not be starting up queues again if shutting down so
 	 * must flush all entered requests to their failed completion to avoid
@@ -2355,16 +2554,6 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 	kfree(dev);
 }
 
-static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
-{
-	dev_warn(dev->ctrl.device, "Removing after probe failure status: %d\n", status);
-
-	nvme_get_ctrl(&dev->ctrl);
-	nvme_dev_disable(dev, false, false);
-	if (!queue_work(nvme_wq, &dev->remove_work))
-		nvme_put_ctrl(&dev->ctrl);
-}
-
 static int nvme_reset_dev(struct nvme_dev *dev)
 {
 	bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
@@ -2374,7 +2563,7 @@ static int nvme_reset_dev(struct nvme_dev *dev)
 
 	lockdep_assert_held(&dev->ctrl.reset_lock);
 
-	if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING))
+	if (dev->ctrl.state != NVME_CTRL_RESETTING)
 		goto out;
 
 	/*
@@ -2460,6 +2649,10 @@ static int nvme_reset_dev(struct nvme_dev *dev)
 		unfreeze_queue = true;
 	}
 
+	/* controller state may have been updated already by inner EH */
+	if (dev->ctrl.state == new_state)
+		goto reset_done;
+
 	result = -ENODEV;
 	/*
 	 * If only admin queue live, keep it to do further investigation or
@@ -2473,6 +2666,7 @@ static int nvme_reset_dev(struct nvme_dev *dev)
 
 	nvme_start_ctrl(&dev->ctrl);
 
+ reset_done:
 	if (unfreeze_queue)
 		nvme_unfreeze(&dev->ctrl);
 	return 0;
@@ -2589,6 +2783,13 @@ static unsigned long check_vendor_combination_bug(struct pci_dev *pdev)
 	return 0;
 }
 
+static void nvme_eh_init(struct nvme_dev *dev)
+{
+	spin_lock_init(&dev->eh_lock);
+	init_waitqueue_head(&dev->eh_wq);
+	INIT_LIST_HEAD(&dev->eh_head);
+}
+
 static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	int node, result = -ENOMEM;
@@ -2633,6 +2834,8 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
 
+	nvme_eh_init(dev);
+
 	nvme_reset_ctrl(&dev->ctrl);
 
 	return 0;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
  2018-05-11 12:29 [PATCH V5 0/9] nvme: pci: fix & improve timeout handling Ming Lei
                   ` (8 preceding siblings ...)
  2018-05-11 12:29 ` [PATCH V5 9/9] nvme: pci: support nested EH Ming Lei
@ 2018-05-11 20:50 ` Keith Busch
  2018-05-12  0:21   ` Ming Lei
  2018-05-14  8:21 ` jianchao.wang
  10 siblings, 1 reply; 32+ messages in thread
From: Keith Busch @ 2018-05-11 20:50 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, Jens Axboe, Laurence Oberman, Sagi Grimberg,
	James Smart, linux-nvme, linux-block, Jianchao Wang,
	Christoph Hellwig

On Fri, May 11, 2018 at 08:29:24PM +0800, Ming Lei wrote:
> Hi,
> 
> The 1st patch introduces blk_quiesce_timeout() and blk_unquiesce_timeout()
> for NVMe, meantime fixes blk_sync_queue().
> 
> The 2nd patch covers timeout for admin commands for recovering controller
> for avoiding possible deadlock.
> 
> The 3rd and 4th patches avoid to wait_freeze on queues which aren't frozen.
> 
> The last 5 patches fixes several races wrt. NVMe timeout handler, and
> finally can make blktests block/011 passed. Meantime the NVMe PCI timeout
> mecanism become much more rebost than before.
> 
> gitweb:
> 	https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V5

Hi Ming,

First test with simulated broken links is unsuccessful. I'm getting
stuck here:

[<0>] blk_mq_freeze_queue_wait+0x46/0xb0
[<0>] blk_cleanup_queue+0x78/0x170
[<0>] nvme_ns_remove+0x137/0x1a0 [nvme_core]
[<0>] nvme_remove_namespaces+0x86/0xc0 [nvme_core]
[<0>] nvme_remove+0x6b/0x130 [nvme]
[<0>] pci_device_remove+0x36/0xb0
[<0>] device_release_driver_internal+0x157/0x220
[<0>] nvme_remove_dead_ctrl_work+0x29/0x40 [nvme]
[<0>] process_one_work+0x170/0x350
[<0>] worker_thread+0x2e/0x380
[<0>] kthread+0x111/0x130
[<0>] ret_from_fork+0x1f/0x30


Here's the last parts of the kernel logs capturing the failure:

[  760.679105] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679116] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679120] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679124] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679127] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679131] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679135] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679138] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679141] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679144] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679148] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679151] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679155] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679158] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679161] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679164] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679169] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679172] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679176] nvme nvme1: EH 0: before shutdown
[  760.679177] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679181] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679185] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679189] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679192] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679196] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679199] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679202] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679240] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679243] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.679246] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff

( above repeats a few more hundred times )

[  760.679960] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[  760.701468] nvme nvme1: EH 0: after shutdown, top eh: 1
[  760.727099] pci_raw_set_power_state: 62 callbacks suppressed
[  760.727103] nvme 0000:86:00.0: Refused to change power state, currently in D3
[  760.727483] nvme nvme1: EH 0: state 4, eh_done -19, top eh 1
[  760.727485] nvme nvme1: EH 0: after recovery -19
[  760.727488] nvme nvme1: EH: fail controller
[  760.727491] nvme nvme1: Removing after probe failure status: 0
[  760.735138] nvme1n1: detected capacity change from 1200243695616 to 0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
  2018-05-11 20:50 ` [PATCH V5 0/9] nvme: pci: fix & improve timeout handling Keith Busch
@ 2018-05-12  0:21   ` Ming Lei
  2018-05-14 15:18     ` Keith Busch
  0 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2018-05-12  0:21 UTC (permalink / raw)
  To: Keith Busch
  Cc: Keith Busch, Jens Axboe, Laurence Oberman, Sagi Grimberg,
	James Smart, linux-nvme, linux-block, Jianchao Wang,
	Christoph Hellwig

Hi Keith,

On Fri, May 11, 2018 at 02:50:28PM -0600, Keith Busch wrote:
> On Fri, May 11, 2018 at 08:29:24PM +0800, Ming Lei wrote:
> > Hi,
> > 
> > The 1st patch introduces blk_quiesce_timeout() and blk_unquiesce_timeout()
> > for NVMe, meantime fixes blk_sync_queue().
> > 
> > The 2nd patch covers timeout for admin commands for recovering controller
> > for avoiding possible deadlock.
> > 
> > The 3rd and 4th patches avoid to wait_freeze on queues which aren't frozen.
> > 
> > The last 5 patches fixes several races wrt. NVMe timeout handler, and
> > finally can make blktests block/011 passed. Meantime the NVMe PCI timeout
> > mecanism become much more rebost than before.
> > 
> > gitweb:
> > 	https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V5
> 
> Hi Ming,
> 
> First test with simulated broken links is unsuccessful. I'm getting
> stuck here:
> 
> [<0>] blk_mq_freeze_queue_wait+0x46/0xb0
> [<0>] blk_cleanup_queue+0x78/0x170
> [<0>] nvme_ns_remove+0x137/0x1a0 [nvme_core]
> [<0>] nvme_remove_namespaces+0x86/0xc0 [nvme_core]
> [<0>] nvme_remove+0x6b/0x130 [nvme]
> [<0>] pci_device_remove+0x36/0xb0
> [<0>] device_release_driver_internal+0x157/0x220
> [<0>] nvme_remove_dead_ctrl_work+0x29/0x40 [nvme]
> [<0>] process_one_work+0x170/0x350
> [<0>] worker_thread+0x2e/0x380
> [<0>] kthread+0x111/0x130
> [<0>] ret_from_fork+0x1f/0x30
> 
> 
> Here's the last parts of the kernel logs capturing the failure:
> 
> [  760.679105] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679116] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679120] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679124] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679127] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679131] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679135] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679138] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679141] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679144] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679148] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679151] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679155] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679158] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679161] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679164] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679169] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679172] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679176] nvme nvme1: EH 0: before shutdown
> [  760.679177] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679181] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679185] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679189] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679192] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679196] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679199] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679202] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679240] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679243] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.679246] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> 
> ( above repeats a few more hundred times )
> 
> [  760.679960] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> [  760.701468] nvme nvme1: EH 0: after shutdown, top eh: 1
> [  760.727099] pci_raw_set_power_state: 62 callbacks suppressed
> [  760.727103] nvme 0000:86:00.0: Refused to change power state, currently in D3

EH may not cover this kind of failure, so it fails in the 1st try.

> [  760.727483] nvme nvme1: EH 0: state 4, eh_done -19, top eh 1
> [  760.727485] nvme nvme1: EH 0: after recovery -19
> [  760.727488] nvme nvme1: EH: fail controller

The above issue(hang in nvme_remove()) is still an old issue, which
is because queues are kept as quiesce during remove, so could you
please test the following change?

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1dec353388be..c78e5a0cde06 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3254,6 +3254,11 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
         */
        if (ctrl->state == NVME_CTRL_DEAD)
                nvme_kill_queues(ctrl);
+       else {
+               if (ctrl->admin_q)
+                       blk_mq_unquiesce_queue(ctrl->admin_q);
+               nvme_start_queues(ctrl);
+       }

        down_write(&ctrl->namespaces_rwsem);
        list_splice_init(&ctrl->namespaces, &ns_list);

BTW, in my environment, it is hard to trigger this failure, so not see
this issue, but I did verify the nested EH which can recover from error
in reset.

Thanks,
Ming

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
  2018-05-11 12:29 [PATCH V5 0/9] nvme: pci: fix & improve timeout handling Ming Lei
                   ` (9 preceding siblings ...)
  2018-05-11 20:50 ` [PATCH V5 0/9] nvme: pci: fix & improve timeout handling Keith Busch
@ 2018-05-14  8:21 ` jianchao.wang
  2018-05-14  9:38   ` Ming Lei
  10 siblings, 1 reply; 32+ messages in thread
From: jianchao.wang @ 2018-05-14  8:21 UTC (permalink / raw)
  To: Ming Lei, Keith Busch
  Cc: Jens Axboe, linux-block, James Smart, Christoph Hellwig,
	Sagi Grimberg, linux-nvme, Laurence Oberman

Hi ming

Please refer to my test log and analysis.

[  229.872622] nvme nvme0: I/O 164 QID 1 timeout, reset controller
[  229.872649] nvme nvme0: EH 0: before shutdown
[  229.872683] nvme nvme0: I/O 165 QID 1 timeout, reset controller
[  229.872700] nvme nvme0: I/O 166 QID 1 timeout, reset controller
[  229.872716] nvme nvme0: I/O 167 QID 1 timeout, reset controller
[  229.872733] nvme nvme0: I/O 169 QID 1 timeout, reset controller
[  229.872750] nvme nvme0: I/O 173 QID 1 timeout, reset controller
[  229.872766] nvme nvme0: I/O 174 QID 1 timeout, reset controller
[  229.872783] nvme nvme0: I/O 178 QID 1 timeout, reset controller
[  229.872800] nvme nvme0: I/O 182 QID 1 timeout, aborting
[  229.872900] nvme nvme0: I/O 185 QID 1 timeout, aborting
[  229.872975] nvme nvme0: I/O 190 QID 1 timeout, reset controller
[  229.872990] nvme nvme0: I/O 191 QID 1 timeout, aborting
[  229.873021] nvme nvme0: I/O 5 QID 2 timeout, aborting
[  229.873096] nvme nvme0: I/O 40 QID 2 timeout, aborting
[  229.874041] nvme nvme0: Abort status: 0x0
[  229.874064] nvme nvme0: Abort status: 0x0
[  229.874078] nvme nvme0: Abort status: 0x0
[  229.874092] nvme nvme0: Abort status: 0x0
[  229.874106] nvme nvme0: Abort status: 0x0
[  230.060698] nvme nvme0: EH 0: after shutdown, top eh: 1
[  290.840592] nvme nvme0: I/O 18 QID 0 timeout, disable controller
[  290.840746] nvme nvme0: EH 1: before shutdown

[  290.992650] ------------[ cut here ]------------
[  290.992751] Trying to free already-free IRQ 133
[  290.992783] WARNING: CPU: 6 PID: 227 at /home/will/u04/source_code/linux-stable/kernel/irq/manage.c:1549 __free_irq+0xf6/0x420
[  290.993394] CPU: 6 PID: 227 Comm: kworker/u16:4 Kdump: loaded Not tainted 4.17.0-rc3+ #37
[  290.993402] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017
[  290.993418] Workqueue: nvme-reset-wq nvme_eh_work
[  290.993436] RIP: 0010:__free_irq+0xf6/0x420
...
[  290.993541] Call Trace:
[  290.993581]  free_irq+0x4c/0xa0
[  290.993600]  pci_free_irq+0x18/0x30
[  290.993615]  nvme_dev_disable+0x20b/0x720
[  290.993745]  nvme_eh_work+0xdd/0x4b0
[  290.993866]  process_one_work+0x3ca/0xaa0
[  290.993960]  worker_thread+0x89/0x6c0
[  290.994017]  kthread+0x18d/0x1e0
[  290.994061]  ret_from_fork+0x24/0x30
[  338.912379] INFO: task kworker/u16:4:227 blocked for more than 30 seconds.
[  338.913348]       Tainted: G        W         4.17.0-rc3+ #37
[  338.913549] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  338.913809] kworker/u16:4   D    0   227      2 0x80000000
[  338.913842] Workqueue: nvme-reset-wq nvme_eh_work
[  338.913857] Call Trace:
[  338.914002]  schedule+0x52/0xe0
[  338.914019]  schedule_preempt_disabled+0x14/0x20
[  338.914029]  __mutex_lock+0x5b4/0xca0
[  338.914228]  nvme_eh_work+0x11a/0x4b0
[  338.914344]  process_one_work+0x3ca/0xaa0
[  338.914440]  worker_thread+0x89/0x6c0
[  338.914496]  kthread+0x18d/0x1e0
[  338.914542]  ret_from_fork+0x24/0x30
[  338.914648]

Here is the deadlock scenario.

nvme_eh_work // EH0
  -> nvme_reset_dev //hold reset_lock
    -> nvme_setup_io_queues
      -> nvme_create_io_queues
        -> nvme_create_queue
          -> set nvmeq->cq_vector
          -> adapter_alloc_cq
          -> adapter_alloc_sq
             irq has not been requested
             io timeout 
                                    nvme_eh_work //EH1
                                      -> nvme_dev_disable
                                        -> quiesce the adminq //----> here !
                                        -> nvme_suspend_queue
                                          print out warning Trying to free already-free IRQ 133
                                        -> nvme_cancel_request // complete the timeout admin request
                                      -> require reset_lock
          -> adapter_delete_cq
            -> adapter_delete_queue // submit to the adminq which has been quiesced.
              -> nvme_submit_sync_cmd
                -> blk_execute_rq
                  -> wait_for_completion_io_timeout
                  hang_check is true, so there is no hung task warning for this context

EH0 submit cq delete admin command, but it will never be completed or timed out, because the admin request queue has
been quiesced, so the reset_lock cannot be released, and EH1 cannot get reset_lock and make things forward.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
  2018-05-14  8:21 ` jianchao.wang
@ 2018-05-14  9:38   ` Ming Lei
  2018-05-14 10:05     ` jianchao.wang
  0 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2018-05-14  9:38 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Keith Busch, Jens Axboe, linux-block, James Smart,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

On Mon, May 14, 2018 at 04:21:04PM +0800, jianchao.wang wrote:
> Hi ming
> 
> Please refer to my test log and analysis.
> 
> [  229.872622] nvme nvme0: I/O 164 QID 1 timeout, reset controller
> [  229.872649] nvme nvme0: EH 0: before shutdown
> [  229.872683] nvme nvme0: I/O 165 QID 1 timeout, reset controller
> [  229.872700] nvme nvme0: I/O 166 QID 1 timeout, reset controller
> [  229.872716] nvme nvme0: I/O 167 QID 1 timeout, reset controller
> [  229.872733] nvme nvme0: I/O 169 QID 1 timeout, reset controller
> [  229.872750] nvme nvme0: I/O 173 QID 1 timeout, reset controller
> [  229.872766] nvme nvme0: I/O 174 QID 1 timeout, reset controller
> [  229.872783] nvme nvme0: I/O 178 QID 1 timeout, reset controller
> [  229.872800] nvme nvme0: I/O 182 QID 1 timeout, aborting
> [  229.872900] nvme nvme0: I/O 185 QID 1 timeout, aborting
> [  229.872975] nvme nvme0: I/O 190 QID 1 timeout, reset controller
> [  229.872990] nvme nvme0: I/O 191 QID 1 timeout, aborting
> [  229.873021] nvme nvme0: I/O 5 QID 2 timeout, aborting
> [  229.873096] nvme nvme0: I/O 40 QID 2 timeout, aborting
> [  229.874041] nvme nvme0: Abort status: 0x0
> [  229.874064] nvme nvme0: Abort status: 0x0
> [  229.874078] nvme nvme0: Abort status: 0x0
> [  229.874092] nvme nvme0: Abort status: 0x0
> [  229.874106] nvme nvme0: Abort status: 0x0
> [  230.060698] nvme nvme0: EH 0: after shutdown, top eh: 1
> [  290.840592] nvme nvme0: I/O 18 QID 0 timeout, disable controller
> [  290.840746] nvme nvme0: EH 1: before shutdown
> 
> [  290.992650] ------------[ cut here ]------------
> [  290.992751] Trying to free already-free IRQ 133
> [  290.992783] WARNING: CPU: 6 PID: 227 at /home/will/u04/source_code/linux-stable/kernel/irq/manage.c:1549 __free_irq+0xf6/0x420
> [  290.993394] CPU: 6 PID: 227 Comm: kworker/u16:4 Kdump: loaded Not tainted 4.17.0-rc3+ #37
> [  290.993402] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017
> [  290.993418] Workqueue: nvme-reset-wq nvme_eh_work
> [  290.993436] RIP: 0010:__free_irq+0xf6/0x420
> ...
> [  290.993541] Call Trace:
> [  290.993581]  free_irq+0x4c/0xa0
> [  290.993600]  pci_free_irq+0x18/0x30
> [  290.993615]  nvme_dev_disable+0x20b/0x720
> [  290.993745]  nvme_eh_work+0xdd/0x4b0
> [  290.993866]  process_one_work+0x3ca/0xaa0
> [  290.993960]  worker_thread+0x89/0x6c0
> [  290.994017]  kthread+0x18d/0x1e0
> [  290.994061]  ret_from_fork+0x24/0x30
> [  338.912379] INFO: task kworker/u16:4:227 blocked for more than 30 seconds.
> [  338.913348]       Tainted: G        W         4.17.0-rc3+ #37
> [  338.913549] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  338.913809] kworker/u16:4   D    0   227      2 0x80000000
> [  338.913842] Workqueue: nvme-reset-wq nvme_eh_work
> [  338.913857] Call Trace:
> [  338.914002]  schedule+0x52/0xe0
> [  338.914019]  schedule_preempt_disabled+0x14/0x20
> [  338.914029]  __mutex_lock+0x5b4/0xca0
> [  338.914228]  nvme_eh_work+0x11a/0x4b0
> [  338.914344]  process_one_work+0x3ca/0xaa0
> [  338.914440]  worker_thread+0x89/0x6c0
> [  338.914496]  kthread+0x18d/0x1e0
> [  338.914542]  ret_from_fork+0x24/0x30
> [  338.914648]

The above isn't a new issue too, since nvme_dev_disable() can be
run before the controller is recovered(nvme_setup_io_queues() done)
without this patchset.

This can happen in several cases, such as the one you listed below, or
two or more timed-out req are triggered from different queues.

This issue is that genirq won't work well if the same IRQ is freed by
two times.

> 
> Here is the deadlock scenario.
> 
> nvme_eh_work // EH0
>   -> nvme_reset_dev //hold reset_lock
>     -> nvme_setup_io_queues
>       -> nvme_create_io_queues
>         -> nvme_create_queue
>           -> set nvmeq->cq_vector
>           -> adapter_alloc_cq
>           -> adapter_alloc_sq
>              irq has not been requested
>              io timeout 
>                                     nvme_eh_work //EH1
>                                       -> nvme_dev_disable
>                                         -> quiesce the adminq //----> here !
>                                         -> nvme_suspend_queue
>                                           print out warning Trying to free already-free IRQ 133
>                                         -> nvme_cancel_request // complete the timeout admin request
>                                       -> require reset_lock
>           -> adapter_delete_cq

If the admin IO submitted in adapter_alloc_sq() is timed out,
nvme_dev_disable() in EH1 will complete it which is set as REQ_FAILFAST_DRIVER,
then adapter_alloc_sq() should return error, and the whole reset in EH0
should have been terminated immediately.

I guess the issue should be that nvme_create_io_queues() ignores the failure.

Could you dump the stack trace of EH0 reset task? So that we may see
where EH0 reset kthread hangs.

>             -> adapter_delete_queue // submit to the adminq which has been quiesced.
>               -> nvme_submit_sync_cmd
>                 -> blk_execute_rq
>                   -> wait_for_completion_io_timeout
>                   hang_check is true, so there is no hung task warning for this context
> 
> EH0 submit cq delete admin command, but it will never be completed or timed out, because the admin request queue has
> been quiesced, so the reset_lock cannot be released, and EH1 cannot get reset_lock and make things forward.

The nvme_dev_disable() in outer EH(EH1 in above log) will complete all
admin command, which won't be retried because it is set as
REQ_FAILFAST_DRIVER, so nvme_cancel_request() will complete it in
nvme_dev_disable().

Thanks
Ming

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
  2018-05-14  9:38   ` Ming Lei
@ 2018-05-14 10:05     ` jianchao.wang
  2018-05-14 12:22       ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: jianchao.wang @ 2018-05-14 10:05 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, Jens Axboe, linux-block, James Smart,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

Hi ming

On 05/14/2018 05:38 PM, Ming Lei wrote:
>> Here is the deadlock scenario.
>>
>> nvme_eh_work // EH0
>>   -> nvme_reset_dev //hold reset_lock
>>     -> nvme_setup_io_queues
>>       -> nvme_create_io_queues
>>         -> nvme_create_queue
>>           -> set nvmeq->cq_vector
>>           -> adapter_alloc_cq
>>           -> adapter_alloc_sq
>>              irq has not been requested
>>              io timeout 
>>                                     nvme_eh_work //EH1
>>                                       -> nvme_dev_disable
>>                                         -> quiesce the adminq //----> here !
>>                                         -> nvme_suspend_queue
>>                                           print out warning Trying to free already-free IRQ 133
>>                                         -> nvme_cancel_request // complete the timeout admin request
>>                                       -> require reset_lock
>>           -> adapter_delete_cq
> If the admin IO submitted in adapter_alloc_sq() is timed out,
> nvme_dev_disable() in EH1 will complete it which is set as REQ_FAILFAST_DRIVER,
> then adapter_alloc_sq() should return error, and the whole reset in EH0
> should have been terminated immediately.

Please refer to the following segment:

static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
{
	struct nvme_dev *dev = nvmeq->dev;
	int result;
...
	nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid;
	result = adapter_alloc_cq(dev, qid, nvmeq);
	if (result < 0)
		goto release_vector;

	result = adapter_alloc_sq(dev, qid, nvmeq);   // if timeout and failed here
	if (result < 0)
		goto release_cq;

	nvme_init_queue(nvmeq, qid);
	result = queue_request_irq(nvmeq);
	if (result < 0)
		goto release_sq;

	return result;

 release_sq:
	dev->online_queues--;
	adapter_delete_sq(dev, qid);
 release_cq:                                        // we will be here !
	adapter_delete_cq(dev, qid);                // another cq delete admin command will be sent out.
 release_vector:
	nvmeq->cq_vector = -1;
	return result;
}


> 
> I guess the issue should be that nvme_create_io_queues() ignores the failure.
> 
> Could you dump the stack trace of EH0 reset task? So that we may see
> where EH0 reset kthread hangs.

root@will-ThinkCentre-M910s:/home/will/Desktop# cat /proc/2273/stack 
[<0>] blk_execute_rq+0xf7/0x150
[<0>] __nvme_submit_sync_cmd+0x94/0x110
[<0>] nvme_submit_sync_cmd+0x1b/0x20
[<0>] adapter_delete_queue+0xad/0xf0
[<0>] nvme_reset_dev+0x1b67/0x2450
[<0>] nvme_eh_work+0x19c/0x4b0
[<0>] process_one_work+0x3ca/0xaa0
[<0>] worker_thread+0x89/0x6c0
[<0>] kthread+0x18d/0x1e0
[<0>] ret_from_fork+0x24/0x30
[<0>] 0xffffffffffffffff
root@will-ThinkCentre-M910s:/home/will/Desktop# cat /proc/2275/stack 
[<0>] nvme_eh_work+0x11a/0x4b0
[<0>] process_one_work+0x3ca/0xaa0
[<0>] worker_thread+0x89/0x6c0
[<0>] kthread+0x18d/0x1e0
[<0>] ret_from_fork+0x24/0x30
[<0>] 0xffffffffffffffff

> 
>>             -> adapter_delete_queue // submit to the adminq which has been quiesced.
>>               -> nvme_submit_sync_cmd
>>                 -> blk_execute_rq
>>                   -> wait_for_completion_io_timeout
>>                   hang_check is true, so there is no hung task warning for this context
>>
>> EH0 submit cq delete admin command, but it will never be completed or timed out, because the admin request queue has
>> been quiesced, so the reset_lock cannot be released, and EH1 cannot get reset_lock and make things forward.
> The nvme_dev_disable() in outer EH(EH1 in above log) will complete all
> admin command, which won't be retried because it is set as
> REQ_FAILFAST_DRIVER, so nvme_cancel_request() will complete it in
> nvme_dev_disable().

This cq delete admin command is sent out after EH 1 nvme_dev_disable completed and failed the
previous timeout sq alloc admin command. please refer to the code segment above.

Thanks
jianchao

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
  2018-05-14 10:05     ` jianchao.wang
@ 2018-05-14 12:22       ` Ming Lei
  2018-05-15  0:33         ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2018-05-14 12:22 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Keith Busch, Jens Axboe, linux-block, James Smart,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

Hi Jianchao,

On Mon, May 14, 2018 at 06:05:50PM +0800, jianchao.wang wrote:
> Hi ming
> 
> On 05/14/2018 05:38 PM, Ming Lei wrote:
> >> Here is the deadlock scenario.
> >>
> >> nvme_eh_work // EH0
> >>   -> nvme_reset_dev //hold reset_lock
> >>     -> nvme_setup_io_queues
> >>       -> nvme_create_io_queues
> >>         -> nvme_create_queue
> >>           -> set nvmeq->cq_vector
> >>           -> adapter_alloc_cq
> >>           -> adapter_alloc_sq
> >>              irq has not been requested
> >>              io timeout 
> >>                                     nvme_eh_work //EH1
> >>                                       -> nvme_dev_disable
> >>                                         -> quiesce the adminq //----> here !
> >>                                         -> nvme_suspend_queue
> >>                                           print out warning Trying to free already-free IRQ 133
> >>                                         -> nvme_cancel_request // complete the timeout admin request
> >>                                       -> require reset_lock
> >>           -> adapter_delete_cq
> > If the admin IO submitted in adapter_alloc_sq() is timed out,
> > nvme_dev_disable() in EH1 will complete it which is set as REQ_FAILFAST_DRIVER,
> > then adapter_alloc_sq() should return error, and the whole reset in EH0
> > should have been terminated immediately.
> 
> Please refer to the following segment:
> 
> static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
> {
> 	struct nvme_dev *dev = nvmeq->dev;
> 	int result;
> ...
> 	nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid;
> 	result = adapter_alloc_cq(dev, qid, nvmeq);
> 	if (result < 0)
> 		goto release_vector;
> 
> 	result = adapter_alloc_sq(dev, qid, nvmeq);   // if timeout and failed here
> 	if (result < 0)
> 		goto release_cq;
> 
> 	nvme_init_queue(nvmeq, qid);
> 	result = queue_request_irq(nvmeq);
> 	if (result < 0)
> 		goto release_sq;
> 
> 	return result;
> 
>  release_sq:
> 	dev->online_queues--;
> 	adapter_delete_sq(dev, qid);
>  release_cq:                                        // we will be here !
> 	adapter_delete_cq(dev, qid);                // another cq delete admin command will be sent out.
>  release_vector:
> 	nvmeq->cq_vector = -1;
> 	return result;
> }

Given admin queue has been suspended, all admin IO should have
been terminated immediately, so could you test the following patch?

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f509d37b2fb8..44e38be259f2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1515,9 +1515,6 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 	nvmeq->cq_vector = -1;
 	spin_unlock_irq(&nvmeq->q_lock);
 
-	if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
-		blk_mq_quiesce_queue(nvmeq->dev->ctrl.admin_q);
-
 	pci_free_irq(to_pci_dev(nvmeq->dev->dev), vector, nvmeq);
 
 	return 0;
@@ -1741,8 +1738,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
 			dev->ctrl.admin_q = NULL;
 			return -ENODEV;
 		}
-	} else
-		blk_mq_unquiesce_queue(dev->ctrl.admin_q);
+	}
 
 	return 0;
 }

> 
> 
> > 
> > I guess the issue should be that nvme_create_io_queues() ignores the failure.
> > 
> > Could you dump the stack trace of EH0 reset task? So that we may see
> > where EH0 reset kthread hangs.
> 
> root@will-ThinkCentre-M910s:/home/will/Desktop# cat /proc/2273/stack 
> [<0>] blk_execute_rq+0xf7/0x150
> [<0>] __nvme_submit_sync_cmd+0x94/0x110
> [<0>] nvme_submit_sync_cmd+0x1b/0x20
> [<0>] adapter_delete_queue+0xad/0xf0
> [<0>] nvme_reset_dev+0x1b67/0x2450
> [<0>] nvme_eh_work+0x19c/0x4b0
> [<0>] process_one_work+0x3ca/0xaa0
> [<0>] worker_thread+0x89/0x6c0
> [<0>] kthread+0x18d/0x1e0
> [<0>] ret_from_fork+0x24/0x30
> [<0>] 0xffffffffffffffff

Even without this patch, the above hang can happen in reset context,
so this issue isn't related with the introduced 'reset_lock'.

> root@will-ThinkCentre-M910s:/home/will/Desktop# cat /proc/2275/stack 
> [<0>] nvme_eh_work+0x11a/0x4b0
> [<0>] process_one_work+0x3ca/0xaa0
> [<0>] worker_thread+0x89/0x6c0
> [<0>] kthread+0x18d/0x1e0
> [<0>] ret_from_fork+0x24/0x30
> [<0>] 0xffffffffffffffff
> 
> > 
> >>             -> adapter_delete_queue // submit to the adminq which has been quiesced.
> >>               -> nvme_submit_sync_cmd
> >>                 -> blk_execute_rq
> >>                   -> wait_for_completion_io_timeout
> >>                   hang_check is true, so there is no hung task warning for this context
> >>
> >> EH0 submit cq delete admin command, but it will never be completed or timed out, because the admin request queue has
> >> been quiesced, so the reset_lock cannot be released, and EH1 cannot get reset_lock and make things forward.
> > The nvme_dev_disable() in outer EH(EH1 in above log) will complete all
> > admin command, which won't be retried because it is set as
> > REQ_FAILFAST_DRIVER, so nvme_cancel_request() will complete it in
> > nvme_dev_disable().
> 
> This cq delete admin command is sent out after EH 1 nvme_dev_disable completed and failed the
> previous timeout sq alloc admin command. please refer to the code segment above.

Right, as I mentioned above, this admin command should have been failed
immediately.


Thanks,
Ming

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
  2018-05-12  0:21   ` Ming Lei
@ 2018-05-14 15:18     ` Keith Busch
  2018-05-14 23:47       ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: Keith Busch @ 2018-05-14 15:18 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, Jens Axboe, Laurence Oberman, Sagi Grimberg,
	James Smart, linux-nvme, linux-block, Jianchao Wang,
	Christoph Hellwig

Hi Ming,

On Sat, May 12, 2018 at 08:21:22AM +0800, Ming Lei wrote:
> > [  760.679960] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> > [  760.701468] nvme nvme1: EH 0: after shutdown, top eh: 1
> > [  760.727099] pci_raw_set_power_state: 62 callbacks suppressed
> > [  760.727103] nvme 0000:86:00.0: Refused to change power state, currently in D3
> 
> EH may not cover this kind of failure, so it fails in the 1st try.

Indeed, the test is simulating a permanently broken link, so recovery is
not expected. A success in this case is just completing driver
unbinding.
 
> > [  760.727483] nvme nvme1: EH 0: state 4, eh_done -19, top eh 1
> > [  760.727485] nvme nvme1: EH 0: after recovery -19
> > [  760.727488] nvme nvme1: EH: fail controller
> 
> The above issue(hang in nvme_remove()) is still an old issue, which
> is because queues are kept as quiesce during remove, so could you
> please test the following change?
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 1dec353388be..c78e5a0cde06 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3254,6 +3254,11 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
>          */
>         if (ctrl->state == NVME_CTRL_DEAD)
>                 nvme_kill_queues(ctrl);
> +       else {
> +               if (ctrl->admin_q)
> +                       blk_mq_unquiesce_queue(ctrl->admin_q);
> +               nvme_start_queues(ctrl);
> +       }
> 
>         down_write(&ctrl->namespaces_rwsem);
>         list_splice_init(&ctrl->namespaces, &ns_list);

The above won't actually do anything here since the broken link puts the
controller in the DEAD state, so we've killed the queues which also
unquiesces them.

> BTW, in my environment, it is hard to trigger this failure, so not see
> this issue, but I did verify the nested EH which can recover from error
> in reset.

It's actually pretty easy to trigger this one. I just modify block/019 to
remove the check for a hotplug slot then run it on a block device that's
not hot-pluggable.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
  2018-05-14 15:18     ` Keith Busch
@ 2018-05-14 23:47       ` Ming Lei
  2018-05-15  0:33         ` Keith Busch
  0 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2018-05-14 23:47 UTC (permalink / raw)
  To: Keith Busch
  Cc: Keith Busch, Jens Axboe, Laurence Oberman, Sagi Grimberg,
	James Smart, linux-nvme, linux-block, Jianchao Wang,
	Christoph Hellwig

On Mon, May 14, 2018 at 09:18:21AM -0600, Keith Busch wrote:
> Hi Ming,
> 
> On Sat, May 12, 2018 at 08:21:22AM +0800, Ming Lei wrote:
> > > [  760.679960] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> > > [  760.701468] nvme nvme1: EH 0: after shutdown, top eh: 1
> > > [  760.727099] pci_raw_set_power_state: 62 callbacks suppressed
> > > [  760.727103] nvme 0000:86:00.0: Refused to change power state, currently in D3
> > 
> > EH may not cover this kind of failure, so it fails in the 1st try.
> 
> Indeed, the test is simulating a permanently broken link, so recovery is
> not expected. A success in this case is just completing driver
> unbinding.
>  
> > > [  760.727483] nvme nvme1: EH 0: state 4, eh_done -19, top eh 1
> > > [  760.727485] nvme nvme1: EH 0: after recovery -19
> > > [  760.727488] nvme nvme1: EH: fail controller
> > 
> > The above issue(hang in nvme_remove()) is still an old issue, which
> > is because queues are kept as quiesce during remove, so could you
> > please test the following change?
> > 
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index 1dec353388be..c78e5a0cde06 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -3254,6 +3254,11 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
> >          */
> >         if (ctrl->state == NVME_CTRL_DEAD)
> >                 nvme_kill_queues(ctrl);
> > +       else {
> > +               if (ctrl->admin_q)
> > +                       blk_mq_unquiesce_queue(ctrl->admin_q);
> > +               nvme_start_queues(ctrl);
> > +       }
> > 
> >         down_write(&ctrl->namespaces_rwsem);
> >         list_splice_init(&ctrl->namespaces, &ns_list);
> 
> The above won't actually do anything here since the broken link puts the
> controller in the DEAD state, so we've killed the queues which also
> unquiesces them.

I suggest you to double check if the controller is set as DEAD
in nvme_remove() since there won't be any log dumped when this happen.

If controller is set as DEAD and queues are killed, and all IO should
have been dispatched to driver and nvme_queueu_rq() will fail them all,
then there isn't any reason to see the hang in your stack trace log.

> 
> > BTW, in my environment, it is hard to trigger this failure, so not see
> > this issue, but I did verify the nested EH which can recover from error
> > in reset.
> 
> It's actually pretty easy to trigger this one. I just modify block/019 to
> remove the check for a hotplug slot then run it on a block device that's
> not hot-pluggable.

I will try this test, and hope I can reproduce it in my environment.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
  2018-05-14 12:22       ` Ming Lei
@ 2018-05-15  0:33         ` Ming Lei
  2018-05-15  9:56           ` jianchao.wang
  0 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2018-05-15  0:33 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Keith Busch, Jens Axboe, linux-block, James Smart,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

On Mon, May 14, 2018 at 08:22:11PM +0800, Ming Lei wrote:
> Hi Jianchao,
> 
> On Mon, May 14, 2018 at 06:05:50PM +0800, jianchao.wang wrote:
> > Hi ming
> > 
> > On 05/14/2018 05:38 PM, Ming Lei wrote:
> > >> Here is the deadlock scenario.
> > >>
> > >> nvme_eh_work // EH0
> > >>   -> nvme_reset_dev //hold reset_lock
> > >>     -> nvme_setup_io_queues
> > >>       -> nvme_create_io_queues
> > >>         -> nvme_create_queue
> > >>           -> set nvmeq->cq_vector
> > >>           -> adapter_alloc_cq
> > >>           -> adapter_alloc_sq
> > >>              irq has not been requested
> > >>              io timeout 
> > >>                                     nvme_eh_work //EH1
> > >>                                       -> nvme_dev_disable
> > >>                                         -> quiesce the adminq //----> here !
> > >>                                         -> nvme_suspend_queue
> > >>                                           print out warning Trying to free already-free IRQ 133
> > >>                                         -> nvme_cancel_request // complete the timeout admin request
> > >>                                       -> require reset_lock
> > >>           -> adapter_delete_cq
> > > If the admin IO submitted in adapter_alloc_sq() is timed out,
> > > nvme_dev_disable() in EH1 will complete it which is set as REQ_FAILFAST_DRIVER,
> > > then adapter_alloc_sq() should return error, and the whole reset in EH0
> > > should have been terminated immediately.
> > 
> > Please refer to the following segment:
> > 
> > static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
> > {
> > 	struct nvme_dev *dev = nvmeq->dev;
> > 	int result;
> > ...
> > 	nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid;
> > 	result = adapter_alloc_cq(dev, qid, nvmeq);
> > 	if (result < 0)
> > 		goto release_vector;
> > 
> > 	result = adapter_alloc_sq(dev, qid, nvmeq);   // if timeout and failed here
> > 	if (result < 0)
> > 		goto release_cq;
> > 
> > 	nvme_init_queue(nvmeq, qid);
> > 	result = queue_request_irq(nvmeq);
> > 	if (result < 0)
> > 		goto release_sq;
> > 
> > 	return result;
> > 
> >  release_sq:
> > 	dev->online_queues--;
> > 	adapter_delete_sq(dev, qid);
> >  release_cq:                                        // we will be here !
> > 	adapter_delete_cq(dev, qid);                // another cq delete admin command will be sent out.
> >  release_vector:
> > 	nvmeq->cq_vector = -1;
> > 	return result;
> > }
> 
> Given admin queue has been suspended, all admin IO should have
> been terminated immediately, so could you test the following patch?
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index f509d37b2fb8..44e38be259f2 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1515,9 +1515,6 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
>  	nvmeq->cq_vector = -1;
>  	spin_unlock_irq(&nvmeq->q_lock);
>  
> -	if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
> -		blk_mq_quiesce_queue(nvmeq->dev->ctrl.admin_q);
> -
>  	pci_free_irq(to_pci_dev(nvmeq->dev->dev), vector, nvmeq);
>  
>  	return 0;
> @@ -1741,8 +1738,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
>  			dev->ctrl.admin_q = NULL;
>  			return -ENODEV;
>  		}
> -	} else
> -		blk_mq_unquiesce_queue(dev->ctrl.admin_q);
> +	}
>  
>  	return 0;
>  }

We still have to quiesce admin queue before canceling request, so looks
the following patch is better, so please ignore the above patch and try
the following one and see if your hang can be addressed:

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f509d37b2fb8..c2adc76472a8 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1741,8 +1741,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
 			dev->ctrl.admin_q = NULL;
 			return -ENODEV;
 		}
-	} else
-		blk_mq_unquiesce_queue(dev->ctrl.admin_q);
+	}
 
 	return 0;
 }
@@ -2520,6 +2519,12 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
 	 */
 	if (shutdown)
 		nvme_start_queues(&dev->ctrl);
+
+	/*
+	 * Avoid to suck reset because timeout may happen during reset and
+	 * reset may hang forever if admin queue is kept as quiesced
+	 */
+	blk_mq_unquiesce_queue(dev->ctrl.admin_q);
 	mutex_unlock(&dev->shutdown_lock);
 }
 

-- 
Ming

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
  2018-05-14 23:47       ` Ming Lei
@ 2018-05-15  0:33         ` Keith Busch
  2018-05-15  9:08           ` Ming Lei
  2018-05-16  4:31           ` Ming Lei
  0 siblings, 2 replies; 32+ messages in thread
From: Keith Busch @ 2018-05-15  0:33 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, Jens Axboe, Laurence Oberman, Sagi Grimberg,
	James Smart, linux-nvme, linux-block, Jianchao Wang,
	Christoph Hellwig

On Tue, May 15, 2018 at 07:47:07AM +0800, Ming Lei wrote:
> > > > [  760.727485] nvme nvme1: EH 0: after recovery -19
> > > > [  760.727488] nvme nvme1: EH: fail controller
> > > 
> > > The above issue(hang in nvme_remove()) is still an old issue, which
> > > is because queues are kept as quiesce during remove, so could you
> > > please test the following change?
> > > 
> > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > > index 1dec353388be..c78e5a0cde06 100644
> > > --- a/drivers/nvme/host/core.c
> > > +++ b/drivers/nvme/host/core.c
> > > @@ -3254,6 +3254,11 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
> > >          */
> > >         if (ctrl->state == NVME_CTRL_DEAD)
> > >                 nvme_kill_queues(ctrl);
> > > +       else {
> > > +               if (ctrl->admin_q)
> > > +                       blk_mq_unquiesce_queue(ctrl->admin_q);
> > > +               nvme_start_queues(ctrl);
> > > +       }
> > > 
> > >         down_write(&ctrl->namespaces_rwsem);
> > >         list_splice_init(&ctrl->namespaces, &ns_list);
> > 
> > The above won't actually do anything here since the broken link puts the
> > controller in the DEAD state, so we've killed the queues which also
> > unquiesces them.
> 
> I suggest you to double check if the controller is set as DEAD
> in nvme_remove() since there won't be any log dumped when this happen.

Yes, it's dead. pci_device_is_present returns false when the link is
broken.

Also, the logs showed the capacity was set to 0, which only happens when
we kill the namespace queues, which supposedly restarts the queues too.
 
> If controller is set as DEAD and queues are killed, and all IO should
> have been dispatched to driver and nvme_queueu_rq() will fail them all,
> then there isn't any reason to see the hang in your stack trace log.

Right, that's the idea. It just doesn't appear to be working here.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
  2018-05-15  0:33         ` Keith Busch
@ 2018-05-15  9:08           ` Ming Lei
  2018-05-16  4:31           ` Ming Lei
  1 sibling, 0 replies; 32+ messages in thread
From: Ming Lei @ 2018-05-15  9:08 UTC (permalink / raw)
  To: Keith Busch
  Cc: Ming Lei, Jens Axboe, linux-block, Laurence Oberman,
	Sagi Grimberg, James Smart, linux-nvme, Keith Busch,
	Jianchao Wang, Christoph Hellwig

On Tue, May 15, 2018 at 8:33 AM, Keith Busch
<keith.busch@linux.intel.com> wrote:
> On Tue, May 15, 2018 at 07:47:07AM +0800, Ming Lei wrote:
>> > > > [  760.727485] nvme nvme1: EH 0: after recovery -19
>> > > > [  760.727488] nvme nvme1: EH: fail controller
>> > >
>> > > The above issue(hang in nvme_remove()) is still an old issue, which
>> > > is because queues are kept as quiesce during remove, so could you
>> > > please test the following change?
>> > >
>> > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> > > index 1dec353388be..c78e5a0cde06 100644
>> > > --- a/drivers/nvme/host/core.c
>> > > +++ b/drivers/nvme/host/core.c
>> > > @@ -3254,6 +3254,11 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
>> > >          */
>> > >         if (ctrl->state == NVME_CTRL_DEAD)
>> > >                 nvme_kill_queues(ctrl);
>> > > +       else {
>> > > +               if (ctrl->admin_q)
>> > > +                       blk_mq_unquiesce_queue(ctrl->admin_q);
>> > > +               nvme_start_queues(ctrl);
>> > > +       }
>> > >
>> > >         down_write(&ctrl->namespaces_rwsem);
>> > >         list_splice_init(&ctrl->namespaces, &ns_list);
>> >
>> > The above won't actually do anything here since the broken link puts the
>> > controller in the DEAD state, so we've killed the queues which also
>> > unquiesces them.
>>
>> I suggest you to double check if the controller is set as DEAD
>> in nvme_remove() since there won't be any log dumped when this happen.
>
> Yes, it's dead. pci_device_is_present returns false when the link is
> broken.
>
> Also, the logs showed the capacity was set to 0, which only happens when
> we kill the namespace queues, which supposedly restarts the queues too.
>

Right, nvme_kill_queues() may trigger that, and in my 019 test, not see
pci_device_is_present() returns false, but nvme_kill_queues() has been
called in nvme_remove_dead_ctrl_work(),  and still didn't reproduce
the hang in blk_cleanup_queue() yet.

Looks a bit weird, but debugfs may show some clue, :-)

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
  2018-05-15  0:33         ` Ming Lei
@ 2018-05-15  9:56           ` jianchao.wang
  2018-05-15 12:56             ` Ming Lei
  2018-05-16  2:04             ` Ming Lei
  0 siblings, 2 replies; 32+ messages in thread
From: jianchao.wang @ 2018-05-15  9:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Laurence Oberman, Sagi Grimberg,
	James Smart, linux-nvme, Keith Busch, Christoph Hellwig

Hi ming

On 05/15/2018 08:33 AM, Ming Lei wrote:
> We still have to quiesce admin queue before canceling request, so looks
> the following patch is better, so please ignore the above patch and try
> the following one and see if your hang can be addressed:
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index f509d37b2fb8..c2adc76472a8 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1741,8 +1741,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
>  			dev->ctrl.admin_q = NULL;
>  			return -ENODEV;
>  		}
> -	} else
> -		blk_mq_unquiesce_queue(dev->ctrl.admin_q);
> +	}
>  
>  	return 0;
>  }
> @@ -2520,6 +2519,12 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
>  	 */
>  	if (shutdown)
>  		nvme_start_queues(&dev->ctrl);
> +
> +	/*
> +	 * Avoid to suck reset because timeout may happen during reset and
> +	 * reset may hang forever if admin queue is kept as quiesced
> +	 */
> +	blk_mq_unquiesce_queue(dev->ctrl.admin_q);
>  	mutex_unlock(&dev->shutdown_lock);
>  }

w/ patch above and patch below, both the warning and io hung issue didn't reproduce till now.


@@ -1450,6 +1648,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 {
        struct nvme_dev *dev = nvmeq->dev;
        int result;
+       int cq_vector;
 
        if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
                unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
@@ -1462,15 +1661,16 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
         * A queue's vector matches the queue identifier unless the controller
         * has only one vector available.
         */
-       nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid;
-       result = adapter_alloc_cq(dev, qid, nvmeq);
+       cq_vector = dev->num_vecs == 1 ? 0 : qid;
+       result = adapter_alloc_cq(dev, qid, nvmeq, cq_vector);
        if (result < 0)
-               goto release_vector;
+               goto out;
 
        result = adapter_alloc_sq(dev, qid, nvmeq);
        if (result < 0)
                goto release_cq;
-
+       
+       nvmeq->cq_vector = cq_vector;
        nvme_init_queue(nvmeq, qid);
        result = queue_request_irq(nvmeq);
        if (result < 0)
@@ -1479,12 +1679,12 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
        return result;
 
  release_sq:
+       nvmeq->cq_vector = -1;
        dev->online_queues--;
        adapter_delete_sq(dev, qid);
  release_cq:
        adapter_delete_cq(dev, qid);
- release_vector:
-       nvmeq->cq_vector = -1;
+ out:
        return result;
 }

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V5 9/9] nvme: pci: support nested EH
  2018-05-11 12:29 ` [PATCH V5 9/9] nvme: pci: support nested EH Ming Lei
@ 2018-05-15 10:02   ` jianchao.wang
  2018-05-15 12:39     ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: jianchao.wang @ 2018-05-15 10:02 UTC (permalink / raw)
  To: Ming Lei, Keith Busch
  Cc: Jens Axboe, Laurence Oberman, Sagi Grimberg, James Smart,
	linux-nvme, linux-block, Christoph Hellwig

Hi ming

On 05/11/2018 08:29 PM, Ming Lei wrote:
> +static void nvme_eh_done(struct nvme_eh_work *eh_work, int result)
> +{
> +	struct nvme_dev *dev = eh_work->dev;
> +	bool top_eh;
> +
> +	spin_lock(&dev->eh_lock);
> +	top_eh = list_is_last(&eh_work->list, &dev->eh_head);
> +	dev->nested_eh--;
> +
> +	/* Fail controller if the top EH can't recover it */
> +	if (!result)
> +		wake_up_all(&dev->eh_wq);
> +	else if (top_eh) {
> +		dev->ctrl_failed = true;
> +		nvme_eh_sched_fail_ctrl(dev);
> +		wake_up_all(&dev->eh_wq);
> +	}
> +
> +	list_del(&eh_work->list);
> +	spin_unlock(&dev->eh_lock);
> +
> +	dev_info(dev->ctrl.device, "EH %d: state %d, eh_done %d, top eh %d\n",
> +			eh_work->seq, dev->ctrl.state, result, top_eh);
> +	wait_event(dev->eh_wq, nvme_eh_reset_done(dev));

decrease the nested_eh before it exits, another new EH will have confusing seq number.
please refer to following log:
[ 1342.961869] nvme nvme0: Abort status: 0x0
[ 1342.961878] nvme nvme0: Abort status: 0x0
[ 1343.148341] nvme nvme0: EH 0: after shutdown, top eh: 1
[ 1403.828484] nvme nvme0: I/O 21 QID 0 timeout, disable controller
[ 1403.828603] nvme nvme0: EH 1: before shutdown
... waring logs are ignored here 
[ 1403.984731] nvme nvme0: EH 0: state 4, eh_done -4, top eh 0  // EH0 go to wait
[ 1403.984786] nvme nvme0: EH 1: after shutdown, top eh: 1
[ 1464.856290] nvme nvme0: I/O 22 QID 0 timeout, disable controller  // timeout again in EH 1
[ 1464.856411] nvme nvme0: EH 1: before shutdown // a new EH has a 1 seq number

Is it expected that the new EH has seq number 1 instead of 2 ?

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V5 9/9] nvme: pci: support nested EH
  2018-05-15 10:02   ` jianchao.wang
@ 2018-05-15 12:39     ` Ming Lei
  0 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2018-05-15 12:39 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Keith Busch, Jens Axboe, Laurence Oberman, Sagi Grimberg,
	James Smart, linux-nvme, linux-block, Christoph Hellwig

On Tue, May 15, 2018 at 06:02:13PM +0800, jianchao.wang wrote:
> Hi ming
> 
> On 05/11/2018 08:29 PM, Ming Lei wrote:
> > +static void nvme_eh_done(struct nvme_eh_work *eh_work, int result)
> > +{
> > +	struct nvme_dev *dev = eh_work->dev;
> > +	bool top_eh;
> > +
> > +	spin_lock(&dev->eh_lock);
> > +	top_eh = list_is_last(&eh_work->list, &dev->eh_head);
> > +	dev->nested_eh--;
> > +
> > +	/* Fail controller if the top EH can't recover it */
> > +	if (!result)
> > +		wake_up_all(&dev->eh_wq);
> > +	else if (top_eh) {
> > +		dev->ctrl_failed = true;
> > +		nvme_eh_sched_fail_ctrl(dev);
> > +		wake_up_all(&dev->eh_wq);
> > +	}
> > +
> > +	list_del(&eh_work->list);
> > +	spin_unlock(&dev->eh_lock);
> > +
> > +	dev_info(dev->ctrl.device, "EH %d: state %d, eh_done %d, top eh %d\n",
> > +			eh_work->seq, dev->ctrl.state, result, top_eh);
> > +	wait_event(dev->eh_wq, nvme_eh_reset_done(dev));
> 
> decrease the nested_eh before it exits, another new EH will have confusing seq number.
> please refer to following log:
> [ 1342.961869] nvme nvme0: Abort status: 0x0
> [ 1342.961878] nvme nvme0: Abort status: 0x0
> [ 1343.148341] nvme nvme0: EH 0: after shutdown, top eh: 1
> [ 1403.828484] nvme nvme0: I/O 21 QID 0 timeout, disable controller
> [ 1403.828603] nvme nvme0: EH 1: before shutdown
> ... waring logs are ignored here 
> [ 1403.984731] nvme nvme0: EH 0: state 4, eh_done -4, top eh 0  // EH0 go to wait
> [ 1403.984786] nvme nvme0: EH 1: after shutdown, top eh: 1
> [ 1464.856290] nvme nvme0: I/O 22 QID 0 timeout, disable controller  // timeout again in EH 1
> [ 1464.856411] nvme nvme0: EH 1: before shutdown // a new EH has a 1 seq number
> 
> Is it expected that the new EH has seq number 1 instead of 2 ?

Right, it has been fixed in my local tree of V5.1:

https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V5.1

And there are also several other fixes in this tree.

All will be merged to V6.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
  2018-05-15  9:56           ` jianchao.wang
@ 2018-05-15 12:56             ` Ming Lei
  2018-05-16  3:03               ` jianchao.wang
  2018-05-16  2:04             ` Ming Lei
  1 sibling, 1 reply; 32+ messages in thread
From: Ming Lei @ 2018-05-15 12:56 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Jens Axboe, Keith Busch, Laurence Oberman, Sagi Grimberg,
	James Smart, linux-nvme, linux-block, Christoph Hellwig

On Tue, May 15, 2018 at 05:56:14PM +0800, jianchao.wang wrote:
> Hi ming
> 
> On 05/15/2018 08:33 AM, Ming Lei wrote:
> > We still have to quiesce admin queue before canceling request, so looks
> > the following patch is better, so please ignore the above patch and try
> > the following one and see if your hang can be addressed:
> > 
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index f509d37b2fb8..c2adc76472a8 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -1741,8 +1741,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
> >  			dev->ctrl.admin_q = NULL;
> >  			return -ENODEV;
> >  		}
> > -	} else
> > -		blk_mq_unquiesce_queue(dev->ctrl.admin_q);
> > +	}
> >  
> >  	return 0;
> >  }
> > @@ -2520,6 +2519,12 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
> >  	 */
> >  	if (shutdown)
> >  		nvme_start_queues(&dev->ctrl);
> > +
> > +	/*
> > +	 * Avoid to suck reset because timeout may happen during reset and
> > +	 * reset may hang forever if admin queue is kept as quiesced
> > +	 */
> > +	blk_mq_unquiesce_queue(dev->ctrl.admin_q);
> >  	mutex_unlock(&dev->shutdown_lock);
> >  }
> 
> w/ patch above and patch below, both the warning and io hung issue didn't reproduce till now.
> 

That is great to see another issue which can be covered now, :-)

> 
> @@ -1450,6 +1648,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
>  {
>         struct nvme_dev *dev = nvmeq->dev;
>         int result;
> +       int cq_vector;
>  
>         if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
>                 unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
> @@ -1462,15 +1661,16 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
>          * A queue's vector matches the queue identifier unless the controller
>          * has only one vector available.
>          */
> -       nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid;
> -       result = adapter_alloc_cq(dev, qid, nvmeq);
> +       cq_vector = dev->num_vecs == 1 ? 0 : qid;
> +       result = adapter_alloc_cq(dev, qid, nvmeq, cq_vector);
>         if (result < 0)
> -               goto release_vector;
> +               goto out;
>  
>         result = adapter_alloc_sq(dev, qid, nvmeq);
>         if (result < 0)
>                 goto release_cq;
> -
> +       
> +       nvmeq->cq_vector = cq_vector;
>         nvme_init_queue(nvmeq, qid);
>         result = queue_request_irq(nvmeq);
>         if (result < 0)
> @@ -1479,12 +1679,12 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
>         return result;
>  
>   release_sq:
> +       nvmeq->cq_vector = -1;
>         dev->online_queues--;
>         adapter_delete_sq(dev, qid);
>   release_cq:
>         adapter_delete_cq(dev, qid);
> - release_vector:
> -       nvmeq->cq_vector = -1;
> + out:
>         return result;
>  }
> 

Looks a nice fix on nvme_create_queue(), but seems the change on 
adapter_alloc_cq() is missed in above patch.

Could you prepare a formal one so that I may integrate it to V6?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
  2018-05-15  9:56           ` jianchao.wang
  2018-05-15 12:56             ` Ming Lei
@ 2018-05-16  2:04             ` Ming Lei
  2018-05-16  2:09               ` Ming Lei
  1 sibling, 1 reply; 32+ messages in thread
From: Ming Lei @ 2018-05-16  2:04 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Jens Axboe, linux-block, Laurence Oberman, Sagi Grimberg,
	James Smart, linux-nvme, Keith Busch, Christoph Hellwig

On Tue, May 15, 2018 at 05:56:14PM +0800, jianchao.wang wrote:
> Hi ming
> 
> On 05/15/2018 08:33 AM, Ming Lei wrote:
> > We still have to quiesce admin queue before canceling request, so looks
> > the following patch is better, so please ignore the above patch and try
> > the following one and see if your hang can be addressed:
> > 
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index f509d37b2fb8..c2adc76472a8 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -1741,8 +1741,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
> >  			dev->ctrl.admin_q = NULL;
> >  			return -ENODEV;
> >  		}
> > -	} else
> > -		blk_mq_unquiesce_queue(dev->ctrl.admin_q);
> > +	}
> >  
> >  	return 0;
> >  }
> > @@ -2520,6 +2519,12 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
> >  	 */
> >  	if (shutdown)
> >  		nvme_start_queues(&dev->ctrl);
> > +
> > +	/*
> > +	 * Avoid to suck reset because timeout may happen during reset and
> > +	 * reset may hang forever if admin queue is kept as quiesced
> > +	 */
> > +	blk_mq_unquiesce_queue(dev->ctrl.admin_q);
> >  	mutex_unlock(&dev->shutdown_lock);
> >  }
> 
> w/ patch above and patch below, both the warning and io hung issue didn't reproduce till now.
> 
> 
> @@ -1450,6 +1648,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
>  {
>         struct nvme_dev *dev = nvmeq->dev;
>         int result;
> +       int cq_vector;
>  
>         if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
>                 unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
> @@ -1462,15 +1661,16 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
>          * A queue's vector matches the queue identifier unless the controller
>          * has only one vector available.
>          */
> -       nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid;
> -       result = adapter_alloc_cq(dev, qid, nvmeq);
> +       cq_vector = dev->num_vecs == 1 ? 0 : qid;
> +       result = adapter_alloc_cq(dev, qid, nvmeq, cq_vector);
>         if (result < 0)
> -               goto release_vector;
> +               goto out;

Think of this issue further, the above change will cause adapter_alloc_cq()
failed immediately because nvmeq->cq_vector isn't set before submitting this
admin IO.

So could you check if only the patch("unquiesce admin queue after shutdown
controller") can fix your IO hang issue?

BTW, the warning from genirq can be left alone, that is another issue.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
  2018-05-16  2:04             ` Ming Lei
@ 2018-05-16  2:09               ` Ming Lei
  2018-05-16  2:15                 ` jianchao.wang
  0 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2018-05-16  2:09 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Jens Axboe, linux-block, Laurence Oberman, Sagi Grimberg,
	James Smart, linux-nvme, Keith Busch, Christoph Hellwig

On Wed, May 16, 2018 at 10:04:20AM +0800, Ming Lei wrote:
> On Tue, May 15, 2018 at 05:56:14PM +0800, jianchao.wang wrote:
> > Hi ming
> > 
> > On 05/15/2018 08:33 AM, Ming Lei wrote:
> > > We still have to quiesce admin queue before canceling request, so looks
> > > the following patch is better, so please ignore the above patch and try
> > > the following one and see if your hang can be addressed:
> > > 
> > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > > index f509d37b2fb8..c2adc76472a8 100644
> > > --- a/drivers/nvme/host/pci.c
> > > +++ b/drivers/nvme/host/pci.c
> > > @@ -1741,8 +1741,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
> > >  			dev->ctrl.admin_q = NULL;
> > >  			return -ENODEV;
> > >  		}
> > > -	} else
> > > -		blk_mq_unquiesce_queue(dev->ctrl.admin_q);
> > > +	}
> > >  
> > >  	return 0;
> > >  }
> > > @@ -2520,6 +2519,12 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
> > >  	 */
> > >  	if (shutdown)
> > >  		nvme_start_queues(&dev->ctrl);
> > > +
> > > +	/*
> > > +	 * Avoid to suck reset because timeout may happen during reset and
> > > +	 * reset may hang forever if admin queue is kept as quiesced
> > > +	 */
> > > +	blk_mq_unquiesce_queue(dev->ctrl.admin_q);
> > >  	mutex_unlock(&dev->shutdown_lock);
> > >  }
> > 
> > w/ patch above and patch below, both the warning and io hung issue didn't reproduce till now.
> > 
> > 
> > @@ -1450,6 +1648,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
> >  {
> >         struct nvme_dev *dev = nvmeq->dev;
> >         int result;
> > +       int cq_vector;
> >  
> >         if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
> >                 unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
> > @@ -1462,15 +1661,16 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
> >          * A queue's vector matches the queue identifier unless the controller
> >          * has only one vector available.
> >          */
> > -       nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid;
> > -       result = adapter_alloc_cq(dev, qid, nvmeq);
> > +       cq_vector = dev->num_vecs == 1 ? 0 : qid;
> > +       result = adapter_alloc_cq(dev, qid, nvmeq, cq_vector);
> >         if (result < 0)
> > -               goto release_vector;
> > +               goto out;
> 
> Think of this issue further, the above change will cause adapter_alloc_cq()
> failed immediately because nvmeq->cq_vector isn't set before submitting this
> admin IO.
> 
> So could you check if only the patch("unquiesce admin queue after shutdown
> controller") can fix your IO hang issue?
> 
> BTW, the warning from genirq can be left alone, that is another issue.

Ooops, no such issue at all since admin queue is ready, please ignore the
noise, sorry, :-(

Thanks,
Ming

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
  2018-05-16  2:09               ` Ming Lei
@ 2018-05-16  2:15                 ` jianchao.wang
  0 siblings, 0 replies; 32+ messages in thread
From: jianchao.wang @ 2018-05-16  2:15 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Laurence Oberman, Sagi Grimberg,
	James Smart, linux-nvme, Keith Busch, Christoph Hellwig

Hi ming

On 05/16/2018 10:09 AM, Ming Lei wrote:
> So could you check if only the patch("unquiesce admin queue after shutdown
> controller") can fix your IO hang issue?

I indeed tested this before fix the warning.
It could fix the io hung issue. :)

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
  2018-05-15 12:56             ` Ming Lei
@ 2018-05-16  3:03               ` jianchao.wang
  0 siblings, 0 replies; 32+ messages in thread
From: jianchao.wang @ 2018-05-16  3:03 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Laurence Oberman, Sagi Grimberg,
	James Smart, linux-nvme, Keith Busch, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 268 bytes --]

Hi Ming

On 05/15/2018 08:56 PM, Ming Lei wrote:
> Looks a nice fix on nvme_create_queue(), but seems the change on 
> adapter_alloc_cq() is missed in above patch.
> 
> Could you prepare a formal one so that I may integrate it to V6?

Please refer to

Thanks
Jianchao

[-- Attachment #2: 0001-nvme-pci-set-nvmeq-cq_vector-after-alloc-cq-sq.patch --]
[-- Type: text/x-patch, Size: 3091 bytes --]

>From 9bb6db79901ef303cd40c4c911604bc053b1ad95 Mon Sep 17 00:00:00 2001
From: Jianchao Wang <jianchao.w.wang@oracle.com>
Date: Wed, 16 May 2018 09:45:45 +0800
Subject: [PATCH] nvme-pci: set nvmeq->cq_vector after alloc cq/sq

Currently nvmeq->cq_vector is set before alloc cq/sq. If the alloc
cq/sq command timeout, nvme_suspend_queue will invoke free_irq
for the nvmeq because the cq_vector is valid, this will cause warning
'Trying to free already-free IRQ xxx'. set nvmeq->cq_vector after
alloc cq/sq successes to fix this.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/pci.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fbc71fa..c830092 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1070,7 +1070,7 @@ static int adapter_delete_queue(struct nvme_dev *dev, u8 opcode, u16 id)
 }
 
 static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
-						struct nvme_queue *nvmeq)
+						struct nvme_queue *nvmeq, int cq_vector)
 {
 	struct nvme_command c;
 	int flags = NVME_QUEUE_PHYS_CONTIG | NVME_CQ_IRQ_ENABLED;
@@ -1085,7 +1085,7 @@ static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
 	c.create_cq.cqid = cpu_to_le16(qid);
 	c.create_cq.qsize = cpu_to_le16(nvmeq->q_depth - 1);
 	c.create_cq.cq_flags = cpu_to_le16(flags);
-	c.create_cq.irq_vector = cpu_to_le16(nvmeq->cq_vector);
+	c.create_cq.irq_vector = cpu_to_le16(cq_vector);
 
 	return nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0);
 }
@@ -1450,6 +1450,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 {
 	struct nvme_dev *dev = nvmeq->dev;
 	int result;
+	int cq_vector;
 
 	if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
 		unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
@@ -1462,15 +1463,21 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 	 * A queue's vector matches the queue identifier unless the controller
 	 * has only one vector available.
 	 */
-	nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid;
-	result = adapter_alloc_cq(dev, qid, nvmeq);
+	cq_vector = dev->num_vecs == 1 ? 0 : qid;
+	result = adapter_alloc_cq(dev, qid, nvmeq, cq_vector);
 	if (result < 0)
-		goto release_vector;
+		goto out;
 
 	result = adapter_alloc_sq(dev, qid, nvmeq);
 	if (result < 0)
 		goto release_cq;
 
+	/*
+	 * set cq_vector after alloc cq/sq, otherwise, if alloc cq/sq command
+	 * timeout, nvme_suspend_queue will invoke free_irq for it and cause warning
+	 * 'Trying to free already-free IRQ xxx'
+	 */
+	nvmeq->cq_vector = cq_vector;
 	nvme_init_queue(nvmeq, qid);
 	result = queue_request_irq(nvmeq);
 	if (result < 0)
@@ -1478,13 +1485,13 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 
 	return result;
 
- release_sq:
+release_sq:
+	nvmeq->cq_vector = -1;
 	dev->online_queues--;
 	adapter_delete_sq(dev, qid);
- release_cq:
+release_cq:
 	adapter_delete_cq(dev, qid);
- release_vector:
-	nvmeq->cq_vector = -1;
+out:
 	return result;
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
  2018-05-15  0:33         ` Keith Busch
  2018-05-15  9:08           ` Ming Lei
@ 2018-05-16  4:31           ` Ming Lei
  2018-05-16 15:18             ` Keith Busch
  1 sibling, 1 reply; 32+ messages in thread
From: Ming Lei @ 2018-05-16  4:31 UTC (permalink / raw)
  To: Keith Busch
  Cc: Keith Busch, Jens Axboe, Laurence Oberman, Sagi Grimberg,
	James Smart, linux-nvme, linux-block, Jianchao Wang,
	Christoph Hellwig

On Mon, May 14, 2018 at 06:33:35PM -0600, Keith Busch wrote:
> On Tue, May 15, 2018 at 07:47:07AM +0800, Ming Lei wrote:
> > > > > [  760.727485] nvme nvme1: EH 0: after recovery -19
> > > > > [  760.727488] nvme nvme1: EH: fail controller
> > > > 
> > > > The above issue(hang in nvme_remove()) is still an old issue, which
> > > > is because queues are kept as quiesce during remove, so could you
> > > > please test the following change?
> > > > 
> > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > > > index 1dec353388be..c78e5a0cde06 100644
> > > > --- a/drivers/nvme/host/core.c
> > > > +++ b/drivers/nvme/host/core.c
> > > > @@ -3254,6 +3254,11 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
> > > >          */
> > > >         if (ctrl->state == NVME_CTRL_DEAD)
> > > >                 nvme_kill_queues(ctrl);
> > > > +       else {
> > > > +               if (ctrl->admin_q)
> > > > +                       blk_mq_unquiesce_queue(ctrl->admin_q);
> > > > +               nvme_start_queues(ctrl);
> > > > +       }
> > > > 
> > > >         down_write(&ctrl->namespaces_rwsem);
> > > >         list_splice_init(&ctrl->namespaces, &ns_list);
> > > 
> > > The above won't actually do anything here since the broken link puts the
> > > controller in the DEAD state, so we've killed the queues which also
> > > unquiesces them.
> > 
> > I suggest you to double check if the controller is set as DEAD
> > in nvme_remove() since there won't be any log dumped when this happen.
> 
> Yes, it's dead. pci_device_is_present returns false when the link is
> broken.
> 
> Also, the logs showed the capacity was set to 0, which only happens when
> we kill the namespace queues, which supposedly restarts the queues too.
>  
> > If controller is set as DEAD and queues are killed, and all IO should
> > have been dispatched to driver and nvme_queueu_rq() will fail them all,
> > then there isn't any reason to see the hang in your stack trace log.
> 
> Right, that's the idea. It just doesn't appear to be working here.

Hi Keith,

This issue may probably be fixed by Jianchao's patch of 'nvme: pci: set nvmeq->cq_vector
after alloc cq/sq'[1] and my another patch of 'nvme: pci: unquiesce admin
queue after controller is shutdown'[2], and both two have been included in the
posted V6.

[1] https://marc.info/?l=linux-block&m=152644346006002&w=2 
[2] https://marc.info/?l=linux-block&m=152644344805995&w=2

So please retest V6 and see if this issue can be addressed, if not,
could you share the debugfs log, then we might see where is wrong
from the log?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
  2018-05-16  4:31           ` Ming Lei
@ 2018-05-16 15:18             ` Keith Busch
  2018-05-16 22:18               ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: Keith Busch @ 2018-05-16 15:18 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Laurence Oberman, Sagi Grimberg,
	James Smart, linux-nvme, Keith Busch, Jianchao Wang,
	Christoph Hellwig

On Wed, May 16, 2018 at 12:31:28PM +0800, Ming Lei wrote:
> Hi Keith,
> 
> This issue may probably be fixed by Jianchao's patch of 'nvme: pci: set nvmeq->cq_vector
> after alloc cq/sq'[1] and my another patch of 'nvme: pci: unquiesce admin
> queue after controller is shutdown'[2], and both two have been included in the
> posted V6.

No, it's definitely not related to that patch. The link is down in this
test, I can assure you we're bailing out long before we ever even try to
create an IO queue. The failing condition is detected by nvme_pci_enable's
check for all 1's completions at the very beginning.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
  2018-05-16 15:18             ` Keith Busch
@ 2018-05-16 22:18               ` Ming Lei
  0 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2018-05-16 22:18 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Laurence Oberman, Sagi Grimberg,
	James Smart, linux-nvme, Keith Busch, Jianchao Wang,
	Christoph Hellwig

On Wed, May 16, 2018 at 09:18:26AM -0600, Keith Busch wrote:
> On Wed, May 16, 2018 at 12:31:28PM +0800, Ming Lei wrote:
> > Hi Keith,
> > 
> > This issue may probably be fixed by Jianchao's patch of 'nvme: pci: set nvmeq->cq_vector
> > after alloc cq/sq'[1] and my another patch of 'nvme: pci: unquiesce admin
> > queue after controller is shutdown'[2], and both two have been included in the
> > posted V6.
> 
> No, it's definitely not related to that patch. The link is down in this
> test, I can assure you we're bailing out long before we ever even try to
> create an IO queue. The failing condition is detected by nvme_pci_enable's
> check for all 1's completions at the very beginning.

OK, this kind of failure during reset can be triggered in my test easily, then
nvme_remove_dead_ctrl() is called too, but not see IO hang from remove path.

As we discussed, it shouldn't be so, since queues are unquiesced &
killed, all IO should have been failed immediately. Also controller has
been shutdown, the queues are frozen too, so blk_mq_freeze_queue_wait()
won't wait on one unfrozen queue.

So could you post the debugfs log when the hang happens so that we may
find some clue?

Also, I don't think your issue is caused by this patchset, since
nvme_remove_dead_ctrl_work() and nvme_remove() aren't touched by this patch.
That means this issue may be triggered without this patchset too,
so could we start to review this patchset meantime?


Thanks,
Ming

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2018-05-16 22:18 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-11 12:29 [PATCH V5 0/9] nvme: pci: fix & improve timeout handling Ming Lei
2018-05-11 12:29 ` [PATCH V5 1/9] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout() Ming Lei
2018-05-11 12:29 ` [PATCH V5 2/9] nvme: pci: cover timeout for admin commands running in EH Ming Lei
2018-05-11 12:29 ` [PATCH V5 3/9] nvme: pci: only wait freezing if queue is frozen Ming Lei
2018-05-11 12:29 ` [PATCH V5 4/9] nvme: pci: freeze queue in nvme_dev_disable() in case of error recovery Ming Lei
2018-05-11 12:29 ` [PATCH V5 5/9] nvme: pci: prepare for supporting error recovery from resetting context Ming Lei
2018-05-11 12:29 ` [PATCH V5 6/9] nvme: pci: move error handling out of nvme_reset_dev() Ming Lei
2018-05-11 12:29 ` [PATCH V5 7/9] nvme: pci: don't unfreeze queue until controller state updating succeeds Ming Lei
2018-05-11 12:29 ` [PATCH V5 8/9] nvme: core: introduce nvme_force_change_ctrl_state() Ming Lei
2018-05-11 12:29 ` [PATCH V5 9/9] nvme: pci: support nested EH Ming Lei
2018-05-15 10:02   ` jianchao.wang
2018-05-15 12:39     ` Ming Lei
2018-05-11 20:50 ` [PATCH V5 0/9] nvme: pci: fix & improve timeout handling Keith Busch
2018-05-12  0:21   ` Ming Lei
2018-05-14 15:18     ` Keith Busch
2018-05-14 23:47       ` Ming Lei
2018-05-15  0:33         ` Keith Busch
2018-05-15  9:08           ` Ming Lei
2018-05-16  4:31           ` Ming Lei
2018-05-16 15:18             ` Keith Busch
2018-05-16 22:18               ` Ming Lei
2018-05-14  8:21 ` jianchao.wang
2018-05-14  9:38   ` Ming Lei
2018-05-14 10:05     ` jianchao.wang
2018-05-14 12:22       ` Ming Lei
2018-05-15  0:33         ` Ming Lei
2018-05-15  9:56           ` jianchao.wang
2018-05-15 12:56             ` Ming Lei
2018-05-16  3:03               ` jianchao.wang
2018-05-16  2:04             ` Ming Lei
2018-05-16  2:09               ` Ming Lei
2018-05-16  2:15                 ` jianchao.wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).