All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V3 0/8] nvme: pci: fix & improve timeout handling
@ 2018-05-03  3:17 ` Ming Lei
  0 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-03  3:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

Hi,

The 1st patch introduces blk_quiesce_timeout() and blk_unquiesce_timeout()
for NVMe, meantime fixes blk_sync_queue().

The 2nd patch covers timeout for admin commands for recovering controller
for avoiding possible deadlock.

The 3~5 patches fixes race between nvme_start_freeze() and
nvme_unfreeze().

The last 3 patches fixes several races wrt. NVMe timeout handler, and
finally can make blktests block/011 passed. Meantime the NVMe timeout
model gets simplified a lot.

gitweb:
	https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V3

V3:
	- fix one new race related freezing in patch 4, nvme_reset_work()
	may hang forever without this patch
	- rewrite the last 3 patches, and avoid to break nvme_reset_ctrl*()

V2:
	- fix draining timeout work, so no need to change return value from
	.timeout()
	- fix race between nvme_start_freeze() and nvme_unfreeze()
	- cover timeout for admin commands running in EH

Ming Lei (8):
  block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
  nvme: pci: cover timeout for admin commands running in EH
  nvme: pci: only wait freezing if queue is frozen
  nvme: pci: freeze queue in nvme_dev_disable() in case of error
    recovery
  nvme: fix race between freeze queues and unfreeze queues
  nvme: pci: split controller resetting into two parts
  nvme: pci: recover controller reliably
  nvme: pci: simplify timeout handling

 block/blk-core.c         |  21 ++-
 block/blk-mq.c           |   9 ++
 block/blk-timeout.c      |   5 +-
 drivers/nvme/host/core.c |  47 ++++++-
 drivers/nvme/host/nvme.h |   6 +
 drivers/nvme/host/pci.c  | 344 ++++++++++++++++++++++++++++++++++++++++-------
 include/linux/blkdev.h   |  13 ++
 7 files changed, 391 insertions(+), 54 deletions(-)

Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>

-- 
2.9.5

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH V3 0/8] nvme: pci: fix & improve timeout handling
@ 2018-05-03  3:17 ` Ming Lei
  0 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-03  3:17 UTC (permalink / raw)


Hi,

The 1st patch introduces blk_quiesce_timeout() and blk_unquiesce_timeout()
for NVMe, meantime fixes blk_sync_queue().

The 2nd patch covers timeout for admin commands for recovering controller
for avoiding possible deadlock.

The 3~5 patches fixes race between nvme_start_freeze() and
nvme_unfreeze().

The last 3 patches fixes several races wrt. NVMe timeout handler, and
finally can make blktests block/011 passed. Meantime the NVMe timeout
model gets simplified a lot.

gitweb:
	https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V3

V3:
	- fix one new race related freezing in patch 4, nvme_reset_work()
	may hang forever without this patch
	- rewrite the last 3 patches, and avoid to break nvme_reset_ctrl*()

V2:
	- fix draining timeout work, so no need to change return value from
	.timeout()
	- fix race between nvme_start_freeze() and nvme_unfreeze()
	- cover timeout for admin commands running in EH

Ming Lei (8):
  block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
  nvme: pci: cover timeout for admin commands running in EH
  nvme: pci: only wait freezing if queue is frozen
  nvme: pci: freeze queue in nvme_dev_disable() in case of error
    recovery
  nvme: fix race between freeze queues and unfreeze queues
  nvme: pci: split controller resetting into two parts
  nvme: pci: recover controller reliably
  nvme: pci: simplify timeout handling

 block/blk-core.c         |  21 ++-
 block/blk-mq.c           |   9 ++
 block/blk-timeout.c      |   5 +-
 drivers/nvme/host/core.c |  47 ++++++-
 drivers/nvme/host/nvme.h |   6 +
 drivers/nvme/host/pci.c  | 344 ++++++++++++++++++++++++++++++++++++++++-------
 include/linux/blkdev.h   |  13 ++
 7 files changed, 391 insertions(+), 54 deletions(-)

Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Cc: Laurence Oberman <loberman at redhat.com>

-- 
2.9.5

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH V3 1/8] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
  2018-05-03  3:17 ` Ming Lei
@ 2018-05-03  3:17   ` Ming Lei
  -1 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-03  3:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, Bart Van Assche,
	Jianchao Wang, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	Laurence Oberman

Turns out the current way can't drain timout completely because mod_timer()
can be triggered in the work func, which can be just run inside the synced
timeout work:

        del_timer_sync(&q->timeout);
        cancel_work_sync(&q->timeout_work);

This patch introduces one flag of 'timeout_off' for fixing this issue, turns
out this simple way does work.

Also blk_quiesce_timeout() and blk_unquiesce_timeout() are introduced for
draining timeout, which is needed by NVMe.

Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c       | 21 +++++++++++++++++++--
 block/blk-mq.c         |  9 +++++++++
 block/blk-timeout.c    |  5 ++++-
 include/linux/blkdev.h | 13 +++++++++++++
 4 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 85909b431eb0..c277f1023703 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -392,6 +392,22 @@ void blk_stop_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_stop_queue);
 
+void blk_unquiesce_timeout(struct request_queue *q)
+{
+	blk_mark_timeout_quiesce(q, false);
+	mod_timer(&q->timeout, jiffies + q->rq_timeout);
+}
+EXPORT_SYMBOL(blk_unquiesce_timeout);
+
+void blk_quiesce_timeout(struct request_queue *q)
+{
+	blk_mark_timeout_quiesce(q, true);
+
+	del_timer_sync(&q->timeout);
+	cancel_work_sync(&q->timeout_work);
+}
+EXPORT_SYMBOL(blk_quiesce_timeout);
+
 /**
  * blk_sync_queue - cancel any pending callbacks on a queue
  * @q: the queue
@@ -412,8 +428,7 @@ EXPORT_SYMBOL(blk_stop_queue);
  */
 void blk_sync_queue(struct request_queue *q)
 {
-	del_timer_sync(&q->timeout);
-	cancel_work_sync(&q->timeout_work);
+	blk_quiesce_timeout(q);
 
 	if (q->mq_ops) {
 		struct blk_mq_hw_ctx *hctx;
@@ -425,6 +440,8 @@ void blk_sync_queue(struct request_queue *q)
 	} else {
 		cancel_delayed_work_sync(&q->delay_work);
 	}
+
+	blk_mark_timeout_quiesce(q, false);
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c3621453ad87..d0a5dc29c8ef 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -901,6 +901,15 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	};
 	struct blk_mq_hw_ctx *hctx;
 	int i;
+	bool timeout_off;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	timeout_off = q->timeout_off;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	if (timeout_off)
+		return;
 
 	/* A deadlock might occur if a request is stuck requiring a
 	 * timeout at the same time a queue freeze is waiting
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 652d4d4d3e97..ffd0b609091e 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -136,12 +136,15 @@ void blk_timeout_work(struct work_struct *work)
 
 	spin_lock_irqsave(q->queue_lock, flags);
 
+	if (q->timeout_off)
+		goto exit;
+
 	list_for_each_entry_safe(rq, tmp, &q->timeout_list, timeout_list)
 		blk_rq_check_expired(rq, &next, &next_set);
 
 	if (next_set)
 		mod_timer(&q->timeout, round_jiffies_up(next));
-
+exit:
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5c4eee043191..a2cc4aaecf50 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -584,6 +584,7 @@ struct request_queue {
 	struct timer_list	timeout;
 	struct work_struct	timeout_work;
 	struct list_head	timeout_list;
+	bool			timeout_off;
 
 	struct list_head	icq_list;
 #ifdef CONFIG_BLK_CGROUP
@@ -1017,6 +1018,18 @@ extern void blk_execute_rq(struct request_queue *, struct gendisk *,
 extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,
 				  struct request *, int, rq_end_io_fn *);
 
+static inline void blk_mark_timeout_quiesce(struct request_queue *q, bool quiesce)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	q->timeout_off = quiesce;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+extern void blk_quiesce_timeout(struct request_queue *q);
+extern void blk_unquiesce_timeout(struct request_queue *q);
+
 int blk_status_to_errno(blk_status_t status);
 blk_status_t errno_to_blk_status(int errno);
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V3 1/8] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
@ 2018-05-03  3:17   ` Ming Lei
  0 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-03  3:17 UTC (permalink / raw)


Turns out the current way can't drain timout completely because mod_timer()
can be triggered in the work func, which can be just run inside the synced
timeout work:

        del_timer_sync(&q->timeout);
        cancel_work_sync(&q->timeout_work);

This patch introduces one flag of 'timeout_off' for fixing this issue, turns
out this simple way does work.

Also blk_quiesce_timeout() and blk_unquiesce_timeout() are introduced for
draining timeout, which is needed by NVMe.

Cc: Bart Van Assche <bart.vanassche at wdc.com>
Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Cc: Laurence Oberman <loberman at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-core.c       | 21 +++++++++++++++++++--
 block/blk-mq.c         |  9 +++++++++
 block/blk-timeout.c    |  5 ++++-
 include/linux/blkdev.h | 13 +++++++++++++
 4 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 85909b431eb0..c277f1023703 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -392,6 +392,22 @@ void blk_stop_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_stop_queue);
 
+void blk_unquiesce_timeout(struct request_queue *q)
+{
+	blk_mark_timeout_quiesce(q, false);
+	mod_timer(&q->timeout, jiffies + q->rq_timeout);
+}
+EXPORT_SYMBOL(blk_unquiesce_timeout);
+
+void blk_quiesce_timeout(struct request_queue *q)
+{
+	blk_mark_timeout_quiesce(q, true);
+
+	del_timer_sync(&q->timeout);
+	cancel_work_sync(&q->timeout_work);
+}
+EXPORT_SYMBOL(blk_quiesce_timeout);
+
 /**
  * blk_sync_queue - cancel any pending callbacks on a queue
  * @q: the queue
@@ -412,8 +428,7 @@ EXPORT_SYMBOL(blk_stop_queue);
  */
 void blk_sync_queue(struct request_queue *q)
 {
-	del_timer_sync(&q->timeout);
-	cancel_work_sync(&q->timeout_work);
+	blk_quiesce_timeout(q);
 
 	if (q->mq_ops) {
 		struct blk_mq_hw_ctx *hctx;
@@ -425,6 +440,8 @@ void blk_sync_queue(struct request_queue *q)
 	} else {
 		cancel_delayed_work_sync(&q->delay_work);
 	}
+
+	blk_mark_timeout_quiesce(q, false);
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c3621453ad87..d0a5dc29c8ef 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -901,6 +901,15 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	};
 	struct blk_mq_hw_ctx *hctx;
 	int i;
+	bool timeout_off;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	timeout_off = q->timeout_off;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	if (timeout_off)
+		return;
 
 	/* A deadlock might occur if a request is stuck requiring a
 	 * timeout at the same time a queue freeze is waiting
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 652d4d4d3e97..ffd0b609091e 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -136,12 +136,15 @@ void blk_timeout_work(struct work_struct *work)
 
 	spin_lock_irqsave(q->queue_lock, flags);
 
+	if (q->timeout_off)
+		goto exit;
+
 	list_for_each_entry_safe(rq, tmp, &q->timeout_list, timeout_list)
 		blk_rq_check_expired(rq, &next, &next_set);
 
 	if (next_set)
 		mod_timer(&q->timeout, round_jiffies_up(next));
-
+exit:
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5c4eee043191..a2cc4aaecf50 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -584,6 +584,7 @@ struct request_queue {
 	struct timer_list	timeout;
 	struct work_struct	timeout_work;
 	struct list_head	timeout_list;
+	bool			timeout_off;
 
 	struct list_head	icq_list;
 #ifdef CONFIG_BLK_CGROUP
@@ -1017,6 +1018,18 @@ extern void blk_execute_rq(struct request_queue *, struct gendisk *,
 extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,
 				  struct request *, int, rq_end_io_fn *);
 
+static inline void blk_mark_timeout_quiesce(struct request_queue *q, bool quiesce)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	q->timeout_off = quiesce;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+extern void blk_quiesce_timeout(struct request_queue *q);
+extern void blk_unquiesce_timeout(struct request_queue *q);
+
 int blk_status_to_errno(blk_status_t status);
 blk_status_t errno_to_blk_status(int errno);
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V3 2/8] nvme: pci: cover timeout for admin commands running in EH
  2018-05-03  3:17 ` Ming Lei
@ 2018-05-03  3:17   ` Ming Lei
  -1 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-03  3:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

When admin commands are used in EH for recovering controller, we have to
cover their timeout and can't depend on block's timeout since deadlock may
be caused when these commands are timed-out by block layer again.

Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 77 ++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 66 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fbc71fac6f1e..0e6cd605164a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1733,21 +1733,28 @@ static inline void nvme_release_cmb(struct nvme_dev *dev)
 	}
 }
 
-static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits)
+static void nvme_init_set_host_mem_cmd(struct nvme_dev *dev,
+		struct nvme_command *c, u32 bits)
 {
 	u64 dma_addr = dev->host_mem_descs_dma;
+
+	memset(c, 0, sizeof(*c));
+	c->features.opcode	= nvme_admin_set_features;
+	c->features.fid		= cpu_to_le32(NVME_FEAT_HOST_MEM_BUF);
+	c->features.dword11	= cpu_to_le32(bits);
+	c->features.dword12	= cpu_to_le32(dev->host_mem_size >>
+					      ilog2(dev->ctrl.page_size));
+	c->features.dword13	= cpu_to_le32(lower_32_bits(dma_addr));
+	c->features.dword14	= cpu_to_le32(upper_32_bits(dma_addr));
+	c->features.dword15	= cpu_to_le32(dev->nr_host_mem_descs);
+}
+
+static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits)
+{
 	struct nvme_command c;
 	int ret;
 
-	memset(&c, 0, sizeof(c));
-	c.features.opcode	= nvme_admin_set_features;
-	c.features.fid		= cpu_to_le32(NVME_FEAT_HOST_MEM_BUF);
-	c.features.dword11	= cpu_to_le32(bits);
-	c.features.dword12	= cpu_to_le32(dev->host_mem_size >>
-					      ilog2(dev->ctrl.page_size));
-	c.features.dword13	= cpu_to_le32(lower_32_bits(dma_addr));
-	c.features.dword14	= cpu_to_le32(upper_32_bits(dma_addr));
-	c.features.dword15	= cpu_to_le32(dev->nr_host_mem_descs);
+	nvme_init_set_host_mem_cmd(dev, &c, bits);
 
 	ret = nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0);
 	if (ret) {
@@ -1758,6 +1765,54 @@ static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits)
 	return ret;
 }
 
+static void nvme_set_host_mem_end_io(struct request *rq, blk_status_t sts)
+{
+	struct completion *waiting = rq->end_io_data;
+
+	rq->end_io_data = NULL;
+
+	/*
+	 * complete last, if this is a stack request the process (and thus
+	 * the rq pointer) could be invalid right after this complete()
+	 */
+	complete(waiting);
+}
+
+/*
+ * This function can be used inside timeout handler, when the block layer
+ * timeout may not work, so this function has to cover the timeout by itself.
+ */
+static int nvme_set_host_mem_timeout(struct nvme_dev *dev, u32 bits)
+{
+	DECLARE_COMPLETION_ONSTACK(wait);
+	struct nvme_command c;
+	struct request_queue *q = dev->ctrl.admin_q;
+	struct request *req;
+	int ret;
+
+	nvme_init_set_host_mem_cmd(dev, &c, bits);
+
+	req = nvme_alloc_request(q, &c, 0, NVME_QID_ANY);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
+	req->timeout = ADMIN_TIMEOUT;
+	req->end_io_data = &wait;
+
+	blk_execute_rq_nowait(q, NULL, req, false,
+			nvme_set_host_mem_end_io);
+	wait_for_completion_io_timeout(&wait, ADMIN_TIMEOUT);
+
+	if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
+		ret = -EINTR;
+	else
+		ret = nvme_req(req)->status;
+
+	blk_mq_free_request(req);
+
+	return ret;
+}
+
 static void nvme_free_host_mem(struct nvme_dev *dev)
 {
 	int i;
@@ -2216,7 +2271,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		 * but I'd rather be safe than sorry..
 		 */
 		if (dev->host_mem_descs)
-			nvme_set_host_mem(dev, 0);
+			nvme_set_host_mem_timeout(dev, 0);
 		nvme_disable_io_queues(dev);
 		nvme_disable_admin_queue(dev, shutdown);
 	}
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V3 2/8] nvme: pci: cover timeout for admin commands running in EH
@ 2018-05-03  3:17   ` Ming Lei
  0 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-03  3:17 UTC (permalink / raw)


When admin commands are used in EH for recovering controller, we have to
cover their timeout and can't depend on block's timeout since deadlock may
be caused when these commands are timed-out by block layer again.

Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Cc: Laurence Oberman <loberman at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/pci.c | 77 ++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 66 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fbc71fac6f1e..0e6cd605164a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1733,21 +1733,28 @@ static inline void nvme_release_cmb(struct nvme_dev *dev)
 	}
 }
 
-static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits)
+static void nvme_init_set_host_mem_cmd(struct nvme_dev *dev,
+		struct nvme_command *c, u32 bits)
 {
 	u64 dma_addr = dev->host_mem_descs_dma;
+
+	memset(c, 0, sizeof(*c));
+	c->features.opcode	= nvme_admin_set_features;
+	c->features.fid		= cpu_to_le32(NVME_FEAT_HOST_MEM_BUF);
+	c->features.dword11	= cpu_to_le32(bits);
+	c->features.dword12	= cpu_to_le32(dev->host_mem_size >>
+					      ilog2(dev->ctrl.page_size));
+	c->features.dword13	= cpu_to_le32(lower_32_bits(dma_addr));
+	c->features.dword14	= cpu_to_le32(upper_32_bits(dma_addr));
+	c->features.dword15	= cpu_to_le32(dev->nr_host_mem_descs);
+}
+
+static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits)
+{
 	struct nvme_command c;
 	int ret;
 
-	memset(&c, 0, sizeof(c));
-	c.features.opcode	= nvme_admin_set_features;
-	c.features.fid		= cpu_to_le32(NVME_FEAT_HOST_MEM_BUF);
-	c.features.dword11	= cpu_to_le32(bits);
-	c.features.dword12	= cpu_to_le32(dev->host_mem_size >>
-					      ilog2(dev->ctrl.page_size));
-	c.features.dword13	= cpu_to_le32(lower_32_bits(dma_addr));
-	c.features.dword14	= cpu_to_le32(upper_32_bits(dma_addr));
-	c.features.dword15	= cpu_to_le32(dev->nr_host_mem_descs);
+	nvme_init_set_host_mem_cmd(dev, &c, bits);
 
 	ret = nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0);
 	if (ret) {
@@ -1758,6 +1765,54 @@ static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits)
 	return ret;
 }
 
+static void nvme_set_host_mem_end_io(struct request *rq, blk_status_t sts)
+{
+	struct completion *waiting = rq->end_io_data;
+
+	rq->end_io_data = NULL;
+
+	/*
+	 * complete last, if this is a stack request the process (and thus
+	 * the rq pointer) could be invalid right after this complete()
+	 */
+	complete(waiting);
+}
+
+/*
+ * This function can be used inside timeout handler, when the block layer
+ * timeout may not work, so this function has to cover the timeout by itself.
+ */
+static int nvme_set_host_mem_timeout(struct nvme_dev *dev, u32 bits)
+{
+	DECLARE_COMPLETION_ONSTACK(wait);
+	struct nvme_command c;
+	struct request_queue *q = dev->ctrl.admin_q;
+	struct request *req;
+	int ret;
+
+	nvme_init_set_host_mem_cmd(dev, &c, bits);
+
+	req = nvme_alloc_request(q, &c, 0, NVME_QID_ANY);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
+	req->timeout = ADMIN_TIMEOUT;
+	req->end_io_data = &wait;
+
+	blk_execute_rq_nowait(q, NULL, req, false,
+			nvme_set_host_mem_end_io);
+	wait_for_completion_io_timeout(&wait, ADMIN_TIMEOUT);
+
+	if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
+		ret = -EINTR;
+	else
+		ret = nvme_req(req)->status;
+
+	blk_mq_free_request(req);
+
+	return ret;
+}
+
 static void nvme_free_host_mem(struct nvme_dev *dev)
 {
 	int i;
@@ -2216,7 +2271,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		 * but I'd rather be safe than sorry..
 		 */
 		if (dev->host_mem_descs)
-			nvme_set_host_mem(dev, 0);
+			nvme_set_host_mem_timeout(dev, 0);
 		nvme_disable_io_queues(dev);
 		nvme_disable_admin_queue(dev, shutdown);
 	}
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V3 3/8] nvme: pci: only wait freezing if queue is frozen
  2018-05-03  3:17 ` Ming Lei
@ 2018-05-03  3:17   ` Ming Lei
  -1 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-03  3:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

In nvme_dev_disable() called during shutting down controler,
nvme_wait_freeze_timeout() may be done on the controller not
frozen yet, so add the check for avoiding the case.

Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 0e6cd605164a..8172ee584130 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2240,14 +2240,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	int i;
 	bool dead = true;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
+	bool frozen = false;
 
 	mutex_lock(&dev->shutdown_lock);
 	if (pci_is_enabled(pdev)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
 		if (dev->ctrl.state == NVME_CTRL_LIVE ||
-		    dev->ctrl.state == NVME_CTRL_RESETTING)
+		    dev->ctrl.state == NVME_CTRL_RESETTING) {
 			nvme_start_freeze(&dev->ctrl);
+			frozen = true;
+		}
 		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
 			pdev->error_state  != pci_channel_io_normal);
 	}
@@ -2257,7 +2260,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	 * doing a safe shutdown.
 	 */
 	if (!dead) {
-		if (shutdown)
+		if (shutdown && frozen)
 			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
 	}
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V3 3/8] nvme: pci: only wait freezing if queue is frozen
@ 2018-05-03  3:17   ` Ming Lei
  0 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-03  3:17 UTC (permalink / raw)


In nvme_dev_disable() called during shutting down controler,
nvme_wait_freeze_timeout() may be done on the controller not
frozen yet, so add the check for avoiding the case.

Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Cc: Laurence Oberman <loberman at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/pci.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 0e6cd605164a..8172ee584130 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2240,14 +2240,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	int i;
 	bool dead = true;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
+	bool frozen = false;
 
 	mutex_lock(&dev->shutdown_lock);
 	if (pci_is_enabled(pdev)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
 		if (dev->ctrl.state == NVME_CTRL_LIVE ||
-		    dev->ctrl.state == NVME_CTRL_RESETTING)
+		    dev->ctrl.state == NVME_CTRL_RESETTING) {
 			nvme_start_freeze(&dev->ctrl);
+			frozen = true;
+		}
 		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
 			pdev->error_state  != pci_channel_io_normal);
 	}
@@ -2257,7 +2260,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	 * doing a safe shutdown.
 	 */
 	if (!dead) {
-		if (shutdown)
+		if (shutdown && frozen)
 			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
 	}
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V3 4/8] nvme: pci: freeze queue in nvme_dev_disable() in case of error recovery
  2018-05-03  3:17 ` Ming Lei
@ 2018-05-03  3:17   ` Ming Lei
  -1 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-03  3:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

When nvme_dev_disable() is used for error recovery, we should always
freeze queues before shutdown controller:

- reset handler supposes queues are frozen, and will wait_freeze &
unfreeze them explicitly, if queues aren't frozen during nvme_dev_disable(),
reset handler may wait forever even though there isn't any requests
allocated.

- this way may avoid to cancel lots of requests during error recovery

This patch introduces the parameter of 'freeze_queue' for fixing this
issue.

Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 47 ++++++++++++++++++++++++++++++++---------------
 1 file changed, 32 insertions(+), 15 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8172ee584130..ef80e064a62c 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -69,7 +69,8 @@ struct nvme_dev;
 struct nvme_queue;
 
 static void nvme_process_cq(struct nvme_queue *nvmeq);
-static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
+static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
+		freeze_queue);
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -1197,7 +1198,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	 */
 	if (nvme_should_reset(dev, csts)) {
 		nvme_warn_reset(dev, csts);
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, true);
 		nvme_reset_ctrl(&dev->ctrl);
 		return BLK_EH_HANDLED;
 	}
@@ -1224,7 +1225,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, disable controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, false);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
 		return BLK_EH_HANDLED;
 	default:
@@ -1240,7 +1241,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, true);
 		nvme_reset_ctrl(&dev->ctrl);
 
 		/*
@@ -2235,19 +2236,35 @@ static void nvme_pci_disable(struct nvme_dev *dev)
 	}
 }
 
-static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
+/*
+ * Resetting often follows nvme_dev_disable(), so queues need to be frozen
+ * before resetting.
+ */
+static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
+		freeze_queue)
 {
 	int i;
 	bool dead = true;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 	bool frozen = false;
 
+	/*
+	 * 'freeze_queue' is only valid for non-shutdown, and we do
+	 * inline freeze & wait_freeze_timeout for shutdown just for
+	 * completing as many as possible requests before shutdown
+	 */
+	if (shutdown)
+		freeze_queue = false;
+
+	if (freeze_queue)
+		nvme_start_freeze(&dev->ctrl);
+
 	mutex_lock(&dev->shutdown_lock);
 	if (pci_is_enabled(pdev)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
-		if (dev->ctrl.state == NVME_CTRL_LIVE ||
-		    dev->ctrl.state == NVME_CTRL_RESETTING) {
+		if (shutdown && (dev->ctrl.state == NVME_CTRL_LIVE ||
+		    dev->ctrl.state == NVME_CTRL_RESETTING)) {
 			nvme_start_freeze(&dev->ctrl);
 			frozen = true;
 		}
@@ -2339,7 +2356,7 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
 	dev_warn(dev->ctrl.device, "Removing after probe failure status: %d\n", status);
 
 	nvme_get_ctrl(&dev->ctrl);
-	nvme_dev_disable(dev, false);
+	nvme_dev_disable(dev, false, false);
 	if (!queue_work(nvme_wq, &dev->remove_work))
 		nvme_put_ctrl(&dev->ctrl);
 }
@@ -2360,7 +2377,7 @@ static void nvme_reset_work(struct work_struct *work)
 	 * moving on.
 	 */
 	if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, false);
 
 	/*
 	 * Introduce CONNECTING state from nvme-fc/rdma transports to mark the
@@ -2609,7 +2626,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 static void nvme_reset_prepare(struct pci_dev *pdev)
 {
 	struct nvme_dev *dev = pci_get_drvdata(pdev);
-	nvme_dev_disable(dev, false);
+	nvme_dev_disable(dev, false, true);
 }
 
 static void nvme_reset_done(struct pci_dev *pdev)
@@ -2621,7 +2638,7 @@ static void nvme_reset_done(struct pci_dev *pdev)
 static void nvme_shutdown(struct pci_dev *pdev)
 {
 	struct nvme_dev *dev = pci_get_drvdata(pdev);
-	nvme_dev_disable(dev, true);
+	nvme_dev_disable(dev, true, false);
 }
 
 /*
@@ -2640,13 +2657,13 @@ static void nvme_remove(struct pci_dev *pdev)
 
 	if (!pci_device_is_present(pdev)) {
 		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, false);
 	}
 
 	flush_work(&dev->ctrl.reset_work);
 	nvme_stop_ctrl(&dev->ctrl);
 	nvme_remove_namespaces(&dev->ctrl);
-	nvme_dev_disable(dev, true);
+	nvme_dev_disable(dev, true, false);
 	nvme_free_host_mem(dev);
 	nvme_dev_remove_admin(dev);
 	nvme_free_queues(dev, 0);
@@ -2680,7 +2697,7 @@ static int nvme_suspend(struct device *dev)
 	struct pci_dev *pdev = to_pci_dev(dev);
 	struct nvme_dev *ndev = pci_get_drvdata(pdev);
 
-	nvme_dev_disable(ndev, true);
+	nvme_dev_disable(ndev, true, false);
 	return 0;
 }
 
@@ -2712,7 +2729,7 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
 	case pci_channel_io_frozen:
 		dev_warn(dev->ctrl.device,
 			"frozen state error detected, reset controller\n");
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, true);
 		return PCI_ERS_RESULT_NEED_RESET;
 	case pci_channel_io_perm_failure:
 		dev_warn(dev->ctrl.device,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V3 4/8] nvme: pci: freeze queue in nvme_dev_disable() in case of error recovery
@ 2018-05-03  3:17   ` Ming Lei
  0 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-03  3:17 UTC (permalink / raw)


When nvme_dev_disable() is used for error recovery, we should always
freeze queues before shutdown controller:

- reset handler supposes queues are frozen, and will wait_freeze &
unfreeze them explicitly, if queues aren't frozen during nvme_dev_disable(),
reset handler may wait forever even though there isn't any requests
allocated.

- this way may avoid to cancel lots of requests during error recovery

This patch introduces the parameter of 'freeze_queue' for fixing this
issue.

Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Cc: Laurence Oberman <loberman at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/pci.c | 47 ++++++++++++++++++++++++++++++++---------------
 1 file changed, 32 insertions(+), 15 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8172ee584130..ef80e064a62c 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -69,7 +69,8 @@ struct nvme_dev;
 struct nvme_queue;
 
 static void nvme_process_cq(struct nvme_queue *nvmeq);
-static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
+static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
+		freeze_queue);
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -1197,7 +1198,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	 */
 	if (nvme_should_reset(dev, csts)) {
 		nvme_warn_reset(dev, csts);
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, true);
 		nvme_reset_ctrl(&dev->ctrl);
 		return BLK_EH_HANDLED;
 	}
@@ -1224,7 +1225,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, disable controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, false);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
 		return BLK_EH_HANDLED;
 	default:
@@ -1240,7 +1241,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, true);
 		nvme_reset_ctrl(&dev->ctrl);
 
 		/*
@@ -2235,19 +2236,35 @@ static void nvme_pci_disable(struct nvme_dev *dev)
 	}
 }
 
-static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
+/*
+ * Resetting often follows nvme_dev_disable(), so queues need to be frozen
+ * before resetting.
+ */
+static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
+		freeze_queue)
 {
 	int i;
 	bool dead = true;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 	bool frozen = false;
 
+	/*
+	 * 'freeze_queue' is only valid for non-shutdown, and we do
+	 * inline freeze & wait_freeze_timeout for shutdown just for
+	 * completing as many as possible requests before shutdown
+	 */
+	if (shutdown)
+		freeze_queue = false;
+
+	if (freeze_queue)
+		nvme_start_freeze(&dev->ctrl);
+
 	mutex_lock(&dev->shutdown_lock);
 	if (pci_is_enabled(pdev)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
-		if (dev->ctrl.state == NVME_CTRL_LIVE ||
-		    dev->ctrl.state == NVME_CTRL_RESETTING) {
+		if (shutdown && (dev->ctrl.state == NVME_CTRL_LIVE ||
+		    dev->ctrl.state == NVME_CTRL_RESETTING)) {
 			nvme_start_freeze(&dev->ctrl);
 			frozen = true;
 		}
@@ -2339,7 +2356,7 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
 	dev_warn(dev->ctrl.device, "Removing after probe failure status: %d\n", status);
 
 	nvme_get_ctrl(&dev->ctrl);
-	nvme_dev_disable(dev, false);
+	nvme_dev_disable(dev, false, false);
 	if (!queue_work(nvme_wq, &dev->remove_work))
 		nvme_put_ctrl(&dev->ctrl);
 }
@@ -2360,7 +2377,7 @@ static void nvme_reset_work(struct work_struct *work)
 	 * moving on.
 	 */
 	if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, false);
 
 	/*
 	 * Introduce CONNECTING state from nvme-fc/rdma transports to mark the
@@ -2609,7 +2626,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 static void nvme_reset_prepare(struct pci_dev *pdev)
 {
 	struct nvme_dev *dev = pci_get_drvdata(pdev);
-	nvme_dev_disable(dev, false);
+	nvme_dev_disable(dev, false, true);
 }
 
 static void nvme_reset_done(struct pci_dev *pdev)
@@ -2621,7 +2638,7 @@ static void nvme_reset_done(struct pci_dev *pdev)
 static void nvme_shutdown(struct pci_dev *pdev)
 {
 	struct nvme_dev *dev = pci_get_drvdata(pdev);
-	nvme_dev_disable(dev, true);
+	nvme_dev_disable(dev, true, false);
 }
 
 /*
@@ -2640,13 +2657,13 @@ static void nvme_remove(struct pci_dev *pdev)
 
 	if (!pci_device_is_present(pdev)) {
 		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, false);
 	}
 
 	flush_work(&dev->ctrl.reset_work);
 	nvme_stop_ctrl(&dev->ctrl);
 	nvme_remove_namespaces(&dev->ctrl);
-	nvme_dev_disable(dev, true);
+	nvme_dev_disable(dev, true, false);
 	nvme_free_host_mem(dev);
 	nvme_dev_remove_admin(dev);
 	nvme_free_queues(dev, 0);
@@ -2680,7 +2697,7 @@ static int nvme_suspend(struct device *dev)
 	struct pci_dev *pdev = to_pci_dev(dev);
 	struct nvme_dev *ndev = pci_get_drvdata(pdev);
 
-	nvme_dev_disable(ndev, true);
+	nvme_dev_disable(ndev, true, false);
 	return 0;
 }
 
@@ -2712,7 +2729,7 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
 	case pci_channel_io_frozen:
 		dev_warn(dev->ctrl.device,
 			"frozen state error detected, reset controller\n");
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, true);
 		return PCI_ERS_RESULT_NEED_RESET;
 	case pci_channel_io_perm_failure:
 		dev_warn(dev->ctrl.device,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V3 5/8] nvme: fix race between freeze queues and unfreeze queues
  2018-05-03  3:17 ` Ming Lei
@ 2018-05-03  3:17   ` Ming Lei
  -1 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-03  3:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

nvme_dev_disable() and resetting controller are required for recovering
controller, but the two are run from different contexts.

nvme_start_freeze() is run from nvme_dev_disable(), and nvme_unfreeze()
is run from resetting context. Unfortunatley timeout may be triggered
when draining IO from resetting controller, so nvme_start_freeze() may
be run several times.

This patch fixes the race between nvme_start_freeze() and
nvme_unfreeze() by introducing flags of 'queues_freezed' and the lock
of 'freeze_mutex'.

Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/core.c | 29 +++++++++++++++++++++--------
 drivers/nvme/host/nvme.h |  3 +++
 2 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9df4f71e58ca..f9028873298e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3447,6 +3447,9 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	INIT_WORK(&ctrl->fw_act_work, nvme_fw_act_work);
 	INIT_WORK(&ctrl->delete_work, nvme_delete_ctrl_work);
 
+	ctrl->queues_freezed = false;
+	mutex_init(&ctrl->freeze_mutex);
+
 	ret = ida_simple_get(&nvme_instance_ida, 0, 0, GFP_KERNEL);
 	if (ret < 0)
 		goto out;
@@ -3526,10 +3529,15 @@ void nvme_unfreeze(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
 
-	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
-		blk_mq_unfreeze_queue(ns->queue);
-	up_read(&ctrl->namespaces_rwsem);
+	mutex_lock(&ctrl->freeze_mutex);
+	if (ctrl->queues_freezed) {
+		down_read(&ctrl->namespaces_rwsem);
+		list_for_each_entry(ns, &ctrl->namespaces, list)
+			blk_mq_unfreeze_queue(ns->queue);
+		up_read(&ctrl->namespaces_rwsem);
+		ctrl->queues_freezed = false;
+	}
+	mutex_unlock(&ctrl->freeze_mutex);
 }
 EXPORT_SYMBOL_GPL(nvme_unfreeze);
 
@@ -3562,10 +3570,15 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
 
-	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
-		blk_freeze_queue_start(ns->queue);
-	up_read(&ctrl->namespaces_rwsem);
+	mutex_lock(&ctrl->freeze_mutex);
+	if (!ctrl->queues_freezed) {
+		down_read(&ctrl->namespaces_rwsem);
+		list_for_each_entry(ns, &ctrl->namespaces, list)
+			blk_freeze_queue_start(ns->queue);
+		up_read(&ctrl->namespaces_rwsem);
+		ctrl->queues_freezed = true;
+	}
+	mutex_unlock(&ctrl->freeze_mutex);
 }
 EXPORT_SYMBOL_GPL(nvme_start_freeze);
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 061fecfd44f5..a19b7f04ac24 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -149,6 +149,9 @@ struct nvme_ctrl {
 	struct work_struct reset_work;
 	struct work_struct delete_work;
 
+	bool queues_freezed;
+	struct mutex freeze_mutex;
+
 	struct nvme_subsystem *subsys;
 	struct list_head subsys_entry;
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V3 5/8] nvme: fix race between freeze queues and unfreeze queues
@ 2018-05-03  3:17   ` Ming Lei
  0 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-03  3:17 UTC (permalink / raw)


nvme_dev_disable() and resetting controller are required for recovering
controller, but the two are run from different contexts.

nvme_start_freeze() is run from nvme_dev_disable(), and nvme_unfreeze()
is run from resetting context. Unfortunatley timeout may be triggered
when draining IO from resetting controller, so nvme_start_freeze() may
be run several times.

This patch fixes the race between nvme_start_freeze() and
nvme_unfreeze() by introducing flags of 'queues_freezed' and the lock
of 'freeze_mutex'.

Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Cc: Laurence Oberman <loberman at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/core.c | 29 +++++++++++++++++++++--------
 drivers/nvme/host/nvme.h |  3 +++
 2 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9df4f71e58ca..f9028873298e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3447,6 +3447,9 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	INIT_WORK(&ctrl->fw_act_work, nvme_fw_act_work);
 	INIT_WORK(&ctrl->delete_work, nvme_delete_ctrl_work);
 
+	ctrl->queues_freezed = false;
+	mutex_init(&ctrl->freeze_mutex);
+
 	ret = ida_simple_get(&nvme_instance_ida, 0, 0, GFP_KERNEL);
 	if (ret < 0)
 		goto out;
@@ -3526,10 +3529,15 @@ void nvme_unfreeze(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
 
-	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
-		blk_mq_unfreeze_queue(ns->queue);
-	up_read(&ctrl->namespaces_rwsem);
+	mutex_lock(&ctrl->freeze_mutex);
+	if (ctrl->queues_freezed) {
+		down_read(&ctrl->namespaces_rwsem);
+		list_for_each_entry(ns, &ctrl->namespaces, list)
+			blk_mq_unfreeze_queue(ns->queue);
+		up_read(&ctrl->namespaces_rwsem);
+		ctrl->queues_freezed = false;
+	}
+	mutex_unlock(&ctrl->freeze_mutex);
 }
 EXPORT_SYMBOL_GPL(nvme_unfreeze);
 
@@ -3562,10 +3570,15 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
 
-	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
-		blk_freeze_queue_start(ns->queue);
-	up_read(&ctrl->namespaces_rwsem);
+	mutex_lock(&ctrl->freeze_mutex);
+	if (!ctrl->queues_freezed) {
+		down_read(&ctrl->namespaces_rwsem);
+		list_for_each_entry(ns, &ctrl->namespaces, list)
+			blk_freeze_queue_start(ns->queue);
+		up_read(&ctrl->namespaces_rwsem);
+		ctrl->queues_freezed = true;
+	}
+	mutex_unlock(&ctrl->freeze_mutex);
 }
 EXPORT_SYMBOL_GPL(nvme_start_freeze);
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 061fecfd44f5..a19b7f04ac24 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -149,6 +149,9 @@ struct nvme_ctrl {
 	struct work_struct reset_work;
 	struct work_struct delete_work;
 
+	bool queues_freezed;
+	struct mutex freeze_mutex;
+
 	struct nvme_subsystem *subsys;
 	struct list_head subsys_entry;
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V3 6/8] nvme: pci: split controller resetting into two parts
  2018-05-03  3:17 ` Ming Lei
@ 2018-05-03  3:17   ` Ming Lei
  -1 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-03  3:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

This patch splits controller resetting into the following two parts:

1) the real resetting part

2) the 2nd part for draining IO and updating controller state

The patch prepares for supporting reliable controller recovery, for
example, IO timeout still may be triggered when running the above part 2,
so we still have to recover hardware under this situation.

No functional change.

Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 37 +++++++++++++++++++++++++++----------
 1 file changed, 27 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ef80e064a62c..16d7507bfd79 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2361,16 +2361,10 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
 		nvme_put_ctrl(&dev->ctrl);
 }
 
-static void nvme_reset_work(struct work_struct *work)
+static int nvme_pre_reset_dev(struct nvme_dev *dev)
 {
-	struct nvme_dev *dev =
-		container_of(work, struct nvme_dev, ctrl.reset_work);
 	bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
 	int result = -ENODEV;
-	enum nvme_ctrl_state new_state = NVME_CTRL_LIVE;
-
-	if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING))
-		goto out;
 
 	/*
 	 * If we're called to reset a live controller first shut it down before
@@ -2430,8 +2424,13 @@ static void nvme_reset_work(struct work_struct *work)
 	}
 
 	result = nvme_setup_io_queues(dev);
-	if (result)
-		goto out;
+ out:
+	return result;
+}
+
+static void nvme_post_reset_dev(struct nvme_dev *dev)
+{
+	enum nvme_ctrl_state new_state = NVME_CTRL_LIVE;
 
 	/*
 	 * Keep the controller around but remove all namespaces if we don't have
@@ -2465,7 +2464,25 @@ static void nvme_reset_work(struct work_struct *work)
 	return;
 
  out:
-	nvme_remove_dead_ctrl(dev, result);
+	nvme_remove_dead_ctrl(dev, 0);
+}
+
+static void nvme_reset_work(struct work_struct *work)
+{
+	struct nvme_dev *dev =
+		container_of(work, struct nvme_dev, ctrl.reset_work);
+	int ret;
+
+	if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING))
+		return;
+
+	ret = nvme_pre_reset_dev(dev);
+	if (ret) {
+		nvme_remove_dead_ctrl(dev, ret);
+		return;
+	}
+
+	nvme_post_reset_dev(dev);
 }
 
 static void nvme_remove_dead_ctrl_work(struct work_struct *work)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V3 6/8] nvme: pci: split controller resetting into two parts
@ 2018-05-03  3:17   ` Ming Lei
  0 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-03  3:17 UTC (permalink / raw)


This patch splits controller resetting into the following two parts:

1) the real resetting part

2) the 2nd part for draining IO and updating controller state

The patch prepares for supporting reliable controller recovery, for
example, IO timeout still may be triggered when running the above part 2,
so we still have to recover hardware under this situation.

No functional change.

Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Cc: Laurence Oberman <loberman at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/pci.c | 37 +++++++++++++++++++++++++++----------
 1 file changed, 27 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ef80e064a62c..16d7507bfd79 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2361,16 +2361,10 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
 		nvme_put_ctrl(&dev->ctrl);
 }
 
-static void nvme_reset_work(struct work_struct *work)
+static int nvme_pre_reset_dev(struct nvme_dev *dev)
 {
-	struct nvme_dev *dev =
-		container_of(work, struct nvme_dev, ctrl.reset_work);
 	bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
 	int result = -ENODEV;
-	enum nvme_ctrl_state new_state = NVME_CTRL_LIVE;
-
-	if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING))
-		goto out;
 
 	/*
 	 * If we're called to reset a live controller first shut it down before
@@ -2430,8 +2424,13 @@ static void nvme_reset_work(struct work_struct *work)
 	}
 
 	result = nvme_setup_io_queues(dev);
-	if (result)
-		goto out;
+ out:
+	return result;
+}
+
+static void nvme_post_reset_dev(struct nvme_dev *dev)
+{
+	enum nvme_ctrl_state new_state = NVME_CTRL_LIVE;
 
 	/*
 	 * Keep the controller around but remove all namespaces if we don't have
@@ -2465,7 +2464,25 @@ static void nvme_reset_work(struct work_struct *work)
 	return;
 
  out:
-	nvme_remove_dead_ctrl(dev, result);
+	nvme_remove_dead_ctrl(dev, 0);
+}
+
+static void nvme_reset_work(struct work_struct *work)
+{
+	struct nvme_dev *dev =
+		container_of(work, struct nvme_dev, ctrl.reset_work);
+	int ret;
+
+	if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING))
+		return;
+
+	ret = nvme_pre_reset_dev(dev);
+	if (ret) {
+		nvme_remove_dead_ctrl(dev, ret);
+		return;
+	}
+
+	nvme_post_reset_dev(dev);
 }
 
 static void nvme_remove_dead_ctrl_work(struct work_struct *work)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V3 7/8] nvme: pci: recover controller reliably
  2018-05-03  3:17 ` Ming Lei
@ 2018-05-03  3:17   ` Ming Lei
  -1 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-03  3:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

During draining IO in resetting controller, error still may happen
and timeout can be triggered, the current implementation can't recover
controller any more for this situation, this patch fixes the issue
by the following approach:

- introduces eh_reset_work, and moves draining IO and updating controller
state into this work function.

- resets controller just after nvme_dev_disable(), then schedule
eh_reset_work for draining IO & updating controller state

- introduce reset lock to sync between the above two parts

- move nvme_start_queues() into the part for resetting controller,
so that timeout even won't quiesce queue any more during draining
IO

Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 58 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 16d7507bfd79..64bbccdb5091 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -71,6 +71,7 @@ struct nvme_queue;
 static void nvme_process_cq(struct nvme_queue *nvmeq);
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
 		freeze_queue);
+static int nvme_eh_reset(struct nvme_dev *dev);
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -113,6 +114,10 @@ struct nvme_dev {
 	dma_addr_t host_mem_descs_dma;
 	struct nvme_host_mem_buf_desc *host_mem_descs;
 	void **host_mem_desc_bufs;
+
+	/* reset for timeout */
+	struct mutex reset_lock;
+	struct work_struct eh_reset_work;
 };
 
 static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
@@ -1199,7 +1204,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	if (nvme_should_reset(dev, csts)) {
 		nvme_warn_reset(dev, csts);
 		nvme_dev_disable(dev, false, true);
-		nvme_reset_ctrl(&dev->ctrl);
+		nvme_eh_reset(dev);
 		return BLK_EH_HANDLED;
 	}
 
@@ -1242,7 +1247,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
 		nvme_dev_disable(dev, false, true);
-		nvme_reset_ctrl(&dev->ctrl);
+		nvme_eh_reset(dev);
 
 		/*
 		 * Mark the request as handled, since the inline shutdown
@@ -2424,6 +2429,10 @@ static int nvme_pre_reset_dev(struct nvme_dev *dev)
 	}
 
 	result = nvme_setup_io_queues(dev);
+	if (result)
+		goto out;
+
+	nvme_start_queues(&dev->ctrl);
  out:
 	return result;
 }
@@ -2432,6 +2441,7 @@ static void nvme_post_reset_dev(struct nvme_dev *dev)
 {
 	enum nvme_ctrl_state new_state = NVME_CTRL_LIVE;
 
+	mutex_lock(&dev->reset_lock);
 	/*
 	 * Keep the controller around but remove all namespaces if we don't have
 	 * any working I/O queue.
@@ -2442,8 +2452,12 @@ static void nvme_post_reset_dev(struct nvme_dev *dev)
 		nvme_remove_namespaces(&dev->ctrl);
 		new_state = NVME_CTRL_ADMIN_ONLY;
 	} else {
-		nvme_start_queues(&dev->ctrl);
+		mutex_unlock(&dev->reset_lock);
+
+		/* error may happen during draining IO again */
 		nvme_wait_freeze(&dev->ctrl);
+
+		mutex_lock(&dev->reset_lock);
 		/* hit this only when allocate tagset fails */
 		if (nvme_dev_add(dev))
 			new_state = NVME_CTRL_ADMIN_ONLY;
@@ -2461,10 +2475,12 @@ static void nvme_post_reset_dev(struct nvme_dev *dev)
 	}
 
 	nvme_start_ctrl(&dev->ctrl);
+	mutex_unlock(&dev->reset_lock);
 	return;
 
  out:
 	nvme_remove_dead_ctrl(dev, 0);
+	mutex_unlock(&dev->reset_lock);
 }
 
 static void nvme_reset_work(struct work_struct *work)
@@ -2485,6 +2501,43 @@ static void nvme_reset_work(struct work_struct *work)
 	nvme_post_reset_dev(dev);
 }
 
+static void nvme_eh_reset_work(struct work_struct *work)
+{
+	struct nvme_dev *dev =
+		container_of(work, struct nvme_dev, eh_reset_work);
+
+	nvme_post_reset_dev(dev);
+}
+
+static int nvme_eh_reset(struct nvme_dev *dev)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&dev->reset_lock);
+	if (dev->ctrl.state != NVME_CTRL_RESETTING &&
+	    dev->ctrl.state != NVME_CTRL_CONNECTING) {
+	    if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING))
+		goto exit;
+	}
+
+	ret = nvme_pre_reset_dev(dev);
+	if (ret) {
+		dev_warn(dev->ctrl.device,
+			"failed to pre-reset controller: %d\n", ret);
+		nvme_remove_dead_ctrl(dev, ret);
+		goto exit;
+	}
+
+	mutex_unlock(&dev->reset_lock);
+
+	queue_work(nvme_reset_wq, &dev->eh_reset_work);
+
+	return 0;
+ exit:
+	mutex_unlock(&dev->reset_lock);
+	return ret;
+}
+
 static void nvme_remove_dead_ctrl_work(struct work_struct *work)
 {
 	struct nvme_dev *dev = container_of(work, struct nvme_dev, remove_work);
@@ -2606,6 +2659,8 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (result)
 		goto put_pci;
 
+	mutex_init(&dev->reset_lock);
+	INIT_WORK(&dev->eh_reset_work, nvme_eh_reset_work);
 	INIT_WORK(&dev->ctrl.reset_work, nvme_reset_work);
 	INIT_WORK(&dev->remove_work, nvme_remove_dead_ctrl_work);
 	mutex_init(&dev->shutdown_lock);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V3 7/8] nvme: pci: recover controller reliably
@ 2018-05-03  3:17   ` Ming Lei
  0 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-03  3:17 UTC (permalink / raw)


During draining IO in resetting controller, error still may happen
and timeout can be triggered, the current implementation can't recover
controller any more for this situation, this patch fixes the issue
by the following approach:

- introduces eh_reset_work, and moves draining IO and updating controller
state into this work function.

- resets controller just after nvme_dev_disable(), then schedule
eh_reset_work for draining IO & updating controller state

- introduce reset lock to sync between the above two parts

- move nvme_start_queues() into the part for resetting controller,
so that timeout even won't quiesce queue any more during draining
IO

Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Cc: Laurence Oberman <loberman at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/pci.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 58 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 16d7507bfd79..64bbccdb5091 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -71,6 +71,7 @@ struct nvme_queue;
 static void nvme_process_cq(struct nvme_queue *nvmeq);
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
 		freeze_queue);
+static int nvme_eh_reset(struct nvme_dev *dev);
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -113,6 +114,10 @@ struct nvme_dev {
 	dma_addr_t host_mem_descs_dma;
 	struct nvme_host_mem_buf_desc *host_mem_descs;
 	void **host_mem_desc_bufs;
+
+	/* reset for timeout */
+	struct mutex reset_lock;
+	struct work_struct eh_reset_work;
 };
 
 static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
@@ -1199,7 +1204,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	if (nvme_should_reset(dev, csts)) {
 		nvme_warn_reset(dev, csts);
 		nvme_dev_disable(dev, false, true);
-		nvme_reset_ctrl(&dev->ctrl);
+		nvme_eh_reset(dev);
 		return BLK_EH_HANDLED;
 	}
 
@@ -1242,7 +1247,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
 		nvme_dev_disable(dev, false, true);
-		nvme_reset_ctrl(&dev->ctrl);
+		nvme_eh_reset(dev);
 
 		/*
 		 * Mark the request as handled, since the inline shutdown
@@ -2424,6 +2429,10 @@ static int nvme_pre_reset_dev(struct nvme_dev *dev)
 	}
 
 	result = nvme_setup_io_queues(dev);
+	if (result)
+		goto out;
+
+	nvme_start_queues(&dev->ctrl);
  out:
 	return result;
 }
@@ -2432,6 +2441,7 @@ static void nvme_post_reset_dev(struct nvme_dev *dev)
 {
 	enum nvme_ctrl_state new_state = NVME_CTRL_LIVE;
 
+	mutex_lock(&dev->reset_lock);
 	/*
 	 * Keep the controller around but remove all namespaces if we don't have
 	 * any working I/O queue.
@@ -2442,8 +2452,12 @@ static void nvme_post_reset_dev(struct nvme_dev *dev)
 		nvme_remove_namespaces(&dev->ctrl);
 		new_state = NVME_CTRL_ADMIN_ONLY;
 	} else {
-		nvme_start_queues(&dev->ctrl);
+		mutex_unlock(&dev->reset_lock);
+
+		/* error may happen during draining IO again */
 		nvme_wait_freeze(&dev->ctrl);
+
+		mutex_lock(&dev->reset_lock);
 		/* hit this only when allocate tagset fails */
 		if (nvme_dev_add(dev))
 			new_state = NVME_CTRL_ADMIN_ONLY;
@@ -2461,10 +2475,12 @@ static void nvme_post_reset_dev(struct nvme_dev *dev)
 	}
 
 	nvme_start_ctrl(&dev->ctrl);
+	mutex_unlock(&dev->reset_lock);
 	return;
 
  out:
 	nvme_remove_dead_ctrl(dev, 0);
+	mutex_unlock(&dev->reset_lock);
 }
 
 static void nvme_reset_work(struct work_struct *work)
@@ -2485,6 +2501,43 @@ static void nvme_reset_work(struct work_struct *work)
 	nvme_post_reset_dev(dev);
 }
 
+static void nvme_eh_reset_work(struct work_struct *work)
+{
+	struct nvme_dev *dev =
+		container_of(work, struct nvme_dev, eh_reset_work);
+
+	nvme_post_reset_dev(dev);
+}
+
+static int nvme_eh_reset(struct nvme_dev *dev)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&dev->reset_lock);
+	if (dev->ctrl.state != NVME_CTRL_RESETTING &&
+	    dev->ctrl.state != NVME_CTRL_CONNECTING) {
+	    if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING))
+		goto exit;
+	}
+
+	ret = nvme_pre_reset_dev(dev);
+	if (ret) {
+		dev_warn(dev->ctrl.device,
+			"failed to pre-reset controller: %d\n", ret);
+		nvme_remove_dead_ctrl(dev, ret);
+		goto exit;
+	}
+
+	mutex_unlock(&dev->reset_lock);
+
+	queue_work(nvme_reset_wq, &dev->eh_reset_work);
+
+	return 0;
+ exit:
+	mutex_unlock(&dev->reset_lock);
+	return ret;
+}
+
 static void nvme_remove_dead_ctrl_work(struct work_struct *work)
 {
 	struct nvme_dev *dev = container_of(work, struct nvme_dev, remove_work);
@@ -2606,6 +2659,8 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (result)
 		goto put_pci;
 
+	mutex_init(&dev->reset_lock);
+	INIT_WORK(&dev->eh_reset_work, nvme_eh_reset_work);
 	INIT_WORK(&dev->ctrl.reset_work, nvme_reset_work);
 	INIT_WORK(&dev->remove_work, nvme_remove_dead_ctrl_work);
 	mutex_init(&dev->shutdown_lock);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V3 8/8] nvme: pci: simplify timeout handling
  2018-05-03  3:17 ` Ming Lei
@ 2018-05-03  3:17   ` Ming Lei
  -1 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-03  3:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

When one req is timed out, now nvme_timeout() handles it by the
following way:

	nvme_dev_disable(dev, false);
	nvme_reset_ctrl(&dev->ctrl);
	return BLK_EH_HANDLED.

And block's timeout handler is per-request-queue, that means each
namespace's error handling has to shutdown & reset the whole controller.

This patch moves the error handler into one per-controller kthread:

- Move timeout handling into one EH thread, and wakeup this thread for
handling timedout event.
- Returns BLK_EH_RESET_TIMER for timed-out request so that this request is
handled in EH thread after controller is shutdown, and memory corrupt
issue won't be worried
- Inside the EH handler, timeout work is drained before canceling in-flight
requrests, so all requests can canceled(completed) by EH handler.

With this approach, we don't need to worry about all kinds of race any
more, and the model is much simpler than current concurrent timeout
model.

Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/core.c |  22 ++++++++
 drivers/nvme/host/nvme.h |   3 ++
 drivers/nvme/host/pci.c  | 133 ++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 145 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f9028873298e..664a268accd1 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3582,6 +3582,28 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_start_freeze);
 
+void nvme_unquiesce_timeout(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_unquiesce_timeout(ns->queue);
+	up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_unquiesce_timeout);
+
+void nvme_quiesce_timeout(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_quiesce_timeout(ns->queue);
+	up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_quiesce_timeout);
+
 void nvme_stop_queues(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index a19b7f04ac24..956ee19ff403 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -19,6 +19,7 @@
 #include <linux/pci.h>
 #include <linux/kref.h>
 #include <linux/blk-mq.h>
+#include <linux/blkdev.h>
 #include <linux/lightnvm.h>
 #include <linux/sed-opal.h>
 #include <linux/fault-inject.h>
@@ -405,6 +406,8 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len,
 void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
 		union nvme_result *res);
 
+void nvme_unquiesce_timeout(struct nvme_ctrl *ctrl);
+void nvme_quiesce_timeout(struct nvme_ctrl *ctrl);
 void nvme_stop_queues(struct nvme_ctrl *ctrl);
 void nvme_start_queues(struct nvme_ctrl *ctrl);
 void nvme_kill_queues(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 64bbccdb5091..0bdb523645d2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -29,6 +29,7 @@
 #include <linux/types.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/sed-opal.h>
+#include <linux/kthread.h>
 
 #include "nvme.h"
 
@@ -118,6 +119,10 @@ struct nvme_dev {
 	/* reset for timeout */
 	struct mutex reset_lock;
 	struct work_struct eh_reset_work;
+
+	spinlock_t	  eh_lock;
+	bool		eh_in_recovery;
+	struct task_struct    *ehandler;
 };
 
 static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
@@ -1182,6 +1187,87 @@ static void nvme_warn_reset(struct nvme_dev *dev, u32 csts)
 			 csts, result);
 }
 
+static void nvme_eh_schedule(struct nvme_dev *dev, struct request *rq)
+{
+	/*
+	 * mark timeout as quiesce for avoiding the timed-out request to
+	 * be timed out again
+	 */
+	blk_mark_timeout_quiesce(rq->q, true);
+
+	spin_lock(&dev->eh_lock);
+	if (!dev->eh_in_recovery) {
+		dev->eh_in_recovery = true;
+		wake_up_process(dev->ehandler);
+	}
+	spin_unlock(&dev->eh_lock);
+}
+
+static void nvme_eh_done(struct nvme_dev *dev)
+{
+	spin_lock(&dev->eh_lock);
+	if (dev->eh_in_recovery)
+		dev->eh_in_recovery = false;
+	spin_unlock(&dev->eh_lock);
+}
+
+static int nvme_error_handler(void *data)
+{
+	struct nvme_dev *dev = data;
+
+	while (true) {
+		/*
+		 * The sequence in kthread_stop() sets the stop flag first
+		 * then wakes the process.  To avoid missed wakeups, the task
+		 * should always be in a non running state before the stop
+		 * flag is checked
+		 */
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (kthread_should_stop())
+			break;
+
+		spin_lock(&dev->eh_lock);
+		if (!dev->eh_in_recovery) {
+			spin_unlock(&dev->eh_lock);
+			schedule();
+			continue;
+		}
+		spin_unlock(&dev->eh_lock);
+
+		__set_current_state(TASK_RUNNING);
+
+		dev_info(dev->ctrl.device, "start eh recovery\n");
+		nvme_dev_disable(dev, false, true);
+
+		nvme_eh_reset(dev);
+
+		dev_info(dev->ctrl.device, "eh recovery done\n");
+	}
+	__set_current_state(TASK_RUNNING);
+	dev->ehandler = NULL;
+
+	return 0;
+}
+
+static int nvme_eh_init(struct nvme_dev *dev)
+{
+	spin_lock_init(&dev->eh_lock);
+
+	dev->ehandler = kthread_run(nvme_error_handler, dev,
+			"nvme_eh_%d", dev->ctrl.instance);
+	if (IS_ERR(dev->ehandler)) {
+		dev_err(dev->ctrl.device, "error handler thread failed to spawn, error = %ld\n",
+			PTR_ERR(dev->ehandler));
+		return PTR_ERR(dev->ehandler);
+	}
+	return 0;
+}
+
+/*
+ * Controller will be shutdown in EH thread, return BLK_EH_RESET_TIMER
+ * after nvme_eh_schedule() so the timed-out request can be completed
+ * after controller is shutdown, then memory corruption can be avoided
+ */
 static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -1203,9 +1289,8 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	 */
 	if (nvme_should_reset(dev, csts)) {
 		nvme_warn_reset(dev, csts);
-		nvme_dev_disable(dev, false, true);
-		nvme_eh_reset(dev);
-		return BLK_EH_HANDLED;
+		nvme_eh_schedule(dev, req);
+		return BLK_EH_RESET_TIMER;
 	}
 
 	/*
@@ -1221,8 +1306,9 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	/*
 	 * Shutdown immediately if controller times out while starting. The
 	 * reset work will see the pci device disabled when it gets the forced
-	 * cancellation error. All outstanding requests are completed on
-	 * shutdown, so we return BLK_EH_HANDLED.
+	 * cancellation error. All outstanding requests(include the timed-out
+	 * one) are completed on shutdown, so we return BLK_EH_RESET_TIMER for
+	 * avoiding to complete it by block layer.
 	 */
 	switch (dev->ctrl.state) {
 	case NVME_CTRL_CONNECTING:
@@ -1230,9 +1316,9 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, disable controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false, false);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		return BLK_EH_HANDLED;
+		nvme_eh_schedule(dev, req);
+		return BLK_EH_RESET_TIMER;
 	default:
 		break;
 	}
@@ -1246,15 +1332,14 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false, true);
-		nvme_eh_reset(dev);
-
 		/*
-		 * Mark the request as handled, since the inline shutdown
+		 * Mark the request as RESET_TIMER for avoiding to complete
+		 * this one by block layer, since the inline shutdown
 		 * forces all outstanding requests to complete.
 		 */
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		return BLK_EH_HANDLED;
+		nvme_eh_schedule(dev, req);
+		return BLK_EH_RESET_TIMER;
 	}
 
 	if (atomic_dec_return(&dev->ctrl.abort_limit) < 0) {
@@ -2303,11 +2388,25 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
 	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
 		nvme_suspend_queue(&dev->queues[i]);
 
+	/*
+	 * safe to sync timeout after queues are quiesced, then all
+	 * requests(include the time-out ones) will be canceled.
+	 */
+	nvme_quiesce_timeout(&dev->ctrl);
+	blk_quiesce_timeout(dev->ctrl.admin_q);
+
 	nvme_pci_disable(dev);
 
+	/*
+	 * Both timeout and interrupt handler have been drained, and all
+	 * in-flight requests will be canceled now.
+	 */
 	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
 	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
 
+	/* all requests are canceled now, so enable timeout now */
+	nvme_unquiesce_timeout(&dev->ctrl);
+	blk_unquiesce_timeout(dev->ctrl.admin_q);
 	/*
 	 * The driver will not be starting up queues again if shutting down so
 	 * must flush all entered requests to their failed completion to avoid
@@ -2527,13 +2626,14 @@ static int nvme_eh_reset(struct nvme_dev *dev)
 		nvme_remove_dead_ctrl(dev, ret);
 		goto exit;
 	}
-
+	nvme_eh_done(dev);
 	mutex_unlock(&dev->reset_lock);
 
 	queue_work(nvme_reset_wq, &dev->eh_reset_work);
 
 	return 0;
  exit:
+	nvme_eh_done(dev);
 	mutex_unlock(&dev->reset_lock);
 	return ret;
 }
@@ -2679,10 +2779,15 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
 
+	if (nvme_eh_init(dev))
+		goto uninit_ctrl;
+
 	nvme_reset_ctrl(&dev->ctrl);
 
 	return 0;
 
+ uninit_ctrl:
+	nvme_uninit_ctrl(&dev->ctrl);
  release_pools:
 	nvme_release_prp_pools(dev);
  unmap:
@@ -2736,6 +2841,8 @@ static void nvme_remove(struct pci_dev *pdev)
 	nvme_stop_ctrl(&dev->ctrl);
 	nvme_remove_namespaces(&dev->ctrl);
 	nvme_dev_disable(dev, true, false);
+	if (dev->ehandler)
+		kthread_stop(dev->ehandler);
 	nvme_free_host_mem(dev);
 	nvme_dev_remove_admin(dev);
 	nvme_free_queues(dev, 0);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V3 8/8] nvme: pci: simplify timeout handling
@ 2018-05-03  3:17   ` Ming Lei
  0 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-03  3:17 UTC (permalink / raw)


When one req is timed out, now nvme_timeout() handles it by the
following way:

	nvme_dev_disable(dev, false);
	nvme_reset_ctrl(&dev->ctrl);
	return BLK_EH_HANDLED.

And block's timeout handler is per-request-queue, that means each
namespace's error handling has to shutdown & reset the whole controller.

This patch moves the error handler into one per-controller kthread:

- Move timeout handling into one EH thread, and wakeup this thread for
handling timedout event.
- Returns BLK_EH_RESET_TIMER for timed-out request so that this request is
handled in EH thread after controller is shutdown, and memory corrupt
issue won't be worried
- Inside the EH handler, timeout work is drained before canceling in-flight
requrests, so all requests can canceled(completed) by EH handler.

With this approach, we don't need to worry about all kinds of race any
more, and the model is much simpler than current concurrent timeout
model.

Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Cc: Laurence Oberman <loberman at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/core.c |  22 ++++++++
 drivers/nvme/host/nvme.h |   3 ++
 drivers/nvme/host/pci.c  | 133 ++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 145 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f9028873298e..664a268accd1 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3582,6 +3582,28 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_start_freeze);
 
+void nvme_unquiesce_timeout(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_unquiesce_timeout(ns->queue);
+	up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_unquiesce_timeout);
+
+void nvme_quiesce_timeout(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_quiesce_timeout(ns->queue);
+	up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_quiesce_timeout);
+
 void nvme_stop_queues(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index a19b7f04ac24..956ee19ff403 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -19,6 +19,7 @@
 #include <linux/pci.h>
 #include <linux/kref.h>
 #include <linux/blk-mq.h>
+#include <linux/blkdev.h>
 #include <linux/lightnvm.h>
 #include <linux/sed-opal.h>
 #include <linux/fault-inject.h>
@@ -405,6 +406,8 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len,
 void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
 		union nvme_result *res);
 
+void nvme_unquiesce_timeout(struct nvme_ctrl *ctrl);
+void nvme_quiesce_timeout(struct nvme_ctrl *ctrl);
 void nvme_stop_queues(struct nvme_ctrl *ctrl);
 void nvme_start_queues(struct nvme_ctrl *ctrl);
 void nvme_kill_queues(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 64bbccdb5091..0bdb523645d2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -29,6 +29,7 @@
 #include <linux/types.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/sed-opal.h>
+#include <linux/kthread.h>
 
 #include "nvme.h"
 
@@ -118,6 +119,10 @@ struct nvme_dev {
 	/* reset for timeout */
 	struct mutex reset_lock;
 	struct work_struct eh_reset_work;
+
+	spinlock_t	  eh_lock;
+	bool		eh_in_recovery;
+	struct task_struct    *ehandler;
 };
 
 static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
@@ -1182,6 +1187,87 @@ static void nvme_warn_reset(struct nvme_dev *dev, u32 csts)
 			 csts, result);
 }
 
+static void nvme_eh_schedule(struct nvme_dev *dev, struct request *rq)
+{
+	/*
+	 * mark timeout as quiesce for avoiding the timed-out request to
+	 * be timed out again
+	 */
+	blk_mark_timeout_quiesce(rq->q, true);
+
+	spin_lock(&dev->eh_lock);
+	if (!dev->eh_in_recovery) {
+		dev->eh_in_recovery = true;
+		wake_up_process(dev->ehandler);
+	}
+	spin_unlock(&dev->eh_lock);
+}
+
+static void nvme_eh_done(struct nvme_dev *dev)
+{
+	spin_lock(&dev->eh_lock);
+	if (dev->eh_in_recovery)
+		dev->eh_in_recovery = false;
+	spin_unlock(&dev->eh_lock);
+}
+
+static int nvme_error_handler(void *data)
+{
+	struct nvme_dev *dev = data;
+
+	while (true) {
+		/*
+		 * The sequence in kthread_stop() sets the stop flag first
+		 * then wakes the process.  To avoid missed wakeups, the task
+		 * should always be in a non running state before the stop
+		 * flag is checked
+		 */
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (kthread_should_stop())
+			break;
+
+		spin_lock(&dev->eh_lock);
+		if (!dev->eh_in_recovery) {
+			spin_unlock(&dev->eh_lock);
+			schedule();
+			continue;
+		}
+		spin_unlock(&dev->eh_lock);
+
+		__set_current_state(TASK_RUNNING);
+
+		dev_info(dev->ctrl.device, "start eh recovery\n");
+		nvme_dev_disable(dev, false, true);
+
+		nvme_eh_reset(dev);
+
+		dev_info(dev->ctrl.device, "eh recovery done\n");
+	}
+	__set_current_state(TASK_RUNNING);
+	dev->ehandler = NULL;
+
+	return 0;
+}
+
+static int nvme_eh_init(struct nvme_dev *dev)
+{
+	spin_lock_init(&dev->eh_lock);
+
+	dev->ehandler = kthread_run(nvme_error_handler, dev,
+			"nvme_eh_%d", dev->ctrl.instance);
+	if (IS_ERR(dev->ehandler)) {
+		dev_err(dev->ctrl.device, "error handler thread failed to spawn, error = %ld\n",
+			PTR_ERR(dev->ehandler));
+		return PTR_ERR(dev->ehandler);
+	}
+	return 0;
+}
+
+/*
+ * Controller will be shutdown in EH thread, return BLK_EH_RESET_TIMER
+ * after nvme_eh_schedule() so the timed-out request can be completed
+ * after controller is shutdown, then memory corruption can be avoided
+ */
 static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -1203,9 +1289,8 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	 */
 	if (nvme_should_reset(dev, csts)) {
 		nvme_warn_reset(dev, csts);
-		nvme_dev_disable(dev, false, true);
-		nvme_eh_reset(dev);
-		return BLK_EH_HANDLED;
+		nvme_eh_schedule(dev, req);
+		return BLK_EH_RESET_TIMER;
 	}
 
 	/*
@@ -1221,8 +1306,9 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	/*
 	 * Shutdown immediately if controller times out while starting. The
 	 * reset work will see the pci device disabled when it gets the forced
-	 * cancellation error. All outstanding requests are completed on
-	 * shutdown, so we return BLK_EH_HANDLED.
+	 * cancellation error. All outstanding requests(include the timed-out
+	 * one) are completed on shutdown, so we return BLK_EH_RESET_TIMER for
+	 * avoiding to complete it by block layer.
 	 */
 	switch (dev->ctrl.state) {
 	case NVME_CTRL_CONNECTING:
@@ -1230,9 +1316,9 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, disable controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false, false);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		return BLK_EH_HANDLED;
+		nvme_eh_schedule(dev, req);
+		return BLK_EH_RESET_TIMER;
 	default:
 		break;
 	}
@@ -1246,15 +1332,14 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false, true);
-		nvme_eh_reset(dev);
-
 		/*
-		 * Mark the request as handled, since the inline shutdown
+		 * Mark the request as RESET_TIMER for avoiding to complete
+		 * this one by block layer, since the inline shutdown
 		 * forces all outstanding requests to complete.
 		 */
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		return BLK_EH_HANDLED;
+		nvme_eh_schedule(dev, req);
+		return BLK_EH_RESET_TIMER;
 	}
 
 	if (atomic_dec_return(&dev->ctrl.abort_limit) < 0) {
@@ -2303,11 +2388,25 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
 	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
 		nvme_suspend_queue(&dev->queues[i]);
 
+	/*
+	 * safe to sync timeout after queues are quiesced, then all
+	 * requests(include the time-out ones) will be canceled.
+	 */
+	nvme_quiesce_timeout(&dev->ctrl);
+	blk_quiesce_timeout(dev->ctrl.admin_q);
+
 	nvme_pci_disable(dev);
 
+	/*
+	 * Both timeout and interrupt handler have been drained, and all
+	 * in-flight requests will be canceled now.
+	 */
 	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
 	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
 
+	/* all requests are canceled now, so enable timeout now */
+	nvme_unquiesce_timeout(&dev->ctrl);
+	blk_unquiesce_timeout(dev->ctrl.admin_q);
 	/*
 	 * The driver will not be starting up queues again if shutting down so
 	 * must flush all entered requests to their failed completion to avoid
@@ -2527,13 +2626,14 @@ static int nvme_eh_reset(struct nvme_dev *dev)
 		nvme_remove_dead_ctrl(dev, ret);
 		goto exit;
 	}
-
+	nvme_eh_done(dev);
 	mutex_unlock(&dev->reset_lock);
 
 	queue_work(nvme_reset_wq, &dev->eh_reset_work);
 
 	return 0;
  exit:
+	nvme_eh_done(dev);
 	mutex_unlock(&dev->reset_lock);
 	return ret;
 }
@@ -2679,10 +2779,15 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
 
+	if (nvme_eh_init(dev))
+		goto uninit_ctrl;
+
 	nvme_reset_ctrl(&dev->ctrl);
 
 	return 0;
 
+ uninit_ctrl:
+	nvme_uninit_ctrl(&dev->ctrl);
  release_pools:
 	nvme_release_prp_pools(dev);
  unmap:
@@ -2736,6 +2841,8 @@ static void nvme_remove(struct pci_dev *pdev)
 	nvme_stop_ctrl(&dev->ctrl);
 	nvme_remove_namespaces(&dev->ctrl);
 	nvme_dev_disable(dev, true, false);
+	if (dev->ehandler)
+		kthread_stop(dev->ehandler);
 	nvme_free_host_mem(dev);
 	nvme_dev_remove_admin(dev);
 	nvme_free_queues(dev, 0);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH V3 7/8] nvme: pci: recover controller reliably
  2018-05-03  3:17   ` Ming Lei
@ 2018-05-03  9:14     ` jianchao.wang
  -1 siblings, 0 replies; 38+ messages in thread
From: jianchao.wang @ 2018-05-03  9:14 UTC (permalink / raw)
  To: Ming Lei, Keith Busch
  Cc: Jens Axboe, Laurence Oberman, Sagi Grimberg, linux-nvme,
	linux-block, Christoph Hellwig

Hi ming

On 05/03/2018 11:17 AM, Ming Lei wrote:
>  static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
> @@ -1199,7 +1204,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  	if (nvme_should_reset(dev, csts)) {
>  		nvme_warn_reset(dev, csts);
>  		nvme_dev_disable(dev, false, true);
> -		nvme_reset_ctrl(&dev->ctrl);
> +		nvme_eh_reset(dev);
>  		return BLK_EH_HANDLED;
>  	}
>  
> @@ -1242,7 +1247,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  			 "I/O %d QID %d timeout, reset controller\n",
>  			 req->tag, nvmeq->qid);
>  		nvme_dev_disable(dev, false, true);
> -		nvme_reset_ctrl(&dev->ctrl);
> +		nvme_eh_reset(dev);

w/o the 8th patch, invoke nvme_eh_reset in nvme_timeout is dangerous.
nvme_pre_reset_dev will send a lot of admin io when initialize the controller.
if this admin ios timeout, the nvme_timeout cannot handle this because the timeout work is sleeping
to wait admin ios.

In addition, even if we take the nvme_wait_freeze out of nvme_eh_reset and put it into another context,
but the ctrl state is still CONNECTING, the nvme_eh_reset cannot move forward.

Actually, I used to report this issue to Keith. I met io hung when the controller die in
nvme_reset_work -> nvme_wait_freeze. As you know, the nvme_reset_work cannot be scheduled because it is waiting.
Here is Keith's commit for this:
http://lists.infradead.org/pipermail/linux-nvme/2018-February/015603.html

Thanks
Jianchao



 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH V3 7/8] nvme: pci: recover controller reliably
@ 2018-05-03  9:14     ` jianchao.wang
  0 siblings, 0 replies; 38+ messages in thread
From: jianchao.wang @ 2018-05-03  9:14 UTC (permalink / raw)


Hi ming

On 05/03/2018 11:17 AM, Ming Lei wrote:
>  static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
> @@ -1199,7 +1204,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  	if (nvme_should_reset(dev, csts)) {
>  		nvme_warn_reset(dev, csts);
>  		nvme_dev_disable(dev, false, true);
> -		nvme_reset_ctrl(&dev->ctrl);
> +		nvme_eh_reset(dev);
>  		return BLK_EH_HANDLED;
>  	}
>  
> @@ -1242,7 +1247,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  			 "I/O %d QID %d timeout, reset controller\n",
>  			 req->tag, nvmeq->qid);
>  		nvme_dev_disable(dev, false, true);
> -		nvme_reset_ctrl(&dev->ctrl);
> +		nvme_eh_reset(dev);

w/o the 8th patch, invoke nvme_eh_reset in nvme_timeout is dangerous.
nvme_pre_reset_dev will send a lot of admin io when initialize the controller.
if this admin ios timeout, the nvme_timeout cannot handle this because the timeout work is sleeping
to wait admin ios.

In addition, even if we take the nvme_wait_freeze out of nvme_eh_reset and put it into another context,
but the ctrl state is still CONNECTING, the nvme_eh_reset cannot move forward.

Actually, I used to report this issue to Keith. I met io hung when the controller die in
nvme_reset_work -> nvme_wait_freeze. As you know, the nvme_reset_work cannot be scheduled because it is waiting.
Here is Keith's commit for this:
http://lists.infradead.org/pipermail/linux-nvme/2018-February/015603.html

Thanks
Jianchao



 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V3 7/8] nvme: pci: recover controller reliably
  2018-05-03  9:14     ` jianchao.wang
@ 2018-05-03 10:08       ` Ming Lei
  -1 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-03 10:08 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Keith Busch, Jens Axboe, Laurence Oberman, Sagi Grimberg,
	linux-nvme, linux-block, Christoph Hellwig

On Thu, May 03, 2018 at 05:14:30PM +0800, jianchao.wang wrote:
> Hi ming
> 
> On 05/03/2018 11:17 AM, Ming Lei wrote:
> >  static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
> > @@ -1199,7 +1204,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> >  	if (nvme_should_reset(dev, csts)) {
> >  		nvme_warn_reset(dev, csts);
> >  		nvme_dev_disable(dev, false, true);
> > -		nvme_reset_ctrl(&dev->ctrl);
> > +		nvme_eh_reset(dev);
> >  		return BLK_EH_HANDLED;
> >  	}
> >  
> > @@ -1242,7 +1247,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> >  			 "I/O %d QID %d timeout, reset controller\n",
> >  			 req->tag, nvmeq->qid);
> >  		nvme_dev_disable(dev, false, true);
> > -		nvme_reset_ctrl(&dev->ctrl);
> > +		nvme_eh_reset(dev);
> 
> w/o the 8th patch, invoke nvme_eh_reset in nvme_timeout is dangerous.
> nvme_pre_reset_dev will send a lot of admin io when initialize the controller.
> if this admin ios timeout, the nvme_timeout cannot handle this because the timeout work is sleeping
> to wait admin ios.

I just tried to not make the 8th patch too big, but looks we have to
merge the two.

Once the EH kthread is introduced in 8th patch, there isn't such issue any more.

> 
> In addition, even if we take the nvme_wait_freeze out of nvme_eh_reset and put it into another context,
> but the ctrl state is still CONNECTING, the nvme_eh_reset cannot move forward.

nvme_eh_reset() can move on, if controller state is either CONNECTING or
RESETTING, nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING) won't
be called in nvme_eh_reset(), and nvme_pre_reset_dev() will be called
directly, then follows scheduling of the eh_reset_work.

> 
> Actually, I used to report this issue to Keith. I met io hung when the controller die in
> nvme_reset_work -> nvme_wait_freeze. As you know, the nvme_reset_work cannot be scheduled because it is waiting.
> Here is Keith's commit for this:
> http://lists.infradead.org/pipermail/linux-nvme/2018-February/015603.html

There are actually two issues here, both are covered in this patchset:

1) when nvme_wait_freeze() is for draining IO, controller error may
happen again, this patch can shutdown & reset controller again to
recover it.

2) there is also another issue: queues may not be put into freeze state
in nvme_dev_disable(), then nvme_wait_freeze() will wait forever. This
issue is addressed by 4th patch.

Both two can be observed in blktests block/011 and verified by this patches.

Thanks for your great review, please let me know if there are other
issues.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH V3 7/8] nvme: pci: recover controller reliably
@ 2018-05-03 10:08       ` Ming Lei
  0 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-03 10:08 UTC (permalink / raw)


On Thu, May 03, 2018@05:14:30PM +0800, jianchao.wang wrote:
> Hi ming
> 
> On 05/03/2018 11:17 AM, Ming Lei wrote:
> >  static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
> > @@ -1199,7 +1204,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> >  	if (nvme_should_reset(dev, csts)) {
> >  		nvme_warn_reset(dev, csts);
> >  		nvme_dev_disable(dev, false, true);
> > -		nvme_reset_ctrl(&dev->ctrl);
> > +		nvme_eh_reset(dev);
> >  		return BLK_EH_HANDLED;
> >  	}
> >  
> > @@ -1242,7 +1247,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> >  			 "I/O %d QID %d timeout, reset controller\n",
> >  			 req->tag, nvmeq->qid);
> >  		nvme_dev_disable(dev, false, true);
> > -		nvme_reset_ctrl(&dev->ctrl);
> > +		nvme_eh_reset(dev);
> 
> w/o the 8th patch, invoke nvme_eh_reset in nvme_timeout is dangerous.
> nvme_pre_reset_dev will send a lot of admin io when initialize the controller.
> if this admin ios timeout, the nvme_timeout cannot handle this because the timeout work is sleeping
> to wait admin ios.

I just tried to not make the 8th patch too big, but looks we have to
merge the two.

Once the EH kthread is introduced in 8th patch, there isn't such issue any more.

> 
> In addition, even if we take the nvme_wait_freeze out of nvme_eh_reset and put it into another context,
> but the ctrl state is still CONNECTING, the nvme_eh_reset cannot move forward.

nvme_eh_reset() can move on, if controller state is either CONNECTING or
RESETTING, nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING) won't
be called in nvme_eh_reset(), and nvme_pre_reset_dev() will be called
directly, then follows scheduling of the eh_reset_work.

> 
> Actually, I used to report this issue to Keith. I met io hung when the controller die in
> nvme_reset_work -> nvme_wait_freeze. As you know, the nvme_reset_work cannot be scheduled because it is waiting.
> Here is Keith's commit for this:
> http://lists.infradead.org/pipermail/linux-nvme/2018-February/015603.html

There are actually two issues here, both are covered in this patchset:

1) when nvme_wait_freeze() is for draining IO, controller error may
happen again, this patch can shutdown & reset controller again to
recover it.

2) there is also another issue: queues may not be put into freeze state
in nvme_dev_disable(), then nvme_wait_freeze() will wait forever. This
issue is addressed by 4th patch.

Both two can be observed in blktests block/011 and verified by this patches.

Thanks for your great review, please let me know if there are other
issues.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V3 7/8] nvme: pci: recover controller reliably
  2018-05-03 10:08       ` Ming Lei
@ 2018-05-03 15:46         ` jianchao.wang
  -1 siblings, 0 replies; 38+ messages in thread
From: jianchao.wang @ 2018-05-03 15:46 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, Jens Axboe, Laurence Oberman, Sagi Grimberg,
	linux-nvme, linux-block, Christoph Hellwig

Hi Ming

Thanks for your kindly response.

On 05/03/2018 06:08 PM, Ming Lei wrote:
> nvme_eh_reset() can move on, if controller state is either CONNECTING or
> RESETTING, nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING) won't
> be called in nvme_eh_reset(), and nvme_pre_reset_dev() will be called
> directly, then follows scheduling of the eh_reset_work.

We have to consider another context, nvme_reset_work.
There are two contexts that will invoke nvme_pre_reset_dev.
nvme_eh_reset could invoke nvme_pre_reset_dev even if the state is RESETTING or CONNECTING.

nvme_error_handler                nvme_reset_ctrl
                                    -> change state to RESETTING
                                    -> queue reset work
                                  nvme_reset_work          
  -> nvme_dev_disable                                     
  -> nvme_eh_reset
    -> nvme_pre_reset_dev          -> nvme_pre_reset_dev
      -> change state to CONNECTING   -> change state fails
                                   -> nvme_remove_dead_ctrl
There seems to be other race scenarios between them.

Just invoke nvme_dev_disable in nvme_error_handler context and hand over the other things
to nvme_reset_work as the v2 patch series seems clearer.
The primary target of nvme_error_handler is to drain IOs and disable controller.
reset_work could take over the left things of the recovery work

IMO, modify the nvme_reset_ctrl_sync in nvme_error_handler to nvme_reset_ctrl should be ok.
If the admin ios in reset_work timeout, the nvme_error_handler could also provide help to complete them.

> There are actually two issues here, both are covered in this patchset:
> 
> 1) when nvme_wait_freeze() is for draining IO, controller error may
> happen again, this patch can shutdown & reset controller again to
> recover it.
>

We could split the nvme_reset_work into two contexts instead of introduce another nvme_eh_reset.
Otherwise, the code looks really complicated.
In addition, this fix for this issue could be deferred after your great EH kthread solution is submitted.

Thanks
Jianchao

> 2) there is also another issue: queues may not be put into freeze state
> in nvme_dev_disable(), then nvme_wait_freeze() will wait forever. This
> issue is addressed by 4th patch.
> 
> Both two can be observed in blktests block/011 and verified by this patches.
> 
> Thanks for your great review, please let me know if there are other
> issues.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH V3 7/8] nvme: pci: recover controller reliably
@ 2018-05-03 15:46         ` jianchao.wang
  0 siblings, 0 replies; 38+ messages in thread
From: jianchao.wang @ 2018-05-03 15:46 UTC (permalink / raw)


Hi Ming

Thanks for your kindly response.

On 05/03/2018 06:08 PM, Ming Lei wrote:
> nvme_eh_reset() can move on, if controller state is either CONNECTING or
> RESETTING, nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING) won't
> be called in nvme_eh_reset(), and nvme_pre_reset_dev() will be called
> directly, then follows scheduling of the eh_reset_work.

We have to consider another context, nvme_reset_work.
There are two contexts that will invoke nvme_pre_reset_dev.
nvme_eh_reset could invoke nvme_pre_reset_dev even if the state is RESETTING or CONNECTING.

nvme_error_handler                nvme_reset_ctrl
                                    -> change state to RESETTING
                                    -> queue reset work
                                  nvme_reset_work          
  -> nvme_dev_disable                                     
  -> nvme_eh_reset
    -> nvme_pre_reset_dev          -> nvme_pre_reset_dev
      -> change state to CONNECTING   -> change state fails
                                   -> nvme_remove_dead_ctrl
There seems to be other race scenarios between them.

Just invoke nvme_dev_disable in nvme_error_handler context and hand over the other things
to nvme_reset_work as the v2 patch series seems clearer.
The primary target of nvme_error_handler is to drain IOs and disable controller.
reset_work could take over the left things of the recovery work

IMO, modify the nvme_reset_ctrl_sync in nvme_error_handler to nvme_reset_ctrl should be ok.
If the admin ios in reset_work timeout, the nvme_error_handler could also provide help to complete them.

> There are actually two issues here, both are covered in this patchset:
> 
> 1) when nvme_wait_freeze() is for draining IO, controller error may
> happen again, this patch can shutdown & reset controller again to
> recover it.
>

We could split the nvme_reset_work into two contexts instead of introduce another nvme_eh_reset.
Otherwise, the code looks really complicated.
In addition, this fix for this issue could be deferred after your great EH kthread solution is submitted.

Thanks
Jianchao

> 2) there is also another issue: queues may not be put into freeze state
> in nvme_dev_disable(), then nvme_wait_freeze() will wait forever. This
> issue is addressed by 4th patch.
> 
> Both two can be observed in blktests block/011 and verified by this patches.
> 
> Thanks for your great review, please let me know if there are other
> issues.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V3 7/8] nvme: pci: recover controller reliably
  2018-05-03 15:46         ` jianchao.wang
@ 2018-05-04  4:24           ` Ming Lei
  -1 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-04  4:24 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Keith Busch, Jens Axboe, Laurence Oberman, Sagi Grimberg,
	linux-nvme, linux-block, Christoph Hellwig

On Thu, May 03, 2018 at 11:46:56PM +0800, jianchao.wang wrote:
> Hi Ming
> 
> Thanks for your kindly response.
> 
> On 05/03/2018 06:08 PM, Ming Lei wrote:
> > nvme_eh_reset() can move on, if controller state is either CONNECTING or
> > RESETTING, nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING) won't
> > be called in nvme_eh_reset(), and nvme_pre_reset_dev() will be called
> > directly, then follows scheduling of the eh_reset_work.
> 
> We have to consider another context, nvme_reset_work.
> There are two contexts that will invoke nvme_pre_reset_dev.
> nvme_eh_reset could invoke nvme_pre_reset_dev even if the state is RESETTING or CONNECTING.
> 
> nvme_error_handler                nvme_reset_ctrl
>                                     -> change state to RESETTING
>                                     -> queue reset work
>                                   nvme_reset_work          
>   -> nvme_dev_disable                                     
>   -> nvme_eh_reset
>     -> nvme_pre_reset_dev          -> nvme_pre_reset_dev
>       -> change state to CONNECTING   -> change state fails
>                                    -> nvme_remove_dead_ctrl
> There seems to be other race scenarios between them.

IMO it is fine to change the change state in nvme_pre_reset_dev()
as the following way:

if (dev->ctrl.state != NVME_CTRL_CONNECTING)
	if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_CONNECTING))
		goto out;

Also dev->reset_lock has been introduced, and we may hold it for
nvme_pre_reset_dev() in the path of nvme_reset_ctrl() too, then
there isn't the above race any more.

> 
> Just invoke nvme_dev_disable in nvme_error_handler context and hand over the other things
> to nvme_reset_work as the v2 patch series seems clearer.

That way may not fix the current race: nvme_dev_disable() will
quiesce/freeze queue again when resetting is in-progress, then
nothing can move on.

One big change in V3 is to run nvme_dev_disable() & nvme_pre_reset_dev()
in the EH thread, and make sure queues are started once
nvme_pre_reset_dev() returns. But as you pointed, it may not work well
enough if admin requests are timed-out in nvme_pre_reset_dev().

> The primary target of nvme_error_handler is to drain IOs and disable controller.
> reset_work could take over the left things of the recovery work

Draining IOs isn't necessary, we can remove freezing queues & wait_freeze()
from nvme_dev_disable() & nvme_reset_work() for non-shutdown, that can be
easy to fix all these races since the .eh_reset_work isn't needed any more,
because quiescing is enough to prevent IOs from entering driver/device.

The only issue is that more requests may be failed when reset failed,
so I don't want to try that, and just follows the current behaviour.

IMO the primary goal of nvme EH is to recover controller, that is
what nvme_dev_disable() and resetting should do, if IO isn't drained,
timeout still may be triggered.

The big trouble is about percpu_ref, once the q_usage_counter is killed
to start freezing for preventing IO from entering block layer, we have
to run blk_mq_freeze_queue_wait() before unfreezing the usage couner. If
blk_mq_freeze_queue_wait() isn't needed, things can be much easier for us.

> 
> IMO, modify the nvme_reset_ctrl_sync in nvme_error_handler to nvme_reset_ctrl should be ok.
> If the admin ios in reset_work timeout, the nvme_error_handler could also provide help to complete them.
> 

Right, that is one goal of this patchset.

> > There are actually two issues here, both are covered in this patchset:
> > 
> > 1) when nvme_wait_freeze() is for draining IO, controller error may
> > happen again, this patch can shutdown & reset controller again to
> > recover it.
> >
> 
> We could split the nvme_reset_work into two contexts instead of introduce another nvme_eh_reset.
> Otherwise, the code looks really complicated.

Yeah, that is what I plan to do in V4, introduce .eh_pre_reset_work and
.eh_post_reset_work, and run the following things in .eh_pre_reset_work:

	- nvme_pre_reset_dev()
	- schedule .eh_post_reset_work

and run draining IO & updating controller state in .eh_post_reset_work.
.eh_pre_reset_work will be scheduled in nvme_error_handler().

This way may address the issue of admin req timeout in nvme_pre_reset_dev(),
what do you think of this approach?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH V3 7/8] nvme: pci: recover controller reliably
@ 2018-05-04  4:24           ` Ming Lei
  0 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-04  4:24 UTC (permalink / raw)


On Thu, May 03, 2018@11:46:56PM +0800, jianchao.wang wrote:
> Hi Ming
> 
> Thanks for your kindly response.
> 
> On 05/03/2018 06:08 PM, Ming Lei wrote:
> > nvme_eh_reset() can move on, if controller state is either CONNECTING or
> > RESETTING, nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING) won't
> > be called in nvme_eh_reset(), and nvme_pre_reset_dev() will be called
> > directly, then follows scheduling of the eh_reset_work.
> 
> We have to consider another context, nvme_reset_work.
> There are two contexts that will invoke nvme_pre_reset_dev.
> nvme_eh_reset could invoke nvme_pre_reset_dev even if the state is RESETTING or CONNECTING.
> 
> nvme_error_handler                nvme_reset_ctrl
>                                     -> change state to RESETTING
>                                     -> queue reset work
>                                   nvme_reset_work          
>   -> nvme_dev_disable                                     
>   -> nvme_eh_reset
>     -> nvme_pre_reset_dev          -> nvme_pre_reset_dev
>       -> change state to CONNECTING   -> change state fails
>                                    -> nvme_remove_dead_ctrl
> There seems to be other race scenarios between them.

IMO it is fine to change the change state in nvme_pre_reset_dev()
as the following way:

if (dev->ctrl.state != NVME_CTRL_CONNECTING)
	if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_CONNECTING))
		goto out;

Also dev->reset_lock has been introduced, and we may hold it for
nvme_pre_reset_dev() in the path of nvme_reset_ctrl() too, then
there isn't the above race any more.

> 
> Just invoke nvme_dev_disable in nvme_error_handler context and hand over the other things
> to nvme_reset_work as the v2 patch series seems clearer.

That way may not fix the current race: nvme_dev_disable() will
quiesce/freeze queue again when resetting is in-progress, then
nothing can move on.

One big change in V3 is to run nvme_dev_disable() & nvme_pre_reset_dev()
in the EH thread, and make sure queues are started once
nvme_pre_reset_dev() returns. But as you pointed, it may not work well
enough if admin requests are timed-out in nvme_pre_reset_dev().

> The primary target of nvme_error_handler is to drain IOs and disable controller.
> reset_work could take over the left things of the recovery work

Draining IOs isn't necessary, we can remove freezing queues & wait_freeze()
from nvme_dev_disable() & nvme_reset_work() for non-shutdown, that can be
easy to fix all these races since the .eh_reset_work isn't needed any more,
because quiescing is enough to prevent IOs from entering driver/device.

The only issue is that more requests may be failed when reset failed,
so I don't want to try that, and just follows the current behaviour.

IMO the primary goal of nvme EH is to recover controller, that is
what nvme_dev_disable() and resetting should do, if IO isn't drained,
timeout still may be triggered.

The big trouble is about percpu_ref, once the q_usage_counter is killed
to start freezing for preventing IO from entering block layer, we have
to run blk_mq_freeze_queue_wait() before unfreezing the usage couner. If
blk_mq_freeze_queue_wait() isn't needed, things can be much easier for us.

> 
> IMO, modify the nvme_reset_ctrl_sync in nvme_error_handler to nvme_reset_ctrl should be ok.
> If the admin ios in reset_work timeout, the nvme_error_handler could also provide help to complete them.
> 

Right, that is one goal of this patchset.

> > There are actually two issues here, both are covered in this patchset:
> > 
> > 1) when nvme_wait_freeze() is for draining IO, controller error may
> > happen again, this patch can shutdown & reset controller again to
> > recover it.
> >
> 
> We could split the nvme_reset_work into two contexts instead of introduce another nvme_eh_reset.
> Otherwise, the code looks really complicated.

Yeah, that is what I plan to do in V4, introduce .eh_pre_reset_work and
.eh_post_reset_work, and run the following things in .eh_pre_reset_work:

	- nvme_pre_reset_dev()
	- schedule .eh_post_reset_work

and run draining IO & updating controller state in .eh_post_reset_work.
.eh_pre_reset_work will be scheduled in nvme_error_handler().

This way may address the issue of admin req timeout in nvme_pre_reset_dev(),
what do you think of this approach?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V3 7/8] nvme: pci: recover controller reliably
  2018-05-04  4:24           ` Ming Lei
@ 2018-05-04  6:10             ` jianchao.wang
  -1 siblings, 0 replies; 38+ messages in thread
From: jianchao.wang @ 2018-05-04  6:10 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Laurence Oberman, Sagi Grimberg,
	linux-nvme, Keith Busch, Christoph Hellwig

Hi ming

On 05/04/2018 12:24 PM, Ming Lei wrote:
>> Just invoke nvme_dev_disable in nvme_error_handler context and hand over the other things
>> to nvme_reset_work as the v2 patch series seems clearer.
> That way may not fix the current race: nvme_dev_disable() will
> quiesce/freeze queue again when resetting is in-progress, then
> nothing can move on.

With timeout quiesce and nvme EH kthread, the nvme_dev_disable could ensure all the in-flight
requests to be handled. When we queue the reset_work in nvme_error_handler, there will be no
timeout anymore.
If there is timeout due to the admin io in reset_work, this is expected, reset_work will be
interrupted and fail.
If the io timeout when reset_work wait freeze, because the reset_work is waiting, things cannot
move on. we have multiple method to handle this:
1. as Keith's solution, change ctrl state to DELETING and fail the reset_work.
2. as current patch set, try to recovery it. but we have to split reset_work into two contexts.

Actually, there are many other places where the nvme_dev_disable will be invoked.   
> 
> One big change in V3 is to run nvme_dev_disable() & nvme_pre_reset_dev()
> in the EH thread, and make sure queues are started once
> nvme_pre_reset_dev() returns. But as you pointed, it may not work well
> enough if admin requests are timed-out in nvme_pre_reset_dev().
> 
>> The primary target of nvme_error_handler is to drain IOs and disable controller.
>> reset_work could take over the left things of the recovery work
> Draining IOs isn't necessary, we can remove freezing queues & wait_freeze()
> from nvme_dev_disable() & nvme_reset_work() for non-shutdown, that can be
> easy to fix all these races since the .eh_reset_work isn't needed any more,
> because quiescing is enough to prevent IOs from entering driver/device.
> 
> The only issue is that more requests may be failed when reset failed,
> so I don't want to try that, and just follows the current behaviour.
> 
> IMO the primary goal of nvme EH is to recover controller, that is
> what nvme_dev_disable() and resetting should do, if IO isn't drained,
> timeout still may be triggered.
> 
> The big trouble is about percpu_ref, once the q_usage_counter is killed
> to start freezing for preventing IO from entering block layer, we have
> to run blk_mq_freeze_queue_wait() before unfreezing the usage couner. If
> blk_mq_freeze_queue_wait() isn't needed, things can be much easier for us.


Sorry for my bad description. I mean nvme_dev_disable will drain all the 
in-flight requests and requeue or fail them, then we will not get any timeout again.

In fact, I used to talk with Keith about remove freezing queues & wait_freeze for non-shutdown.
There is an obstacle, nvme_dev_add -> blk_mq_update_nr_hw_queues.
When the ctrl comes back, the number of hw queues maybe changed.
Even if we move the wait_freeze, blk_mq_update_nr_hw_queues will do that.
On the other hand, if not freeze the queue, the requests that are submitted to the dying hw queue 
during the resetting, will be sacrificed.  


> 
>> IMO, modify the nvme_reset_ctrl_sync in nvme_error_handler to nvme_reset_ctrl should be ok.
>> If the admin ios in reset_work timeout, the nvme_error_handler could also provide help to complete them.
>>
> Right, that is one goal of this patchset.
> 
>>> There are actually two issues here, both are covered in this patchset:
>>>
>>> 1) when nvme_wait_freeze() is for draining IO, controller error may
>>> happen again, this patch can shutdown & reset controller again to
>>> recover it.
>>>
>> We could split the nvme_reset_work into two contexts instead of introduce another nvme_eh_reset.
>> Otherwise, the code looks really complicated.
> Yeah, that is what I plan to do in V4, introduce .eh_pre_reset_work and
> .eh_post_reset_work, and run the following things in .eh_pre_reset_work:
> 
> 	- nvme_pre_reset_dev()
> 	- schedule .eh_post_reset_work
> 
> and run draining IO & updating controller state in .eh_post_reset_work.
> .eh_pre_reset_work will be scheduled in nvme_error_handler().
> 
> This way may address the issue of admin req timeout in nvme_pre_reset_dev(),
> what do you think of this approach?

nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface.
Then it is more convenient to ensure that there will be only one resetting instance running.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH V3 7/8] nvme: pci: recover controller reliably
@ 2018-05-04  6:10             ` jianchao.wang
  0 siblings, 0 replies; 38+ messages in thread
From: jianchao.wang @ 2018-05-04  6:10 UTC (permalink / raw)


Hi ming

On 05/04/2018 12:24 PM, Ming Lei wrote:
>> Just invoke nvme_dev_disable in nvme_error_handler context and hand over the other things
>> to nvme_reset_work as the v2 patch series seems clearer.
> That way may not fix the current race: nvme_dev_disable() will
> quiesce/freeze queue again when resetting is in-progress, then
> nothing can move on.

With timeout quiesce and nvme EH kthread, the nvme_dev_disable could ensure all the in-flight
requests to be handled. When we queue the reset_work in nvme_error_handler, there will be no
timeout anymore.
If there is timeout due to the admin io in reset_work, this is expected, reset_work will be
interrupted and fail.
If the io timeout when reset_work wait freeze, because the reset_work is waiting, things cannot
move on. we have multiple method to handle this:
1. as Keith's solution, change ctrl state to DELETING and fail the reset_work.
2. as current patch set, try to recovery it. but we have to split reset_work into two contexts.

Actually, there are many other places where the nvme_dev_disable will be invoked.   
> 
> One big change in V3 is to run nvme_dev_disable() & nvme_pre_reset_dev()
> in the EH thread, and make sure queues are started once
> nvme_pre_reset_dev() returns. But as you pointed, it may not work well
> enough if admin requests are timed-out in nvme_pre_reset_dev().
> 
>> The primary target of nvme_error_handler is to drain IOs and disable controller.
>> reset_work could take over the left things of the recovery work
> Draining IOs isn't necessary, we can remove freezing queues & wait_freeze()
> from nvme_dev_disable() & nvme_reset_work() for non-shutdown, that can be
> easy to fix all these races since the .eh_reset_work isn't needed any more,
> because quiescing is enough to prevent IOs from entering driver/device.
> 
> The only issue is that more requests may be failed when reset failed,
> so I don't want to try that, and just follows the current behaviour.
> 
> IMO the primary goal of nvme EH is to recover controller, that is
> what nvme_dev_disable() and resetting should do, if IO isn't drained,
> timeout still may be triggered.
> 
> The big trouble is about percpu_ref, once the q_usage_counter is killed
> to start freezing for preventing IO from entering block layer, we have
> to run blk_mq_freeze_queue_wait() before unfreezing the usage couner. If
> blk_mq_freeze_queue_wait() isn't needed, things can be much easier for us.


Sorry for my bad description. I mean nvme_dev_disable will drain all the 
in-flight requests and requeue or fail them, then we will not get any timeout again.

In fact, I used to talk with Keith about remove freezing queues & wait_freeze for non-shutdown.
There is an obstacle, nvme_dev_add -> blk_mq_update_nr_hw_queues.
When the ctrl comes back, the number of hw queues maybe changed.
Even if we move the wait_freeze, blk_mq_update_nr_hw_queues will do that.
On the other hand, if not freeze the queue, the requests that are submitted to the dying hw queue 
during the resetting, will be sacrificed.  


> 
>> IMO, modify the nvme_reset_ctrl_sync in nvme_error_handler to nvme_reset_ctrl should be ok.
>> If the admin ios in reset_work timeout, the nvme_error_handler could also provide help to complete them.
>>
> Right, that is one goal of this patchset.
> 
>>> There are actually two issues here, both are covered in this patchset:
>>>
>>> 1) when nvme_wait_freeze() is for draining IO, controller error may
>>> happen again, this patch can shutdown & reset controller again to
>>> recover it.
>>>
>> We could split the nvme_reset_work into two contexts instead of introduce another nvme_eh_reset.
>> Otherwise, the code looks really complicated.
> Yeah, that is what I plan to do in V4, introduce .eh_pre_reset_work and
> .eh_post_reset_work, and run the following things in .eh_pre_reset_work:
> 
> 	- nvme_pre_reset_dev()
> 	- schedule .eh_post_reset_work
> 
> and run draining IO & updating controller state in .eh_post_reset_work.
> .eh_pre_reset_work will be scheduled in nvme_error_handler().
> 
> This way may address the issue of admin req timeout in nvme_pre_reset_dev(),
> what do you think of this approach?

nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface.
Then it is more convenient to ensure that there will be only one resetting instance running.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V3 7/8] nvme: pci: recover controller reliably
  2018-05-04  6:10             ` jianchao.wang
@ 2018-05-04  6:21               ` jianchao.wang
  -1 siblings, 0 replies; 38+ messages in thread
From: jianchao.wang @ 2018-05-04  6:21 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Keith Busch, Laurence Oberman, Sagi Grimberg,
	linux-nvme, linux-block, Christoph Hellwig

Oh sorry.

On 05/04/2018 02:10 PM, jianchao.wang wrote:
> nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface.
> Then it is more convenient to ensure that there will be only one resetting instance running.

ctrl state is still in RESETTING state, nvme_reset_ctrl cannot work.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH V3 7/8] nvme: pci: recover controller reliably
@ 2018-05-04  6:21               ` jianchao.wang
  0 siblings, 0 replies; 38+ messages in thread
From: jianchao.wang @ 2018-05-04  6:21 UTC (permalink / raw)


Oh sorry.

On 05/04/2018 02:10 PM, jianchao.wang wrote:
> nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface.
> Then it is more convenient to ensure that there will be only one resetting instance running.

ctrl state is still in RESETTING state, nvme_reset_ctrl cannot work.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V3 7/8] nvme: pci: recover controller reliably
  2018-05-04  6:10             ` jianchao.wang
@ 2018-05-04  8:02               ` Ming Lei
  -1 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-04  8:02 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Jens Axboe, linux-block, Laurence Oberman, Sagi Grimberg,
	linux-nvme, Keith Busch, Christoph Hellwig

On Fri, May 04, 2018 at 02:10:19PM +0800, jianchao.wang wrote:
> Hi ming
> 
> On 05/04/2018 12:24 PM, Ming Lei wrote:
> >> Just invoke nvme_dev_disable in nvme_error_handler context and hand over the other things
> >> to nvme_reset_work as the v2 patch series seems clearer.
> > That way may not fix the current race: nvme_dev_disable() will
> > quiesce/freeze queue again when resetting is in-progress, then
> > nothing can move on.
> 
> With timeout quiesce and nvme EH kthread, the nvme_dev_disable could ensure all the in-flight
> requests to be handled. When we queue the reset_work in nvme_error_handler, there will be no
> timeout anymore.

Right.

> If there is timeout due to the admin io in reset_work, this is expected, reset_work will be
> interrupted and fail.

At least with V3, the admin io submitted in reset_work won't be handled
well. When EH thread hangs in the sync admin io, nvme_eh_schedule()
can't wakeup the EH thread any more, then the admin io may hang forever
in reset_work. Then we have to run nvme_pre_reset_dev() in another
context.

> If the io timeout when reset_work wait freeze, because the reset_work is waiting, things cannot
> move on. we have multiple method to handle this:
> 1. as Keith's solution, change ctrl state to DELETING and fail the reset_work.
> 2. as current patch set, try to recovery it. but we have to split reset_work into two contexts.
>

Right, nvme_dev_disable() can guarantee forward progress, and we have to
make sure that nvme_dev_disable() is called when timeout happens.

> Actually, there are many other places where the nvme_dev_disable will be invoked.   
> > 
> > One big change in V3 is to run nvme_dev_disable() & nvme_pre_reset_dev()
> > in the EH thread, and make sure queues are started once
> > nvme_pre_reset_dev() returns. But as you pointed, it may not work well
> > enough if admin requests are timed-out in nvme_pre_reset_dev().
> > 
> >> The primary target of nvme_error_handler is to drain IOs and disable controller.
> >> reset_work could take over the left things of the recovery work
> > Draining IOs isn't necessary, we can remove freezing queues & wait_freeze()
> > from nvme_dev_disable() & nvme_reset_work() for non-shutdown, that can be
> > easy to fix all these races since the .eh_reset_work isn't needed any more,
> > because quiescing is enough to prevent IOs from entering driver/device.
> > 
> > The only issue is that more requests may be failed when reset failed,
> > so I don't want to try that, and just follows the current behaviour.
> > 
> > IMO the primary goal of nvme EH is to recover controller, that is
> > what nvme_dev_disable() and resetting should do, if IO isn't drained,
> > timeout still may be triggered.
> > 
> > The big trouble is about percpu_ref, once the q_usage_counter is killed
> > to start freezing for preventing IO from entering block layer, we have
> > to run blk_mq_freeze_queue_wait() before unfreezing the usage couner. If
> > blk_mq_freeze_queue_wait() isn't needed, things can be much easier for us.
> 
> 
> Sorry for my bad description. I mean nvme_dev_disable will drain all the 
> in-flight requests and requeue or fail them, then we will not get any timeout again.

OK, got it, sorry for misunderstanding your point.

> 
> In fact, I used to talk with Keith about remove freezing queues & wait_freeze for non-shutdown.
> There is an obstacle, nvme_dev_add -> blk_mq_update_nr_hw_queues.
> When the ctrl comes back, the number of hw queues maybe changed.
> Even if we move the wait_freeze, blk_mq_update_nr_hw_queues will do that.
> On the other hand, if not freeze the queue, the requests that are submitted to the dying hw queue 
> during the resetting, will be sacrificed.  

Right.

> 
> 
> > 
> >> IMO, modify the nvme_reset_ctrl_sync in nvme_error_handler to nvme_reset_ctrl should be ok.
> >> If the admin ios in reset_work timeout, the nvme_error_handler could also provide help to complete them.
> >>
> > Right, that is one goal of this patchset.
> > 
> >>> There are actually two issues here, both are covered in this patchset:
> >>>
> >>> 1) when nvme_wait_freeze() is for draining IO, controller error may
> >>> happen again, this patch can shutdown & reset controller again to
> >>> recover it.
> >>>
> >> We could split the nvme_reset_work into two contexts instead of introduce another nvme_eh_reset.
> >> Otherwise, the code looks really complicated.
> > Yeah, that is what I plan to do in V4, introduce .eh_pre_reset_work and
> > .eh_post_reset_work, and run the following things in .eh_pre_reset_work:
> > 
> > 	- nvme_pre_reset_dev()
> > 	- schedule .eh_post_reset_work
> > 
> > and run draining IO & updating controller state in .eh_post_reset_work.
> > .eh_pre_reset_work will be scheduled in nvme_error_handler().
> > 
> > This way may address the issue of admin req timeout in nvme_pre_reset_dev(),
> > what do you think of this approach?
> 
> nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface.
> Then it is more convenient to ensure that there will be only one resetting instance running.
> 

But as you mentioned above, reset_work has to be splitted into two
contexts for handling IO timeout during wait_freeze in reset_work,
so single instance of nvme_reset_ctrl() may not work well.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH V3 7/8] nvme: pci: recover controller reliably
@ 2018-05-04  8:02               ` Ming Lei
  0 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-04  8:02 UTC (permalink / raw)


On Fri, May 04, 2018@02:10:19PM +0800, jianchao.wang wrote:
> Hi ming
> 
> On 05/04/2018 12:24 PM, Ming Lei wrote:
> >> Just invoke nvme_dev_disable in nvme_error_handler context and hand over the other things
> >> to nvme_reset_work as the v2 patch series seems clearer.
> > That way may not fix the current race: nvme_dev_disable() will
> > quiesce/freeze queue again when resetting is in-progress, then
> > nothing can move on.
> 
> With timeout quiesce and nvme EH kthread, the nvme_dev_disable could ensure all the in-flight
> requests to be handled. When we queue the reset_work in nvme_error_handler, there will be no
> timeout anymore.

Right.

> If there is timeout due to the admin io in reset_work, this is expected, reset_work will be
> interrupted and fail.

At least with V3, the admin io submitted in reset_work won't be handled
well. When EH thread hangs in the sync admin io, nvme_eh_schedule()
can't wakeup the EH thread any more, then the admin io may hang forever
in reset_work. Then we have to run nvme_pre_reset_dev() in another
context.

> If the io timeout when reset_work wait freeze, because the reset_work is waiting, things cannot
> move on. we have multiple method to handle this:
> 1. as Keith's solution, change ctrl state to DELETING and fail the reset_work.
> 2. as current patch set, try to recovery it. but we have to split reset_work into two contexts.
>

Right, nvme_dev_disable() can guarantee forward progress, and we have to
make sure that nvme_dev_disable() is called when timeout happens.

> Actually, there are many other places where the nvme_dev_disable will be invoked.   
> > 
> > One big change in V3 is to run nvme_dev_disable() & nvme_pre_reset_dev()
> > in the EH thread, and make sure queues are started once
> > nvme_pre_reset_dev() returns. But as you pointed, it may not work well
> > enough if admin requests are timed-out in nvme_pre_reset_dev().
> > 
> >> The primary target of nvme_error_handler is to drain IOs and disable controller.
> >> reset_work could take over the left things of the recovery work
> > Draining IOs isn't necessary, we can remove freezing queues & wait_freeze()
> > from nvme_dev_disable() & nvme_reset_work() for non-shutdown, that can be
> > easy to fix all these races since the .eh_reset_work isn't needed any more,
> > because quiescing is enough to prevent IOs from entering driver/device.
> > 
> > The only issue is that more requests may be failed when reset failed,
> > so I don't want to try that, and just follows the current behaviour.
> > 
> > IMO the primary goal of nvme EH is to recover controller, that is
> > what nvme_dev_disable() and resetting should do, if IO isn't drained,
> > timeout still may be triggered.
> > 
> > The big trouble is about percpu_ref, once the q_usage_counter is killed
> > to start freezing for preventing IO from entering block layer, we have
> > to run blk_mq_freeze_queue_wait() before unfreezing the usage couner. If
> > blk_mq_freeze_queue_wait() isn't needed, things can be much easier for us.
> 
> 
> Sorry for my bad description. I mean nvme_dev_disable will drain all the 
> in-flight requests and requeue or fail them, then we will not get any timeout again.

OK, got it, sorry for misunderstanding your point.

> 
> In fact, I used to talk with Keith about remove freezing queues & wait_freeze for non-shutdown.
> There is an obstacle, nvme_dev_add -> blk_mq_update_nr_hw_queues.
> When the ctrl comes back, the number of hw queues maybe changed.
> Even if we move the wait_freeze, blk_mq_update_nr_hw_queues will do that.
> On the other hand, if not freeze the queue, the requests that are submitted to the dying hw queue 
> during the resetting, will be sacrificed.  

Right.

> 
> 
> > 
> >> IMO, modify the nvme_reset_ctrl_sync in nvme_error_handler to nvme_reset_ctrl should be ok.
> >> If the admin ios in reset_work timeout, the nvme_error_handler could also provide help to complete them.
> >>
> > Right, that is one goal of this patchset.
> > 
> >>> There are actually two issues here, both are covered in this patchset:
> >>>
> >>> 1) when nvme_wait_freeze() is for draining IO, controller error may
> >>> happen again, this patch can shutdown & reset controller again to
> >>> recover it.
> >>>
> >> We could split the nvme_reset_work into two contexts instead of introduce another nvme_eh_reset.
> >> Otherwise, the code looks really complicated.
> > Yeah, that is what I plan to do in V4, introduce .eh_pre_reset_work and
> > .eh_post_reset_work, and run the following things in .eh_pre_reset_work:
> > 
> > 	- nvme_pre_reset_dev()
> > 	- schedule .eh_post_reset_work
> > 
> > and run draining IO & updating controller state in .eh_post_reset_work.
> > .eh_pre_reset_work will be scheduled in nvme_error_handler().
> > 
> > This way may address the issue of admin req timeout in nvme_pre_reset_dev(),
> > what do you think of this approach?
> 
> nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface.
> Then it is more convenient to ensure that there will be only one resetting instance running.
> 

But as you mentioned above, reset_work has to be splitted into two
contexts for handling IO timeout during wait_freeze in reset_work,
so single instance of nvme_reset_ctrl() may not work well.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V3 7/8] nvme: pci: recover controller reliably
  2018-05-04  8:02               ` Ming Lei
@ 2018-05-04  8:28                 ` jianchao.wang
  -1 siblings, 0 replies; 38+ messages in thread
From: jianchao.wang @ 2018-05-04  8:28 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Laurence Oberman, Sagi Grimberg,
	linux-nvme, Keith Busch, Christoph Hellwig

Hi ming

On 05/04/2018 04:02 PM, Ming Lei wrote:
>> nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface.
>> Then it is more convenient to ensure that there will be only one resetting instance running.
>>
> But as you mentioned above, reset_work has to be splitted into two
> contexts for handling IO timeout during wait_freeze in reset_work,
> so single instance of nvme_reset_ctrl() may not work well.

I mean the EH kthread and the reset_work which both could reset the ctrl instead of
the pre and post rest context.

Honestly, I suspect a bit that whether it is worthy to try to recover from [1].
The Eh kthread solution could make things easier, but the codes for recovery from [1] has
made code really complicated. It is more difficult to unify the nvme-pci, rdma and fc. 

How about just fail the resetting as the Keith's solution ?

[1] io timeout when nvme_reset_work or the new nvme_post_reset_dev invoke nvme_wait_freeze.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH V3 7/8] nvme: pci: recover controller reliably
@ 2018-05-04  8:28                 ` jianchao.wang
  0 siblings, 0 replies; 38+ messages in thread
From: jianchao.wang @ 2018-05-04  8:28 UTC (permalink / raw)


Hi ming

On 05/04/2018 04:02 PM, Ming Lei wrote:
>> nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface.
>> Then it is more convenient to ensure that there will be only one resetting instance running.
>>
> But as you mentioned above, reset_work has to be splitted into two
> contexts for handling IO timeout during wait_freeze in reset_work,
> so single instance of nvme_reset_ctrl() may not work well.

I mean the EH kthread and the reset_work which both could reset the ctrl instead of
the pre and post rest context.

Honestly, I suspect a bit that whether it is worthy to try to recover from [1].
The Eh kthread solution could make things easier, but the codes for recovery from [1] has
made code really complicated. It is more difficult to unify the nvme-pci, rdma and fc. 

How about just fail the resetting as the Keith's solution ?

[1] io timeout when nvme_reset_work or the new nvme_post_reset_dev invoke nvme_wait_freeze.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V3 7/8] nvme: pci: recover controller reliably
  2018-05-04  8:28                 ` jianchao.wang
@ 2018-05-04  9:16                   ` Ming Lei
  -1 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-04  9:16 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Jens Axboe, linux-block, Laurence Oberman, Sagi Grimberg,
	linux-nvme, Keith Busch, Christoph Hellwig

On Fri, May 04, 2018 at 04:28:23PM +0800, jianchao.wang wrote:
> Hi ming
> 
> On 05/04/2018 04:02 PM, Ming Lei wrote:
> >> nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface.
> >> Then it is more convenient to ensure that there will be only one resetting instance running.
> >>
> > But as you mentioned above, reset_work has to be splitted into two
> > contexts for handling IO timeout during wait_freeze in reset_work,
> > so single instance of nvme_reset_ctrl() may not work well.
> 
> I mean the EH kthread and the reset_work which both could reset the ctrl instead of
> the pre and post rest context.

That may run nvme_pre_reset_dev() two times from both EH thread and
reset_work, looks not good.

> 
> Honestly, I suspect a bit that whether it is worthy to try to recover from [1].
> The Eh kthread solution could make things easier, but the codes for recovery from [1] has
> made code really complicated. It is more difficult to unify the nvme-pci, rdma and fc. 
IMO the model is not complicated(one EH thread with two-stage resetting), and
it may be cloned to rdma, fc, .. without much difficulty.

Follows the model:

1) single EH thread, in which controller should be guaranteed to
shutdown, and EH thread is waken up when controller needs to be
recovered.

2) 1st stage resetting: recover controller, which has to run in one
work context, since admin commands for recover may timeout.

3) 2nd stage resetting: draining IO & updating controller state to live,
which has to run in another context, since IOs during this stage may
timeout too.

The reset lock is used to sync between 1st stage resetting and 2nd
stage resetting. The '1st stage resetting' is scheduled from EH thread,
and the '2nd state resetting' is scheduled after the '1st stage
resetting' is done.

The implementation might not be so complicated too, since V3 has partitioned
reset_work into two parts, then what we just need to do in V4 is to
run each in one work from EH thread.

> How about just fail the resetting as the Keith's solution ?
> 
> [1] io timeout when nvme_reset_work or the new nvme_post_reset_dev invoke nvme_wait_freeze.
> 

I'd suggest to not change controller state as DELETING in this case,
otherwise it may be reported as another bug, in which controller becomes
completely unusable. This issue can be easily triggered by blktests
block/011, in this test case, controller is always recoverable.

I will post out V4, and see if all current issues can be covered, and
how complicated the implementation is.

Thanks
Ming

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH V3 7/8] nvme: pci: recover controller reliably
@ 2018-05-04  9:16                   ` Ming Lei
  0 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-04  9:16 UTC (permalink / raw)


On Fri, May 04, 2018@04:28:23PM +0800, jianchao.wang wrote:
> Hi ming
> 
> On 05/04/2018 04:02 PM, Ming Lei wrote:
> >> nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface.
> >> Then it is more convenient to ensure that there will be only one resetting instance running.
> >>
> > But as you mentioned above, reset_work has to be splitted into two
> > contexts for handling IO timeout during wait_freeze in reset_work,
> > so single instance of nvme_reset_ctrl() may not work well.
> 
> I mean the EH kthread and the reset_work which both could reset the ctrl instead of
> the pre and post rest context.

That may run nvme_pre_reset_dev() two times from both EH thread and
reset_work, looks not good.

> 
> Honestly, I suspect a bit that whether it is worthy to try to recover from [1].
> The Eh kthread solution could make things easier, but the codes for recovery from [1] has
> made code really complicated. It is more difficult to unify the nvme-pci, rdma and fc. 
IMO the model is not complicated(one EH thread with two-stage resetting), and
it may be cloned to rdma, fc, .. without much difficulty.

Follows the model:

1) single EH thread, in which controller should be guaranteed to
shutdown, and EH thread is waken up when controller needs to be
recovered.

2) 1st stage resetting: recover controller, which has to run in one
work context, since admin commands for recover may timeout.

3) 2nd stage resetting: draining IO & updating controller state to live,
which has to run in another context, since IOs during this stage may
timeout too.

The reset lock is used to sync between 1st stage resetting and 2nd
stage resetting. The '1st stage resetting' is scheduled from EH thread,
and the '2nd state resetting' is scheduled after the '1st stage
resetting' is done.

The implementation might not be so complicated too, since V3 has partitioned
reset_work into two parts, then what we just need to do in V4 is to
run each in one work from EH thread.

> How about just fail the resetting as the Keith's solution ?
> 
> [1] io timeout when nvme_reset_work or the new nvme_post_reset_dev invoke nvme_wait_freeze.
> 

I'd suggest to not change controller state as DELETING in this case,
otherwise it may be reported as another bug, in which controller becomes
completely unusable. This issue can be easily triggered by blktests
block/011, in this test case, controller is always recoverable.

I will post out V4, and see if all current issues can be covered, and
how complicated the implementation is.

Thanks
Ming

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V3 7/8] nvme: pci: recover controller reliably
  2018-05-04  8:28                 ` jianchao.wang
@ 2018-05-05  0:16                   ` Ming Lei
  -1 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-05  0:16 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Ming Lei, Jens Axboe, Keith Busch, Laurence Oberman,
	Sagi Grimberg, linux-nvme, linux-block, Christoph Hellwig

On Fri, May 4, 2018 at 4:28 PM, jianchao.wang
<jianchao.w.wang@oracle.com> wrote:
> Hi ming
>
> On 05/04/2018 04:02 PM, Ming Lei wrote:
>>> nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface.
>>> Then it is more convenient to ensure that there will be only one resetting instance running.
>>>
>> But as you mentioned above, reset_work has to be splitted into two
>> contexts for handling IO timeout during wait_freeze in reset_work,
>> so single instance of nvme_reset_ctrl() may not work well.
>
> I mean the EH kthread and the reset_work which both could reset the ctrl instead of
> the pre and post rest context.
>
> Honestly, I suspect a bit that whether it is worthy to try to recover from [1].
> The Eh kthread solution could make things easier, but the codes for recovery from [1] has
> made code really complicated. It is more difficult to unify the nvme-pci, rdma and fc.

Another choice may be nested EH, which should be easier to implement:

- run the whole recovery procedures(shutdown & reset) in one single context
- and start a new context to handle new timeout during last recovery in the
same way

The two approaches is just like sync IO vs AIO.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH V3 7/8] nvme: pci: recover controller reliably
@ 2018-05-05  0:16                   ` Ming Lei
  0 siblings, 0 replies; 38+ messages in thread
From: Ming Lei @ 2018-05-05  0:16 UTC (permalink / raw)


On Fri, May 4, 2018 at 4:28 PM, jianchao.wang
<jianchao.w.wang@oracle.com> wrote:
> Hi ming
>
> On 05/04/2018 04:02 PM, Ming Lei wrote:
>>> nvme_error_handler should invoke nvme_reset_ctrl instead of introducing another interface.
>>> Then it is more convenient to ensure that there will be only one resetting instance running.
>>>
>> But as you mentioned above, reset_work has to be splitted into two
>> contexts for handling IO timeout during wait_freeze in reset_work,
>> so single instance of nvme_reset_ctrl() may not work well.
>
> I mean the EH kthread and the reset_work which both could reset the ctrl instead of
> the pre and post rest context.
>
> Honestly, I suspect a bit that whether it is worthy to try to recover from [1].
> The Eh kthread solution could make things easier, but the codes for recovery from [1] has
> made code really complicated. It is more difficult to unify the nvme-pci, rdma and fc.

Another choice may be nested EH, which should be easier to implement:

- run the whole recovery procedures(shutdown & reset) in one single context
- and start a new context to handle new timeout during last recovery in the
same way

The two approaches is just like sync IO vs AIO.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2018-05-05  0:16 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-03  3:17 [PATCH V3 0/8] nvme: pci: fix & improve timeout handling Ming Lei
2018-05-03  3:17 ` Ming Lei
2018-05-03  3:17 ` [PATCH V3 1/8] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout() Ming Lei
2018-05-03  3:17   ` Ming Lei
2018-05-03  3:17 ` [PATCH V3 2/8] nvme: pci: cover timeout for admin commands running in EH Ming Lei
2018-05-03  3:17   ` Ming Lei
2018-05-03  3:17 ` [PATCH V3 3/8] nvme: pci: only wait freezing if queue is frozen Ming Lei
2018-05-03  3:17   ` Ming Lei
2018-05-03  3:17 ` [PATCH V3 4/8] nvme: pci: freeze queue in nvme_dev_disable() in case of error recovery Ming Lei
2018-05-03  3:17   ` Ming Lei
2018-05-03  3:17 ` [PATCH V3 5/8] nvme: fix race between freeze queues and unfreeze queues Ming Lei
2018-05-03  3:17   ` Ming Lei
2018-05-03  3:17 ` [PATCH V3 6/8] nvme: pci: split controller resetting into two parts Ming Lei
2018-05-03  3:17   ` Ming Lei
2018-05-03  3:17 ` [PATCH V3 7/8] nvme: pci: recover controller reliably Ming Lei
2018-05-03  3:17   ` Ming Lei
2018-05-03  9:14   ` jianchao.wang
2018-05-03  9:14     ` jianchao.wang
2018-05-03 10:08     ` Ming Lei
2018-05-03 10:08       ` Ming Lei
2018-05-03 15:46       ` jianchao.wang
2018-05-03 15:46         ` jianchao.wang
2018-05-04  4:24         ` Ming Lei
2018-05-04  4:24           ` Ming Lei
2018-05-04  6:10           ` jianchao.wang
2018-05-04  6:10             ` jianchao.wang
2018-05-04  6:21             ` jianchao.wang
2018-05-04  6:21               ` jianchao.wang
2018-05-04  8:02             ` Ming Lei
2018-05-04  8:02               ` Ming Lei
2018-05-04  8:28               ` jianchao.wang
2018-05-04  8:28                 ` jianchao.wang
2018-05-04  9:16                 ` Ming Lei
2018-05-04  9:16                   ` Ming Lei
2018-05-05  0:16                 ` Ming Lei
2018-05-05  0:16                   ` Ming Lei
2018-05-03  3:17 ` [PATCH V3 8/8] nvme: pci: simplify timeout handling Ming Lei
2018-05-03  3:17   ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.