All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2 0/5] nvme: pci: fix & improve timeout handling
@ 2018-04-29 15:41 ` Ming Lei
  0 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2018-04-29 15:41 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme

Hi,

The 1st patch introduces blk_quiesce_timeout() and blk_unquiesce_timeout()
for NVMe, meantime fixes blk_sync_queue().

The 2nd patch covers timeout for admin commands for recovering controller
for avoiding possible deadlock.

The other 3 patches fixes several races wrt. NVMe timeout handler, and
finally can make blktests block/011 passed.


Ming Lei (5):
  block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
  nvme: pci: cover timeout for admin commands running in EH
  nvme: pci: only wait freezing if queue is frozen
  nvme: fix race between freeze queues and unfreeze queues
  nvme: pci: simplify timeout handling

V2:
	- fix draining timeout work, so no need to change return value from
	.timeout()
	- fix race between nvme_start_freeze() and nvme_unfreeze()
	- cover timeout for admin commands running in EH

 block/blk-core.c         |  34 +++++++-
 block/blk-mq.c           |   9 ++
 block/blk-timeout.c      |   5 +-
 drivers/nvme/host/core.c |  47 ++++++++--
 drivers/nvme/host/nvme.h |   6 ++
 drivers/nvme/host/pci.c  | 222 ++++++++++++++++++++++++++++++++++++++++++-----
 include/linux/blkdev.h   |   4 +
 7 files changed, 296 insertions(+), 31 deletions(-)

Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org

-- 
2.9.5

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH V2 0/5] nvme: pci: fix & improve timeout handling
@ 2018-04-29 15:41 ` Ming Lei
  0 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2018-04-29 15:41 UTC (permalink / raw)


Hi,

The 1st patch introduces blk_quiesce_timeout() and blk_unquiesce_timeout()
for NVMe, meantime fixes blk_sync_queue().

The 2nd patch covers timeout for admin commands for recovering controller
for avoiding possible deadlock.

The other 3 patches fixes several races wrt. NVMe timeout handler, and
finally can make blktests block/011 passed.


Ming Lei (5):
  block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
  nvme: pci: cover timeout for admin commands running in EH
  nvme: pci: only wait freezing if queue is frozen
  nvme: fix race between freeze queues and unfreeze queues
  nvme: pci: simplify timeout handling

V2:
	- fix draining timeout work, so no need to change return value from
	.timeout()
	- fix race between nvme_start_freeze() and nvme_unfreeze()
	- cover timeout for admin commands running in EH

 block/blk-core.c         |  34 +++++++-
 block/blk-mq.c           |   9 ++
 block/blk-timeout.c      |   5 +-
 drivers/nvme/host/core.c |  47 ++++++++--
 drivers/nvme/host/nvme.h |   6 ++
 drivers/nvme/host/pci.c  | 222 ++++++++++++++++++++++++++++++++++++++++++-----
 include/linux/blkdev.h   |   4 +
 7 files changed, 296 insertions(+), 31 deletions(-)

Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org

-- 
2.9.5

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH V2 1/5] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
  2018-04-29 15:41 ` Ming Lei
@ 2018-04-29 15:41   ` Ming Lei
  -1 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2018-04-29 15:41 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, Bart Van Assche,
	Jianchao Wang, Christoph Hellwig, Sagi Grimberg, linux-nvme

Turns out the current way can't drain timout completely because mod_timer()
can be triggered in the work func, which can be just run inside the synced
timeout work:

        del_timer_sync(&q->timeout);
        cancel_work_sync(&q->timeout_work);

This patch introduces one flag of 'timeout_off' for fixing this issue, turns
out this simple way does work.

Also blk_quiesce_timeout() and blk_unquiesce_timeout() are introduced for
draining timeout, which is needed by NVMe.

Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c       | 34 ++++++++++++++++++++++++++++++++--
 block/blk-mq.c         |  9 +++++++++
 block/blk-timeout.c    |  5 ++++-
 include/linux/blkdev.h |  4 ++++
 4 files changed, 49 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 806ce2442819..1288a0673ed1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -388,6 +388,35 @@ void blk_stop_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_stop_queue);
 
+static void __blk_unquiesce_timeout(struct request_queue *q)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	q->timeout_off = false;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+void blk_unquiesce_timeout(struct request_queue *q)
+{
+	__blk_unquiesce_timeout(q);
+	mod_timer(&q->timeout, jiffies + q->rq_timeout);
+}
+EXPORT_SYMBOL(blk_unquiesce_timeout);
+
+void blk_quiesce_timeout(struct request_queue *q)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	q->timeout_off = true;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	del_timer_sync(&q->timeout);
+	cancel_work_sync(&q->timeout_work);
+}
+EXPORT_SYMBOL(blk_quiesce_timeout);
+
 /**
  * blk_sync_queue - cancel any pending callbacks on a queue
  * @q: the queue
@@ -408,8 +437,7 @@ EXPORT_SYMBOL(blk_stop_queue);
  */
 void blk_sync_queue(struct request_queue *q)
 {
-	del_timer_sync(&q->timeout);
-	cancel_work_sync(&q->timeout_work);
+	blk_quiesce_timeout(q);
 
 	if (q->mq_ops) {
 		struct blk_mq_hw_ctx *hctx;
@@ -421,6 +449,8 @@ void blk_sync_queue(struct request_queue *q)
 	} else {
 		cancel_delayed_work_sync(&q->delay_work);
 	}
+
+	__blk_unquiesce_timeout(q);
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0dc9e341c2a7..890dd3b3e3e1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -901,6 +901,15 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	};
 	struct blk_mq_hw_ctx *hctx;
 	int i;
+	bool timeout_off;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	timeout_off = q->timeout_off;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	if (timeout_off)
+		return;
 
 	/* A deadlock might occur if a request is stuck requiring a
 	 * timeout at the same time a queue freeze is waiting
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 652d4d4d3e97..ffd0b609091e 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -136,12 +136,15 @@ void blk_timeout_work(struct work_struct *work)
 
 	spin_lock_irqsave(q->queue_lock, flags);
 
+	if (q->timeout_off)
+		goto exit;
+
 	list_for_each_entry_safe(rq, tmp, &q->timeout_list, timeout_list)
 		blk_rq_check_expired(rq, &next, &next_set);
 
 	if (next_set)
 		mod_timer(&q->timeout, round_jiffies_up(next));
-
+exit:
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9af3e0f430bc..99488a5c7d34 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -584,6 +584,7 @@ struct request_queue {
 	struct timer_list	timeout;
 	struct work_struct	timeout_work;
 	struct list_head	timeout_list;
+	bool			timeout_off;
 
 	struct list_head	icq_list;
 #ifdef CONFIG_BLK_CGROUP
@@ -1011,6 +1012,9 @@ extern void blk_execute_rq(struct request_queue *, struct gendisk *,
 extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,
 				  struct request *, int, rq_end_io_fn *);
 
+extern void blk_quiesce_timeout(struct request_queue *q);
+extern void blk_unquiesce_timeout(struct request_queue *q);
+
 int blk_status_to_errno(blk_status_t status);
 blk_status_t errno_to_blk_status(int errno);
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V2 1/5] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
@ 2018-04-29 15:41   ` Ming Lei
  0 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2018-04-29 15:41 UTC (permalink / raw)


Turns out the current way can't drain timout completely because mod_timer()
can be triggered in the work func, which can be just run inside the synced
timeout work:

        del_timer_sync(&q->timeout);
        cancel_work_sync(&q->timeout_work);

This patch introduces one flag of 'timeout_off' for fixing this issue, turns
out this simple way does work.

Also blk_quiesce_timeout() and blk_unquiesce_timeout() are introduced for
draining timeout, which is needed by NVMe.

Cc: Bart Van Assche <bart.vanassche at wdc.com>
Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-core.c       | 34 ++++++++++++++++++++++++++++++++--
 block/blk-mq.c         |  9 +++++++++
 block/blk-timeout.c    |  5 ++++-
 include/linux/blkdev.h |  4 ++++
 4 files changed, 49 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 806ce2442819..1288a0673ed1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -388,6 +388,35 @@ void blk_stop_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_stop_queue);
 
+static void __blk_unquiesce_timeout(struct request_queue *q)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	q->timeout_off = false;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+void blk_unquiesce_timeout(struct request_queue *q)
+{
+	__blk_unquiesce_timeout(q);
+	mod_timer(&q->timeout, jiffies + q->rq_timeout);
+}
+EXPORT_SYMBOL(blk_unquiesce_timeout);
+
+void blk_quiesce_timeout(struct request_queue *q)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	q->timeout_off = true;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	del_timer_sync(&q->timeout);
+	cancel_work_sync(&q->timeout_work);
+}
+EXPORT_SYMBOL(blk_quiesce_timeout);
+
 /**
  * blk_sync_queue - cancel any pending callbacks on a queue
  * @q: the queue
@@ -408,8 +437,7 @@ EXPORT_SYMBOL(blk_stop_queue);
  */
 void blk_sync_queue(struct request_queue *q)
 {
-	del_timer_sync(&q->timeout);
-	cancel_work_sync(&q->timeout_work);
+	blk_quiesce_timeout(q);
 
 	if (q->mq_ops) {
 		struct blk_mq_hw_ctx *hctx;
@@ -421,6 +449,8 @@ void blk_sync_queue(struct request_queue *q)
 	} else {
 		cancel_delayed_work_sync(&q->delay_work);
 	}
+
+	__blk_unquiesce_timeout(q);
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0dc9e341c2a7..890dd3b3e3e1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -901,6 +901,15 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	};
 	struct blk_mq_hw_ctx *hctx;
 	int i;
+	bool timeout_off;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	timeout_off = q->timeout_off;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	if (timeout_off)
+		return;
 
 	/* A deadlock might occur if a request is stuck requiring a
 	 * timeout at the same time a queue freeze is waiting
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 652d4d4d3e97..ffd0b609091e 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -136,12 +136,15 @@ void blk_timeout_work(struct work_struct *work)
 
 	spin_lock_irqsave(q->queue_lock, flags);
 
+	if (q->timeout_off)
+		goto exit;
+
 	list_for_each_entry_safe(rq, tmp, &q->timeout_list, timeout_list)
 		blk_rq_check_expired(rq, &next, &next_set);
 
 	if (next_set)
 		mod_timer(&q->timeout, round_jiffies_up(next));
-
+exit:
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9af3e0f430bc..99488a5c7d34 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -584,6 +584,7 @@ struct request_queue {
 	struct timer_list	timeout;
 	struct work_struct	timeout_work;
 	struct list_head	timeout_list;
+	bool			timeout_off;
 
 	struct list_head	icq_list;
 #ifdef CONFIG_BLK_CGROUP
@@ -1011,6 +1012,9 @@ extern void blk_execute_rq(struct request_queue *, struct gendisk *,
 extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,
 				  struct request *, int, rq_end_io_fn *);
 
+extern void blk_quiesce_timeout(struct request_queue *q);
+extern void blk_unquiesce_timeout(struct request_queue *q);
+
 int blk_status_to_errno(blk_status_t status);
 blk_status_t errno_to_blk_status(int errno);
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V2 2/5] nvme: pci: cover timeout for admin commands running in EH
  2018-04-29 15:41 ` Ming Lei
@ 2018-04-29 15:41   ` Ming Lei
  -1 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2018-04-29 15:41 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme

When admin commands are used in EH for recovering controller, we have to
cover their timeout and can't depend on block's timeout since deadlock may
be caused when these commands are timed-out by block layer again.

Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 77 ++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 66 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fbc71fac6f1e..0e6cd605164a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1733,21 +1733,28 @@ static inline void nvme_release_cmb(struct nvme_dev *dev)
 	}
 }
 
-static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits)
+static void nvme_init_set_host_mem_cmd(struct nvme_dev *dev,
+		struct nvme_command *c, u32 bits)
 {
 	u64 dma_addr = dev->host_mem_descs_dma;
+
+	memset(c, 0, sizeof(*c));
+	c->features.opcode	= nvme_admin_set_features;
+	c->features.fid		= cpu_to_le32(NVME_FEAT_HOST_MEM_BUF);
+	c->features.dword11	= cpu_to_le32(bits);
+	c->features.dword12	= cpu_to_le32(dev->host_mem_size >>
+					      ilog2(dev->ctrl.page_size));
+	c->features.dword13	= cpu_to_le32(lower_32_bits(dma_addr));
+	c->features.dword14	= cpu_to_le32(upper_32_bits(dma_addr));
+	c->features.dword15	= cpu_to_le32(dev->nr_host_mem_descs);
+}
+
+static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits)
+{
 	struct nvme_command c;
 	int ret;
 
-	memset(&c, 0, sizeof(c));
-	c.features.opcode	= nvme_admin_set_features;
-	c.features.fid		= cpu_to_le32(NVME_FEAT_HOST_MEM_BUF);
-	c.features.dword11	= cpu_to_le32(bits);
-	c.features.dword12	= cpu_to_le32(dev->host_mem_size >>
-					      ilog2(dev->ctrl.page_size));
-	c.features.dword13	= cpu_to_le32(lower_32_bits(dma_addr));
-	c.features.dword14	= cpu_to_le32(upper_32_bits(dma_addr));
-	c.features.dword15	= cpu_to_le32(dev->nr_host_mem_descs);
+	nvme_init_set_host_mem_cmd(dev, &c, bits);
 
 	ret = nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0);
 	if (ret) {
@@ -1758,6 +1765,54 @@ static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits)
 	return ret;
 }
 
+static void nvme_set_host_mem_end_io(struct request *rq, blk_status_t sts)
+{
+	struct completion *waiting = rq->end_io_data;
+
+	rq->end_io_data = NULL;
+
+	/*
+	 * complete last, if this is a stack request the process (and thus
+	 * the rq pointer) could be invalid right after this complete()
+	 */
+	complete(waiting);
+}
+
+/*
+ * This function can be used inside timeout handler, when the block layer
+ * timeout may not work, so this function has to cover the timeout by itself.
+ */
+static int nvme_set_host_mem_timeout(struct nvme_dev *dev, u32 bits)
+{
+	DECLARE_COMPLETION_ONSTACK(wait);
+	struct nvme_command c;
+	struct request_queue *q = dev->ctrl.admin_q;
+	struct request *req;
+	int ret;
+
+	nvme_init_set_host_mem_cmd(dev, &c, bits);
+
+	req = nvme_alloc_request(q, &c, 0, NVME_QID_ANY);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
+	req->timeout = ADMIN_TIMEOUT;
+	req->end_io_data = &wait;
+
+	blk_execute_rq_nowait(q, NULL, req, false,
+			nvme_set_host_mem_end_io);
+	wait_for_completion_io_timeout(&wait, ADMIN_TIMEOUT);
+
+	if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
+		ret = -EINTR;
+	else
+		ret = nvme_req(req)->status;
+
+	blk_mq_free_request(req);
+
+	return ret;
+}
+
 static void nvme_free_host_mem(struct nvme_dev *dev)
 {
 	int i;
@@ -2216,7 +2271,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		 * but I'd rather be safe than sorry..
 		 */
 		if (dev->host_mem_descs)
-			nvme_set_host_mem(dev, 0);
+			nvme_set_host_mem_timeout(dev, 0);
 		nvme_disable_io_queues(dev);
 		nvme_disable_admin_queue(dev, shutdown);
 	}
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V2 2/5] nvme: pci: cover timeout for admin commands running in EH
@ 2018-04-29 15:41   ` Ming Lei
  0 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2018-04-29 15:41 UTC (permalink / raw)


When admin commands are used in EH for recovering controller, we have to
cover their timeout and can't depend on block's timeout since deadlock may
be caused when these commands are timed-out by block layer again.

Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/pci.c | 77 ++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 66 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fbc71fac6f1e..0e6cd605164a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1733,21 +1733,28 @@ static inline void nvme_release_cmb(struct nvme_dev *dev)
 	}
 }
 
-static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits)
+static void nvme_init_set_host_mem_cmd(struct nvme_dev *dev,
+		struct nvme_command *c, u32 bits)
 {
 	u64 dma_addr = dev->host_mem_descs_dma;
+
+	memset(c, 0, sizeof(*c));
+	c->features.opcode	= nvme_admin_set_features;
+	c->features.fid		= cpu_to_le32(NVME_FEAT_HOST_MEM_BUF);
+	c->features.dword11	= cpu_to_le32(bits);
+	c->features.dword12	= cpu_to_le32(dev->host_mem_size >>
+					      ilog2(dev->ctrl.page_size));
+	c->features.dword13	= cpu_to_le32(lower_32_bits(dma_addr));
+	c->features.dword14	= cpu_to_le32(upper_32_bits(dma_addr));
+	c->features.dword15	= cpu_to_le32(dev->nr_host_mem_descs);
+}
+
+static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits)
+{
 	struct nvme_command c;
 	int ret;
 
-	memset(&c, 0, sizeof(c));
-	c.features.opcode	= nvme_admin_set_features;
-	c.features.fid		= cpu_to_le32(NVME_FEAT_HOST_MEM_BUF);
-	c.features.dword11	= cpu_to_le32(bits);
-	c.features.dword12	= cpu_to_le32(dev->host_mem_size >>
-					      ilog2(dev->ctrl.page_size));
-	c.features.dword13	= cpu_to_le32(lower_32_bits(dma_addr));
-	c.features.dword14	= cpu_to_le32(upper_32_bits(dma_addr));
-	c.features.dword15	= cpu_to_le32(dev->nr_host_mem_descs);
+	nvme_init_set_host_mem_cmd(dev, &c, bits);
 
 	ret = nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0);
 	if (ret) {
@@ -1758,6 +1765,54 @@ static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits)
 	return ret;
 }
 
+static void nvme_set_host_mem_end_io(struct request *rq, blk_status_t sts)
+{
+	struct completion *waiting = rq->end_io_data;
+
+	rq->end_io_data = NULL;
+
+	/*
+	 * complete last, if this is a stack request the process (and thus
+	 * the rq pointer) could be invalid right after this complete()
+	 */
+	complete(waiting);
+}
+
+/*
+ * This function can be used inside timeout handler, when the block layer
+ * timeout may not work, so this function has to cover the timeout by itself.
+ */
+static int nvme_set_host_mem_timeout(struct nvme_dev *dev, u32 bits)
+{
+	DECLARE_COMPLETION_ONSTACK(wait);
+	struct nvme_command c;
+	struct request_queue *q = dev->ctrl.admin_q;
+	struct request *req;
+	int ret;
+
+	nvme_init_set_host_mem_cmd(dev, &c, bits);
+
+	req = nvme_alloc_request(q, &c, 0, NVME_QID_ANY);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
+	req->timeout = ADMIN_TIMEOUT;
+	req->end_io_data = &wait;
+
+	blk_execute_rq_nowait(q, NULL, req, false,
+			nvme_set_host_mem_end_io);
+	wait_for_completion_io_timeout(&wait, ADMIN_TIMEOUT);
+
+	if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
+		ret = -EINTR;
+	else
+		ret = nvme_req(req)->status;
+
+	blk_mq_free_request(req);
+
+	return ret;
+}
+
 static void nvme_free_host_mem(struct nvme_dev *dev)
 {
 	int i;
@@ -2216,7 +2271,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		 * but I'd rather be safe than sorry..
 		 */
 		if (dev->host_mem_descs)
-			nvme_set_host_mem(dev, 0);
+			nvme_set_host_mem_timeout(dev, 0);
 		nvme_disable_io_queues(dev);
 		nvme_disable_admin_queue(dev, shutdown);
 	}
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V2 3/5] nvme: pci: only wait freezing if queue is frozen
  2018-04-29 15:41 ` Ming Lei
@ 2018-04-29 15:41   ` Ming Lei
  -1 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2018-04-29 15:41 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme

In nvme_dev_disable() called during shutting down controler,
nvme_wait_freeze_timeout() may be done on the controller not
frozen yet, so add the check for avoiding the case.

Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 0e6cd605164a..8172ee584130 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2240,14 +2240,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	int i;
 	bool dead = true;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
+	bool frozen = false;
 
 	mutex_lock(&dev->shutdown_lock);
 	if (pci_is_enabled(pdev)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
 		if (dev->ctrl.state == NVME_CTRL_LIVE ||
-		    dev->ctrl.state == NVME_CTRL_RESETTING)
+		    dev->ctrl.state == NVME_CTRL_RESETTING) {
 			nvme_start_freeze(&dev->ctrl);
+			frozen = true;
+		}
 		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
 			pdev->error_state  != pci_channel_io_normal);
 	}
@@ -2257,7 +2260,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	 * doing a safe shutdown.
 	 */
 	if (!dead) {
-		if (shutdown)
+		if (shutdown && frozen)
 			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
 	}
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V2 3/5] nvme: pci: only wait freezing if queue is frozen
@ 2018-04-29 15:41   ` Ming Lei
  0 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2018-04-29 15:41 UTC (permalink / raw)


In nvme_dev_disable() called during shutting down controler,
nvme_wait_freeze_timeout() may be done on the controller not
frozen yet, so add the check for avoiding the case.

Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/pci.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 0e6cd605164a..8172ee584130 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2240,14 +2240,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	int i;
 	bool dead = true;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
+	bool frozen = false;
 
 	mutex_lock(&dev->shutdown_lock);
 	if (pci_is_enabled(pdev)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
 		if (dev->ctrl.state == NVME_CTRL_LIVE ||
-		    dev->ctrl.state == NVME_CTRL_RESETTING)
+		    dev->ctrl.state == NVME_CTRL_RESETTING) {
 			nvme_start_freeze(&dev->ctrl);
+			frozen = true;
+		}
 		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
 			pdev->error_state  != pci_channel_io_normal);
 	}
@@ -2257,7 +2260,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	 * doing a safe shutdown.
 	 */
 	if (!dead) {
-		if (shutdown)
+		if (shutdown && frozen)
 			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
 	}
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V2 4/5] nvme: fix race between freeze queues and unfreeze queues
  2018-04-29 15:41 ` Ming Lei
@ 2018-04-29 15:41   ` Ming Lei
  -1 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2018-04-29 15:41 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme

nvme_dev_disable() and resetting controller are required for recovering
controller, but the two are run from different contexts.

nvme_start_freeze() is run from nvme_dev_disable(), and nvme_unfreeze()
is run from resetting context. Unfortunatley timeout may be triggered
when draining IO from resetting controller, so nvme_start_freeze() may
be run several times.

This patch fixes the race between nvme_start_freeze() and
nvme_unfreeze() by introducing flags of 'queues_freezed' and the lock
of 'freeze_mutex'.

Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/core.c | 29 +++++++++++++++++++++--------
 drivers/nvme/host/nvme.h |  3 +++
 2 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9df4f71e58ca..f9028873298e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3447,6 +3447,9 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	INIT_WORK(&ctrl->fw_act_work, nvme_fw_act_work);
 	INIT_WORK(&ctrl->delete_work, nvme_delete_ctrl_work);
 
+	ctrl->queues_freezed = false;
+	mutex_init(&ctrl->freeze_mutex);
+
 	ret = ida_simple_get(&nvme_instance_ida, 0, 0, GFP_KERNEL);
 	if (ret < 0)
 		goto out;
@@ -3526,10 +3529,15 @@ void nvme_unfreeze(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
 
-	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
-		blk_mq_unfreeze_queue(ns->queue);
-	up_read(&ctrl->namespaces_rwsem);
+	mutex_lock(&ctrl->freeze_mutex);
+	if (ctrl->queues_freezed) {
+		down_read(&ctrl->namespaces_rwsem);
+		list_for_each_entry(ns, &ctrl->namespaces, list)
+			blk_mq_unfreeze_queue(ns->queue);
+		up_read(&ctrl->namespaces_rwsem);
+		ctrl->queues_freezed = false;
+	}
+	mutex_unlock(&ctrl->freeze_mutex);
 }
 EXPORT_SYMBOL_GPL(nvme_unfreeze);
 
@@ -3562,10 +3570,15 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
 
-	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
-		blk_freeze_queue_start(ns->queue);
-	up_read(&ctrl->namespaces_rwsem);
+	mutex_lock(&ctrl->freeze_mutex);
+	if (!ctrl->queues_freezed) {
+		down_read(&ctrl->namespaces_rwsem);
+		list_for_each_entry(ns, &ctrl->namespaces, list)
+			blk_freeze_queue_start(ns->queue);
+		up_read(&ctrl->namespaces_rwsem);
+		ctrl->queues_freezed = true;
+	}
+	mutex_unlock(&ctrl->freeze_mutex);
 }
 EXPORT_SYMBOL_GPL(nvme_start_freeze);
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 061fecfd44f5..a19b7f04ac24 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -149,6 +149,9 @@ struct nvme_ctrl {
 	struct work_struct reset_work;
 	struct work_struct delete_work;
 
+	bool queues_freezed;
+	struct mutex freeze_mutex;
+
 	struct nvme_subsystem *subsys;
 	struct list_head subsys_entry;
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V2 4/5] nvme: fix race between freeze queues and unfreeze queues
@ 2018-04-29 15:41   ` Ming Lei
  0 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2018-04-29 15:41 UTC (permalink / raw)


nvme_dev_disable() and resetting controller are required for recovering
controller, but the two are run from different contexts.

nvme_start_freeze() is run from nvme_dev_disable(), and nvme_unfreeze()
is run from resetting context. Unfortunatley timeout may be triggered
when draining IO from resetting controller, so nvme_start_freeze() may
be run several times.

This patch fixes the race between nvme_start_freeze() and
nvme_unfreeze() by introducing flags of 'queues_freezed' and the lock
of 'freeze_mutex'.

Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/core.c | 29 +++++++++++++++++++++--------
 drivers/nvme/host/nvme.h |  3 +++
 2 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9df4f71e58ca..f9028873298e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3447,6 +3447,9 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	INIT_WORK(&ctrl->fw_act_work, nvme_fw_act_work);
 	INIT_WORK(&ctrl->delete_work, nvme_delete_ctrl_work);
 
+	ctrl->queues_freezed = false;
+	mutex_init(&ctrl->freeze_mutex);
+
 	ret = ida_simple_get(&nvme_instance_ida, 0, 0, GFP_KERNEL);
 	if (ret < 0)
 		goto out;
@@ -3526,10 +3529,15 @@ void nvme_unfreeze(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
 
-	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
-		blk_mq_unfreeze_queue(ns->queue);
-	up_read(&ctrl->namespaces_rwsem);
+	mutex_lock(&ctrl->freeze_mutex);
+	if (ctrl->queues_freezed) {
+		down_read(&ctrl->namespaces_rwsem);
+		list_for_each_entry(ns, &ctrl->namespaces, list)
+			blk_mq_unfreeze_queue(ns->queue);
+		up_read(&ctrl->namespaces_rwsem);
+		ctrl->queues_freezed = false;
+	}
+	mutex_unlock(&ctrl->freeze_mutex);
 }
 EXPORT_SYMBOL_GPL(nvme_unfreeze);
 
@@ -3562,10 +3570,15 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
 
-	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
-		blk_freeze_queue_start(ns->queue);
-	up_read(&ctrl->namespaces_rwsem);
+	mutex_lock(&ctrl->freeze_mutex);
+	if (!ctrl->queues_freezed) {
+		down_read(&ctrl->namespaces_rwsem);
+		list_for_each_entry(ns, &ctrl->namespaces, list)
+			blk_freeze_queue_start(ns->queue);
+		up_read(&ctrl->namespaces_rwsem);
+		ctrl->queues_freezed = true;
+	}
+	mutex_unlock(&ctrl->freeze_mutex);
 }
 EXPORT_SYMBOL_GPL(nvme_start_freeze);
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 061fecfd44f5..a19b7f04ac24 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -149,6 +149,9 @@ struct nvme_ctrl {
 	struct work_struct reset_work;
 	struct work_struct delete_work;
 
+	bool queues_freezed;
+	struct mutex freeze_mutex;
+
 	struct nvme_subsystem *subsys;
 	struct list_head subsys_entry;
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V2 5/5] nvme: pci: simplify timeout handling
  2018-04-29 15:41 ` Ming Lei
@ 2018-04-29 15:41   ` Ming Lei
  -1 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2018-04-29 15:41 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme

When one req is timed out, now nvme_timeout() handles it by the
following way:

	nvme_dev_disable(dev, false);
	nvme_reset_ctrl(&dev->ctrl);
	return BLK_EH_HANDLED.

which may introduces the following issues:

1) the following timeout on other reqs may call nvme_dev_disable()
again, which may quiesce queue again when resetting is in-progress,
then finally nothing can move on.

2) when timeout is triggered in reset work function, nvme_wait_freeze()
may wait forever because now controller can't be recovered at all

This patch fixes the issues by:

1) handle timeout event in one EH thread, and wakeup this thread if
controller recovery is needed

2) Inside the EH handler, timeout work is drained by nvme_unquiesce_timeout()
before canceling in-flight requrests, so all requests can be canceled or
completed by either EH or block timeout handler.

3) Moves the draining IO and unfreezing queues into one post_eh work context,
so controller can be recovered always.

This patch fixes reports from the horible test of block/011.

Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/core.c |  22 ++++++++
 drivers/nvme/host/nvme.h |   3 +
 drivers/nvme/host/pci.c  | 140 +++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 155 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f9028873298e..664a268accd1 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3582,6 +3582,28 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_start_freeze);
 
+void nvme_unquiesce_timeout(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_unquiesce_timeout(ns->queue);
+	up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_unquiesce_timeout);
+
+void nvme_quiesce_timeout(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_quiesce_timeout(ns->queue);
+	up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_quiesce_timeout);
+
 void nvme_stop_queues(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index a19b7f04ac24..956ee19ff403 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -19,6 +19,7 @@
 #include <linux/pci.h>
 #include <linux/kref.h>
 #include <linux/blk-mq.h>
+#include <linux/blkdev.h>
 #include <linux/lightnvm.h>
 #include <linux/sed-opal.h>
 #include <linux/fault-inject.h>
@@ -405,6 +406,8 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len,
 void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
 		union nvme_result *res);
 
+void nvme_unquiesce_timeout(struct nvme_ctrl *ctrl);
+void nvme_quiesce_timeout(struct nvme_ctrl *ctrl);
 void nvme_stop_queues(struct nvme_ctrl *ctrl);
 void nvme_start_queues(struct nvme_ctrl *ctrl);
 void nvme_kill_queues(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8172ee584130..86c14d9051ff 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -29,6 +29,7 @@
 #include <linux/types.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/sed-opal.h>
+#include <linux/kthread.h>
 
 #include "nvme.h"
 
@@ -112,6 +113,11 @@ struct nvme_dev {
 	dma_addr_t host_mem_descs_dma;
 	struct nvme_host_mem_buf_desc *host_mem_descs;
 	void **host_mem_desc_bufs;
+
+	spinlock_t	  eh_lock;
+	bool		eh_in_recovery;
+	struct task_struct    *ehandler;
+	struct work_struct post_eh_work;
 };
 
 static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
@@ -1176,6 +1182,100 @@ static void nvme_warn_reset(struct nvme_dev *dev, u32 csts)
 			 csts, result);
 }
 
+static void nvme_eh_schedule(struct nvme_dev *dev)
+{
+	spin_lock(&dev->eh_lock);
+	if (!dev->eh_in_recovery) {
+		dev->eh_in_recovery = true;
+		wake_up_process(dev->ehandler);
+	}
+	spin_unlock(&dev->eh_lock);
+}
+
+static void nvme_eh_done(struct nvme_dev *dev)
+{
+	spin_lock(&dev->eh_lock);
+	if (dev->eh_in_recovery)
+		dev->eh_in_recovery = false;
+	spin_unlock(&dev->eh_lock);
+}
+
+static int nvme_error_handler(void *data)
+{
+	struct nvme_dev *dev = data;
+
+	while (true) {
+		/*
+		 * The sequence in kthread_stop() sets the stop flag first
+		 * then wakes the process.  To avoid missed wakeups, the task
+		 * should always be in a non running state before the stop
+		 * flag is checked
+		 */
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (kthread_should_stop())
+			break;
+
+		spin_lock(&dev->eh_lock);
+		if (!dev->eh_in_recovery) {
+			spin_unlock(&dev->eh_lock);
+			schedule();
+			continue;
+		}
+		spin_unlock(&dev->eh_lock);
+
+		__set_current_state(TASK_RUNNING);
+
+		dev_info(dev->ctrl.device, "start eh recovery\n");
+
+		/* freeze queues before recovery */
+		nvme_start_freeze(&dev->ctrl);
+
+		nvme_dev_disable(dev, false);
+
+		/*
+		 * reset won't wait for IO completion any more, so it
+		 * is safe to reset controller in sync way
+		 */
+		nvme_reset_ctrl_sync(&dev->ctrl);
+
+		dev_info(dev->ctrl.device, "eh recovery done\n");
+
+		/*
+		 * drain IO & unfreeze queues in another context because
+		 * these IOs may trigger timeout again
+		 */
+		queue_work(nvme_reset_wq, &dev->post_eh_work);
+	}
+	__set_current_state(TASK_RUNNING);
+	dev->ehandler = NULL;
+
+	return 0;
+}
+
+static void nvme_post_eh_work(struct work_struct *work)
+{
+	struct nvme_dev *dev =
+		container_of(work, struct nvme_dev, post_eh_work);
+
+	nvme_wait_freeze(&dev->ctrl);
+	nvme_unfreeze(&dev->ctrl);
+}
+
+static int nvme_eh_init(struct nvme_dev *dev)
+{
+	spin_lock_init(&dev->eh_lock);
+	INIT_WORK(&dev->post_eh_work, nvme_post_eh_work);
+
+	dev->ehandler = kthread_run(nvme_error_handler, dev,
+			"nvme_eh_%d", dev->ctrl.instance);
+	if (IS_ERR(dev->ehandler)) {
+		dev_err(dev->ctrl.device, "error handler thread failed to spawn, error = %ld\n",
+			PTR_ERR(dev->ehandler));
+		return PTR_ERR(dev->ehandler);
+	}
+	return 0;
+}
+
 static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -1197,8 +1297,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	 */
 	if (nvme_should_reset(dev, csts)) {
 		nvme_warn_reset(dev, csts);
-		nvme_dev_disable(dev, false);
-		nvme_reset_ctrl(&dev->ctrl);
+		nvme_eh_schedule(dev);
 		return BLK_EH_HANDLED;
 	}
 
@@ -1224,8 +1323,8 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, disable controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+		nvme_eh_schedule(dev);
 		return BLK_EH_HANDLED;
 	default:
 		break;
@@ -1240,14 +1339,12 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
-		nvme_reset_ctrl(&dev->ctrl);
-
 		/*
 		 * Mark the request as handled, since the inline shutdown
 		 * forces all outstanding requests to complete.
 		 */
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+		nvme_eh_schedule(dev);
 		return BLK_EH_HANDLED;
 	}
 
@@ -2246,8 +2343,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	if (pci_is_enabled(pdev)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
-		if (dev->ctrl.state == NVME_CTRL_LIVE ||
-		    dev->ctrl.state == NVME_CTRL_RESETTING) {
+		if (shutdown && (dev->ctrl.state == NVME_CTRL_LIVE ||
+		    dev->ctrl.state == NVME_CTRL_RESETTING)) {
 			nvme_start_freeze(&dev->ctrl);
 			frozen = true;
 		}
@@ -2281,11 +2378,23 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
 		nvme_suspend_queue(&dev->queues[i]);
 
+	/* safe to sync timeout after queues are quiesced */
+	nvme_quiesce_timeout(&dev->ctrl);
+	blk_quiesce_timeout(dev->ctrl.admin_q);
+
 	nvme_pci_disable(dev);
 
+	/*
+	 * Both timeout and interrupt handler have been drained, and all
+	 * in-flight requests will be canceled now.
+	 */
 	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
 	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
 
+	/* all requests are canceled now, so enable timeout now */
+	nvme_unquiesce_timeout(&dev->ctrl);
+	blk_unquiesce_timeout(dev->ctrl.admin_q);
+
 	/*
 	 * The driver will not be starting up queues again if shutting down so
 	 * must flush all entered requests to their failed completion to avoid
@@ -2416,9 +2525,14 @@ static void nvme_reset_work(struct work_struct *work)
 	if (result)
 		goto out;
 
+	nvme_eh_done(dev);
+
 	/*
 	 * Keep the controller around but remove all namespaces if we don't have
 	 * any working I/O queue.
+	 *
+	 * Now we won't wait for IO completion here any more, and sync reset
+	 * can be used safely anywhere.
 	 */
 	if (dev->online_queues < 2) {
 		dev_warn(dev->ctrl.device, "IO queues not created\n");
@@ -2427,11 +2541,9 @@ static void nvme_reset_work(struct work_struct *work)
 		new_state = NVME_CTRL_ADMIN_ONLY;
 	} else {
 		nvme_start_queues(&dev->ctrl);
-		nvme_wait_freeze(&dev->ctrl);
 		/* hit this only when allocate tagset fails */
 		if (nvme_dev_add(dev))
 			new_state = NVME_CTRL_ADMIN_ONLY;
-		nvme_unfreeze(&dev->ctrl);
 	}
 
 	/*
@@ -2449,6 +2561,7 @@ static void nvme_reset_work(struct work_struct *work)
 
  out:
 	nvme_remove_dead_ctrl(dev, result);
+	nvme_eh_done(dev);
 }
 
 static void nvme_remove_dead_ctrl_work(struct work_struct *work)
@@ -2590,10 +2703,15 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
 
+	if (nvme_eh_init(dev))
+		goto uninit_ctrl;
+
 	nvme_reset_ctrl(&dev->ctrl);
 
 	return 0;
 
+ uninit_ctrl:
+	nvme_uninit_ctrl(&dev->ctrl);
  release_pools:
 	nvme_release_prp_pools(dev);
  unmap:
@@ -2647,6 +2765,8 @@ static void nvme_remove(struct pci_dev *pdev)
 	nvme_stop_ctrl(&dev->ctrl);
 	nvme_remove_namespaces(&dev->ctrl);
 	nvme_dev_disable(dev, true);
+	if (dev->ehandler)
+		kthread_stop(dev->ehandler);
 	nvme_free_host_mem(dev);
 	nvme_dev_remove_admin(dev);
 	nvme_free_queues(dev, 0);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V2 5/5] nvme: pci: simplify timeout handling
@ 2018-04-29 15:41   ` Ming Lei
  0 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2018-04-29 15:41 UTC (permalink / raw)


When one req is timed out, now nvme_timeout() handles it by the
following way:

	nvme_dev_disable(dev, false);
	nvme_reset_ctrl(&dev->ctrl);
	return BLK_EH_HANDLED.

which may introduces the following issues:

1) the following timeout on other reqs may call nvme_dev_disable()
again, which may quiesce queue again when resetting is in-progress,
then finally nothing can move on.

2) when timeout is triggered in reset work function, nvme_wait_freeze()
may wait forever because now controller can't be recovered at all

This patch fixes the issues by:

1) handle timeout event in one EH thread, and wakeup this thread if
controller recovery is needed

2) Inside the EH handler, timeout work is drained by nvme_unquiesce_timeout()
before canceling in-flight requrests, so all requests can be canceled or
completed by either EH or block timeout handler.

3) Moves the draining IO and unfreezing queues into one post_eh work context,
so controller can be recovered always.

This patch fixes reports from the horible test of block/011.

Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/core.c |  22 ++++++++
 drivers/nvme/host/nvme.h |   3 +
 drivers/nvme/host/pci.c  | 140 +++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 155 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f9028873298e..664a268accd1 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3582,6 +3582,28 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_start_freeze);
 
+void nvme_unquiesce_timeout(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_unquiesce_timeout(ns->queue);
+	up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_unquiesce_timeout);
+
+void nvme_quiesce_timeout(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_quiesce_timeout(ns->queue);
+	up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_quiesce_timeout);
+
 void nvme_stop_queues(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index a19b7f04ac24..956ee19ff403 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -19,6 +19,7 @@
 #include <linux/pci.h>
 #include <linux/kref.h>
 #include <linux/blk-mq.h>
+#include <linux/blkdev.h>
 #include <linux/lightnvm.h>
 #include <linux/sed-opal.h>
 #include <linux/fault-inject.h>
@@ -405,6 +406,8 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len,
 void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
 		union nvme_result *res);
 
+void nvme_unquiesce_timeout(struct nvme_ctrl *ctrl);
+void nvme_quiesce_timeout(struct nvme_ctrl *ctrl);
 void nvme_stop_queues(struct nvme_ctrl *ctrl);
 void nvme_start_queues(struct nvme_ctrl *ctrl);
 void nvme_kill_queues(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8172ee584130..86c14d9051ff 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -29,6 +29,7 @@
 #include <linux/types.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/sed-opal.h>
+#include <linux/kthread.h>
 
 #include "nvme.h"
 
@@ -112,6 +113,11 @@ struct nvme_dev {
 	dma_addr_t host_mem_descs_dma;
 	struct nvme_host_mem_buf_desc *host_mem_descs;
 	void **host_mem_desc_bufs;
+
+	spinlock_t	  eh_lock;
+	bool		eh_in_recovery;
+	struct task_struct    *ehandler;
+	struct work_struct post_eh_work;
 };
 
 static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
@@ -1176,6 +1182,100 @@ static void nvme_warn_reset(struct nvme_dev *dev, u32 csts)
 			 csts, result);
 }
 
+static void nvme_eh_schedule(struct nvme_dev *dev)
+{
+	spin_lock(&dev->eh_lock);
+	if (!dev->eh_in_recovery) {
+		dev->eh_in_recovery = true;
+		wake_up_process(dev->ehandler);
+	}
+	spin_unlock(&dev->eh_lock);
+}
+
+static void nvme_eh_done(struct nvme_dev *dev)
+{
+	spin_lock(&dev->eh_lock);
+	if (dev->eh_in_recovery)
+		dev->eh_in_recovery = false;
+	spin_unlock(&dev->eh_lock);
+}
+
+static int nvme_error_handler(void *data)
+{
+	struct nvme_dev *dev = data;
+
+	while (true) {
+		/*
+		 * The sequence in kthread_stop() sets the stop flag first
+		 * then wakes the process.  To avoid missed wakeups, the task
+		 * should always be in a non running state before the stop
+		 * flag is checked
+		 */
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (kthread_should_stop())
+			break;
+
+		spin_lock(&dev->eh_lock);
+		if (!dev->eh_in_recovery) {
+			spin_unlock(&dev->eh_lock);
+			schedule();
+			continue;
+		}
+		spin_unlock(&dev->eh_lock);
+
+		__set_current_state(TASK_RUNNING);
+
+		dev_info(dev->ctrl.device, "start eh recovery\n");
+
+		/* freeze queues before recovery */
+		nvme_start_freeze(&dev->ctrl);
+
+		nvme_dev_disable(dev, false);
+
+		/*
+		 * reset won't wait for IO completion any more, so it
+		 * is safe to reset controller in sync way
+		 */
+		nvme_reset_ctrl_sync(&dev->ctrl);
+
+		dev_info(dev->ctrl.device, "eh recovery done\n");
+
+		/*
+		 * drain IO & unfreeze queues in another context because
+		 * these IOs may trigger timeout again
+		 */
+		queue_work(nvme_reset_wq, &dev->post_eh_work);
+	}
+	__set_current_state(TASK_RUNNING);
+	dev->ehandler = NULL;
+
+	return 0;
+}
+
+static void nvme_post_eh_work(struct work_struct *work)
+{
+	struct nvme_dev *dev =
+		container_of(work, struct nvme_dev, post_eh_work);
+
+	nvme_wait_freeze(&dev->ctrl);
+	nvme_unfreeze(&dev->ctrl);
+}
+
+static int nvme_eh_init(struct nvme_dev *dev)
+{
+	spin_lock_init(&dev->eh_lock);
+	INIT_WORK(&dev->post_eh_work, nvme_post_eh_work);
+
+	dev->ehandler = kthread_run(nvme_error_handler, dev,
+			"nvme_eh_%d", dev->ctrl.instance);
+	if (IS_ERR(dev->ehandler)) {
+		dev_err(dev->ctrl.device, "error handler thread failed to spawn, error = %ld\n",
+			PTR_ERR(dev->ehandler));
+		return PTR_ERR(dev->ehandler);
+	}
+	return 0;
+}
+
 static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -1197,8 +1297,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	 */
 	if (nvme_should_reset(dev, csts)) {
 		nvme_warn_reset(dev, csts);
-		nvme_dev_disable(dev, false);
-		nvme_reset_ctrl(&dev->ctrl);
+		nvme_eh_schedule(dev);
 		return BLK_EH_HANDLED;
 	}
 
@@ -1224,8 +1323,8 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, disable controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+		nvme_eh_schedule(dev);
 		return BLK_EH_HANDLED;
 	default:
 		break;
@@ -1240,14 +1339,12 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
-		nvme_reset_ctrl(&dev->ctrl);
-
 		/*
 		 * Mark the request as handled, since the inline shutdown
 		 * forces all outstanding requests to complete.
 		 */
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+		nvme_eh_schedule(dev);
 		return BLK_EH_HANDLED;
 	}
 
@@ -2246,8 +2343,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	if (pci_is_enabled(pdev)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
-		if (dev->ctrl.state == NVME_CTRL_LIVE ||
-		    dev->ctrl.state == NVME_CTRL_RESETTING) {
+		if (shutdown && (dev->ctrl.state == NVME_CTRL_LIVE ||
+		    dev->ctrl.state == NVME_CTRL_RESETTING)) {
 			nvme_start_freeze(&dev->ctrl);
 			frozen = true;
 		}
@@ -2281,11 +2378,23 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
 		nvme_suspend_queue(&dev->queues[i]);
 
+	/* safe to sync timeout after queues are quiesced */
+	nvme_quiesce_timeout(&dev->ctrl);
+	blk_quiesce_timeout(dev->ctrl.admin_q);
+
 	nvme_pci_disable(dev);
 
+	/*
+	 * Both timeout and interrupt handler have been drained, and all
+	 * in-flight requests will be canceled now.
+	 */
 	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
 	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
 
+	/* all requests are canceled now, so enable timeout now */
+	nvme_unquiesce_timeout(&dev->ctrl);
+	blk_unquiesce_timeout(dev->ctrl.admin_q);
+
 	/*
 	 * The driver will not be starting up queues again if shutting down so
 	 * must flush all entered requests to their failed completion to avoid
@@ -2416,9 +2525,14 @@ static void nvme_reset_work(struct work_struct *work)
 	if (result)
 		goto out;
 
+	nvme_eh_done(dev);
+
 	/*
 	 * Keep the controller around but remove all namespaces if we don't have
 	 * any working I/O queue.
+	 *
+	 * Now we won't wait for IO completion here any more, and sync reset
+	 * can be used safely anywhere.
 	 */
 	if (dev->online_queues < 2) {
 		dev_warn(dev->ctrl.device, "IO queues not created\n");
@@ -2427,11 +2541,9 @@ static void nvme_reset_work(struct work_struct *work)
 		new_state = NVME_CTRL_ADMIN_ONLY;
 	} else {
 		nvme_start_queues(&dev->ctrl);
-		nvme_wait_freeze(&dev->ctrl);
 		/* hit this only when allocate tagset fails */
 		if (nvme_dev_add(dev))
 			new_state = NVME_CTRL_ADMIN_ONLY;
-		nvme_unfreeze(&dev->ctrl);
 	}
 
 	/*
@@ -2449,6 +2561,7 @@ static void nvme_reset_work(struct work_struct *work)
 
  out:
 	nvme_remove_dead_ctrl(dev, result);
+	nvme_eh_done(dev);
 }
 
 static void nvme_remove_dead_ctrl_work(struct work_struct *work)
@@ -2590,10 +2703,15 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
 
+	if (nvme_eh_init(dev))
+		goto uninit_ctrl;
+
 	nvme_reset_ctrl(&dev->ctrl);
 
 	return 0;
 
+ uninit_ctrl:
+	nvme_uninit_ctrl(&dev->ctrl);
  release_pools:
 	nvme_release_prp_pools(dev);
  unmap:
@@ -2647,6 +2765,8 @@ static void nvme_remove(struct pci_dev *pdev)
 	nvme_stop_ctrl(&dev->ctrl);
 	nvme_remove_namespaces(&dev->ctrl);
 	nvme_dev_disable(dev, true);
+	if (dev->ehandler)
+		kthread_stop(dev->ehandler);
 	nvme_free_host_mem(dev);
 	nvme_dev_remove_admin(dev);
 	nvme_free_queues(dev, 0);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH V2 5/5] nvme: pci: simplify timeout handling
  2018-04-29 15:41   ` Ming Lei
@ 2018-05-02  2:23     ` jianchao.wang
  -1 siblings, 0 replies; 26+ messages in thread
From: jianchao.wang @ 2018-05-02  2:23 UTC (permalink / raw)
  To: Ming Lei, Keith Busch
  Cc: Jens Axboe, Sagi Grimberg, linux-nvme, linux-block, Christoph Hellwig

Hi Ming

On 04/29/2018 11:41 PM, Ming Lei wrote:
> +
>  static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  {
>  	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> @@ -1197,8 +1297,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  	 */
>  	if (nvme_should_reset(dev, csts)) {
>  		nvme_warn_reset(dev, csts);
> -		nvme_dev_disable(dev, false);
> -		nvme_reset_ctrl(&dev->ctrl);
> +		nvme_eh_schedule(dev);
>  		return BLK_EH_HANDLED;
>  	}
>  
> @@ -1224,8 +1323,8 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  		dev_warn(dev->ctrl.device,
>  			 "I/O %d QID %d timeout, disable controller\n",
>  			 req->tag, nvmeq->qid);
> -		nvme_dev_disable(dev, false);
>  		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
> +		nvme_eh_schedule(dev);
>  		return BLK_EH_HANDLED;
>  	default:
>  		break;
> @@ -1240,14 +1339,12 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  		dev_warn(dev->ctrl.device,
>  			 "I/O %d QID %d timeout, reset controller\n",
>  			 req->tag, nvmeq->qid);
> -		nvme_dev_disable(dev, false);
> -		nvme_reset_ctrl(&dev->ctrl);
> -
>  		/*
>  		 * Mark the request as handled, since the inline shutdown
>  		 * forces all outstanding requests to complete.
>  		 */
>  		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
> +		nvme_eh_schedule(dev);
>  		return BLK_EH_HANDLED;
>  	}
>  
> @@ -2246,8 +2343,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  	if (pci_is_enabled(pdev)) {
>  		u32 csts = readl(dev->bar + NVME_REG_CSTS);
>  
> -		if (dev->ctrl.state == NVME_CTRL_LIVE ||
> -		    dev->ctrl.state == NVME_CTRL_RESETTING) {
> +		if (shutdown && (dev->ctrl.state == NVME_CTRL_LIVE ||
> +		    dev->ctrl.state == NVME_CTRL_RESETTING)) {
>  			nvme_start_freeze(&dev->ctrl);
>  			frozen = true;
>  		}
> @@ -2281,11 +2378,23 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
>  		nvme_suspend_queue(&dev->queues[i]);
>  
> +	/* safe to sync timeout after queues are quiesced */
> +	nvme_quiesce_timeout(&dev->ctrl);
> +	blk_quiesce_timeout(dev->ctrl.admin_q);
> +
>  	nvme_pci_disable(dev);
>  
> +	/*
> +	 * Both timeout and interrupt handler have been drained, and all
> +	 * in-flight requests will be canceled now.
> +	 */
>  	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
>  	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);


We need to return BLK_EH_RESET_TIMER in nvme_timeout then:
1. defer the completion. we can't unmap the io request before close the controller totally, so not BLK_EH_HANDLED.
2. nvme_cancel_request could complete it. blk_mq_complete_request is invoked by nvme_cancel_request, so not BLK_EH_NOT_HANDLED.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH V2 5/5] nvme: pci: simplify timeout handling
@ 2018-05-02  2:23     ` jianchao.wang
  0 siblings, 0 replies; 26+ messages in thread
From: jianchao.wang @ 2018-05-02  2:23 UTC (permalink / raw)


Hi Ming

On 04/29/2018 11:41 PM, Ming Lei wrote:
> +
>  static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  {
>  	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> @@ -1197,8 +1297,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  	 */
>  	if (nvme_should_reset(dev, csts)) {
>  		nvme_warn_reset(dev, csts);
> -		nvme_dev_disable(dev, false);
> -		nvme_reset_ctrl(&dev->ctrl);
> +		nvme_eh_schedule(dev);
>  		return BLK_EH_HANDLED;
>  	}
>  
> @@ -1224,8 +1323,8 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  		dev_warn(dev->ctrl.device,
>  			 "I/O %d QID %d timeout, disable controller\n",
>  			 req->tag, nvmeq->qid);
> -		nvme_dev_disable(dev, false);
>  		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
> +		nvme_eh_schedule(dev);
>  		return BLK_EH_HANDLED;
>  	default:
>  		break;
> @@ -1240,14 +1339,12 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  		dev_warn(dev->ctrl.device,
>  			 "I/O %d QID %d timeout, reset controller\n",
>  			 req->tag, nvmeq->qid);
> -		nvme_dev_disable(dev, false);
> -		nvme_reset_ctrl(&dev->ctrl);
> -
>  		/*
>  		 * Mark the request as handled, since the inline shutdown
>  		 * forces all outstanding requests to complete.
>  		 */
>  		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
> +		nvme_eh_schedule(dev);
>  		return BLK_EH_HANDLED;
>  	}
>  
> @@ -2246,8 +2343,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  	if (pci_is_enabled(pdev)) {
>  		u32 csts = readl(dev->bar + NVME_REG_CSTS);
>  
> -		if (dev->ctrl.state == NVME_CTRL_LIVE ||
> -		    dev->ctrl.state == NVME_CTRL_RESETTING) {
> +		if (shutdown && (dev->ctrl.state == NVME_CTRL_LIVE ||
> +		    dev->ctrl.state == NVME_CTRL_RESETTING)) {
>  			nvme_start_freeze(&dev->ctrl);
>  			frozen = true;
>  		}
> @@ -2281,11 +2378,23 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
>  		nvme_suspend_queue(&dev->queues[i]);
>  
> +	/* safe to sync timeout after queues are quiesced */
> +	nvme_quiesce_timeout(&dev->ctrl);
> +	blk_quiesce_timeout(dev->ctrl.admin_q);
> +
>  	nvme_pci_disable(dev);
>  
> +	/*
> +	 * Both timeout and interrupt handler have been drained, and all
> +	 * in-flight requests will be canceled now.
> +	 */
>  	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
>  	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);


We need to return BLK_EH_RESET_TIMER in nvme_timeout then:
1. defer the completion. we can't unmap the io request before close the controller totally, so not BLK_EH_HANDLED.
2. nvme_cancel_request could complete it. blk_mq_complete_request is invoked by nvme_cancel_request, so not BLK_EH_NOT_HANDLED.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V2 1/5] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
  2018-04-29 15:41   ` Ming Lei
@ 2018-05-02  2:23     ` jianchao.wang
  -1 siblings, 0 replies; 26+ messages in thread
From: jianchao.wang @ 2018-05-02  2:23 UTC (permalink / raw)
  To: Ming Lei, Keith Busch
  Cc: Jens Axboe, Sagi Grimberg, linux-nvme, linux-block,
	Bart Van Assche, Christoph Hellwig

Hi ming

On 04/29/2018 11:41 PM, Ming Lei wrote:
>  
> +static void __blk_unquiesce_timeout(struct request_queue *q)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(q->queue_lock, flags);
> +	q->timeout_off = false;
> +	spin_unlock_irqrestore(q->queue_lock, flags);
> +}
> +
> +void blk_unquiesce_timeout(struct request_queue *q)
> +{
> +	__blk_unquiesce_timeout(q);
> +	mod_timer(&q->timeout, jiffies + q->rq_timeout);
> +}
> +EXPORT_SYMBOL(blk_unquiesce_timeout);
> +
> +void blk_quiesce_timeout(struct request_queue *q)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(q->queue_lock, flags);
> +	q->timeout_off = true;
> +	spin_unlock_irqrestore(q->queue_lock, flags);
> +
> +	del_timer_sync(&q->timeout);
> +	cancel_work_sync(&q->timeout_work);
> +}
> +EXPORT_SYMBOL(blk_quiesce_timeout);
> +
>  /**
>   * blk_sync_queue - cancel any pending callbacks on a queue
>   * @q: the queue
> @@ -408,8 +437,7 @@ EXPORT_SYMBOL(blk_stop_queue);
>   */
>  void blk_sync_queue(struct request_queue *q)
>  {
> -	del_timer_sync(&q->timeout);
> -	cancel_work_sync(&q->timeout_work);
> +	blk_quiesce_timeout(q);
>  
>  	if (q->mq_ops) {
>  		struct blk_mq_hw_ctx *hctx;
> @@ -421,6 +449,8 @@ void blk_sync_queue(struct request_queue *q)
>  	} else {
>  		cancel_delayed_work_sync(&q->delay_work);
>  	}
> +
> +	__blk_unquiesce_timeout(q);
>  }
>  EXPORT_SYMBOL(blk_sync_queue);
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 0dc9e341c2a7..890dd3b3e3e1 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -901,6 +901,15 @@ static void blk_mq_timeout_work(struct work_struct *work)
>  	};
>  	struct blk_mq_hw_ctx *hctx;
>  	int i;
> +	bool timeout_off;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(q->queue_lock, flags);
> +	timeout_off = q->timeout_off;
> +	spin_unlock_irqrestore(q->queue_lock, flags);
> +
> +	if (timeout_off)
> +		return;

Looks like there is still race as following:

blk_quiesce_timeout                          blk_mq_timeout_work
                                               -> timeout_off = q->timeout_off; //still a false
  -> q->timeout_off = true;
  -> del_timer_sync(&q->timeout);
                                               -> mod_timer
  -> cancel_work_sync(&q->timeout_work);


How about this:

Introduce a new request queue flag QUEUE_FLAG_TIMEOUT_QUIESCED

void blk_quiesce_queue_timeout(struct request_queue *q)
{
	unsigned long flags;

	spin_lock_irqsave(q->queue_lock, flags);
	queue_flag_set(QUEUE_FLAG_TIMEOUT_QUIESCED, q);
	spin_unlock_irqrestore(q->queue_lock, flags);

	synchronize_rcu_bh();
	cancel_work_sync(&q->timeout_work);
	/*
	 * The timeout work will not be invoked anymore.
	 */
	del_timer_sync(&q->timeout);
}

 void blk_set_queue_dying(struct request_queue *q)
 {
 	spin_lock_irq(q->queue_lock);
@@ -867,6 +905,9 @@ static void blk_rq_timed_out_timer(struct timer_list *t)
 {
 	struct request_queue *q = from_timer(q, t, timeout);
 
+	if (blk_queue_timeout_quiesced(q))
+		return;

 	kblockd_schedule_work(&q->timeout_work);
 }

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH V2 1/5] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
@ 2018-05-02  2:23     ` jianchao.wang
  0 siblings, 0 replies; 26+ messages in thread
From: jianchao.wang @ 2018-05-02  2:23 UTC (permalink / raw)


Hi ming

On 04/29/2018 11:41 PM, Ming Lei wrote:
>  
> +static void __blk_unquiesce_timeout(struct request_queue *q)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(q->queue_lock, flags);
> +	q->timeout_off = false;
> +	spin_unlock_irqrestore(q->queue_lock, flags);
> +}
> +
> +void blk_unquiesce_timeout(struct request_queue *q)
> +{
> +	__blk_unquiesce_timeout(q);
> +	mod_timer(&q->timeout, jiffies + q->rq_timeout);
> +}
> +EXPORT_SYMBOL(blk_unquiesce_timeout);
> +
> +void blk_quiesce_timeout(struct request_queue *q)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(q->queue_lock, flags);
> +	q->timeout_off = true;
> +	spin_unlock_irqrestore(q->queue_lock, flags);
> +
> +	del_timer_sync(&q->timeout);
> +	cancel_work_sync(&q->timeout_work);
> +}
> +EXPORT_SYMBOL(blk_quiesce_timeout);
> +
>  /**
>   * blk_sync_queue - cancel any pending callbacks on a queue
>   * @q: the queue
> @@ -408,8 +437,7 @@ EXPORT_SYMBOL(blk_stop_queue);
>   */
>  void blk_sync_queue(struct request_queue *q)
>  {
> -	del_timer_sync(&q->timeout);
> -	cancel_work_sync(&q->timeout_work);
> +	blk_quiesce_timeout(q);
>  
>  	if (q->mq_ops) {
>  		struct blk_mq_hw_ctx *hctx;
> @@ -421,6 +449,8 @@ void blk_sync_queue(struct request_queue *q)
>  	} else {
>  		cancel_delayed_work_sync(&q->delay_work);
>  	}
> +
> +	__blk_unquiesce_timeout(q);
>  }
>  EXPORT_SYMBOL(blk_sync_queue);
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 0dc9e341c2a7..890dd3b3e3e1 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -901,6 +901,15 @@ static void blk_mq_timeout_work(struct work_struct *work)
>  	};
>  	struct blk_mq_hw_ctx *hctx;
>  	int i;
> +	bool timeout_off;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(q->queue_lock, flags);
> +	timeout_off = q->timeout_off;
> +	spin_unlock_irqrestore(q->queue_lock, flags);
> +
> +	if (timeout_off)
> +		return;

Looks like there is still race as following:

blk_quiesce_timeout                          blk_mq_timeout_work
                                               -> timeout_off = q->timeout_off; //still a false
  -> q->timeout_off = true;
  -> del_timer_sync(&q->timeout);
                                               -> mod_timer
  -> cancel_work_sync(&q->timeout_work);


How about this:

Introduce a new request queue flag QUEUE_FLAG_TIMEOUT_QUIESCED

void blk_quiesce_queue_timeout(struct request_queue *q)
{
	unsigned long flags;

	spin_lock_irqsave(q->queue_lock, flags);
	queue_flag_set(QUEUE_FLAG_TIMEOUT_QUIESCED, q);
	spin_unlock_irqrestore(q->queue_lock, flags);

	synchronize_rcu_bh();
	cancel_work_sync(&q->timeout_work);
	/*
	 * The timeout work will not be invoked anymore.
	 */
	del_timer_sync(&q->timeout);
}

 void blk_set_queue_dying(struct request_queue *q)
 {
 	spin_lock_irq(q->queue_lock);
@@ -867,6 +905,9 @@ static void blk_rq_timed_out_timer(struct timer_list *t)
 {
 	struct request_queue *q = from_timer(q, t, timeout);
 
+	if (blk_queue_timeout_quiesced(q))
+		return;

 	kblockd_schedule_work(&q->timeout_work);
 }

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V2 1/5] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
  2018-05-02  2:23     ` jianchao.wang
@ 2018-05-02  3:33       ` Ming Lei
  -1 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2018-05-02  3:33 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-block,
	Bart Van Assche, Christoph Hellwig

On Wed, May 02, 2018 at 10:23:35AM +0800, jianchao.wang wrote:
> Hi ming
> 
> On 04/29/2018 11:41 PM, Ming Lei wrote:
> >  
> > +static void __blk_unquiesce_timeout(struct request_queue *q)
> > +{
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(q->queue_lock, flags);
> > +	q->timeout_off = false;
> > +	spin_unlock_irqrestore(q->queue_lock, flags);
> > +}
> > +
> > +void blk_unquiesce_timeout(struct request_queue *q)
> > +{
> > +	__blk_unquiesce_timeout(q);
> > +	mod_timer(&q->timeout, jiffies + q->rq_timeout);
> > +}
> > +EXPORT_SYMBOL(blk_unquiesce_timeout);
> > +
> > +void blk_quiesce_timeout(struct request_queue *q)
> > +{
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(q->queue_lock, flags);
> > +	q->timeout_off = true;
> > +	spin_unlock_irqrestore(q->queue_lock, flags);
> > +
> > +	del_timer_sync(&q->timeout);
> > +	cancel_work_sync(&q->timeout_work);
> > +}
> > +EXPORT_SYMBOL(blk_quiesce_timeout);
> > +
> >  /**
> >   * blk_sync_queue - cancel any pending callbacks on a queue
> >   * @q: the queue
> > @@ -408,8 +437,7 @@ EXPORT_SYMBOL(blk_stop_queue);
> >   */
> >  void blk_sync_queue(struct request_queue *q)
> >  {
> > -	del_timer_sync(&q->timeout);
> > -	cancel_work_sync(&q->timeout_work);
> > +	blk_quiesce_timeout(q);
> >  
> >  	if (q->mq_ops) {
> >  		struct blk_mq_hw_ctx *hctx;
> > @@ -421,6 +449,8 @@ void blk_sync_queue(struct request_queue *q)
> >  	} else {
> >  		cancel_delayed_work_sync(&q->delay_work);
> >  	}
> > +
> > +	__blk_unquiesce_timeout(q);
> >  }
> >  EXPORT_SYMBOL(blk_sync_queue);
> >  
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 0dc9e341c2a7..890dd3b3e3e1 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -901,6 +901,15 @@ static void blk_mq_timeout_work(struct work_struct *work)
> >  	};
> >  	struct blk_mq_hw_ctx *hctx;
> >  	int i;
> > +	bool timeout_off;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(q->queue_lock, flags);
> > +	timeout_off = q->timeout_off;
> > +	spin_unlock_irqrestore(q->queue_lock, flags);
> > +
> > +	if (timeout_off)
> > +		return;
> 
> Looks like there is still race as following:
> 
> blk_quiesce_timeout                          blk_mq_timeout_work
>                                                -> timeout_off = q->timeout_off; //still a false
>   -> q->timeout_off = true;
>   -> del_timer_sync(&q->timeout);
>                                                -> mod_timer
>   -> cancel_work_sync(&q->timeout_work);

No, there isn't such race, the 'mod_timer' doesn't make a difference
because 'q->timeout_off' will be visible in new work func after
cancel_work_sync() returns. So even the timer is expired, work func
still returns immediately.

That means del_timer_sync() isn't absolutely necessary too.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH V2 1/5] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
@ 2018-05-02  3:33       ` Ming Lei
  0 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2018-05-02  3:33 UTC (permalink / raw)


On Wed, May 02, 2018@10:23:35AM +0800, jianchao.wang wrote:
> Hi ming
> 
> On 04/29/2018 11:41 PM, Ming Lei wrote:
> >  
> > +static void __blk_unquiesce_timeout(struct request_queue *q)
> > +{
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(q->queue_lock, flags);
> > +	q->timeout_off = false;
> > +	spin_unlock_irqrestore(q->queue_lock, flags);
> > +}
> > +
> > +void blk_unquiesce_timeout(struct request_queue *q)
> > +{
> > +	__blk_unquiesce_timeout(q);
> > +	mod_timer(&q->timeout, jiffies + q->rq_timeout);
> > +}
> > +EXPORT_SYMBOL(blk_unquiesce_timeout);
> > +
> > +void blk_quiesce_timeout(struct request_queue *q)
> > +{
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(q->queue_lock, flags);
> > +	q->timeout_off = true;
> > +	spin_unlock_irqrestore(q->queue_lock, flags);
> > +
> > +	del_timer_sync(&q->timeout);
> > +	cancel_work_sync(&q->timeout_work);
> > +}
> > +EXPORT_SYMBOL(blk_quiesce_timeout);
> > +
> >  /**
> >   * blk_sync_queue - cancel any pending callbacks on a queue
> >   * @q: the queue
> > @@ -408,8 +437,7 @@ EXPORT_SYMBOL(blk_stop_queue);
> >   */
> >  void blk_sync_queue(struct request_queue *q)
> >  {
> > -	del_timer_sync(&q->timeout);
> > -	cancel_work_sync(&q->timeout_work);
> > +	blk_quiesce_timeout(q);
> >  
> >  	if (q->mq_ops) {
> >  		struct blk_mq_hw_ctx *hctx;
> > @@ -421,6 +449,8 @@ void blk_sync_queue(struct request_queue *q)
> >  	} else {
> >  		cancel_delayed_work_sync(&q->delay_work);
> >  	}
> > +
> > +	__blk_unquiesce_timeout(q);
> >  }
> >  EXPORT_SYMBOL(blk_sync_queue);
> >  
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 0dc9e341c2a7..890dd3b3e3e1 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -901,6 +901,15 @@ static void blk_mq_timeout_work(struct work_struct *work)
> >  	};
> >  	struct blk_mq_hw_ctx *hctx;
> >  	int i;
> > +	bool timeout_off;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(q->queue_lock, flags);
> > +	timeout_off = q->timeout_off;
> > +	spin_unlock_irqrestore(q->queue_lock, flags);
> > +
> > +	if (timeout_off)
> > +		return;
> 
> Looks like there is still race as following:
> 
> blk_quiesce_timeout                          blk_mq_timeout_work
>                                                -> timeout_off = q->timeout_off; //still a false
>   -> q->timeout_off = true;
>   -> del_timer_sync(&q->timeout);
>                                                -> mod_timer
>   -> cancel_work_sync(&q->timeout_work);

No, there isn't such race, the 'mod_timer' doesn't make a difference
because 'q->timeout_off' will be visible in new work func after
cancel_work_sync() returns. So even the timer is expired, work func
still returns immediately.

That means del_timer_sync() isn't absolutely necessary too.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V2 5/5] nvme: pci: simplify timeout handling
  2018-05-02  2:23     ` jianchao.wang
@ 2018-05-02  4:54       ` Ming Lei
  -1 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2018-05-02  4:54 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-block,
	Christoph Hellwig

On Wed, May 02, 2018 at 10:23:20AM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 04/29/2018 11:41 PM, Ming Lei wrote:
> > +
> >  static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> >  {
> >  	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> > @@ -1197,8 +1297,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> >  	 */
> >  	if (nvme_should_reset(dev, csts)) {
> >  		nvme_warn_reset(dev, csts);
> > -		nvme_dev_disable(dev, false);
> > -		nvme_reset_ctrl(&dev->ctrl);
> > +		nvme_eh_schedule(dev);
> >  		return BLK_EH_HANDLED;
> >  	}
> >  
> > @@ -1224,8 +1323,8 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> >  		dev_warn(dev->ctrl.device,
> >  			 "I/O %d QID %d timeout, disable controller\n",
> >  			 req->tag, nvmeq->qid);
> > -		nvme_dev_disable(dev, false);
> >  		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
> > +		nvme_eh_schedule(dev);
> >  		return BLK_EH_HANDLED;
> >  	default:
> >  		break;
> > @@ -1240,14 +1339,12 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> >  		dev_warn(dev->ctrl.device,
> >  			 "I/O %d QID %d timeout, reset controller\n",
> >  			 req->tag, nvmeq->qid);
> > -		nvme_dev_disable(dev, false);
> > -		nvme_reset_ctrl(&dev->ctrl);
> > -
> >  		/*
> >  		 * Mark the request as handled, since the inline shutdown
> >  		 * forces all outstanding requests to complete.
> >  		 */
> >  		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
> > +		nvme_eh_schedule(dev);
> >  		return BLK_EH_HANDLED;
> >  	}
> >  
> > @@ -2246,8 +2343,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> >  	if (pci_is_enabled(pdev)) {
> >  		u32 csts = readl(dev->bar + NVME_REG_CSTS);
> >  
> > -		if (dev->ctrl.state == NVME_CTRL_LIVE ||
> > -		    dev->ctrl.state == NVME_CTRL_RESETTING) {
> > +		if (shutdown && (dev->ctrl.state == NVME_CTRL_LIVE ||
> > +		    dev->ctrl.state == NVME_CTRL_RESETTING)) {
> >  			nvme_start_freeze(&dev->ctrl);
> >  			frozen = true;
> >  		}
> > @@ -2281,11 +2378,23 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> >  	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
> >  		nvme_suspend_queue(&dev->queues[i]);
> >  
> > +	/* safe to sync timeout after queues are quiesced */
> > +	nvme_quiesce_timeout(&dev->ctrl);
> > +	blk_quiesce_timeout(dev->ctrl.admin_q);
> > +
> >  	nvme_pci_disable(dev);
> >  
> > +	/*
> > +	 * Both timeout and interrupt handler have been drained, and all
> > +	 * in-flight requests will be canceled now.
> > +	 */
> >  	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
> >  	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
> 
> 
> We need to return BLK_EH_RESET_TIMER in nvme_timeout then:
> 1. defer the completion. we can't unmap the io request before close the controller totally, so not BLK_EH_HANDLED.
> 2. nvme_cancel_request could complete it. blk_mq_complete_request is invoked by nvme_cancel_request, so not BLK_EH_NOT_HANDLED.

We don't need to change return value of .timeout() any more after
calling nvme_quiesce_timeout():

Thanks the single EH kthread, once nvme_quiesce_timeout() returns, all
timed-out requests have been handled already. Some of them may be completed,
and others may be handled as RESET_TIMER, but none of them are handled as
NOT_HANDLED because nvme_timeout() won't return that value.

So the following blk_mq_tagset_busy_iter(nvme_cancel_request) can handle
all in-flight requests because timeout is drained and quiesced.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH V2 5/5] nvme: pci: simplify timeout handling
@ 2018-05-02  4:54       ` Ming Lei
  0 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2018-05-02  4:54 UTC (permalink / raw)


On Wed, May 02, 2018@10:23:20AM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 04/29/2018 11:41 PM, Ming Lei wrote:
> > +
> >  static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> >  {
> >  	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> > @@ -1197,8 +1297,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> >  	 */
> >  	if (nvme_should_reset(dev, csts)) {
> >  		nvme_warn_reset(dev, csts);
> > -		nvme_dev_disable(dev, false);
> > -		nvme_reset_ctrl(&dev->ctrl);
> > +		nvme_eh_schedule(dev);
> >  		return BLK_EH_HANDLED;
> >  	}
> >  
> > @@ -1224,8 +1323,8 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> >  		dev_warn(dev->ctrl.device,
> >  			 "I/O %d QID %d timeout, disable controller\n",
> >  			 req->tag, nvmeq->qid);
> > -		nvme_dev_disable(dev, false);
> >  		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
> > +		nvme_eh_schedule(dev);
> >  		return BLK_EH_HANDLED;
> >  	default:
> >  		break;
> > @@ -1240,14 +1339,12 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> >  		dev_warn(dev->ctrl.device,
> >  			 "I/O %d QID %d timeout, reset controller\n",
> >  			 req->tag, nvmeq->qid);
> > -		nvme_dev_disable(dev, false);
> > -		nvme_reset_ctrl(&dev->ctrl);
> > -
> >  		/*
> >  		 * Mark the request as handled, since the inline shutdown
> >  		 * forces all outstanding requests to complete.
> >  		 */
> >  		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
> > +		nvme_eh_schedule(dev);
> >  		return BLK_EH_HANDLED;
> >  	}
> >  
> > @@ -2246,8 +2343,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> >  	if (pci_is_enabled(pdev)) {
> >  		u32 csts = readl(dev->bar + NVME_REG_CSTS);
> >  
> > -		if (dev->ctrl.state == NVME_CTRL_LIVE ||
> > -		    dev->ctrl.state == NVME_CTRL_RESETTING) {
> > +		if (shutdown && (dev->ctrl.state == NVME_CTRL_LIVE ||
> > +		    dev->ctrl.state == NVME_CTRL_RESETTING)) {
> >  			nvme_start_freeze(&dev->ctrl);
> >  			frozen = true;
> >  		}
> > @@ -2281,11 +2378,23 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> >  	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
> >  		nvme_suspend_queue(&dev->queues[i]);
> >  
> > +	/* safe to sync timeout after queues are quiesced */
> > +	nvme_quiesce_timeout(&dev->ctrl);
> > +	blk_quiesce_timeout(dev->ctrl.admin_q);
> > +
> >  	nvme_pci_disable(dev);
> >  
> > +	/*
> > +	 * Both timeout and interrupt handler have been drained, and all
> > +	 * in-flight requests will be canceled now.
> > +	 */
> >  	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
> >  	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
> 
> 
> We need to return BLK_EH_RESET_TIMER in nvme_timeout then:
> 1. defer the completion. we can't unmap the io request before close the controller totally, so not BLK_EH_HANDLED.
> 2. nvme_cancel_request could complete it. blk_mq_complete_request is invoked by nvme_cancel_request, so not BLK_EH_NOT_HANDLED.

We don't need to change return value of .timeout() any more after
calling nvme_quiesce_timeout():

Thanks the single EH kthread, once nvme_quiesce_timeout() returns, all
timed-out requests have been handled already. Some of them may be completed,
and others may be handled as RESET_TIMER, but none of them are handled as
NOT_HANDLED because nvme_timeout() won't return that value.

So the following blk_mq_tagset_busy_iter(nvme_cancel_request) can handle
all in-flight requests because timeout is drained and quiesced.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V2 5/5] nvme: pci: simplify timeout handling
  2018-05-02  4:54       ` Ming Lei
@ 2018-05-02  5:12         ` jianchao.wang
  -1 siblings, 0 replies; 26+ messages in thread
From: jianchao.wang @ 2018-05-02  5:12 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-block,
	Christoph Hellwig

Hi Ming

On 05/02/2018 12:54 PM, Ming Lei wrote:
>> We need to return BLK_EH_RESET_TIMER in nvme_timeout then:
>> 1. defer the completion. we can't unmap the io request before close the controller totally, so not BLK_EH_HANDLED.
>> 2. nvme_cancel_request could complete it. blk_mq_complete_request is invoked by nvme_cancel_request, so not BLK_EH_NOT_HANDLED.
> We don't need to change return value of .timeout() any more after
> calling nvme_quiesce_timeout():
> 
> Thanks the single EH kthread, once nvme_quiesce_timeout() returns, all
> timed-out requests have been handled already. Some of them may be completed,
> and others may be handled as RESET_TIMER, but none of them are handled as
> NOT_HANDLED because nvme_timeout() won't return that value.
> 
> So the following blk_mq_tagset_busy_iter(nvme_cancel_request) can handle
> all in-flight requests because timeout is drained and quiesced.

The key point here is we cannot unmap the io requests before we close the controller directly.
The nvme controller may still hold the command after we complete and unmap the io request in nvme_timeout.
This will cause memory corruption.

So we cannot just schedule the eh recovery kthread then return BLK_EH_HANDLED.
We need to deffer the completion of timeout requests until the controller has been closed totally in nvme_dev_disable.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH V2 5/5] nvme: pci: simplify timeout handling
@ 2018-05-02  5:12         ` jianchao.wang
  0 siblings, 0 replies; 26+ messages in thread
From: jianchao.wang @ 2018-05-02  5:12 UTC (permalink / raw)


Hi Ming

On 05/02/2018 12:54 PM, Ming Lei wrote:
>> We need to return BLK_EH_RESET_TIMER in nvme_timeout then:
>> 1. defer the completion. we can't unmap the io request before close the controller totally, so not BLK_EH_HANDLED.
>> 2. nvme_cancel_request could complete it. blk_mq_complete_request is invoked by nvme_cancel_request, so not BLK_EH_NOT_HANDLED.
> We don't need to change return value of .timeout() any more after
> calling nvme_quiesce_timeout():
> 
> Thanks the single EH kthread, once nvme_quiesce_timeout() returns, all
> timed-out requests have been handled already. Some of them may be completed,
> and others may be handled as RESET_TIMER, but none of them are handled as
> NOT_HANDLED because nvme_timeout() won't return that value.
> 
> So the following blk_mq_tagset_busy_iter(nvme_cancel_request) can handle
> all in-flight requests because timeout is drained and quiesced.

The key point here is we cannot unmap the io requests before we close the controller directly.
The nvme controller may still hold the command after we complete and unmap the io request in nvme_timeout.
This will cause memory corruption.

So we cannot just schedule the eh recovery kthread then return BLK_EH_HANDLED.
We need to deffer the completion of timeout requests until the controller has been closed totally in nvme_dev_disable.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V2 1/5] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
  2018-05-02  3:33       ` Ming Lei
@ 2018-05-02  5:16         ` jianchao.wang
  -1 siblings, 0 replies; 26+ messages in thread
From: jianchao.wang @ 2018-05-02  5:16 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-block,
	Bart Van Assche, Christoph Hellwig

Hi Ming

On 05/02/2018 11:33 AM, Ming Lei wrote:
> No, there isn't such race, the 'mod_timer' doesn't make a difference
> because 'q->timeout_off' will be visible in new work func after
> cancel_work_sync() returns. So even the timer is expired, work func
> still returns immediately.

Yes, you are right.
even if timer is setup , but the timeout work will return.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH V2 1/5] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
@ 2018-05-02  5:16         ` jianchao.wang
  0 siblings, 0 replies; 26+ messages in thread
From: jianchao.wang @ 2018-05-02  5:16 UTC (permalink / raw)


Hi Ming

On 05/02/2018 11:33 AM, Ming Lei wrote:
> No, there isn't such race, the 'mod_timer' doesn't make a difference
> because 'q->timeout_off' will be visible in new work func after
> cancel_work_sync() returns. So even the timer is expired, work func
> still returns immediately.

Yes, you are right.
even if timer is setup , but the timeout work will return.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V2 5/5] nvme: pci: simplify timeout handling
  2018-05-02  5:12         ` jianchao.wang
@ 2018-05-02  7:27           ` Ming Lei
  -1 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2018-05-02  7:27 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Jens Axboe, linux-block, Sagi Grimberg, linux-nvme, Keith Busch,
	Christoph Hellwig

On Wed, May 02, 2018 at 01:12:57PM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 05/02/2018 12:54 PM, Ming Lei wrote:
> >> We need to return BLK_EH_RESET_TIMER in nvme_timeout then:
> >> 1. defer the completion. we can't unmap the io request before close the controller totally, so not BLK_EH_HANDLED.
> >> 2. nvme_cancel_request could complete it. blk_mq_complete_request is invoked by nvme_cancel_request, so not BLK_EH_NOT_HANDLED.
> > We don't need to change return value of .timeout() any more after
> > calling nvme_quiesce_timeout():
> > 
> > Thanks the single EH kthread, once nvme_quiesce_timeout() returns, all
> > timed-out requests have been handled already. Some of them may be completed,
> > and others may be handled as RESET_TIMER, but none of them are handled as
> > NOT_HANDLED because nvme_timeout() won't return that value.
> > 
> > So the following blk_mq_tagset_busy_iter(nvme_cancel_request) can handle
> > all in-flight requests because timeout is drained and quiesced.
> 
> The key point here is we cannot unmap the io requests before we close the controller directly.
> The nvme controller may still hold the command after we complete and unmap the io request in nvme_timeout.
> This will cause memory corruption.
> 
> So we cannot just schedule the eh recovery kthread then return BLK_EH_HANDLED.
> We need to deffer the completion of timeout requests until the controller has been closed totally in nvme_dev_disable.
> 

Good catch!

I am gonna document this point since it is easy to be ignored.

Yeah, looks it is simpler to return BLK_EH_RESET_TIMER for NVMe to
handle this issue, but not too readable, it is something like:
we has stopped the timeout, but BLK_EH_RESET_TIMER still has to be
returned for asking block layer to not complete this req.

So IMO it may be better to return BLK_EH_NOT_HANDLED, but it becomes
a bit tricky to handle this request in EH thread, maybe a NVMe req
flag is needed for completing the req in nvme_cancel_request().

Thanks
Ming

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH V2 5/5] nvme: pci: simplify timeout handling
@ 2018-05-02  7:27           ` Ming Lei
  0 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2018-05-02  7:27 UTC (permalink / raw)


On Wed, May 02, 2018@01:12:57PM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 05/02/2018 12:54 PM, Ming Lei wrote:
> >> We need to return BLK_EH_RESET_TIMER in nvme_timeout then:
> >> 1. defer the completion. we can't unmap the io request before close the controller totally, so not BLK_EH_HANDLED.
> >> 2. nvme_cancel_request could complete it. blk_mq_complete_request is invoked by nvme_cancel_request, so not BLK_EH_NOT_HANDLED.
> > We don't need to change return value of .timeout() any more after
> > calling nvme_quiesce_timeout():
> > 
> > Thanks the single EH kthread, once nvme_quiesce_timeout() returns, all
> > timed-out requests have been handled already. Some of them may be completed,
> > and others may be handled as RESET_TIMER, but none of them are handled as
> > NOT_HANDLED because nvme_timeout() won't return that value.
> > 
> > So the following blk_mq_tagset_busy_iter(nvme_cancel_request) can handle
> > all in-flight requests because timeout is drained and quiesced.
> 
> The key point here is we cannot unmap the io requests before we close the controller directly.
> The nvme controller may still hold the command after we complete and unmap the io request in nvme_timeout.
> This will cause memory corruption.
> 
> So we cannot just schedule the eh recovery kthread then return BLK_EH_HANDLED.
> We need to deffer the completion of timeout requests until the controller has been closed totally in nvme_dev_disable.
> 

Good catch!

I am gonna document this point since it is easy to be ignored.

Yeah, looks it is simpler to return BLK_EH_RESET_TIMER for NVMe to
handle this issue, but not too readable, it is something like:
we has stopped the timeout, but BLK_EH_RESET_TIMER still has to be
returned for asking block layer to not complete this req.

So IMO it may be better to return BLK_EH_NOT_HANDLED, but it becomes
a bit tricky to handle this request in EH thread, maybe a NVMe req
flag is needed for completing the req in nvme_cancel_request().

Thanks
Ming

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2018-05-02  7:27 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-29 15:41 [PATCH V2 0/5] nvme: pci: fix & improve timeout handling Ming Lei
2018-04-29 15:41 ` Ming Lei
2018-04-29 15:41 ` [PATCH V2 1/5] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout() Ming Lei
2018-04-29 15:41   ` Ming Lei
2018-05-02  2:23   ` jianchao.wang
2018-05-02  2:23     ` jianchao.wang
2018-05-02  3:33     ` Ming Lei
2018-05-02  3:33       ` Ming Lei
2018-05-02  5:16       ` jianchao.wang
2018-05-02  5:16         ` jianchao.wang
2018-04-29 15:41 ` [PATCH V2 2/5] nvme: pci: cover timeout for admin commands running in EH Ming Lei
2018-04-29 15:41   ` Ming Lei
2018-04-29 15:41 ` [PATCH V2 3/5] nvme: pci: only wait freezing if queue is frozen Ming Lei
2018-04-29 15:41   ` Ming Lei
2018-04-29 15:41 ` [PATCH V2 4/5] nvme: fix race between freeze queues and unfreeze queues Ming Lei
2018-04-29 15:41   ` Ming Lei
2018-04-29 15:41 ` [PATCH V2 5/5] nvme: pci: simplify timeout handling Ming Lei
2018-04-29 15:41   ` Ming Lei
2018-05-02  2:23   ` jianchao.wang
2018-05-02  2:23     ` jianchao.wang
2018-05-02  4:54     ` Ming Lei
2018-05-02  4:54       ` Ming Lei
2018-05-02  5:12       ` jianchao.wang
2018-05-02  5:12         ` jianchao.wang
2018-05-02  7:27         ` Ming Lei
2018-05-02  7:27           ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.