All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V6 00/11] nvme: pci: fix & improve timeout handling
@ 2018-05-16  4:03 ` Ming Lei
  0 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16  4:03 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, James Smart, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

Hi,

The 1st patch introduces blk_quiesce_timeout() and blk_unquiesce_timeout()
for NVMe, meantime fixes blk_sync_queue().

The 2nd and 3rd patches fix race between nvme_dev_disable() and
controller reset, and avoids double irq freeing and IO hang after
queues are killed.

The 4th patch covers timeout for admin commands for recovering controller
for avoiding possible deadlock.

The 5rd and 6th patches avoid to wait_freeze on queues which aren't frozen.

The last 5 patches fixes several races wrt. NVMe timeout handler. Meantime
the NVMe PCI timeout mecanism become much more rebost than before.

With this patchset, block/011 can be passed.

Also run block 019, it still passed.

Please reivew, ack and test!

gitweb:
	https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V6

V6:
	- fix EH seq number so that correct EH name can be shown in log
	- avoid NULL pointer dereference of admin queue
	- avoid request leak in nvme_set_host_mem_timeout
	- cover races between nvme_dev_disable() and reset wrt. cq_vector
	- think EH as done when its state is updated as ADMIN_ONLY

V5:
	- avoid to remove controller in case of reset failure in inner EHs
	- make sure that nvme_unfreeze and nvme_start_freeze are paired

V4:
	- fixe nvme_init_set_host_mem_cmd()
	- use nested EH model, and run both nvme_dev_disable() and
	resetting in one same context

V3:
	- fix one new race related freezing in patch 4, nvme_reset_work()
	may hang forever without this patch
	- rewrite the last 3 patches, and avoid to break nvme_reset_ctrl*()

V2:
	- fix draining timeout work, so no need to change return value from
	.timeout()
	- fix race between nvme_start_freeze() and nvme_unfreeze()
	- cover timeout for admin commands running in EH

Ming Lei (10):
  block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
  nvme: pci: cover timeout for admin commands running in EH
  nvme: pci: unquiesce admin queue after controller is shutdown
  nvme: pci: only wait freezing if queue is frozen
  nvme: pci: freeze queue in nvme_dev_disable() in case of error
    recovery
  nvme: pci: prepare for supporting error recovery from resetting
    context
  nvme: pci: move error handling out of nvme_reset_dev()
  nvme: pci: don't unfreeze queue until controller state updating
    succeeds
  nvme: core: introduce nvme_force_change_ctrl_state()
  nvme: pci: support nested EH

jianchao.wang (1):
  nvme: pci: set nvmeq->cq_vector after alloc cq/sq

 block/blk-core.c         |  21 ++-
 block/blk-mq.c           |   9 +
 block/blk-timeout.c      |   5 +-
 drivers/nvme/host/core.c |  57 ++++++
 drivers/nvme/host/nvme.h |   7 +
 drivers/nvme/host/pci.c  | 450 ++++++++++++++++++++++++++++++++++++++++-------
 include/linux/blkdev.h   |  13 ++
 7 files changed, 499 insertions(+), 63 deletions(-)

Cc: James Smart <james.smart@broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
-- 
2.9.5

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH V6 00/11] nvme: pci: fix & improve timeout handling
@ 2018-05-16  4:03 ` Ming Lei
  0 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16  4:03 UTC (permalink / raw)


Hi,

The 1st patch introduces blk_quiesce_timeout() and blk_unquiesce_timeout()
for NVMe, meantime fixes blk_sync_queue().

The 2nd and 3rd patches fix race between nvme_dev_disable() and
controller reset, and avoids double irq freeing and IO hang after
queues are killed.

The 4th patch covers timeout for admin commands for recovering controller
for avoiding possible deadlock.

The 5rd and 6th patches avoid to wait_freeze on queues which aren't frozen.

The last 5 patches fixes several races wrt. NVMe timeout handler. Meantime
the NVMe PCI timeout mecanism become much more rebost than before.

With this patchset, block/011 can be passed.

Also run block 019, it still passed.

Please reivew, ack and test!

gitweb:
	https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V6

V6:
	- fix EH seq number so that correct EH name can be shown in log
	- avoid NULL pointer dereference of admin queue
	- avoid request leak in nvme_set_host_mem_timeout
	- cover races between nvme_dev_disable() and reset wrt. cq_vector
	- think EH as done when its state is updated as ADMIN_ONLY

V5:
	- avoid to remove controller in case of reset failure in inner EHs
	- make sure that nvme_unfreeze and nvme_start_freeze are paired

V4:
	- fixe nvme_init_set_host_mem_cmd()
	- use nested EH model, and run both nvme_dev_disable() and
	resetting in one same context

V3:
	- fix one new race related freezing in patch 4, nvme_reset_work()
	may hang forever without this patch
	- rewrite the last 3 patches, and avoid to break nvme_reset_ctrl*()

V2:
	- fix draining timeout work, so no need to change return value from
	.timeout()
	- fix race between nvme_start_freeze() and nvme_unfreeze()
	- cover timeout for admin commands running in EH

Ming Lei (10):
  block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
  nvme: pci: cover timeout for admin commands running in EH
  nvme: pci: unquiesce admin queue after controller is shutdown
  nvme: pci: only wait freezing if queue is frozen
  nvme: pci: freeze queue in nvme_dev_disable() in case of error
    recovery
  nvme: pci: prepare for supporting error recovery from resetting
    context
  nvme: pci: move error handling out of nvme_reset_dev()
  nvme: pci: don't unfreeze queue until controller state updating
    succeeds
  nvme: core: introduce nvme_force_change_ctrl_state()
  nvme: pci: support nested EH

jianchao.wang (1):
  nvme: pci: set nvmeq->cq_vector after alloc cq/sq

 block/blk-core.c         |  21 ++-
 block/blk-mq.c           |   9 +
 block/blk-timeout.c      |   5 +-
 drivers/nvme/host/core.c |  57 ++++++
 drivers/nvme/host/nvme.h |   7 +
 drivers/nvme/host/pci.c  | 450 ++++++++++++++++++++++++++++++++++++++++-------
 include/linux/blkdev.h   |  13 ++
 7 files changed, 499 insertions(+), 63 deletions(-)

Cc: James Smart <james.smart at broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Cc: Laurence Oberman <loberman at redhat.com>
-- 
2.9.5

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH V6 01/11] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
  2018-05-16  4:03 ` Ming Lei
@ 2018-05-16  4:03   ` Ming Lei
  -1 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16  4:03 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, James Smart, Bart Van Assche,
	Jianchao Wang, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	Laurence Oberman

Turns out the current way can't drain timout completely because mod_timer()
can be triggered in the work func, which can be just run inside the synced
timeout work:

        del_timer_sync(&q->timeout);
        cancel_work_sync(&q->timeout_work);

This patch introduces one flag of 'timeout_off' for fixing this issue, turns
out this simple way does work.

Also blk_quiesce_timeout() and blk_unquiesce_timeout() are introduced for
draining timeout, which is needed by NVMe.

Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c       | 21 +++++++++++++++++++--
 block/blk-mq.c         |  9 +++++++++
 block/blk-timeout.c    |  5 ++++-
 include/linux/blkdev.h | 13 +++++++++++++
 4 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 85909b431eb0..c277f1023703 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -392,6 +392,22 @@ void blk_stop_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_stop_queue);
 
+void blk_unquiesce_timeout(struct request_queue *q)
+{
+	blk_mark_timeout_quiesce(q, false);
+	mod_timer(&q->timeout, jiffies + q->rq_timeout);
+}
+EXPORT_SYMBOL(blk_unquiesce_timeout);
+
+void blk_quiesce_timeout(struct request_queue *q)
+{
+	blk_mark_timeout_quiesce(q, true);
+
+	del_timer_sync(&q->timeout);
+	cancel_work_sync(&q->timeout_work);
+}
+EXPORT_SYMBOL(blk_quiesce_timeout);
+
 /**
  * blk_sync_queue - cancel any pending callbacks on a queue
  * @q: the queue
@@ -412,8 +428,7 @@ EXPORT_SYMBOL(blk_stop_queue);
  */
 void blk_sync_queue(struct request_queue *q)
 {
-	del_timer_sync(&q->timeout);
-	cancel_work_sync(&q->timeout_work);
+	blk_quiesce_timeout(q);
 
 	if (q->mq_ops) {
 		struct blk_mq_hw_ctx *hctx;
@@ -425,6 +440,8 @@ void blk_sync_queue(struct request_queue *q)
 	} else {
 		cancel_delayed_work_sync(&q->delay_work);
 	}
+
+	blk_mark_timeout_quiesce(q, false);
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9ce9cac16c3f..173f1723e48f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -917,6 +917,15 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	};
 	struct blk_mq_hw_ctx *hctx;
 	int i;
+	bool timeout_off;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	timeout_off = q->timeout_off;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	if (timeout_off)
+		return;
 
 	/* A deadlock might occur if a request is stuck requiring a
 	 * timeout at the same time a queue freeze is waiting
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 652d4d4d3e97..ffd0b609091e 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -136,12 +136,15 @@ void blk_timeout_work(struct work_struct *work)
 
 	spin_lock_irqsave(q->queue_lock, flags);
 
+	if (q->timeout_off)
+		goto exit;
+
 	list_for_each_entry_safe(rq, tmp, &q->timeout_list, timeout_list)
 		blk_rq_check_expired(rq, &next, &next_set);
 
 	if (next_set)
 		mod_timer(&q->timeout, round_jiffies_up(next));
-
+exit:
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5c4eee043191..a2cc4aaecf50 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -584,6 +584,7 @@ struct request_queue {
 	struct timer_list	timeout;
 	struct work_struct	timeout_work;
 	struct list_head	timeout_list;
+	bool			timeout_off;
 
 	struct list_head	icq_list;
 #ifdef CONFIG_BLK_CGROUP
@@ -1017,6 +1018,18 @@ extern void blk_execute_rq(struct request_queue *, struct gendisk *,
 extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,
 				  struct request *, int, rq_end_io_fn *);
 
+static inline void blk_mark_timeout_quiesce(struct request_queue *q, bool quiesce)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	q->timeout_off = quiesce;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+extern void blk_quiesce_timeout(struct request_queue *q);
+extern void blk_unquiesce_timeout(struct request_queue *q);
+
 int blk_status_to_errno(blk_status_t status);
 blk_status_t errno_to_blk_status(int errno);
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V6 01/11] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
@ 2018-05-16  4:03   ` Ming Lei
  0 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16  4:03 UTC (permalink / raw)


Turns out the current way can't drain timout completely because mod_timer()
can be triggered in the work func, which can be just run inside the synced
timeout work:

        del_timer_sync(&q->timeout);
        cancel_work_sync(&q->timeout_work);

This patch introduces one flag of 'timeout_off' for fixing this issue, turns
out this simple way does work.

Also blk_quiesce_timeout() and blk_unquiesce_timeout() are introduced for
draining timeout, which is needed by NVMe.

Cc: James Smart <james.smart at broadcom.com>
Cc: Bart Van Assche <bart.vanassche at wdc.com>
Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Cc: Laurence Oberman <loberman at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 block/blk-core.c       | 21 +++++++++++++++++++--
 block/blk-mq.c         |  9 +++++++++
 block/blk-timeout.c    |  5 ++++-
 include/linux/blkdev.h | 13 +++++++++++++
 4 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 85909b431eb0..c277f1023703 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -392,6 +392,22 @@ void blk_stop_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_stop_queue);
 
+void blk_unquiesce_timeout(struct request_queue *q)
+{
+	blk_mark_timeout_quiesce(q, false);
+	mod_timer(&q->timeout, jiffies + q->rq_timeout);
+}
+EXPORT_SYMBOL(blk_unquiesce_timeout);
+
+void blk_quiesce_timeout(struct request_queue *q)
+{
+	blk_mark_timeout_quiesce(q, true);
+
+	del_timer_sync(&q->timeout);
+	cancel_work_sync(&q->timeout_work);
+}
+EXPORT_SYMBOL(blk_quiesce_timeout);
+
 /**
  * blk_sync_queue - cancel any pending callbacks on a queue
  * @q: the queue
@@ -412,8 +428,7 @@ EXPORT_SYMBOL(blk_stop_queue);
  */
 void blk_sync_queue(struct request_queue *q)
 {
-	del_timer_sync(&q->timeout);
-	cancel_work_sync(&q->timeout_work);
+	blk_quiesce_timeout(q);
 
 	if (q->mq_ops) {
 		struct blk_mq_hw_ctx *hctx;
@@ -425,6 +440,8 @@ void blk_sync_queue(struct request_queue *q)
 	} else {
 		cancel_delayed_work_sync(&q->delay_work);
 	}
+
+	blk_mark_timeout_quiesce(q, false);
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9ce9cac16c3f..173f1723e48f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -917,6 +917,15 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	};
 	struct blk_mq_hw_ctx *hctx;
 	int i;
+	bool timeout_off;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	timeout_off = q->timeout_off;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	if (timeout_off)
+		return;
 
 	/* A deadlock might occur if a request is stuck requiring a
 	 * timeout at the same time a queue freeze is waiting
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 652d4d4d3e97..ffd0b609091e 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -136,12 +136,15 @@ void blk_timeout_work(struct work_struct *work)
 
 	spin_lock_irqsave(q->queue_lock, flags);
 
+	if (q->timeout_off)
+		goto exit;
+
 	list_for_each_entry_safe(rq, tmp, &q->timeout_list, timeout_list)
 		blk_rq_check_expired(rq, &next, &next_set);
 
 	if (next_set)
 		mod_timer(&q->timeout, round_jiffies_up(next));
-
+exit:
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5c4eee043191..a2cc4aaecf50 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -584,6 +584,7 @@ struct request_queue {
 	struct timer_list	timeout;
 	struct work_struct	timeout_work;
 	struct list_head	timeout_list;
+	bool			timeout_off;
 
 	struct list_head	icq_list;
 #ifdef CONFIG_BLK_CGROUP
@@ -1017,6 +1018,18 @@ extern void blk_execute_rq(struct request_queue *, struct gendisk *,
 extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,
 				  struct request *, int, rq_end_io_fn *);
 
+static inline void blk_mark_timeout_quiesce(struct request_queue *q, bool quiesce)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	q->timeout_off = quiesce;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+extern void blk_quiesce_timeout(struct request_queue *q);
+extern void blk_unquiesce_timeout(struct request_queue *q);
+
 int blk_status_to_errno(blk_status_t status);
 blk_status_t errno_to_blk_status(int errno);
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V6 02/11] nvme: pci: cover timeout for admin commands running in EH
  2018-05-16  4:03 ` Ming Lei
@ 2018-05-16  4:03   ` Ming Lei
  -1 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16  4:03 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, James Smart, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

When admin commands are used in EH for recovering controller, we have to
cover their timeout and can't depend on block's timeout since deadlock may
be caused when these commands are timed-out by block layer again.

Cc: James Smart <james.smart@broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 81 ++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 70 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 17a0190bd88f..b60f727b844c 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1742,21 +1742,28 @@ static inline void nvme_release_cmb(struct nvme_dev *dev)
 	}
 }
 
-static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits)
+static void nvme_init_set_host_mem_cmd(struct nvme_dev *dev,
+		struct nvme_command *c, u32 bits)
 {
 	u64 dma_addr = dev->host_mem_descs_dma;
+
+	memset(c, 0, sizeof(*c));
+	c->features.opcode	= nvme_admin_set_features;
+	c->features.fid		= cpu_to_le32(NVME_FEAT_HOST_MEM_BUF);
+	c->features.dword11	= cpu_to_le32(bits);
+	c->features.dword12	= cpu_to_le32(dev->host_mem_size >>
+					      ilog2(dev->ctrl.page_size));
+	c->features.dword13	= cpu_to_le32(lower_32_bits(dma_addr));
+	c->features.dword14	= cpu_to_le32(upper_32_bits(dma_addr));
+	c->features.dword15	= cpu_to_le32(dev->nr_host_mem_descs);
+}
+
+static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits)
+{
 	struct nvme_command c;
 	int ret;
 
-	memset(&c, 0, sizeof(c));
-	c.features.opcode	= nvme_admin_set_features;
-	c.features.fid		= cpu_to_le32(NVME_FEAT_HOST_MEM_BUF);
-	c.features.dword11	= cpu_to_le32(bits);
-	c.features.dword12	= cpu_to_le32(dev->host_mem_size >>
-					      ilog2(dev->ctrl.page_size));
-	c.features.dword13	= cpu_to_le32(lower_32_bits(dma_addr));
-	c.features.dword14	= cpu_to_le32(upper_32_bits(dma_addr));
-	c.features.dword15	= cpu_to_le32(dev->nr_host_mem_descs);
+	nvme_init_set_host_mem_cmd(dev, &c, bits);
 
 	ret = nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0);
 	if (ret) {
@@ -1767,6 +1774,58 @@ static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits)
 	return ret;
 }
 
+static void nvme_set_host_mem_end_io(struct request *rq, blk_status_t sts)
+{
+	struct completion *waiting = rq->end_io_data;
+
+	rq->end_io_data = NULL;
+	blk_mq_free_request(rq);
+
+	/*
+	 * complete last, if this is a stack request the process (and thus
+	 * the rq pointer) could be invalid right after this complete()
+	 */
+	complete(waiting);
+}
+
+/*
+ * This function can only be used inside nvme_dev_disable() when timeout
+ * may not work, then this function has to cover the timeout by itself.
+ *
+ * When wait_for_completion_io_timeout() returns 0 and timeout happens,
+ * this request will be completed after controller is shutdown.
+ */
+static int nvme_set_host_mem_timeout(struct nvme_dev *dev, u32 bits)
+{
+	DECLARE_COMPLETION_ONSTACK(wait);
+	struct nvme_command c;
+	struct request_queue *q = dev->ctrl.admin_q;
+	struct request *req;
+	int ret;
+
+	nvme_init_set_host_mem_cmd(dev, &c, bits);
+
+	req = nvme_alloc_request(q, &c, 0, NVME_QID_ANY);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
+	req->timeout = ADMIN_TIMEOUT;
+	req->end_io_data = &wait;
+
+	blk_execute_rq_nowait(q, NULL, req, false,
+			nvme_set_host_mem_end_io);
+	ret = wait_for_completion_io_timeout(&wait, ADMIN_TIMEOUT);
+	if (ret > 0) {
+		if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
+			ret = -EINTR;
+		else
+			ret = nvme_req(req)->status;
+	} else
+		ret = -EINTR;
+
+	return ret;
+}
+
 static void nvme_free_host_mem(struct nvme_dev *dev)
 {
 	int i;
@@ -2225,7 +2284,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		 * but I'd rather be safe than sorry..
 		 */
 		if (dev->host_mem_descs)
-			nvme_set_host_mem(dev, 0);
+			nvme_set_host_mem_timeout(dev, 0);
 		nvme_disable_io_queues(dev);
 		nvme_disable_admin_queue(dev, shutdown);
 	}
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V6 02/11] nvme: pci: cover timeout for admin commands running in EH
@ 2018-05-16  4:03   ` Ming Lei
  0 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16  4:03 UTC (permalink / raw)


When admin commands are used in EH for recovering controller, we have to
cover their timeout and can't depend on block's timeout since deadlock may
be caused when these commands are timed-out by block layer again.

Cc: James Smart <james.smart at broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Cc: Laurence Oberman <loberman at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/pci.c | 81 ++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 70 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 17a0190bd88f..b60f727b844c 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1742,21 +1742,28 @@ static inline void nvme_release_cmb(struct nvme_dev *dev)
 	}
 }
 
-static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits)
+static void nvme_init_set_host_mem_cmd(struct nvme_dev *dev,
+		struct nvme_command *c, u32 bits)
 {
 	u64 dma_addr = dev->host_mem_descs_dma;
+
+	memset(c, 0, sizeof(*c));
+	c->features.opcode	= nvme_admin_set_features;
+	c->features.fid		= cpu_to_le32(NVME_FEAT_HOST_MEM_BUF);
+	c->features.dword11	= cpu_to_le32(bits);
+	c->features.dword12	= cpu_to_le32(dev->host_mem_size >>
+					      ilog2(dev->ctrl.page_size));
+	c->features.dword13	= cpu_to_le32(lower_32_bits(dma_addr));
+	c->features.dword14	= cpu_to_le32(upper_32_bits(dma_addr));
+	c->features.dword15	= cpu_to_le32(dev->nr_host_mem_descs);
+}
+
+static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits)
+{
 	struct nvme_command c;
 	int ret;
 
-	memset(&c, 0, sizeof(c));
-	c.features.opcode	= nvme_admin_set_features;
-	c.features.fid		= cpu_to_le32(NVME_FEAT_HOST_MEM_BUF);
-	c.features.dword11	= cpu_to_le32(bits);
-	c.features.dword12	= cpu_to_le32(dev->host_mem_size >>
-					      ilog2(dev->ctrl.page_size));
-	c.features.dword13	= cpu_to_le32(lower_32_bits(dma_addr));
-	c.features.dword14	= cpu_to_le32(upper_32_bits(dma_addr));
-	c.features.dword15	= cpu_to_le32(dev->nr_host_mem_descs);
+	nvme_init_set_host_mem_cmd(dev, &c, bits);
 
 	ret = nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0);
 	if (ret) {
@@ -1767,6 +1774,58 @@ static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits)
 	return ret;
 }
 
+static void nvme_set_host_mem_end_io(struct request *rq, blk_status_t sts)
+{
+	struct completion *waiting = rq->end_io_data;
+
+	rq->end_io_data = NULL;
+	blk_mq_free_request(rq);
+
+	/*
+	 * complete last, if this is a stack request the process (and thus
+	 * the rq pointer) could be invalid right after this complete()
+	 */
+	complete(waiting);
+}
+
+/*
+ * This function can only be used inside nvme_dev_disable() when timeout
+ * may not work, then this function has to cover the timeout by itself.
+ *
+ * When wait_for_completion_io_timeout() returns 0 and timeout happens,
+ * this request will be completed after controller is shutdown.
+ */
+static int nvme_set_host_mem_timeout(struct nvme_dev *dev, u32 bits)
+{
+	DECLARE_COMPLETION_ONSTACK(wait);
+	struct nvme_command c;
+	struct request_queue *q = dev->ctrl.admin_q;
+	struct request *req;
+	int ret;
+
+	nvme_init_set_host_mem_cmd(dev, &c, bits);
+
+	req = nvme_alloc_request(q, &c, 0, NVME_QID_ANY);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
+	req->timeout = ADMIN_TIMEOUT;
+	req->end_io_data = &wait;
+
+	blk_execute_rq_nowait(q, NULL, req, false,
+			nvme_set_host_mem_end_io);
+	ret = wait_for_completion_io_timeout(&wait, ADMIN_TIMEOUT);
+	if (ret > 0) {
+		if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
+			ret = -EINTR;
+		else
+			ret = nvme_req(req)->status;
+	} else
+		ret = -EINTR;
+
+	return ret;
+}
+
 static void nvme_free_host_mem(struct nvme_dev *dev)
 {
 	int i;
@@ -2225,7 +2284,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		 * but I'd rather be safe than sorry..
 		 */
 		if (dev->host_mem_descs)
-			nvme_set_host_mem(dev, 0);
+			nvme_set_host_mem_timeout(dev, 0);
 		nvme_disable_io_queues(dev);
 		nvme_disable_admin_queue(dev, shutdown);
 	}
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V6 03/11] nvme: pci: unquiesce admin queue after controller is shutdown
  2018-05-16  4:03 ` Ming Lei
@ 2018-05-16  4:03   ` Ming Lei
  -1 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16  4:03 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, James Smart, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

Given timeout event can come during reset, nvme_dev_disable() shouldn't
keep admin queue as quiesced after controller is shutdown. Otherwise
it may block admin IO in reset, and cause reset hang forever.

This patch fixes this issue by unquiescing admin queue at the end
of nvme_dev_disable().

Cc: James Smart <james.smart@broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Reported-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b60f727b844c..9e28d7118232 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1557,8 +1557,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
 			dev->ctrl.admin_q = NULL;
 			return -ENODEV;
 		}
-	} else
-		blk_mq_unquiesce_queue(dev->ctrl.admin_q);
+	}
 
 	return 0;
 }
@@ -2303,6 +2302,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	 */
 	if (shutdown)
 		nvme_start_queues(&dev->ctrl);
+
+	/*
+	 * Avoid to stuck reset because timeout may happen during reset and
+	 * reset may hang forever if admin queue is kept as quiesced.
+	 *
+	 * Druing reset, if admin queue isn't ready, the command will be
+	 * failed immediately, that means we don't need to quiesce admin
+	 * queue.
+	 */
+	if (dev->ctrl.admin_q)
+		blk_mq_unquiesce_queue(dev->ctrl.admin_q);
 	mutex_unlock(&dev->shutdown_lock);
 }
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V6 03/11] nvme: pci: unquiesce admin queue after controller is shutdown
@ 2018-05-16  4:03   ` Ming Lei
  0 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16  4:03 UTC (permalink / raw)


Given timeout event can come during reset, nvme_dev_disable() shouldn't
keep admin queue as quiesced after controller is shutdown. Otherwise
it may block admin IO in reset, and cause reset hang forever.

This patch fixes this issue by unquiescing admin queue at the end
of nvme_dev_disable().

Cc: James Smart <james.smart at broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Cc: Laurence Oberman <loberman at redhat.com>
Reported-by: Jianchao Wang <jianchao.w.wang at oracle.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/pci.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b60f727b844c..9e28d7118232 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1557,8 +1557,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
 			dev->ctrl.admin_q = NULL;
 			return -ENODEV;
 		}
-	} else
-		blk_mq_unquiesce_queue(dev->ctrl.admin_q);
+	}
 
 	return 0;
 }
@@ -2303,6 +2302,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	 */
 	if (shutdown)
 		nvme_start_queues(&dev->ctrl);
+
+	/*
+	 * Avoid to stuck reset because timeout may happen during reset and
+	 * reset may hang forever if admin queue is kept as quiesced.
+	 *
+	 * Druing reset, if admin queue isn't ready, the command will be
+	 * failed immediately, that means we don't need to quiesce admin
+	 * queue.
+	 */
+	if (dev->ctrl.admin_q)
+		blk_mq_unquiesce_queue(dev->ctrl.admin_q);
 	mutex_unlock(&dev->shutdown_lock);
 }
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V6 04/11] nvme: pci: set nvmeq->cq_vector after alloc cq/sq
  2018-05-16  4:03 ` Ming Lei
@ 2018-05-16  4:03   ` Ming Lei
  -1 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16  4:03 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, jianchao.wang, James Smart,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman,
	Ming Lei

From: "jianchao.wang" <jianchao.w.wang@oracle.com>

Currently nvmeq->cq_vector is set before alloc cq/sq. If the alloc
cq/sq command timeout, nvme_suspend_queue will invoke free_irq
for the nvmeq because the cq_vector is valid, this will cause warning
'Trying to free already-free IRQ xxx'. set nvmeq->cq_vector after
alloc cq/sq successes to fix this.

Cc: James Smart <james.smart@broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/pci.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 9e28d7118232..e94a103ead1e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1070,7 +1070,7 @@ static int adapter_delete_queue(struct nvme_dev *dev, u8 opcode, u16 id)
 }
 
 static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
-						struct nvme_queue *nvmeq)
+						struct nvme_queue *nvmeq, int cq_vector)
 {
 	struct nvme_command c;
 	int flags = NVME_QUEUE_PHYS_CONTIG | NVME_CQ_IRQ_ENABLED;
@@ -1085,7 +1085,7 @@ static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
 	c.create_cq.cqid = cpu_to_le16(qid);
 	c.create_cq.qsize = cpu_to_le16(nvmeq->q_depth - 1);
 	c.create_cq.cq_flags = cpu_to_le16(flags);
-	c.create_cq.irq_vector = cpu_to_le16(nvmeq->cq_vector);
+	c.create_cq.irq_vector = cpu_to_le16(cq_vector);
 
 	return nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0);
 }
@@ -1459,6 +1459,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 {
 	struct nvme_dev *dev = nvmeq->dev;
 	int result;
+	int cq_vector;
 
 	if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
 		unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
@@ -1471,15 +1472,21 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 	 * A queue's vector matches the queue identifier unless the controller
 	 * has only one vector available.
 	 */
-	nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid;
-	result = adapter_alloc_cq(dev, qid, nvmeq);
+	cq_vector = dev->num_vecs == 1 ? 0 : qid;
+	result = adapter_alloc_cq(dev, qid, nvmeq, cq_vector);
 	if (result < 0)
-		goto release_vector;
+		goto out;
 
 	result = adapter_alloc_sq(dev, qid, nvmeq);
 	if (result < 0)
 		goto release_cq;
 
+	/*
+	 * set cq_vector after alloc cq/sq, otherwise, if alloc cq/sq command
+	 * timeout, nvme_suspend_queue will invoke free_irq for it and cause warning
+	 * 'Trying to free already-free IRQ xxx'
+	 */
+	nvmeq->cq_vector = cq_vector;
 	nvme_init_queue(nvmeq, qid);
 	result = queue_request_irq(nvmeq);
 	if (result < 0)
@@ -1487,13 +1494,13 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 
 	return result;
 
- release_sq:
+release_sq:
+	nvmeq->cq_vector = -1;
 	dev->online_queues--;
 	adapter_delete_sq(dev, qid);
- release_cq:
+release_cq:
 	adapter_delete_cq(dev, qid);
- release_vector:
-	nvmeq->cq_vector = -1;
+out:
 	return result;
 }
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V6 04/11] nvme: pci: set nvmeq->cq_vector after alloc cq/sq
@ 2018-05-16  4:03   ` Ming Lei
  0 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16  4:03 UTC (permalink / raw)


From: "jianchao.wang" <jianchao.w.wang@oracle.com>

Currently nvmeq->cq_vector is set before alloc cq/sq. If the alloc
cq/sq command timeout, nvme_suspend_queue will invoke free_irq
for the nvmeq because the cq_vector is valid, this will cause warning
'Trying to free already-free IRQ xxx'. set nvmeq->cq_vector after
alloc cq/sq successes to fix this.

Cc: James Smart <james.smart at broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Cc: Laurence Oberman <loberman at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/pci.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 9e28d7118232..e94a103ead1e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1070,7 +1070,7 @@ static int adapter_delete_queue(struct nvme_dev *dev, u8 opcode, u16 id)
 }
 
 static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
-						struct nvme_queue *nvmeq)
+						struct nvme_queue *nvmeq, int cq_vector)
 {
 	struct nvme_command c;
 	int flags = NVME_QUEUE_PHYS_CONTIG | NVME_CQ_IRQ_ENABLED;
@@ -1085,7 +1085,7 @@ static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
 	c.create_cq.cqid = cpu_to_le16(qid);
 	c.create_cq.qsize = cpu_to_le16(nvmeq->q_depth - 1);
 	c.create_cq.cq_flags = cpu_to_le16(flags);
-	c.create_cq.irq_vector = cpu_to_le16(nvmeq->cq_vector);
+	c.create_cq.irq_vector = cpu_to_le16(cq_vector);
 
 	return nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0);
 }
@@ -1459,6 +1459,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 {
 	struct nvme_dev *dev = nvmeq->dev;
 	int result;
+	int cq_vector;
 
 	if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
 		unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
@@ -1471,15 +1472,21 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 	 * A queue's vector matches the queue identifier unless the controller
 	 * has only one vector available.
 	 */
-	nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid;
-	result = adapter_alloc_cq(dev, qid, nvmeq);
+	cq_vector = dev->num_vecs == 1 ? 0 : qid;
+	result = adapter_alloc_cq(dev, qid, nvmeq, cq_vector);
 	if (result < 0)
-		goto release_vector;
+		goto out;
 
 	result = adapter_alloc_sq(dev, qid, nvmeq);
 	if (result < 0)
 		goto release_cq;
 
+	/*
+	 * set cq_vector after alloc cq/sq, otherwise, if alloc cq/sq command
+	 * timeout, nvme_suspend_queue will invoke free_irq for it and cause warning
+	 * 'Trying to free already-free IRQ xxx'
+	 */
+	nvmeq->cq_vector = cq_vector;
 	nvme_init_queue(nvmeq, qid);
 	result = queue_request_irq(nvmeq);
 	if (result < 0)
@@ -1487,13 +1494,13 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 
 	return result;
 
- release_sq:
+release_sq:
+	nvmeq->cq_vector = -1;
 	dev->online_queues--;
 	adapter_delete_sq(dev, qid);
- release_cq:
+release_cq:
 	adapter_delete_cq(dev, qid);
- release_vector:
-	nvmeq->cq_vector = -1;
+out:
 	return result;
 }
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V6 05/11] nvme: pci: only wait freezing if queue is frozen
  2018-05-16  4:03 ` Ming Lei
@ 2018-05-16  4:03   ` Ming Lei
  -1 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16  4:03 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, James Smart, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

In nvme_dev_disable() called during shutting down controler,
nvme_wait_freeze_timeout() may be done on the controller not
frozen yet, so add the check for avoiding the case.

Cc: James Smart <james.smart@broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e94a103ead1e..6413dad51107 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2259,14 +2259,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	int i;
 	bool dead = true;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
+	bool frozen = false;
 
 	mutex_lock(&dev->shutdown_lock);
 	if (pci_is_enabled(pdev)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
 		if (dev->ctrl.state == NVME_CTRL_LIVE ||
-		    dev->ctrl.state == NVME_CTRL_RESETTING)
+		    dev->ctrl.state == NVME_CTRL_RESETTING) {
 			nvme_start_freeze(&dev->ctrl);
+			frozen = true;
+		}
 		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
 			pdev->error_state  != pci_channel_io_normal);
 	}
@@ -2276,7 +2279,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	 * doing a safe shutdown.
 	 */
 	if (!dead) {
-		if (shutdown)
+		if (shutdown && frozen)
 			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
 	}
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V6 05/11] nvme: pci: only wait freezing if queue is frozen
@ 2018-05-16  4:03   ` Ming Lei
  0 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16  4:03 UTC (permalink / raw)


In nvme_dev_disable() called during shutting down controler,
nvme_wait_freeze_timeout() may be done on the controller not
frozen yet, so add the check for avoiding the case.

Cc: James Smart <james.smart at broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Cc: Laurence Oberman <loberman at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/pci.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e94a103ead1e..6413dad51107 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2259,14 +2259,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	int i;
 	bool dead = true;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
+	bool frozen = false;
 
 	mutex_lock(&dev->shutdown_lock);
 	if (pci_is_enabled(pdev)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
 		if (dev->ctrl.state == NVME_CTRL_LIVE ||
-		    dev->ctrl.state == NVME_CTRL_RESETTING)
+		    dev->ctrl.state == NVME_CTRL_RESETTING) {
 			nvme_start_freeze(&dev->ctrl);
+			frozen = true;
+		}
 		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
 			pdev->error_state  != pci_channel_io_normal);
 	}
@@ -2276,7 +2279,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	 * doing a safe shutdown.
 	 */
 	if (!dead) {
-		if (shutdown)
+		if (shutdown && frozen)
 			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
 	}
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V6 06/11] nvme: pci: freeze queue in nvme_dev_disable() in case of error recovery
  2018-05-16  4:03 ` Ming Lei
@ 2018-05-16  4:03   ` Ming Lei
  -1 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16  4:03 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, James Smart, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

When nvme_dev_disable() is used for error recovery, we should always
freeze queues before shutdown controller:

- reset handler supposes queues are frozen, and will wait_freeze &
unfreeze them explicitly, if queues aren't frozen during nvme_dev_disable(),
reset handler may wait forever even though there isn't any requests
allocated.

- this way may avoid to cancel lots of requests during error recovery

This patch introduces the parameter of 'freeze_queue' for fixing this
issue.

Cc: James Smart <james.smart@broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 47 ++++++++++++++++++++++++++++++++---------------
 1 file changed, 32 insertions(+), 15 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6413dad51107..365d1a5ee1eb 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -69,7 +69,8 @@ struct nvme_dev;
 struct nvme_queue;
 
 static void nvme_process_cq(struct nvme_queue *nvmeq);
-static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
+static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
+		freeze_queue);
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -1206,7 +1207,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	 */
 	if (nvme_should_reset(dev, csts)) {
 		nvme_warn_reset(dev, csts);
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, true);
 		nvme_reset_ctrl(&dev->ctrl);
 		return BLK_EH_HANDLED;
 	}
@@ -1233,7 +1234,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, disable controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, false);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
 		return BLK_EH_HANDLED;
 	default:
@@ -1249,7 +1250,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, true);
 		nvme_reset_ctrl(&dev->ctrl);
 
 		/*
@@ -2254,19 +2255,35 @@ static void nvme_pci_disable(struct nvme_dev *dev)
 	}
 }
 
-static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
+/*
+ * Resetting often follows nvme_dev_disable(), so queues need to be frozen
+ * before resetting.
+ */
+static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
+		freeze_queue)
 {
 	int i;
 	bool dead = true;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 	bool frozen = false;
 
+	/*
+	 * 'freeze_queue' is only valid for non-shutdown, and we do
+	 * inline freeze & wait_freeze_timeout for shutdown just for
+	 * completing as many as possible requests before shutdown
+	 */
+	if (shutdown)
+		freeze_queue = false;
+
+	if (freeze_queue)
+		nvme_start_freeze(&dev->ctrl);
+
 	mutex_lock(&dev->shutdown_lock);
 	if (pci_is_enabled(pdev)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
-		if (dev->ctrl.state == NVME_CTRL_LIVE ||
-		    dev->ctrl.state == NVME_CTRL_RESETTING) {
+		if (shutdown && (dev->ctrl.state == NVME_CTRL_LIVE ||
+		    dev->ctrl.state == NVME_CTRL_RESETTING)) {
 			nvme_start_freeze(&dev->ctrl);
 			frozen = true;
 		}
@@ -2369,7 +2386,7 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
 	dev_warn(dev->ctrl.device, "Removing after probe failure status: %d\n", status);
 
 	nvme_get_ctrl(&dev->ctrl);
-	nvme_dev_disable(dev, false);
+	nvme_dev_disable(dev, false, false);
 	if (!queue_work(nvme_wq, &dev->remove_work))
 		nvme_put_ctrl(&dev->ctrl);
 }
@@ -2390,7 +2407,7 @@ static void nvme_reset_work(struct work_struct *work)
 	 * moving on.
 	 */
 	if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, false);
 
 	/*
 	 * Introduce CONNECTING state from nvme-fc/rdma transports to mark the
@@ -2639,7 +2656,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 static void nvme_reset_prepare(struct pci_dev *pdev)
 {
 	struct nvme_dev *dev = pci_get_drvdata(pdev);
-	nvme_dev_disable(dev, false);
+	nvme_dev_disable(dev, false, true);
 }
 
 static void nvme_reset_done(struct pci_dev *pdev)
@@ -2651,7 +2668,7 @@ static void nvme_reset_done(struct pci_dev *pdev)
 static void nvme_shutdown(struct pci_dev *pdev)
 {
 	struct nvme_dev *dev = pci_get_drvdata(pdev);
-	nvme_dev_disable(dev, true);
+	nvme_dev_disable(dev, true, false);
 }
 
 /*
@@ -2670,13 +2687,13 @@ static void nvme_remove(struct pci_dev *pdev)
 
 	if (!pci_device_is_present(pdev)) {
 		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, false);
 	}
 
 	flush_work(&dev->ctrl.reset_work);
 	nvme_stop_ctrl(&dev->ctrl);
 	nvme_remove_namespaces(&dev->ctrl);
-	nvme_dev_disable(dev, true);
+	nvme_dev_disable(dev, true, false);
 	nvme_free_host_mem(dev);
 	nvme_dev_remove_admin(dev);
 	nvme_free_queues(dev, 0);
@@ -2710,7 +2727,7 @@ static int nvme_suspend(struct device *dev)
 	struct pci_dev *pdev = to_pci_dev(dev);
 	struct nvme_dev *ndev = pci_get_drvdata(pdev);
 
-	nvme_dev_disable(ndev, true);
+	nvme_dev_disable(ndev, true, false);
 	return 0;
 }
 
@@ -2742,7 +2759,7 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
 	case pci_channel_io_frozen:
 		dev_warn(dev->ctrl.device,
 			"frozen state error detected, reset controller\n");
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, true);
 		return PCI_ERS_RESULT_NEED_RESET;
 	case pci_channel_io_perm_failure:
 		dev_warn(dev->ctrl.device,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V6 06/11] nvme: pci: freeze queue in nvme_dev_disable() in case of error recovery
@ 2018-05-16  4:03   ` Ming Lei
  0 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16  4:03 UTC (permalink / raw)


When nvme_dev_disable() is used for error recovery, we should always
freeze queues before shutdown controller:

- reset handler supposes queues are frozen, and will wait_freeze &
unfreeze them explicitly, if queues aren't frozen during nvme_dev_disable(),
reset handler may wait forever even though there isn't any requests
allocated.

- this way may avoid to cancel lots of requests during error recovery

This patch introduces the parameter of 'freeze_queue' for fixing this
issue.

Cc: James Smart <james.smart at broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Cc: Laurence Oberman <loberman at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/pci.c | 47 ++++++++++++++++++++++++++++++++---------------
 1 file changed, 32 insertions(+), 15 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6413dad51107..365d1a5ee1eb 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -69,7 +69,8 @@ struct nvme_dev;
 struct nvme_queue;
 
 static void nvme_process_cq(struct nvme_queue *nvmeq);
-static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
+static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
+		freeze_queue);
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -1206,7 +1207,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	 */
 	if (nvme_should_reset(dev, csts)) {
 		nvme_warn_reset(dev, csts);
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, true);
 		nvme_reset_ctrl(&dev->ctrl);
 		return BLK_EH_HANDLED;
 	}
@@ -1233,7 +1234,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, disable controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, false);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
 		return BLK_EH_HANDLED;
 	default:
@@ -1249,7 +1250,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, true);
 		nvme_reset_ctrl(&dev->ctrl);
 
 		/*
@@ -2254,19 +2255,35 @@ static void nvme_pci_disable(struct nvme_dev *dev)
 	}
 }
 
-static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
+/*
+ * Resetting often follows nvme_dev_disable(), so queues need to be frozen
+ * before resetting.
+ */
+static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
+		freeze_queue)
 {
 	int i;
 	bool dead = true;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 	bool frozen = false;
 
+	/*
+	 * 'freeze_queue' is only valid for non-shutdown, and we do
+	 * inline freeze & wait_freeze_timeout for shutdown just for
+	 * completing as many as possible requests before shutdown
+	 */
+	if (shutdown)
+		freeze_queue = false;
+
+	if (freeze_queue)
+		nvme_start_freeze(&dev->ctrl);
+
 	mutex_lock(&dev->shutdown_lock);
 	if (pci_is_enabled(pdev)) {
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
-		if (dev->ctrl.state == NVME_CTRL_LIVE ||
-		    dev->ctrl.state == NVME_CTRL_RESETTING) {
+		if (shutdown && (dev->ctrl.state == NVME_CTRL_LIVE ||
+		    dev->ctrl.state == NVME_CTRL_RESETTING)) {
 			nvme_start_freeze(&dev->ctrl);
 			frozen = true;
 		}
@@ -2369,7 +2386,7 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
 	dev_warn(dev->ctrl.device, "Removing after probe failure status: %d\n", status);
 
 	nvme_get_ctrl(&dev->ctrl);
-	nvme_dev_disable(dev, false);
+	nvme_dev_disable(dev, false, false);
 	if (!queue_work(nvme_wq, &dev->remove_work))
 		nvme_put_ctrl(&dev->ctrl);
 }
@@ -2390,7 +2407,7 @@ static void nvme_reset_work(struct work_struct *work)
 	 * moving on.
 	 */
 	if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, false);
 
 	/*
 	 * Introduce CONNECTING state from nvme-fc/rdma transports to mark the
@@ -2639,7 +2656,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 static void nvme_reset_prepare(struct pci_dev *pdev)
 {
 	struct nvme_dev *dev = pci_get_drvdata(pdev);
-	nvme_dev_disable(dev, false);
+	nvme_dev_disable(dev, false, true);
 }
 
 static void nvme_reset_done(struct pci_dev *pdev)
@@ -2651,7 +2668,7 @@ static void nvme_reset_done(struct pci_dev *pdev)
 static void nvme_shutdown(struct pci_dev *pdev)
 {
 	struct nvme_dev *dev = pci_get_drvdata(pdev);
-	nvme_dev_disable(dev, true);
+	nvme_dev_disable(dev, true, false);
 }
 
 /*
@@ -2670,13 +2687,13 @@ static void nvme_remove(struct pci_dev *pdev)
 
 	if (!pci_device_is_present(pdev)) {
 		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, false);
 	}
 
 	flush_work(&dev->ctrl.reset_work);
 	nvme_stop_ctrl(&dev->ctrl);
 	nvme_remove_namespaces(&dev->ctrl);
-	nvme_dev_disable(dev, true);
+	nvme_dev_disable(dev, true, false);
 	nvme_free_host_mem(dev);
 	nvme_dev_remove_admin(dev);
 	nvme_free_queues(dev, 0);
@@ -2710,7 +2727,7 @@ static int nvme_suspend(struct device *dev)
 	struct pci_dev *pdev = to_pci_dev(dev);
 	struct nvme_dev *ndev = pci_get_drvdata(pdev);
 
-	nvme_dev_disable(ndev, true);
+	nvme_dev_disable(ndev, true, false);
 	return 0;
 }
 
@@ -2742,7 +2759,7 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
 	case pci_channel_io_frozen:
 		dev_warn(dev->ctrl.device,
 			"frozen state error detected, reset controller\n");
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, false, true);
 		return PCI_ERS_RESULT_NEED_RESET;
 	case pci_channel_io_perm_failure:
 		dev_warn(dev->ctrl.device,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V6 07/11] nvme: pci: prepare for supporting error recovery from resetting context
  2018-05-16  4:03 ` Ming Lei
@ 2018-05-16  4:03   ` Ming Lei
  -1 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16  4:03 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, James Smart, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

Either the admin or normal IO in reset context may be timed out because
controller error happens. When this timeout happens, we may have to
start controller recovery again.

This patch introduces 'reset_lock' and holds this lock when running reset,
so that we may support nested reset in the following patches.

Cc: James Smart <james.smart@broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/core.c |  2 ++
 drivers/nvme/host/nvme.h |  3 +++
 drivers/nvme/host/pci.c  | 20 +++++++++++++++++---
 3 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 99b857e5a7a9..3b0cf2fd3f53 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3430,6 +3430,8 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	INIT_WORK(&ctrl->fw_act_work, nvme_fw_act_work);
 	INIT_WORK(&ctrl->delete_work, nvme_delete_ctrl_work);
 
+	mutex_init(&ctrl->reset_lock);
+
 	ret = ida_simple_get(&nvme_instance_ida, 0, 0, GFP_KERNEL);
 	if (ret < 0)
 		goto out;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 17d2f7cf3fed..b00a56412bab 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -151,6 +151,9 @@ struct nvme_ctrl {
 	struct device ctrl_device;
 	struct device *device;	/* char device */
 	struct cdev cdev;
+
+	/* sync reset activities */
+	struct mutex reset_lock;
 	struct work_struct reset_work;
 	struct work_struct delete_work;
 
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 365d1a5ee1eb..50bd1818deb5 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2391,14 +2391,14 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
 		nvme_put_ctrl(&dev->ctrl);
 }
 
-static void nvme_reset_work(struct work_struct *work)
+static void nvme_reset_dev(struct nvme_dev *dev)
 {
-	struct nvme_dev *dev =
-		container_of(work, struct nvme_dev, ctrl.reset_work);
 	bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
 	int result = -ENODEV;
 	enum nvme_ctrl_state new_state = NVME_CTRL_LIVE;
 
+	lockdep_assert_held(&dev->ctrl.reset_lock);
+
 	if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING))
 		goto out;
 
@@ -2474,7 +2474,11 @@ static void nvme_reset_work(struct work_struct *work)
 		new_state = NVME_CTRL_ADMIN_ONLY;
 	} else {
 		nvme_start_queues(&dev->ctrl);
+		mutex_unlock(&dev->ctrl.reset_lock);
+
 		nvme_wait_freeze(&dev->ctrl);
+
+		mutex_lock(&dev->ctrl.reset_lock);
 		/* hit this only when allocate tagset fails */
 		if (nvme_dev_add(dev))
 			new_state = NVME_CTRL_ADMIN_ONLY;
@@ -2498,6 +2502,16 @@ static void nvme_reset_work(struct work_struct *work)
 	nvme_remove_dead_ctrl(dev, result);
 }
 
+static void nvme_reset_work(struct work_struct *work)
+{
+	struct nvme_dev *dev =
+		container_of(work, struct nvme_dev, ctrl.reset_work);
+
+	mutex_lock(&dev->ctrl.reset_lock);
+	nvme_reset_dev(dev);
+	mutex_unlock(&dev->ctrl.reset_lock);
+}
+
 static void nvme_remove_dead_ctrl_work(struct work_struct *work)
 {
 	struct nvme_dev *dev = container_of(work, struct nvme_dev, remove_work);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V6 07/11] nvme: pci: prepare for supporting error recovery from resetting context
@ 2018-05-16  4:03   ` Ming Lei
  0 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16  4:03 UTC (permalink / raw)


Either the admin or normal IO in reset context may be timed out because
controller error happens. When this timeout happens, we may have to
start controller recovery again.

This patch introduces 'reset_lock' and holds this lock when running reset,
so that we may support nested reset in the following patches.

Cc: James Smart <james.smart at broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Cc: Laurence Oberman <loberman at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/core.c |  2 ++
 drivers/nvme/host/nvme.h |  3 +++
 drivers/nvme/host/pci.c  | 20 +++++++++++++++++---
 3 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 99b857e5a7a9..3b0cf2fd3f53 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3430,6 +3430,8 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	INIT_WORK(&ctrl->fw_act_work, nvme_fw_act_work);
 	INIT_WORK(&ctrl->delete_work, nvme_delete_ctrl_work);
 
+	mutex_init(&ctrl->reset_lock);
+
 	ret = ida_simple_get(&nvme_instance_ida, 0, 0, GFP_KERNEL);
 	if (ret < 0)
 		goto out;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 17d2f7cf3fed..b00a56412bab 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -151,6 +151,9 @@ struct nvme_ctrl {
 	struct device ctrl_device;
 	struct device *device;	/* char device */
 	struct cdev cdev;
+
+	/* sync reset activities */
+	struct mutex reset_lock;
 	struct work_struct reset_work;
 	struct work_struct delete_work;
 
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 365d1a5ee1eb..50bd1818deb5 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2391,14 +2391,14 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
 		nvme_put_ctrl(&dev->ctrl);
 }
 
-static void nvme_reset_work(struct work_struct *work)
+static void nvme_reset_dev(struct nvme_dev *dev)
 {
-	struct nvme_dev *dev =
-		container_of(work, struct nvme_dev, ctrl.reset_work);
 	bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
 	int result = -ENODEV;
 	enum nvme_ctrl_state new_state = NVME_CTRL_LIVE;
 
+	lockdep_assert_held(&dev->ctrl.reset_lock);
+
 	if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING))
 		goto out;
 
@@ -2474,7 +2474,11 @@ static void nvme_reset_work(struct work_struct *work)
 		new_state = NVME_CTRL_ADMIN_ONLY;
 	} else {
 		nvme_start_queues(&dev->ctrl);
+		mutex_unlock(&dev->ctrl.reset_lock);
+
 		nvme_wait_freeze(&dev->ctrl);
+
+		mutex_lock(&dev->ctrl.reset_lock);
 		/* hit this only when allocate tagset fails */
 		if (nvme_dev_add(dev))
 			new_state = NVME_CTRL_ADMIN_ONLY;
@@ -2498,6 +2502,16 @@ static void nvme_reset_work(struct work_struct *work)
 	nvme_remove_dead_ctrl(dev, result);
 }
 
+static void nvme_reset_work(struct work_struct *work)
+{
+	struct nvme_dev *dev =
+		container_of(work, struct nvme_dev, ctrl.reset_work);
+
+	mutex_lock(&dev->ctrl.reset_lock);
+	nvme_reset_dev(dev);
+	mutex_unlock(&dev->ctrl.reset_lock);
+}
+
 static void nvme_remove_dead_ctrl_work(struct work_struct *work)
 {
 	struct nvme_dev *dev = container_of(work, struct nvme_dev, remove_work);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V6 08/11] nvme: pci: move error handling out of nvme_reset_dev()
  2018-05-16  4:03 ` Ming Lei
@ 2018-05-16  4:03   ` Ming Lei
  -1 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16  4:03 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, James Smart, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

Once nested EH is introduced, we may not need to handle error
in the inner EH, so move error handling out of nvme_reset_dev().

Meantime return the reset result to caller.

Cc: James Smart <james.smart@broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 50bd1818deb5..634fd7f17950 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2391,7 +2391,7 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
 		nvme_put_ctrl(&dev->ctrl);
 }
 
-static void nvme_reset_dev(struct nvme_dev *dev)
+static int nvme_reset_dev(struct nvme_dev *dev)
 {
 	bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
 	int result = -ENODEV;
@@ -2485,6 +2485,7 @@ static void nvme_reset_dev(struct nvme_dev *dev)
 		nvme_unfreeze(&dev->ctrl);
 	}
 
+	result = -ENODEV;
 	/*
 	 * If only admin queue live, keep it to do further investigation or
 	 * recovery.
@@ -2496,19 +2497,22 @@ static void nvme_reset_dev(struct nvme_dev *dev)
 	}
 
 	nvme_start_ctrl(&dev->ctrl);
-	return;
+	return 0;
 
  out:
-	nvme_remove_dead_ctrl(dev, result);
+	return result;
 }
 
 static void nvme_reset_work(struct work_struct *work)
 {
 	struct nvme_dev *dev =
 		container_of(work, struct nvme_dev, ctrl.reset_work);
+	int result;
 
 	mutex_lock(&dev->ctrl.reset_lock);
-	nvme_reset_dev(dev);
+	result = nvme_reset_dev(dev);
+	if (result)
+		nvme_remove_dead_ctrl(dev, result);
 	mutex_unlock(&dev->ctrl.reset_lock);
 }
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V6 08/11] nvme: pci: move error handling out of nvme_reset_dev()
@ 2018-05-16  4:03   ` Ming Lei
  0 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16  4:03 UTC (permalink / raw)


Once nested EH is introduced, we may not need to handle error
in the inner EH, so move error handling out of nvme_reset_dev().

Meantime return the reset result to caller.

Cc: James Smart <james.smart at broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Cc: Laurence Oberman <loberman at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/pci.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 50bd1818deb5..634fd7f17950 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2391,7 +2391,7 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
 		nvme_put_ctrl(&dev->ctrl);
 }
 
-static void nvme_reset_dev(struct nvme_dev *dev)
+static int nvme_reset_dev(struct nvme_dev *dev)
 {
 	bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
 	int result = -ENODEV;
@@ -2485,6 +2485,7 @@ static void nvme_reset_dev(struct nvme_dev *dev)
 		nvme_unfreeze(&dev->ctrl);
 	}
 
+	result = -ENODEV;
 	/*
 	 * If only admin queue live, keep it to do further investigation or
 	 * recovery.
@@ -2496,19 +2497,22 @@ static void nvme_reset_dev(struct nvme_dev *dev)
 	}
 
 	nvme_start_ctrl(&dev->ctrl);
-	return;
+	return 0;
 
  out:
-	nvme_remove_dead_ctrl(dev, result);
+	return result;
 }
 
 static void nvme_reset_work(struct work_struct *work)
 {
 	struct nvme_dev *dev =
 		container_of(work, struct nvme_dev, ctrl.reset_work);
+	int result;
 
 	mutex_lock(&dev->ctrl.reset_lock);
-	nvme_reset_dev(dev);
+	result = nvme_reset_dev(dev);
+	if (result)
+		nvme_remove_dead_ctrl(dev, result);
 	mutex_unlock(&dev->ctrl.reset_lock);
 }
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V6 09/11] nvme: pci: don't unfreeze queue until controller state updating succeeds
  2018-05-16  4:03 ` Ming Lei
@ 2018-05-16  4:03   ` Ming Lei
  -1 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16  4:03 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, James Smart, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

If it fails to update controller state into LIVE or ADMIN_ONLY, the
controller will be removed, so not necessary to unfreeze queue any
more.

This way will make the following patch easier to not leak the
freeze couner.

Cc: James Smart <james.smart@broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 634fd7f17950..4236d79e3643 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2396,6 +2396,7 @@ static int nvme_reset_dev(struct nvme_dev *dev)
 	bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
 	int result = -ENODEV;
 	enum nvme_ctrl_state new_state = NVME_CTRL_LIVE;
+	bool unfreeze_queue = false;
 
 	lockdep_assert_held(&dev->ctrl.reset_lock);
 
@@ -2482,7 +2483,7 @@ static int nvme_reset_dev(struct nvme_dev *dev)
 		/* hit this only when allocate tagset fails */
 		if (nvme_dev_add(dev))
 			new_state = NVME_CTRL_ADMIN_ONLY;
-		nvme_unfreeze(&dev->ctrl);
+		unfreeze_queue = true;
 	}
 
 	result = -ENODEV;
@@ -2497,6 +2498,9 @@ static int nvme_reset_dev(struct nvme_dev *dev)
 	}
 
 	nvme_start_ctrl(&dev->ctrl);
+
+	if (unfreeze_queue)
+		nvme_unfreeze(&dev->ctrl);
 	return 0;
 
  out:
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V6 09/11] nvme: pci: don't unfreeze queue until controller state updating succeeds
@ 2018-05-16  4:03   ` Ming Lei
  0 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16  4:03 UTC (permalink / raw)


If it fails to update controller state into LIVE or ADMIN_ONLY, the
controller will be removed, so not necessary to unfreeze queue any
more.

This way will make the following patch easier to not leak the
freeze couner.

Cc: James Smart <james.smart at broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Cc: Laurence Oberman <loberman at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/pci.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 634fd7f17950..4236d79e3643 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2396,6 +2396,7 @@ static int nvme_reset_dev(struct nvme_dev *dev)
 	bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
 	int result = -ENODEV;
 	enum nvme_ctrl_state new_state = NVME_CTRL_LIVE;
+	bool unfreeze_queue = false;
 
 	lockdep_assert_held(&dev->ctrl.reset_lock);
 
@@ -2482,7 +2483,7 @@ static int nvme_reset_dev(struct nvme_dev *dev)
 		/* hit this only when allocate tagset fails */
 		if (nvme_dev_add(dev))
 			new_state = NVME_CTRL_ADMIN_ONLY;
-		nvme_unfreeze(&dev->ctrl);
+		unfreeze_queue = true;
 	}
 
 	result = -ENODEV;
@@ -2497,6 +2498,9 @@ static int nvme_reset_dev(struct nvme_dev *dev)
 	}
 
 	nvme_start_ctrl(&dev->ctrl);
+
+	if (unfreeze_queue)
+		nvme_unfreeze(&dev->ctrl);
 	return 0;
 
  out:
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V6 10/11] nvme: core: introduce nvme_force_change_ctrl_state()
  2018-05-16  4:03 ` Ming Lei
@ 2018-05-16  4:03   ` Ming Lei
  -1 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16  4:03 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, James Smart, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

When controller is being reset, timeout still may be triggered,
for handling this situation, the contoller state has to be
changed to NVME_CTRL_RESETTING first.

So introduce nvme_force_change_ctrl_state() for this purpose.

Cc: James Smart <james.smart@broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/core.c | 33 +++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h |  2 ++
 2 files changed, 35 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 3b0cf2fd3f53..9e51c3e1f534 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -256,6 +256,39 @@ void nvme_cancel_request(struct request *req, void *data, bool reserved)
 }
 EXPORT_SYMBOL_GPL(nvme_cancel_request);
 
+/* should only be used by EH for handling error during reset */
+void nvme_force_change_ctrl_state(struct nvme_ctrl *ctrl,
+		enum nvme_ctrl_state new_state)
+{
+	enum nvme_ctrl_state old_state;
+	unsigned long flags;
+	bool warn = true;
+
+	spin_lock_irqsave(&ctrl->lock, flags);
+	old_state = ctrl->state;
+	switch (new_state) {
+	case NVME_CTRL_RESETTING:
+		switch (old_state) {
+		case NVME_CTRL_LIVE:
+		case NVME_CTRL_ADMIN_ONLY:
+		case NVME_CTRL_RESETTING:
+		case NVME_CTRL_CONNECTING:
+			warn = false;
+			break;
+		default:
+			break;
+		}
+	default:
+		break;
+	}
+	if (warn)
+		WARN(1, "Make sure you want to change to %d from %d\n",
+				new_state, old_state);
+	ctrl->state = new_state;
+	spin_unlock_irqrestore(&ctrl->lock, flags);
+}
+EXPORT_SYMBOL_GPL(nvme_force_change_ctrl_state);
+
 bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 		enum nvme_ctrl_state new_state)
 {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index b00a56412bab..715239226f4c 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -390,6 +390,8 @@ void nvme_complete_rq(struct request *req);
 void nvme_cancel_request(struct request *req, void *data, bool reserved);
 bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 		enum nvme_ctrl_state new_state);
+void nvme_force_change_ctrl_state(struct nvme_ctrl *ctrl,
+		enum nvme_ctrl_state new_state);
 int nvme_disable_ctrl(struct nvme_ctrl *ctrl, u64 cap);
 int nvme_enable_ctrl(struct nvme_ctrl *ctrl, u64 cap);
 int nvme_shutdown_ctrl(struct nvme_ctrl *ctrl);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V6 10/11] nvme: core: introduce nvme_force_change_ctrl_state()
@ 2018-05-16  4:03   ` Ming Lei
  0 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16  4:03 UTC (permalink / raw)


When controller is being reset, timeout still may be triggered,
for handling this situation, the contoller state has to be
changed to NVME_CTRL_RESETTING first.

So introduce nvme_force_change_ctrl_state() for this purpose.

Cc: James Smart <james.smart at broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Cc: Laurence Oberman <loberman at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/core.c | 33 +++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h |  2 ++
 2 files changed, 35 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 3b0cf2fd3f53..9e51c3e1f534 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -256,6 +256,39 @@ void nvme_cancel_request(struct request *req, void *data, bool reserved)
 }
 EXPORT_SYMBOL_GPL(nvme_cancel_request);
 
+/* should only be used by EH for handling error during reset */
+void nvme_force_change_ctrl_state(struct nvme_ctrl *ctrl,
+		enum nvme_ctrl_state new_state)
+{
+	enum nvme_ctrl_state old_state;
+	unsigned long flags;
+	bool warn = true;
+
+	spin_lock_irqsave(&ctrl->lock, flags);
+	old_state = ctrl->state;
+	switch (new_state) {
+	case NVME_CTRL_RESETTING:
+		switch (old_state) {
+		case NVME_CTRL_LIVE:
+		case NVME_CTRL_ADMIN_ONLY:
+		case NVME_CTRL_RESETTING:
+		case NVME_CTRL_CONNECTING:
+			warn = false;
+			break;
+		default:
+			break;
+		}
+	default:
+		break;
+	}
+	if (warn)
+		WARN(1, "Make sure you want to change to %d from %d\n",
+				new_state, old_state);
+	ctrl->state = new_state;
+	spin_unlock_irqrestore(&ctrl->lock, flags);
+}
+EXPORT_SYMBOL_GPL(nvme_force_change_ctrl_state);
+
 bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 		enum nvme_ctrl_state new_state)
 {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index b00a56412bab..715239226f4c 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -390,6 +390,8 @@ void nvme_complete_rq(struct request *req);
 void nvme_cancel_request(struct request *req, void *data, bool reserved);
 bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 		enum nvme_ctrl_state new_state);
+void nvme_force_change_ctrl_state(struct nvme_ctrl *ctrl,
+		enum nvme_ctrl_state new_state);
 int nvme_disable_ctrl(struct nvme_ctrl *ctrl, u64 cap);
 int nvme_enable_ctrl(struct nvme_ctrl *ctrl, u64 cap);
 int nvme_shutdown_ctrl(struct nvme_ctrl *ctrl);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V6 11/11] nvme: pci: support nested EH
  2018-05-16  4:03 ` Ming Lei
@ 2018-05-16  4:03   ` Ming Lei
  -1 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16  4:03 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, James Smart, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

When one req is timed out, now nvme_timeout() handles it by the
following way:

	nvme_dev_disable(dev, false);
	nvme_reset_ctrl(&dev->ctrl);
	return BLK_EH_HANDLED.

There are several issues about the above approach:

1) IO may fail during resetting

Admin IO timeout may be triggered in nvme_reset_dev() when error happens.
Normal IO timeout may be triggered too during nvme_wait_freeze() in reset
path. When the two kinds of timeout happen, the current reset mechanism
can't work any more.

2) race between nvme_start_freeze and nvme_wait_freeze() & nvme_unfreeze()

- nvme_dev_disable() and resetting controller are required for recovering
controller, but the two are run from different contexts. nvme_start_freeze()
is call from nvme_dev_disable() which is run timeout work context, and
nvme_unfreeze() is run from reset work context. Unfortunatley timeout may be
triggered during resetting controller, so nvme_start_freeze() may be run
several times.

- Also two reset work may run one by one, this may cause hang in
nvme_wait_freeze() forever too.

3) all namespace's EH require to shutdown & reset the controller

block's timeout handler is per-request-queue, that means each
namespace's error handling may shutdown & reset the whole controller,
then the shutdown from one namespace may quiese queues when resetting
from another namespace is in-progress.

This patch fixes the above issues by using nested EH:

1) run controller shutdown(nvme_dev_disable()) and resetting(nvme_reset_dev)
from one same EH context, so the above race 2) can be fixed easily.

2) always start a new context for handling EH, and cancel all in-flight
requests(include the timed-out ones) in nvme_dev_disable() by quiescing
timeout event before shutdown controller.

3) limit the max number of nested EH, when the limit is reached, removes
the controller and fails all in-flight request.

With this approach, blktest block/011 can be passed.

Cc: James Smart <james.smart@broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/core.c |  22 +++++
 drivers/nvme/host/nvme.h |   2 +
 drivers/nvme/host/pci.c  | 252 +++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 256 insertions(+), 20 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9e51c3e1f534..264619dc81db 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3587,6 +3587,28 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_start_freeze);
 
+void nvme_unquiesce_timeout(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_unquiesce_timeout(ns->queue);
+	up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_unquiesce_timeout);
+
+void nvme_quiesce_timeout(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_quiesce_timeout(ns->queue);
+	up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_quiesce_timeout);
+
 void nvme_stop_queues(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 715239226f4c..5ed7d7ddd597 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -412,6 +412,8 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len,
 void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
 		union nvme_result *res);
 
+void nvme_unquiesce_timeout(struct nvme_ctrl *ctrl);
+void nvme_quiesce_timeout(struct nvme_ctrl *ctrl);
 void nvme_stop_queues(struct nvme_ctrl *ctrl);
 void nvme_start_queues(struct nvme_ctrl *ctrl);
 void nvme_kill_queues(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 4236d79e3643..58e92c7c10e0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -71,6 +71,7 @@ struct nvme_queue;
 static void nvme_process_cq(struct nvme_queue *nvmeq);
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
 		freeze_queue);
+static int nvme_reset_dev(struct nvme_dev *dev);
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -113,6 +114,23 @@ struct nvme_dev {
 	dma_addr_t host_mem_descs_dma;
 	struct nvme_host_mem_buf_desc *host_mem_descs;
 	void **host_mem_desc_bufs;
+
+	/* EH handler */
+	spinlock_t	eh_lock;
+	bool		ctrl_shutdown_started;
+	bool		ctrl_failed;
+	unsigned int	nested_eh;
+	struct work_struct fail_ctrl_work;
+	wait_queue_head_t	eh_wq;
+	struct list_head	eh_head;
+};
+
+#define  NVME_MAX_NESTED_EH	32
+struct nvme_eh_work {
+	struct work_struct	work;
+	struct nvme_dev		*dev;
+	int			seq;
+	struct list_head	list;
 };
 
 static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
@@ -1186,6 +1204,183 @@ static void nvme_warn_reset(struct nvme_dev *dev, u32 csts)
 			 csts, result);
 }
 
+static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
+{
+	dev_warn(dev->ctrl.device, "Removing after probe failure status: %d\n", status);
+
+	nvme_get_ctrl(&dev->ctrl);
+	nvme_dev_disable(dev, false, false);
+	if (!queue_work(nvme_wq, &dev->remove_work))
+		nvme_put_ctrl(&dev->ctrl);
+}
+
+static void nvme_eh_fail_ctrl_work(struct work_struct *work)
+{
+	struct nvme_dev *dev =
+		container_of(work, struct nvme_dev, fail_ctrl_work);
+
+	dev_info(dev->ctrl.device, "EH: fail controller\n");
+	nvme_remove_dead_ctrl(dev, 0);
+}
+
+static void nvme_eh_mark_ctrl_shutdown(struct nvme_dev *dev)
+{
+	spin_lock(&dev->eh_lock);
+	dev->ctrl_shutdown_started = false;
+	spin_unlock(&dev->eh_lock);
+}
+
+static void nvme_eh_sched_fail_ctrl(struct nvme_dev *dev)
+{
+	INIT_WORK(&dev->fail_ctrl_work, nvme_eh_fail_ctrl_work);
+	queue_work(nvme_reset_wq, &dev->fail_ctrl_work);
+}
+
+/* either controller is updated to LIVE or will be removed */
+static bool nvme_eh_reset_done(struct nvme_dev *dev)
+{
+	return dev->ctrl.state == NVME_CTRL_LIVE ||
+		dev->ctrl.state == NVME_CTRL_ADMIN_ONLY ||
+		dev->ctrl_failed;
+}
+
+static void nvme_eh_done(struct nvme_eh_work *eh_work, int result)
+{
+	struct nvme_dev *dev = eh_work->dev;
+	bool top_eh;
+
+	spin_lock(&dev->eh_lock);
+	top_eh = list_is_last(&eh_work->list, &dev->eh_head);
+
+	/* Fail controller if the top EH can't recover it */
+	if (!result)
+		wake_up_all(&dev->eh_wq);
+	else if (top_eh) {
+		dev->ctrl_failed = true;
+		nvme_eh_sched_fail_ctrl(dev);
+		wake_up_all(&dev->eh_wq);
+	}
+
+	list_del(&eh_work->list);
+	spin_unlock(&dev->eh_lock);
+
+	dev_info(dev->ctrl.device, "EH %d: state %d, eh_done %d, top eh %d\n",
+			eh_work->seq, dev->ctrl.state, result, top_eh);
+	wait_event(dev->eh_wq, nvme_eh_reset_done(dev));
+
+	/* release the EH seq, so outer EH can be allocated bigger seq No. */
+	spin_lock(&dev->eh_lock);
+	dev->nested_eh--;
+	spin_unlock(&dev->eh_lock);
+
+	/*
+	 * After controller is recovered in upper EH finally, we have to
+	 * unfreeze queues if reset failed in this EH, otherwise blk-mq
+	 * queues' freeze counter may be leaked.
+	 *
+	 * nvme_unfreeze() can only be called after controller state is
+	 * updated to LIVE.
+	 */
+	if (result && (dev->ctrl.state == NVME_CTRL_LIVE ||
+				dev->ctrl.state == NVME_CTRL_ADMIN_ONLY))
+		nvme_unfreeze(&dev->ctrl);
+}
+
+static void nvme_eh_work(struct work_struct *work)
+{
+	struct nvme_eh_work *eh_work =
+		container_of(work, struct nvme_eh_work, work);
+	struct nvme_dev *dev = eh_work->dev;
+	int result = -ENODEV;
+	bool top_eh;
+
+	dev_info(dev->ctrl.device, "EH %d: before shutdown\n",
+			eh_work->seq);
+	nvme_dev_disable(dev, false, true);
+
+	/* allow new EH to be created */
+	nvme_eh_mark_ctrl_shutdown(dev);
+
+	/*
+	 * nvme_dev_disable cancels all in-flight requests, and wont't
+	 * cause timout at all, so I am always the top EH now, but it
+	 * becomes not true after 'reset_lock' is held, so have to check
+	 * if I am still the top EH, and force to update to NVME_CTRL_RESETTING
+	 * if yes.
+	 */
+	mutex_lock(&dev->ctrl.reset_lock);
+	spin_lock(&dev->eh_lock);
+
+	/* allow top EH to preempt other inner EH */
+	top_eh = list_is_last(&eh_work->list, &dev->eh_head);
+	dev_info(dev->ctrl.device, "EH %d: after shutdown, top eh: %d\n",
+			eh_work->seq, top_eh);
+	if (!top_eh) {
+		if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
+			spin_unlock(&dev->eh_lock);
+			goto done;
+		}
+	} else {
+		nvme_force_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING);
+		result = 0;
+	}
+
+	spin_unlock(&dev->eh_lock);
+	result = nvme_reset_dev(dev);
+done:
+	mutex_unlock(&dev->ctrl.reset_lock);
+	nvme_eh_done(eh_work, result);
+	dev_info(dev->ctrl.device, "EH %d: after recovery %d\n",
+			eh_work->seq, result);
+
+	kfree(eh_work);
+}
+
+static void nvme_eh_schedule(struct nvme_dev *dev)
+{
+	bool need_sched = false;
+	bool fail_ctrl = false;
+	struct nvme_eh_work *eh_work;
+	int seq;
+
+	spin_lock(&dev->eh_lock);
+	if (!dev->ctrl_shutdown_started) {
+		need_sched = true;
+		seq = dev->nested_eh;
+		if (++dev->nested_eh >= NVME_MAX_NESTED_EH) {
+			if (!dev->ctrl_failed)
+				dev->ctrl_failed = fail_ctrl = true;
+			else
+				need_sched = false;
+		} else
+			dev->ctrl_shutdown_started = true;
+	}
+	spin_unlock(&dev->eh_lock);
+
+	if (!need_sched)
+		return;
+
+	if (fail_ctrl) {
+ fail_ctrl:
+		nvme_eh_sched_fail_ctrl(dev);
+		return;
+	}
+
+	eh_work = kzalloc(sizeof(*eh_work), GFP_NOIO);
+	if (!eh_work)
+		goto fail_ctrl;
+
+	eh_work->dev = dev;
+	eh_work->seq = seq;
+
+	spin_lock(&dev->eh_lock);
+	list_add_tail(&eh_work->list, &dev->eh_head);
+	spin_unlock(&dev->eh_lock);
+
+	INIT_WORK(&eh_work->work, nvme_eh_work);
+	queue_work(nvme_reset_wq, &eh_work->work);
+}
+
 static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -1207,9 +1402,8 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	 */
 	if (nvme_should_reset(dev, csts)) {
 		nvme_warn_reset(dev, csts);
-		nvme_dev_disable(dev, false, true);
-		nvme_reset_ctrl(&dev->ctrl);
-		return BLK_EH_HANDLED;
+		nvme_eh_schedule(dev);
+		return BLK_EH_RESET_TIMER;
 	}
 
 	/*
@@ -1234,9 +1428,9 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, disable controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false, false);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		return BLK_EH_HANDLED;
+		nvme_eh_schedule(dev);
+		return BLK_EH_RESET_TIMER;
 	default:
 		break;
 	}
@@ -1250,15 +1444,13 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false, true);
-		nvme_reset_ctrl(&dev->ctrl);
-
 		/*
 		 * Mark the request as handled, since the inline shutdown
 		 * forces all outstanding requests to complete.
 		 */
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		return BLK_EH_HANDLED;
+		nvme_eh_schedule(dev);
+		return BLK_EH_RESET_TIMER;
 	}
 
 	if (atomic_dec_return(&dev->ctrl.abort_limit) < 0) {
@@ -2316,12 +2508,28 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
 	}
 	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
 		nvme_suspend_queue(&dev->queues[i]);
+	/*
+	 * safe to sync timeout after queues are quiesced, then all
+	 * requests(include the time-out ones) will be canceled.
+	 */
+	nvme_quiesce_timeout(&dev->ctrl);
+	if (dev->ctrl.admin_q)
+		blk_quiesce_timeout(dev->ctrl.admin_q);
 
 	nvme_pci_disable(dev);
 
+	/*
+	 * Both timeout and interrupt handler have been drained, and all
+	 * in-flight requests will be canceled now.
+	 */
 	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
 	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
 
+	/* all requests have been canceled now, so enable timeout now */
+	nvme_unquiesce_timeout(&dev->ctrl);
+	if (dev->ctrl.admin_q)
+		blk_unquiesce_timeout(dev->ctrl.admin_q);
+
 	/*
 	 * The driver will not be starting up queues again if shutting down so
 	 * must flush all entered requests to their failed completion to avoid
@@ -2381,16 +2589,6 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 	kfree(dev);
 }
 
-static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
-{
-	dev_warn(dev->ctrl.device, "Removing after probe failure status: %d\n", status);
-
-	nvme_get_ctrl(&dev->ctrl);
-	nvme_dev_disable(dev, false, false);
-	if (!queue_work(nvme_wq, &dev->remove_work))
-		nvme_put_ctrl(&dev->ctrl);
-}
-
 static int nvme_reset_dev(struct nvme_dev *dev)
 {
 	bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
@@ -2400,7 +2598,7 @@ static int nvme_reset_dev(struct nvme_dev *dev)
 
 	lockdep_assert_held(&dev->ctrl.reset_lock);
 
-	if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING))
+	if (dev->ctrl.state != NVME_CTRL_RESETTING)
 		goto out;
 
 	/*
@@ -2486,6 +2684,10 @@ static int nvme_reset_dev(struct nvme_dev *dev)
 		unfreeze_queue = true;
 	}
 
+	/* controller state may have been updated already by inner EH */
+	if (dev->ctrl.state == new_state)
+		goto reset_done;
+
 	result = -ENODEV;
 	/*
 	 * If only admin queue live, keep it to do further investigation or
@@ -2499,6 +2701,7 @@ static int nvme_reset_dev(struct nvme_dev *dev)
 
 	nvme_start_ctrl(&dev->ctrl);
 
+ reset_done:
 	if (unfreeze_queue)
 		nvme_unfreeze(&dev->ctrl);
 	return 0;
@@ -2615,6 +2818,13 @@ static unsigned long check_vendor_combination_bug(struct pci_dev *pdev)
 	return 0;
 }
 
+static void nvme_eh_init(struct nvme_dev *dev)
+{
+	spin_lock_init(&dev->eh_lock);
+	init_waitqueue_head(&dev->eh_wq);
+	INIT_LIST_HEAD(&dev->eh_head);
+}
+
 static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	int node, result = -ENOMEM;
@@ -2659,6 +2869,8 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
 
+	nvme_eh_init(dev);
+
 	nvme_reset_ctrl(&dev->ctrl);
 
 	return 0;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V6 11/11] nvme: pci: support nested EH
@ 2018-05-16  4:03   ` Ming Lei
  0 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16  4:03 UTC (permalink / raw)


When one req is timed out, now nvme_timeout() handles it by the
following way:

	nvme_dev_disable(dev, false);
	nvme_reset_ctrl(&dev->ctrl);
	return BLK_EH_HANDLED.

There are several issues about the above approach:

1) IO may fail during resetting

Admin IO timeout may be triggered in nvme_reset_dev() when error happens.
Normal IO timeout may be triggered too during nvme_wait_freeze() in reset
path. When the two kinds of timeout happen, the current reset mechanism
can't work any more.

2) race between nvme_start_freeze and nvme_wait_freeze() & nvme_unfreeze()

- nvme_dev_disable() and resetting controller are required for recovering
controller, but the two are run from different contexts. nvme_start_freeze()
is call from nvme_dev_disable() which is run timeout work context, and
nvme_unfreeze() is run from reset work context. Unfortunatley timeout may be
triggered during resetting controller, so nvme_start_freeze() may be run
several times.

- Also two reset work may run one by one, this may cause hang in
nvme_wait_freeze() forever too.

3) all namespace's EH require to shutdown & reset the controller

block's timeout handler is per-request-queue, that means each
namespace's error handling may shutdown & reset the whole controller,
then the shutdown from one namespace may quiese queues when resetting
from another namespace is in-progress.

This patch fixes the above issues by using nested EH:

1) run controller shutdown(nvme_dev_disable()) and resetting(nvme_reset_dev)
from one same EH context, so the above race 2) can be fixed easily.

2) always start a new context for handling EH, and cancel all in-flight
requests(include the timed-out ones) in nvme_dev_disable() by quiescing
timeout event before shutdown controller.

3) limit the max number of nested EH, when the limit is reached, removes
the controller and fails all in-flight request.

With this approach, blktest block/011 can be passed.

Cc: James Smart <james.smart at broadcom.com>
Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Cc: Laurence Oberman <loberman at redhat.com>
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/core.c |  22 +++++
 drivers/nvme/host/nvme.h |   2 +
 drivers/nvme/host/pci.c  | 252 +++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 256 insertions(+), 20 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9e51c3e1f534..264619dc81db 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3587,6 +3587,28 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_start_freeze);
 
+void nvme_unquiesce_timeout(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_unquiesce_timeout(ns->queue);
+	up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_unquiesce_timeout);
+
+void nvme_quiesce_timeout(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_quiesce_timeout(ns->queue);
+	up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_quiesce_timeout);
+
 void nvme_stop_queues(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 715239226f4c..5ed7d7ddd597 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -412,6 +412,8 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len,
 void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
 		union nvme_result *res);
 
+void nvme_unquiesce_timeout(struct nvme_ctrl *ctrl);
+void nvme_quiesce_timeout(struct nvme_ctrl *ctrl);
 void nvme_stop_queues(struct nvme_ctrl *ctrl);
 void nvme_start_queues(struct nvme_ctrl *ctrl);
 void nvme_kill_queues(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 4236d79e3643..58e92c7c10e0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -71,6 +71,7 @@ struct nvme_queue;
 static void nvme_process_cq(struct nvme_queue *nvmeq);
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
 		freeze_queue);
+static int nvme_reset_dev(struct nvme_dev *dev);
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -113,6 +114,23 @@ struct nvme_dev {
 	dma_addr_t host_mem_descs_dma;
 	struct nvme_host_mem_buf_desc *host_mem_descs;
 	void **host_mem_desc_bufs;
+
+	/* EH handler */
+	spinlock_t	eh_lock;
+	bool		ctrl_shutdown_started;
+	bool		ctrl_failed;
+	unsigned int	nested_eh;
+	struct work_struct fail_ctrl_work;
+	wait_queue_head_t	eh_wq;
+	struct list_head	eh_head;
+};
+
+#define  NVME_MAX_NESTED_EH	32
+struct nvme_eh_work {
+	struct work_struct	work;
+	struct nvme_dev		*dev;
+	int			seq;
+	struct list_head	list;
 };
 
 static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
@@ -1186,6 +1204,183 @@ static void nvme_warn_reset(struct nvme_dev *dev, u32 csts)
 			 csts, result);
 }
 
+static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
+{
+	dev_warn(dev->ctrl.device, "Removing after probe failure status: %d\n", status);
+
+	nvme_get_ctrl(&dev->ctrl);
+	nvme_dev_disable(dev, false, false);
+	if (!queue_work(nvme_wq, &dev->remove_work))
+		nvme_put_ctrl(&dev->ctrl);
+}
+
+static void nvme_eh_fail_ctrl_work(struct work_struct *work)
+{
+	struct nvme_dev *dev =
+		container_of(work, struct nvme_dev, fail_ctrl_work);
+
+	dev_info(dev->ctrl.device, "EH: fail controller\n");
+	nvme_remove_dead_ctrl(dev, 0);
+}
+
+static void nvme_eh_mark_ctrl_shutdown(struct nvme_dev *dev)
+{
+	spin_lock(&dev->eh_lock);
+	dev->ctrl_shutdown_started = false;
+	spin_unlock(&dev->eh_lock);
+}
+
+static void nvme_eh_sched_fail_ctrl(struct nvme_dev *dev)
+{
+	INIT_WORK(&dev->fail_ctrl_work, nvme_eh_fail_ctrl_work);
+	queue_work(nvme_reset_wq, &dev->fail_ctrl_work);
+}
+
+/* either controller is updated to LIVE or will be removed */
+static bool nvme_eh_reset_done(struct nvme_dev *dev)
+{
+	return dev->ctrl.state == NVME_CTRL_LIVE ||
+		dev->ctrl.state == NVME_CTRL_ADMIN_ONLY ||
+		dev->ctrl_failed;
+}
+
+static void nvme_eh_done(struct nvme_eh_work *eh_work, int result)
+{
+	struct nvme_dev *dev = eh_work->dev;
+	bool top_eh;
+
+	spin_lock(&dev->eh_lock);
+	top_eh = list_is_last(&eh_work->list, &dev->eh_head);
+
+	/* Fail controller if the top EH can't recover it */
+	if (!result)
+		wake_up_all(&dev->eh_wq);
+	else if (top_eh) {
+		dev->ctrl_failed = true;
+		nvme_eh_sched_fail_ctrl(dev);
+		wake_up_all(&dev->eh_wq);
+	}
+
+	list_del(&eh_work->list);
+	spin_unlock(&dev->eh_lock);
+
+	dev_info(dev->ctrl.device, "EH %d: state %d, eh_done %d, top eh %d\n",
+			eh_work->seq, dev->ctrl.state, result, top_eh);
+	wait_event(dev->eh_wq, nvme_eh_reset_done(dev));
+
+	/* release the EH seq, so outer EH can be allocated bigger seq No. */
+	spin_lock(&dev->eh_lock);
+	dev->nested_eh--;
+	spin_unlock(&dev->eh_lock);
+
+	/*
+	 * After controller is recovered in upper EH finally, we have to
+	 * unfreeze queues if reset failed in this EH, otherwise blk-mq
+	 * queues' freeze counter may be leaked.
+	 *
+	 * nvme_unfreeze() can only be called after controller state is
+	 * updated to LIVE.
+	 */
+	if (result && (dev->ctrl.state == NVME_CTRL_LIVE ||
+				dev->ctrl.state == NVME_CTRL_ADMIN_ONLY))
+		nvme_unfreeze(&dev->ctrl);
+}
+
+static void nvme_eh_work(struct work_struct *work)
+{
+	struct nvme_eh_work *eh_work =
+		container_of(work, struct nvme_eh_work, work);
+	struct nvme_dev *dev = eh_work->dev;
+	int result = -ENODEV;
+	bool top_eh;
+
+	dev_info(dev->ctrl.device, "EH %d: before shutdown\n",
+			eh_work->seq);
+	nvme_dev_disable(dev, false, true);
+
+	/* allow new EH to be created */
+	nvme_eh_mark_ctrl_shutdown(dev);
+
+	/*
+	 * nvme_dev_disable cancels all in-flight requests, and wont't
+	 * cause timout at all, so I am always the top EH now, but it
+	 * becomes not true after 'reset_lock' is held, so have to check
+	 * if I am still the top EH, and force to update to NVME_CTRL_RESETTING
+	 * if yes.
+	 */
+	mutex_lock(&dev->ctrl.reset_lock);
+	spin_lock(&dev->eh_lock);
+
+	/* allow top EH to preempt other inner EH */
+	top_eh = list_is_last(&eh_work->list, &dev->eh_head);
+	dev_info(dev->ctrl.device, "EH %d: after shutdown, top eh: %d\n",
+			eh_work->seq, top_eh);
+	if (!top_eh) {
+		if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
+			spin_unlock(&dev->eh_lock);
+			goto done;
+		}
+	} else {
+		nvme_force_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING);
+		result = 0;
+	}
+
+	spin_unlock(&dev->eh_lock);
+	result = nvme_reset_dev(dev);
+done:
+	mutex_unlock(&dev->ctrl.reset_lock);
+	nvme_eh_done(eh_work, result);
+	dev_info(dev->ctrl.device, "EH %d: after recovery %d\n",
+			eh_work->seq, result);
+
+	kfree(eh_work);
+}
+
+static void nvme_eh_schedule(struct nvme_dev *dev)
+{
+	bool need_sched = false;
+	bool fail_ctrl = false;
+	struct nvme_eh_work *eh_work;
+	int seq;
+
+	spin_lock(&dev->eh_lock);
+	if (!dev->ctrl_shutdown_started) {
+		need_sched = true;
+		seq = dev->nested_eh;
+		if (++dev->nested_eh >= NVME_MAX_NESTED_EH) {
+			if (!dev->ctrl_failed)
+				dev->ctrl_failed = fail_ctrl = true;
+			else
+				need_sched = false;
+		} else
+			dev->ctrl_shutdown_started = true;
+	}
+	spin_unlock(&dev->eh_lock);
+
+	if (!need_sched)
+		return;
+
+	if (fail_ctrl) {
+ fail_ctrl:
+		nvme_eh_sched_fail_ctrl(dev);
+		return;
+	}
+
+	eh_work = kzalloc(sizeof(*eh_work), GFP_NOIO);
+	if (!eh_work)
+		goto fail_ctrl;
+
+	eh_work->dev = dev;
+	eh_work->seq = seq;
+
+	spin_lock(&dev->eh_lock);
+	list_add_tail(&eh_work->list, &dev->eh_head);
+	spin_unlock(&dev->eh_lock);
+
+	INIT_WORK(&eh_work->work, nvme_eh_work);
+	queue_work(nvme_reset_wq, &eh_work->work);
+}
+
 static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -1207,9 +1402,8 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	 */
 	if (nvme_should_reset(dev, csts)) {
 		nvme_warn_reset(dev, csts);
-		nvme_dev_disable(dev, false, true);
-		nvme_reset_ctrl(&dev->ctrl);
-		return BLK_EH_HANDLED;
+		nvme_eh_schedule(dev);
+		return BLK_EH_RESET_TIMER;
 	}
 
 	/*
@@ -1234,9 +1428,9 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, disable controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false, false);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		return BLK_EH_HANDLED;
+		nvme_eh_schedule(dev);
+		return BLK_EH_RESET_TIMER;
 	default:
 		break;
 	}
@@ -1250,15 +1444,13 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false, true);
-		nvme_reset_ctrl(&dev->ctrl);
-
 		/*
 		 * Mark the request as handled, since the inline shutdown
 		 * forces all outstanding requests to complete.
 		 */
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		return BLK_EH_HANDLED;
+		nvme_eh_schedule(dev);
+		return BLK_EH_RESET_TIMER;
 	}
 
 	if (atomic_dec_return(&dev->ctrl.abort_limit) < 0) {
@@ -2316,12 +2508,28 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
 	}
 	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
 		nvme_suspend_queue(&dev->queues[i]);
+	/*
+	 * safe to sync timeout after queues are quiesced, then all
+	 * requests(include the time-out ones) will be canceled.
+	 */
+	nvme_quiesce_timeout(&dev->ctrl);
+	if (dev->ctrl.admin_q)
+		blk_quiesce_timeout(dev->ctrl.admin_q);
 
 	nvme_pci_disable(dev);
 
+	/*
+	 * Both timeout and interrupt handler have been drained, and all
+	 * in-flight requests will be canceled now.
+	 */
 	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
 	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
 
+	/* all requests have been canceled now, so enable timeout now */
+	nvme_unquiesce_timeout(&dev->ctrl);
+	if (dev->ctrl.admin_q)
+		blk_unquiesce_timeout(dev->ctrl.admin_q);
+
 	/*
 	 * The driver will not be starting up queues again if shutting down so
 	 * must flush all entered requests to their failed completion to avoid
@@ -2381,16 +2589,6 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 	kfree(dev);
 }
 
-static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
-{
-	dev_warn(dev->ctrl.device, "Removing after probe failure status: %d\n", status);
-
-	nvme_get_ctrl(&dev->ctrl);
-	nvme_dev_disable(dev, false, false);
-	if (!queue_work(nvme_wq, &dev->remove_work))
-		nvme_put_ctrl(&dev->ctrl);
-}
-
 static int nvme_reset_dev(struct nvme_dev *dev)
 {
 	bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
@@ -2400,7 +2598,7 @@ static int nvme_reset_dev(struct nvme_dev *dev)
 
 	lockdep_assert_held(&dev->ctrl.reset_lock);
 
-	if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING))
+	if (dev->ctrl.state != NVME_CTRL_RESETTING)
 		goto out;
 
 	/*
@@ -2486,6 +2684,10 @@ static int nvme_reset_dev(struct nvme_dev *dev)
 		unfreeze_queue = true;
 	}
 
+	/* controller state may have been updated already by inner EH */
+	if (dev->ctrl.state == new_state)
+		goto reset_done;
+
 	result = -ENODEV;
 	/*
 	 * If only admin queue live, keep it to do further investigation or
@@ -2499,6 +2701,7 @@ static int nvme_reset_dev(struct nvme_dev *dev)
 
 	nvme_start_ctrl(&dev->ctrl);
 
+ reset_done:
 	if (unfreeze_queue)
 		nvme_unfreeze(&dev->ctrl);
 	return 0;
@@ -2615,6 +2818,13 @@ static unsigned long check_vendor_combination_bug(struct pci_dev *pdev)
 	return 0;
 }
 
+static void nvme_eh_init(struct nvme_dev *dev)
+{
+	spin_lock_init(&dev->eh_lock);
+	init_waitqueue_head(&dev->eh_wq);
+	INIT_LIST_HEAD(&dev->eh_head);
+}
+
 static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	int node, result = -ENOMEM;
@@ -2659,6 +2869,8 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
 
+	nvme_eh_init(dev);
+
 	nvme_reset_ctrl(&dev->ctrl);
 
 	return 0;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH V6 11/11] nvme: pci: support nested EH
  2018-05-16  4:03   ` Ming Lei
@ 2018-05-16 14:12     ` Keith Busch
  -1 siblings, 0 replies; 60+ messages in thread
From: Keith Busch @ 2018-05-16 14:12 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, James Smart, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

Hi Ming,

I'm sorry, but I am frankly not on board with introducing yet
another work-queue into this driver for handling this situation. The
fact you missed syncing with this queue in the surprise remove case,
introducing various use-after-free conditions, just demonstrates exactly
how over-complicated this approach is. That, and the forced controller
state transtions is yet another way surprise removal will break as its
depending on the state machine to prevent certain transitions.

The driver is already in a work queue context when it becomes aware of
corrective action being necessary. Seriously, simply syncing these in the
reset convers nearly all conditions you're concered with, most of which
will be obviated if Bart's blk-mq timeout rework is added. The only case
that isn't covered is if IO stops when renumbering the hardware contexts
(unlikely as that is), and that's easily fixable just moving that into
the scan_work.

As far as blktests block/011 is concerned, I think this needs to be
rethought considering what it's actually showing us. The fact the
pci driver provides such an easy way to not only muck with PCI config
register *and* internal kernel structures out from under a driver that's
bound to it is insane.  If PCI really wants to provide this sysfs entry,
it really ought to notify bound drivers that this is occuring, similar
to the 'reset' sysfs.

Anyway, there is merit to some of your earlier patches. In particular,
specifically patches 2, 4, and 5.  On the timing out the host memory
releasing (patch 2), I would just rather see this as a generic API,
though:

  http://lists.infradead.org/pipermail/linux-nvme/2018-January/015313.html
  http://lists.infradead.org/pipermail/linux-nvme/2018-January/015314.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH V6 11/11] nvme: pci: support nested EH
@ 2018-05-16 14:12     ` Keith Busch
  0 siblings, 0 replies; 60+ messages in thread
From: Keith Busch @ 2018-05-16 14:12 UTC (permalink / raw)


Hi Ming,

I'm sorry, but I am frankly not on board with introducing yet
another work-queue into this driver for handling this situation. The
fact you missed syncing with this queue in the surprise remove case,
introducing various use-after-free conditions, just demonstrates exactly
how over-complicated this approach is. That, and the forced controller
state transtions is yet another way surprise removal will break as its
depending on the state machine to prevent certain transitions.

The driver is already in a work queue context when it becomes aware of
corrective action being necessary. Seriously, simply syncing these in the
reset convers nearly all conditions you're concered with, most of which
will be obviated if Bart's blk-mq timeout rework is added. The only case
that isn't covered is if IO stops when renumbering the hardware contexts
(unlikely as that is), and that's easily fixable just moving that into
the scan_work.

As far as blktests block/011 is concerned, I think this needs to be
rethought considering what it's actually showing us. The fact the
pci driver provides such an easy way to not only muck with PCI config
register *and* internal kernel structures out from under a driver that's
bound to it is insane.  If PCI really wants to provide this sysfs entry,
it really ought to notify bound drivers that this is occuring, similar
to the 'reset' sysfs.

Anyway, there is merit to some of your earlier patches. In particular,
specifically patches 2, 4, and 5.  On the timing out the host memory
releasing (patch 2), I would just rather see this as a generic API,
though:

  http://lists.infradead.org/pipermail/linux-nvme/2018-January/015313.html
  http://lists.infradead.org/pipermail/linux-nvme/2018-January/015314.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V6 11/11] nvme: pci: support nested EH
  2018-05-16 14:12     ` Keith Busch
@ 2018-05-16 23:10       ` Ming Lei
  -1 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16 23:10 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, James Smart, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, Laurence Oberman

Hi Keith,

On Wed, May 16, 2018 at 08:12:42AM -0600, Keith Busch wrote:
> Hi Ming,
> 
> I'm sorry, but I am frankly not on board with introducing yet
> another work-queue into this driver for handling this situation. The
> fact you missed syncing with this queue in the surprise remove case,
> introducing various use-after-free conditions, just demonstrates exactly

blk_cleanup_queue() will drain all requests, and all the timeout activities
will be quiesced, so could you explain what the use-after-free
conditions are? Also we can wait on 'eh_wq' simply until dev->nested_eh
becomes zero before removing 'nvme_dev'.

> how over-complicated this approach is. That, and the forced controller

Now we run controller shutdown and resetting in different contests,
which has caused big trouble, kinds of IO hang:

	https://marc.info/?l=linux-block&m=152644353006033&w=2

This patch moves the two into one same context, it is really a
simplification for avoiding IO hang which is caused by race
between starting freeze and nvme_wait_freeze(), which two are
run from different contexts now.

The sync between all EH contexts looks a bit complicated, but
eh_wq is introduced to sync until the recovery is done for all EHs.

So the new EH model isn't complicated.

I hope all these issues can be fixed, but open to any soluiton,
unfortunately not see any workable way yet, except for this patchset.

> state transtions is yet another way surprise removal will break as its
> depending on the state machine to prevent certain transitions.

We can limit the transtions in nvme_force_change_ctrl_state() too,
that can be documented well.

> 
> The driver is already in a work queue context when it becomes aware of
> corrective action being necessary. Seriously, simply syncing these in the

Yeah, it is in reset work func, which is part of recovery, but either
admin or normal IO can be timed-out in this context too, that is why I
introduces new work to cover this case.

I think we have to handle the issue of IO time-out during reset.

In block/011 test, it can be observed easily that the 3rd EH is started
to recover the whole failure, and finally it succeeds.

> reset convers nearly all conditions you're concered with, most of which

I believe I have explained it to you, that isn't enough:

https://marc.info/?l=linux-block&m=152599770314638&w=2
https://marc.info/?l=linux-block&m=152598984210852&w=2
https://marc.info/?l=linux-block&m=152598664409170&w=2

> will be obviated if Bart's blk-mq timeout rework is added. The only case
> that isn't covered is if IO stops when renumbering the hardware contexts
> (unlikely as that is), and that's easily fixable just moving that into
> the scan_work.

How can Bart's patch cover multiple NS's case? Timeout from multiple NS
still can come at the same time.

> 
> As far as blktests block/011 is concerned, I think this needs to be
> rethought considering what it's actually showing us. The fact the
> pci driver provides such an easy way to not only muck with PCI config
> register *and* internal kernel structures out from under a driver that's
> bound to it is insane.  If PCI really wants to provide this sysfs entry,
> it really ought to notify bound drivers that this is occuring, similar
> to the 'reset' sysfs.

All simulation in block/011 may happen in reality.

When one IO is timed out, IO dispatched during resetting can be
timed-out too.

Unfortunately current NVMe driver can't handle that and the reset
work will hang forever. I don't believe it is a good way as EH.

> 
> Anyway, there is merit to some of your earlier patches. In particular,
> specifically patches 2, 4, and 5.  On the timing out the host memory
> releasing (patch 2), I would just rather see this as a generic API,
> though:
> 
>   http://lists.infradead.org/pipermail/linux-nvme/2018-January/015313.html
>   http://lists.infradead.org/pipermail/linux-nvme/2018-January/015314.html

The generic API may not be necessary, since it is only needed when it is
called in nvme_dev_disable(), but it is still better to have this API.

Except above, you ignore one big source of IO hang, that is the race
between starting freeze & blk_mq_freeze_queue_wait().

- nvme_dev_disable() and resetting controller are required for recovering
    controller, but the two are run from different contexts. nvme_start_freeze()
    is call from nvme_dev_disable() which is run timeout work context, and
    nvme_unfreeze() is run from reset work context. Unfortunatley timeout may be
    triggered during resetting controller, so nvme_start_freeze() may be run
    several times.
    
- Also two reset work may run one by one, this may cause hang in
nvme_wait_freeze() forever too.

In short, I'd see all these issues described in the following commit log
can be fixed:

	https://marc.info/?l=linux-block&m=152644353006033&w=2

If you think there are better ways, let's talk in code. Again, I am
open to any solution.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH V6 11/11] nvme: pci: support nested EH
@ 2018-05-16 23:10       ` Ming Lei
  0 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-16 23:10 UTC (permalink / raw)


Hi Keith,

On Wed, May 16, 2018@08:12:42AM -0600, Keith Busch wrote:
> Hi Ming,
> 
> I'm sorry, but I am frankly not on board with introducing yet
> another work-queue into this driver for handling this situation. The
> fact you missed syncing with this queue in the surprise remove case,
> introducing various use-after-free conditions, just demonstrates exactly

blk_cleanup_queue() will drain all requests, and all the timeout activities
will be quiesced, so could you explain what the use-after-free
conditions are? Also we can wait on 'eh_wq' simply until dev->nested_eh
becomes zero before removing 'nvme_dev'.

> how over-complicated this approach is. That, and the forced controller

Now we run controller shutdown and resetting in different contests,
which has caused big trouble, kinds of IO hang:

	https://marc.info/?l=linux-block&m=152644353006033&w=2

This patch moves the two into one same context, it is really a
simplification for avoiding IO hang which is caused by race
between starting freeze and nvme_wait_freeze(), which two are
run from different contexts now.

The sync between all EH contexts looks a bit complicated, but
eh_wq is introduced to sync until the recovery is done for all EHs.

So the new EH model isn't complicated.

I hope all these issues can be fixed, but open to any soluiton,
unfortunately not see any workable way yet, except for this patchset.

> state transtions is yet another way surprise removal will break as its
> depending on the state machine to prevent certain transitions.

We can limit the transtions in nvme_force_change_ctrl_state() too,
that can be documented well.

> 
> The driver is already in a work queue context when it becomes aware of
> corrective action being necessary. Seriously, simply syncing these in the

Yeah, it is in reset work func, which is part of recovery, but either
admin or normal IO can be timed-out in this context too, that is why I
introduces new work to cover this case.

I think we have to handle the issue of IO time-out during reset.

In block/011 test, it can be observed easily that the 3rd EH is started
to recover the whole failure, and finally it succeeds.

> reset convers nearly all conditions you're concered with, most of which

I believe I have explained it to you, that isn't enough:

https://marc.info/?l=linux-block&m=152599770314638&w=2
https://marc.info/?l=linux-block&m=152598984210852&w=2
https://marc.info/?l=linux-block&m=152598664409170&w=2

> will be obviated if Bart's blk-mq timeout rework is added. The only case
> that isn't covered is if IO stops when renumbering the hardware contexts
> (unlikely as that is), and that's easily fixable just moving that into
> the scan_work.

How can Bart's patch cover multiple NS's case? Timeout from multiple NS
still can come at the same time.

> 
> As far as blktests block/011 is concerned, I think this needs to be
> rethought considering what it's actually showing us. The fact the
> pci driver provides such an easy way to not only muck with PCI config
> register *and* internal kernel structures out from under a driver that's
> bound to it is insane.  If PCI really wants to provide this sysfs entry,
> it really ought to notify bound drivers that this is occuring, similar
> to the 'reset' sysfs.

All simulation in block/011 may happen in reality.

When one IO is timed out, IO dispatched during resetting can be
timed-out too.

Unfortunately current NVMe driver can't handle that and the reset
work will hang forever. I don't believe it is a good way as EH.

> 
> Anyway, there is merit to some of your earlier patches. In particular,
> specifically patches 2, 4, and 5.  On the timing out the host memory
> releasing (patch 2), I would just rather see this as a generic API,
> though:
> 
>   http://lists.infradead.org/pipermail/linux-nvme/2018-January/015313.html
>   http://lists.infradead.org/pipermail/linux-nvme/2018-January/015314.html

The generic API may not be necessary, since it is only needed when it is
called in nvme_dev_disable(), but it is still better to have this API.

Except above, you ignore one big source of IO hang, that is the race
between starting freeze & blk_mq_freeze_queue_wait().

- nvme_dev_disable() and resetting controller are required for recovering
    controller, but the two are run from different contexts. nvme_start_freeze()
    is call from nvme_dev_disable() which is run timeout work context, and
    nvme_unfreeze() is run from reset work context. Unfortunatley timeout may be
    triggered during resetting controller, so nvme_start_freeze() may be run
    several times.
    
- Also two reset work may run one by one, this may cause hang in
nvme_wait_freeze() forever too.

In short, I'd see all these issues described in the following commit log
can be fixed:

	https://marc.info/?l=linux-block&m=152644353006033&w=2

If you think there are better ways, let's talk in code. Again, I am
open to any solution.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V6 11/11] nvme: pci: support nested EH
  2018-05-16 23:10       ` Ming Lei
@ 2018-05-17  2:20         ` Keith Busch
  -1 siblings, 0 replies; 60+ messages in thread
From: Keith Busch @ 2018-05-17  2:20 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, Jens Axboe, Laurence Oberman, Sagi Grimberg,
	James Smart, linux-nvme, linux-block, Jianchao Wang,
	Christoph Hellwig

Hi Ming,

I'm developing the answers in code the issues you raised. It will just
take a moment to complete flushing those out. In the meantime just want
to point out why I think block/011 isn't a real test.

On Thu, May 17, 2018 at 07:10:59AM +0800, Ming Lei wrote:
> All simulation in block/011 may happen in reality.

If this test actually simulates reality, then the following one line
patch (plus explanation for why) would be a real "fix" as this is very
successful in passing block/011. :)

---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 1faa32cd07da..dcc5746304c4 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2118,6 +2118,12 @@ static int nvme_pci_enable(struct nvme_dev *dev)
 
 	if (pci_enable_device_mem(pdev))
 		return result;
+	/*
+	 * blktests block/011 disables the device without the driver knowing.
+	 * We'll just enable the device twice to get the enable_cnt > 1
+	 * so that the test's disabling does absolutely nothing.
+	 */
+	pci_enable_device_mem(pdev);
 
 	pci_set_master(pdev);
--

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH V6 11/11] nvme: pci: support nested EH
@ 2018-05-17  2:20         ` Keith Busch
  0 siblings, 0 replies; 60+ messages in thread
From: Keith Busch @ 2018-05-17  2:20 UTC (permalink / raw)


Hi Ming,

I'm developing the answers in code the issues you raised. It will just
take a moment to complete flushing those out. In the meantime just want
to point out why I think block/011 isn't a real test.

On Thu, May 17, 2018@07:10:59AM +0800, Ming Lei wrote:
> All simulation in block/011 may happen in reality.

If this test actually simulates reality, then the following one line
patch (plus explanation for why) would be a real "fix" as this is very
successful in passing block/011. :)

---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 1faa32cd07da..dcc5746304c4 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2118,6 +2118,12 @@ static int nvme_pci_enable(struct nvme_dev *dev)
 
 	if (pci_enable_device_mem(pdev))
 		return result;
+	/*
+	 * blktests block/011 disables the device without the driver knowing.
+	 * We'll just enable the device twice to get the enable_cnt > 1
+	 * so that the test's disabling does absolutely nothing.
+	 */
+	pci_enable_device_mem(pdev);
 
 	pci_set_master(pdev);
--

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH V6 11/11] nvme: pci: support nested EH
  2018-05-17  2:20         ` Keith Busch
@ 2018-05-17  8:41           ` Christoph Hellwig
  -1 siblings, 0 replies; 60+ messages in thread
From: Christoph Hellwig @ 2018-05-17  8:41 UTC (permalink / raw)
  To: Keith Busch
  Cc: Ming Lei, Keith Busch, Jens Axboe, Laurence Oberman,
	Sagi Grimberg, James Smart, linux-nvme, linux-block,
	Jianchao Wang, Christoph Hellwig, Johannes Thumshirn, bhelgaas,
	linux-pci, arjan

On Wed, May 16, 2018 at 08:20:31PM -0600, Keith Busch wrote:
> On Thu, May 17, 2018 at 07:10:59AM +0800, Ming Lei wrote:
> > All simulation in block/011 may happen in reality.
> 
> If this test actually simulates reality, then the following one line
> patch (plus explanation for why) would be a real "fix" as this is very
> successful in passing block/011. :)
> 
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 1faa32cd07da..dcc5746304c4 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2118,6 +2118,12 @@ static int nvme_pci_enable(struct nvme_dev *dev)
>  
>  	if (pci_enable_device_mem(pdev))
>  		return result;
> +	/*
> +	 * blktests block/011 disables the device without the driver knowing.
> +	 * We'll just enable the device twice to get the enable_cnt > 1
> +	 * so that the test's disabling does absolutely nothing.
> +	 */
> +	pci_enable_device_mem(pdev);

Heh.  But yes, this test and the PCI "enable" interface in sysfs look
horribly wrong. pci_disable_device/pci_enable_device aren't something we
can just do underneath the driver.  Even if injecting the lowlevel
config space manipulations done by it might be useful and a good test
the Linux state ones are just killing the driver.

The enable attribute appears to have been added by Arjan for the
Xorg driver.  I think if we have a driver bound to the device we
should not allow it.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH V6 11/11] nvme: pci: support nested EH
@ 2018-05-17  8:41           ` Christoph Hellwig
  0 siblings, 0 replies; 60+ messages in thread
From: Christoph Hellwig @ 2018-05-17  8:41 UTC (permalink / raw)


On Wed, May 16, 2018@08:20:31PM -0600, Keith Busch wrote:
> On Thu, May 17, 2018@07:10:59AM +0800, Ming Lei wrote:
> > All simulation in block/011 may happen in reality.
> 
> If this test actually simulates reality, then the following one line
> patch (plus explanation for why) would be a real "fix" as this is very
> successful in passing block/011. :)
> 
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 1faa32cd07da..dcc5746304c4 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2118,6 +2118,12 @@ static int nvme_pci_enable(struct nvme_dev *dev)
>  
>  	if (pci_enable_device_mem(pdev))
>  		return result;
> +	/*
> +	 * blktests block/011 disables the device without the driver knowing.
> +	 * We'll just enable the device twice to get the enable_cnt > 1
> +	 * so that the test's disabling does absolutely nothing.
> +	 */
> +	pci_enable_device_mem(pdev);

Heh.  But yes, this test and the PCI "enable" interface in sysfs look
horribly wrong. pci_disable_device/pci_enable_device aren't something we
can just do underneath the driver.  Even if injecting the lowlevel
config space manipulations done by it might be useful and a good test
the Linux state ones are just killing the driver.

The enable attribute appears to have been added by Arjan for the
Xorg driver.  I think if we have a driver bound to the device we
should not allow it.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V6 11/11] nvme: pci: support nested EH
  2018-05-17  8:41           ` Christoph Hellwig
  (?)
@ 2018-05-17 14:20             ` Keith Busch
  -1 siblings, 0 replies; 60+ messages in thread
From: Keith Busch @ 2018-05-17 14:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, Keith Busch, Jens Axboe, Laurence Oberman,
	Sagi Grimberg, James Smart, linux-nvme, linux-block,
	Jianchao Wang, Johannes Thumshirn, bhelgaas, linux-pci, arjan

On Thu, May 17, 2018 at 10:41:29AM +0200, Christoph Hellwig wrote:
> On Wed, May 16, 2018 at 08:20:31PM -0600, Keith Busch wrote:
> > On Thu, May 17, 2018 at 07:10:59AM +0800, Ming Lei wrote:
> > > All simulation in block/011 may happen in reality.
> > 
> > If this test actually simulates reality, then the following one line
> > patch (plus explanation for why) would be a real "fix" as this is very
> > successful in passing block/011. :)
> > 
> > ---
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 1faa32cd07da..dcc5746304c4 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -2118,6 +2118,12 @@ static int nvme_pci_enable(struct nvme_dev *dev)
> >  
> >  	if (pci_enable_device_mem(pdev))
> >  		return result;
> > +	/*
> > +	 * blktests block/011 disables the device without the driver knowing.
> > +	 * We'll just enable the device twice to get the enable_cnt > 1
> > +	 * so that the test's disabling does absolutely nothing.
> > +	 */
> > +	pci_enable_device_mem(pdev);
> 
> Heh.  But yes, this test and the PCI "enable" interface in sysfs look
> horribly wrong. pci_disable_device/pci_enable_device aren't something we
> can just do underneath the driver.  Even if injecting the lowlevel
> config space manipulations done by it might be useful and a good test
> the Linux state ones are just killing the driver.

Yes, I'm totally fine with injecting errors into the config space, but
for goodness sakes, don't fuck with the internal kernel structures out
from under drivers using them.

My suggestion is to use 'setpci' to disable the device if you want to
inject this scenario. That way you get the desired broken device
scenario you want to test, but struct pci_dev hasn't been altered.
 
> The enable attribute appears to have been added by Arjan for the
> Xorg driver.  I think if we have a driver bound to the device we
> should not allow it.

Agreed. Alternatively possibly call the driver's reset_preparei/done
callbacks.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V6 11/11] nvme: pci: support nested EH
@ 2018-05-17 14:20             ` Keith Busch
  0 siblings, 0 replies; 60+ messages in thread
From: Keith Busch @ 2018-05-17 14:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Laurence Oberman, Sagi Grimberg,
	linux-pci, James Smart, linux-nvme, Ming Lei, Keith Busch,
	Jianchao Wang, Johannes Thumshirn, bhelgaas, arjan

On Thu, May 17, 2018 at 10:41:29AM +0200, Christoph Hellwig wrote:
> On Wed, May 16, 2018 at 08:20:31PM -0600, Keith Busch wrote:
> > On Thu, May 17, 2018 at 07:10:59AM +0800, Ming Lei wrote:
> > > All simulation in block/011 may happen in reality.
> > 
> > If this test actually simulates reality, then the following one line
> > patch (plus explanation for why) would be a real "fix" as this is very
> > successful in passing block/011. :)
> > 
> > ---
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 1faa32cd07da..dcc5746304c4 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -2118,6 +2118,12 @@ static int nvme_pci_enable(struct nvme_dev *dev)
> >  
> >  	if (pci_enable_device_mem(pdev))
> >  		return result;
> > +	/*
> > +	 * blktests block/011 disables the device without the driver knowing.
> > +	 * We'll just enable the device twice to get the enable_cnt > 1
> > +	 * so that the test's disabling does absolutely nothing.
> > +	 */
> > +	pci_enable_device_mem(pdev);
> 
> Heh.  But yes, this test and the PCI "enable" interface in sysfs look
> horribly wrong. pci_disable_device/pci_enable_device aren't something we
> can just do underneath the driver.  Even if injecting the lowlevel
> config space manipulations done by it might be useful and a good test
> the Linux state ones are just killing the driver.

Yes, I'm totally fine with injecting errors into the config space, but
for goodness sakes, don't fuck with the internal kernel structures out
from under drivers using them.

My suggestion is to use 'setpci' to disable the device if you want to
inject this scenario. That way you get the desired broken device
scenario you want to test, but struct pci_dev hasn't been altered.
 
> The enable attribute appears to have been added by Arjan for the
> Xorg driver.  I think if we have a driver bound to the device we
> should not allow it.

Agreed. Alternatively possibly call the driver's reset_preparei/done
callbacks.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH V6 11/11] nvme: pci: support nested EH
@ 2018-05-17 14:20             ` Keith Busch
  0 siblings, 0 replies; 60+ messages in thread
From: Keith Busch @ 2018-05-17 14:20 UTC (permalink / raw)


On Thu, May 17, 2018@10:41:29AM +0200, Christoph Hellwig wrote:
> On Wed, May 16, 2018@08:20:31PM -0600, Keith Busch wrote:
> > On Thu, May 17, 2018@07:10:59AM +0800, Ming Lei wrote:
> > > All simulation in block/011 may happen in reality.
> > 
> > If this test actually simulates reality, then the following one line
> > patch (plus explanation for why) would be a real "fix" as this is very
> > successful in passing block/011. :)
> > 
> > ---
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 1faa32cd07da..dcc5746304c4 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -2118,6 +2118,12 @@ static int nvme_pci_enable(struct nvme_dev *dev)
> >  
> >  	if (pci_enable_device_mem(pdev))
> >  		return result;
> > +	/*
> > +	 * blktests block/011 disables the device without the driver knowing.
> > +	 * We'll just enable the device twice to get the enable_cnt > 1
> > +	 * so that the test's disabling does absolutely nothing.
> > +	 */
> > +	pci_enable_device_mem(pdev);
> 
> Heh.  But yes, this test and the PCI "enable" interface in sysfs look
> horribly wrong. pci_disable_device/pci_enable_device aren't something we
> can just do underneath the driver.  Even if injecting the lowlevel
> config space manipulations done by it might be useful and a good test
> the Linux state ones are just killing the driver.

Yes, I'm totally fine with injecting errors into the config space, but
for goodness sakes, don't fuck with the internal kernel structures out
from under drivers using them.

My suggestion is to use 'setpci' to disable the device if you want to
inject this scenario. That way you get the desired broken device
scenario you want to test, but struct pci_dev hasn't been altered.
 
> The enable attribute appears to have been added by Arjan for the
> Xorg driver.  I think if we have a driver bound to the device we
> should not allow it.

Agreed. Alternatively possibly call the driver's reset_preparei/done
callbacks.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V6 11/11] nvme: pci: support nested EH
  2018-05-17 14:20             ` Keith Busch
  (?)
@ 2018-05-17 14:23               ` Johannes Thumshirn
  -1 siblings, 0 replies; 60+ messages in thread
From: Johannes Thumshirn @ 2018-05-17 14:23 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Ming Lei, Keith Busch, Jens Axboe,
	Laurence Oberman, Sagi Grimberg, James Smart, linux-nvme,
	linux-block, Jianchao Wang, bhelgaas, linux-pci, arjan

On Thu, May 17, 2018 at 08:20:51AM -0600, Keith Busch wrote:
> > Heh.  But yes, this test and the PCI "enable" interface in sysfs look
> > horribly wrong. pci_disable_device/pci_enable_device aren't something we
> > can just do underneath the driver.  Even if injecting the lowlevel
> > config space manipulations done by it might be useful and a good test
> > the Linux state ones are just killing the driver.
> 
> Yes, I'm totally fine with injecting errors into the config space, but
> for goodness sakes, don't fuck with the internal kernel structures out
> from under drivers using them.
> 
> My suggestion is to use 'setpci' to disable the device if you want to
> inject this scenario. That way you get the desired broken device
> scenario you want to test, but struct pci_dev hasn't been altered.
>  
> > The enable attribute appears to have been added by Arjan for the
> > Xorg driver.  I think if we have a driver bound to the device we
> > should not allow it.
> 
> Agreed. Alternatively possibly call the driver's reset_preparei/done
> callbacks.

Exactly, but as long as we can issue the reset via sysfs the test-case
is still valid.

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V6 11/11] nvme: pci: support nested EH
@ 2018-05-17 14:23               ` Johannes Thumshirn
  0 siblings, 0 replies; 60+ messages in thread
From: Johannes Thumshirn @ 2018-05-17 14:23 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Ming Lei, Keith Busch, Jens Axboe,
	Laurence Oberman, Sagi Grimberg, James Smart, linux-nvme,
	linux-block, Jianchao Wang, bhelgaas, linux-pci, arjan

On Thu, May 17, 2018 at 08:20:51AM -0600, Keith Busch wrote:
> > Heh.  But yes, this test and the PCI "enable" interface in sysfs look
> > horribly wrong. pci_disable_device/pci_enable_device aren't something we
> > can just do underneath the driver.  Even if injecting the lowlevel
> > config space manipulations done by it might be useful and a good test
> > the Linux state ones are just killing the driver.
> 
> Yes, I'm totally fine with injecting errors into the config space, but
> for goodness sakes, don't fuck with the internal kernel structures out
> from under drivers using them.
> 
> My suggestion is to use 'setpci' to disable the device if you want to
> inject this scenario. That way you get the desired broken device
> scenario you want to test, but struct pci_dev hasn't been altered.
>  
> > The enable attribute appears to have been added by Arjan for the
> > Xorg driver.  I think if we have a driver bound to the device we
> > should not allow it.
> 
> Agreed. Alternatively possibly call the driver's reset_preparei/done
> callbacks.

Exactly, but as long as we can issue the reset via sysfs the test-case
is still valid.

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH V6 11/11] nvme: pci: support nested EH
@ 2018-05-17 14:23               ` Johannes Thumshirn
  0 siblings, 0 replies; 60+ messages in thread
From: Johannes Thumshirn @ 2018-05-17 14:23 UTC (permalink / raw)


On Thu, May 17, 2018@08:20:51AM -0600, Keith Busch wrote:
> > Heh.  But yes, this test and the PCI "enable" interface in sysfs look
> > horribly wrong. pci_disable_device/pci_enable_device aren't something we
> > can just do underneath the driver.  Even if injecting the lowlevel
> > config space manipulations done by it might be useful and a good test
> > the Linux state ones are just killing the driver.
> 
> Yes, I'm totally fine with injecting errors into the config space, but
> for goodness sakes, don't fuck with the internal kernel structures out
> from under drivers using them.
> 
> My suggestion is to use 'setpci' to disable the device if you want to
> inject this scenario. That way you get the desired broken device
> scenario you want to test, but struct pci_dev hasn't been altered.
>  
> > The enable attribute appears to have been added by Arjan for the
> > Xorg driver.  I think if we have a driver bound to the device we
> > should not allow it.
> 
> Agreed. Alternatively possibly call the driver's reset_preparei/done
> callbacks.

Exactly, but as long as we can issue the reset via sysfs the test-case
is still valid.

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V6 11/11] nvme: pci: support nested EH
  2018-05-17  2:20         ` Keith Busch
@ 2018-05-18  0:20           ` Ming Lei
  -1 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-18  0:20 UTC (permalink / raw)
  To: Keith Busch
  Cc: Keith Busch, Jens Axboe, Laurence Oberman, Sagi Grimberg,
	James Smart, linux-nvme, linux-block, Jianchao Wang,
	Christoph Hellwig

On Wed, May 16, 2018 at 08:20:31PM -0600, Keith Busch wrote:
> Hi Ming,
> 
> I'm developing the answers in code the issues you raised. It will just
> take a moment to complete flushing those out. In the meantime just want
> to point out why I think block/011 isn't a real test.
> 
> On Thu, May 17, 2018 at 07:10:59AM +0800, Ming Lei wrote:
> > All simulation in block/011 may happen in reality.
> 
> If this test actually simulates reality, then the following one line
> patch (plus explanation for why) would be a real "fix" as this is very
> successful in passing block/011. :)
> 
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 1faa32cd07da..dcc5746304c4 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2118,6 +2118,12 @@ static int nvme_pci_enable(struct nvme_dev *dev)
>  
>  	if (pci_enable_device_mem(pdev))
>  		return result;
> +	/*
> +	 * blktests block/011 disables the device without the driver knowing.
> +	 * We'll just enable the device twice to get the enable_cnt > 1
> +	 * so that the test's disabling does absolutely nothing.
> +	 */
> +	pci_enable_device_mem(pdev);

What I think block/011 is helpful is that it can trigger IO timeout
during reset, which can be triggered in reality too.

That is one big problem of NVMe driver, IMO.

And this patch does fix this issue, together other timeout related
races.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH V6 11/11] nvme: pci: support nested EH
@ 2018-05-18  0:20           ` Ming Lei
  0 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-18  0:20 UTC (permalink / raw)


On Wed, May 16, 2018@08:20:31PM -0600, Keith Busch wrote:
> Hi Ming,
> 
> I'm developing the answers in code the issues you raised. It will just
> take a moment to complete flushing those out. In the meantime just want
> to point out why I think block/011 isn't a real test.
> 
> On Thu, May 17, 2018@07:10:59AM +0800, Ming Lei wrote:
> > All simulation in block/011 may happen in reality.
> 
> If this test actually simulates reality, then the following one line
> patch (plus explanation for why) would be a real "fix" as this is very
> successful in passing block/011. :)
> 
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 1faa32cd07da..dcc5746304c4 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2118,6 +2118,12 @@ static int nvme_pci_enable(struct nvme_dev *dev)
>  
>  	if (pci_enable_device_mem(pdev))
>  		return result;
> +	/*
> +	 * blktests block/011 disables the device without the driver knowing.
> +	 * We'll just enable the device twice to get the enable_cnt > 1
> +	 * so that the test's disabling does absolutely nothing.
> +	 */
> +	pci_enable_device_mem(pdev);

What I think block/011 is helpful is that it can trigger IO timeout
during reset, which can be triggered in reality too.

That is one big problem of NVMe driver, IMO.

And this patch does fix this issue, together other timeout related
races.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V6 11/11] nvme: pci: support nested EH
  2018-05-18  0:20           ` Ming Lei
@ 2018-05-18  1:01             ` Ming Lei
  -1 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-18  1:01 UTC (permalink / raw)
  To: Keith Busch
  Cc: Keith Busch, Jens Axboe, Laurence Oberman, Sagi Grimberg,
	James Smart, linux-nvme, linux-block, Jianchao Wang,
	Christoph Hellwig

On Fri, May 18, 2018 at 08:19:58AM +0800, Ming Lei wrote:
> On Wed, May 16, 2018 at 08:20:31PM -0600, Keith Busch wrote:
> > Hi Ming,
> > 
> > I'm developing the answers in code the issues you raised. It will just
> > take a moment to complete flushing those out. In the meantime just want
> > to point out why I think block/011 isn't a real test.
> > 
> > On Thu, May 17, 2018 at 07:10:59AM +0800, Ming Lei wrote:
> > > All simulation in block/011 may happen in reality.
> > 
> > If this test actually simulates reality, then the following one line
> > patch (plus explanation for why) would be a real "fix" as this is very
> > successful in passing block/011. :)
> > 
> > ---
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 1faa32cd07da..dcc5746304c4 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -2118,6 +2118,12 @@ static int nvme_pci_enable(struct nvme_dev *dev)
> >  
> >  	if (pci_enable_device_mem(pdev))
> >  		return result;
> > +	/*
> > +	 * blktests block/011 disables the device without the driver knowing.
> > +	 * We'll just enable the device twice to get the enable_cnt > 1
> > +	 * so that the test's disabling does absolutely nothing.
> > +	 */
> > +	pci_enable_device_mem(pdev);
> 
> What I think block/011 is helpful is that it can trigger IO timeout
> during reset, which can be triggered in reality too.
> 
> That is one big problem of NVMe driver, IMO.
> 
> And this patch does fix this issue, together other timeout related
> races.

BTW, the above patch may not be enough to 'fix' this NVMe issue, I guess
you may have to figure out another one to remove fault-injection, or at
least disable io-timeout-fail on NVMe.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH V6 11/11] nvme: pci: support nested EH
@ 2018-05-18  1:01             ` Ming Lei
  0 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-18  1:01 UTC (permalink / raw)


On Fri, May 18, 2018@08:19:58AM +0800, Ming Lei wrote:
> On Wed, May 16, 2018@08:20:31PM -0600, Keith Busch wrote:
> > Hi Ming,
> > 
> > I'm developing the answers in code the issues you raised. It will just
> > take a moment to complete flushing those out. In the meantime just want
> > to point out why I think block/011 isn't a real test.
> > 
> > On Thu, May 17, 2018@07:10:59AM +0800, Ming Lei wrote:
> > > All simulation in block/011 may happen in reality.
> > 
> > If this test actually simulates reality, then the following one line
> > patch (plus explanation for why) would be a real "fix" as this is very
> > successful in passing block/011. :)
> > 
> > ---
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 1faa32cd07da..dcc5746304c4 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -2118,6 +2118,12 @@ static int nvme_pci_enable(struct nvme_dev *dev)
> >  
> >  	if (pci_enable_device_mem(pdev))
> >  		return result;
> > +	/*
> > +	 * blktests block/011 disables the device without the driver knowing.
> > +	 * We'll just enable the device twice to get the enable_cnt > 1
> > +	 * so that the test's disabling does absolutely nothing.
> > +	 */
> > +	pci_enable_device_mem(pdev);
> 
> What I think block/011 is helpful is that it can trigger IO timeout
> during reset, which can be triggered in reality too.
> 
> That is one big problem of NVMe driver, IMO.
> 
> And this patch does fix this issue, together other timeout related
> races.

BTW, the above patch may not be enough to 'fix' this NVMe issue, I guess
you may have to figure out another one to remove fault-injection, or at
least disable io-timeout-fail on NVMe.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V6 11/11] nvme: pci: support nested EH
  2018-05-18  0:20           ` Ming Lei
@ 2018-05-18 13:57             ` Keith Busch
  -1 siblings, 0 replies; 60+ messages in thread
From: Keith Busch @ 2018-05-18 13:57 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Laurence Oberman, Sagi Grimberg,
	James Smart, linux-nvme, Keith Busch, Jianchao Wang,
	Christoph Hellwig

On Fri, May 18, 2018 at 08:20:05AM +0800, Ming Lei wrote:
> What I think block/011 is helpful is that it can trigger IO timeout
> during reset, which can be triggered in reality too.

As I mentioned earlier, there is nothing wrong with the spirit of
the test. What's wrong with it is the misguided implemention.

Do you underestand why it ever passes? The success happens when the
enabling part of the loop happens to coincide with the driver's enabling,
creating the pci_dev->enable_cnt > 1, making subsequent disable parts
of the loop do absolutely nothing; the exact same as the one-liner
(non-serious) patch I sent to defeat the test.

A better way to induce the timeout is:

  # setpci -s <B:D.f> 4.w=0:6

This will halt the device without messing with the kernel structures,
just like how a real device failure would occur.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH V6 11/11] nvme: pci: support nested EH
@ 2018-05-18 13:57             ` Keith Busch
  0 siblings, 0 replies; 60+ messages in thread
From: Keith Busch @ 2018-05-18 13:57 UTC (permalink / raw)


On Fri, May 18, 2018@08:20:05AM +0800, Ming Lei wrote:
> What I think block/011 is helpful is that it can trigger IO timeout
> during reset, which can be triggered in reality too.

As I mentioned earlier, there is nothing wrong with the spirit of
the test. What's wrong with it is the misguided implemention.

Do you underestand why it ever passes? The success happens when the
enabling part of the loop happens to coincide with the driver's enabling,
creating the pci_dev->enable_cnt > 1, making subsequent disable parts
of the loop do absolutely nothing; the exact same as the one-liner
(non-serious) patch I sent to defeat the test.

A better way to induce the timeout is:

  # setpci -s <B:D.f> 4.w=0:6

This will halt the device without messing with the kernel structures,
just like how a real device failure would occur.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V6 11/11] nvme: pci: support nested EH
  2018-05-17 14:23               ` Johannes Thumshirn
  (?)
@ 2018-05-18 16:28                 ` Keith Busch
  -1 siblings, 0 replies; 60+ messages in thread
From: Keith Busch @ 2018-05-18 16:28 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Christoph Hellwig, Ming Lei, Keith Busch, Jens Axboe,
	Laurence Oberman, Sagi Grimberg, James Smart, linux-nvme,
	linux-block, Jianchao Wang, bhelgaas, linux-pci, arjan

On Thu, May 17, 2018 at 04:23:45PM +0200, Johannes Thumshirn wrote:
> > Agreed. Alternatively possibly call the driver's reset_preparei/done
> > callbacks.
> 
> Exactly, but as long as we can issue the reset via sysfs the test-case
> is still valid.

I disagree the test case is valid. The test writes '0' to the
pci-sysfs 'enable', but the driver also disables the pci device as part
of resetting, which is a perfectly reasonable thing for a driver to do.

If the timing of the test's loop happens to write '0' right after the
driver disabled the device that it owns, a 'write error' on that sysfs
write occurs, and blktests then incorrectly claims the test failed.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V6 11/11] nvme: pci: support nested EH
@ 2018-05-18 16:28                 ` Keith Busch
  0 siblings, 0 replies; 60+ messages in thread
From: Keith Busch @ 2018-05-18 16:28 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jens Axboe, linux-block, Laurence Oberman, Sagi Grimberg,
	linux-pci, James Smart, linux-nvme, Ming Lei, Keith Busch,
	Jianchao Wang, bhelgaas, arjan, Christoph Hellwig

On Thu, May 17, 2018 at 04:23:45PM +0200, Johannes Thumshirn wrote:
> > Agreed. Alternatively possibly call the driver's reset_preparei/done
> > callbacks.
> 
> Exactly, but as long as we can issue the reset via sysfs the test-case
> is still valid.

I disagree the test case is valid. The test writes '0' to the
pci-sysfs 'enable', but the driver also disables the pci device as part
of resetting, which is a perfectly reasonable thing for a driver to do.

If the timing of the test's loop happens to write '0' right after the
driver disabled the device that it owns, a 'write error' on that sysfs
write occurs, and blktests then incorrectly claims the test failed.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH V6 11/11] nvme: pci: support nested EH
@ 2018-05-18 16:28                 ` Keith Busch
  0 siblings, 0 replies; 60+ messages in thread
From: Keith Busch @ 2018-05-18 16:28 UTC (permalink / raw)


On Thu, May 17, 2018@04:23:45PM +0200, Johannes Thumshirn wrote:
> > Agreed. Alternatively possibly call the driver's reset_preparei/done
> > callbacks.
> 
> Exactly, but as long as we can issue the reset via sysfs the test-case
> is still valid.

I disagree the test case is valid. The test writes '0' to the
pci-sysfs 'enable', but the driver also disables the pci device as part
of resetting, which is a perfectly reasonable thing for a driver to do.

If the timing of the test's loop happens to write '0' right after the
driver disabled the device that it owns, a 'write error' on that sysfs
write occurs, and blktests then incorrectly claims the test failed.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V6 11/11] nvme: pci: support nested EH
  2018-05-18 13:57             ` Keith Busch
@ 2018-05-18 16:58               ` Jens Axboe
  -1 siblings, 0 replies; 60+ messages in thread
From: Jens Axboe @ 2018-05-18 16:58 UTC (permalink / raw)
  To: Keith Busch, Ming Lei
  Cc: linux-block, Laurence Oberman, Sagi Grimberg, James Smart,
	linux-nvme, Keith Busch, Jianchao Wang, Christoph Hellwig

On 5/18/18 7:57 AM, Keith Busch wrote:
> On Fri, May 18, 2018 at 08:20:05AM +0800, Ming Lei wrote:
>> What I think block/011 is helpful is that it can trigger IO timeout
>> during reset, which can be triggered in reality too.
> 
> As I mentioned earlier, there is nothing wrong with the spirit of
> the test. What's wrong with it is the misguided implemention.
> 
> Do you underestand why it ever passes? The success happens when the
> enabling part of the loop happens to coincide with the driver's enabling,
> creating the pci_dev->enable_cnt > 1, making subsequent disable parts
> of the loop do absolutely nothing; the exact same as the one-liner
> (non-serious) patch I sent to defeat the test.
> 
> A better way to induce the timeout is:
> 
>   # setpci -s <B:D.f> 4.w=0:6
> 
> This will halt the device without messing with the kernel structures,
> just like how a real device failure would occur.

Let's just improve/fix the test case. Sounds like the 'enable' sysfs
attribute should never have been exported, and hence the test should
never have used it. blktests is not the source of truth, necessarily,
and it would be silly to work around cases in the kernel if it's a
clear case of "doctor it hurts when I shoot myself in the foot".

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH V6 11/11] nvme: pci: support nested EH
@ 2018-05-18 16:58               ` Jens Axboe
  0 siblings, 0 replies; 60+ messages in thread
From: Jens Axboe @ 2018-05-18 16:58 UTC (permalink / raw)


On 5/18/18 7:57 AM, Keith Busch wrote:
> On Fri, May 18, 2018@08:20:05AM +0800, Ming Lei wrote:
>> What I think block/011 is helpful is that it can trigger IO timeout
>> during reset, which can be triggered in reality too.
> 
> As I mentioned earlier, there is nothing wrong with the spirit of
> the test. What's wrong with it is the misguided implemention.
> 
> Do you underestand why it ever passes? The success happens when the
> enabling part of the loop happens to coincide with the driver's enabling,
> creating the pci_dev->enable_cnt > 1, making subsequent disable parts
> of the loop do absolutely nothing; the exact same as the one-liner
> (non-serious) patch I sent to defeat the test.
> 
> A better way to induce the timeout is:
> 
>   # setpci -s <B:D.f> 4.w=0:6
> 
> This will halt the device without messing with the kernel structures,
> just like how a real device failure would occur.

Let's just improve/fix the test case. Sounds like the 'enable' sysfs
attribute should never have been exported, and hence the test should
never have used it. blktests is not the source of truth, necessarily,
and it would be silly to work around cases in the kernel if it's a
clear case of "doctor it hurts when I shoot myself in the foot".

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V6 11/11] nvme: pci: support nested EH
  2018-05-18 13:57             ` Keith Busch
@ 2018-05-18 22:26               ` Ming Lei
  -1 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-18 22:26 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Keith Busch, Laurence Oberman, Sagi Grimberg,
	James Smart, linux-nvme, linux-block, Jianchao Wang,
	Christoph Hellwig

On Fri, May 18, 2018 at 07:57:51AM -0600, Keith Busch wrote:
> On Fri, May 18, 2018 at 08:20:05AM +0800, Ming Lei wrote:
> > What I think block/011 is helpful is that it can trigger IO timeout
> > during reset, which can be triggered in reality too.
> 
> As I mentioned earlier, there is nothing wrong with the spirit of
> the test. What's wrong with it is the misguided implemention.
> 
> Do you underestand why it ever passes? The success happens when the
> enabling part of the loop happens to coincide with the driver's enabling,
> creating the pci_dev->enable_cnt > 1, making subsequent disable parts
> of the loop do absolutely nothing; the exact same as the one-liner
> (non-serious) patch I sent to defeat the test.
> 
> A better way to induce the timeout is:
> 
>   # setpci -s <B:D.f> 4.w=0:6
> 
> This will halt the device without messing with the kernel structures,
> just like how a real device failure would occur.

Frankly speaking, I don't care how the test-case is implemented at all.

The big problem is that NVMe driver can't handle IO time-out during
reset context, and finally either the controller becomes DEAD or reset
context hangs forever, and everything can't move on.

The issue can be reproduced easier via io-timeout-fail fault injection.

So could we please face to the real issue instead of working around
test case?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH V6 11/11] nvme: pci: support nested EH
@ 2018-05-18 22:26               ` Ming Lei
  0 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-18 22:26 UTC (permalink / raw)


On Fri, May 18, 2018@07:57:51AM -0600, Keith Busch wrote:
> On Fri, May 18, 2018@08:20:05AM +0800, Ming Lei wrote:
> > What I think block/011 is helpful is that it can trigger IO timeout
> > during reset, which can be triggered in reality too.
> 
> As I mentioned earlier, there is nothing wrong with the spirit of
> the test. What's wrong with it is the misguided implemention.
> 
> Do you underestand why it ever passes? The success happens when the
> enabling part of the loop happens to coincide with the driver's enabling,
> creating the pci_dev->enable_cnt > 1, making subsequent disable parts
> of the loop do absolutely nothing; the exact same as the one-liner
> (non-serious) patch I sent to defeat the test.
> 
> A better way to induce the timeout is:
> 
>   # setpci -s <B:D.f> 4.w=0:6
> 
> This will halt the device without messing with the kernel structures,
> just like how a real device failure would occur.

Frankly speaking, I don't care how the test-case is implemented at all.

The big problem is that NVMe driver can't handle IO time-out during
reset context, and finally either the controller becomes DEAD or reset
context hangs forever, and everything can't move on.

The issue can be reproduced easier via io-timeout-fail fault injection.

So could we please face to the real issue instead of working around
test case?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V6 11/11] nvme: pci: support nested EH
  2018-05-18 22:26               ` Ming Lei
@ 2018-05-18 23:45                 ` Keith Busch
  -1 siblings, 0 replies; 60+ messages in thread
From: Keith Busch @ 2018-05-18 23:45 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Laurence Oberman, Sagi Grimberg,
	James Smart, linux-nvme, Keith Busch, Jianchao Wang,
	Christoph Hellwig

On Sat, May 19, 2018 at 06:26:28AM +0800, Ming Lei wrote:
> So could we please face to the real issue instead of working around
> test case?

Yes, that's why I want you to stop referencing the broken test.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH V6 11/11] nvme: pci: support nested EH
@ 2018-05-18 23:45                 ` Keith Busch
  0 siblings, 0 replies; 60+ messages in thread
From: Keith Busch @ 2018-05-18 23:45 UTC (permalink / raw)


On Sat, May 19, 2018@06:26:28AM +0800, Ming Lei wrote:
> So could we please face to the real issue instead of working around
> test case?

Yes, that's why I want you to stop referencing the broken test.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V6 11/11] nvme: pci: support nested EH
  2018-05-18 23:45                 ` Keith Busch
@ 2018-05-18 23:51                   ` Ming Lei
  -1 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-18 23:51 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Laurence Oberman, Sagi Grimberg,
	James Smart, linux-nvme, Keith Busch, Jianchao Wang,
	Christoph Hellwig

On Fri, May 18, 2018 at 05:45:00PM -0600, Keith Busch wrote:
> On Sat, May 19, 2018 at 06:26:28AM +0800, Ming Lei wrote:
> > So could we please face to the real issue instead of working around
> > test case?
> 
> Yes, that's why I want you to stop referencing the broken test.

Unfortunately I don't care if it is broken or not, and I think
the test case is great because it can expose the real problem.
That is its value!

The reason why I referenced this test is that everyone can reproduce
this real issue easily, that is it.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH V6 11/11] nvme: pci: support nested EH
@ 2018-05-18 23:51                   ` Ming Lei
  0 siblings, 0 replies; 60+ messages in thread
From: Ming Lei @ 2018-05-18 23:51 UTC (permalink / raw)


On Fri, May 18, 2018@05:45:00PM -0600, Keith Busch wrote:
> On Sat, May 19, 2018@06:26:28AM +0800, Ming Lei wrote:
> > So could we please face to the real issue instead of working around
> > test case?
> 
> Yes, that's why I want you to stop referencing the broken test.

Unfortunately I don't care if it is broken or not, and I think
the test case is great because it can expose the real problem.
That is its value!

The reason why I referenced this test is that everyone can reproduce
this real issue easily, that is it.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V6 11/11] nvme: pci: support nested EH
  2018-05-18 16:28                 ` Keith Busch
  (?)
@ 2018-05-22  7:35                   ` Johannes Thumshirn
  -1 siblings, 0 replies; 60+ messages in thread
From: Johannes Thumshirn @ 2018-05-22  7:35 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Ming Lei, Keith Busch, Jens Axboe,
	Laurence Oberman, Sagi Grimberg, James Smart, linux-nvme,
	linux-block, Jianchao Wang, bhelgaas, linux-pci, arjan

On Fri, May 18, 2018 at 10:28:04AM -0600, Keith Busch wrote:
> On Thu, May 17, 2018 at 04:23:45PM +0200, Johannes Thumshirn wrote:
> > > Agreed. Alternatively possibly call the driver's reset_preparei/done
> > > callbacks.
> > 
> > Exactly, but as long as we can issue the reset via sysfs the test-case
> > is still valid.
> 
> I disagree the test case is valid. The test writes '0' to the
> pci-sysfs 'enable', but the driver also disables the pci device as part
> of resetting, which is a perfectly reasonable thing for a driver to do.
> 
> If the timing of the test's loop happens to write '0' right after the
> driver disabled the device that it owns, a 'write error' on that sysfs
> write occurs, and blktests then incorrectly claims the test failed.

Hmm, ok that's a valid point. But seeing you have sent a patch for
blktests anyways I think it's gone now.

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V6 11/11] nvme: pci: support nested EH
@ 2018-05-22  7:35                   ` Johannes Thumshirn
  0 siblings, 0 replies; 60+ messages in thread
From: Johannes Thumshirn @ 2018-05-22  7:35 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Laurence Oberman, Sagi Grimberg,
	linux-pci, James Smart, linux-nvme, Ming Lei, Keith Busch,
	Jianchao Wang, bhelgaas, arjan, Christoph Hellwig

On Fri, May 18, 2018 at 10:28:04AM -0600, Keith Busch wrote:
> On Thu, May 17, 2018 at 04:23:45PM +0200, Johannes Thumshirn wrote:
> > > Agreed. Alternatively possibly call the driver's reset_preparei/done
> > > callbacks.
> > =

> > Exactly, but as long as we can issue the reset via sysfs the test-case
> > is still valid.
> =

> I disagree the test case is valid. The test writes '0' to the
> pci-sysfs 'enable', but the driver also disables the pci device as part
> of resetting, which is a perfectly reasonable thing for a driver to do.
> =

> If the timing of the test's loop happens to write '0' right after the
> driver disabled the device that it owns, a 'write error' on that sysfs
> write occurs, and blktests then incorrectly claims the test failed.

Hmm, ok that's a valid point. But seeing you have sent a patch for
blktests anyways I think it's gone now.

-- =

Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N=FCrnberg
GF: Felix Imend=F6rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N=FCrnberg)
Key fingerprint =3D EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH V6 11/11] nvme: pci: support nested EH
@ 2018-05-22  7:35                   ` Johannes Thumshirn
  0 siblings, 0 replies; 60+ messages in thread
From: Johannes Thumshirn @ 2018-05-22  7:35 UTC (permalink / raw)


On Fri, May 18, 2018@10:28:04AM -0600, Keith Busch wrote:
> On Thu, May 17, 2018@04:23:45PM +0200, Johannes Thumshirn wrote:
> > > Agreed. Alternatively possibly call the driver's reset_preparei/done
> > > callbacks.
> > 
> > Exactly, but as long as we can issue the reset via sysfs the test-case
> > is still valid.
> 
> I disagree the test case is valid. The test writes '0' to the
> pci-sysfs 'enable', but the driver also disables the pci device as part
> of resetting, which is a perfectly reasonable thing for a driver to do.
> 
> If the timing of the test's loop happens to write '0' right after the
> driver disabled the device that it owns, a 'write error' on that sysfs
> write occurs, and blktests then incorrectly claims the test failed.

Hmm, ok that's a valid point. But seeing you have sent a patch for
blktests anyways I think it's gone now.

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH V6 02/11] nvme: pci: cover timeout for admin commands running in EH
  2018-05-16  4:03   ` Ming Lei
@ 2018-05-24 15:39     ` Keith Busch
  -1 siblings, 0 replies; 60+ messages in thread
From: Keith Busch @ 2018-05-24 15:39 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, Jens Axboe, Laurence Oberman, Sagi Grimberg,
	James Smart, linux-nvme, linux-block, Jianchao Wang,
	Christoph Hellwig

On Wed, May 16, 2018 at 12:03:04PM +0800, Ming Lei wrote:
> +static void nvme_set_host_mem_end_io(struct request *rq, blk_status_t sts)
> +{
> +	struct completion *waiting = rq->end_io_data;
> +
> +	rq->end_io_data = NULL;
> +	blk_mq_free_request(rq);
> +
> +	/*
> +	 * complete last, if this is a stack request the process (and thus
> +	 * the rq pointer) could be invalid right after this complete()
> +	 */
> +	complete(waiting);
> +}
> +
> +/*
> + * This function can only be used inside nvme_dev_disable() when timeout
> + * may not work, then this function has to cover the timeout by itself.
> + *
> + * When wait_for_completion_io_timeout() returns 0 and timeout happens,
> + * this request will be completed after controller is shutdown.
> + */
> +static int nvme_set_host_mem_timeout(struct nvme_dev *dev, u32 bits)
> +{
> +	DECLARE_COMPLETION_ONSTACK(wait);
> +	struct nvme_command c;
> +	struct request_queue *q = dev->ctrl.admin_q;
> +	struct request *req;
> +	int ret;
> +
> +	nvme_init_set_host_mem_cmd(dev, &c, bits);
> +
> +	req = nvme_alloc_request(q, &c, 0, NVME_QID_ANY);
> +	if (IS_ERR(req))
> +		return PTR_ERR(req);
> +
> +	req->timeout = ADMIN_TIMEOUT;
> +	req->end_io_data = &wait;
> +
> +	blk_execute_rq_nowait(q, NULL, req, false,
> +			nvme_set_host_mem_end_io);
> +	ret = wait_for_completion_io_timeout(&wait, ADMIN_TIMEOUT);
> +	if (ret > 0) {
> +		if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
> +			ret = -EINTR;
> +		else
> +			ret = nvme_req(req)->status;
> +	} else
> +		ret = -EINTR;
> +
> +	return ret;
> +}

This function doesn't need to return anything at all. The caller doesn't
care about the success in order to decide what to do next.

Now let's say it does time out. The command will be be completed later
through nvme_cancel_request, but your completion callback will reference
a now invalid stack variable.

>  static void nvme_free_host_mem(struct nvme_dev *dev)
>  {
>  	int i;
> @@ -2225,7 +2284,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  		 * but I'd rather be safe than sorry..
>  		 */
>  		if (dev->host_mem_descs)
> -			nvme_set_host_mem(dev, 0);
> +			nvme_set_host_mem_timeout(dev, 0);
>  		nvme_disable_io_queues(dev);
>  		nvme_disable_admin_queue(dev, shutdown);
>  	}

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH V6 02/11] nvme: pci: cover timeout for admin commands running in EH
@ 2018-05-24 15:39     ` Keith Busch
  0 siblings, 0 replies; 60+ messages in thread
From: Keith Busch @ 2018-05-24 15:39 UTC (permalink / raw)


On Wed, May 16, 2018@12:03:04PM +0800, Ming Lei wrote:
> +static void nvme_set_host_mem_end_io(struct request *rq, blk_status_t sts)
> +{
> +	struct completion *waiting = rq->end_io_data;
> +
> +	rq->end_io_data = NULL;
> +	blk_mq_free_request(rq);
> +
> +	/*
> +	 * complete last, if this is a stack request the process (and thus
> +	 * the rq pointer) could be invalid right after this complete()
> +	 */
> +	complete(waiting);
> +}
> +
> +/*
> + * This function can only be used inside nvme_dev_disable() when timeout
> + * may not work, then this function has to cover the timeout by itself.
> + *
> + * When wait_for_completion_io_timeout() returns 0 and timeout happens,
> + * this request will be completed after controller is shutdown.
> + */
> +static int nvme_set_host_mem_timeout(struct nvme_dev *dev, u32 bits)
> +{
> +	DECLARE_COMPLETION_ONSTACK(wait);
> +	struct nvme_command c;
> +	struct request_queue *q = dev->ctrl.admin_q;
> +	struct request *req;
> +	int ret;
> +
> +	nvme_init_set_host_mem_cmd(dev, &c, bits);
> +
> +	req = nvme_alloc_request(q, &c, 0, NVME_QID_ANY);
> +	if (IS_ERR(req))
> +		return PTR_ERR(req);
> +
> +	req->timeout = ADMIN_TIMEOUT;
> +	req->end_io_data = &wait;
> +
> +	blk_execute_rq_nowait(q, NULL, req, false,
> +			nvme_set_host_mem_end_io);
> +	ret = wait_for_completion_io_timeout(&wait, ADMIN_TIMEOUT);
> +	if (ret > 0) {
> +		if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
> +			ret = -EINTR;
> +		else
> +			ret = nvme_req(req)->status;
> +	} else
> +		ret = -EINTR;
> +
> +	return ret;
> +}

This function doesn't need to return anything at all. The caller doesn't
care about the success in order to decide what to do next.

Now let's say it does time out. The command will be be completed later
through nvme_cancel_request, but your completion callback will reference
a now invalid stack variable.

>  static void nvme_free_host_mem(struct nvme_dev *dev)
>  {
>  	int i;
> @@ -2225,7 +2284,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  		 * but I'd rather be safe than sorry..
>  		 */
>  		if (dev->host_mem_descs)
> -			nvme_set_host_mem(dev, 0);
> +			nvme_set_host_mem_timeout(dev, 0);
>  		nvme_disable_io_queues(dev);
>  		nvme_disable_admin_queue(dev, shutdown);
>  	}

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2018-05-24 15:39 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-16  4:03 [PATCH V6 00/11] nvme: pci: fix & improve timeout handling Ming Lei
2018-05-16  4:03 ` Ming Lei
2018-05-16  4:03 ` [PATCH V6 01/11] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout() Ming Lei
2018-05-16  4:03   ` Ming Lei
2018-05-16  4:03 ` [PATCH V6 02/11] nvme: pci: cover timeout for admin commands running in EH Ming Lei
2018-05-16  4:03   ` Ming Lei
2018-05-24 15:39   ` Keith Busch
2018-05-24 15:39     ` Keith Busch
2018-05-16  4:03 ` [PATCH V6 03/11] nvme: pci: unquiesce admin queue after controller is shutdown Ming Lei
2018-05-16  4:03   ` Ming Lei
2018-05-16  4:03 ` [PATCH V6 04/11] nvme: pci: set nvmeq->cq_vector after alloc cq/sq Ming Lei
2018-05-16  4:03   ` Ming Lei
2018-05-16  4:03 ` [PATCH V6 05/11] nvme: pci: only wait freezing if queue is frozen Ming Lei
2018-05-16  4:03   ` Ming Lei
2018-05-16  4:03 ` [PATCH V6 06/11] nvme: pci: freeze queue in nvme_dev_disable() in case of error recovery Ming Lei
2018-05-16  4:03   ` Ming Lei
2018-05-16  4:03 ` [PATCH V6 07/11] nvme: pci: prepare for supporting error recovery from resetting context Ming Lei
2018-05-16  4:03   ` Ming Lei
2018-05-16  4:03 ` [PATCH V6 08/11] nvme: pci: move error handling out of nvme_reset_dev() Ming Lei
2018-05-16  4:03   ` Ming Lei
2018-05-16  4:03 ` [PATCH V6 09/11] nvme: pci: don't unfreeze queue until controller state updating succeeds Ming Lei
2018-05-16  4:03   ` Ming Lei
2018-05-16  4:03 ` [PATCH V6 10/11] nvme: core: introduce nvme_force_change_ctrl_state() Ming Lei
2018-05-16  4:03   ` Ming Lei
2018-05-16  4:03 ` [PATCH V6 11/11] nvme: pci: support nested EH Ming Lei
2018-05-16  4:03   ` Ming Lei
2018-05-16 14:12   ` Keith Busch
2018-05-16 14:12     ` Keith Busch
2018-05-16 23:10     ` Ming Lei
2018-05-16 23:10       ` Ming Lei
2018-05-17  2:20       ` Keith Busch
2018-05-17  2:20         ` Keith Busch
2018-05-17  8:41         ` Christoph Hellwig
2018-05-17  8:41           ` Christoph Hellwig
2018-05-17 14:20           ` Keith Busch
2018-05-17 14:20             ` Keith Busch
2018-05-17 14:20             ` Keith Busch
2018-05-17 14:23             ` Johannes Thumshirn
2018-05-17 14:23               ` Johannes Thumshirn
2018-05-17 14:23               ` Johannes Thumshirn
2018-05-18 16:28               ` Keith Busch
2018-05-18 16:28                 ` Keith Busch
2018-05-18 16:28                 ` Keith Busch
2018-05-22  7:35                 ` Johannes Thumshirn
2018-05-22  7:35                   ` Johannes Thumshirn
2018-05-22  7:35                   ` Johannes Thumshirn
2018-05-18  0:20         ` Ming Lei
2018-05-18  0:20           ` Ming Lei
2018-05-18  1:01           ` Ming Lei
2018-05-18  1:01             ` Ming Lei
2018-05-18 13:57           ` Keith Busch
2018-05-18 13:57             ` Keith Busch
2018-05-18 16:58             ` Jens Axboe
2018-05-18 16:58               ` Jens Axboe
2018-05-18 22:26             ` Ming Lei
2018-05-18 22:26               ` Ming Lei
2018-05-18 23:45               ` Keith Busch
2018-05-18 23:45                 ` Keith Busch
2018-05-18 23:51                 ` Ming Lei
2018-05-18 23:51                   ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.