All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] nvme: pci: fix & improve timeout handling
@ 2018-04-26 12:39 ` Ming Lei
  0 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-26 12:39 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme

Hi,

This first patch introduces EH kthread for handling timeout, and
simplifies the logics a lot, and fixes reports on block/011.

The 2nd one fixes the issue reported by Jianchao, in which admin
req may time out in EH.


Ming Lei (2):
  nvme: pci: simplify timeout handling
  nvme: pci: guarantee EH can make progress

 drivers/nvme/host/core.c |  11 ++++
 drivers/nvme/host/nvme.h |   1 +
 drivers/nvme/host/pci.c  | 161 ++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 158 insertions(+), 15 deletions(-)

Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org


-- 
2.9.5

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 0/2] nvme: pci: fix & improve timeout handling
@ 2018-04-26 12:39 ` Ming Lei
  0 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-26 12:39 UTC (permalink / raw)


Hi,

This first patch introduces EH kthread for handling timeout, and
simplifies the logics a lot, and fixes reports on block/011.

The 2nd one fixes the issue reported by Jianchao, in which admin
req may time out in EH.


Ming Lei (2):
  nvme: pci: simplify timeout handling
  nvme: pci: guarantee EH can make progress

 drivers/nvme/host/core.c |  11 ++++
 drivers/nvme/host/nvme.h |   1 +
 drivers/nvme/host/pci.c  | 161 ++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 158 insertions(+), 15 deletions(-)

Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org


-- 
2.9.5

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-04-26 12:39 ` Ming Lei
@ 2018-04-26 12:39   ` Ming Lei
  -1 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-26 12:39 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme

When one req is timed out, now nvme_timeout() handles it by the
following way:

	nvme_dev_disable(dev, false);
	nvme_reset_ctrl(&dev->ctrl);
	return BLK_EH_HANDLED.

which may introduces the following issues:

1) the following timeout on other reqs may call nvme_dev_disable()
again, which may quiesce queue again when resetting is in-progress,
then finally nothing can move on.

2) inside resetting, nvme_dev_disable() may be called, but this way
may cause double completion on the previous timed-out request.

3) the timed-out request can't be covered by nvme_dev_disable(), and
this way is too tricky and easy to cause trouble.

This patch fixes these issues by moving timeout handling into one EH
thread, and wakeup this thread for handling timedout request, meantime
return BLK_EH_NOT_HANDLED, so all requests will be handled in the
EH handler.

This patch fixes reports from the horible test of block/011.

Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/core.c |  11 ++++
 drivers/nvme/host/nvme.h |   1 +
 drivers/nvme/host/pci.c  | 134 +++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 135 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9df4f71e58ca..f68c653f7fbf 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3569,6 +3569,17 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_start_freeze);
 
+void nvme_sync_queues(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_sync_queue(ns->queue);
+	up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_sync_queues);
+
 void nvme_stop_queues(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 061fecfd44f5..08bb297d5f62 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -402,6 +402,7 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len,
 void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
 		union nvme_result *res);
 
+void nvme_sync_queues(struct nvme_ctrl *ctrl);
 void nvme_stop_queues(struct nvme_ctrl *ctrl);
 void nvme_start_queues(struct nvme_ctrl *ctrl);
 void nvme_kill_queues(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fbc71fac6f1e..5d05a04f8e72 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -29,6 +29,7 @@
 #include <linux/types.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/sed-opal.h>
+#include <linux/kthread.h>
 
 #include "nvme.h"
 
@@ -112,6 +113,11 @@ struct nvme_dev {
 	dma_addr_t host_mem_descs_dma;
 	struct nvme_host_mem_buf_desc *host_mem_descs;
 	void **host_mem_desc_bufs;
+
+	spinlock_t	  eh_lock;
+	wait_queue_head_t eh_wq;
+	bool		eh_in_recovery;
+	struct task_struct    *ehandler;
 };
 
 static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
@@ -1176,6 +1182,80 @@ static void nvme_warn_reset(struct nvme_dev *dev, u32 csts)
 			 csts, result);
 }
 
+static void nvme_eh_schedule(struct nvme_dev *dev)
+{
+	spin_lock(&dev->eh_lock);
+	if (!dev->eh_in_recovery) {
+		dev->eh_in_recovery = true;
+		wake_up_process(dev->ehandler);
+	}
+	spin_unlock(&dev->eh_lock);
+}
+
+static void nvme_eh_done(struct nvme_dev *dev)
+{
+	spin_lock(&dev->eh_lock);
+	if (dev->eh_in_recovery) {
+		dev->eh_in_recovery = false;
+		wake_up(&dev->eh_wq);
+	}
+	spin_unlock(&dev->eh_lock);
+}
+
+static int nvme_error_handler(void *data)
+{
+	struct nvme_dev *dev = data;
+
+	while (true) {
+		/*
+		 * The sequence in kthread_stop() sets the stop flag first
+		 * then wakes the process.  To avoid missed wakeups, the task
+		 * should always be in a non running state before the stop
+		 * flag is checked
+		 */
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (kthread_should_stop())
+			break;
+
+		spin_lock(&dev->eh_lock);
+		if (!dev->eh_in_recovery) {
+			spin_unlock(&dev->eh_lock);
+			schedule();
+			continue;
+		}
+		spin_unlock(&dev->eh_lock);
+
+		__set_current_state(TASK_RUNNING);
+
+		dev_info(dev->ctrl.device, "start eh recovery\n");
+		nvme_dev_disable(dev, false);
+		nvme_reset_ctrl(&dev->ctrl);
+
+		wait_event(dev->eh_wq, !dev->eh_in_recovery);
+
+		dev_info(dev->ctrl.device, "eh recovery done\n");
+	}
+	__set_current_state(TASK_RUNNING);
+	dev->ehandler = NULL;
+
+	return 0;
+}
+
+static int nvme_eh_init(struct nvme_dev *dev)
+{
+	spin_lock_init(&dev->eh_lock);
+	init_waitqueue_head(&dev->eh_wq);
+
+	dev->ehandler = kthread_run(nvme_error_handler, dev,
+			"nvme_eh_%d", dev->ctrl.instance);
+	if (IS_ERR(dev->ehandler)) {
+		dev_err(dev->ctrl.device, "error handler thread failed to spawn, error = %ld\n",
+			PTR_ERR(dev->ehandler));
+		return PTR_ERR(dev->ehandler);
+	}
+	return 0;
+}
+
 static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -1197,9 +1277,8 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	 */
 	if (nvme_should_reset(dev, csts)) {
 		nvme_warn_reset(dev, csts);
-		nvme_dev_disable(dev, false);
-		nvme_reset_ctrl(&dev->ctrl);
-		return BLK_EH_HANDLED;
+		nvme_eh_schedule(dev);
+		return BLK_EH_NOT_HANDLED;
 	}
 
 	/*
@@ -1224,9 +1303,9 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, disable controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		return BLK_EH_HANDLED;
+		nvme_eh_schedule(dev);
+		return BLK_EH_NOT_HANDLED;
 	default:
 		break;
 	}
@@ -1240,15 +1319,13 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
-		nvme_reset_ctrl(&dev->ctrl);
-
 		/*
 		 * Mark the request as handled, since the inline shutdown
 		 * forces all outstanding requests to complete.
 		 */
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		return BLK_EH_HANDLED;
+		nvme_eh_schedule(dev);
+		return BLK_EH_NOT_HANDLED;
 	}
 
 	if (atomic_dec_return(&dev->ctrl.abort_limit) < 0) {
@@ -2180,6 +2257,22 @@ static void nvme_pci_disable(struct nvme_dev *dev)
 	}
 }
 
+/*
+ * This one is called after queues are quiesced, and no in-fligh timeout
+ * and nvme interrupt handling.
+ */
+static void nvme_pci_cancel_request(struct request *req, void *data,
+		bool reserved)
+{
+	/* make sure timed-out requests are covered too */
+	if (req->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) {
+		req->aborted_gstate = 0;
+		req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
+	}
+
+	nvme_cancel_request(req, data, reserved);
+}
+
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 {
 	int i;
@@ -2223,10 +2316,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
 		nvme_suspend_queue(&dev->queues[i]);
 
+	/*
+	 * safe to sync timeout after queues are quiesced, then all
+	 * requests(include the time-out ones) will be canceled.
+	 */
+	nvme_sync_queues(&dev->ctrl);
+	blk_sync_queue(dev->ctrl.admin_q);
+
 	nvme_pci_disable(dev);
 
-	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
-	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
+	blk_mq_tagset_busy_iter(&dev->tagset, nvme_pci_cancel_request, &dev->ctrl);
+	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_pci_cancel_request, &dev->ctrl);
 
 	/*
 	 * The driver will not be starting up queues again if shutting down so
@@ -2358,6 +2458,8 @@ static void nvme_reset_work(struct work_struct *work)
 	if (result)
 		goto out;
 
+	nvme_eh_done(dev);
+
 	/*
 	 * Keep the controller around but remove all namespaces if we don't have
 	 * any working I/O queue.
@@ -2391,6 +2493,9 @@ static void nvme_reset_work(struct work_struct *work)
 
  out:
 	nvme_remove_dead_ctrl(dev, result);
+
+	/* mark EH done even conroller is dead */
+	nvme_eh_done(dev);
 }
 
 static void nvme_remove_dead_ctrl_work(struct work_struct *work)
@@ -2532,10 +2637,15 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
 
+	if (nvme_eh_init(dev))
+		goto uninit_ctrl;
+
 	nvme_reset_ctrl(&dev->ctrl);
 
 	return 0;
 
+ uninit_ctrl:
+	nvme_uninit_ctrl(&dev->ctrl);
  release_pools:
 	nvme_release_prp_pools(dev);
  unmap:
@@ -2589,6 +2699,8 @@ static void nvme_remove(struct pci_dev *pdev)
 	nvme_stop_ctrl(&dev->ctrl);
 	nvme_remove_namespaces(&dev->ctrl);
 	nvme_dev_disable(dev, true);
+	if (dev->ehandler)
+		kthread_stop(dev->ehandler);
 	nvme_free_host_mem(dev);
 	nvme_dev_remove_admin(dev);
 	nvme_free_queues(dev, 0);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-04-26 12:39   ` Ming Lei
  0 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-26 12:39 UTC (permalink / raw)


When one req is timed out, now nvme_timeout() handles it by the
following way:

	nvme_dev_disable(dev, false);
	nvme_reset_ctrl(&dev->ctrl);
	return BLK_EH_HANDLED.

which may introduces the following issues:

1) the following timeout on other reqs may call nvme_dev_disable()
again, which may quiesce queue again when resetting is in-progress,
then finally nothing can move on.

2) inside resetting, nvme_dev_disable() may be called, but this way
may cause double completion on the previous timed-out request.

3) the timed-out request can't be covered by nvme_dev_disable(), and
this way is too tricky and easy to cause trouble.

This patch fixes these issues by moving timeout handling into one EH
thread, and wakeup this thread for handling timedout request, meantime
return BLK_EH_NOT_HANDLED, so all requests will be handled in the
EH handler.

This patch fixes reports from the horible test of block/011.

Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/core.c |  11 ++++
 drivers/nvme/host/nvme.h |   1 +
 drivers/nvme/host/pci.c  | 134 +++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 135 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9df4f71e58ca..f68c653f7fbf 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3569,6 +3569,17 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_start_freeze);
 
+void nvme_sync_queues(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_sync_queue(ns->queue);
+	up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_sync_queues);
+
 void nvme_stop_queues(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 061fecfd44f5..08bb297d5f62 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -402,6 +402,7 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len,
 void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
 		union nvme_result *res);
 
+void nvme_sync_queues(struct nvme_ctrl *ctrl);
 void nvme_stop_queues(struct nvme_ctrl *ctrl);
 void nvme_start_queues(struct nvme_ctrl *ctrl);
 void nvme_kill_queues(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fbc71fac6f1e..5d05a04f8e72 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -29,6 +29,7 @@
 #include <linux/types.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/sed-opal.h>
+#include <linux/kthread.h>
 
 #include "nvme.h"
 
@@ -112,6 +113,11 @@ struct nvme_dev {
 	dma_addr_t host_mem_descs_dma;
 	struct nvme_host_mem_buf_desc *host_mem_descs;
 	void **host_mem_desc_bufs;
+
+	spinlock_t	  eh_lock;
+	wait_queue_head_t eh_wq;
+	bool		eh_in_recovery;
+	struct task_struct    *ehandler;
 };
 
 static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
@@ -1176,6 +1182,80 @@ static void nvme_warn_reset(struct nvme_dev *dev, u32 csts)
 			 csts, result);
 }
 
+static void nvme_eh_schedule(struct nvme_dev *dev)
+{
+	spin_lock(&dev->eh_lock);
+	if (!dev->eh_in_recovery) {
+		dev->eh_in_recovery = true;
+		wake_up_process(dev->ehandler);
+	}
+	spin_unlock(&dev->eh_lock);
+}
+
+static void nvme_eh_done(struct nvme_dev *dev)
+{
+	spin_lock(&dev->eh_lock);
+	if (dev->eh_in_recovery) {
+		dev->eh_in_recovery = false;
+		wake_up(&dev->eh_wq);
+	}
+	spin_unlock(&dev->eh_lock);
+}
+
+static int nvme_error_handler(void *data)
+{
+	struct nvme_dev *dev = data;
+
+	while (true) {
+		/*
+		 * The sequence in kthread_stop() sets the stop flag first
+		 * then wakes the process.  To avoid missed wakeups, the task
+		 * should always be in a non running state before the stop
+		 * flag is checked
+		 */
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (kthread_should_stop())
+			break;
+
+		spin_lock(&dev->eh_lock);
+		if (!dev->eh_in_recovery) {
+			spin_unlock(&dev->eh_lock);
+			schedule();
+			continue;
+		}
+		spin_unlock(&dev->eh_lock);
+
+		__set_current_state(TASK_RUNNING);
+
+		dev_info(dev->ctrl.device, "start eh recovery\n");
+		nvme_dev_disable(dev, false);
+		nvme_reset_ctrl(&dev->ctrl);
+
+		wait_event(dev->eh_wq, !dev->eh_in_recovery);
+
+		dev_info(dev->ctrl.device, "eh recovery done\n");
+	}
+	__set_current_state(TASK_RUNNING);
+	dev->ehandler = NULL;
+
+	return 0;
+}
+
+static int nvme_eh_init(struct nvme_dev *dev)
+{
+	spin_lock_init(&dev->eh_lock);
+	init_waitqueue_head(&dev->eh_wq);
+
+	dev->ehandler = kthread_run(nvme_error_handler, dev,
+			"nvme_eh_%d", dev->ctrl.instance);
+	if (IS_ERR(dev->ehandler)) {
+		dev_err(dev->ctrl.device, "error handler thread failed to spawn, error = %ld\n",
+			PTR_ERR(dev->ehandler));
+		return PTR_ERR(dev->ehandler);
+	}
+	return 0;
+}
+
 static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -1197,9 +1277,8 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	 */
 	if (nvme_should_reset(dev, csts)) {
 		nvme_warn_reset(dev, csts);
-		nvme_dev_disable(dev, false);
-		nvme_reset_ctrl(&dev->ctrl);
-		return BLK_EH_HANDLED;
+		nvme_eh_schedule(dev);
+		return BLK_EH_NOT_HANDLED;
 	}
 
 	/*
@@ -1224,9 +1303,9 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, disable controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		return BLK_EH_HANDLED;
+		nvme_eh_schedule(dev);
+		return BLK_EH_NOT_HANDLED;
 	default:
 		break;
 	}
@@ -1240,15 +1319,13 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
-		nvme_reset_ctrl(&dev->ctrl);
-
 		/*
 		 * Mark the request as handled, since the inline shutdown
 		 * forces all outstanding requests to complete.
 		 */
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		return BLK_EH_HANDLED;
+		nvme_eh_schedule(dev);
+		return BLK_EH_NOT_HANDLED;
 	}
 
 	if (atomic_dec_return(&dev->ctrl.abort_limit) < 0) {
@@ -2180,6 +2257,22 @@ static void nvme_pci_disable(struct nvme_dev *dev)
 	}
 }
 
+/*
+ * This one is called after queues are quiesced, and no in-fligh timeout
+ * and nvme interrupt handling.
+ */
+static void nvme_pci_cancel_request(struct request *req, void *data,
+		bool reserved)
+{
+	/* make sure timed-out requests are covered too */
+	if (req->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) {
+		req->aborted_gstate = 0;
+		req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
+	}
+
+	nvme_cancel_request(req, data, reserved);
+}
+
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 {
 	int i;
@@ -2223,10 +2316,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
 		nvme_suspend_queue(&dev->queues[i]);
 
+	/*
+	 * safe to sync timeout after queues are quiesced, then all
+	 * requests(include the time-out ones) will be canceled.
+	 */
+	nvme_sync_queues(&dev->ctrl);
+	blk_sync_queue(dev->ctrl.admin_q);
+
 	nvme_pci_disable(dev);
 
-	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
-	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
+	blk_mq_tagset_busy_iter(&dev->tagset, nvme_pci_cancel_request, &dev->ctrl);
+	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_pci_cancel_request, &dev->ctrl);
 
 	/*
 	 * The driver will not be starting up queues again if shutting down so
@@ -2358,6 +2458,8 @@ static void nvme_reset_work(struct work_struct *work)
 	if (result)
 		goto out;
 
+	nvme_eh_done(dev);
+
 	/*
 	 * Keep the controller around but remove all namespaces if we don't have
 	 * any working I/O queue.
@@ -2391,6 +2493,9 @@ static void nvme_reset_work(struct work_struct *work)
 
  out:
 	nvme_remove_dead_ctrl(dev, result);
+
+	/* mark EH done even conroller is dead */
+	nvme_eh_done(dev);
 }
 
 static void nvme_remove_dead_ctrl_work(struct work_struct *work)
@@ -2532,10 +2637,15 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
 
+	if (nvme_eh_init(dev))
+		goto uninit_ctrl;
+
 	nvme_reset_ctrl(&dev->ctrl);
 
 	return 0;
 
+ uninit_ctrl:
+	nvme_uninit_ctrl(&dev->ctrl);
  release_pools:
 	nvme_release_prp_pools(dev);
  unmap:
@@ -2589,6 +2699,8 @@ static void nvme_remove(struct pci_dev *pdev)
 	nvme_stop_ctrl(&dev->ctrl);
 	nvme_remove_namespaces(&dev->ctrl);
 	nvme_dev_disable(dev, true);
+	if (dev->ehandler)
+		kthread_stop(dev->ehandler);
 	nvme_free_host_mem(dev);
 	nvme_dev_remove_admin(dev);
 	nvme_free_queues(dev, 0);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 2/2] nvme: pci: guarantee EH can make progress
  2018-04-26 12:39 ` Ming Lei
@ 2018-04-26 12:39   ` Ming Lei
  -1 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-26 12:39 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Ming Lei, Jianchao Wang,
	Christoph Hellwig, Sagi Grimberg, linux-nvme

When handling error/timeout, it still needs to send commands to admin
queue, and these commands can be timed out too, then EH handler may
never move on for dealing with this situation.

Actually it doesn't need to handle these admin commands after controller
is recovered because all these requests are marked as FAIL_FAST_DRIVER.

So this patch returns BLK_EH_HANDLED for these requests in
nvme_timeout().

Meantime log this failure when it happens.

Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-nvme@lists.infradead.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 5d05a04f8e72..1e058deb4718 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1265,6 +1265,20 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	struct nvme_command cmd;
 	u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
+	/*
+	 * If error recovery is in-progress and this request needn't to
+	 * be retried, return BLK_EH_HANDLED immediately, so that error
+	 * handler kthread can always make progress since we still need
+	 * to send FAILFAST request to admin queue for handling error.
+	 */
+	spin_lock(&dev->eh_lock);
+	if (dev->eh_in_recovery && blk_noretry_request(req)) {
+		spin_unlock(&dev->eh_lock);
+		nvme_req(req)->status |= NVME_SC_DNR;
+		return BLK_EH_HANDLED;
+	}
+	spin_unlock(&dev->eh_lock);
+
 	/* If PCI error recovery process is happening, we cannot reset or
 	 * the recovery mechanism will surely fail.
 	 */
@@ -2106,7 +2120,7 @@ static int nvme_delete_queue(struct nvme_queue *nvmeq, u8 opcode)
 	return 0;
 }
 
-static void nvme_disable_io_queues(struct nvme_dev *dev)
+static int nvme_disable_io_queues(struct nvme_dev *dev)
 {
 	int pass, queues = dev->online_queues - 1;
 	unsigned long timeout;
@@ -2125,12 +2139,14 @@ static void nvme_disable_io_queues(struct nvme_dev *dev)
 		while (sent--) {
 			timeout = wait_for_completion_io_timeout(&dev->ioq_wait, timeout);
 			if (timeout == 0)
-				return;
+				return -EIO;
 			if (i)
 				goto retry;
 		}
 		opcode = nvme_admin_delete_cq;
 	}
+
+	return 0;
 }
 
 /*
@@ -2302,6 +2318,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	nvme_stop_queues(&dev->ctrl);
 
 	if (!dead && dev->ctrl.queue_count > 0) {
+		int ret = 0;
 		/*
 		 * If the controller is still alive tell it to stop using the
 		 * host memory buffer.  In theory the shutdown / reset should
@@ -2309,8 +2326,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		 * but I'd rather be safe than sorry..
 		 */
 		if (dev->host_mem_descs)
-			nvme_set_host_mem(dev, 0);
-		nvme_disable_io_queues(dev);
+			ret = nvme_set_host_mem(dev, 0);
+		ret |= nvme_disable_io_queues(dev);
+		if (ret)
+			dev_warn(dev->ctrl.device, "fail to configure dev\n");
 		nvme_disable_admin_queue(dev, shutdown);
 	}
 	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 2/2] nvme: pci: guarantee EH can make progress
@ 2018-04-26 12:39   ` Ming Lei
  0 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-26 12:39 UTC (permalink / raw)


When handling error/timeout, it still needs to send commands to admin
queue, and these commands can be timed out too, then EH handler may
never move on for dealing with this situation.

Actually it doesn't need to handle these admin commands after controller
is recovered because all these requests are marked as FAIL_FAST_DRIVER.

So this patch returns BLK_EH_HANDLED for these requests in
nvme_timeout().

Meantime log this failure when it happens.

Cc: Jianchao Wang <jianchao.w.wang at oracle.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: linux-nvme at lists.infradead.org
Signed-off-by: Ming Lei <ming.lei at redhat.com>
---
 drivers/nvme/host/pci.c | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 5d05a04f8e72..1e058deb4718 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1265,6 +1265,20 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	struct nvme_command cmd;
 	u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
+	/*
+	 * If error recovery is in-progress and this request needn't to
+	 * be retried, return BLK_EH_HANDLED immediately, so that error
+	 * handler kthread can always make progress since we still need
+	 * to send FAILFAST request to admin queue for handling error.
+	 */
+	spin_lock(&dev->eh_lock);
+	if (dev->eh_in_recovery && blk_noretry_request(req)) {
+		spin_unlock(&dev->eh_lock);
+		nvme_req(req)->status |= NVME_SC_DNR;
+		return BLK_EH_HANDLED;
+	}
+	spin_unlock(&dev->eh_lock);
+
 	/* If PCI error recovery process is happening, we cannot reset or
 	 * the recovery mechanism will surely fail.
 	 */
@@ -2106,7 +2120,7 @@ static int nvme_delete_queue(struct nvme_queue *nvmeq, u8 opcode)
 	return 0;
 }
 
-static void nvme_disable_io_queues(struct nvme_dev *dev)
+static int nvme_disable_io_queues(struct nvme_dev *dev)
 {
 	int pass, queues = dev->online_queues - 1;
 	unsigned long timeout;
@@ -2125,12 +2139,14 @@ static void nvme_disable_io_queues(struct nvme_dev *dev)
 		while (sent--) {
 			timeout = wait_for_completion_io_timeout(&dev->ioq_wait, timeout);
 			if (timeout == 0)
-				return;
+				return -EIO;
 			if (i)
 				goto retry;
 		}
 		opcode = nvme_admin_delete_cq;
 	}
+
+	return 0;
 }
 
 /*
@@ -2302,6 +2318,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	nvme_stop_queues(&dev->ctrl);
 
 	if (!dead && dev->ctrl.queue_count > 0) {
+		int ret = 0;
 		/*
 		 * If the controller is still alive tell it to stop using the
 		 * host memory buffer.  In theory the shutdown / reset should
@@ -2309,8 +2326,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		 * but I'd rather be safe than sorry..
 		 */
 		if (dev->host_mem_descs)
-			nvme_set_host_mem(dev, 0);
-		nvme_disable_io_queues(dev);
+			ret = nvme_set_host_mem(dev, 0);
+		ret |= nvme_disable_io_queues(dev);
+		if (ret)
+			dev_warn(dev->ctrl.device, "fail to configure dev\n");
 		nvme_disable_admin_queue(dev, shutdown);
 	}
 	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-04-26 12:39   ` Ming Lei
@ 2018-04-26 15:07     ` jianchao.wang
  -1 siblings, 0 replies; 72+ messages in thread
From: jianchao.wang @ 2018-04-26 15:07 UTC (permalink / raw)
  To: Ming Lei, Keith Busch
  Cc: Jens Axboe, Sagi Grimberg, linux-nvme, linux-block, Christoph Hellwig

Hi Ming

Thanks for your wonderful solution. :)

On 04/26/2018 08:39 PM, Ming Lei wrote:
> +/*
> + * This one is called after queues are quiesced, and no in-fligh timeout
> + * and nvme interrupt handling.
> + */
> +static void nvme_pci_cancel_request(struct request *req, void *data,
> +		bool reserved)
> +{
> +	/* make sure timed-out requests are covered too */
> +	if (req->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) {
> +		req->aborted_gstate = 0;
> +		req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
> +	}
> +
> +	nvme_cancel_request(req, data, reserved);
> +}
> +
>  static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  {
>  	int i;
> @@ -2223,10 +2316,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
>  		nvme_suspend_queue(&dev->queues[i]);
>  
> +	/*
> +	 * safe to sync timeout after queues are quiesced, then all
> +	 * requests(include the time-out ones) will be canceled.
> +	 */
> +	nvme_sync_queues(&dev->ctrl);
> +	blk_sync_queue(dev->ctrl.admin_q);
> +
Looks like blk_sync_queue cannot drain all the timeout work.

blk_sync_queue
  -> del_timer_sync          
                          blk_mq_timeout_work
                            -> mod_timer
  -> cancel_work_sync
the timeout work may come back again.
we may need to force all the in-flight requests to be timed out with blk_abort_request

>  	nvme_pci_disable(dev);

the interrupt will not come, but there maybe running one.
a synchronize_sched() here ?

Thanks
Jianchao
>  
> -	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
> -	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
> +	blk_mq_tagset_busy_iter(&dev->tagset, nvme_pci_cancel_request, &dev->ctrl);
> +	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_pci_cancel_request, &dev->ctrl);

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-04-26 15:07     ` jianchao.wang
  0 siblings, 0 replies; 72+ messages in thread
From: jianchao.wang @ 2018-04-26 15:07 UTC (permalink / raw)


Hi Ming

Thanks for your wonderful solution. :)

On 04/26/2018 08:39 PM, Ming Lei wrote:
> +/*
> + * This one is called after queues are quiesced, and no in-fligh timeout
> + * and nvme interrupt handling.
> + */
> +static void nvme_pci_cancel_request(struct request *req, void *data,
> +		bool reserved)
> +{
> +	/* make sure timed-out requests are covered too */
> +	if (req->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) {
> +		req->aborted_gstate = 0;
> +		req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
> +	}
> +
> +	nvme_cancel_request(req, data, reserved);
> +}
> +
>  static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  {
>  	int i;
> @@ -2223,10 +2316,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
>  		nvme_suspend_queue(&dev->queues[i]);
>  
> +	/*
> +	 * safe to sync timeout after queues are quiesced, then all
> +	 * requests(include the time-out ones) will be canceled.
> +	 */
> +	nvme_sync_queues(&dev->ctrl);
> +	blk_sync_queue(dev->ctrl.admin_q);
> +
Looks like blk_sync_queue cannot drain all the timeout work.

blk_sync_queue
  -> del_timer_sync          
                          blk_mq_timeout_work
                            -> mod_timer
  -> cancel_work_sync
the timeout work may come back again.
we may need to force all the in-flight requests to be timed out with blk_abort_request

>  	nvme_pci_disable(dev);

the interrupt will not come, but there maybe running one.
a synchronize_sched() here ?

Thanks
Jianchao
>  
> -	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
> -	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
> +	blk_mq_tagset_busy_iter(&dev->tagset, nvme_pci_cancel_request, &dev->ctrl);
> +	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_pci_cancel_request, &dev->ctrl);

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-04-26 15:07     ` jianchao.wang
@ 2018-04-26 15:57       ` Ming Lei
  -1 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-26 15:57 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Keith Busch, Jens Axboe, linux-block, Sagi Grimberg, linux-nvme,
	Christoph Hellwig

Hi Jianchao,

On Thu, Apr 26, 2018 at 11:07:56PM +0800, jianchao.wang wrote:
> Hi Ming
> 
> Thanks for your wonderful solution. :)
> 
> On 04/26/2018 08:39 PM, Ming Lei wrote:
> > +/*
> > + * This one is called after queues are quiesced, and no in-fligh timeout
> > + * and nvme interrupt handling.
> > + */
> > +static void nvme_pci_cancel_request(struct request *req, void *data,
> > +		bool reserved)
> > +{
> > +	/* make sure timed-out requests are covered too */
> > +	if (req->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) {
> > +		req->aborted_gstate = 0;
> > +		req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
> > +	}
> > +
> > +	nvme_cancel_request(req, data, reserved);
> > +}
> > +
> >  static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> >  {
> >  	int i;
> > @@ -2223,10 +2316,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> >  	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
> >  		nvme_suspend_queue(&dev->queues[i]);
> >  
> > +	/*
> > +	 * safe to sync timeout after queues are quiesced, then all
> > +	 * requests(include the time-out ones) will be canceled.
> > +	 */
> > +	nvme_sync_queues(&dev->ctrl);
> > +	blk_sync_queue(dev->ctrl.admin_q);
> > +
> Looks like blk_sync_queue cannot drain all the timeout work.
> 
> blk_sync_queue
>   -> del_timer_sync          
>                           blk_mq_timeout_work
>                             -> mod_timer
>   -> cancel_work_sync
> the timeout work may come back again.
> we may need to force all the in-flight requests to be timed out with blk_abort_request
> 

blk_abort_request() seems over-kill, we could avoid this race simply by
returning EH_NOT_HANDLED if the controller is in-recovery.

> >  	nvme_pci_disable(dev);
> 
> the interrupt will not come, but there maybe running one.
> a synchronize_sched() here ?

We may cover this case by moving nvme_suspend_queue() before
nvme_stop_queues().

Both two are very good catch, thanks!

-- 
Ming

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-04-26 15:57       ` Ming Lei
  0 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-26 15:57 UTC (permalink / raw)


Hi Jianchao,

On Thu, Apr 26, 2018@11:07:56PM +0800, jianchao.wang wrote:
> Hi Ming
> 
> Thanks for your wonderful solution. :)
> 
> On 04/26/2018 08:39 PM, Ming Lei wrote:
> > +/*
> > + * This one is called after queues are quiesced, and no in-fligh timeout
> > + * and nvme interrupt handling.
> > + */
> > +static void nvme_pci_cancel_request(struct request *req, void *data,
> > +		bool reserved)
> > +{
> > +	/* make sure timed-out requests are covered too */
> > +	if (req->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) {
> > +		req->aborted_gstate = 0;
> > +		req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
> > +	}
> > +
> > +	nvme_cancel_request(req, data, reserved);
> > +}
> > +
> >  static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> >  {
> >  	int i;
> > @@ -2223,10 +2316,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> >  	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
> >  		nvme_suspend_queue(&dev->queues[i]);
> >  
> > +	/*
> > +	 * safe to sync timeout after queues are quiesced, then all
> > +	 * requests(include the time-out ones) will be canceled.
> > +	 */
> > +	nvme_sync_queues(&dev->ctrl);
> > +	blk_sync_queue(dev->ctrl.admin_q);
> > +
> Looks like blk_sync_queue cannot drain all the timeout work.
> 
> blk_sync_queue
>   -> del_timer_sync          
>                           blk_mq_timeout_work
>                             -> mod_timer
>   -> cancel_work_sync
> the timeout work may come back again.
> we may need to force all the in-flight requests to be timed out with blk_abort_request
> 

blk_abort_request() seems over-kill, we could avoid this race simply by
returning EH_NOT_HANDLED if the controller is in-recovery.

> >  	nvme_pci_disable(dev);
> 
> the interrupt will not come, but there maybe running one.
> a synchronize_sched() here ?

We may cover this case by moving nvme_suspend_queue() before
nvme_stop_queues().

Both two are very good catch, thanks!

-- 
Ming

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-04-26 15:57       ` Ming Lei
@ 2018-04-26 16:16         ` Ming Lei
  -1 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-26 16:16 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Keith Busch, Jens Axboe, linux-block, Sagi Grimberg, linux-nvme,
	Christoph Hellwig

On Thu, Apr 26, 2018 at 11:57:22PM +0800, Ming Lei wrote:
> Hi Jianchao,
> 
> On Thu, Apr 26, 2018 at 11:07:56PM +0800, jianchao.wang wrote:
> > Hi Ming
> > 
> > Thanks for your wonderful solution. :)
> > 
> > On 04/26/2018 08:39 PM, Ming Lei wrote:
> > > +/*
> > > + * This one is called after queues are quiesced, and no in-fligh timeout
> > > + * and nvme interrupt handling.
> > > + */
> > > +static void nvme_pci_cancel_request(struct request *req, void *data,
> > > +		bool reserved)
> > > +{
> > > +	/* make sure timed-out requests are covered too */
> > > +	if (req->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) {
> > > +		req->aborted_gstate = 0;
> > > +		req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
> > > +	}
> > > +
> > > +	nvme_cancel_request(req, data, reserved);
> > > +}
> > > +
> > >  static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> > >  {
> > >  	int i;
> > > @@ -2223,10 +2316,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> > >  	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
> > >  		nvme_suspend_queue(&dev->queues[i]);
> > >  
> > > +	/*
> > > +	 * safe to sync timeout after queues are quiesced, then all
> > > +	 * requests(include the time-out ones) will be canceled.
> > > +	 */
> > > +	nvme_sync_queues(&dev->ctrl);
> > > +	blk_sync_queue(dev->ctrl.admin_q);
> > > +
> > Looks like blk_sync_queue cannot drain all the timeout work.
> > 
> > blk_sync_queue
> >   -> del_timer_sync          
> >                           blk_mq_timeout_work
> >                             -> mod_timer
> >   -> cancel_work_sync
> > the timeout work may come back again.
> > we may need to force all the in-flight requests to be timed out with blk_abort_request
> > 
> 
> blk_abort_request() seems over-kill, we could avoid this race simply by
> returning EH_NOT_HANDLED if the controller is in-recovery.
> 
> > >  	nvme_pci_disable(dev);
> > 
> > the interrupt will not come, but there maybe running one.
> > a synchronize_sched() here ?
> 
> We may cover this case by moving nvme_suspend_queue() before
> nvme_stop_queues().

Looks not need the above change, because pci_free_irq() called
from nvme_suspend_queue() won't return until any executing interrupts
for this IRQ have completed.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-04-26 16:16         ` Ming Lei
  0 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-26 16:16 UTC (permalink / raw)


On Thu, Apr 26, 2018@11:57:22PM +0800, Ming Lei wrote:
> Hi Jianchao,
> 
> On Thu, Apr 26, 2018@11:07:56PM +0800, jianchao.wang wrote:
> > Hi Ming
> > 
> > Thanks for your wonderful solution. :)
> > 
> > On 04/26/2018 08:39 PM, Ming Lei wrote:
> > > +/*
> > > + * This one is called after queues are quiesced, and no in-fligh timeout
> > > + * and nvme interrupt handling.
> > > + */
> > > +static void nvme_pci_cancel_request(struct request *req, void *data,
> > > +		bool reserved)
> > > +{
> > > +	/* make sure timed-out requests are covered too */
> > > +	if (req->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) {
> > > +		req->aborted_gstate = 0;
> > > +		req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
> > > +	}
> > > +
> > > +	nvme_cancel_request(req, data, reserved);
> > > +}
> > > +
> > >  static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> > >  {
> > >  	int i;
> > > @@ -2223,10 +2316,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> > >  	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
> > >  		nvme_suspend_queue(&dev->queues[i]);
> > >  
> > > +	/*
> > > +	 * safe to sync timeout after queues are quiesced, then all
> > > +	 * requests(include the time-out ones) will be canceled.
> > > +	 */
> > > +	nvme_sync_queues(&dev->ctrl);
> > > +	blk_sync_queue(dev->ctrl.admin_q);
> > > +
> > Looks like blk_sync_queue cannot drain all the timeout work.
> > 
> > blk_sync_queue
> >   -> del_timer_sync          
> >                           blk_mq_timeout_work
> >                             -> mod_timer
> >   -> cancel_work_sync
> > the timeout work may come back again.
> > we may need to force all the in-flight requests to be timed out with blk_abort_request
> > 
> 
> blk_abort_request() seems over-kill, we could avoid this race simply by
> returning EH_NOT_HANDLED if the controller is in-recovery.
> 
> > >  	nvme_pci_disable(dev);
> > 
> > the interrupt will not come, but there maybe running one.
> > a synchronize_sched() here ?
> 
> We may cover this case by moving nvme_suspend_queue() before
> nvme_stop_queues().

Looks not need the above change, because pci_free_irq() called
from nvme_suspend_queue() won't return until any executing interrupts
for this IRQ have completed.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] nvme: pci: guarantee EH can make progress
  2018-04-26 12:39   ` Ming Lei
@ 2018-04-26 16:24     ` Keith Busch
  -1 siblings, 0 replies; 72+ messages in thread
From: Keith Busch @ 2018-04-26 16:24 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Jianchao Wang, Christoph Hellwig,
	Sagi Grimberg, linux-nvme

On Thu, Apr 26, 2018 at 08:39:56PM +0800, Ming Lei wrote:
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 5d05a04f8e72..1e058deb4718 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1265,6 +1265,20 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  	struct nvme_command cmd;
>  	u32 csts = readl(dev->bar + NVME_REG_CSTS);
>  
> +	/*
> +	 * If error recovery is in-progress and this request needn't to
> +	 * be retried, return BLK_EH_HANDLED immediately, so that error
> +	 * handler kthread can always make progress since we still need
> +	 * to send FAILFAST request to admin queue for handling error.
> +	 */
> +	spin_lock(&dev->eh_lock);
> +	if (dev->eh_in_recovery && blk_noretry_request(req)) {
> +		spin_unlock(&dev->eh_lock);
> +		nvme_req(req)->status |= NVME_SC_DNR;
> +		return BLK_EH_HANDLED;
> +	}
> +	spin_unlock(&dev->eh_lock);

This doesn't really look safe. Even if a command times out, the
controller still owns that command, and calling it done while pci bus
master enable is still set can cause memory corruption.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 2/2] nvme: pci: guarantee EH can make progress
@ 2018-04-26 16:24     ` Keith Busch
  0 siblings, 0 replies; 72+ messages in thread
From: Keith Busch @ 2018-04-26 16:24 UTC (permalink / raw)


On Thu, Apr 26, 2018@08:39:56PM +0800, Ming Lei wrote:
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 5d05a04f8e72..1e058deb4718 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1265,6 +1265,20 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  	struct nvme_command cmd;
>  	u32 csts = readl(dev->bar + NVME_REG_CSTS);
>  
> +	/*
> +	 * If error recovery is in-progress and this request needn't to
> +	 * be retried, return BLK_EH_HANDLED immediately, so that error
> +	 * handler kthread can always make progress since we still need
> +	 * to send FAILFAST request to admin queue for handling error.
> +	 */
> +	spin_lock(&dev->eh_lock);
> +	if (dev->eh_in_recovery && blk_noretry_request(req)) {
> +		spin_unlock(&dev->eh_lock);
> +		nvme_req(req)->status |= NVME_SC_DNR;
> +		return BLK_EH_HANDLED;
> +	}
> +	spin_unlock(&dev->eh_lock);

This doesn't really look safe. Even if a command times out, the
controller still owns that command, and calling it done while pci bus
master enable is still set can cause memory corruption.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-04-26 15:57       ` Ming Lei
@ 2018-04-27  1:37         ` jianchao.wang
  -1 siblings, 0 replies; 72+ messages in thread
From: jianchao.wang @ 2018-04-27  1:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Sagi Grimberg, linux-nvme, Keith Busch,
	Christoph Hellwig



On 04/26/2018 11:57 PM, Ming Lei wrote:
> Hi Jianchao,
> 
> On Thu, Apr 26, 2018 at 11:07:56PM +0800, jianchao.wang wrote:
>> Hi Ming
>>
>> Thanks for your wonderful solution. :)
>>
>> On 04/26/2018 08:39 PM, Ming Lei wrote:
>>> +/*
>>> + * This one is called after queues are quiesced, and no in-fligh timeout
>>> + * and nvme interrupt handling.
>>> + */
>>> +static void nvme_pci_cancel_request(struct request *req, void *data,
>>> +		bool reserved)
>>> +{
>>> +	/* make sure timed-out requests are covered too */
>>> +	if (req->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) {
>>> +		req->aborted_gstate = 0;
>>> +		req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
>>> +	}
>>> +
>>> +	nvme_cancel_request(req, data, reserved);
>>> +}
>>> +
>>>  static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>>>  {
>>>  	int i;
>>> @@ -2223,10 +2316,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>>>  	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
>>>  		nvme_suspend_queue(&dev->queues[i]);
>>>  
>>> +	/*
>>> +	 * safe to sync timeout after queues are quiesced, then all
>>> +	 * requests(include the time-out ones) will be canceled.
>>> +	 */
>>> +	nvme_sync_queues(&dev->ctrl);
>>> +	blk_sync_queue(dev->ctrl.admin_q);
>>> +
>> Looks like blk_sync_queue cannot drain all the timeout work.
>>
>> blk_sync_queue
>>   -> del_timer_sync          
>>                           blk_mq_timeout_work
>>                             -> mod_timer
>>   -> cancel_work_sync
>> the timeout work may come back again.
>> we may need to force all the in-flight requests to be timed out with blk_abort_request
>>
> 
> blk_abort_request() seems over-kill, we could avoid this race simply by
> returning EH_NOT_HANDLED if the controller is in-recovery.
return EH_NOT_HANDLED maybe not enough.
please consider the following scenario.

                                      nvme_error_handler
                                        -> nvme_dev_disable
                                          -> blk_sync_queue
//timeout comes again due to the
//scenario above
blk_mq_timeout_work                    
  -> blk_mq_check_expired               
    -> set aborted_gstate
                                          -> nvme_pci_cancel_request
                                            -> RQF_MQ_TIMEOUT_EXPIRED has not been set
                                            -> nvme_cancel_request
                                              -> blk_mq_complete_request
                                                -> do nothing
  -> blk_mq_ternimate_expired
    -> blk_mq_rq_timed_out
      -> set RQF_MQ_TIMEOUT_EXPIRED
      -> .timeout return BLK_EH_NOT_HANDLED

Then the timeout request is leaked.

> 
>>>  	nvme_pci_disable(dev);
>>
>> the interrupt will not come, but there maybe running one.
>> a synchronize_sched() here ?
> 
> We may cover this case by moving nvme_suspend_queue() before
> nvme_stop_queues().
> 
> Both two are very good catch, thanks!
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-04-27  1:37         ` jianchao.wang
  0 siblings, 0 replies; 72+ messages in thread
From: jianchao.wang @ 2018-04-27  1:37 UTC (permalink / raw)




On 04/26/2018 11:57 PM, Ming Lei wrote:
> Hi Jianchao,
> 
> On Thu, Apr 26, 2018@11:07:56PM +0800, jianchao.wang wrote:
>> Hi Ming
>>
>> Thanks for your wonderful solution. :)
>>
>> On 04/26/2018 08:39 PM, Ming Lei wrote:
>>> +/*
>>> + * This one is called after queues are quiesced, and no in-fligh timeout
>>> + * and nvme interrupt handling.
>>> + */
>>> +static void nvme_pci_cancel_request(struct request *req, void *data,
>>> +		bool reserved)
>>> +{
>>> +	/* make sure timed-out requests are covered too */
>>> +	if (req->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) {
>>> +		req->aborted_gstate = 0;
>>> +		req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
>>> +	}
>>> +
>>> +	nvme_cancel_request(req, data, reserved);
>>> +}
>>> +
>>>  static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>>>  {
>>>  	int i;
>>> @@ -2223,10 +2316,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>>>  	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
>>>  		nvme_suspend_queue(&dev->queues[i]);
>>>  
>>> +	/*
>>> +	 * safe to sync timeout after queues are quiesced, then all
>>> +	 * requests(include the time-out ones) will be canceled.
>>> +	 */
>>> +	nvme_sync_queues(&dev->ctrl);
>>> +	blk_sync_queue(dev->ctrl.admin_q);
>>> +
>> Looks like blk_sync_queue cannot drain all the timeout work.
>>
>> blk_sync_queue
>>   -> del_timer_sync          
>>                           blk_mq_timeout_work
>>                             -> mod_timer
>>   -> cancel_work_sync
>> the timeout work may come back again.
>> we may need to force all the in-flight requests to be timed out with blk_abort_request
>>
> 
> blk_abort_request() seems over-kill, we could avoid this race simply by
> returning EH_NOT_HANDLED if the controller is in-recovery.
return EH_NOT_HANDLED maybe not enough.
please consider the following scenario.

                                      nvme_error_handler
                                        -> nvme_dev_disable
                                          -> blk_sync_queue
//timeout comes again due to the
//scenario above
blk_mq_timeout_work                    
  -> blk_mq_check_expired               
    -> set aborted_gstate
                                          -> nvme_pci_cancel_request
                                            -> RQF_MQ_TIMEOUT_EXPIRED has not been set
                                            -> nvme_cancel_request
                                              -> blk_mq_complete_request
                                                -> do nothing
  -> blk_mq_ternimate_expired
    -> blk_mq_rq_timed_out
      -> set RQF_MQ_TIMEOUT_EXPIRED
      -> .timeout return BLK_EH_NOT_HANDLED

Then the timeout request is leaked.

> 
>>>  	nvme_pci_disable(dev);
>>
>> the interrupt will not come, but there maybe running one.
>> a synchronize_sched() here ?
> 
> We may cover this case by moving nvme_suspend_queue() before
> nvme_stop_queues().
> 
> Both two are very good catch, thanks!
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-04-27  1:37         ` jianchao.wang
@ 2018-04-27 14:57           ` Ming Lei
  -1 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-27 14:57 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Jens Axboe, linux-block, Sagi Grimberg, linux-nvme, Keith Busch,
	Christoph Hellwig

On Fri, Apr 27, 2018 at 09:37:06AM +0800, jianchao.wang wrote:
> 
> 
> On 04/26/2018 11:57 PM, Ming Lei wrote:
> > Hi Jianchao,
> > 
> > On Thu, Apr 26, 2018 at 11:07:56PM +0800, jianchao.wang wrote:
> >> Hi Ming
> >>
> >> Thanks for your wonderful solution. :)
> >>
> >> On 04/26/2018 08:39 PM, Ming Lei wrote:
> >>> +/*
> >>> + * This one is called after queues are quiesced, and no in-fligh timeout
> >>> + * and nvme interrupt handling.
> >>> + */
> >>> +static void nvme_pci_cancel_request(struct request *req, void *data,
> >>> +		bool reserved)
> >>> +{
> >>> +	/* make sure timed-out requests are covered too */
> >>> +	if (req->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) {
> >>> +		req->aborted_gstate = 0;
> >>> +		req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
> >>> +	}
> >>> +
> >>> +	nvme_cancel_request(req, data, reserved);
> >>> +}
> >>> +
> >>>  static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> >>>  {
> >>>  	int i;
> >>> @@ -2223,10 +2316,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> >>>  	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
> >>>  		nvme_suspend_queue(&dev->queues[i]);
> >>>  
> >>> +	/*
> >>> +	 * safe to sync timeout after queues are quiesced, then all
> >>> +	 * requests(include the time-out ones) will be canceled.
> >>> +	 */
> >>> +	nvme_sync_queues(&dev->ctrl);
> >>> +	blk_sync_queue(dev->ctrl.admin_q);
> >>> +
> >> Looks like blk_sync_queue cannot drain all the timeout work.
> >>
> >> blk_sync_queue
> >>   -> del_timer_sync          
> >>                           blk_mq_timeout_work
> >>                             -> mod_timer
> >>   -> cancel_work_sync
> >> the timeout work may come back again.
> >> we may need to force all the in-flight requests to be timed out with blk_abort_request
> >>
> > 
> > blk_abort_request() seems over-kill, we could avoid this race simply by
> > returning EH_NOT_HANDLED if the controller is in-recovery.
> return EH_NOT_HANDLED maybe not enough.
> please consider the following scenario.
> 
>                                       nvme_error_handler
>                                         -> nvme_dev_disable
>                                           -> blk_sync_queue
> //timeout comes again due to the
> //scenario above

I may not understand your point, once blk_sync_queue() returns, the
timer itself is deactivated, meantime the synced .nvme_timeout() only
returns EH_NOT_HANDLED before the deactivation.

That means this timer won't be expired any more, so could you explain
a bit why timeout can come again after blk_sync_queue() returns.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-04-27 14:57           ` Ming Lei
  0 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-27 14:57 UTC (permalink / raw)


On Fri, Apr 27, 2018@09:37:06AM +0800, jianchao.wang wrote:
> 
> 
> On 04/26/2018 11:57 PM, Ming Lei wrote:
> > Hi Jianchao,
> > 
> > On Thu, Apr 26, 2018@11:07:56PM +0800, jianchao.wang wrote:
> >> Hi Ming
> >>
> >> Thanks for your wonderful solution. :)
> >>
> >> On 04/26/2018 08:39 PM, Ming Lei wrote:
> >>> +/*
> >>> + * This one is called after queues are quiesced, and no in-fligh timeout
> >>> + * and nvme interrupt handling.
> >>> + */
> >>> +static void nvme_pci_cancel_request(struct request *req, void *data,
> >>> +		bool reserved)
> >>> +{
> >>> +	/* make sure timed-out requests are covered too */
> >>> +	if (req->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) {
> >>> +		req->aborted_gstate = 0;
> >>> +		req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
> >>> +	}
> >>> +
> >>> +	nvme_cancel_request(req, data, reserved);
> >>> +}
> >>> +
> >>>  static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> >>>  {
> >>>  	int i;
> >>> @@ -2223,10 +2316,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> >>>  	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
> >>>  		nvme_suspend_queue(&dev->queues[i]);
> >>>  
> >>> +	/*
> >>> +	 * safe to sync timeout after queues are quiesced, then all
> >>> +	 * requests(include the time-out ones) will be canceled.
> >>> +	 */
> >>> +	nvme_sync_queues(&dev->ctrl);
> >>> +	blk_sync_queue(dev->ctrl.admin_q);
> >>> +
> >> Looks like blk_sync_queue cannot drain all the timeout work.
> >>
> >> blk_sync_queue
> >>   -> del_timer_sync          
> >>                           blk_mq_timeout_work
> >>                             -> mod_timer
> >>   -> cancel_work_sync
> >> the timeout work may come back again.
> >> we may need to force all the in-flight requests to be timed out with blk_abort_request
> >>
> > 
> > blk_abort_request() seems over-kill, we could avoid this race simply by
> > returning EH_NOT_HANDLED if the controller is in-recovery.
> return EH_NOT_HANDLED maybe not enough.
> please consider the following scenario.
> 
>                                       nvme_error_handler
>                                         -> nvme_dev_disable
>                                           -> blk_sync_queue
> //timeout comes again due to the
> //scenario above

I may not understand your point, once blk_sync_queue() returns, the
timer itself is deactivated, meantime the synced .nvme_timeout() only
returns EH_NOT_HANDLED before the deactivation.

That means this timer won't be expired any more, so could you explain
a bit why timeout can come again after blk_sync_queue() returns.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-04-26 12:39   ` Ming Lei
@ 2018-04-27 17:51     ` Keith Busch
  -1 siblings, 0 replies; 72+ messages in thread
From: Keith Busch @ 2018-04-27 17:51 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Jianchao Wang, Christoph Hellwig,
	Sagi Grimberg, linux-nvme

On Thu, Apr 26, 2018 at 08:39:55PM +0800, Ming Lei wrote:
> +/*
> + * This one is called after queues are quiesced, and no in-fligh timeout
> + * and nvme interrupt handling.
> + */
> +static void nvme_pci_cancel_request(struct request *req, void *data,
> +		bool reserved)
> +{
> +	/* make sure timed-out requests are covered too */
> +	if (req->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) {
> +		req->aborted_gstate = 0;
> +		req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
> +	}
> +
> +	nvme_cancel_request(req, data, reserved);
> +}

I don't know about this. I feel like blk-mq owns these flags and LLD's
shouldn't require such knowledge of their implementation details.

I understand how the problems are happening a bit better now. It used
to be that blk-mq would lock an expired command one at a time, so when
we had a batch of IO timeouts, the driver was able to complete all of
them inside a single IO timeout handler.

That's not the case anymore, so the driver is called for every IO
timeout even though if it reaped all the commands at once.

IMO, the block layer should allow the driver to complete all timed out
IOs at once rather than go through ->timeout() for each of them and
without knowing about the request state details. Much of the problems
would go away in that case.

But I think much of this can be fixed by just syncing the queues in the
probe work, and may be a more reasonable bug-fix for 4.17 rather than
rewrite the entire timeout handler. What do you think of this patch? It's
an update of one I posted a few months ago, but uses Jianchao's safe
namespace list locking.

---
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a3771c5729f5..198b4469c3e2 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3562,12 +3562,25 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
 	struct nvme_ns *ns;
 
 	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
+	list_for_each_entry(ns, &ctrl->namespaces, list) {
 		blk_mq_unquiesce_queue(ns->queue);
+		blk_mq_kick_requeue_list(ns->queue);
+	}
 	up_read(&ctrl->namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_start_queues);
 
+void nvme_sync_queues(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_sync_queue(ns->queue);
+	up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_sync_queues);
+
 int nvme_reinit_tagset(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set)
 {
 	if (!ctrl->ops->reinit_request)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 7ded7a51c430..e62273198725 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -402,6 +402,7 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len,
 void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
 		union nvme_result *res);
 
+void nvme_sync_queues(struct nvme_ctrl *ctrl);
 void nvme_stop_queues(struct nvme_ctrl *ctrl);
 void nvme_start_queues(struct nvme_ctrl *ctrl);
 void nvme_kill_queues(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fbc71fac6f1e..a2f3ad105620 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2303,6 +2303,7 @@ static void nvme_reset_work(struct work_struct *work)
 	 */
 	if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
 		nvme_dev_disable(dev, false);
+	nvme_sync_queues(&dev->ctrl);
 
 	/*
 	 * Introduce CONNECTING state from nvme-fc/rdma transports to mark the
--

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-04-27 17:51     ` Keith Busch
  0 siblings, 0 replies; 72+ messages in thread
From: Keith Busch @ 2018-04-27 17:51 UTC (permalink / raw)


On Thu, Apr 26, 2018@08:39:55PM +0800, Ming Lei wrote:
> +/*
> + * This one is called after queues are quiesced, and no in-fligh timeout
> + * and nvme interrupt handling.
> + */
> +static void nvme_pci_cancel_request(struct request *req, void *data,
> +		bool reserved)
> +{
> +	/* make sure timed-out requests are covered too */
> +	if (req->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) {
> +		req->aborted_gstate = 0;
> +		req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
> +	}
> +
> +	nvme_cancel_request(req, data, reserved);
> +}

I don't know about this. I feel like blk-mq owns these flags and LLD's
shouldn't require such knowledge of their implementation details.

I understand how the problems are happening a bit better now. It used
to be that blk-mq would lock an expired command one at a time, so when
we had a batch of IO timeouts, the driver was able to complete all of
them inside a single IO timeout handler.

That's not the case anymore, so the driver is called for every IO
timeout even though if it reaped all the commands at once.

IMO, the block layer should allow the driver to complete all timed out
IOs at once rather than go through ->timeout() for each of them and
without knowing about the request state details. Much of the problems
would go away in that case.

But I think much of this can be fixed by just syncing the queues in the
probe work, and may be a more reasonable bug-fix for 4.17 rather than
rewrite the entire timeout handler. What do you think of this patch? It's
an update of one I posted a few months ago, but uses Jianchao's safe
namespace list locking.

---
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a3771c5729f5..198b4469c3e2 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3562,12 +3562,25 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
 	struct nvme_ns *ns;
 
 	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
+	list_for_each_entry(ns, &ctrl->namespaces, list) {
 		blk_mq_unquiesce_queue(ns->queue);
+		blk_mq_kick_requeue_list(ns->queue);
+	}
 	up_read(&ctrl->namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_start_queues);
 
+void nvme_sync_queues(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_sync_queue(ns->queue);
+	up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_sync_queues);
+
 int nvme_reinit_tagset(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set)
 {
 	if (!ctrl->ops->reinit_request)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 7ded7a51c430..e62273198725 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -402,6 +402,7 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len,
 void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
 		union nvme_result *res);
 
+void nvme_sync_queues(struct nvme_ctrl *ctrl);
 void nvme_stop_queues(struct nvme_ctrl *ctrl);
 void nvme_start_queues(struct nvme_ctrl *ctrl);
 void nvme_kill_queues(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fbc71fac6f1e..a2f3ad105620 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2303,6 +2303,7 @@ static void nvme_reset_work(struct work_struct *work)
 	 */
 	if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
 		nvme_dev_disable(dev, false);
+	nvme_sync_queues(&dev->ctrl);
 
 	/*
 	 * Introduce CONNECTING state from nvme-fc/rdma transports to mark the
--

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] nvme: pci: guarantee EH can make progress
  2018-04-26 16:24     ` Keith Busch
@ 2018-04-28  3:28       ` Ming Lei
  -1 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-28  3:28 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Jianchao Wang, Christoph Hellwig,
	Sagi Grimberg, linux-nvme

On Thu, Apr 26, 2018 at 10:24:43AM -0600, Keith Busch wrote:
> On Thu, Apr 26, 2018 at 08:39:56PM +0800, Ming Lei wrote:
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 5d05a04f8e72..1e058deb4718 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -1265,6 +1265,20 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> >  	struct nvme_command cmd;
> >  	u32 csts = readl(dev->bar + NVME_REG_CSTS);
> >  
> > +	/*
> > +	 * If error recovery is in-progress and this request needn't to
> > +	 * be retried, return BLK_EH_HANDLED immediately, so that error
> > +	 * handler kthread can always make progress since we still need
> > +	 * to send FAILFAST request to admin queue for handling error.
> > +	 */
> > +	spin_lock(&dev->eh_lock);
> > +	if (dev->eh_in_recovery && blk_noretry_request(req)) {
> > +		spin_unlock(&dev->eh_lock);
> > +		nvme_req(req)->status |= NVME_SC_DNR;
> > +		return BLK_EH_HANDLED;
> > +	}
> > +	spin_unlock(&dev->eh_lock);
> 
> This doesn't really look safe. Even if a command times out, the
> controller still owns that command, and calling it done while pci bus
> master enable is still set can cause memory corruption.

OK, that is one point I missed.

This issue can be handled by sending 'set host mem' in async way in
nvme_dev_disable() path, just like nvme_disable_io_queues().

Thanks,
Ming

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 2/2] nvme: pci: guarantee EH can make progress
@ 2018-04-28  3:28       ` Ming Lei
  0 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-28  3:28 UTC (permalink / raw)


On Thu, Apr 26, 2018@10:24:43AM -0600, Keith Busch wrote:
> On Thu, Apr 26, 2018@08:39:56PM +0800, Ming Lei wrote:
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 5d05a04f8e72..1e058deb4718 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -1265,6 +1265,20 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> >  	struct nvme_command cmd;
> >  	u32 csts = readl(dev->bar + NVME_REG_CSTS);
> >  
> > +	/*
> > +	 * If error recovery is in-progress and this request needn't to
> > +	 * be retried, return BLK_EH_HANDLED immediately, so that error
> > +	 * handler kthread can always make progress since we still need
> > +	 * to send FAILFAST request to admin queue for handling error.
> > +	 */
> > +	spin_lock(&dev->eh_lock);
> > +	if (dev->eh_in_recovery && blk_noretry_request(req)) {
> > +		spin_unlock(&dev->eh_lock);
> > +		nvme_req(req)->status |= NVME_SC_DNR;
> > +		return BLK_EH_HANDLED;
> > +	}
> > +	spin_unlock(&dev->eh_lock);
> 
> This doesn't really look safe. Even if a command times out, the
> controller still owns that command, and calling it done while pci bus
> master enable is still set can cause memory corruption.

OK, that is one point I missed.

This issue can be handled by sending 'set host mem' in async way in
nvme_dev_disable() path, just like nvme_disable_io_queues().

Thanks,
Ming

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-04-27 17:51     ` Keith Busch
@ 2018-04-28  3:50       ` Ming Lei
  -1 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-28  3:50 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Jianchao Wang, Christoph Hellwig,
	Sagi Grimberg, linux-nvme

On Fri, Apr 27, 2018 at 11:51:57AM -0600, Keith Busch wrote:
> On Thu, Apr 26, 2018 at 08:39:55PM +0800, Ming Lei wrote:
> > +/*
> > + * This one is called after queues are quiesced, and no in-fligh timeout
> > + * and nvme interrupt handling.
> > + */
> > +static void nvme_pci_cancel_request(struct request *req, void *data,
> > +		bool reserved)
> > +{
> > +	/* make sure timed-out requests are covered too */
> > +	if (req->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) {
> > +		req->aborted_gstate = 0;
> > +		req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
> > +	}
> > +
> > +	nvme_cancel_request(req, data, reserved);
> > +}
> 
> I don't know about this. I feel like blk-mq owns these flags and LLD's
> shouldn't require such knowledge of their implementation details.

If driver returns BLK_EH_NOT_HANDLED from .timeout(), the driver may
need to deal with this flag.

But it won't be necessary to use this flag here, and it can be done by
clearing 'req->aborted_gstate' always.

> 
> I understand how the problems are happening a bit better now. It used
> to be that blk-mq would lock an expired command one at a time, so when
> we had a batch of IO timeouts, the driver was able to complete all of
> them inside a single IO timeout handler.
> 
> That's not the case anymore, so the driver is called for every IO
> timeout even though if it reaped all the commands at once.

Actually there isn't the case before, even for legacy path, one .timeout()
handles one request only.

> 
> IMO, the block layer should allow the driver to complete all timed out
> IOs at once rather than go through ->timeout() for each of them and
> without knowing about the request state details. Much of the problems
> would go away in that case.

That need to change every driver, looks not easy to do, also not sure
if other drivers need this way.

> 
> But I think much of this can be fixed by just syncing the queues in the
> probe work, and may be a more reasonable bug-fix for 4.17 rather than
> rewrite the entire timeout handler. What do you think of this patch? It's
> an update of one I posted a few months ago, but uses Jianchao's safe
> namespace list locking.
> 
> ---
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index a3771c5729f5..198b4469c3e2 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3562,12 +3562,25 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
>  	struct nvme_ns *ns;
>  
>  	down_read(&ctrl->namespaces_rwsem);
> -	list_for_each_entry(ns, &ctrl->namespaces, list)
> +	list_for_each_entry(ns, &ctrl->namespaces, list) {
>  		blk_mq_unquiesce_queue(ns->queue);
> +		blk_mq_kick_requeue_list(ns->queue);
> +	}
>  	up_read(&ctrl->namespaces_rwsem);
>  }
>  EXPORT_SYMBOL_GPL(nvme_start_queues);
>  
> +void nvme_sync_queues(struct nvme_ctrl *ctrl)
> +{
> +	struct nvme_ns *ns;
> +
> +	down_read(&ctrl->namespaces_rwsem);
> +	list_for_each_entry(ns, &ctrl->namespaces, list)
> +		blk_sync_queue(ns->queue);
> +	up_read(&ctrl->namespaces_rwsem);
> +}
> +EXPORT_SYMBOL_GPL(nvme_sync_queues);
> +
>  int nvme_reinit_tagset(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set)
>  {
>  	if (!ctrl->ops->reinit_request)
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 7ded7a51c430..e62273198725 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -402,6 +402,7 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len,
>  void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
>  		union nvme_result *res);
>  
> +void nvme_sync_queues(struct nvme_ctrl *ctrl);
>  void nvme_stop_queues(struct nvme_ctrl *ctrl);
>  void nvme_start_queues(struct nvme_ctrl *ctrl);
>  void nvme_kill_queues(struct nvme_ctrl *ctrl);
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index fbc71fac6f1e..a2f3ad105620 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2303,6 +2303,7 @@ static void nvme_reset_work(struct work_struct *work)
>  	 */
>  	if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
>  		nvme_dev_disable(dev, false);
> +	nvme_sync_queues(&dev->ctrl);

This sync may be raced with one timed-out request, which may be handled
as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
work reliably. 

There are more issues except for the above one if disabling controller
and resetting it are run in different context:

1) inside resetting, nvme_dev_disable() may be called, but this way
may cause double completion on the previous timed-out request.

2) the timed-out request can't be covered by nvme_dev_disable(), and
this way is too tricky and easy to cause trouble.

IMO, it is much easier to avoid the above issues by handling controller
recover in one single thread. I will address all your comments and post
V3 for review.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-04-28  3:50       ` Ming Lei
  0 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-28  3:50 UTC (permalink / raw)


On Fri, Apr 27, 2018@11:51:57AM -0600, Keith Busch wrote:
> On Thu, Apr 26, 2018@08:39:55PM +0800, Ming Lei wrote:
> > +/*
> > + * This one is called after queues are quiesced, and no in-fligh timeout
> > + * and nvme interrupt handling.
> > + */
> > +static void nvme_pci_cancel_request(struct request *req, void *data,
> > +		bool reserved)
> > +{
> > +	/* make sure timed-out requests are covered too */
> > +	if (req->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) {
> > +		req->aborted_gstate = 0;
> > +		req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
> > +	}
> > +
> > +	nvme_cancel_request(req, data, reserved);
> > +}
> 
> I don't know about this. I feel like blk-mq owns these flags and LLD's
> shouldn't require such knowledge of their implementation details.

If driver returns BLK_EH_NOT_HANDLED from .timeout(), the driver may
need to deal with this flag.

But it won't be necessary to use this flag here, and it can be done by
clearing 'req->aborted_gstate' always.

> 
> I understand how the problems are happening a bit better now. It used
> to be that blk-mq would lock an expired command one at a time, so when
> we had a batch of IO timeouts, the driver was able to complete all of
> them inside a single IO timeout handler.
> 
> That's not the case anymore, so the driver is called for every IO
> timeout even though if it reaped all the commands at once.

Actually there isn't the case before, even for legacy path, one .timeout()
handles one request only.

> 
> IMO, the block layer should allow the driver to complete all timed out
> IOs at once rather than go through ->timeout() for each of them and
> without knowing about the request state details. Much of the problems
> would go away in that case.

That need to change every driver, looks not easy to do, also not sure
if other drivers need this way.

> 
> But I think much of this can be fixed by just syncing the queues in the
> probe work, and may be a more reasonable bug-fix for 4.17 rather than
> rewrite the entire timeout handler. What do you think of this patch? It's
> an update of one I posted a few months ago, but uses Jianchao's safe
> namespace list locking.
> 
> ---
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index a3771c5729f5..198b4469c3e2 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3562,12 +3562,25 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
>  	struct nvme_ns *ns;
>  
>  	down_read(&ctrl->namespaces_rwsem);
> -	list_for_each_entry(ns, &ctrl->namespaces, list)
> +	list_for_each_entry(ns, &ctrl->namespaces, list) {
>  		blk_mq_unquiesce_queue(ns->queue);
> +		blk_mq_kick_requeue_list(ns->queue);
> +	}
>  	up_read(&ctrl->namespaces_rwsem);
>  }
>  EXPORT_SYMBOL_GPL(nvme_start_queues);
>  
> +void nvme_sync_queues(struct nvme_ctrl *ctrl)
> +{
> +	struct nvme_ns *ns;
> +
> +	down_read(&ctrl->namespaces_rwsem);
> +	list_for_each_entry(ns, &ctrl->namespaces, list)
> +		blk_sync_queue(ns->queue);
> +	up_read(&ctrl->namespaces_rwsem);
> +}
> +EXPORT_SYMBOL_GPL(nvme_sync_queues);
> +
>  int nvme_reinit_tagset(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set)
>  {
>  	if (!ctrl->ops->reinit_request)
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 7ded7a51c430..e62273198725 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -402,6 +402,7 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len,
>  void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
>  		union nvme_result *res);
>  
> +void nvme_sync_queues(struct nvme_ctrl *ctrl);
>  void nvme_stop_queues(struct nvme_ctrl *ctrl);
>  void nvme_start_queues(struct nvme_ctrl *ctrl);
>  void nvme_kill_queues(struct nvme_ctrl *ctrl);
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index fbc71fac6f1e..a2f3ad105620 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2303,6 +2303,7 @@ static void nvme_reset_work(struct work_struct *work)
>  	 */
>  	if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
>  		nvme_dev_disable(dev, false);
> +	nvme_sync_queues(&dev->ctrl);

This sync may be raced with one timed-out request, which may be handled
as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
work reliably. 

There are more issues except for the above one if disabling controller
and resetting it are run in different context:

1) inside resetting, nvme_dev_disable() may be called, but this way
may cause double completion on the previous timed-out request.

2) the timed-out request can't be covered by nvme_dev_disable(), and
this way is too tricky and easy to cause trouble.

IMO, it is much easier to avoid the above issues by handling controller
recover in one single thread. I will address all your comments and post
V3 for review.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-04-28  3:50       ` Ming Lei
@ 2018-04-28 13:35         ` Keith Busch
  -1 siblings, 0 replies; 72+ messages in thread
From: Keith Busch @ 2018-04-28 13:35 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-block,
	Jianchao Wang, Christoph Hellwig

On Sat, Apr 28, 2018 at 11:50:17AM +0800, Ming Lei wrote:
> > I understand how the problems are happening a bit better now. It used
> > to be that blk-mq would lock an expired command one at a time, so when
> > we had a batch of IO timeouts, the driver was able to complete all of
> > them inside a single IO timeout handler.
> > 
> > That's not the case anymore, so the driver is called for every IO
> > timeout even though if it reaped all the commands at once.
> 
> Actually there isn't the case before, even for legacy path, one .timeout()
> handles one request only.

That's not quite what I was talking about.

Before, only the command that was about to be sent to the driver's
.timeout() was marked completed. The driver could (and did) compete
other timed out commands in a single .timeout(), and the tag would
clear, so we could hanlde all timeouts in a single .timeout().

Now, blk-mq marks all timed out commands as aborted prior to calling
the driver's .timeout(). If the driver completes any of those commands,
the tag does not clear, so the driver's .timeout() just gets to be called
again for commands it already reaped.



> > IMO, the block layer should allow the driver to complete all timed out
> > IOs at once rather than go through ->timeout() for each of them and
> > without knowing about the request state details. Much of the problems
> > would go away in that case.
> 
> That need to change every driver, looks not easy to do, also not sure
> if other drivers need this way.
> 
> > 
> > But I think much of this can be fixed by just syncing the queues in the
> > probe work, and may be a more reasonable bug-fix for 4.17 rather than
> > rewrite the entire timeout handler. What do you think of this patch? It's
> > an update of one I posted a few months ago, but uses Jianchao's safe
> > namespace list locking.
> > 
> > ---
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index a3771c5729f5..198b4469c3e2 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -3562,12 +3562,25 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
> >  	struct nvme_ns *ns;
> >  
> >  	down_read(&ctrl->namespaces_rwsem);
> > -	list_for_each_entry(ns, &ctrl->namespaces, list)
> > +	list_for_each_entry(ns, &ctrl->namespaces, list) {
> >  		blk_mq_unquiesce_queue(ns->queue);
> > +		blk_mq_kick_requeue_list(ns->queue);
> > +	}
> >  	up_read(&ctrl->namespaces_rwsem);
> >  }
> >  EXPORT_SYMBOL_GPL(nvme_start_queues);
> >  
> > +void nvme_sync_queues(struct nvme_ctrl *ctrl)
> > +{
> > +	struct nvme_ns *ns;
> > +
> > +	down_read(&ctrl->namespaces_rwsem);
> > +	list_for_each_entry(ns, &ctrl->namespaces, list)
> > +		blk_sync_queue(ns->queue);
> > +	up_read(&ctrl->namespaces_rwsem);
> > +}
> > +EXPORT_SYMBOL_GPL(nvme_sync_queues);
> > +
> >  int nvme_reinit_tagset(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set)
> >  {
> >  	if (!ctrl->ops->reinit_request)
> > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> > index 7ded7a51c430..e62273198725 100644
> > --- a/drivers/nvme/host/nvme.h
> > +++ b/drivers/nvme/host/nvme.h
> > @@ -402,6 +402,7 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len,
> >  void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
> >  		union nvme_result *res);
> >  
> > +void nvme_sync_queues(struct nvme_ctrl *ctrl);
> >  void nvme_stop_queues(struct nvme_ctrl *ctrl);
> >  void nvme_start_queues(struct nvme_ctrl *ctrl);
> >  void nvme_kill_queues(struct nvme_ctrl *ctrl);
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index fbc71fac6f1e..a2f3ad105620 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -2303,6 +2303,7 @@ static void nvme_reset_work(struct work_struct *work)
> >  	 */
> >  	if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
> >  		nvme_dev_disable(dev, false);
> > +	nvme_sync_queues(&dev->ctrl);
> 
> This sync may be raced with one timed-out request, which may be handled
> as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
> work reliably. 
> 
> There are more issues except for the above one if disabling controller
> and resetting it are run in different context:
> 
> 1) inside resetting, nvme_dev_disable() may be called, but this way
> may cause double completion on the previous timed-out request.
> 
> 2) the timed-out request can't be covered by nvme_dev_disable(), and
> this way is too tricky and easy to cause trouble.
> 
> IMO, it is much easier to avoid the above issues by handling controller
> recover in one single thread. I will address all your comments and post
> V3 for review.
> 
> 
> Thanks,
> Ming
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-04-28 13:35         ` Keith Busch
  0 siblings, 0 replies; 72+ messages in thread
From: Keith Busch @ 2018-04-28 13:35 UTC (permalink / raw)


On Sat, Apr 28, 2018@11:50:17AM +0800, Ming Lei wrote:
> > I understand how the problems are happening a bit better now. It used
> > to be that blk-mq would lock an expired command one at a time, so when
> > we had a batch of IO timeouts, the driver was able to complete all of
> > them inside a single IO timeout handler.
> > 
> > That's not the case anymore, so the driver is called for every IO
> > timeout even though if it reaped all the commands at once.
> 
> Actually there isn't the case before, even for legacy path, one .timeout()
> handles one request only.

That's not quite what I was talking about.

Before, only the command that was about to be sent to the driver's
.timeout() was marked completed. The driver could (and did) compete
other timed out commands in a single .timeout(), and the tag would
clear, so we could hanlde all timeouts in a single .timeout().

Now, blk-mq marks all timed out commands as aborted prior to calling
the driver's .timeout(). If the driver completes any of those commands,
the tag does not clear, so the driver's .timeout() just gets to be called
again for commands it already reaped.



> > IMO, the block layer should allow the driver to complete all timed out
> > IOs at once rather than go through ->timeout() for each of them and
> > without knowing about the request state details. Much of the problems
> > would go away in that case.
> 
> That need to change every driver, looks not easy to do, also not sure
> if other drivers need this way.
> 
> > 
> > But I think much of this can be fixed by just syncing the queues in the
> > probe work, and may be a more reasonable bug-fix for 4.17 rather than
> > rewrite the entire timeout handler. What do you think of this patch? It's
> > an update of one I posted a few months ago, but uses Jianchao's safe
> > namespace list locking.
> > 
> > ---
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index a3771c5729f5..198b4469c3e2 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -3562,12 +3562,25 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
> >  	struct nvme_ns *ns;
> >  
> >  	down_read(&ctrl->namespaces_rwsem);
> > -	list_for_each_entry(ns, &ctrl->namespaces, list)
> > +	list_for_each_entry(ns, &ctrl->namespaces, list) {
> >  		blk_mq_unquiesce_queue(ns->queue);
> > +		blk_mq_kick_requeue_list(ns->queue);
> > +	}
> >  	up_read(&ctrl->namespaces_rwsem);
> >  }
> >  EXPORT_SYMBOL_GPL(nvme_start_queues);
> >  
> > +void nvme_sync_queues(struct nvme_ctrl *ctrl)
> > +{
> > +	struct nvme_ns *ns;
> > +
> > +	down_read(&ctrl->namespaces_rwsem);
> > +	list_for_each_entry(ns, &ctrl->namespaces, list)
> > +		blk_sync_queue(ns->queue);
> > +	up_read(&ctrl->namespaces_rwsem);
> > +}
> > +EXPORT_SYMBOL_GPL(nvme_sync_queues);
> > +
> >  int nvme_reinit_tagset(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set)
> >  {
> >  	if (!ctrl->ops->reinit_request)
> > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> > index 7ded7a51c430..e62273198725 100644
> > --- a/drivers/nvme/host/nvme.h
> > +++ b/drivers/nvme/host/nvme.h
> > @@ -402,6 +402,7 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len,
> >  void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
> >  		union nvme_result *res);
> >  
> > +void nvme_sync_queues(struct nvme_ctrl *ctrl);
> >  void nvme_stop_queues(struct nvme_ctrl *ctrl);
> >  void nvme_start_queues(struct nvme_ctrl *ctrl);
> >  void nvme_kill_queues(struct nvme_ctrl *ctrl);
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index fbc71fac6f1e..a2f3ad105620 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -2303,6 +2303,7 @@ static void nvme_reset_work(struct work_struct *work)
> >  	 */
> >  	if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
> >  		nvme_dev_disable(dev, false);
> > +	nvme_sync_queues(&dev->ctrl);
> 
> This sync may be raced with one timed-out request, which may be handled
> as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
> work reliably. 
> 
> There are more issues except for the above one if disabling controller
> and resetting it are run in different context:
> 
> 1) inside resetting, nvme_dev_disable() may be called, but this way
> may cause double completion on the previous timed-out request.
> 
> 2) the timed-out request can't be covered by nvme_dev_disable(), and
> this way is too tricky and easy to cause trouble.
> 
> IMO, it is much easier to avoid the above issues by handling controller
> recover in one single thread. I will address all your comments and post
> V3 for review.
> 
> 
> Thanks,
> Ming
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-04-27 14:57           ` Ming Lei
@ 2018-04-28 14:00             ` jianchao.wang
  -1 siblings, 0 replies; 72+ messages in thread
From: jianchao.wang @ 2018-04-28 14:00 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvme, linux-block,
	Christoph Hellwig

Hi ming

On 04/27/2018 10:57 PM, Ming Lei wrote:
> I may not understand your point, once blk_sync_queue() returns, the
> timer itself is deactivated, meantime the synced .nvme_timeout() only
> returns EH_NOT_HANDLED before the deactivation.
> 
> That means this timer won't be expired any more, so could you explain
> a bit why timeout can come again after blk_sync_queue() returns

Please consider the following case:

blk_sync_queue
  -> del_timer_sync          
                          blk_mq_timeout_work
                            -> blk_mq_check_expired // return the timeout value
                            -> blk_mq_terninate_expired
                              -> .timeout //return EH_NOT_HANDLED
                            -> mod_timer // setup the timer again based on the result of blk_mq_check_expired
  -> cancel_work_sync
So after the blk_sync_queue, the timer may come back again, then the timeout work.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-04-28 14:00             ` jianchao.wang
  0 siblings, 0 replies; 72+ messages in thread
From: jianchao.wang @ 2018-04-28 14:00 UTC (permalink / raw)


Hi ming

On 04/27/2018 10:57 PM, Ming Lei wrote:
> I may not understand your point, once blk_sync_queue() returns, the
> timer itself is deactivated, meantime the synced .nvme_timeout() only
> returns EH_NOT_HANDLED before the deactivation.
> 
> That means this timer won't be expired any more, so could you explain
> a bit why timeout can come again after blk_sync_queue() returns

Please consider the following case:

blk_sync_queue
  -> del_timer_sync          
                          blk_mq_timeout_work
                            -> blk_mq_check_expired // return the timeout value
                            -> blk_mq_terninate_expired
                              -> .timeout //return EH_NOT_HANDLED
                            -> mod_timer // setup the timer again based on the result of blk_mq_check_expired
  -> cancel_work_sync
So after the blk_sync_queue, the timer may come back again, then the timeout work.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-04-28 13:35         ` Keith Busch
@ 2018-04-28 14:31           ` jianchao.wang
  -1 siblings, 0 replies; 72+ messages in thread
From: jianchao.wang @ 2018-04-28 14:31 UTC (permalink / raw)
  To: Keith Busch, Ming Lei
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, linux-block,
	Christoph Hellwig

Hi Ming and Keith

Let me detail extend more here. :)

On 04/28/2018 09:35 PM, Keith Busch wrote:
>> Actually there isn't the case before, even for legacy path, one .timeout()
>> handles one request only.

Yes, .timeout should be invoked for every timeout request and .timeout should also
handle this only one request in principle
however, nvme_timeout will invoke nvme_dev_disable

> That's not quite what I was talking about.
> 
> Before, only the command that was about to be sent to the driver's
> .timeout() was marked completed. The driver could (and did) compete
> other timed out commands in a single .timeout(), and the tag would
> clear, so we could hanlde all timeouts in a single .timeout().

I think Keith are saying that
before this new blk-mq timeout implementation, the logic of blk_mq_timeout_work is

get _only_ _one_ timeout request
mark completed
invoke .timeout, in nvme, it is nvme_timeout
then nvme_dev_disable is invoked and thus other requests could be completed by blk_mq_complete_request
because they have not been mark completed 

> 
> Now, blk-mq marks all timed out commands as aborted prior to calling
> the driver's .timeout(). If the driver completes any of those commands,
> the tag does not clear, so the driver's .timeout() just gets to be called
> again for commands it already reaped.
> 

After the new blk-mq timeout implementation, 

set the aborted_gstate of _all_ the timeout requests
invoke the .timeout one by one
for the first timeout request's .timeout, in nvme, it is nvme_timeout
nvme_dev_disable is invoked and try to complete all the in-flight requests through blk_mq_complete_request
but timeout requests that have been set aborted_gstate cannot handled by blk_mq_complete_request
so some requests are leaked by nvme_dev_disable
these residual timeout requests will still be handled by blk_mq_timeout_work through invoke .timeout one by one


Thanks
Jianchao


> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-04-28 14:31           ` jianchao.wang
  0 siblings, 0 replies; 72+ messages in thread
From: jianchao.wang @ 2018-04-28 14:31 UTC (permalink / raw)


Hi Ming and Keith

Let me detail extend more here. :)

On 04/28/2018 09:35 PM, Keith Busch wrote:
>> Actually there isn't the case before, even for legacy path, one .timeout()
>> handles one request only.

Yes, .timeout should be invoked for every timeout request and .timeout should also
handle this only one request in principle
however, nvme_timeout will invoke nvme_dev_disable

> That's not quite what I was talking about.
> 
> Before, only the command that was about to be sent to the driver's
> .timeout() was marked completed. The driver could (and did) compete
> other timed out commands in a single .timeout(), and the tag would
> clear, so we could hanlde all timeouts in a single .timeout().

I think Keith are saying that
before this new blk-mq timeout implementation, the logic of blk_mq_timeout_work is

get _only_ _one_ timeout request
mark completed
invoke .timeout, in nvme, it is nvme_timeout
then nvme_dev_disable is invoked and thus other requests could be completed by blk_mq_complete_request
because they have not been mark completed 

> 
> Now, blk-mq marks all timed out commands as aborted prior to calling
> the driver's .timeout(). If the driver completes any of those commands,
> the tag does not clear, so the driver's .timeout() just gets to be called
> again for commands it already reaped.
> 

After the new blk-mq timeout implementation, 

set the aborted_gstate of _all_ the timeout requests
invoke the .timeout one by one
for the first timeout request's .timeout, in nvme, it is nvme_timeout
nvme_dev_disable is invoked and try to complete all the in-flight requests through blk_mq_complete_request
but timeout requests that have been set aborted_gstate cannot handled by blk_mq_complete_request
so some requests are leaked by nvme_dev_disable
these residual timeout requests will still be handled by blk_mq_timeout_work through invoke .timeout one by one


Thanks
Jianchao


> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-04-28 13:35         ` Keith Busch
@ 2018-04-28 21:39           ` Ming Lei
  -1 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-28 21:39 UTC (permalink / raw)
  To: Keith Busch
  Cc: Ming Lei, Jens Axboe, linux-block, Sagi Grimberg, linux-nvme,
	Keith Busch, Jianchao Wang, Christoph Hellwig

On Sat, Apr 28, 2018 at 9:35 PM, Keith Busch
<keith.busch@linux.intel.com> wrote:
> On Sat, Apr 28, 2018 at 11:50:17AM +0800, Ming Lei wrote:
>> > I understand how the problems are happening a bit better now. It used
>> > to be that blk-mq would lock an expired command one at a time, so when
>> > we had a batch of IO timeouts, the driver was able to complete all of
>> > them inside a single IO timeout handler.
>> >
>> > That's not the case anymore, so the driver is called for every IO
>> > timeout even though if it reaped all the commands at once.
>>
>> Actually there isn't the case before, even for legacy path, one .timeout()
>> handles one request only.
>
> That's not quite what I was talking about.
>
> Before, only the command that was about to be sent to the driver's
> .timeout() was marked completed. The driver could (and did) compete
> other timed out commands in a single .timeout(), and the tag would
> clear, so we could hanlde all timeouts in a single .timeout().
>
> Now, blk-mq marks all timed out commands as aborted prior to calling
> the driver's .timeout(). If the driver completes any of those commands,
> the tag does not clear, so the driver's .timeout() just gets to be called
> again for commands it already reaped.

That won't happen because new timeout model will mark aborted on timed-out
request first, then run synchronize_rcu() before making these requests
really expired, and now rcu lock is held in normal completion
handler(blk_mq_complete_request).

Yes, Bart is working towards that way, but there is still the same race
between timeout handler(nvme_dev_disable()) and reset_work(), and nothing
changes wrt. the timeout model:

- reset may take a while to complete because of nvme_wait_freeze(), and
timeout can happen during resetting, then reset may hang forever. Even
without nvme_wait_freeze(), it is possible for timeout to happen during
reset work too in theory.

Actually for non-shutdown, it isn't necessary to freeze queue at all, and it
is enough to just quiesce queues to make hardware happy for recovery,
that has been part of my V2 patchset.

But it is really simple and clean to run the recovery(nvme_dev_disable and
reset_work) in one same context for avoiding this race, and the two
parts should have been done together for making our life easier, that is why
I was trying to work towards this direction.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-04-28 21:39           ` Ming Lei
  0 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-28 21:39 UTC (permalink / raw)


On Sat, Apr 28, 2018 at 9:35 PM, Keith Busch
<keith.busch@linux.intel.com> wrote:
> On Sat, Apr 28, 2018@11:50:17AM +0800, Ming Lei wrote:
>> > I understand how the problems are happening a bit better now. It used
>> > to be that blk-mq would lock an expired command one at a time, so when
>> > we had a batch of IO timeouts, the driver was able to complete all of
>> > them inside a single IO timeout handler.
>> >
>> > That's not the case anymore, so the driver is called for every IO
>> > timeout even though if it reaped all the commands at once.
>>
>> Actually there isn't the case before, even for legacy path, one .timeout()
>> handles one request only.
>
> That's not quite what I was talking about.
>
> Before, only the command that was about to be sent to the driver's
> .timeout() was marked completed. The driver could (and did) compete
> other timed out commands in a single .timeout(), and the tag would
> clear, so we could hanlde all timeouts in a single .timeout().
>
> Now, blk-mq marks all timed out commands as aborted prior to calling
> the driver's .timeout(). If the driver completes any of those commands,
> the tag does not clear, so the driver's .timeout() just gets to be called
> again for commands it already reaped.

That won't happen because new timeout model will mark aborted on timed-out
request first, then run synchronize_rcu() before making these requests
really expired, and now rcu lock is held in normal completion
handler(blk_mq_complete_request).

Yes, Bart is working towards that way, but there is still the same race
between timeout handler(nvme_dev_disable()) and reset_work(), and nothing
changes wrt. the timeout model:

- reset may take a while to complete because of nvme_wait_freeze(), and
timeout can happen during resetting, then reset may hang forever. Even
without nvme_wait_freeze(), it is possible for timeout to happen during
reset work too in theory.

Actually for non-shutdown, it isn't necessary to freeze queue at all, and it
is enough to just quiesce queues to make hardware happy for recovery,
that has been part of my V2 patchset.

But it is really simple and clean to run the recovery(nvme_dev_disable and
reset_work) in one same context for avoiding this race, and the two
parts should have been done together for making our life easier, that is why
I was trying to work towards this direction.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-04-28 14:00             ` jianchao.wang
@ 2018-04-28 21:57               ` Ming Lei
  -1 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-28 21:57 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Ming Lei, Jens Axboe, linux-block, Sagi Grimberg, linux-nvme,
	Keith Busch, Christoph Hellwig

On Sat, Apr 28, 2018 at 10:00 PM, jianchao.wang
<jianchao.w.wang@oracle.com> wrote:
> Hi ming
>
> On 04/27/2018 10:57 PM, Ming Lei wrote:
>> I may not understand your point, once blk_sync_queue() returns, the
>> timer itself is deactivated, meantime the synced .nvme_timeout() only
>> returns EH_NOT_HANDLED before the deactivation.
>>
>> That means this timer won't be expired any more, so could you explain
>> a bit why timeout can come again after blk_sync_queue() returns
>
> Please consider the following case:
>
> blk_sync_queue
>   -> del_timer_sync
>                           blk_mq_timeout_work
>                             -> blk_mq_check_expired // return the timeout value
>                             -> blk_mq_terninate_expired
>                               -> .timeout //return EH_NOT_HANDLED
>                             -> mod_timer // setup the timer again based on the result of blk_mq_check_expired
>   -> cancel_work_sync
> So after the blk_sync_queue, the timer may come back again, then the timeout work.

OK, I was trying to avoid to use blk_abort_request(), but looks we
may have to depend on it or similar way.

BTW, that means blk_sync_queue() has been broken, even though the uses
in blk_cleanup_queue().

Another approach is to introduce one perpcu_ref of
'q->timeout_usage_counter' for
syncing timeout, seems a bit over-kill too, but simpler in both theory
and implement.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-04-28 21:57               ` Ming Lei
  0 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-28 21:57 UTC (permalink / raw)


On Sat, Apr 28, 2018 at 10:00 PM, jianchao.wang
<jianchao.w.wang@oracle.com> wrote:
> Hi ming
>
> On 04/27/2018 10:57 PM, Ming Lei wrote:
>> I may not understand your point, once blk_sync_queue() returns, the
>> timer itself is deactivated, meantime the synced .nvme_timeout() only
>> returns EH_NOT_HANDLED before the deactivation.
>>
>> That means this timer won't be expired any more, so could you explain
>> a bit why timeout can come again after blk_sync_queue() returns
>
> Please consider the following case:
>
> blk_sync_queue
>   -> del_timer_sync
>                           blk_mq_timeout_work
>                             -> blk_mq_check_expired // return the timeout value
>                             -> blk_mq_terninate_expired
>                               -> .timeout //return EH_NOT_HANDLED
>                             -> mod_timer // setup the timer again based on the result of blk_mq_check_expired
>   -> cancel_work_sync
> So after the blk_sync_queue, the timer may come back again, then the timeout work.

OK, I was trying to avoid to use blk_abort_request(), but looks we
may have to depend on it or similar way.

BTW, that means blk_sync_queue() has been broken, even though the uses
in blk_cleanup_queue().

Another approach is to introduce one perpcu_ref of
'q->timeout_usage_counter' for
syncing timeout, seems a bit over-kill too, but simpler in both theory
and implement.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-04-28 21:57               ` Ming Lei
@ 2018-04-28 22:27                 ` Ming Lei
  -1 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-28 22:27 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Ming Lei, Jens Axboe, linux-block, Sagi Grimberg, linux-nvme,
	Keith Busch, Christoph Hellwig

On Sun, Apr 29, 2018 at 5:57 AM, Ming Lei <tom.leiming@gmail.com> wrote:
> On Sat, Apr 28, 2018 at 10:00 PM, jianchao.wang
> <jianchao.w.wang@oracle.com> wrote:
>> Hi ming
>>
>> On 04/27/2018 10:57 PM, Ming Lei wrote:
>>> I may not understand your point, once blk_sync_queue() returns, the
>>> timer itself is deactivated, meantime the synced .nvme_timeout() only
>>> returns EH_NOT_HANDLED before the deactivation.
>>>
>>> That means this timer won't be expired any more, so could you explain
>>> a bit why timeout can come again after blk_sync_queue() returns
>>
>> Please consider the following case:
>>
>> blk_sync_queue
>>   -> del_timer_sync
>>                           blk_mq_timeout_work
>>                             -> blk_mq_check_expired // return the timeout value
>>                             -> blk_mq_terninate_expired
>>                               -> .timeout //return EH_NOT_HANDLED
>>                             -> mod_timer // setup the timer again based on the result of blk_mq_check_expired
>>   -> cancel_work_sync
>> So after the blk_sync_queue, the timer may come back again, then the timeout work.
>
> OK, I was trying to avoid to use blk_abort_request(), but looks we
> may have to depend on it or similar way.
>
> BTW, that means blk_sync_queue() has been broken, even though the uses
> in blk_cleanup_queue().
>
> Another approach is to introduce one perpcu_ref of
> 'q->timeout_usage_counter' for
> syncing timeout, seems a bit over-kill too, but simpler in both theory
> and implement.

Or one timout_mutex is enough.


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-04-28 22:27                 ` Ming Lei
  0 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-28 22:27 UTC (permalink / raw)


On Sun, Apr 29, 2018@5:57 AM, Ming Lei <tom.leiming@gmail.com> wrote:
> On Sat, Apr 28, 2018 at 10:00 PM, jianchao.wang
> <jianchao.w.wang@oracle.com> wrote:
>> Hi ming
>>
>> On 04/27/2018 10:57 PM, Ming Lei wrote:
>>> I may not understand your point, once blk_sync_queue() returns, the
>>> timer itself is deactivated, meantime the synced .nvme_timeout() only
>>> returns EH_NOT_HANDLED before the deactivation.
>>>
>>> That means this timer won't be expired any more, so could you explain
>>> a bit why timeout can come again after blk_sync_queue() returns
>>
>> Please consider the following case:
>>
>> blk_sync_queue
>>   -> del_timer_sync
>>                           blk_mq_timeout_work
>>                             -> blk_mq_check_expired // return the timeout value
>>                             -> blk_mq_terninate_expired
>>                               -> .timeout //return EH_NOT_HANDLED
>>                             -> mod_timer // setup the timer again based on the result of blk_mq_check_expired
>>   -> cancel_work_sync
>> So after the blk_sync_queue, the timer may come back again, then the timeout work.
>
> OK, I was trying to avoid to use blk_abort_request(), but looks we
> may have to depend on it or similar way.
>
> BTW, that means blk_sync_queue() has been broken, even though the uses
> in blk_cleanup_queue().
>
> Another approach is to introduce one perpcu_ref of
> 'q->timeout_usage_counter' for
> syncing timeout, seems a bit over-kill too, but simpler in both theory
> and implement.

Or one timout_mutex is enough.


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-04-28 22:27                 ` Ming Lei
@ 2018-04-29  1:36                   ` Ming Lei
  -1 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-29  1:36 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Ming Lei, Jens Axboe, linux-block, Sagi Grimberg, linux-nvme,
	Keith Busch, Christoph Hellwig

On Sun, Apr 29, 2018 at 6:27 AM, Ming Lei <tom.leiming@gmail.com> wrote:
> On Sun, Apr 29, 2018 at 5:57 AM, Ming Lei <tom.leiming@gmail.com> wrote:
>> On Sat, Apr 28, 2018 at 10:00 PM, jianchao.wang
>> <jianchao.w.wang@oracle.com> wrote:
>>> Hi ming
>>>
>>> On 04/27/2018 10:57 PM, Ming Lei wrote:
>>>> I may not understand your point, once blk_sync_queue() returns, the
>>>> timer itself is deactivated, meantime the synced .nvme_timeout() only
>>>> returns EH_NOT_HANDLED before the deactivation.
>>>>
>>>> That means this timer won't be expired any more, so could you explain
>>>> a bit why timeout can come again after blk_sync_queue() returns
>>>
>>> Please consider the following case:
>>>
>>> blk_sync_queue
>>>   -> del_timer_sync
>>>                           blk_mq_timeout_work
>>>                             -> blk_mq_check_expired // return the timeout value
>>>                             -> blk_mq_terninate_expired
>>>                               -> .timeout //return EH_NOT_HANDLED
>>>                             -> mod_timer // setup the timer again based on the result of blk_mq_check_expired
>>>   -> cancel_work_sync
>>> So after the blk_sync_queue, the timer may come back again, then the timeout work.
>>
>> OK, I was trying to avoid to use blk_abort_request(), but looks we
>> may have to depend on it or similar way.
>>
>> BTW, that means blk_sync_queue() has been broken, even though the uses
>> in blk_cleanup_queue().
>>
>> Another approach is to introduce one perpcu_ref of
>> 'q->timeout_usage_counter' for
>> syncing timeout, seems a bit over-kill too, but simpler in both theory
>> and implement.
>
> Or one timout_mutex is enough.

Turns out it is SRCU.

-- 
Ming Lei

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-04-29  1:36                   ` Ming Lei
  0 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-29  1:36 UTC (permalink / raw)


On Sun, Apr 29, 2018@6:27 AM, Ming Lei <tom.leiming@gmail.com> wrote:
> On Sun, Apr 29, 2018@5:57 AM, Ming Lei <tom.leiming@gmail.com> wrote:
>> On Sat, Apr 28, 2018 at 10:00 PM, jianchao.wang
>> <jianchao.w.wang@oracle.com> wrote:
>>> Hi ming
>>>
>>> On 04/27/2018 10:57 PM, Ming Lei wrote:
>>>> I may not understand your point, once blk_sync_queue() returns, the
>>>> timer itself is deactivated, meantime the synced .nvme_timeout() only
>>>> returns EH_NOT_HANDLED before the deactivation.
>>>>
>>>> That means this timer won't be expired any more, so could you explain
>>>> a bit why timeout can come again after blk_sync_queue() returns
>>>
>>> Please consider the following case:
>>>
>>> blk_sync_queue
>>>   -> del_timer_sync
>>>                           blk_mq_timeout_work
>>>                             -> blk_mq_check_expired // return the timeout value
>>>                             -> blk_mq_terninate_expired
>>>                               -> .timeout //return EH_NOT_HANDLED
>>>                             -> mod_timer // setup the timer again based on the result of blk_mq_check_expired
>>>   -> cancel_work_sync
>>> So after the blk_sync_queue, the timer may come back again, then the timeout work.
>>
>> OK, I was trying to avoid to use blk_abort_request(), but looks we
>> may have to depend on it or similar way.
>>
>> BTW, that means blk_sync_queue() has been broken, even though the uses
>> in blk_cleanup_queue().
>>
>> Another approach is to introduce one perpcu_ref of
>> 'q->timeout_usage_counter' for
>> syncing timeout, seems a bit over-kill too, but simpler in both theory
>> and implement.
>
> Or one timout_mutex is enough.

Turns out it is SRCU.

-- 
Ming Lei

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-04-29  1:36                   ` Ming Lei
@ 2018-04-29  2:21                     ` jianchao.wang
  -1 siblings, 0 replies; 72+ messages in thread
From: jianchao.wang @ 2018-04-29  2:21 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvme, Ming Lei,
	linux-block, Christoph Hellwig

Hi ming

On 04/29/2018 09:36 AM, Ming Lei wrote:
> On Sun, Apr 29, 2018 at 6:27 AM, Ming Lei <tom.leiming@gmail.com> wrote:
>> On Sun, Apr 29, 2018 at 5:57 AM, Ming Lei <tom.leiming@gmail.com> wrote:
>>> On Sat, Apr 28, 2018 at 10:00 PM, jianchao.wang
>>> <jianchao.w.wang@oracle.com> wrote:
>>>> Hi ming
>>>>
>>>> On 04/27/2018 10:57 PM, Ming Lei wrote:
>>>>> I may not understand your point, once blk_sync_queue() returns, the
>>>>> timer itself is deactivated, meantime the synced .nvme_timeout() only
>>>>> returns EH_NOT_HANDLED before the deactivation.
>>>>>
>>>>> That means this timer won't be expired any more, so could you explain
>>>>> a bit why timeout can come again after blk_sync_queue() returns
>>>>
>>>> Please consider the following case:
>>>>
>>>> blk_sync_queue
>>>>   -> del_timer_sync
>>>>                           blk_mq_timeout_work
>>>>                             -> blk_mq_check_expired // return the timeout value
>>>>                             -> blk_mq_terninate_expired
>>>>                               -> .timeout //return EH_NOT_HANDLED
>>>>                             -> mod_timer // setup the timer again based on the result of blk_mq_check_expired
>>>>   -> cancel_work_sync
>>>> So after the blk_sync_queue, the timer may come back again, then the timeout work.
>>>
>>> OK, I was trying to avoid to use blk_abort_request(), but looks we
>>> may have to depend on it or similar way.
>>>
>>> BTW, that means blk_sync_queue() has been broken, even though the uses
>>> in blk_cleanup_queue().
>>>
>>> Another approach is to introduce one perpcu_ref of
>>> 'q->timeout_usage_counter' for
>>> syncing timeout, seems a bit over-kill too, but simpler in both theory
>>> and implement.
>>
>> Or one timout_mutex is enough.
> 
> Turns out it is SRCU.
> 
after split the timeout path into timer and workqueue two parts, if we don't drain the in-flight requests, the request_queue->timeout and the timeout work look like an issue of 'chicken and egg'.
how about introduce a flag to disable triggering of timeout work ?


Thanks
Jianchao

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-04-29  2:21                     ` jianchao.wang
  0 siblings, 0 replies; 72+ messages in thread
From: jianchao.wang @ 2018-04-29  2:21 UTC (permalink / raw)


Hi ming

On 04/29/2018 09:36 AM, Ming Lei wrote:
> On Sun, Apr 29, 2018@6:27 AM, Ming Lei <tom.leiming@gmail.com> wrote:
>> On Sun, Apr 29, 2018@5:57 AM, Ming Lei <tom.leiming@gmail.com> wrote:
>>> On Sat, Apr 28, 2018 at 10:00 PM, jianchao.wang
>>> <jianchao.w.wang@oracle.com> wrote:
>>>> Hi ming
>>>>
>>>> On 04/27/2018 10:57 PM, Ming Lei wrote:
>>>>> I may not understand your point, once blk_sync_queue() returns, the
>>>>> timer itself is deactivated, meantime the synced .nvme_timeout() only
>>>>> returns EH_NOT_HANDLED before the deactivation.
>>>>>
>>>>> That means this timer won't be expired any more, so could you explain
>>>>> a bit why timeout can come again after blk_sync_queue() returns
>>>>
>>>> Please consider the following case:
>>>>
>>>> blk_sync_queue
>>>>   -> del_timer_sync
>>>>                           blk_mq_timeout_work
>>>>                             -> blk_mq_check_expired // return the timeout value
>>>>                             -> blk_mq_terninate_expired
>>>>                               -> .timeout //return EH_NOT_HANDLED
>>>>                             -> mod_timer // setup the timer again based on the result of blk_mq_check_expired
>>>>   -> cancel_work_sync
>>>> So after the blk_sync_queue, the timer may come back again, then the timeout work.
>>>
>>> OK, I was trying to avoid to use blk_abort_request(), but looks we
>>> may have to depend on it or similar way.
>>>
>>> BTW, that means blk_sync_queue() has been broken, even though the uses
>>> in blk_cleanup_queue().
>>>
>>> Another approach is to introduce one perpcu_ref of
>>> 'q->timeout_usage_counter' for
>>> syncing timeout, seems a bit over-kill too, but simpler in both theory
>>> and implement.
>>
>> Or one timout_mutex is enough.
> 
> Turns out it is SRCU.
> 
after split the timeout path into timer and workqueue two parts, if we don't drain the in-flight requests, the request_queue->timeout and the timeout work look like an issue of 'chicken and egg'.
how about introduce a flag to disable triggering of timeout work ?


Thanks
Jianchao

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-04-29  2:21                     ` jianchao.wang
@ 2018-04-29 14:13                       ` Ming Lei
  -1 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-29 14:13 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvme, Ming Lei,
	linux-block, Christoph Hellwig

On Sun, Apr 29, 2018 at 10:21 AM, jianchao.wang
<jianchao.w.wang@oracle.com> wrote:
> Hi ming
>
> On 04/29/2018 09:36 AM, Ming Lei wrote:
>> On Sun, Apr 29, 2018 at 6:27 AM, Ming Lei <tom.leiming@gmail.com> wrote:
>>> On Sun, Apr 29, 2018 at 5:57 AM, Ming Lei <tom.leiming@gmail.com> wrote:
>>>> On Sat, Apr 28, 2018 at 10:00 PM, jianchao.wang
>>>> <jianchao.w.wang@oracle.com> wrote:
>>>>> Hi ming
>>>>>
>>>>> On 04/27/2018 10:57 PM, Ming Lei wrote:
>>>>>> I may not understand your point, once blk_sync_queue() returns, the
>>>>>> timer itself is deactivated, meantime the synced .nvme_timeout() only
>>>>>> returns EH_NOT_HANDLED before the deactivation.
>>>>>>
>>>>>> That means this timer won't be expired any more, so could you explain
>>>>>> a bit why timeout can come again after blk_sync_queue() returns
>>>>>
>>>>> Please consider the following case:
>>>>>
>>>>> blk_sync_queue
>>>>>   -> del_timer_sync
>>>>>                           blk_mq_timeout_work
>>>>>                             -> blk_mq_check_expired // return the timeout value
>>>>>                             -> blk_mq_terninate_expired
>>>>>                               -> .timeout //return EH_NOT_HANDLED
>>>>>                             -> mod_timer // setup the timer again based on the result of blk_mq_check_expired
>>>>>   -> cancel_work_sync
>>>>> So after the blk_sync_queue, the timer may come back again, then the timeout work.
>>>>
>>>> OK, I was trying to avoid to use blk_abort_request(), but looks we
>>>> may have to depend on it or similar way.
>>>>
>>>> BTW, that means blk_sync_queue() has been broken, even though the uses
>>>> in blk_cleanup_queue().
>>>>
>>>> Another approach is to introduce one perpcu_ref of
>>>> 'q->timeout_usage_counter' for
>>>> syncing timeout, seems a bit over-kill too, but simpler in both theory
>>>> and implement.
>>>
>>> Or one timout_mutex is enough.
>>
>> Turns out it is SRCU.
>>
> after split the timeout path into timer and workqueue two parts, if we don't drain the in-flight requests, the request_queue->timeout and the timeout work look like an issue of 'chicken and egg'.
> how about introduce a flag to disable triggering of timeout work ?

Yes, that is correct approach, no matter what kind of sync is used, and the
flag is inevitable.

The timeout sync issue is fixed in this way, now I am thinking about race
related with freezing queue. Either freezing need to be removed for
non-shutdown, or introduce one flag to avoid the race.


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-04-29 14:13                       ` Ming Lei
  0 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-29 14:13 UTC (permalink / raw)


On Sun, Apr 29, 2018 at 10:21 AM, jianchao.wang
<jianchao.w.wang@oracle.com> wrote:
> Hi ming
>
> On 04/29/2018 09:36 AM, Ming Lei wrote:
>> On Sun, Apr 29, 2018@6:27 AM, Ming Lei <tom.leiming@gmail.com> wrote:
>>> On Sun, Apr 29, 2018@5:57 AM, Ming Lei <tom.leiming@gmail.com> wrote:
>>>> On Sat, Apr 28, 2018 at 10:00 PM, jianchao.wang
>>>> <jianchao.w.wang@oracle.com> wrote:
>>>>> Hi ming
>>>>>
>>>>> On 04/27/2018 10:57 PM, Ming Lei wrote:
>>>>>> I may not understand your point, once blk_sync_queue() returns, the
>>>>>> timer itself is deactivated, meantime the synced .nvme_timeout() only
>>>>>> returns EH_NOT_HANDLED before the deactivation.
>>>>>>
>>>>>> That means this timer won't be expired any more, so could you explain
>>>>>> a bit why timeout can come again after blk_sync_queue() returns
>>>>>
>>>>> Please consider the following case:
>>>>>
>>>>> blk_sync_queue
>>>>>   -> del_timer_sync
>>>>>                           blk_mq_timeout_work
>>>>>                             -> blk_mq_check_expired // return the timeout value
>>>>>                             -> blk_mq_terninate_expired
>>>>>                               -> .timeout //return EH_NOT_HANDLED
>>>>>                             -> mod_timer // setup the timer again based on the result of blk_mq_check_expired
>>>>>   -> cancel_work_sync
>>>>> So after the blk_sync_queue, the timer may come back again, then the timeout work.
>>>>
>>>> OK, I was trying to avoid to use blk_abort_request(), but looks we
>>>> may have to depend on it or similar way.
>>>>
>>>> BTW, that means blk_sync_queue() has been broken, even though the uses
>>>> in blk_cleanup_queue().
>>>>
>>>> Another approach is to introduce one perpcu_ref of
>>>> 'q->timeout_usage_counter' for
>>>> syncing timeout, seems a bit over-kill too, but simpler in both theory
>>>> and implement.
>>>
>>> Or one timout_mutex is enough.
>>
>> Turns out it is SRCU.
>>
> after split the timeout path into timer and workqueue two parts, if we don't drain the in-flight requests, the request_queue->timeout and the timeout work look like an issue of 'chicken and egg'.
> how about introduce a flag to disable triggering of timeout work ?

Yes, that is correct approach, no matter what kind of sync is used, and the
flag is inevitable.

The timeout sync issue is fixed in this way, now I am thinking about race
related with freezing queue. Either freezing need to be removed for
non-shutdown, or introduce one flag to avoid the race.


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-04-28 21:39           ` Ming Lei
@ 2018-04-30 19:52             ` Keith Busch
  -1 siblings, 0 replies; 72+ messages in thread
From: Keith Busch @ 2018-04-30 19:52 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Jens Axboe, linux-block, Sagi Grimberg, linux-nvme,
	Keith Busch, Jianchao Wang, Christoph Hellwig

On Sun, Apr 29, 2018 at 05:39:52AM +0800, Ming Lei wrote:
> On Sat, Apr 28, 2018 at 9:35 PM, Keith Busch
> <keith.busch@linux.intel.com> wrote:
> > On Sat, Apr 28, 2018 at 11:50:17AM +0800, Ming Lei wrote:
> >> > I understand how the problems are happening a bit better now. It used
> >> > to be that blk-mq would lock an expired command one at a time, so when
> >> > we had a batch of IO timeouts, the driver was able to complete all of
> >> > them inside a single IO timeout handler.
> >> >
> >> > That's not the case anymore, so the driver is called for every IO
> >> > timeout even though if it reaped all the commands at once.
> >>
> >> Actually there isn't the case before, even for legacy path, one .timeout()
> >> handles one request only.
> >
> > That's not quite what I was talking about.
> >
> > Before, only the command that was about to be sent to the driver's
> > .timeout() was marked completed. The driver could (and did) compete
> > other timed out commands in a single .timeout(), and the tag would
> > clear, so we could hanlde all timeouts in a single .timeout().
> >
> > Now, blk-mq marks all timed out commands as aborted prior to calling
> > the driver's .timeout(). If the driver completes any of those commands,
> > the tag does not clear, so the driver's .timeout() just gets to be called
> > again for commands it already reaped.
> 
> That won't happen because new timeout model will mark aborted on timed-out
> request first, then run synchronize_rcu() before making these requests
> really expired, and now rcu lock is held in normal completion
> handler(blk_mq_complete_request).
> 
> Yes, Bart is working towards that way, but there is still the same race
> between timeout handler(nvme_dev_disable()) and reset_work(), and nothing
> changes wrt. the timeout model:

Yeah, the driver makes sure there are no possible outstanding commands at
the end of nvme_dev_disable. This should mean there's no timeout handler
running because there's no possible commands for that handler. But that's
not really the case anymore, so we had been inadvertently depending on
that behavior.

> - reset may take a while to complete because of nvme_wait_freeze(), and
> timeout can happen during resetting, then reset may hang forever. Even
> without nvme_wait_freeze(), it is possible for timeout to happen during
> reset work too in theory.
>
> Actually for non-shutdown, it isn't necessary to freeze queue at all, and it
> is enough to just quiesce queues to make hardware happy for recovery,
> that has been part of my V2 patchset.

When we freeze, we prevent IOs from entering contexts that may not be
valid on the other side of the reset. It's not very common for the
context count to change, but it can happen.

Anyway, will take a look at your series and catch up on the notes from
you and Jianchao.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-04-30 19:52             ` Keith Busch
  0 siblings, 0 replies; 72+ messages in thread
From: Keith Busch @ 2018-04-30 19:52 UTC (permalink / raw)


On Sun, Apr 29, 2018@05:39:52AM +0800, Ming Lei wrote:
> On Sat, Apr 28, 2018 at 9:35 PM, Keith Busch
> <keith.busch@linux.intel.com> wrote:
> > On Sat, Apr 28, 2018@11:50:17AM +0800, Ming Lei wrote:
> >> > I understand how the problems are happening a bit better now. It used
> >> > to be that blk-mq would lock an expired command one at a time, so when
> >> > we had a batch of IO timeouts, the driver was able to complete all of
> >> > them inside a single IO timeout handler.
> >> >
> >> > That's not the case anymore, so the driver is called for every IO
> >> > timeout even though if it reaped all the commands at once.
> >>
> >> Actually there isn't the case before, even for legacy path, one .timeout()
> >> handles one request only.
> >
> > That's not quite what I was talking about.
> >
> > Before, only the command that was about to be sent to the driver's
> > .timeout() was marked completed. The driver could (and did) compete
> > other timed out commands in a single .timeout(), and the tag would
> > clear, so we could hanlde all timeouts in a single .timeout().
> >
> > Now, blk-mq marks all timed out commands as aborted prior to calling
> > the driver's .timeout(). If the driver completes any of those commands,
> > the tag does not clear, so the driver's .timeout() just gets to be called
> > again for commands it already reaped.
> 
> That won't happen because new timeout model will mark aborted on timed-out
> request first, then run synchronize_rcu() before making these requests
> really expired, and now rcu lock is held in normal completion
> handler(blk_mq_complete_request).
> 
> Yes, Bart is working towards that way, but there is still the same race
> between timeout handler(nvme_dev_disable()) and reset_work(), and nothing
> changes wrt. the timeout model:

Yeah, the driver makes sure there are no possible outstanding commands at
the end of nvme_dev_disable. This should mean there's no timeout handler
running because there's no possible commands for that handler. But that's
not really the case anymore, so we had been inadvertently depending on
that behavior.

> - reset may take a while to complete because of nvme_wait_freeze(), and
> timeout can happen during resetting, then reset may hang forever. Even
> without nvme_wait_freeze(), it is possible for timeout to happen during
> reset work too in theory.
>
> Actually for non-shutdown, it isn't necessary to freeze queue at all, and it
> is enough to just quiesce queues to make hardware happy for recovery,
> that has been part of my V2 patchset.

When we freeze, we prevent IOs from entering contexts that may not be
valid on the other side of the reset. It's not very common for the
context count to change, but it can happen.

Anyway, will take a look at your series and catch up on the notes from
you and Jianchao.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-04-30 19:52             ` Keith Busch
@ 2018-04-30 23:14               ` Ming Lei
  -1 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-30 23:14 UTC (permalink / raw)
  To: Keith Busch
  Cc: Ming Lei, Jens Axboe, linux-block, Sagi Grimberg, linux-nvme,
	Keith Busch, Jianchao Wang, Christoph Hellwig

On Mon, Apr 30, 2018 at 01:52:17PM -0600, Keith Busch wrote:
> On Sun, Apr 29, 2018 at 05:39:52AM +0800, Ming Lei wrote:
> > On Sat, Apr 28, 2018 at 9:35 PM, Keith Busch
> > <keith.busch@linux.intel.com> wrote:
> > > On Sat, Apr 28, 2018 at 11:50:17AM +0800, Ming Lei wrote:
> > >> > I understand how the problems are happening a bit better now. It used
> > >> > to be that blk-mq would lock an expired command one at a time, so when
> > >> > we had a batch of IO timeouts, the driver was able to complete all of
> > >> > them inside a single IO timeout handler.
> > >> >
> > >> > That's not the case anymore, so the driver is called for every IO
> > >> > timeout even though if it reaped all the commands at once.
> > >>
> > >> Actually there isn't the case before, even for legacy path, one .timeout()
> > >> handles one request only.
> > >
> > > That's not quite what I was talking about.
> > >
> > > Before, only the command that was about to be sent to the driver's
> > > .timeout() was marked completed. The driver could (and did) compete
> > > other timed out commands in a single .timeout(), and the tag would
> > > clear, so we could hanlde all timeouts in a single .timeout().
> > >
> > > Now, blk-mq marks all timed out commands as aborted prior to calling
> > > the driver's .timeout(). If the driver completes any of those commands,
> > > the tag does not clear, so the driver's .timeout() just gets to be called
> > > again for commands it already reaped.
> > 
> > That won't happen because new timeout model will mark aborted on timed-out
> > request first, then run synchronize_rcu() before making these requests
> > really expired, and now rcu lock is held in normal completion
> > handler(blk_mq_complete_request).
> > 
> > Yes, Bart is working towards that way, but there is still the same race
> > between timeout handler(nvme_dev_disable()) and reset_work(), and nothing
> > changes wrt. the timeout model:
> 
> Yeah, the driver makes sure there are no possible outstanding commands at
> the end of nvme_dev_disable. This should mean there's no timeout handler
> running because there's no possible commands for that handler. But that's
> not really the case anymore, so we had been inadvertently depending on
> that behavior.

I guess we may not depend on that behavior, because the timeout work
is per-request-queue(namespace), and timeout from all namespace/admin
queue may happen at the same time, meantime the .timeout() may be run
at different timing because of scheduling delay, and one of them may
cause nvme_dev_disable() to be called during resetting, not mention the
case of timeout triggered by reset_work().

That means we may have to drain timeout too even though Bart's patch is merged.

In short, there are several issues wrt. NVMe recovery:

1) timeout may be triggered in reset_work() by draining IO in
wait_freeze()

2) timeout still may be triggered by other queue, and nvme_dev_disable()
may be called during resetting which is scheduled by other queue's timeout

In both 1) and 2), queues can be quiesced and wait_freeze() in reset_work()
may never complete, then controller can't be recovered at all.

3) race related with start_freeze & unfreeze()


And it may be fixed by changing the model into the following two parts:

1) recovering controller:
- freeze queues
- nvme_dev_disable()
- resetting & setting up queues

2) post-reset or post-recovery
- wait for freezing & unfreezing

And make sure the #1 can always go on for recovering controller even
though that #2 is blocked by timeout.

If freezing can be removed, the #2 may not be necessary, but it may
cause more requests to be handled during recovering hardware, so it
is still reasonable to keep freezing as before.

> 
> > - reset may take a while to complete because of nvme_wait_freeze(), and
> > timeout can happen during resetting, then reset may hang forever. Even
> > without nvme_wait_freeze(), it is possible for timeout to happen during
> > reset work too in theory.
> >
> > Actually for non-shutdown, it isn't necessary to freeze queue at all, and it
> > is enough to just quiesce queues to make hardware happy for recovery,
> > that has been part of my V2 patchset.
> 
> When we freeze, we prevent IOs from entering contexts that may not be
> valid on the other side of the reset. It's not very common for the
> context count to change, but it can happen.
> 
> Anyway, will take a look at your series and catch up on the notes from
> you and Jianchao.

The V2 has been posted out, and freeze isn't removed, but moved to
post-reset.

The main approach should be fine in V2, but there are still issues
(the change may break the reset from other context, such as pci reset;
freezing caused by update_nr_hw_queues) in V2, and the implementation can
be simpler by partitioning the reset work into two parts simply.

I am working on V3, but any comments are welcome on V2, especially about
the taken approach.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-04-30 23:14               ` Ming Lei
  0 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-04-30 23:14 UTC (permalink / raw)


On Mon, Apr 30, 2018@01:52:17PM -0600, Keith Busch wrote:
> On Sun, Apr 29, 2018@05:39:52AM +0800, Ming Lei wrote:
> > On Sat, Apr 28, 2018 at 9:35 PM, Keith Busch
> > <keith.busch@linux.intel.com> wrote:
> > > On Sat, Apr 28, 2018@11:50:17AM +0800, Ming Lei wrote:
> > >> > I understand how the problems are happening a bit better now. It used
> > >> > to be that blk-mq would lock an expired command one at a time, so when
> > >> > we had a batch of IO timeouts, the driver was able to complete all of
> > >> > them inside a single IO timeout handler.
> > >> >
> > >> > That's not the case anymore, so the driver is called for every IO
> > >> > timeout even though if it reaped all the commands at once.
> > >>
> > >> Actually there isn't the case before, even for legacy path, one .timeout()
> > >> handles one request only.
> > >
> > > That's not quite what I was talking about.
> > >
> > > Before, only the command that was about to be sent to the driver's
> > > .timeout() was marked completed. The driver could (and did) compete
> > > other timed out commands in a single .timeout(), and the tag would
> > > clear, so we could hanlde all timeouts in a single .timeout().
> > >
> > > Now, blk-mq marks all timed out commands as aborted prior to calling
> > > the driver's .timeout(). If the driver completes any of those commands,
> > > the tag does not clear, so the driver's .timeout() just gets to be called
> > > again for commands it already reaped.
> > 
> > That won't happen because new timeout model will mark aborted on timed-out
> > request first, then run synchronize_rcu() before making these requests
> > really expired, and now rcu lock is held in normal completion
> > handler(blk_mq_complete_request).
> > 
> > Yes, Bart is working towards that way, but there is still the same race
> > between timeout handler(nvme_dev_disable()) and reset_work(), and nothing
> > changes wrt. the timeout model:
> 
> Yeah, the driver makes sure there are no possible outstanding commands at
> the end of nvme_dev_disable. This should mean there's no timeout handler
> running because there's no possible commands for that handler. But that's
> not really the case anymore, so we had been inadvertently depending on
> that behavior.

I guess we may not depend on that behavior, because the timeout work
is per-request-queue(namespace), and timeout from all namespace/admin
queue may happen at the same time, meantime the .timeout() may be run
at different timing because of scheduling delay, and one of them may
cause nvme_dev_disable() to be called during resetting, not mention the
case of timeout triggered by reset_work().

That means we may have to drain timeout too even though Bart's patch is merged.

In short, there are several issues wrt. NVMe recovery:

1) timeout may be triggered in reset_work() by draining IO in
wait_freeze()

2) timeout still may be triggered by other queue, and nvme_dev_disable()
may be called during resetting which is scheduled by other queue's timeout

In both 1) and 2), queues can be quiesced and wait_freeze() in reset_work()
may never complete, then controller can't be recovered at all.

3) race related with start_freeze & unfreeze()


And it may be fixed by changing the model into the following two parts:

1) recovering controller:
- freeze queues
- nvme_dev_disable()
- resetting & setting up queues

2) post-reset or post-recovery
- wait for freezing & unfreezing

And make sure the #1 can always go on for recovering controller even
though that #2 is blocked by timeout.

If freezing can be removed, the #2 may not be necessary, but it may
cause more requests to be handled during recovering hardware, so it
is still reasonable to keep freezing as before.

> 
> > - reset may take a while to complete because of nvme_wait_freeze(), and
> > timeout can happen during resetting, then reset may hang forever. Even
> > without nvme_wait_freeze(), it is possible for timeout to happen during
> > reset work too in theory.
> >
> > Actually for non-shutdown, it isn't necessary to freeze queue at all, and it
> > is enough to just quiesce queues to make hardware happy for recovery,
> > that has been part of my V2 patchset.
> 
> When we freeze, we prevent IOs from entering contexts that may not be
> valid on the other side of the reset. It's not very common for the
> context count to change, but it can happen.
> 
> Anyway, will take a look at your series and catch up on the notes from
> you and Jianchao.

The V2 has been posted out, and freeze isn't removed, but moved to
post-reset.

The main approach should be fine in V2, but there are still issues
(the change may break the reset from other context, such as pci reset;
freezing caused by update_nr_hw_queues) in V2, and the implementation can
be simpler by partitioning the reset work into two parts simply.

I am working on V3, but any comments are welcome on V2, especially about
the taken approach.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-04-28  3:50       ` Ming Lei
@ 2018-05-08 15:30         ` Keith Busch
  -1 siblings, 0 replies; 72+ messages in thread
From: Keith Busch @ 2018-05-08 15:30 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Jianchao Wang, Christoph Hellwig,
	Sagi Grimberg, linux-nvme

On Sat, Apr 28, 2018 at 11:50:17AM +0800, Ming Lei wrote:
> This sync may be raced with one timed-out request, which may be handled
> as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
> work reliably. 

Ming,

As proposed, that scenario is impossible to encounter. Resetting the
controller inline with the timeout reaps all the commands, and then
sets the controller state to RESETTING. While blk-mq may not allow the
driver to complete those requests, having the driver sync with the queues
will hold the controller in the reset state until blk-mq is done with
its timeout work; therefore, it is impossible for the NVMe driver to
return "BLK_EH_RESET_TIMER", and all commands will be completed through
nvme_timeout's BLK_EH_HANDLED exactly as desired.

Could you please recheck my suggestion? The alternatives proposed are
far too risky for a 4.17 consideration, and I'm hoping we can stabilize
this behavior in the current release if possible.

Thanks,
Keith

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-05-08 15:30         ` Keith Busch
  0 siblings, 0 replies; 72+ messages in thread
From: Keith Busch @ 2018-05-08 15:30 UTC (permalink / raw)


On Sat, Apr 28, 2018@11:50:17AM +0800, Ming Lei wrote:
> This sync may be raced with one timed-out request, which may be handled
> as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
> work reliably. 

Ming,

As proposed, that scenario is impossible to encounter. Resetting the
controller inline with the timeout reaps all the commands, and then
sets the controller state to RESETTING. While blk-mq may not allow the
driver to complete those requests, having the driver sync with the queues
will hold the controller in the reset state until blk-mq is done with
its timeout work; therefore, it is impossible for the NVMe driver to
return "BLK_EH_RESET_TIMER", and all commands will be completed through
nvme_timeout's BLK_EH_HANDLED exactly as desired.

Could you please recheck my suggestion? The alternatives proposed are
far too risky for a 4.17 consideration, and I'm hoping we can stabilize
this behavior in the current release if possible.

Thanks,
Keith

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-05-08 15:30         ` Keith Busch
@ 2018-05-10 20:52           ` Ming Lei
  -1 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-05-10 20:52 UTC (permalink / raw)
  To: Keith Busch
  Cc: Ming Lei, Jens Axboe, Sagi Grimberg, linux-nvme, linux-block,
	Jianchao Wang, Christoph Hellwig

Hi Keith,

On Tue, May 8, 2018 at 11:30 PM, Keith Busch <keith.busch@intel.com> wrote:
> On Sat, Apr 28, 2018 at 11:50:17AM +0800, Ming Lei wrote:
>> This sync may be raced with one timed-out request, which may be handled
>> as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
>> work reliably.
>
> Ming,
>
> As proposed, that scenario is impossible to encounter. Resetting the
> controller inline with the timeout reaps all the commands, and then
> sets the controller state to RESETTING. While blk-mq may not allow the
> driver to complete those requests, having the driver sync with the queues
> will hold the controller in the reset state until blk-mq is done with
> its timeout work; therefore, it is impossible for the NVMe driver to
> return "BLK_EH_RESET_TIMER", and all commands will be completed through
> nvme_timeout's BLK_EH_HANDLED exactly as desired.

That isn't true for multiple namespace case,  each request queue has its
own timeout work, and all these timeout work can be triggered concurrently.

>
> Could you please recheck my suggestion? The alternatives proposed are
> far too risky for a 4.17 consideration, and I'm hoping we can stabilize
> this behavior in the current release if possible.

Bart's rework on timeout won't cover the case of multiple namespace, so
we have to sync timeout inside driver.

The approach taken in my V4 has been simpler, could you take a look at it?

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-05-10 20:52           ` Ming Lei
  0 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-05-10 20:52 UTC (permalink / raw)


Hi Keith,

On Tue, May 8, 2018@11:30 PM, Keith Busch <keith.busch@intel.com> wrote:
> On Sat, Apr 28, 2018@11:50:17AM +0800, Ming Lei wrote:
>> This sync may be raced with one timed-out request, which may be handled
>> as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
>> work reliably.
>
> Ming,
>
> As proposed, that scenario is impossible to encounter. Resetting the
> controller inline with the timeout reaps all the commands, and then
> sets the controller state to RESETTING. While blk-mq may not allow the
> driver to complete those requests, having the driver sync with the queues
> will hold the controller in the reset state until blk-mq is done with
> its timeout work; therefore, it is impossible for the NVMe driver to
> return "BLK_EH_RESET_TIMER", and all commands will be completed through
> nvme_timeout's BLK_EH_HANDLED exactly as desired.

That isn't true for multiple namespace case,  each request queue has its
own timeout work, and all these timeout work can be triggered concurrently.

>
> Could you please recheck my suggestion? The alternatives proposed are
> far too risky for a 4.17 consideration, and I'm hoping we can stabilize
> this behavior in the current release if possible.

Bart's rework on timeout won't cover the case of multiple namespace, so
we have to sync timeout inside driver.

The approach taken in my V4 has been simpler, could you take a look at it?

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-05-10 20:52           ` Ming Lei
@ 2018-05-10 21:05             ` Keith Busch
  -1 siblings, 0 replies; 72+ messages in thread
From: Keith Busch @ 2018-05-10 21:05 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Ming Lei,
	linux-block, Jianchao Wang, Christoph Hellwig

On Fri, May 11, 2018 at 04:52:11AM +0800, Ming Lei wrote:
> Hi Keith,
> 
> On Tue, May 8, 2018 at 11:30 PM, Keith Busch <keith.busch@intel.com> wrote:
> > On Sat, Apr 28, 2018 at 11:50:17AM +0800, Ming Lei wrote:
> >> This sync may be raced with one timed-out request, which may be handled
> >> as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
> >> work reliably.
> >
> > Ming,
> >
> > As proposed, that scenario is impossible to encounter. Resetting the
> > controller inline with the timeout reaps all the commands, and then
> > sets the controller state to RESETTING. While blk-mq may not allow the
> > driver to complete those requests, having the driver sync with the queues
> > will hold the controller in the reset state until blk-mq is done with
> > its timeout work; therefore, it is impossible for the NVMe driver to
> > return "BLK_EH_RESET_TIMER", and all commands will be completed through
> > nvme_timeout's BLK_EH_HANDLED exactly as desired.
> 
> That isn't true for multiple namespace case,  each request queue has its
> own timeout work, and all these timeout work can be triggered concurrently.

The controller state is most certainly not per queue/namespace. It's
global to the controller. Once the reset is triggered, nvme_timeout can
only return EH_HANDLED.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-05-10 21:05             ` Keith Busch
  0 siblings, 0 replies; 72+ messages in thread
From: Keith Busch @ 2018-05-10 21:05 UTC (permalink / raw)


On Fri, May 11, 2018@04:52:11AM +0800, Ming Lei wrote:
> Hi Keith,
> 
> On Tue, May 8, 2018@11:30 PM, Keith Busch <keith.busch@intel.com> wrote:
> > On Sat, Apr 28, 2018@11:50:17AM +0800, Ming Lei wrote:
> >> This sync may be raced with one timed-out request, which may be handled
> >> as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
> >> work reliably.
> >
> > Ming,
> >
> > As proposed, that scenario is impossible to encounter. Resetting the
> > controller inline with the timeout reaps all the commands, and then
> > sets the controller state to RESETTING. While blk-mq may not allow the
> > driver to complete those requests, having the driver sync with the queues
> > will hold the controller in the reset state until blk-mq is done with
> > its timeout work; therefore, it is impossible for the NVMe driver to
> > return "BLK_EH_RESET_TIMER", and all commands will be completed through
> > nvme_timeout's BLK_EH_HANDLED exactly as desired.
> 
> That isn't true for multiple namespace case,  each request queue has its
> own timeout work, and all these timeout work can be triggered concurrently.

The controller state is most certainly not per queue/namespace. It's
global to the controller. Once the reset is triggered, nvme_timeout can
only return EH_HANDLED.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-05-10 21:05             ` Keith Busch
@ 2018-05-10 21:10               ` Ming Lei
  -1 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-05-10 21:10 UTC (permalink / raw)
  To: Keith Busch
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Ming Lei,
	linux-block, Jianchao Wang, Christoph Hellwig

On Fri, May 11, 2018 at 5:05 AM, Keith Busch
<keith.busch@linux.intel.com> wrote:
> On Fri, May 11, 2018 at 04:52:11AM +0800, Ming Lei wrote:
>> Hi Keith,
>>
>> On Tue, May 8, 2018 at 11:30 PM, Keith Busch <keith.busch@intel.com> wrote:
>> > On Sat, Apr 28, 2018 at 11:50:17AM +0800, Ming Lei wrote:
>> >> This sync may be raced with one timed-out request, which may be handled
>> >> as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
>> >> work reliably.
>> >
>> > Ming,
>> >
>> > As proposed, that scenario is impossible to encounter. Resetting the
>> > controller inline with the timeout reaps all the commands, and then
>> > sets the controller state to RESETTING. While blk-mq may not allow the
>> > driver to complete those requests, having the driver sync with the queues
>> > will hold the controller in the reset state until blk-mq is done with
>> > its timeout work; therefore, it is impossible for the NVMe driver to
>> > return "BLK_EH_RESET_TIMER", and all commands will be completed through
>> > nvme_timeout's BLK_EH_HANDLED exactly as desired.
>>
>> That isn't true for multiple namespace case,  each request queue has its
>> own timeout work, and all these timeout work can be triggered concurrently.
>
> The controller state is most certainly not per queue/namespace. It's
> global to the controller. Once the reset is triggered, nvme_timeout can
> only return EH_HANDLED.

It is related with EH_HANDLED, please see the following case:

1) when req A from N1 is timed out, nvme_timeout() handles
it as EH_HANDLED: nvme_dev_disable() and reset is scheduled.

2) when req B from N2 is timed out, nvme_timeout() handles
it as EH_HANDLED, then nvme_dev_disable() is called exactly
when reset is in-progress, so queues become quiesced, and nothing
can move on in the resetting triggered by N1.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-05-10 21:10               ` Ming Lei
  0 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-05-10 21:10 UTC (permalink / raw)


On Fri, May 11, 2018 at 5:05 AM, Keith Busch
<keith.busch@linux.intel.com> wrote:
> On Fri, May 11, 2018@04:52:11AM +0800, Ming Lei wrote:
>> Hi Keith,
>>
>> On Tue, May 8, 2018@11:30 PM, Keith Busch <keith.busch@intel.com> wrote:
>> > On Sat, Apr 28, 2018@11:50:17AM +0800, Ming Lei wrote:
>> >> This sync may be raced with one timed-out request, which may be handled
>> >> as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
>> >> work reliably.
>> >
>> > Ming,
>> >
>> > As proposed, that scenario is impossible to encounter. Resetting the
>> > controller inline with the timeout reaps all the commands, and then
>> > sets the controller state to RESETTING. While blk-mq may not allow the
>> > driver to complete those requests, having the driver sync with the queues
>> > will hold the controller in the reset state until blk-mq is done with
>> > its timeout work; therefore, it is impossible for the NVMe driver to
>> > return "BLK_EH_RESET_TIMER", and all commands will be completed through
>> > nvme_timeout's BLK_EH_HANDLED exactly as desired.
>>
>> That isn't true for multiple namespace case,  each request queue has its
>> own timeout work, and all these timeout work can be triggered concurrently.
>
> The controller state is most certainly not per queue/namespace. It's
> global to the controller. Once the reset is triggered, nvme_timeout can
> only return EH_HANDLED.

It is related with EH_HANDLED, please see the following case:

1) when req A from N1 is timed out, nvme_timeout() handles
it as EH_HANDLED: nvme_dev_disable() and reset is scheduled.

2) when req B from N2 is timed out, nvme_timeout() handles
it as EH_HANDLED, then nvme_dev_disable() is called exactly
when reset is in-progress, so queues become quiesced, and nothing
can move on in the resetting triggered by N1.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-05-10 21:10               ` Ming Lei
@ 2018-05-10 21:18                 ` Keith Busch
  -1 siblings, 0 replies; 72+ messages in thread
From: Keith Busch @ 2018-05-10 21:18 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Sagi Grimberg, linux-nvme, Ming Lei,
	Keith Busch, Jianchao Wang, Christoph Hellwig

On Fri, May 11, 2018 at 05:10:40AM +0800, Ming Lei wrote:
> On Fri, May 11, 2018 at 5:05 AM, Keith Busch
> <keith.busch@linux.intel.com> wrote:
> > On Fri, May 11, 2018 at 04:52:11AM +0800, Ming Lei wrote:
> >> Hi Keith,
> >>
> >> On Tue, May 8, 2018 at 11:30 PM, Keith Busch <keith.busch@intel.com> wrote:
> >> > On Sat, Apr 28, 2018 at 11:50:17AM +0800, Ming Lei wrote:
> >> >> This sync may be raced with one timed-out request, which may be handled
> >> >> as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
> >> >> work reliably.
> >> >
> >> > Ming,
> >> >
> >> > As proposed, that scenario is impossible to encounter. Resetting the
> >> > controller inline with the timeout reaps all the commands, and then
> >> > sets the controller state to RESETTING. While blk-mq may not allow the
> >> > driver to complete those requests, having the driver sync with the queues
> >> > will hold the controller in the reset state until blk-mq is done with
> >> > its timeout work; therefore, it is impossible for the NVMe driver to
> >> > return "BLK_EH_RESET_TIMER", and all commands will be completed through
> >> > nvme_timeout's BLK_EH_HANDLED exactly as desired.
> >>
> >> That isn't true for multiple namespace case,  each request queue has its
> >> own timeout work, and all these timeout work can be triggered concurrently.
> >
> > The controller state is most certainly not per queue/namespace. It's
> > global to the controller. Once the reset is triggered, nvme_timeout can
> > only return EH_HANDLED.
> 
> It is related with EH_HANDLED, please see the following case:
> 
> 1) when req A from N1 is timed out, nvme_timeout() handles
> it as EH_HANDLED: nvme_dev_disable() and reset is scheduled.
> 
> 2) when req B from N2 is timed out, nvme_timeout() handles
> it as EH_HANDLED, then nvme_dev_disable() is called exactly
> when reset is in-progress, so queues become quiesced, and nothing
> can move on in the resetting triggered by N1.

Huh? The nvme_sync_queues ensures that doesn't happen. That was the
whole point.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-05-10 21:18                 ` Keith Busch
  0 siblings, 0 replies; 72+ messages in thread
From: Keith Busch @ 2018-05-10 21:18 UTC (permalink / raw)


On Fri, May 11, 2018@05:10:40AM +0800, Ming Lei wrote:
> On Fri, May 11, 2018 at 5:05 AM, Keith Busch
> <keith.busch@linux.intel.com> wrote:
> > On Fri, May 11, 2018@04:52:11AM +0800, Ming Lei wrote:
> >> Hi Keith,
> >>
> >> On Tue, May 8, 2018@11:30 PM, Keith Busch <keith.busch@intel.com> wrote:
> >> > On Sat, Apr 28, 2018@11:50:17AM +0800, Ming Lei wrote:
> >> >> This sync may be raced with one timed-out request, which may be handled
> >> >> as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
> >> >> work reliably.
> >> >
> >> > Ming,
> >> >
> >> > As proposed, that scenario is impossible to encounter. Resetting the
> >> > controller inline with the timeout reaps all the commands, and then
> >> > sets the controller state to RESETTING. While blk-mq may not allow the
> >> > driver to complete those requests, having the driver sync with the queues
> >> > will hold the controller in the reset state until blk-mq is done with
> >> > its timeout work; therefore, it is impossible for the NVMe driver to
> >> > return "BLK_EH_RESET_TIMER", and all commands will be completed through
> >> > nvme_timeout's BLK_EH_HANDLED exactly as desired.
> >>
> >> That isn't true for multiple namespace case,  each request queue has its
> >> own timeout work, and all these timeout work can be triggered concurrently.
> >
> > The controller state is most certainly not per queue/namespace. It's
> > global to the controller. Once the reset is triggered, nvme_timeout can
> > only return EH_HANDLED.
> 
> It is related with EH_HANDLED, please see the following case:
> 
> 1) when req A from N1 is timed out, nvme_timeout() handles
> it as EH_HANDLED: nvme_dev_disable() and reset is scheduled.
> 
> 2) when req B from N2 is timed out, nvme_timeout() handles
> it as EH_HANDLED, then nvme_dev_disable() is called exactly
> when reset is in-progress, so queues become quiesced, and nothing
> can move on in the resetting triggered by N1.

Huh? The nvme_sync_queues ensures that doesn't happen. That was the
whole point.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-05-10 21:18                 ` Keith Busch
@ 2018-05-10 21:24                   ` Ming Lei
  -1 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-05-10 21:24 UTC (permalink / raw)
  To: Keith Busch
  Cc: Ming Lei, Jens Axboe, linux-block, Sagi Grimberg, linux-nvme,
	Keith Busch, Jianchao Wang, Christoph Hellwig

On Thu, May 10, 2018 at 03:18:29PM -0600, Keith Busch wrote:
> On Fri, May 11, 2018 at 05:10:40AM +0800, Ming Lei wrote:
> > On Fri, May 11, 2018 at 5:05 AM, Keith Busch
> > <keith.busch@linux.intel.com> wrote:
> > > On Fri, May 11, 2018 at 04:52:11AM +0800, Ming Lei wrote:
> > >> Hi Keith,
> > >>
> > >> On Tue, May 8, 2018 at 11:30 PM, Keith Busch <keith.busch@intel.com> wrote:
> > >> > On Sat, Apr 28, 2018 at 11:50:17AM +0800, Ming Lei wrote:
> > >> >> This sync may be raced with one timed-out request, which may be handled
> > >> >> as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
> > >> >> work reliably.
> > >> >
> > >> > Ming,
> > >> >
> > >> > As proposed, that scenario is impossible to encounter. Resetting the
> > >> > controller inline with the timeout reaps all the commands, and then
> > >> > sets the controller state to RESETTING. While blk-mq may not allow the
> > >> > driver to complete those requests, having the driver sync with the queues
> > >> > will hold the controller in the reset state until blk-mq is done with
> > >> > its timeout work; therefore, it is impossible for the NVMe driver to
> > >> > return "BLK_EH_RESET_TIMER", and all commands will be completed through
> > >> > nvme_timeout's BLK_EH_HANDLED exactly as desired.
> > >>
> > >> That isn't true for multiple namespace case,  each request queue has its
> > >> own timeout work, and all these timeout work can be triggered concurrently.
> > >
> > > The controller state is most certainly not per queue/namespace. It's
> > > global to the controller. Once the reset is triggered, nvme_timeout can
> > > only return EH_HANDLED.
> > 
> > It is related with EH_HANDLED, please see the following case:
> > 
> > 1) when req A from N1 is timed out, nvme_timeout() handles
> > it as EH_HANDLED: nvme_dev_disable() and reset is scheduled.
> > 
> > 2) when req B from N2 is timed out, nvme_timeout() handles
> > it as EH_HANDLED, then nvme_dev_disable() is called exactly
> > when reset is in-progress, so queues become quiesced, and nothing
> > can move on in the resetting triggered by N1.
> 
> Huh? The nvme_sync_queues ensures that doesn't happen. That was the
> whole point.

Could you share me the link?

Firstly, the previous nvme_sync_queues() won't work reliably, so this
patch introduces blk_unquiesce_timeout() and blk_quiesce_timeout() for
this purpose.

Secondly, I remembered that you only call nvme_sync_queues() at the
entry of nvme_reset_work(), but timeout(either admin or normal IO)
can happen again during resetting, that is another race addressed by
this patchset, but can't cover by your proposal.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-05-10 21:24                   ` Ming Lei
  0 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-05-10 21:24 UTC (permalink / raw)


On Thu, May 10, 2018@03:18:29PM -0600, Keith Busch wrote:
> On Fri, May 11, 2018@05:10:40AM +0800, Ming Lei wrote:
> > On Fri, May 11, 2018 at 5:05 AM, Keith Busch
> > <keith.busch@linux.intel.com> wrote:
> > > On Fri, May 11, 2018@04:52:11AM +0800, Ming Lei wrote:
> > >> Hi Keith,
> > >>
> > >> On Tue, May 8, 2018@11:30 PM, Keith Busch <keith.busch@intel.com> wrote:
> > >> > On Sat, Apr 28, 2018@11:50:17AM +0800, Ming Lei wrote:
> > >> >> This sync may be raced with one timed-out request, which may be handled
> > >> >> as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
> > >> >> work reliably.
> > >> >
> > >> > Ming,
> > >> >
> > >> > As proposed, that scenario is impossible to encounter. Resetting the
> > >> > controller inline with the timeout reaps all the commands, and then
> > >> > sets the controller state to RESETTING. While blk-mq may not allow the
> > >> > driver to complete those requests, having the driver sync with the queues
> > >> > will hold the controller in the reset state until blk-mq is done with
> > >> > its timeout work; therefore, it is impossible for the NVMe driver to
> > >> > return "BLK_EH_RESET_TIMER", and all commands will be completed through
> > >> > nvme_timeout's BLK_EH_HANDLED exactly as desired.
> > >>
> > >> That isn't true for multiple namespace case,  each request queue has its
> > >> own timeout work, and all these timeout work can be triggered concurrently.
> > >
> > > The controller state is most certainly not per queue/namespace. It's
> > > global to the controller. Once the reset is triggered, nvme_timeout can
> > > only return EH_HANDLED.
> > 
> > It is related with EH_HANDLED, please see the following case:
> > 
> > 1) when req A from N1 is timed out, nvme_timeout() handles
> > it as EH_HANDLED: nvme_dev_disable() and reset is scheduled.
> > 
> > 2) when req B from N2 is timed out, nvme_timeout() handles
> > it as EH_HANDLED, then nvme_dev_disable() is called exactly
> > when reset is in-progress, so queues become quiesced, and nothing
> > can move on in the resetting triggered by N1.
> 
> Huh? The nvme_sync_queues ensures that doesn't happen. That was the
> whole point.

Could you share me the link?

Firstly, the previous nvme_sync_queues() won't work reliably, so this
patch introduces blk_unquiesce_timeout() and blk_quiesce_timeout() for
this purpose.

Secondly, I remembered that you only call nvme_sync_queues() at the
entry of nvme_reset_work(), but timeout(either admin or normal IO)
can happen again during resetting, that is another race addressed by
this patchset, but can't cover by your proposal.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-05-10 21:24                   ` Ming Lei
@ 2018-05-10 21:44                     ` Keith Busch
  -1 siblings, 0 replies; 72+ messages in thread
From: Keith Busch @ 2018-05-10 21:44 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Sagi Grimberg, Ming Lei, linux-nvme,
	Keith Busch, Jianchao Wang, Christoph Hellwig

On Fri, May 11, 2018 at 05:24:46AM +0800, Ming Lei wrote:
> Could you share me the link?

The diff was in this reply here:

http://lists.infradead.org/pipermail/linux-nvme/2018-April/017019.html

> Firstly, the previous nvme_sync_queues() won't work reliably, so this
> patch introduces blk_unquiesce_timeout() and blk_quiesce_timeout() for
> this purpose.
>
> Secondly, I remembered that you only call nvme_sync_queues() at the
> entry of nvme_reset_work(), but timeout(either admin or normal IO)
> can happen again during resetting, that is another race addressed by
> this patchset, but can't cover by your proposal.

I sync the queues at the beginning because it ensures there is not
a single in flight request for the entire controller (all namespaces
plus admin queue) before transitioning to the connecting state.

If a command times out during connecting state, we go to the dead state.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-05-10 21:44                     ` Keith Busch
  0 siblings, 0 replies; 72+ messages in thread
From: Keith Busch @ 2018-05-10 21:44 UTC (permalink / raw)


On Fri, May 11, 2018@05:24:46AM +0800, Ming Lei wrote:
> Could you share me the link?

The diff was in this reply here:

http://lists.infradead.org/pipermail/linux-nvme/2018-April/017019.html

> Firstly, the previous nvme_sync_queues() won't work reliably, so this
> patch introduces blk_unquiesce_timeout() and blk_quiesce_timeout() for
> this purpose.
>
> Secondly, I remembered that you only call nvme_sync_queues() at the
> entry of nvme_reset_work(), but timeout(either admin or normal IO)
> can happen again during resetting, that is another race addressed by
> this patchset, but can't cover by your proposal.

I sync the queues at the beginning because it ensures there is not
a single in flight request for the entire controller (all namespaces
plus admin queue) before transitioning to the connecting state.

If a command times out during connecting state, we go to the dead state.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-05-10 21:44                     ` Keith Busch
@ 2018-05-10 21:50                       ` Ming Lei
  -1 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-05-10 21:50 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Sagi Grimberg, Ming Lei, linux-nvme,
	Keith Busch, Jianchao Wang, Christoph Hellwig

On Thu, May 10, 2018 at 03:44:41PM -0600, Keith Busch wrote:
> On Fri, May 11, 2018 at 05:24:46AM +0800, Ming Lei wrote:
> > Could you share me the link?
> 
> The diff was in this reply here:
> 
> http://lists.infradead.org/pipermail/linux-nvme/2018-April/017019.html
> 
> > Firstly, the previous nvme_sync_queues() won't work reliably, so this
> > patch introduces blk_unquiesce_timeout() and blk_quiesce_timeout() for
> > this purpose.
> >
> > Secondly, I remembered that you only call nvme_sync_queues() at the
> > entry of nvme_reset_work(), but timeout(either admin or normal IO)
> > can happen again during resetting, that is another race addressed by
> > this patchset, but can't cover by your proposal.
> 
> I sync the queues at the beginning because it ensures there is not
> a single in flight request for the entire controller (all namespaces
> plus admin queue) before transitioning to the connecting state.

But it can't avoid the race I mentioned, since nvme_dev_disable() can
happen again during resetting.

> 
> If a command times out during connecting state, we go to the dead state.

That is too risky, since the IO during resetting isn't much different
with IOs submitted from other IO paths.

For example, in case of blktests block/011, controller shouldn't have
been put into dead, and this patchset(V4 & V5) can cover this case well.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-05-10 21:50                       ` Ming Lei
  0 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-05-10 21:50 UTC (permalink / raw)


On Thu, May 10, 2018@03:44:41PM -0600, Keith Busch wrote:
> On Fri, May 11, 2018@05:24:46AM +0800, Ming Lei wrote:
> > Could you share me the link?
> 
> The diff was in this reply here:
> 
> http://lists.infradead.org/pipermail/linux-nvme/2018-April/017019.html
> 
> > Firstly, the previous nvme_sync_queues() won't work reliably, so this
> > patch introduces blk_unquiesce_timeout() and blk_quiesce_timeout() for
> > this purpose.
> >
> > Secondly, I remembered that you only call nvme_sync_queues() at the
> > entry of nvme_reset_work(), but timeout(either admin or normal IO)
> > can happen again during resetting, that is another race addressed by
> > this patchset, but can't cover by your proposal.
> 
> I sync the queues at the beginning because it ensures there is not
> a single in flight request for the entire controller (all namespaces
> plus admin queue) before transitioning to the connecting state.

But it can't avoid the race I mentioned, since nvme_dev_disable() can
happen again during resetting.

> 
> If a command times out during connecting state, we go to the dead state.

That is too risky, since the IO during resetting isn't much different
with IOs submitted from other IO paths.

For example, in case of blktests block/011, controller shouldn't have
been put into dead, and this patchset(V4 & V5) can cover this case well.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-05-10 21:44                     ` Keith Busch
@ 2018-05-10 21:53                       ` Ming Lei
  -1 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-05-10 21:53 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Sagi Grimberg, Ming Lei, linux-nvme,
	Keith Busch, Jianchao Wang, Christoph Hellwig

On Thu, May 10, 2018 at 03:44:41PM -0600, Keith Busch wrote:
> On Fri, May 11, 2018 at 05:24:46AM +0800, Ming Lei wrote:
> > Could you share me the link?
> 
> The diff was in this reply here:
> 
> http://lists.infradead.org/pipermail/linux-nvme/2018-April/017019.html
> 
> > Firstly, the previous nvme_sync_queues() won't work reliably, so this
> > patch introduces blk_unquiesce_timeout() and blk_quiesce_timeout() for
> > this purpose.
> >
> > Secondly, I remembered that you only call nvme_sync_queues() at the
> > entry of nvme_reset_work(), but timeout(either admin or normal IO)
> > can happen again during resetting, that is another race addressed by
> > this patchset, but can't cover by your proposal.
> 
> I sync the queues at the beginning because it ensures there is not
> a single in flight request for the entire controller (all namespaces
> plus admin queue) before transitioning to the connecting state.
> 
> If a command times out during connecting state, we go to the dead state.

Actually, it is hang forever in blk_mq_freeze_queue_wait() from
nvme_reset_dev(), not a dead state of controller.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-05-10 21:53                       ` Ming Lei
  0 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-05-10 21:53 UTC (permalink / raw)


On Thu, May 10, 2018@03:44:41PM -0600, Keith Busch wrote:
> On Fri, May 11, 2018@05:24:46AM +0800, Ming Lei wrote:
> > Could you share me the link?
> 
> The diff was in this reply here:
> 
> http://lists.infradead.org/pipermail/linux-nvme/2018-April/017019.html
> 
> > Firstly, the previous nvme_sync_queues() won't work reliably, so this
> > patch introduces blk_unquiesce_timeout() and blk_quiesce_timeout() for
> > this purpose.
> >
> > Secondly, I remembered that you only call nvme_sync_queues() at the
> > entry of nvme_reset_work(), but timeout(either admin or normal IO)
> > can happen again during resetting, that is another race addressed by
> > this patchset, but can't cover by your proposal.
> 
> I sync the queues at the beginning because it ensures there is not
> a single in flight request for the entire controller (all namespaces
> plus admin queue) before transitioning to the connecting state.
> 
> If a command times out during connecting state, we go to the dead state.

Actually, it is hang forever in blk_mq_freeze_queue_wait() from
nvme_reset_dev(), not a dead state of controller.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-05-10 21:18                 ` Keith Busch
@ 2018-05-10 22:03                   ` Ming Lei
  -1 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-05-10 22:03 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Sagi Grimberg, linux-nvme, Ming Lei,
	Keith Busch, Jianchao Wang, Christoph Hellwig

On Fri, May 11, 2018 at 5:18 AM, Keith Busch
<keith.busch@linux.intel.com> wrote:
> On Fri, May 11, 2018 at 05:10:40AM +0800, Ming Lei wrote:
>> On Fri, May 11, 2018 at 5:05 AM, Keith Busch
>> <keith.busch@linux.intel.com> wrote:
>> > On Fri, May 11, 2018 at 04:52:11AM +0800, Ming Lei wrote:
>> >> Hi Keith,
>> >>
>> >> On Tue, May 8, 2018 at 11:30 PM, Keith Busch <keith.busch@intel.com> wrote:
>> >> > On Sat, Apr 28, 2018 at 11:50:17AM +0800, Ming Lei wrote:
>> >> >> This sync may be raced with one timed-out request, which may be handled
>> >> >> as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
>> >> >> work reliably.
>> >> >
>> >> > Ming,
>> >> >
>> >> > As proposed, that scenario is impossible to encounter. Resetting the
>> >> > controller inline with the timeout reaps all the commands, and then
>> >> > sets the controller state to RESETTING. While blk-mq may not allow the
>> >> > driver to complete those requests, having the driver sync with the queues
>> >> > will hold the controller in the reset state until blk-mq is done with
>> >> > its timeout work; therefore, it is impossible for the NVMe driver to
>> >> > return "BLK_EH_RESET_TIMER", and all commands will be completed through
>> >> > nvme_timeout's BLK_EH_HANDLED exactly as desired.
>> >>
>> >> That isn't true for multiple namespace case,  each request queue has its
>> >> own timeout work, and all these timeout work can be triggered concurrently.
>> >
>> > The controller state is most certainly not per queue/namespace. It's
>> > global to the controller. Once the reset is triggered, nvme_timeout can
>> > only return EH_HANDLED.
>>
>> It is related with EH_HANDLED, please see the following case:
>>
>> 1) when req A from N1 is timed out, nvme_timeout() handles
>> it as EH_HANDLED: nvme_dev_disable() and reset is scheduled.
>>
>> 2) when req B from N2 is timed out, nvme_timeout() handles
>> it as EH_HANDLED, then nvme_dev_disable() is called exactly
>> when reset is in-progress, so queues become quiesced, and nothing
>> can move on in the resetting triggered by N1.
>
> Huh? The nvme_sync_queues ensures that doesn't happen. That was the
> whole point.

Sorry, forget to mention, it isn't enough to simply sync timeout inside reset().

Another tricky thing is about freeze & unfreeze, now freeze is done in
nvme_dev_disable(), and unfreeze is done in nvme_reset_work. That means
we have to make sure both are paired, otherwise queues may be kept as
freeze for ever.

This issue is covered by my V4 & V5 too.

thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-05-10 22:03                   ` Ming Lei
  0 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-05-10 22:03 UTC (permalink / raw)


On Fri, May 11, 2018 at 5:18 AM, Keith Busch
<keith.busch@linux.intel.com> wrote:
> On Fri, May 11, 2018@05:10:40AM +0800, Ming Lei wrote:
>> On Fri, May 11, 2018 at 5:05 AM, Keith Busch
>> <keith.busch@linux.intel.com> wrote:
>> > On Fri, May 11, 2018@04:52:11AM +0800, Ming Lei wrote:
>> >> Hi Keith,
>> >>
>> >> On Tue, May 8, 2018@11:30 PM, Keith Busch <keith.busch@intel.com> wrote:
>> >> > On Sat, Apr 28, 2018@11:50:17AM +0800, Ming Lei wrote:
>> >> >> This sync may be raced with one timed-out request, which may be handled
>> >> >> as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
>> >> >> work reliably.
>> >> >
>> >> > Ming,
>> >> >
>> >> > As proposed, that scenario is impossible to encounter. Resetting the
>> >> > controller inline with the timeout reaps all the commands, and then
>> >> > sets the controller state to RESETTING. While blk-mq may not allow the
>> >> > driver to complete those requests, having the driver sync with the queues
>> >> > will hold the controller in the reset state until blk-mq is done with
>> >> > its timeout work; therefore, it is impossible for the NVMe driver to
>> >> > return "BLK_EH_RESET_TIMER", and all commands will be completed through
>> >> > nvme_timeout's BLK_EH_HANDLED exactly as desired.
>> >>
>> >> That isn't true for multiple namespace case,  each request queue has its
>> >> own timeout work, and all these timeout work can be triggered concurrently.
>> >
>> > The controller state is most certainly not per queue/namespace. It's
>> > global to the controller. Once the reset is triggered, nvme_timeout can
>> > only return EH_HANDLED.
>>
>> It is related with EH_HANDLED, please see the following case:
>>
>> 1) when req A from N1 is timed out, nvme_timeout() handles
>> it as EH_HANDLED: nvme_dev_disable() and reset is scheduled.
>>
>> 2) when req B from N2 is timed out, nvme_timeout() handles
>> it as EH_HANDLED, then nvme_dev_disable() is called exactly
>> when reset is in-progress, so queues become quiesced, and nothing
>> can move on in the resetting triggered by N1.
>
> Huh? The nvme_sync_queues ensures that doesn't happen. That was the
> whole point.

Sorry, forget to mention, it isn't enough to simply sync timeout inside reset().

Another tricky thing is about freeze & unfreeze, now freeze is done in
nvme_dev_disable(), and unfreeze is done in nvme_reset_work. That means
we have to make sure both are paired, otherwise queues may be kept as
freeze for ever.

This issue is covered by my V4 & V5 too.

thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-05-10 22:03                   ` Ming Lei
@ 2018-05-10 22:43                     ` Keith Busch
  -1 siblings, 0 replies; 72+ messages in thread
From: Keith Busch @ 2018-05-10 22:43 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvme, Ming Lei,
	linux-block, Jianchao Wang, Christoph Hellwig

On Fri, May 11, 2018 at 06:03:59AM +0800, Ming Lei wrote:
> Sorry, forget to mention, it isn't enough to simply sync timeout inside reset().
> 
> Another tricky thing is about freeze & unfreeze, now freeze is done in
> nvme_dev_disable(), and unfreeze is done in nvme_reset_work. That means
> we have to make sure both are paired, otherwise queues may be kept as
> freeze for ever.
> 
> This issue is covered by my V4 & V5 too.

That it isn't an issue in my patch either.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-05-10 22:43                     ` Keith Busch
  0 siblings, 0 replies; 72+ messages in thread
From: Keith Busch @ 2018-05-10 22:43 UTC (permalink / raw)


On Fri, May 11, 2018@06:03:59AM +0800, Ming Lei wrote:
> Sorry, forget to mention, it isn't enough to simply sync timeout inside reset().
> 
> Another tricky thing is about freeze & unfreeze, now freeze is done in
> nvme_dev_disable(), and unfreeze is done in nvme_reset_work. That means
> we have to make sure both are paired, otherwise queues may be kept as
> freeze for ever.
> 
> This issue is covered by my V4 & V5 too.

That it isn't an issue in my patch either.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-05-10 22:43                     ` Keith Busch
@ 2018-05-11  0:14                       ` Ming Lei
  -1 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-05-11  0:14 UTC (permalink / raw)
  To: Keith Busch
  Cc: Ming Lei, Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvme,
	linux-block, Jianchao Wang, Christoph Hellwig

On Thu, May 10, 2018 at 04:43:57PM -0600, Keith Busch wrote:
> On Fri, May 11, 2018 at 06:03:59AM +0800, Ming Lei wrote:
> > Sorry, forget to mention, it isn't enough to simply sync timeout inside reset().
> > 
> > Another tricky thing is about freeze & unfreeze, now freeze is done in
> > nvme_dev_disable(), and unfreeze is done in nvme_reset_work. That means
> > we have to make sure both are paired, otherwise queues may be kept as
> > freeze for ever.
> > 
> > This issue is covered by my V4 & V5 too.
> 
> That it isn't an issue in my patch either.

When blk_sync_queue() is used in your patch for draining timeout,
several nvme_timeout() instances may be drained, that means there
may be more than one nvme_start_freeze() called from all these
nvme_timeout() instances, but finally only one nvme_reset_work()
is run, then queues are kept as freeze forever even though
reset is done successfully.

So you patch won't fix this issue.

All these problems are mentioned in the commit log of V4, and I
am proposing this approach to try to address them all.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-05-11  0:14                       ` Ming Lei
  0 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-05-11  0:14 UTC (permalink / raw)


On Thu, May 10, 2018@04:43:57PM -0600, Keith Busch wrote:
> On Fri, May 11, 2018@06:03:59AM +0800, Ming Lei wrote:
> > Sorry, forget to mention, it isn't enough to simply sync timeout inside reset().
> > 
> > Another tricky thing is about freeze & unfreeze, now freeze is done in
> > nvme_dev_disable(), and unfreeze is done in nvme_reset_work. That means
> > we have to make sure both are paired, otherwise queues may be kept as
> > freeze for ever.
> > 
> > This issue is covered by my V4 & V5 too.
> 
> That it isn't an issue in my patch either.

When blk_sync_queue() is used in your patch for draining timeout,
several nvme_timeout() instances may be drained, that means there
may be more than one nvme_start_freeze() called from all these
nvme_timeout() instances, but finally only one nvme_reset_work()
is run, then queues are kept as freeze forever even though
reset is done successfully.

So you patch won't fix this issue.

All these problems are mentioned in the commit log of V4, and I
am proposing this approach to try to address them all.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] nvme: pci: simplify timeout handling
  2018-05-10 21:05             ` Keith Busch
@ 2018-05-11  2:10               ` Ming Lei
  -1 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-05-11  2:10 UTC (permalink / raw)
  To: Keith Busch
  Cc: Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Ming Lei,
	linux-block, Jianchao Wang, Christoph Hellwig

On Fri, May 11, 2018 at 5:05 AM, Keith Busch
<keith.busch@linux.intel.com> wrote:
> On Fri, May 11, 2018 at 04:52:11AM +0800, Ming Lei wrote:
>> Hi Keith,
>>
>> On Tue, May 8, 2018 at 11:30 PM, Keith Busch <keith.busch@intel.com> wrote:
>> > On Sat, Apr 28, 2018 at 11:50:17AM +0800, Ming Lei wrote:
>> >> This sync may be raced with one timed-out request, which may be handled
>> >> as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
>> >> work reliably.
>> >
>> > Ming,
>> >
>> > As proposed, that scenario is impossible to encounter. Resetting the
>> > controller inline with the timeout reaps all the commands, and then
>> > sets the controller state to RESETTING. While blk-mq may not allow the
>> > driver to complete those requests, having the driver sync with the queues
>> > will hold the controller in the reset state until blk-mq is done with
>> > its timeout work; therefore, it is impossible for the NVMe driver to
>> > return "BLK_EH_RESET_TIMER", and all commands will be completed through
>> > nvme_timeout's BLK_EH_HANDLED exactly as desired.
>>
>> That isn't true for multiple namespace case,  each request queue has its
>> own timeout work, and all these timeout work can be triggered concurrently.
>
> The controller state is most certainly not per queue/namespace. It's
> global to the controller. Once the reset is triggered, nvme_timeout can
> only return EH_HANDLED.

One exception is PCI error recovery, in which EH_RESET_TIMER still
may be returned any time.

Also the two timeout can happen at the same time from more than one
NS, just before resetting is started(before updating to NVME_CTRL_RESETTING).

OR one timeout is from admin queue, another one is from NS, both happen
at the same time, still before updating to NVME_CTRL_RESETTING.

In above two situations, one timeout can be handled as EH_HANDLED, and
another can be handled as EH_RESET_TIMER.

So it isn't enough to drain timeout by blk_sync_queue() simply.


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] nvme: pci: simplify timeout handling
@ 2018-05-11  2:10               ` Ming Lei
  0 siblings, 0 replies; 72+ messages in thread
From: Ming Lei @ 2018-05-11  2:10 UTC (permalink / raw)


On Fri, May 11, 2018 at 5:05 AM, Keith Busch
<keith.busch@linux.intel.com> wrote:
> On Fri, May 11, 2018@04:52:11AM +0800, Ming Lei wrote:
>> Hi Keith,
>>
>> On Tue, May 8, 2018@11:30 PM, Keith Busch <keith.busch@intel.com> wrote:
>> > On Sat, Apr 28, 2018@11:50:17AM +0800, Ming Lei wrote:
>> >> This sync may be raced with one timed-out request, which may be handled
>> >> as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
>> >> work reliably.
>> >
>> > Ming,
>> >
>> > As proposed, that scenario is impossible to encounter. Resetting the
>> > controller inline with the timeout reaps all the commands, and then
>> > sets the controller state to RESETTING. While blk-mq may not allow the
>> > driver to complete those requests, having the driver sync with the queues
>> > will hold the controller in the reset state until blk-mq is done with
>> > its timeout work; therefore, it is impossible for the NVMe driver to
>> > return "BLK_EH_RESET_TIMER", and all commands will be completed through
>> > nvme_timeout's BLK_EH_HANDLED exactly as desired.
>>
>> That isn't true for multiple namespace case,  each request queue has its
>> own timeout work, and all these timeout work can be triggered concurrently.
>
> The controller state is most certainly not per queue/namespace. It's
> global to the controller. Once the reset is triggered, nvme_timeout can
> only return EH_HANDLED.

One exception is PCI error recovery, in which EH_RESET_TIMER still
may be returned any time.

Also the two timeout can happen at the same time from more than one
NS, just before resetting is started(before updating to NVME_CTRL_RESETTING).

OR one timeout is from admin queue, another one is from NS, both happen
at the same time, still before updating to NVME_CTRL_RESETTING.

In above two situations, one timeout can be handled as EH_HANDLED, and
another can be handled as EH_RESET_TIMER.

So it isn't enough to drain timeout by blk_sync_queue() simply.


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2018-05-11  2:10 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-26 12:39 [PATCH 0/2] nvme: pci: fix & improve timeout handling Ming Lei
2018-04-26 12:39 ` Ming Lei
2018-04-26 12:39 ` [PATCH 1/2] nvme: pci: simplify " Ming Lei
2018-04-26 12:39   ` Ming Lei
2018-04-26 15:07   ` jianchao.wang
2018-04-26 15:07     ` jianchao.wang
2018-04-26 15:57     ` Ming Lei
2018-04-26 15:57       ` Ming Lei
2018-04-26 16:16       ` Ming Lei
2018-04-26 16:16         ` Ming Lei
2018-04-27  1:37       ` jianchao.wang
2018-04-27  1:37         ` jianchao.wang
2018-04-27 14:57         ` Ming Lei
2018-04-27 14:57           ` Ming Lei
2018-04-28 14:00           ` jianchao.wang
2018-04-28 14:00             ` jianchao.wang
2018-04-28 21:57             ` Ming Lei
2018-04-28 21:57               ` Ming Lei
2018-04-28 22:27               ` Ming Lei
2018-04-28 22:27                 ` Ming Lei
2018-04-29  1:36                 ` Ming Lei
2018-04-29  1:36                   ` Ming Lei
2018-04-29  2:21                   ` jianchao.wang
2018-04-29  2:21                     ` jianchao.wang
2018-04-29 14:13                     ` Ming Lei
2018-04-29 14:13                       ` Ming Lei
2018-04-27 17:51   ` Keith Busch
2018-04-27 17:51     ` Keith Busch
2018-04-28  3:50     ` Ming Lei
2018-04-28  3:50       ` Ming Lei
2018-04-28 13:35       ` Keith Busch
2018-04-28 13:35         ` Keith Busch
2018-04-28 14:31         ` jianchao.wang
2018-04-28 14:31           ` jianchao.wang
2018-04-28 21:39         ` Ming Lei
2018-04-28 21:39           ` Ming Lei
2018-04-30 19:52           ` Keith Busch
2018-04-30 19:52             ` Keith Busch
2018-04-30 23:14             ` Ming Lei
2018-04-30 23:14               ` Ming Lei
2018-05-08 15:30       ` Keith Busch
2018-05-08 15:30         ` Keith Busch
2018-05-10 20:52         ` Ming Lei
2018-05-10 20:52           ` Ming Lei
2018-05-10 21:05           ` Keith Busch
2018-05-10 21:05             ` Keith Busch
2018-05-10 21:10             ` Ming Lei
2018-05-10 21:10               ` Ming Lei
2018-05-10 21:18               ` Keith Busch
2018-05-10 21:18                 ` Keith Busch
2018-05-10 21:24                 ` Ming Lei
2018-05-10 21:24                   ` Ming Lei
2018-05-10 21:44                   ` Keith Busch
2018-05-10 21:44                     ` Keith Busch
2018-05-10 21:50                     ` Ming Lei
2018-05-10 21:50                       ` Ming Lei
2018-05-10 21:53                     ` Ming Lei
2018-05-10 21:53                       ` Ming Lei
2018-05-10 22:03                 ` Ming Lei
2018-05-10 22:03                   ` Ming Lei
2018-05-10 22:43                   ` Keith Busch
2018-05-10 22:43                     ` Keith Busch
2018-05-11  0:14                     ` Ming Lei
2018-05-11  0:14                       ` Ming Lei
2018-05-11  2:10             ` Ming Lei
2018-05-11  2:10               ` Ming Lei
2018-04-26 12:39 ` [PATCH 2/2] nvme: pci: guarantee EH can make progress Ming Lei
2018-04-26 12:39   ` Ming Lei
2018-04-26 16:24   ` Keith Busch
2018-04-26 16:24     ` Keith Busch
2018-04-28  3:28     ` Ming Lei
2018-04-28  3:28       ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.