All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv4 0/4] nvme timeout updates
@ 2018-07-13 20:56 Keith Busch
  2018-07-13 20:56 ` [PATCHv4 1/4] nvme: Sync request queues on reset Keith Busch
                   ` (4 more replies)
  0 siblings, 5 replies; 35+ messages in thread
From: Keith Busch @ 2018-07-13 20:56 UTC (permalink / raw)


The majority of the v3 was either merged or no longer necessary.

Patch 1 is still the same as before.

The remaining patches 2-4 are setting up the nvme core to handle
unfreezing queues and dispatching IO from a context that is safe to
handle IO timeouts. The pci nvme is the only one making use of that at
the momement, but I suspect fabrics can make something of it.

The patch retrying a failed reset on an admin init timeout has been
dropped. If we really need to do this, I think more discussion will help
so we are creating less arbitrary rules.

Keith Busch (4):
  nvme: Sync request queues on reset
  nvme: Start controller in own work queue
  nvme: Introduce frozen controller state
  nvme-pci: Use controller start work to dispath IO

 drivers/nvme/host/core.c | 54 ++++++++++++++++++++++++++++++++++++++++++++----
 drivers/nvme/host/nvme.h |  3 +++
 drivers/nvme/host/pci.c  | 20 ++++++------------
 3 files changed, 59 insertions(+), 18 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 1/4] nvme: Sync request queues on reset
  2018-07-13 20:56 [PATCHv4 0/4] nvme timeout updates Keith Busch
@ 2018-07-13 20:56 ` Keith Busch
  2018-07-16  8:52   ` jianchao.wang
  2018-07-16 14:51   ` Sagi Grimberg
  2018-07-13 20:56 ` [PATCHv4 2/4] nvme: Start controller in own work queue Keith Busch
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 35+ messages in thread
From: Keith Busch @ 2018-07-13 20:56 UTC (permalink / raw)


This patch fixes races that occur with simultaneous controller
resets by synchronizing request queues prior to initializing the
controller. Withouth this, a thread may attempt disabling a controller
at the same time as we're trying to enable it.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/core.c | 21 +++++++++++++++++++--
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  |  1 +
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 46df030b2c3f..eeb97dbef7ff 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -101,6 +101,12 @@ static void nvme_ns_remove(struct nvme_ns *ns);
 static int nvme_revalidate_disk(struct gendisk *disk);
 static void nvme_put_subsystem(struct nvme_subsystem *subsys);
 
+static void nvme_start_queue(struct nvme_ns *ns)
+{
+	blk_mq_unquiesce_queue(ns->queue);
+	blk_mq_kick_requeue_list(ns->queue);
+}
+
 static void nvme_queue_scan(struct nvme_ctrl *ctrl)
 {
 	/*
@@ -3553,7 +3559,7 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl)
 		blk_set_queue_dying(ns->queue);
 
 		/* Forcibly unquiesce queues to avoid blocking dispatch */
-		blk_mq_unquiesce_queue(ns->queue);
+		nvme_start_queue(ns);
 	}
 	up_read(&ctrl->namespaces_rwsem);
 }
@@ -3623,11 +3629,22 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
 
 	down_read(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list)
-		blk_mq_unquiesce_queue(ns->queue);
+		nvme_start_queue(ns);
 	up_read(&ctrl->namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_start_queues);
 
+void nvme_sync_queues(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_sync_queue(ns->queue);
+	up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_sync_queues);
+
 int __init nvme_core_init(void)
 {
 	int result = -ENOMEM;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 0c4a33df3b2f..51cf44ad86fd 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -409,6 +409,7 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len,
 void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
 		volatile union nvme_result *res);
 
+void nvme_sync_queues(struct nvme_ctrl *ctrl);
 void nvme_stop_queues(struct nvme_ctrl *ctrl);
 void nvme_start_queues(struct nvme_ctrl *ctrl);
 void nvme_kill_queues(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ba943f211687..b9b83a7a5a52 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2318,6 +2318,7 @@ static void nvme_reset_work(struct work_struct *work)
 	 */
 	if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
 		nvme_dev_disable(dev, false);
+	nvme_sync_queues(&dev->ctrl);
 
 	/*
 	 * Introduce CONNECTING state from nvme-fc/rdma transports to mark the
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCHv4 2/4] nvme: Start controller in own work queue
  2018-07-13 20:56 [PATCHv4 0/4] nvme timeout updates Keith Busch
  2018-07-13 20:56 ` [PATCHv4 1/4] nvme: Sync request queues on reset Keith Busch
@ 2018-07-13 20:56 ` Keith Busch
  2018-07-16 15:00   ` Sagi Grimberg
  2018-07-13 20:56 ` [PATCHv4 3/4] nvme: Introduce frozen controller state Keith Busch
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 35+ messages in thread
From: Keith Busch @ 2018-07-13 20:56 UTC (permalink / raw)


This moves the controller start from the reset_work to its own work.
This is preparing for the possibility of running IO when starting the
controller, which we shouldn't do from the reset_work since we need that
context to handle timeouts.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/core.c | 12 +++++++++++-
 drivers/nvme/host/nvme.h |  1 +
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index eeb97dbef7ff..6fed8a454843 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3418,6 +3418,7 @@ EXPORT_SYMBOL_GPL(nvme_complete_async_event);
 void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
 {
 	nvme_stop_keep_alive(ctrl);
+	flush_work(&ctrl->start_work);
 	flush_work(&ctrl->async_event_work);
 	flush_work(&ctrl->scan_work);
 	cancel_work_sync(&ctrl->fw_act_work);
@@ -3426,8 +3427,11 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_stop_ctrl);
 
-void nvme_start_ctrl(struct nvme_ctrl *ctrl)
+void nvme_start_work(struct work_struct *work)
 {
+	struct nvme_ctrl *ctrl =
+		container_of(work, struct nvme_ctrl, start_work);
+
 	if (ctrl->kato)
 		nvme_start_keep_alive(ctrl);
 
@@ -3438,6 +3442,11 @@ void nvme_start_ctrl(struct nvme_ctrl *ctrl)
 		nvme_start_queues(ctrl);
 	}
 }
+
+void nvme_start_ctrl(struct nvme_ctrl *ctrl)
+{
+	queue_work(nvme_wq, &ctrl->start_work);
+}
 EXPORT_SYMBOL_GPL(nvme_start_ctrl);
 
 void nvme_uninit_ctrl(struct nvme_ctrl *ctrl)
@@ -3485,6 +3494,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	ctrl->dev = dev;
 	ctrl->ops = ops;
 	ctrl->quirks = quirks;
+	INIT_WORK(&ctrl->start_work, nvme_start_work);
 	INIT_WORK(&ctrl->scan_work, nvme_scan_work);
 	INIT_WORK(&ctrl->async_event_work, nvme_async_event_work);
 	INIT_WORK(&ctrl->fw_act_work, nvme_fw_act_work);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 51cf44ad86fd..1dd26a9a1fdc 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -190,6 +190,7 @@ struct nvme_ctrl {
 	unsigned long quirks;
 	struct nvme_id_power_state psd[32];
 	struct nvme_effects_log *effects;
+	struct work_struct start_work;
 	struct work_struct scan_work;
 	struct work_struct async_event_work;
 	struct delayed_work ka_work;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCHv4 3/4] nvme: Introduce frozen controller state
  2018-07-13 20:56 [PATCHv4 0/4] nvme timeout updates Keith Busch
  2018-07-13 20:56 ` [PATCHv4 1/4] nvme: Sync request queues on reset Keith Busch
  2018-07-13 20:56 ` [PATCHv4 2/4] nvme: Start controller in own work queue Keith Busch
@ 2018-07-13 20:56 ` Keith Busch
  2018-07-16  9:02   ` jianchao.wang
                     ` (2 more replies)
  2018-07-13 20:56 ` [PATCHv4 4/4] nvme-pci: Use controller start work to dispath IO Keith Busch
  2018-07-19 19:48 ` [PATCHv4 0/4] nvme timeout updates Scott Bauer
  4 siblings, 3 replies; 35+ messages in thread
From: Keith Busch @ 2018-07-13 20:56 UTC (permalink / raw)


This patch creates a new controller state that transport specific drivers
can set their controller to prior to starting the controller to have the
start_work handle frozen request queues and update the hardware contexts.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/core.c | 21 ++++++++++++++++++++-
 drivers/nvme/host/nvme.h |  1 +
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 6fed8a454843..7ce339c3fcc4 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -288,11 +288,21 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 			break;
 		}
 		break;
+	case NVME_CTRL_FROZEN:
+		switch (old_state) {
+		case NVME_CTRL_CONNECTING:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
 	case NVME_CTRL_LIVE:
 		switch (old_state) {
 		case NVME_CTRL_NEW:
 		case NVME_CTRL_RESETTING:
 		case NVME_CTRL_CONNECTING:
+		case NVME_CTRL_FROZEN:
 			changed = true;
 			/* FALLTHRU */
 		default:
@@ -304,6 +314,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 		case NVME_CTRL_NEW:
 		case NVME_CTRL_LIVE:
 		case NVME_CTRL_ADMIN_ONLY:
+		case NVME_CTRL_FROZEN:
 			changed = true;
 			/* FALLTHRU */
 		default:
@@ -3435,16 +3446,24 @@ void nvme_start_work(struct work_struct *work)
 	if (ctrl->kato)
 		nvme_start_keep_alive(ctrl);
 
+	if (ctrl->state == NVME_CTRL_FROZEN) {
+		nvme_wait_freeze(ctrl);
+		blk_mq_update_nr_hw_queues(ctrl->tagset, ctrl->queue_count - 1);
+		nvme_unfreeze(ctrl);
+		if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_LIVE))
+			return;
+	}
+
 	if (ctrl->queue_count > 1) {
 		nvme_queue_scan(ctrl);
 		nvme_enable_aen(ctrl);
 		queue_work(nvme_wq, &ctrl->async_event_work);
-		nvme_start_queues(ctrl);
 	}
 }
 
 void nvme_start_ctrl(struct nvme_ctrl *ctrl)
 {
+	nvme_start_queues(ctrl);
 	queue_work(nvme_wq, &ctrl->start_work);
 }
 EXPORT_SYMBOL_GPL(nvme_start_ctrl);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 1dd26a9a1fdc..ca981e5eb6b3 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -134,6 +134,7 @@ enum nvme_ctrl_state {
 	NVME_CTRL_CONNECTING,
 	NVME_CTRL_DELETING,
 	NVME_CTRL_DEAD,
+	NVME_CTRL_FROZEN,
 };
 
 struct nvme_ctrl {
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCHv4 4/4] nvme-pci: Use controller start work to dispath IO
  2018-07-13 20:56 [PATCHv4 0/4] nvme timeout updates Keith Busch
                   ` (2 preceding siblings ...)
  2018-07-13 20:56 ` [PATCHv4 3/4] nvme: Introduce frozen controller state Keith Busch
@ 2018-07-13 20:56 ` Keith Busch
  2018-07-19 19:48 ` [PATCHv4 0/4] nvme timeout updates Scott Bauer
  4 siblings, 0 replies; 35+ messages in thread
From: Keith Busch @ 2018-07-13 20:56 UTC (permalink / raw)


We don't want to wait for IO within the reset work because we need that
to context handle IO timeouts. This patch uses the new FROZEN state
to inform the controller start work to handle restarting IO queues and
waiting for their freeze to complete.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 19 +++++--------------
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b9b83a7a5a52..2e12c8a746c7 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2077,10 +2077,7 @@ static void nvme_disable_io_queues(struct nvme_dev *dev)
 	}
 }
 
-/*
- * return error value only when tagset allocation failed
- */
-static int nvme_dev_add(struct nvme_dev *dev)
+static enum nvme_ctrl_state nvme_dev_add(struct nvme_dev *dev)
 {
 	int ret;
 
@@ -2103,19 +2100,18 @@ static int nvme_dev_add(struct nvme_dev *dev)
 		if (ret) {
 			dev_warn(dev->ctrl.device,
 				"IO queues tagset allocation failed %d\n", ret);
-			return ret;
+			return NVME_CTRL_ADMIN_ONLY;
 		}
 		dev->ctrl.tagset = &dev->tagset;
 
 		nvme_dbbuf_set(dev);
 	} else {
-		blk_mq_update_nr_hw_queues(&dev->tagset, dev->online_queues - 1);
-
 		/* Free previously allocated queues that are no longer usable */
 		nvme_free_queues(dev, dev->online_queues);
+		return NVME_CTRL_FROZEN;
 	}
 
-	return 0;
+	return NVME_CTRL_LIVE;
 }
 
 static int nvme_pci_enable(struct nvme_dev *dev)
@@ -2391,12 +2387,7 @@ static void nvme_reset_work(struct work_struct *work)
 		nvme_remove_namespaces(&dev->ctrl);
 		new_state = NVME_CTRL_ADMIN_ONLY;
 	} else {
-		nvme_start_queues(&dev->ctrl);
-		nvme_wait_freeze(&dev->ctrl);
-		/* hit this only when allocate tagset fails */
-		if (nvme_dev_add(dev))
-			new_state = NVME_CTRL_ADMIN_ONLY;
-		nvme_unfreeze(&dev->ctrl);
+		new_state = nvme_dev_add(dev);
 	}
 
 	/*
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCHv4 1/4] nvme: Sync request queues on reset
  2018-07-13 20:56 ` [PATCHv4 1/4] nvme: Sync request queues on reset Keith Busch
@ 2018-07-16  8:52   ` jianchao.wang
  2018-07-16 10:39     ` Ming Lei
  2018-07-16 14:51   ` Sagi Grimberg
  1 sibling, 1 reply; 35+ messages in thread
From: jianchao.wang @ 2018-07-16  8:52 UTC (permalink / raw)


Hi Keith

On 07/14/2018 04:56 AM, Keith Busch wrote:
> This patch fixes races that occur with simultaneous controller
> resets by synchronizing request queues prior to initializing the
> controller. Withouth this, a thread may attempt disabling a controller
> at the same time as we're trying to enable it.

This issue is due to the previous blk-mq timeout mechanism will hold
all the timed out requests, then nvme_dev_disable cannot grab these requests
and when it returns, the blk_mq_timeout_work could be still running.
But after commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"),
nvme_dev_disble have been able to grab all the requests and when it returns,
there will be no any requests in blk_mq_timeout_work so nvme_dev_disable will
not be invoked multiple times.
So this patch should be unnecessary. :)

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 3/4] nvme: Introduce frozen controller state
  2018-07-13 20:56 ` [PATCHv4 3/4] nvme: Introduce frozen controller state Keith Busch
@ 2018-07-16  9:02   ` jianchao.wang
  2018-07-16 11:09   ` Ming Lei
  2018-07-16 16:34   ` Sagi Grimberg
  2 siblings, 0 replies; 35+ messages in thread
From: jianchao.wang @ 2018-07-16  9:02 UTC (permalink / raw)


Hi Keith

On 07/14/2018 04:56 AM, Keith Busch wrote:
> @@ -134,6 +134,7 @@ enum nvme_ctrl_state {
>  	NVME_CTRL_CONNECTING,
>  	NVME_CTRL_DELETING,
>  	NVME_CTRL_DEAD,
> +	NVME_CTRL_FROZEN,
>  };

We also need to add a state name entry into the nvme_sysfs_show_state

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 1/4] nvme: Sync request queues on reset
  2018-07-16  8:52   ` jianchao.wang
@ 2018-07-16 10:39     ` Ming Lei
  2018-07-16 13:30       ` Keith Busch
  0 siblings, 1 reply; 35+ messages in thread
From: Ming Lei @ 2018-07-16 10:39 UTC (permalink / raw)


On Mon, Jul 16, 2018@04:52:39PM +0800, jianchao.wang wrote:
> Hi Keith
> 
> On 07/14/2018 04:56 AM, Keith Busch wrote:
> > This patch fixes races that occur with simultaneous controller
> > resets by synchronizing request queues prior to initializing the
> > controller. Withouth this, a thread may attempt disabling a controller
> > at the same time as we're trying to enable it.
> 
> This issue is due to the previous blk-mq timeout mechanism will hold
> all the timed out requests, then nvme_dev_disable cannot grab these requests
> and when it returns, the blk_mq_timeout_work could be still running.
> But after commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"),
> nvme_dev_disble have been able to grab all the requests and when it returns,
> there will be no any requests in blk_mq_timeout_work so nvme_dev_disable will
> not be invoked multiple times.
> So this patch should be unnecessary. :)

There are multiple namespaces, and all may trigger timeout at the same
time, so looks the sync is still needed.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 3/4] nvme: Introduce frozen controller state
  2018-07-13 20:56 ` [PATCHv4 3/4] nvme: Introduce frozen controller state Keith Busch
  2018-07-16  9:02   ` jianchao.wang
@ 2018-07-16 11:09   ` Ming Lei
  2018-07-16 13:36     ` Keith Busch
  2018-07-16 16:34   ` Sagi Grimberg
  2 siblings, 1 reply; 35+ messages in thread
From: Ming Lei @ 2018-07-16 11:09 UTC (permalink / raw)


On Fri, Jul 13, 2018@02:56:08PM -0600, Keith Busch wrote:
> This patch creates a new controller state that transport specific drivers
> can set their controller to prior to starting the controller to have the
> start_work handle frozen request queues and update the hardware contexts.
> 
> Signed-off-by: Keith Busch <keith.busch at intel.com>
> ---
>  drivers/nvme/host/core.c | 21 ++++++++++++++++++++-
>  drivers/nvme/host/nvme.h |  1 +
>  2 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 6fed8a454843..7ce339c3fcc4 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -288,11 +288,21 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
>  			break;
>  		}
>  		break;
> +	case NVME_CTRL_FROZEN:
> +		switch (old_state) {
> +		case NVME_CTRL_CONNECTING:
> +			changed = true;
> +			/* FALLTHRU */
> +		default:
> +			break;
> +		}
> +		break;
>  	case NVME_CTRL_LIVE:
>  		switch (old_state) {
>  		case NVME_CTRL_NEW:
>  		case NVME_CTRL_RESETTING:
>  		case NVME_CTRL_CONNECTING:
> +		case NVME_CTRL_FROZEN:
>  			changed = true;
>  			/* FALLTHRU */
>  		default:
> @@ -304,6 +314,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
>  		case NVME_CTRL_NEW:
>  		case NVME_CTRL_LIVE:
>  		case NVME_CTRL_ADMIN_ONLY:
> +		case NVME_CTRL_FROZEN:
>  			changed = true;
>  			/* FALLTHRU */
>  		default:
> @@ -3435,16 +3446,24 @@ void nvme_start_work(struct work_struct *work)
>  	if (ctrl->kato)
>  		nvme_start_keep_alive(ctrl);
>  
> +	if (ctrl->state == NVME_CTRL_FROZEN) {
> +		nvme_wait_freeze(ctrl);
> +		blk_mq_update_nr_hw_queues(ctrl->tagset, ctrl->queue_count - 1);
> +		nvme_unfreeze(ctrl);
> +		if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_LIVE))
> +			return;
> +	}
> +

Now you move unfreezing into start work function, seems this way can't
guarantee that nvme_unfreeze() is strictly paired with nvme_start_freeze() done
in nvme_dev_disable().

In V3, this issue can be avoided by patch of '[PATCHv3 3/9] nvme: Move all IO out of
controller reset'.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 1/4] nvme: Sync request queues on reset
  2018-07-16 10:39     ` Ming Lei
@ 2018-07-16 13:30       ` Keith Busch
  2018-07-17  5:37         ` jianchao.wang
  0 siblings, 1 reply; 35+ messages in thread
From: Keith Busch @ 2018-07-16 13:30 UTC (permalink / raw)


On Mon, Jul 16, 2018@06:39:57PM +0800, Ming Lei wrote:
> On Mon, Jul 16, 2018@04:52:39PM +0800, jianchao.wang wrote:
> > Hi Keith
> > 
> > On 07/14/2018 04:56 AM, Keith Busch wrote:
> > > This patch fixes races that occur with simultaneous controller
> > > resets by synchronizing request queues prior to initializing the
> > > controller. Withouth this, a thread may attempt disabling a controller
> > > at the same time as we're trying to enable it.
> > 
> > This issue is due to the previous blk-mq timeout mechanism will hold
> > all the timed out requests, then nvme_dev_disable cannot grab these requests
> > and when it returns, the blk_mq_timeout_work could be still running.
> > But after commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"),
> > nvme_dev_disble have been able to grab all the requests and when it returns,
> > there will be no any requests in blk_mq_timeout_work so nvme_dev_disable will
> > not be invoked multiple times.
> > So this patch should be unnecessary. :)
> 
> There are multiple namespaces, and all may trigger timeout at the same
> time, so looks the sync is still needed.

Yes, we still need this syncing because each namespace has a different
request queue with their own timeout work.

As an alternative, I looked into having timeout work per tagset rather
than per request-queue, but that turned into a more complicated change
than this patch.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 3/4] nvme: Introduce frozen controller state
  2018-07-16 11:09   ` Ming Lei
@ 2018-07-16 13:36     ` Keith Busch
  2018-07-17  1:23       ` Ming Lei
  0 siblings, 1 reply; 35+ messages in thread
From: Keith Busch @ 2018-07-16 13:36 UTC (permalink / raw)


On Mon, Jul 16, 2018@07:09:04PM +0800, Ming Lei wrote:
> On Fri, Jul 13, 2018@02:56:08PM -0600, Keith Busch wrote:
> > +	if (ctrl->state == NVME_CTRL_FROZEN) {
> > +		nvme_wait_freeze(ctrl);
> > +		blk_mq_update_nr_hw_queues(ctrl->tagset, ctrl->queue_count - 1);
> > +		nvme_unfreeze(ctrl);
> > +		if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_LIVE))
> > +			return;
> > +	}
> > +
> 
> Now you move unfreezing into start work function, seems this way can't
> guarantee that nvme_unfreeze() is strictly paired with nvme_start_freeze() done
> in nvme_dev_disable().

The state machine is supposed to guarantee this. We only start freeze
in certain states from nvme_dev_disable, and transition to frozen when
exiting those states. I agree it's not the easiest API to follow, and
I'll double check to see if it's correct. I think I have the
reset/connecting/frozen/live transitions okay for nvme-pci, but I need
to look again at the deleting/dead states.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 1/4] nvme: Sync request queues on reset
  2018-07-13 20:56 ` [PATCHv4 1/4] nvme: Sync request queues on reset Keith Busch
  2018-07-16  8:52   ` jianchao.wang
@ 2018-07-16 14:51   ` Sagi Grimberg
  2018-07-16 15:37     ` Keith Busch
  1 sibling, 1 reply; 35+ messages in thread
From: Sagi Grimberg @ 2018-07-16 14:51 UTC (permalink / raw)


Keith,

> This patch fixes races that occur with simultaneous controller
> resets by synchronizing request queues prior to initializing the
> controller. Withouth this, a thread may attempt disabling a controller
> at the same time as we're trying to enable it.

I'll need a little more help with this. This is due to the
fact that we disable the controller directly from the timeout
handler? If not, what context may disable when nvme_reset_work
is running?

I'm trying what the sync should serialize and why is it not
serialized by the controller state machine.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 2/4] nvme: Start controller in own work queue
  2018-07-13 20:56 ` [PATCHv4 2/4] nvme: Start controller in own work queue Keith Busch
@ 2018-07-16 15:00   ` Sagi Grimberg
  2018-07-16 15:35     ` Keith Busch
  0 siblings, 1 reply; 35+ messages in thread
From: Sagi Grimberg @ 2018-07-16 15:00 UTC (permalink / raw)



> This moves the controller start from the reset_work to its own work.
> This is preparing for the possibility of running IO when starting the
> controller, which we shouldn't do from the reset_work since we need that
> context to handle timeouts.

What do you mean handle timeouts? the timeout handler is not invoked
from the reset_work thread.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 2/4] nvme: Start controller in own work queue
  2018-07-16 15:00   ` Sagi Grimberg
@ 2018-07-16 15:35     ` Keith Busch
  0 siblings, 0 replies; 35+ messages in thread
From: Keith Busch @ 2018-07-16 15:35 UTC (permalink / raw)


On Mon, Jul 16, 2018@06:00:38PM +0300, Sagi Grimberg wrote:
> 
> > This moves the controller start from the reset_work to its own work.
> > This is preparing for the possibility of running IO when starting the
> > controller, which we shouldn't do from the reset_work since we need that
> > context to handle timeouts.
> 
> What do you mean handle timeouts? the timeout handler is not invoked
> from the reset_work thread.

Not directly. If you dispatch blocking IO from the reset_work, the timeout
handler will eventually be invoked if the IO never returns. This series
is really just trying to get all the blocking IO out of the reset_work
because IO may requeue when the timeout handler disables the controller,
and we need the reset_work to restart the controller.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 1/4] nvme: Sync request queues on reset
  2018-07-16 14:51   ` Sagi Grimberg
@ 2018-07-16 15:37     ` Keith Busch
  2018-07-16 16:36       ` Sagi Grimberg
  0 siblings, 1 reply; 35+ messages in thread
From: Keith Busch @ 2018-07-16 15:37 UTC (permalink / raw)


On Mon, Jul 16, 2018@05:51:47PM +0300, Sagi Grimberg wrote:
> Keith,
> 
> > This patch fixes races that occur with simultaneous controller
> > resets by synchronizing request queues prior to initializing the
> > controller. Withouth this, a thread may attempt disabling a controller
> > at the same time as we're trying to enable it.
> 
> I'll need a little more help with this. This is due to the
> fact that we disable the controller directly from the timeout
> handler? If not, what context may disable when nvme_reset_work
> is running?
> 
> I'm trying what the sync should serialize and why is it not
> serialized by the controller state machine.

The only reason we need this is because each namespace has its own
request queue with their own timeout work. We don't want all of these
scheduling multiple controller resets, so the sync here just ensures
that there is no active timeout work that's about to schedule another
reset while we're already resetting the controller.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 3/4] nvme: Introduce frozen controller state
  2018-07-13 20:56 ` [PATCHv4 3/4] nvme: Introduce frozen controller state Keith Busch
  2018-07-16  9:02   ` jianchao.wang
  2018-07-16 11:09   ` Ming Lei
@ 2018-07-16 16:34   ` Sagi Grimberg
  2018-07-17 16:05     ` James Smart
  2 siblings, 1 reply; 35+ messages in thread
From: Sagi Grimberg @ 2018-07-16 16:34 UTC (permalink / raw)



> +	if (ctrl->state == NVME_CTRL_FROZEN) {
> +		nvme_wait_freeze(ctrl);
> +		blk_mq_update_nr_hw_queues(ctrl->tagset, ctrl->queue_count - 1);
> +		nvme_unfreeze(ctrl);
> +		if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_LIVE))
> +			return;
> +	}
> +

Having the transport drivers setup a state to indicate nvme-core to
handle it and change it again looks convoluted to me...

>   	if (ctrl->queue_count > 1) {
>   		nvme_queue_scan(ctrl);
>   		nvme_enable_aen(ctrl);
>   		queue_work(nvme_wq, &ctrl->async_event_work);
> -		nvme_start_queues(ctrl);
>   	}
>   }
>   
>   void nvme_start_ctrl(struct nvme_ctrl *ctrl)
>   {
> +	nvme_start_queues(ctrl);

You start the queues here?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 1/4] nvme: Sync request queues on reset
  2018-07-16 15:37     ` Keith Busch
@ 2018-07-16 16:36       ` Sagi Grimberg
  2018-07-16 17:12         ` Keith Busch
  0 siblings, 1 reply; 35+ messages in thread
From: Sagi Grimberg @ 2018-07-16 16:36 UTC (permalink / raw)



>> Keith,
>>
>>> This patch fixes races that occur with simultaneous controller
>>> resets by synchronizing request queues prior to initializing the
>>> controller. Withouth this, a thread may attempt disabling a controller
>>> at the same time as we're trying to enable it.
>>
>> I'll need a little more help with this. This is due to the
>> fact that we disable the controller directly from the timeout
>> handler? If not, what context may disable when nvme_reset_work
>> is running?
>>
>> I'm trying what the sync should serialize and why is it not
>> serialized by the controller state machine.
> 
> The only reason we need this is because each namespace has its own
> request queue with their own timeout work. We don't want all of these
> scheduling multiple controller resets, so the sync here just ensures
> that there is no active timeout work that's about to schedule another
> reset while we're already resetting the controller.

But scheduling a reset while a reset is running should not succeed. You
should not be able to change state RESETTING -> RESETTING

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 1/4] nvme: Sync request queues on reset
  2018-07-16 16:36       ` Sagi Grimberg
@ 2018-07-16 17:12         ` Keith Busch
  2018-07-17 13:40           ` Christoph Hellwig
  2018-07-18 11:46           ` Sagi Grimberg
  0 siblings, 2 replies; 35+ messages in thread
From: Keith Busch @ 2018-07-16 17:12 UTC (permalink / raw)


On Mon, Jul 16, 2018@07:36:41PM +0300, Sagi Grimberg wrote:
> > The only reason we need this is because each namespace has its own
> > request queue with their own timeout work. We don't want all of these
> > scheduling multiple controller resets, so the sync here just ensures
> > that there is no active timeout work that's about to schedule another
> > reset while we're already resetting the controller.
> 
> But scheduling a reset while a reset is running should not succeed. You
> should not be able to change state RESETTING -> RESETTING

Timeout handlers call nvme_dev_disable prior to the reset schedule
attempt, which is the part that we want to prevent occuring concurrently
with an already scheduled reset.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 3/4] nvme: Introduce frozen controller state
  2018-07-16 13:36     ` Keith Busch
@ 2018-07-17  1:23       ` Ming Lei
  2018-07-17  5:49         ` jianchao.wang
  2018-07-17 14:06         ` Keith Busch
  0 siblings, 2 replies; 35+ messages in thread
From: Ming Lei @ 2018-07-17  1:23 UTC (permalink / raw)


On Mon, Jul 16, 2018@07:36:40AM -0600, Keith Busch wrote:
> On Mon, Jul 16, 2018@07:09:04PM +0800, Ming Lei wrote:
> > On Fri, Jul 13, 2018@02:56:08PM -0600, Keith Busch wrote:
> > > +	if (ctrl->state == NVME_CTRL_FROZEN) {
> > > +		nvme_wait_freeze(ctrl);
> > > +		blk_mq_update_nr_hw_queues(ctrl->tagset, ctrl->queue_count - 1);
> > > +		nvme_unfreeze(ctrl);
> > > +		if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_LIVE))
> > > +			return;
> > > +	}
> > > +
> > 
> > Now you move unfreezing into start work function, seems this way can't
> > guarantee that nvme_unfreeze() is strictly paired with nvme_start_freeze() done
> > in nvme_dev_disable().
> 
> The state machine is supposed to guarantee this. We only start freeze
> in certain states from nvme_dev_disable, and transition to frozen when
> exiting those states. I agree it's not the easiest API to follow, and
> I'll double check to see if it's correct. I think I have the
> reset/connecting/frozen/live transitions okay for nvme-pci, but I need
> to look again at the deleting/dead states.

I am afraid that may not work well wrt. reset vs. timeout because
timeout still may happen during reset on admin queue, and nvme_dev_disable()
may call nvme_start_freeze() one more time, meantime reset_work won't be
scheduled.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 1/4] nvme: Sync request queues on reset
  2018-07-16 13:30       ` Keith Busch
@ 2018-07-17  5:37         ` jianchao.wang
  0 siblings, 0 replies; 35+ messages in thread
From: jianchao.wang @ 2018-07-17  5:37 UTC (permalink / raw)




On 07/16/2018 09:30 PM, Keith Busch wrote:
>>> On 07/14/2018 04:56 AM, Keith Busch wrote:
>>>> This patch fixes races that occur with simultaneous controller
>>>> resets by synchronizing request queues prior to initializing the
>>>> controller. Withouth this, a thread may attempt disabling a controller
>>>> at the same time as we're trying to enable it.
>>> This issue is due to the previous blk-mq timeout mechanism will hold
>>> all the timed out requests, then nvme_dev_disable cannot grab these requests
>>> and when it returns, the blk_mq_timeout_work could be still running.
>>> But after commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"),
>>> nvme_dev_disble have been able to grab all the requests and when it returns,
>>> there will be no any requests in blk_mq_timeout_work so nvme_dev_disable will
>>> not be invoked multiple times.
>>> So this patch should be unnecessary. :)
>> There are multiple namespaces, and all may trigger timeout at the same
>> time, so looks the sync is still needed.
> Yes, we still need this syncing because each namespace has a different
> request queue with their own timeout work.
> 

Yes, got it. :)

If there are multiple namespaces and trigger timeout at the same time,
so the nvme_dev_disable could be invoked by multiple namespaces' timeout work.
The nvme_dev_disable comes from other namespaces could race with the ctrl reset
work.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 3/4] nvme: Introduce frozen controller state
  2018-07-17  1:23       ` Ming Lei
@ 2018-07-17  5:49         ` jianchao.wang
  2018-07-17  7:21           ` Ming Lei
  2018-07-17 14:06         ` Keith Busch
  1 sibling, 1 reply; 35+ messages in thread
From: jianchao.wang @ 2018-07-17  5:49 UTC (permalink / raw)




On 07/17/2018 09:23 AM, Ming Lei wrote:
> On Mon, Jul 16, 2018@07:36:40AM -0600, Keith Busch wrote:
>> On Mon, Jul 16, 2018@07:09:04PM +0800, Ming Lei wrote:
>>> On Fri, Jul 13, 2018@02:56:08PM -0600, Keith Busch wrote:
>>>> +	if (ctrl->state == NVME_CTRL_FROZEN) {
>>>> +		nvme_wait_freeze(ctrl);
>>>> +		blk_mq_update_nr_hw_queues(ctrl->tagset, ctrl->queue_count - 1);
>>>> +		nvme_unfreeze(ctrl);
>>>> +		if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_LIVE))
>>>> +			return;
>>>> +	}
>>>> +
>>>
>>> Now you move unfreezing into start work function, seems this way can't
>>> guarantee that nvme_unfreeze() is strictly paired with nvme_start_freeze() done
>>> in nvme_dev_disable().
>>
>> The state machine is supposed to guarantee this. We only start freeze
>> in certain states from nvme_dev_disable, and transition to frozen when
>> exiting those states. I agree it's not the easiest API to follow, and
>> I'll double check to see if it's correct. I think I have the
>> reset/connecting/frozen/live transitions okay for nvme-pci, but I need
>> to look again at the deleting/dead states.
> 
> I am afraid that may not work well wrt. reset vs. timeout because
> timeout still may happen during reset on admin queue, and nvme_dev_disable()
> may call nvme_start_freeze() one more time, meantime reset_work won't be
> scheduled.
> 

Hi Ming

In nvme_dev_disable, except for the shutdown case which will try to drain the IO,
what we want is to stop new IO coming in. freeze queue interfaces looks too heavy,
after freeze, we have to wait for the q_usage_counter to be drained.
Is there any chance to just stop new IO coming in or just drain IO ?
For example, use another per-cpu-ref to track the IO.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 3/4] nvme: Introduce frozen controller state
  2018-07-17  5:49         ` jianchao.wang
@ 2018-07-17  7:21           ` Ming Lei
  2018-07-17  7:28             ` jianchao.wang
  0 siblings, 1 reply; 35+ messages in thread
From: Ming Lei @ 2018-07-17  7:21 UTC (permalink / raw)


On Tue, Jul 17, 2018@01:49:17PM +0800, jianchao.wang wrote:
> 
> 
> On 07/17/2018 09:23 AM, Ming Lei wrote:
> > On Mon, Jul 16, 2018@07:36:40AM -0600, Keith Busch wrote:
> >> On Mon, Jul 16, 2018@07:09:04PM +0800, Ming Lei wrote:
> >>> On Fri, Jul 13, 2018@02:56:08PM -0600, Keith Busch wrote:
> >>>> +	if (ctrl->state == NVME_CTRL_FROZEN) {
> >>>> +		nvme_wait_freeze(ctrl);
> >>>> +		blk_mq_update_nr_hw_queues(ctrl->tagset, ctrl->queue_count - 1);
> >>>> +		nvme_unfreeze(ctrl);
> >>>> +		if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_LIVE))
> >>>> +			return;
> >>>> +	}
> >>>> +
> >>>
> >>> Now you move unfreezing into start work function, seems this way can't
> >>> guarantee that nvme_unfreeze() is strictly paired with nvme_start_freeze() done
> >>> in nvme_dev_disable().
> >>
> >> The state machine is supposed to guarantee this. We only start freeze
> >> in certain states from nvme_dev_disable, and transition to frozen when
> >> exiting those states. I agree it's not the easiest API to follow, and
> >> I'll double check to see if it's correct. I think I have the
> >> reset/connecting/frozen/live transitions okay for nvme-pci, but I need
> >> to look again at the deleting/dead states.
> > 
> > I am afraid that may not work well wrt. reset vs. timeout because
> > timeout still may happen during reset on admin queue, and nvme_dev_disable()
> > may call nvme_start_freeze() one more time, meantime reset_work won't be
> > scheduled.
> > 
> 
> Hi Ming
> 
> In nvme_dev_disable, except for the shutdown case which will try to drain the IO,
> what we want is to stop new IO coming in. freeze queue interfaces looks too heavy,
> after freeze, we have to wait for the q_usage_counter to be drained.
> Is there any chance to just stop new IO coming in or just drain IO ?

nvme_start_freeze() is just for stopping new IO.

However, if you want to drain IO, new IO may have to be prevented from
being entering queue first, then that is nvme_start_freeze() + nvme_wait_freeze().

Thanks,
Ming

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 3/4] nvme: Introduce frozen controller state
  2018-07-17  7:21           ` Ming Lei
@ 2018-07-17  7:28             ` jianchao.wang
  2018-07-17 14:32               ` Keith Busch
  0 siblings, 1 reply; 35+ messages in thread
From: jianchao.wang @ 2018-07-17  7:28 UTC (permalink / raw)


Hi Ming

On 07/17/2018 03:21 PM, Ming Lei wrote:
> nvme_start_freeze() is just for stopping new IO.
> 
> However, if you want to drain IO, new IO may have to be prevented from
> being entering queue first, then that is nvme_start_freeze() + nvme_wait_freeze().

Yes, I mean, after we invoke blk_mq_freeze_queue_start, we have to invoke blk_mq_freeze_queue_wait
to drain the q->q_usage_counter, then invoke blk_mq_unfreeze_queue.

But in fact, in this scenario, we just need to stop new IO coming in, no need to drain the IO.
So I say, we may need a interface which just stop new IO coming in, but no need to drain IO before
allow new IO.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 1/4] nvme: Sync request queues on reset
  2018-07-16 17:12         ` Keith Busch
@ 2018-07-17 13:40           ` Christoph Hellwig
  2018-07-17 14:54             ` Keith Busch
  2018-07-18 11:46           ` Sagi Grimberg
  1 sibling, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2018-07-17 13:40 UTC (permalink / raw)


On Mon, Jul 16, 2018@11:12:33AM -0600, Keith Busch wrote:
> On Mon, Jul 16, 2018@07:36:41PM +0300, Sagi Grimberg wrote:
> > > The only reason we need this is because each namespace has its own
> > > request queue with their own timeout work. We don't want all of these
> > > scheduling multiple controller resets, so the sync here just ensures
> > > that there is no active timeout work that's about to schedule another
> > > reset while we're already resetting the controller.
> > 
> > But scheduling a reset while a reset is running should not succeed. You
> > should not be able to change state RESETTING -> RESETTING
> 
> Timeout handlers call nvme_dev_disable prior to the reset schedule
> attempt, which is the part that we want to prevent occuring concurrently
> with an already scheduled reset.

Is there any good way we can get rid of these out of state machine
nvme_dev_disable calls?  It might not be easy, but I think it is
going to help us in the long run.

I also think your original idea of a single work_struct per tag set
for error handling might be a good idea.  This is similar to the
per-host eh thread SCSI had forever, so it might also help SCSI by
getting rid of that and running directly from the block timeout
context.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 3/4] nvme: Introduce frozen controller state
  2018-07-17  1:23       ` Ming Lei
  2018-07-17  5:49         ` jianchao.wang
@ 2018-07-17 14:06         ` Keith Busch
  1 sibling, 0 replies; 35+ messages in thread
From: Keith Busch @ 2018-07-17 14:06 UTC (permalink / raw)


On Tue, Jul 17, 2018@09:23:03AM +0800, Ming Lei wrote:
> On Mon, Jul 16, 2018@07:36:40AM -0600, Keith Busch wrote:
> > On Mon, Jul 16, 2018@07:09:04PM +0800, Ming Lei wrote:
> > > On Fri, Jul 13, 2018@02:56:08PM -0600, Keith Busch wrote:
> > > > +	if (ctrl->state == NVME_CTRL_FROZEN) {
> > > > +		nvme_wait_freeze(ctrl);
> > > > +		blk_mq_update_nr_hw_queues(ctrl->tagset, ctrl->queue_count - 1);
> > > > +		nvme_unfreeze(ctrl);
> > > > +		if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_LIVE))
> > > > +			return;
> > > > +	}
> > > > +
> > > 
> > > Now you move unfreezing into start work function, seems this way can't
> > > guarantee that nvme_unfreeze() is strictly paired with nvme_start_freeze() done
> > > in nvme_dev_disable().
> > 
> > The state machine is supposed to guarantee this. We only start freeze
> > in certain states from nvme_dev_disable, and transition to frozen when
> > exiting those states. I agree it's not the easiest API to follow, and
> > I'll double check to see if it's correct. I think I have the
> > reset/connecting/frozen/live transitions okay for nvme-pci, but I need
> > to look again at the deleting/dead states.
> 
> I am afraid that may not work well wrt. reset vs. timeout because
> timeout still may happen during reset on admin queue, and nvme_dev_disable()
> may call nvme_start_freeze() one more time, meantime reset_work won't be
> scheduled.

The scenario you're describing is going to kill and drain the queues, so
it doesn't matter.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 3/4] nvme: Introduce frozen controller state
  2018-07-17  7:28             ` jianchao.wang
@ 2018-07-17 14:32               ` Keith Busch
  2018-07-18  2:57                 ` jianchao.wang
  0 siblings, 1 reply; 35+ messages in thread
From: Keith Busch @ 2018-07-17 14:32 UTC (permalink / raw)


On Tue, Jul 17, 2018@03:28:33PM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 07/17/2018 03:21 PM, Ming Lei wrote:
> > nvme_start_freeze() is just for stopping new IO.
> > 
> > However, if you want to drain IO, new IO may have to be prevented from
> > being entering queue first, then that is nvme_start_freeze() + nvme_wait_freeze().
> 
> Yes, I mean, after we invoke blk_mq_freeze_queue_start, we have to invoke blk_mq_freeze_queue_wait
> to drain the q->q_usage_counter, then invoke blk_mq_unfreeze_queue.
> 
> But in fact, in this scenario, we just need to stop new IO coming in, no need to drain the IO.
> So I say, we may need a interface which just stop new IO coming in, but no need to drain IO before
> allow new IO.

I think what you are asking for is a way to unfreeze a queue that that
doesn't have a zero ref count, but I think that breaks the percpu_ref.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 1/4] nvme: Sync request queues on reset
  2018-07-17 13:40           ` Christoph Hellwig
@ 2018-07-17 14:54             ` Keith Busch
  0 siblings, 0 replies; 35+ messages in thread
From: Keith Busch @ 2018-07-17 14:54 UTC (permalink / raw)


On Tue, Jul 17, 2018@03:40:48PM +0200, Christoph Hellwig wrote:
> On Mon, Jul 16, 2018@11:12:33AM -0600, Keith Busch wrote:
> > On Mon, Jul 16, 2018@07:36:41PM +0300, Sagi Grimberg wrote:
> > > > The only reason we need this is because each namespace has its own
> > > > request queue with their own timeout work. We don't want all of these
> > > > scheduling multiple controller resets, so the sync here just ensures
> > > > that there is no active timeout work that's about to schedule another
> > > > reset while we're already resetting the controller.
> > > 
> > > But scheduling a reset while a reset is running should not succeed. You
> > > should not be able to change state RESETTING -> RESETTING
> > 
> > Timeout handlers call nvme_dev_disable prior to the reset schedule
> > attempt, which is the part that we want to prevent occuring concurrently
> > with an already scheduled reset.
> 
> Is there any good way we can get rid of these out of state machine
> nvme_dev_disable calls?  It might not be easy, but I think it is
> going to help us in the long run.

Possibly, I'll stare at this a bit more.
 
> I also think your original idea of a single work_struct per tag set
> for error handling might be a good idea.  This is similar to the
> per-host eh thread SCSI had forever, so it might also help SCSI by
> getting rid of that and running directly from the block timeout
> context.

Okay, I'll re-examine that option again. It gets a little messy with
the legacy request interface, but that's okay.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 3/4] nvme: Introduce frozen controller state
  2018-07-16 16:34   ` Sagi Grimberg
@ 2018-07-17 16:05     ` James Smart
  2018-07-17 16:17       ` Keith Busch
  0 siblings, 1 reply; 35+ messages in thread
From: James Smart @ 2018-07-17 16:05 UTC (permalink / raw)




On 7/16/2018 9:34 AM, Sagi Grimberg wrote:
>
>> +??? if (ctrl->state == NVME_CTRL_FROZEN) {
>> +??????? nvme_wait_freeze(ctrl);
>> +??????? blk_mq_update_nr_hw_queues(ctrl->tagset, ctrl->queue_count - 
>> 1);
>> +??????? nvme_unfreeze(ctrl);
>> +??????? if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_LIVE))
>> +??????????? return;
>> +??? }
>> +
>
> Having the transport drivers setup a state to indicate nvme-core to
> handle it and change it again looks convoluted to me...
>

I'll second the comment.

-- james

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 3/4] nvme: Introduce frozen controller state
  2018-07-17 16:05     ` James Smart
@ 2018-07-17 16:17       ` Keith Busch
  2018-07-18 12:20         ` Sagi Grimberg
  0 siblings, 1 reply; 35+ messages in thread
From: Keith Busch @ 2018-07-17 16:17 UTC (permalink / raw)


On Tue, Jul 17, 2018@09:05:47AM -0700, James Smart wrote:
> 
> 
> On 7/16/2018 9:34 AM, Sagi Grimberg wrote:
> > 
> > > +??? if (ctrl->state == NVME_CTRL_FROZEN) {
> > > +??????? nvme_wait_freeze(ctrl);
> > > +??????? blk_mq_update_nr_hw_queues(ctrl->tagset, ctrl->queue_count
> > > - 1);
> > > +??????? nvme_unfreeze(ctrl);
> > > +??????? if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_LIVE))
> > > +??????????? return;
> > > +??? }
> > > +
> > 
> > Having the transport drivers setup a state to indicate nvme-core to
> > handle it and change it again looks convoluted to me...
> > 
> 
> I'll second the comment.
> 
> -- james

You definitely do not need to use this state if it doesn't make sense
for your transport. We just need a context that is safe to dispatch
blocking IO when coming out of a reset. I can move more of this back
to pci if we want to provide a new nvme_ctrl_ops callback if you really
don't like a generic solution.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 3/4] nvme: Introduce frozen controller state
  2018-07-17 14:32               ` Keith Busch
@ 2018-07-18  2:57                 ` jianchao.wang
  0 siblings, 0 replies; 35+ messages in thread
From: jianchao.wang @ 2018-07-18  2:57 UTC (permalink / raw)



Hi Keith

Thanks for your kindly response :)

On 07/17/2018 10:32 PM, Keith Busch wrote:
> On Tue, Jul 17, 2018@03:28:33PM +0800, jianchao.wang wrote:
>> Hi Ming
>>
>> On 07/17/2018 03:21 PM, Ming Lei wrote:
>>> nvme_start_freeze() is just for stopping new IO.
>>>
>>> However, if you want to drain IO, new IO may have to be prevented from
>>> being entering queue first, then that is nvme_start_freeze() + nvme_wait_freeze().
>>
>> Yes, I mean, after we invoke blk_mq_freeze_queue_start, we have to invoke blk_mq_freeze_queue_wait
>> to drain the q->q_usage_counter, then invoke blk_mq_unfreeze_queue.
>>
>> But in fact, in this scenario, we just need to stop new IO coming in, no need to drain the IO.
>> So I say, we may need a interface which just stop new IO coming in, but no need to drain IO before
>> allow new IO.
> 
> I think what you are asking for is a way to unfreeze a queue that that
> doesn't have a zero ref count, but I think that breaks the percpu_ref.

Yes.
IMO, this is actually a defect of current freeze queue mechanism.
The q->q_usage_counter basically includes
 - the number of callers which enter into q->make_request_fn from generic_make_request
 - allocated requests
(just ignore other corners here)

When we freeze the request queue, no one would cross the blk_queue_enter in generic_make_request,
this is expected. We want to prevent new IO from coming in.

But when we unfreeze the request_queue, we have to wait the q->q_usage_counter to be drained.
That is waiting for all the entered request to be drained. And this is not we want.

If we split the number of allocated requests away from the q->q_usage_counter, we will not 
have to drain IOs when unfreeze the request_queue.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 1/4] nvme: Sync request queues on reset
  2018-07-16 17:12         ` Keith Busch
  2018-07-17 13:40           ` Christoph Hellwig
@ 2018-07-18 11:46           ` Sagi Grimberg
  2018-07-18 13:52             ` Keith Busch
  1 sibling, 1 reply; 35+ messages in thread
From: Sagi Grimberg @ 2018-07-18 11:46 UTC (permalink / raw)



>> But scheduling a reset while a reset is running should not succeed. You
>> should not be able to change state RESETTING -> RESETTING
> 
> Timeout handlers call nvme_dev_disable prior to the reset schedule
> attempt, which is the part that we want to prevent occuring concurrently
> with an already scheduled reset.

So why not make the timeout handlers not call nvme_dev_disable? It looks
like this is causing us to jump over hoops to accommodate for.

I think I asked it a couple of times before, why isn't the timeout
handler simply either:
1. abort, or
2. queue a reset (and return BLK_EH_RESET_TIMER), or
3. fail early the I/O if it is coming from the reset context (and
    return BLK_EH_DONE)

*If* there is a rare race where the timed out I/O from (3) is completing
with the timeout handler concurrently, we can simply suspend that queue
alone.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 3/4] nvme: Introduce frozen controller state
  2018-07-17 16:17       ` Keith Busch
@ 2018-07-18 12:20         ` Sagi Grimberg
  2018-07-18 13:53           ` Keith Busch
  0 siblings, 1 reply; 35+ messages in thread
From: Sagi Grimberg @ 2018-07-18 12:20 UTC (permalink / raw)



>>> Having the transport drivers setup a state to indicate nvme-core to
>>> handle it and change it again looks convoluted to me...
>>>
>>
>> I'll second the comment.
>>
>> -- james
> 
> You definitely do not need to use this state if it doesn't make sense
> for your transport. We just need a context that is safe to dispatch
> blocking IO when coming out of a reset. I can move more of this back
> to pci if we want to provide a new nvme_ctrl_ops callback if you really
> don't like a generic solution.

I want stuff in the core that makes sense to all transports, and I
believe this area has much commonality between them. But currently,
the transports do different stuff in their timeout handler which
brings different sets of logic.

rdma simply queues a controller reset, never aborts and is probably
buggy in that area. fc does pretty much the same thing.

I suggest to make pci timeout handler to:
1. abort (if not aborted yet) and return BLK_EH_RESET_TIMER, or
2. queue a reset and return BLK_EH_RESET_TIMER, or
3. fail early the I/O if it is coming from the reset context (state == 
CONNECTING) and return BLK_EH_DONE.

Which at least in my mind makes sense to all transports so if we have it
we can maybe move it up to nvme-core?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 1/4] nvme: Sync request queues on reset
  2018-07-18 11:46           ` Sagi Grimberg
@ 2018-07-18 13:52             ` Keith Busch
  0 siblings, 0 replies; 35+ messages in thread
From: Keith Busch @ 2018-07-18 13:52 UTC (permalink / raw)


On Wed, Jul 18, 2018@02:46:55PM +0300, Sagi Grimberg wrote:
> 
> > > But scheduling a reset while a reset is running should not succeed. You
> > > should not be able to change state RESETTING -> RESETTING
> > 
> > Timeout handlers call nvme_dev_disable prior to the reset schedule
> > attempt, which is the part that we want to prevent occuring concurrently
> > with an already scheduled reset.
> 
> So why not make the timeout handlers not call nvme_dev_disable? It looks
> like this is causing us to jump over hoops to accommodate for.
> 
> I think I asked it a couple of times before, why isn't the timeout
> handler simply either:
> 1. abort, or
> 2. queue a reset (and return BLK_EH_RESET_TIMER), or
> 3. fail early the I/O if it is coming from the reset context (and
>    return BLK_EH_DONE)
> 
> *If* there is a rare race where the timed out I/O from (3) is completing
> with the timeout handler concurrently, we can simply suspend that queue
> alone.

You're just moving the hoop you have to jump through. Handling it inline
with the timeout handler reclaims lost commands in starting, deleting,
and live contexts. The existing reset_work can only reclaim from live.

If you want to create an intermediary work just to do the reset, you'll
first need to untangle the state machine to allow that to happen in a
deleting state, and that work needs to sync with these request queues
anyway, and then everyone else needs to sync with that new work we
created just to do something that is already handled in a work context
that we already sync with.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 3/4] nvme: Introduce frozen controller state
  2018-07-18 12:20         ` Sagi Grimberg
@ 2018-07-18 13:53           ` Keith Busch
  0 siblings, 0 replies; 35+ messages in thread
From: Keith Busch @ 2018-07-18 13:53 UTC (permalink / raw)


On Wed, Jul 18, 2018@03:20:53PM +0300, Sagi Grimberg wrote:
> 
> > > > Having the transport drivers setup a state to indicate nvme-core to
> > > > handle it and change it again looks convoluted to me...
> > > > 
> > > 
> > > I'll second the comment.
> > > 
> > > -- james
> > 
> > You definitely do not need to use this state if it doesn't make sense
> > for your transport. We just need a context that is safe to dispatch
> > blocking IO when coming out of a reset. I can move more of this back
> > to pci if we want to provide a new nvme_ctrl_ops callback if you really
> > don't like a generic solution.
> 
> I want stuff in the core that makes sense to all transports, and I
> believe this area has much commonality between them. But currently,
> the transports do different stuff in their timeout handler which
> brings different sets of logic.
> 
> rdma simply queues a controller reset, never aborts and is probably
> buggy in that area. fc does pretty much the same thing.
> 
> I suggest to make pci timeout handler to:
> 1. abort (if not aborted yet) and return BLK_EH_RESET_TIMER, or
> 2. queue a reset and return BLK_EH_RESET_TIMER, or
> 3. fail early the I/O if it is coming from the reset context (state ==
> CONNECTING) and return BLK_EH_DONE.

You can't just return IO early in the connecting state. You have to disable
the controller first, or you're going to get memory corruption.
 
> Which at least in my mind makes sense to all transports so if we have it
> we can maybe move it up to nvme-core?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCHv4 0/4] nvme timeout updates
  2018-07-13 20:56 [PATCHv4 0/4] nvme timeout updates Keith Busch
                   ` (3 preceding siblings ...)
  2018-07-13 20:56 ` [PATCHv4 4/4] nvme-pci: Use controller start work to dispath IO Keith Busch
@ 2018-07-19 19:48 ` Scott Bauer
  4 siblings, 0 replies; 35+ messages in thread
From: Scott Bauer @ 2018-07-19 19:48 UTC (permalink / raw)


On Fri, Jul 13, 2018@02:56:05PM -0600, Keith Busch wrote:
> The majority of the v3 was either merged or no longer necessary.
> 
> Patch 1 is still the same as before.
> 
> The remaining patches 2-4 are setting up the nvme core to handle
> unfreezing queues and dispatching IO from a context that is safe to
> handle IO timeouts. The pci nvme is the only one making use of that at
> the momement, but I suspect fabrics can make something of it.
> 
> The patch retrying a failed reset on an admin init timeout has been
> dropped. If we really need to do this, I think more discussion will help
> so we are creating less arbitrary rules.
> 
> Keith Busch (4):
>   nvme: Sync request queues on reset
>   nvme: Start controller in own work queue
>   nvme: Introduce frozen controller state
>   nvme-pci: Use controller start work to dispath IO
> 
>  drivers/nvme/host/core.c | 54 ++++++++++++++++++++++++++++++++++++++++++++----
>  drivers/nvme/host/nvme.h |  3 +++
>  drivers/nvme/host/pci.c  | 20 ++++++------------
>  3 files changed, 59 insertions(+), 18 deletions(-)
> 
> -- 
> 2.14.3

Thanks Keith!

Tested-by: Scott Bauer <scott.bauer at intel.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2018-07-19 19:48 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-13 20:56 [PATCHv4 0/4] nvme timeout updates Keith Busch
2018-07-13 20:56 ` [PATCHv4 1/4] nvme: Sync request queues on reset Keith Busch
2018-07-16  8:52   ` jianchao.wang
2018-07-16 10:39     ` Ming Lei
2018-07-16 13:30       ` Keith Busch
2018-07-17  5:37         ` jianchao.wang
2018-07-16 14:51   ` Sagi Grimberg
2018-07-16 15:37     ` Keith Busch
2018-07-16 16:36       ` Sagi Grimberg
2018-07-16 17:12         ` Keith Busch
2018-07-17 13:40           ` Christoph Hellwig
2018-07-17 14:54             ` Keith Busch
2018-07-18 11:46           ` Sagi Grimberg
2018-07-18 13:52             ` Keith Busch
2018-07-13 20:56 ` [PATCHv4 2/4] nvme: Start controller in own work queue Keith Busch
2018-07-16 15:00   ` Sagi Grimberg
2018-07-16 15:35     ` Keith Busch
2018-07-13 20:56 ` [PATCHv4 3/4] nvme: Introduce frozen controller state Keith Busch
2018-07-16  9:02   ` jianchao.wang
2018-07-16 11:09   ` Ming Lei
2018-07-16 13:36     ` Keith Busch
2018-07-17  1:23       ` Ming Lei
2018-07-17  5:49         ` jianchao.wang
2018-07-17  7:21           ` Ming Lei
2018-07-17  7:28             ` jianchao.wang
2018-07-17 14:32               ` Keith Busch
2018-07-18  2:57                 ` jianchao.wang
2018-07-17 14:06         ` Keith Busch
2018-07-16 16:34   ` Sagi Grimberg
2018-07-17 16:05     ` James Smart
2018-07-17 16:17       ` Keith Busch
2018-07-18 12:20         ` Sagi Grimberg
2018-07-18 13:53           ` Keith Busch
2018-07-13 20:56 ` [PATCHv4 4/4] nvme-pci: Use controller start work to dispath IO Keith Busch
2018-07-19 19:48 ` [PATCHv4 0/4] nvme timeout updates Scott Bauer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.