All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Reset timeout for paused hardware
@ 2019-05-22 17:48 ` Keith Busch
  0 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2019-05-22 17:48 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, linux-nvme, linux-block
  Cc: Ming Lei, Keith Busch

Hardware may temporarily stop processing commands that have
been dispatched to it while activating new firmware. Some target
implementation's paused state time exceeds the default request expiry,
so any request dispatched before the driver could quiesce for the
hardware's paused state will time out, and handling this may interrupt
the firmware activation.

This two-part series provides a way for drivers to reset dispatched
requests' timeout deadline, then uses this new mechanism from the nvme
driver's fw activation work.

Keith Busch (2):
  blk-mq: provide way to reset rq timeouts
  nvme: reset request timeouts during fw activation

 block/blk-mq.c           | 30 ++++++++++++++++++++++++++++++
 drivers/nvme/host/core.c | 20 ++++++++++++++++++++
 include/linux/blk-mq.h   |  1 +
 3 files changed, 51 insertions(+)

-- 
2.14.4


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 0/2] Reset timeout for paused hardware
@ 2019-05-22 17:48 ` Keith Busch
  0 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2019-05-22 17:48 UTC (permalink / raw)


Hardware may temporarily stop processing commands that have
been dispatched to it while activating new firmware. Some target
implementation's paused state time exceeds the default request expiry,
so any request dispatched before the driver could quiesce for the
hardware's paused state will time out, and handling this may interrupt
the firmware activation.

This two-part series provides a way for drivers to reset dispatched
requests' timeout deadline, then uses this new mechanism from the nvme
driver's fw activation work.

Keith Busch (2):
  blk-mq: provide way to reset rq timeouts
  nvme: reset request timeouts during fw activation

 block/blk-mq.c           | 30 ++++++++++++++++++++++++++++++
 drivers/nvme/host/core.c | 20 ++++++++++++++++++++
 include/linux/blk-mq.h   |  1 +
 3 files changed, 51 insertions(+)

-- 
2.14.4

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/2] blk-mq: provide way to reset rq deadline
  2019-05-22 17:48 ` Keith Busch
@ 2019-05-22 17:48   ` Keith Busch
  -1 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2019-05-22 17:48 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, linux-nvme, linux-block
  Cc: Ming Lei, Keith Busch

If hardware has temporarily paused processing requests that its driver
has dispatched, the driver may need to halt timeout detection during that
temporary stoppage. When the hardware has un-paused request processing,
the driver needs a way to rebase all dispatched request dealines using
current jiffies so that time accrued during the paused state doesn't
count against that request.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 block/blk-mq.c         | 30 ++++++++++++++++++++++++++++++
 include/linux/blk-mq.h |  1 +
 2 files changed, 31 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 08a6248d8536..9d85903d4e0c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -960,6 +960,36 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	blk_queue_exit(q);
 }
 
+static bool blk_mq_reset_rq(struct blk_mq_hw_ctx *hctx, struct request *rq,
+			    void *priv, bool reserved)
+{
+	if (blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT)
+		blk_add_timer(rq);
+	return true;
+}
+
+/**
+ * blk_mq_reset_rqs - reset the timers on all inflight requests
+ *
+ * @q - request queue to iterate
+ *
+ * This is intended for use when a driver detects its hardware has temporarily
+ * paused processing commands. When the condition is initially detected, the
+ * driver should call blk_mq_quiesce_queue() to block new requests from
+ * dispatching, then blk_sync_queue() to halt any timeout work. When the
+ * hardware becomes operational again, the driver should call this function to
+ * reset previously dispatched request timers who's processing had been paused
+ * by the hardware, then call blk_mq_unquiesce_queue() to unblock future
+ * dispatch.
+ */
+void blk_mq_reset_rqs(struct request_queue *q)
+{
+	if (WARN_ON(!blk_queue_quiesced(q)))
+		return;
+	blk_mq_queue_tag_busy_iter(q, blk_mq_reset_rq, NULL);
+}
+EXPORT_SYMBOL_GPL(blk_mq_reset_rqs);
+
 struct flush_busy_ctx_data {
 	struct blk_mq_hw_ctx *hctx;
 	struct list_head *list;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 15d1aa53d96c..28c421ba5094 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -327,6 +327,7 @@ void blk_freeze_queue_start(struct request_queue *q);
 void blk_mq_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout);
+void blk_mq_reset_rqs(struct request_queue *q);
 
 int blk_mq_map_queues(struct blk_mq_queue_map *qmap);
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 1/2] blk-mq: provide way to reset rq deadline
@ 2019-05-22 17:48   ` Keith Busch
  0 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2019-05-22 17:48 UTC (permalink / raw)


If hardware has temporarily paused processing requests that its driver
has dispatched, the driver may need to halt timeout detection during that
temporary stoppage. When the hardware has un-paused request processing,
the driver needs a way to rebase all dispatched request dealines using
current jiffies so that time accrued during the paused state doesn't
count against that request.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 block/blk-mq.c         | 30 ++++++++++++++++++++++++++++++
 include/linux/blk-mq.h |  1 +
 2 files changed, 31 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 08a6248d8536..9d85903d4e0c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -960,6 +960,36 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	blk_queue_exit(q);
 }
 
+static bool blk_mq_reset_rq(struct blk_mq_hw_ctx *hctx, struct request *rq,
+			    void *priv, bool reserved)
+{
+	if (blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT)
+		blk_add_timer(rq);
+	return true;
+}
+
+/**
+ * blk_mq_reset_rqs - reset the timers on all inflight requests
+ *
+ * @q - request queue to iterate
+ *
+ * This is intended for use when a driver detects its hardware has temporarily
+ * paused processing commands. When the condition is initially detected, the
+ * driver should call blk_mq_quiesce_queue() to block new requests from
+ * dispatching, then blk_sync_queue() to halt any timeout work. When the
+ * hardware becomes operational again, the driver should call this function to
+ * reset previously dispatched request timers who's processing had been paused
+ * by the hardware, then call blk_mq_unquiesce_queue() to unblock future
+ * dispatch.
+ */
+void blk_mq_reset_rqs(struct request_queue *q)
+{
+	if (WARN_ON(!blk_queue_quiesced(q)))
+		return;
+	blk_mq_queue_tag_busy_iter(q, blk_mq_reset_rq, NULL);
+}
+EXPORT_SYMBOL_GPL(blk_mq_reset_rqs);
+
 struct flush_busy_ctx_data {
 	struct blk_mq_hw_ctx *hctx;
 	struct list_head *list;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 15d1aa53d96c..28c421ba5094 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -327,6 +327,7 @@ void blk_freeze_queue_start(struct request_queue *q);
 void blk_mq_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout);
+void blk_mq_reset_rqs(struct request_queue *q);
 
 int blk_mq_map_queues(struct blk_mq_queue_map *qmap);
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/2] nvme: reset request timeouts during fw activation
  2019-05-22 17:48 ` Keith Busch
@ 2019-05-22 17:48   ` Keith Busch
  -1 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2019-05-22 17:48 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, linux-nvme, linux-block
  Cc: Ming Lei, Keith Busch

The nvme controller may pause command processing during firmware
activation. The driver will quiesce queues during this time, but commands
dispatched prior to the notification will not be processed until the
hardware completes this activation.

We do not want those requests to time out while the hardware is in
this paused state as we don't expect those commands to complete during
this time, and that handling will interfere with the firmware activation
process.

In addition to quiescing the queues, halt timeout detection during the
paused state and reset the dispatched request deadlines when the hardware
exists that state. This action applies to IO and admin queues.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/nvme/host/core.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1b7c2afd84cb..37a9a66ada22 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -89,6 +89,7 @@ static dev_t nvme_chr_devt;
 static struct class *nvme_class;
 static struct class *nvme_subsys_class;
 
+static void nvme_reset_queues(struct nvme_ctrl *ctrl);
 static int nvme_revalidate_disk(struct gendisk *disk);
 static void nvme_put_subsystem(struct nvme_subsystem *subsys);
 static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
@@ -3605,6 +3606,11 @@ static void nvme_fw_act_work(struct work_struct *work)
 				msecs_to_jiffies(admin_timeout * 1000);
 
 	nvme_stop_queues(ctrl);
+	nvme_sync_queues(ctrl);
+
+	blk_mq_quiesce_queue(ctrl->admin_q);
+	blk_sync_queue(ctrl->admin_q);
+
 	while (nvme_ctrl_pp_status(ctrl)) {
 		if (time_after(jiffies, fw_act_timeout)) {
 			dev_warn(ctrl->device,
@@ -3618,7 +3624,12 @@ static void nvme_fw_act_work(struct work_struct *work)
 	if (ctrl->state != NVME_CTRL_LIVE)
 		return;
 
+	blk_mq_reset_rqs(ctrl->admin_q);
+	blk_mq_unquiesce_queue(ctrl->admin_q);
+
+	nvme_reset_queues(ctrl);
 	nvme_start_queues(ctrl);
+
 	/* read FW slot information to clear the AER */
 	nvme_get_fw_slot_info(ctrl);
 }
@@ -3901,6 +3912,15 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_start_queues);
 
+static void nvme_reset_queues(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_mq_reset_rqs(ns->queue);
+	up_read(&ctrl->namespaces_rwsem);
+}
 
 void nvme_sync_queues(struct nvme_ctrl *ctrl)
 {
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/2] nvme: reset request timeouts during fw activation
@ 2019-05-22 17:48   ` Keith Busch
  0 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2019-05-22 17:48 UTC (permalink / raw)


The nvme controller may pause command processing during firmware
activation. The driver will quiesce queues during this time, but commands
dispatched prior to the notification will not be processed until the
hardware completes this activation.

We do not want those requests to time out while the hardware is in
this paused state as we don't expect those commands to complete during
this time, and that handling will interfere with the firmware activation
process.

In addition to quiescing the queues, halt timeout detection during the
paused state and reset the dispatched request deadlines when the hardware
exists that state. This action applies to IO and admin queues.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/core.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1b7c2afd84cb..37a9a66ada22 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -89,6 +89,7 @@ static dev_t nvme_chr_devt;
 static struct class *nvme_class;
 static struct class *nvme_subsys_class;
 
+static void nvme_reset_queues(struct nvme_ctrl *ctrl);
 static int nvme_revalidate_disk(struct gendisk *disk);
 static void nvme_put_subsystem(struct nvme_subsystem *subsys);
 static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
@@ -3605,6 +3606,11 @@ static void nvme_fw_act_work(struct work_struct *work)
 				msecs_to_jiffies(admin_timeout * 1000);
 
 	nvme_stop_queues(ctrl);
+	nvme_sync_queues(ctrl);
+
+	blk_mq_quiesce_queue(ctrl->admin_q);
+	blk_sync_queue(ctrl->admin_q);
+
 	while (nvme_ctrl_pp_status(ctrl)) {
 		if (time_after(jiffies, fw_act_timeout)) {
 			dev_warn(ctrl->device,
@@ -3618,7 +3624,12 @@ static void nvme_fw_act_work(struct work_struct *work)
 	if (ctrl->state != NVME_CTRL_LIVE)
 		return;
 
+	blk_mq_reset_rqs(ctrl->admin_q);
+	blk_mq_unquiesce_queue(ctrl->admin_q);
+
+	nvme_reset_queues(ctrl);
 	nvme_start_queues(ctrl);
+
 	/* read FW slot information to clear the AER */
 	nvme_get_fw_slot_info(ctrl);
 }
@@ -3901,6 +3912,15 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_start_queues);
 
+static void nvme_reset_queues(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_mq_reset_rqs(ns->queue);
+	up_read(&ctrl->namespaces_rwsem);
+}
 
 void nvme_sync_queues(struct nvme_ctrl *ctrl)
 {
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] Reset timeout for paused hardware
  2019-05-22 17:48 ` Keith Busch
@ 2019-05-22 20:20   ` Bart Van Assche
  -1 siblings, 0 replies; 30+ messages in thread
From: Bart Van Assche @ 2019-05-22 20:20 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, linux-nvme, linux-block
  Cc: Ming Lei

On 5/22/19 7:48 PM, Keith Busch wrote:
> Hardware may temporarily stop processing commands that have
> been dispatched to it while activating new firmware. Some target
> implementation's paused state time exceeds the default request expiry,
> so any request dispatched before the driver could quiesce for the
> hardware's paused state will time out, and handling this may interrupt
> the firmware activation.
> 
> This two-part series provides a way for drivers to reset dispatched
> requests' timeout deadline, then uses this new mechanism from the nvme
> driver's fw activation work.

Hi Keith,

Is it essential to modify the block layer to implement this behavior
change? Would it be possible to implement this behavior change by
modifying the NVMe driver only, e.g. by modifying the nvme_timeout()
function and by making that function return BLK_EH_RESET_TIMER while new
firmware is being activated?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 0/2] Reset timeout for paused hardware
@ 2019-05-22 20:20   ` Bart Van Assche
  0 siblings, 0 replies; 30+ messages in thread
From: Bart Van Assche @ 2019-05-22 20:20 UTC (permalink / raw)


On 5/22/19 7:48 PM, Keith Busch wrote:
> Hardware may temporarily stop processing commands that have
> been dispatched to it while activating new firmware. Some target
> implementation's paused state time exceeds the default request expiry,
> so any request dispatched before the driver could quiesce for the
> hardware's paused state will time out, and handling this may interrupt
> the firmware activation.
> 
> This two-part series provides a way for drivers to reset dispatched
> requests' timeout deadline, then uses this new mechanism from the nvme
> driver's fw activation work.

Hi Keith,

Is it essential to modify the block layer to implement this behavior
change? Would it be possible to implement this behavior change by
modifying the NVMe driver only, e.g. by modifying the nvme_timeout()
function and by making that function return BLK_EH_RESET_TIMER while new
firmware is being activated?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] Reset timeout for paused hardware
  2019-05-22 20:20   ` Bart Van Assche
@ 2019-05-22 20:28     ` Keith Busch
  -1 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2019-05-22 20:28 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, linux-nvme,
	linux-block, Ming Lei

On Wed, May 22, 2019 at 10:20:45PM +0200, Bart Van Assche wrote:
> On 5/22/19 7:48 PM, Keith Busch wrote:
> > Hardware may temporarily stop processing commands that have
> > been dispatched to it while activating new firmware. Some target
> > implementation's paused state time exceeds the default request expiry,
> > so any request dispatched before the driver could quiesce for the
> > hardware's paused state will time out, and handling this may interrupt
> > the firmware activation.
> > 
> > This two-part series provides a way for drivers to reset dispatched
> > requests' timeout deadline, then uses this new mechanism from the nvme
> > driver's fw activation work.
> 
> Hi Keith,
> 
> Is it essential to modify the block layer to implement this behavior
> change? Would it be possible to implement this behavior change by
> modifying the NVMe driver only, e.g. by modifying the nvme_timeout()
> function and by making that function return BLK_EH_RESET_TIMER while new
> firmware is being activated?

Good question.

We can't just do this from nvme_timeout(), though. That introduces races
between timeout_work and fw_act_work if that fw work clears the
condition that timeout needs to observe to return RESET_TIMER.

Even if we avoid that race, the rq->deadline needs to be adjusted to
the current time after the h/w unpause because the time accumulated while
h/w halted itself should not be counted against the request.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 0/2] Reset timeout for paused hardware
@ 2019-05-22 20:28     ` Keith Busch
  0 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2019-05-22 20:28 UTC (permalink / raw)


On Wed, May 22, 2019@10:20:45PM +0200, Bart Van Assche wrote:
> On 5/22/19 7:48 PM, Keith Busch wrote:
> > Hardware may temporarily stop processing commands that have
> > been dispatched to it while activating new firmware. Some target
> > implementation's paused state time exceeds the default request expiry,
> > so any request dispatched before the driver could quiesce for the
> > hardware's paused state will time out, and handling this may interrupt
> > the firmware activation.
> > 
> > This two-part series provides a way for drivers to reset dispatched
> > requests' timeout deadline, then uses this new mechanism from the nvme
> > driver's fw activation work.
> 
> Hi Keith,
> 
> Is it essential to modify the block layer to implement this behavior
> change? Would it be possible to implement this behavior change by
> modifying the NVMe driver only, e.g. by modifying the nvme_timeout()
> function and by making that function return BLK_EH_RESET_TIMER while new
> firmware is being activated?

Good question.

We can't just do this from nvme_timeout(), though. That introduces races
between timeout_work and fw_act_work if that fw work clears the
condition that timeout needs to observe to return RESET_TIMER.

Even if we avoid that race, the rq->deadline needs to be adjusted to
the current time after the h/w unpause because the time accumulated while
h/w halted itself should not be counted against the request.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] Reset timeout for paused hardware
  2019-05-22 17:48 ` Keith Busch
@ 2019-05-23  3:29   ` Ming Lei
  -1 siblings, 0 replies; 30+ messages in thread
From: Ming Lei @ 2019-05-23  3:29 UTC (permalink / raw)
  To: Keith Busch; +Cc: Jens Axboe, Christoph Hellwig, linux-nvme, linux-block

On Wed, May 22, 2019 at 11:48:10AM -0600, Keith Busch wrote:
> Hardware may temporarily stop processing commands that have
> been dispatched to it while activating new firmware. Some target
> implementation's paused state time exceeds the default request expiry,
> so any request dispatched before the driver could quiesce for the
> hardware's paused state will time out, and handling this may interrupt
> the firmware activation.
> 
> This two-part series provides a way for drivers to reset dispatched
> requests' timeout deadline, then uses this new mechanism from the nvme
> driver's fw activation work.

Just wondering why not freeze IO queues before updating FW?


Thanks,
Ming

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 0/2] Reset timeout for paused hardware
@ 2019-05-23  3:29   ` Ming Lei
  0 siblings, 0 replies; 30+ messages in thread
From: Ming Lei @ 2019-05-23  3:29 UTC (permalink / raw)


On Wed, May 22, 2019@11:48:10AM -0600, Keith Busch wrote:
> Hardware may temporarily stop processing commands that have
> been dispatched to it while activating new firmware. Some target
> implementation's paused state time exceeds the default request expiry,
> so any request dispatched before the driver could quiesce for the
> hardware's paused state will time out, and handling this may interrupt
> the firmware activation.
> 
> This two-part series provides a way for drivers to reset dispatched
> requests' timeout deadline, then uses this new mechanism from the nvme
> driver's fw activation work.

Just wondering why not freeze IO queues before updating FW?


Thanks,
Ming

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] Reset timeout for paused hardware
  2019-05-23  3:29   ` Ming Lei
@ 2019-05-23  3:48     ` Keith Busch
  -1 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2019-05-23  3:48 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, Jens Axboe, linux-block, Christoph Hellwig, linux-nvme

On Wed, May 22, 2019, 9:29 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Wed, May 22, 2019 at 11:48:10AM -0600, Keith Busch wrote:
> > Hardware may temporarily stop processing commands that have
> > been dispatched to it while activating new firmware. Some target
> > implementation's paused state time exceeds the default request expiry,
> > so any request dispatched before the driver could quiesce for the
> > hardware's paused state will time out, and handling this may interrupt
> > the firmware activation.
> >
> > This two-part series provides a way for drivers to reset dispatched
> > requests' timeout deadline, then uses this new mechanism from the nvme
> > driver's fw activation work.
>
> Just wondering why not freeze IO queues before updating FW?


Yeah, that's a good question. A FW update may have been initiated out
of band or from another host entirely. The driver can't count on
preparing for hardware pausing command processing before it's
happened, but we'll always find out asynchronously after it's too late
to freeze.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 0/2] Reset timeout for paused hardware
@ 2019-05-23  3:48     ` Keith Busch
  0 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2019-05-23  3:48 UTC (permalink / raw)


On Wed, May 22, 2019, 9:29 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Wed, May 22, 2019@11:48:10AM -0600, Keith Busch wrote:
> > Hardware may temporarily stop processing commands that have
> > been dispatched to it while activating new firmware. Some target
> > implementation's paused state time exceeds the default request expiry,
> > so any request dispatched before the driver could quiesce for the
> > hardware's paused state will time out, and handling this may interrupt
> > the firmware activation.
> >
> > This two-part series provides a way for drivers to reset dispatched
> > requests' timeout deadline, then uses this new mechanism from the nvme
> > driver's fw activation work.
>
> Just wondering why not freeze IO queues before updating FW?


Yeah, that's a good question. A FW update may have been initiated out
of band or from another host entirely. The driver can't count on
preparing for hardware pausing command processing before it's
happened, but we'll always find out asynchronously after it's too late
to freeze.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] Reset timeout for paused hardware
  2019-05-22 20:28     ` Keith Busch
@ 2019-05-23 10:04       ` Bart Van Assche
  -1 siblings, 0 replies; 30+ messages in thread
From: Bart Van Assche @ 2019-05-23 10:04 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, linux-nvme, Ming Lei, Keith Busch,
	Christoph Hellwig

On 5/22/19 10:28 PM, Keith Busch wrote:
> On Wed, May 22, 2019 at 10:20:45PM +0200, Bart Van Assche wrote:
>> On 5/22/19 7:48 PM, Keith Busch wrote:
>>> Hardware may temporarily stop processing commands that have
>>> been dispatched to it while activating new firmware. Some target
>>> implementation's paused state time exceeds the default request expiry,
>>> so any request dispatched before the driver could quiesce for the
>>> hardware's paused state will time out, and handling this may interrupt
>>> the firmware activation.
>>>
>>> This two-part series provides a way for drivers to reset dispatched
>>> requests' timeout deadline, then uses this new mechanism from the nvme
>>> driver's fw activation work.
>>
>> Hi Keith,
>>
>> Is it essential to modify the block layer to implement this behavior
>> change? Would it be possible to implement this behavior change by
>> modifying the NVMe driver only, e.g. by modifying the nvme_timeout()
>> function and by making that function return BLK_EH_RESET_TIMER while new
>> firmware is being activated?
> 
> Good question.
> 
> We can't just do this from nvme_timeout(), though. That introduces races
> between timeout_work and fw_act_work if that fw work clears the
> condition that timeout needs to observe to return RESET_TIMER.
> 
> Even if we avoid that race, the rq->deadline needs to be adjusted to
> the current time after the h/w unpause because the time accumulated while
> h/w halted itself should not be counted against the request.

Hi Keith,

How about recording the time at which the firmware upgrade finished and
making nvme_timeout() return RESET_TIMER if either a firmware upgrade is
in progress or if it has finished less than (request timeout) seconds
ago? The reason I'm asking this is because I'm concerned that the
patches you posted would need a careful review to see whether or not
these trigger new race conditions.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 0/2] Reset timeout for paused hardware
@ 2019-05-23 10:04       ` Bart Van Assche
  0 siblings, 0 replies; 30+ messages in thread
From: Bart Van Assche @ 2019-05-23 10:04 UTC (permalink / raw)


On 5/22/19 10:28 PM, Keith Busch wrote:
> On Wed, May 22, 2019@10:20:45PM +0200, Bart Van Assche wrote:
>> On 5/22/19 7:48 PM, Keith Busch wrote:
>>> Hardware may temporarily stop processing commands that have
>>> been dispatched to it while activating new firmware. Some target
>>> implementation's paused state time exceeds the default request expiry,
>>> so any request dispatched before the driver could quiesce for the
>>> hardware's paused state will time out, and handling this may interrupt
>>> the firmware activation.
>>>
>>> This two-part series provides a way for drivers to reset dispatched
>>> requests' timeout deadline, then uses this new mechanism from the nvme
>>> driver's fw activation work.
>>
>> Hi Keith,
>>
>> Is it essential to modify the block layer to implement this behavior
>> change? Would it be possible to implement this behavior change by
>> modifying the NVMe driver only, e.g. by modifying the nvme_timeout()
>> function and by making that function return BLK_EH_RESET_TIMER while new
>> firmware is being activated?
> 
> Good question.
> 
> We can't just do this from nvme_timeout(), though. That introduces races
> between timeout_work and fw_act_work if that fw work clears the
> condition that timeout needs to observe to return RESET_TIMER.
> 
> Even if we avoid that race, the rq->deadline needs to be adjusted to
> the current time after the h/w unpause because the time accumulated while
> h/w halted itself should not be counted against the request.

Hi Keith,

How about recording the time at which the firmware upgrade finished and
making nvme_timeout() return RESET_TIMER if either a firmware upgrade is
in progress or if it has finished less than (request timeout) seconds
ago? The reason I'm asking this is because I'm concerned that the
patches you posted would need a careful review to see whether or not
these trigger new race conditions.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] Reset timeout for paused hardware
  2019-05-23  3:48     ` Keith Busch
@ 2019-05-23 10:13       ` Christoph Hellwig
  -1 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2019-05-23 10:13 UTC (permalink / raw)
  To: Keith Busch
  Cc: Ming Lei, Keith Busch, Jens Axboe, linux-block,
	Christoph Hellwig, linux-nvme

On Wed, May 22, 2019 at 09:48:10PM -0600, Keith Busch wrote:
> Yeah, that's a good question. A FW update may have been initiated out
> of band or from another host entirely. The driver can't count on
> preparing for hardware pausing command processing before it's
> happened, but we'll always find out asynchronously after it's too late
> to freeze.

I don't think that is the case at least for spec compliant devices.

From NVMe 1.3:

Figure 49: Asynchronous Event Information - Notice

1h		Firmware Activation Starting: The controller is starting
		a firmware activation process during which command
		processing is paused. Host software may use CSTS.PP to
		determine when command processing has resumed. To clear
		this event, host software reads the Firmware Slot
		Information log page.

So we are supposed to get an AEN before the device stops processing
commands.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 0/2] Reset timeout for paused hardware
@ 2019-05-23 10:13       ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2019-05-23 10:13 UTC (permalink / raw)


On Wed, May 22, 2019@09:48:10PM -0600, Keith Busch wrote:
> Yeah, that's a good question. A FW update may have been initiated out
> of band or from another host entirely. The driver can't count on
> preparing for hardware pausing command processing before it's
> happened, but we'll always find out asynchronously after it's too late
> to freeze.

I don't think that is the case at least for spec compliant devices.

>From NVMe 1.3:

Figure 49: Asynchronous Event Information - Notice

1h		Firmware Activation Starting: The controller is starting
		a firmware activation process during which command
		processing is paused. Host software may use CSTS.PP to
		determine when command processing has resumed. To clear
		this event, host software reads the Firmware Slot
		Information log page.

So we are supposed to get an AEN before the device stops processing
commands.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/2] nvme: reset request timeouts during fw activation
  2019-05-22 17:48   ` Keith Busch
@ 2019-05-23 10:19     ` Ming Lei
  -1 siblings, 0 replies; 30+ messages in thread
From: Ming Lei @ 2019-05-23 10:19 UTC (permalink / raw)
  To: Keith Busch; +Cc: Jens Axboe, Christoph Hellwig, linux-nvme, linux-block

On Wed, May 22, 2019 at 11:48:12AM -0600, Keith Busch wrote:
> The nvme controller may pause command processing during firmware
> activation. The driver will quiesce queues during this time, but commands
> dispatched prior to the notification will not be processed until the
> hardware completes this activation.
> 
> We do not want those requests to time out while the hardware is in
> this paused state as we don't expect those commands to complete during
> this time, and that handling will interfere with the firmware activation
> process.
> 
> In addition to quiescing the queues, halt timeout detection during the
> paused state and reset the dispatched request deadlines when the hardware
> exists that state. This action applies to IO and admin queues.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/nvme/host/core.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 1b7c2afd84cb..37a9a66ada22 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -89,6 +89,7 @@ static dev_t nvme_chr_devt;
>  static struct class *nvme_class;
>  static struct class *nvme_subsys_class;
>  
> +static void nvme_reset_queues(struct nvme_ctrl *ctrl);
>  static int nvme_revalidate_disk(struct gendisk *disk);
>  static void nvme_put_subsystem(struct nvme_subsystem *subsys);
>  static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
> @@ -3605,6 +3606,11 @@ static void nvme_fw_act_work(struct work_struct *work)
>  				msecs_to_jiffies(admin_timeout * 1000);
>  
>  	nvme_stop_queues(ctrl);
> +	nvme_sync_queues(ctrl);
> +
> +	blk_mq_quiesce_queue(ctrl->admin_q);
> +	blk_sync_queue(ctrl->admin_q);

blk_sync_queue() may not halt timeout detection completely since the
timeout work may reset timer again.

Also reset still may come during activating FW, is that a problem?


Thanks,
Ming

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 2/2] nvme: reset request timeouts during fw activation
@ 2019-05-23 10:19     ` Ming Lei
  0 siblings, 0 replies; 30+ messages in thread
From: Ming Lei @ 2019-05-23 10:19 UTC (permalink / raw)


On Wed, May 22, 2019@11:48:12AM -0600, Keith Busch wrote:
> The nvme controller may pause command processing during firmware
> activation. The driver will quiesce queues during this time, but commands
> dispatched prior to the notification will not be processed until the
> hardware completes this activation.
> 
> We do not want those requests to time out while the hardware is in
> this paused state as we don't expect those commands to complete during
> this time, and that handling will interfere with the firmware activation
> process.
> 
> In addition to quiescing the queues, halt timeout detection during the
> paused state and reset the dispatched request deadlines when the hardware
> exists that state. This action applies to IO and admin queues.
> 
> Signed-off-by: Keith Busch <keith.busch at intel.com>
> ---
>  drivers/nvme/host/core.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 1b7c2afd84cb..37a9a66ada22 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -89,6 +89,7 @@ static dev_t nvme_chr_devt;
>  static struct class *nvme_class;
>  static struct class *nvme_subsys_class;
>  
> +static void nvme_reset_queues(struct nvme_ctrl *ctrl);
>  static int nvme_revalidate_disk(struct gendisk *disk);
>  static void nvme_put_subsystem(struct nvme_subsystem *subsys);
>  static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
> @@ -3605,6 +3606,11 @@ static void nvme_fw_act_work(struct work_struct *work)
>  				msecs_to_jiffies(admin_timeout * 1000);
>  
>  	nvme_stop_queues(ctrl);
> +	nvme_sync_queues(ctrl);
> +
> +	blk_mq_quiesce_queue(ctrl->admin_q);
> +	blk_sync_queue(ctrl->admin_q);

blk_sync_queue() may not halt timeout detection completely since the
timeout work may reset timer again.

Also reset still may come during activating FW, is that a problem?


Thanks,
Ming

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] Reset timeout for paused hardware
  2019-05-23 10:13       ` Christoph Hellwig
@ 2019-05-23 13:23         ` Keith Busch
  -1 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2019-05-23 13:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Ming Lei, Busch, Keith, Jens Axboe, linux-block, linux-nvme

On Thu, May 23, 2019 at 03:13:11AM -0700, Christoph Hellwig wrote:
> On Wed, May 22, 2019 at 09:48:10PM -0600, Keith Busch wrote:
> > Yeah, that's a good question. A FW update may have been initiated out
> > of band or from another host entirely. The driver can't count on
> > preparing for hardware pausing command processing before it's
> > happened, but we'll always find out asynchronously after it's too late
> > to freeze.
> 
> I don't think that is the case at least for spec compliant devices.
> 
> From NVMe 1.3:
> 
> Figure 49: Asynchronous Event Information - Notice
> 
> 1h		Firmware Activation Starting: The controller is starting
> 		a firmware activation process during which command
> 		processing is paused. Host software may use CSTS.PP to
> 		determine when command processing has resumed. To clear
> 		this event, host software reads the Firmware Slot
> 		Information log page.
> 
> So we are supposed to get an AEN before the device stops processing
> commands.

Hm, I read the same section, but conclude differently (and at least some
vendors did too). A spec compliant controller activating new firmware
without reset would stop processing commands and set CSTS.PP first,
then send the AEN. When the host is aware to poll Processing Paused,
the controller hasn't been processing new commands for some time.

Could you give some more detail on your interpretation?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 0/2] Reset timeout for paused hardware
@ 2019-05-23 13:23         ` Keith Busch
  0 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2019-05-23 13:23 UTC (permalink / raw)


On Thu, May 23, 2019@03:13:11AM -0700, Christoph Hellwig wrote:
> On Wed, May 22, 2019@09:48:10PM -0600, Keith Busch wrote:
> > Yeah, that's a good question. A FW update may have been initiated out
> > of band or from another host entirely. The driver can't count on
> > preparing for hardware pausing command processing before it's
> > happened, but we'll always find out asynchronously after it's too late
> > to freeze.
> 
> I don't think that is the case at least for spec compliant devices.
> 
> From NVMe 1.3:
> 
> Figure 49: Asynchronous Event Information - Notice
> 
> 1h		Firmware Activation Starting: The controller is starting
> 		a firmware activation process during which command
> 		processing is paused. Host software may use CSTS.PP to
> 		determine when command processing has resumed. To clear
> 		this event, host software reads the Firmware Slot
> 		Information log page.
> 
> So we are supposed to get an AEN before the device stops processing
> commands.

Hm, I read the same section, but conclude differently (and at least some
vendors did too). A spec compliant controller activating new firmware
without reset would stop processing commands and set CSTS.PP first,
then send the AEN. When the host is aware to poll Processing Paused,
the controller hasn't been processing new commands for some time.

Could you give some more detail on your interpretation?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/2] nvme: reset request timeouts during fw activation
  2019-05-23 10:19     ` Ming Lei
@ 2019-05-23 13:34       ` Keith Busch
  -1 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2019-05-23 13:34 UTC (permalink / raw)
  To: Ming Lei
  Cc: Busch, Keith, Jens Axboe, linux-block, Christoph Hellwig, linux-nvme

On Thu, May 23, 2019 at 03:19:54AM -0700, Ming Lei wrote:
> On Wed, May 22, 2019 at 11:48:12AM -0600, Keith Busch wrote:
> > @@ -3605,6 +3606,11 @@ static void nvme_fw_act_work(struct work_struct *work)
> >  				msecs_to_jiffies(admin_timeout * 1000);
> >  
> >  	nvme_stop_queues(ctrl);
> > +	nvme_sync_queues(ctrl);
> > +
> > +	blk_mq_quiesce_queue(ctrl->admin_q);
> > +	blk_sync_queue(ctrl->admin_q);
> 
> blk_sync_queue() may not halt timeout detection completely since the
> timeout work may reset timer again.

Doh! Didn't hit that in testing, but point taken.
 
> Also reset still may come during activating FW, is that a problem?

IO timeout and user initiated resets should be avoided. A state machine
addition may be useful here.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 2/2] nvme: reset request timeouts during fw activation
@ 2019-05-23 13:34       ` Keith Busch
  0 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2019-05-23 13:34 UTC (permalink / raw)


On Thu, May 23, 2019@03:19:54AM -0700, Ming Lei wrote:
> On Wed, May 22, 2019@11:48:12AM -0600, Keith Busch wrote:
> > @@ -3605,6 +3606,11 @@ static void nvme_fw_act_work(struct work_struct *work)
> >  				msecs_to_jiffies(admin_timeout * 1000);
> >  
> >  	nvme_stop_queues(ctrl);
> > +	nvme_sync_queues(ctrl);
> > +
> > +	blk_mq_quiesce_queue(ctrl->admin_q);
> > +	blk_sync_queue(ctrl->admin_q);
> 
> blk_sync_queue() may not halt timeout detection completely since the
> timeout work may reset timer again.

Doh! Didn't hit that in testing, but point taken.
 
> Also reset still may come during activating FW, is that a problem?

IO timeout and user initiated resets should be avoided. A state machine
addition may be useful here.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/2] nvme: reset request timeouts during fw activation
  2019-05-23 13:34       ` Keith Busch
@ 2019-05-23 14:07         ` Christoph Hellwig
  -1 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2019-05-23 14:07 UTC (permalink / raw)
  To: Keith Busch
  Cc: Ming Lei, Busch, Keith, Jens Axboe, linux-block,
	Christoph Hellwig, linux-nvme

On Thu, May 23, 2019 at 07:34:29AM -0600, Keith Busch wrote:
> Doh! Didn't hit that in testing, but point taken.
>  
> > Also reset still may come during activating FW, is that a problem?
> 
> IO timeout and user initiated resets should be avoided. A state machine
> addition may be useful here.

Yep.  It almost sounds like we'd want a PAUSED state where resets just
keep returning RESET_TIMER without any other action.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 2/2] nvme: reset request timeouts during fw activation
@ 2019-05-23 14:07         ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2019-05-23 14:07 UTC (permalink / raw)


On Thu, May 23, 2019@07:34:29AM -0600, Keith Busch wrote:
> Doh! Didn't hit that in testing, but point taken.
>  
> > Also reset still may come during activating FW, is that a problem?
> 
> IO timeout and user initiated resets should be avoided. A state machine
> addition may be useful here.

Yep.  It almost sounds like we'd want a PAUSED state where resets just
keep returning RESET_TIMER without any other action.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] Reset timeout for paused hardware
  2019-05-23 13:23         ` Keith Busch
@ 2019-05-23 14:10           ` Christoph Hellwig
  -1 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2019-05-23 14:10 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Keith Busch, Ming Lei, Busch, Keith,
	Jens Axboe, linux-block, linux-nvme

On Thu, May 23, 2019 at 07:23:04AM -0600, Keith Busch wrote:
> > Figure 49: Asynchronous Event Information - Notice
> > 
> > 1h		Firmware Activation Starting: The controller is starting
> > 		a firmware activation process during which command
> > 		processing is paused. Host software may use CSTS.PP to
> > 		determine when command processing has resumed. To clear
> > 		this event, host software reads the Firmware Slot
> > 		Information log page.
> > 
> > So we are supposed to get an AEN before the device stops processing
> > commands.
> 
> Hm, I read the same section, but conclude differently (and at least some
> vendors did too). A spec compliant controller activating new firmware
> without reset would stop processing commands and set CSTS.PP first,
> then send the AEN. When the host is aware to poll Processing Paused,
> the controller hasn't been processing new commands for some time.
> 
> Could you give some more detail on your interpretation?

What would be the point of the AEN if it wasn't sent at exactly
the point when the controller stops acceppting command?

The wording is of course NVMe-like, but "The controller is starting a.."
pretty clear implies this is sent at the beginning of the paused
state, not the end.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 0/2] Reset timeout for paused hardware
@ 2019-05-23 14:10           ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2019-05-23 14:10 UTC (permalink / raw)


On Thu, May 23, 2019@07:23:04AM -0600, Keith Busch wrote:
> > Figure 49: Asynchronous Event Information - Notice
> > 
> > 1h		Firmware Activation Starting: The controller is starting
> > 		a firmware activation process during which command
> > 		processing is paused. Host software may use CSTS.PP to
> > 		determine when command processing has resumed. To clear
> > 		this event, host software reads the Firmware Slot
> > 		Information log page.
> > 
> > So we are supposed to get an AEN before the device stops processing
> > commands.
> 
> Hm, I read the same section, but conclude differently (and at least some
> vendors did too). A spec compliant controller activating new firmware
> without reset would stop processing commands and set CSTS.PP first,
> then send the AEN. When the host is aware to poll Processing Paused,
> the controller hasn't been processing new commands for some time.
> 
> Could you give some more detail on your interpretation?

What would be the point of the AEN if it wasn't sent at exactly
the point when the controller stops acceppting command?

The wording is of course NVMe-like, but "The controller is starting a.."
pretty clear implies this is sent at the beginning of the paused
state, not the end.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] Reset timeout for paused hardware
  2019-05-23 14:10           ` Christoph Hellwig
@ 2019-05-23 14:19             ` Keith Busch
  -1 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2019-05-23 14:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Ming Lei, Busch, Keith, Jens Axboe, linux-block, linux-nvme

On Thu, May 23, 2019 at 04:10:54PM +0200, Christoph Hellwig wrote:
> On Thu, May 23, 2019 at 07:23:04AM -0600, Keith Busch wrote:
> > > Figure 49: Asynchronous Event Information - Notice
> > > 
> > > 1h		Firmware Activation Starting: The controller is starting
> > > 		a firmware activation process during which command
> > > 		processing is paused. Host software may use CSTS.PP to
> > > 		determine when command processing has resumed. To clear
> > > 		this event, host software reads the Firmware Slot
> > > 		Information log page.
> > > 
> > > So we are supposed to get an AEN before the device stops processing
> > > commands.
> > 
> > Hm, I read the same section, but conclude differently (and at least some
> > vendors did too). A spec compliant controller activating new firmware
> > without reset would stop processing commands and set CSTS.PP first,
> > then send the AEN. When the host is aware to poll Processing Paused,
> > the controller hasn't been processing new commands for some time.
> > 
> > Could you give some more detail on your interpretation?
> 
> What would be the point of the AEN if it wasn't sent at exactly
> the point when the controller stops acceppting command?
> 
> The wording is of course NVMe-like, but "The controller is starting a.."
> pretty clear implies this is sent at the beginning of the paused
> state, not the end.

Right, the controller starts the pause at time T1, and sends the AEN at
the same time. No new commands will be processed after T1 until the
firmware activate completes.

The host sees the AEN a short time later, time T2.

The time between T1 - T2, we may have sent many commands that are not
going to be processed, and we couldn't have known that when they were
submitted. The controller still owns those commands, and we just need
to adjust their deadlines once processing resumes.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 0/2] Reset timeout for paused hardware
@ 2019-05-23 14:19             ` Keith Busch
  0 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2019-05-23 14:19 UTC (permalink / raw)


On Thu, May 23, 2019@04:10:54PM +0200, Christoph Hellwig wrote:
> On Thu, May 23, 2019@07:23:04AM -0600, Keith Busch wrote:
> > > Figure 49: Asynchronous Event Information - Notice
> > > 
> > > 1h		Firmware Activation Starting: The controller is starting
> > > 		a firmware activation process during which command
> > > 		processing is paused. Host software may use CSTS.PP to
> > > 		determine when command processing has resumed. To clear
> > > 		this event, host software reads the Firmware Slot
> > > 		Information log page.
> > > 
> > > So we are supposed to get an AEN before the device stops processing
> > > commands.
> > 
> > Hm, I read the same section, but conclude differently (and at least some
> > vendors did too). A spec compliant controller activating new firmware
> > without reset would stop processing commands and set CSTS.PP first,
> > then send the AEN. When the host is aware to poll Processing Paused,
> > the controller hasn't been processing new commands for some time.
> > 
> > Could you give some more detail on your interpretation?
> 
> What would be the point of the AEN if it wasn't sent at exactly
> the point when the controller stops acceppting command?
> 
> The wording is of course NVMe-like, but "The controller is starting a.."
> pretty clear implies this is sent at the beginning of the paused
> state, not the end.

Right, the controller starts the pause at time T1, and sends the AEN at
the same time. No new commands will be processed after T1 until the
firmware activate completes.

The host sees the AEN a short time later, time T2.

The time between T1 - T2, we may have sent many commands that are not
going to be processed, and we couldn't have known that when they were
submitted. The controller still owns those commands, and we just need
to adjust their deadlines once processing resumes.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2019-05-23 14:24 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-22 17:48 [PATCH 0/2] Reset timeout for paused hardware Keith Busch
2019-05-22 17:48 ` Keith Busch
2019-05-22 17:48 ` [PATCH 1/2] blk-mq: provide way to reset rq deadline Keith Busch
2019-05-22 17:48   ` Keith Busch
2019-05-22 17:48 ` [PATCH 2/2] nvme: reset request timeouts during fw activation Keith Busch
2019-05-22 17:48   ` Keith Busch
2019-05-23 10:19   ` Ming Lei
2019-05-23 10:19     ` Ming Lei
2019-05-23 13:34     ` Keith Busch
2019-05-23 13:34       ` Keith Busch
2019-05-23 14:07       ` Christoph Hellwig
2019-05-23 14:07         ` Christoph Hellwig
2019-05-22 20:20 ` [PATCH 0/2] Reset timeout for paused hardware Bart Van Assche
2019-05-22 20:20   ` Bart Van Assche
2019-05-22 20:28   ` Keith Busch
2019-05-22 20:28     ` Keith Busch
2019-05-23 10:04     ` Bart Van Assche
2019-05-23 10:04       ` Bart Van Assche
2019-05-23  3:29 ` Ming Lei
2019-05-23  3:29   ` Ming Lei
2019-05-23  3:48   ` Keith Busch
2019-05-23  3:48     ` Keith Busch
2019-05-23 10:13     ` Christoph Hellwig
2019-05-23 10:13       ` Christoph Hellwig
2019-05-23 13:23       ` Keith Busch
2019-05-23 13:23         ` Keith Busch
2019-05-23 14:10         ` Christoph Hellwig
2019-05-23 14:10           ` Christoph Hellwig
2019-05-23 14:19           ` Keith Busch
2019-05-23 14:19             ` Keith Busch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.