All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] nvme: Improve processing paused support
@ 2019-05-24 20:20 Keith Busch
  2019-05-24 20:20 ` [PATCH 1/3] nvme-pci: reset timeout when processing is paused Keith Busch
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Keith Busch @ 2019-05-24 20:20 UTC (permalink / raw)


From: Keith Busch <kbusch@kernel.org>

Improve firmware activation over what we're currently doing. The main
thing is that we do not want to esacalate recovery while the current
status has CSTS.PP set as the recovery attempt may interfere with a
critical controller operation.

There still exists a race where the controller may clear CSTS Processing
Paused at the same time a stalled request times out. We would normally
want to reset the timer for that, but this is at least a state that can
handle error recovery without interrupting background operations, so
the only harm caused from encountering this unlikely event is a longer
latency back to ready.

My previous patch to address this used help from the block layer, but
had some issues. This series only involves the nvme driver and just
addresses the main concern without addressing the corner cases.

Keith Busch (3):
  nvme-pci: reset timeout when processing is paused
  nvme: rearm fw notification in admin only state
  nvme: quiesce admin queue for fw activation

 drivers/nvme/host/core.c | 9 ++++++++-
 drivers/nvme/host/pci.c  | 2 +-
 2 files changed, 9 insertions(+), 2 deletions(-)

-- 
2.14.4

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/3] nvme-pci: reset timeout when processing is paused
  2019-05-24 20:20 [PATCH 0/3] nvme: Improve processing paused support Keith Busch
@ 2019-05-24 20:20 ` Keith Busch
  2019-05-25 13:22   ` Minwoo Im
  2019-06-01  9:07   ` Christoph Hellwig
  2019-05-24 20:20 ` [PATCH 2/3] nvme: rearm fw notification in admin only state Keith Busch
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 12+ messages in thread
From: Keith Busch @ 2019-05-24 20:20 UTC (permalink / raw)


From: Keith Busch <kbusch@kernel.org>

Do not escalate request timeout handling if the controller has
temporary paused command processing. We do not expect requests to
complete during this transient state, so just reset the timer.

Signed-off-by: Keith Busch <kbusch at kernel.org>
---
 drivers/nvme/host/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f562154551ce..101e20522374 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1263,7 +1263,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	 * the recovery mechanism will surely fail.
 	 */
 	mb();
-	if (pci_channel_offline(to_pci_dev(dev->dev)))
+	if (pci_channel_offline(to_pci_dev(dev->dev)) || (csts & NVME_CSTS_PP))
 		return BLK_EH_RESET_TIMER;
 
 	/*
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/3] nvme: rearm fw notification in admin only state
  2019-05-24 20:20 [PATCH 0/3] nvme: Improve processing paused support Keith Busch
  2019-05-24 20:20 ` [PATCH 1/3] nvme-pci: reset timeout when processing is paused Keith Busch
@ 2019-05-24 20:20 ` Keith Busch
  2019-06-01  9:08   ` Christoph Hellwig
  2019-05-24 20:20 ` [PATCH 3/3] nvme: quiesce admin queue for fw activation Keith Busch
  2019-05-31 23:32 ` [PATCH 0/3] nvme: Improve processing paused support Sagi Grimberg
  3 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2019-05-24 20:20 UTC (permalink / raw)


From: Keith Busch <kbusch@kernel.org>

Getting the firmware log is an admin command, so admin-only or live
states are valid to rearm the activation notice.

Signed-off-by: Keith Busch <kbusch at kernel.org>
---
 drivers/nvme/host/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1b7c2afd84cb..96dac2292897 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3615,7 +3615,8 @@ static void nvme_fw_act_work(struct work_struct *work)
 		msleep(100);
 	}
 
-	if (ctrl->state != NVME_CTRL_LIVE)
+	if (ctrl->state != NVME_CTRL_LIVE &&
+	    ctrl->state != NVME_CTRL_ADMIN_ONLY)
 		return;
 
 	nvme_start_queues(ctrl);
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/3] nvme: quiesce admin queue for fw activation
  2019-05-24 20:20 [PATCH 0/3] nvme: Improve processing paused support Keith Busch
  2019-05-24 20:20 ` [PATCH 1/3] nvme-pci: reset timeout when processing is paused Keith Busch
  2019-05-24 20:20 ` [PATCH 2/3] nvme: rearm fw notification in admin only state Keith Busch
@ 2019-05-24 20:20 ` Keith Busch
  2019-05-25 13:17   ` Minwoo Im
  2019-05-31 23:32 ` [PATCH 0/3] nvme: Improve processing paused support Sagi Grimberg
  3 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2019-05-24 20:20 UTC (permalink / raw)


From: Keith Busch <kbusch@kernel.org>

The controller's admin queue processing is paused during firmware
activation, just like IO queues. Quiesce these, and inform the syslog
this uncommon event is happening.

Signed-off-by: Keith Busch <kbusch at kernel.org>
---
 drivers/nvme/host/core.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 96dac2292897..5a6d27823f7f 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3604,7 +3604,10 @@ static void nvme_fw_act_work(struct work_struct *work)
 		fw_act_timeout = jiffies +
 				msecs_to_jiffies(admin_timeout * 1000);
 
+	dev_info(ctrl->device, "processing paused for fw activation\n");
 	nvme_stop_queues(ctrl);
+	blk_mq_quiesce_queue(ctrl->admin_q);
+
 	while (nvme_ctrl_pp_status(ctrl)) {
 		if (time_after(jiffies, fw_act_timeout)) {
 			dev_warn(ctrl->device,
@@ -3619,7 +3622,10 @@ static void nvme_fw_act_work(struct work_struct *work)
 	    ctrl->state != NVME_CTRL_ADMIN_ONLY)
 		return;
 
+	dev_info(ctrl->device, "processing resumed\n");
+	blk_mq_unquiesce_queue(ctrl->admin_q);
 	nvme_start_queues(ctrl);
+
 	/* read FW slot information to clear the AER */
 	nvme_get_fw_slot_info(ctrl);
 }
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/3] nvme: quiesce admin queue for fw activation
  2019-05-24 20:20 ` [PATCH 3/3] nvme: quiesce admin queue for fw activation Keith Busch
@ 2019-05-25 13:17   ` Minwoo Im
  2019-05-25 14:07     ` Keith Busch
  0 siblings, 1 reply; 12+ messages in thread
From: Minwoo Im @ 2019-05-25 13:17 UTC (permalink / raw)


> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 96dac2292897..5a6d27823f7f 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3604,7 +3604,10 @@ static void nvme_fw_act_work(struct work_struct *work)
>  		fw_act_timeout = jiffies +
>  				msecs_to_jiffies(admin_timeout * 1000);
>  
> +	dev_info(ctrl->device, "processing paused for fw activation\n");
>  	nvme_stop_queues(ctrl);
> +	blk_mq_quiesce_queue(ctrl->admin_q);

Keith,

Can we have an warning here to indicate if device firmware has not set
the CSTS.PP yet.  In that case, the information message that you have
introduced here may be invalid.  It would be great if we check the
CSTS.PP first, and then print it out.

If it's not necessary to have it, please feel free to let me know.
Of course, device has to prepared the processing paused status,
though :)

> +
>  	while (nvme_ctrl_pp_status(ctrl)) {
>  		if (time_after(jiffies, fw_act_timeout)) {
>  			dev_warn(ctrl->device,
> @@ -3619,7 +3622,10 @@ static void nvme_fw_act_work(struct work_struct *work)
>  	    ctrl->state != NVME_CTRL_ADMIN_ONLY)
>  		return;
>  
> +	dev_info(ctrl->device, "processing resumed\n");

Can we print it out after admin_q, io_q are unquiesced ?

> +	blk_mq_unquiesce_queue(ctrl->admin_q);
>  	nvme_start_queues(ctrl);
> +
>  	/* read FW slot information to clear the AER */
>  	nvme_get_fw_slot_info(ctrl);
>  }

I am very okay to quiesce the admin_q here.

Thanks,

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/3] nvme-pci: reset timeout when processing is paused
  2019-05-24 20:20 ` [PATCH 1/3] nvme-pci: reset timeout when processing is paused Keith Busch
@ 2019-05-25 13:22   ` Minwoo Im
  2019-06-01  9:07   ` Christoph Hellwig
  1 sibling, 0 replies; 12+ messages in thread
From: Minwoo Im @ 2019-05-25 13:22 UTC (permalink / raw)


On 19-05-24 14:20:34, Keith Busch wrote:
> From: Keith Busch <kbusch at kernel.org>
> 
> Do not escalate request timeout handling if the controller has
> temporary paused command processing. We do not expect requests to
> complete during this transient state, so just reset the timer.
> 
> Signed-off-by: Keith Busch <kbusch at kernel.org>
> ---
>  drivers/nvme/host/pci.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index f562154551ce..101e20522374 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1263,7 +1263,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  	 * the recovery mechanism will surely fail.
>  	 */
>  	mb();
> -	if (pci_channel_offline(to_pci_dev(dev->dev)))
> +	if (pci_channel_offline(to_pci_dev(dev->dev)) || (csts & NVME_CSTS_PP))
>  		return BLK_EH_RESET_TIMER;

Keith,

Is it okay to check CSTS.PP status without checking CC.EN ?  Can't we
just do the PP check like:

	if (pci_channel_offline(to_pci_dev(dev->dev)) ||
			nvme_ctrl_pp_status(&dev->ctrl))
		return BLK_EH_RESET_TIMER;

What do you think?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 3/3] nvme: quiesce admin queue for fw activation
  2019-05-25 13:17   ` Minwoo Im
@ 2019-05-25 14:07     ` Keith Busch
  2019-05-25 15:05       ` Minwoo Im
  0 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2019-05-25 14:07 UTC (permalink / raw)


On Sat, May 25, 2019@7:17 AM Minwoo Im <minwoo.im.dev@gmail.com> wrote:
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index 96dac2292897..5a6d27823f7f 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -3604,7 +3604,10 @@ static void nvme_fw_act_work(struct work_struct *work)
> >               fw_act_timeout = jiffies +
> >                               msecs_to_jiffies(admin_timeout * 1000);
> >
> > +     dev_info(ctrl->device, "processing paused for fw activation\n");
> >       nvme_stop_queues(ctrl);
> > +     blk_mq_quiesce_queue(ctrl->admin_q);
>
> Keith,
>
> Can we have an warning here to indicate if device firmware has not set
> the CSTS.PP yet.  In that case, the information message that you have
> introduced here may be invalid.  It would be great if we check the
> CSTS.PP first, and then print it out.
>
> If it's not necessary to have it, please feel free to let me know.
> Of course, device has to prepared the processing paused status,
> though :)

Well, between the fw notice AEN and when the work can finally check
CSTS.PP, we really have no way of knowing if the controller paused and
resumed before we observed that controller status, so a warning would
be a bit much. We're still quiescing, so the print is valid.

In case the controller did pause processing, and we hit the unusual
timeout race that does not see CSTS.PP, the new prints make it clear
that's what happened.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 3/3] nvme: quiesce admin queue for fw activation
  2019-05-25 14:07     ` Keith Busch
@ 2019-05-25 15:05       ` Minwoo Im
  0 siblings, 0 replies; 12+ messages in thread
From: Minwoo Im @ 2019-05-25 15:05 UTC (permalink / raw)


> >
> > Keith,
> >
> > Can we have an warning here to indicate if device firmware has not set
> > the CSTS.PP yet.  In that case, the information message that you have
> > introduced here may be invalid.  It would be great if we check the
> > CSTS.PP first, and then print it out.
> >
> > If it's not necessary to have it, please feel free to let me know.
> > Of course, device has to prepared the processing paused status,
> > though :)
> 
> Well, between the fw notice AEN and when the work can finally check
> CSTS.PP, we really have no way of knowing if the controller paused and
> resumed before we observed that controller status, so a warning would
> be a bit much. We're still quiescing, so the print is valid.

That makes sense.  The work scheduled from the AEN request cannot
guarantee whether firmware has paused/resumed or not.

How about this, we prints out that the queues have been quiesced
instead of "processing paused" which might make confusing between
firmware just sets the CSTS.PS and the work scheduled has just quiesced
the admin and io queues both.

Thanks for your kindly response, by the way.

> In case the controller did pause processing, and we hit the unusual
> timeout race that does not see CSTS.PP, the new prints make it clear
> that's what happened.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 0/3] nvme: Improve processing paused support
  2019-05-24 20:20 [PATCH 0/3] nvme: Improve processing paused support Keith Busch
                   ` (2 preceding siblings ...)
  2019-05-24 20:20 ` [PATCH 3/3] nvme: quiesce admin queue for fw activation Keith Busch
@ 2019-05-31 23:32 ` Sagi Grimberg
  2019-06-01  2:45   ` Minwoo Im
  3 siblings, 1 reply; 12+ messages in thread
From: Sagi Grimberg @ 2019-05-31 23:32 UTC (permalink / raw)


This looks sensible to me,

Minwoo, do you have any concerns or feedback? If not
can I get your review tag?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 0/3] nvme: Improve processing paused support
  2019-05-31 23:32 ` [PATCH 0/3] nvme: Improve processing paused support Sagi Grimberg
@ 2019-06-01  2:45   ` Minwoo Im
  0 siblings, 0 replies; 12+ messages in thread
From: Minwoo Im @ 2019-06-01  2:45 UTC (permalink / raw)


On 19-05-31 16:32:24, Sagi Grimberg wrote:
> This looks sensible to me,
> 
> Minwoo, do you have any concerns or feedback? If not
> can I get your review tag?

Actually this series are looking good to me.  But, I was just
concerning that if there is better print statement or safe access method
for CSTS.PP bit in the code.

But, the print message is not that critical so that I can make my review
on this.  I need to ask Keith about accessing CSTS.PP without seeing
CC.EN can be valid in some cases, by the way :).

For this series:

Reviewed-by: Minwoo Im <minwoo.im.dev at gmail.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/3] nvme-pci: reset timeout when processing is paused
  2019-05-24 20:20 ` [PATCH 1/3] nvme-pci: reset timeout when processing is paused Keith Busch
  2019-05-25 13:22   ` Minwoo Im
@ 2019-06-01  9:07   ` Christoph Hellwig
  1 sibling, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2019-06-01  9:07 UTC (permalink / raw)


On Fri, May 24, 2019@02:20:34PM -0600, Keith Busch wrote:
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index f562154551ce..101e20522374 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1263,7 +1263,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  	 * the recovery mechanism will surely fail.
>  	 */
>  	mb();
> -	if (pci_channel_offline(to_pci_dev(dev->dev)))
> +	if (pci_channel_offline(to_pci_dev(dev->dev)) || (csts & NVME_CSTS_PP))

I think we at least need a ratelimited printk when this happens so
people know they don't get timeouts because of CSTS.PP.

And maybe we need a timeout how long we allow the timeouts to extended
due to CSTS.PP, otherwise a buggy device that never clears it would
hold I/O hostage forever.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 2/3] nvme: rearm fw notification in admin only state
  2019-05-24 20:20 ` [PATCH 2/3] nvme: rearm fw notification in admin only state Keith Busch
@ 2019-06-01  9:08   ` Christoph Hellwig
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2019-06-01  9:08 UTC (permalink / raw)


On Fri, May 24, 2019@02:20:35PM -0600, Keith Busch wrote:
> +	if (ctrl->state != NVME_CTRL_LIVE &&
> +	    ctrl->state != NVME_CTRL_ADMIN_ONLY)

I think we really

 a) need a nvme_ctrl_is_live() helper encapsulating the above
 b) audit all checks for NVME_CTRL_LIVE if they really need to use
    this helper instead.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-06-01  9:08 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-24 20:20 [PATCH 0/3] nvme: Improve processing paused support Keith Busch
2019-05-24 20:20 ` [PATCH 1/3] nvme-pci: reset timeout when processing is paused Keith Busch
2019-05-25 13:22   ` Minwoo Im
2019-06-01  9:07   ` Christoph Hellwig
2019-05-24 20:20 ` [PATCH 2/3] nvme: rearm fw notification in admin only state Keith Busch
2019-06-01  9:08   ` Christoph Hellwig
2019-05-24 20:20 ` [PATCH 3/3] nvme: quiesce admin queue for fw activation Keith Busch
2019-05-25 13:17   ` Minwoo Im
2019-05-25 14:07     ` Keith Busch
2019-05-25 15:05       ` Minwoo Im
2019-05-31 23:32 ` [PATCH 0/3] nvme: Improve processing paused support Sagi Grimberg
2019-06-01  2:45   ` Minwoo Im

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.