linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] NVMe: Split shutdown work
@ 2015-11-23 18:17 Keith Busch
  2015-11-24  7:34 ` Christoph Hellwig
  0 siblings, 1 reply; 7+ messages in thread
From: Keith Busch @ 2015-11-23 18:17 UTC (permalink / raw)


A command timeout during controller start should safely disable the
controller first. This requires decoupling shutdown work from start-up,
and special handling for timeouts that occur during start-up.

Setting the queue count has a special check for a disabled device in
order to set an appropriate return code as this command may legitimately
fail. All other initialization commands do not need to distinguish
timeouts and command error.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
This is based on Christoph's block tree:

  http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-req.9

 drivers/nvme/host/pci.c | 39 ++++++++++++++++++++++++++-------------
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ff253c8..cc08fae 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -86,6 +86,7 @@ struct nvme_queue;
 static int nvme_reset(struct nvme_dev *dev);
 static void nvme_process_cq(struct nvme_queue *nvmeq);
 static void nvme_remove_dead_ctrl(struct nvme_dev *dev);
+static void nvme_dev_shutdown(struct nvme_dev *dev);
 
 struct async_cmd_info {
 	struct kthread_work work;
@@ -116,6 +117,7 @@ struct nvme_dev {
 	void __iomem *bar;
 	struct work_struct reset_work;
 	struct work_struct scan_work;
+	struct work_struct shutdown_work;
 	struct work_struct remove_work;
 	bool subsystem;
 	u32 page_size;
@@ -938,14 +940,14 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	struct nvme_command cmd;
 
 	/*
-	 * Just return an error to reset_work if we're already called from
-	 * probe context.  We don't worry about request / tag reuse as
-	 * reset_work will immeditalely unwind and remove the controller
-	 * once we fail a command.
+	 * Shutdown immediately if controller times out while starting. The
+	 * reset work will see the pci device disabled when it gets the forced
+	 * cancellation error.
 	 */
 	if (test_bit(NVME_CTRL_RESETTING, &dev->flags)) {
-		req->errors = NVME_SC_ABORT_REQ | NVME_SC_DNR;
-		return BLK_EH_HANDLED;
+		dev_warn(dev->dev, "timeout during initialization\n");
+		queue_work(nvme_workq, &dev->shutdown_work);
+		return BLK_EH_QUIESCED;
 	}
 
 	/*
@@ -1510,11 +1512,12 @@ static int set_queue_count(struct nvme_dev *dev, int count)
 
 	status = nvme_set_features(&dev->ctrl, NVME_FEAT_NUM_QUEUES, q_count, 0,
 								&result);
-	if (status < 0)
-		return status;
-	if (status > 0) {
-		dev_err(dev->dev, "Could not set queue count (%d)\n", status);
-		return 0;
+	if (status) {
+		bool enabled = pci_is_enabled(to_pci_dev(dev->dev));
+
+		dev_err(dev->dev, "Could not set queue count (%d) enabled:%d\n",
+								status, enabled);
+		return enabled ? 0 : -ENODEV;
 	}
 	return min(result & 0xffff, result >> 16) + 1;
 }
@@ -2080,6 +2083,12 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 	kfree(dev);
 }
 
+static void nvme_shutdown_work(struct work_struct *work)
+{
+	struct nvme_dev *dev = container_of(work, struct nvme_dev, shutdown_work);
+	nvme_dev_shutdown(dev);
+}
+
 static void nvme_reset_work(struct work_struct *work)
 {
 	struct nvme_dev *dev = container_of(work, struct nvme_dev, reset_work);
@@ -2092,8 +2101,10 @@ static void nvme_reset_work(struct work_struct *work)
 	 * If we're called to reset a live controller first shut it down before
 	 * moving on.
 	 */
-	if (dev->bar)
-		nvme_dev_shutdown(dev);
+	if (dev->bar) {
+		queue_work(nvme_workq, &dev->shutdown_work);
+		flush_work(&dev->shutdown_work);
+	}
 
 	set_bit(NVME_CTRL_RESETTING, &dev->flags);
 
@@ -2250,6 +2261,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	INIT_LIST_HEAD(&dev->node);
 	INIT_WORK(&dev->scan_work, nvme_dev_scan);
 	INIT_WORK(&dev->reset_work, nvme_reset_work);
+	INIT_WORK(&dev->shutdown_work, nvme_shutdown_work);
 	INIT_WORK(&dev->remove_work, nvme_remove_dead_ctrl_work);
 
 	result = nvme_setup_prp_pools(dev);
@@ -2305,6 +2317,7 @@ static void nvme_remove(struct pci_dev *pdev)
 	nvme_remove_namespaces(&dev->ctrl);
 	nvme_uninit_ctrl(&dev->ctrl);
 	nvme_dev_shutdown(dev);
+	flush_work(&dev->shutdown_work);
 	nvme_dev_remove_admin(dev);
 	nvme_free_queues(dev, 0);
 	nvme_release_cmb(dev);
-- 
2.6.2.307.g37023ba

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH] NVMe: Split shutdown work
  2015-11-23 18:17 [PATCH] NVMe: Split shutdown work Keith Busch
@ 2015-11-24  7:34 ` Christoph Hellwig
  2015-11-24 15:14   ` Keith Busch
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2015-11-24  7:34 UTC (permalink / raw)


On Mon, Nov 23, 2015@11:17:45AM -0700, Keith Busch wrote:
> Setting the queue count has a special check for a disabled device in
> order to set an appropriate return code as this command may legitimately
> fail.

Btw, I've tried to understand this as the explanation didn't make sense
to me.  I think I finally found the culprit, and it's this text:

"This feature shall only be issued during initialization prior to creation
 of any I/O Submission and/or Completion Queues. The value allocated shall
 not change between resets."

To be this doesn't indicated that it can legitimately fail. but rather
that we must not even submit it during a reset or resume.  One more
reason for a proper state model..

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] NVMe: Split shutdown work
  2015-11-24  7:34 ` Christoph Hellwig
@ 2015-11-24 15:14   ` Keith Busch
  2015-11-24 15:31     ` Christoph Hellwig
  0 siblings, 1 reply; 7+ messages in thread
From: Keith Busch @ 2015-11-24 15:14 UTC (permalink / raw)


On Mon, Nov 23, 2015@11:34:48PM -0800, Christoph Hellwig wrote:
> to me.  I think I finally found the culprit, and it's this text:
> 
> "This feature shall only be issued during initialization prior to creation
>  of any I/O Submission and/or Completion Queues. The value allocated shall
>  not change between resets."
> 
> To be this doesn't indicated that it can legitimately fail. but rather
> that we must not even submit it during a reset or resume.  One more
> reason for a proper state model..

Security locked drives may reject "set feature". Some of my drives in
manufacturing mode also fail it.

I think the driver is doing the right thing after every CC.EN transition
and before IO queue creation. The spec wording sounds odd, but it wouldn't
make sense if you're not allowed to change for a reset or resume. That's
pretty much the same scenario when the driver initially probes it:
we don't know what state UEFI left it.

I'll send a new revision that makes nvme_dev_shutdown reentrant so we
don't need the awkward "queue_work + flush_work".

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] NVMe: Split shutdown work
  2015-11-24 15:14   ` Keith Busch
@ 2015-11-24 15:31     ` Christoph Hellwig
  2015-11-24 16:13       ` Keith Busch
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2015-11-24 15:31 UTC (permalink / raw)


On Tue, Nov 24, 2015@03:14:06PM +0000, Keith Busch wrote:
> Security locked drives may reject "set feature". Some of my drives in
> manufacturing mode also fail it.

Is there any wording in the spec that allows this?  What number of
I/O queues will show up on these drives?   Allowing to ignore this
failure is defintively black magic and needs long comments explaining
the why and how, or it will get broken accidentally again and again.

> I think the driver is doing the right thing after every CC.EN transition
> and before IO queue creation. The spec wording sounds odd, but it wouldn't
> make sense if you're not allowed to change for a reset or resume. That's
> pretty much the same scenario when the driver initially probes it:
> we don't know what state UEFI left it.

Hmm, true.  Might be a good idea to clarify the spec.

> I'll send a new revision that makes nvme_dev_shutdown reentrant so we
> don't need the awkward "queue_work + flush_work".

We need serialization not just of shutdown calls, but also of shutdown
vs reset.  Thinking about it aren't we doing the shutdown from the
pci_driver ->removal callback with my current branch?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] NVMe: Split shutdown work
  2015-11-24 15:31     ` Christoph Hellwig
@ 2015-11-24 16:13       ` Keith Busch
  2015-11-24 17:58         ` Christoph Hellwig
  0 siblings, 1 reply; 7+ messages in thread
From: Keith Busch @ 2015-11-24 16:13 UTC (permalink / raw)


On Tue, Nov 24, 2015@07:31:07AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 24, 2015@03:14:06PM +0000, Keith Busch wrote:
> > Security locked drives may reject "set feature". Some of my drives in
> > manufacturing mode also fail it.
> 
> Is there any wording in the spec that allows this?  What number of
> I/O queues will show up on these drives?   Allowing to ignore this
> failure is defintively black magic and needs long comments explaining
> the why and how, or it will get broken accidentally again and again.

Heh, my reasoning is focused a bit too narrowly. :)

Instead of examining a specific command's failure modes, can we agree
there is a difference in how we should handle a controller that responds
to initialization with failure status vs one that doesn't respond
at all? I don't want to rat hole commentary for an exceedingly rare
scenario, but it helps tremendously to have this distinction if it
happens.

> We need serialization not just of shutdown calls, but also of shutdown
> vs reset.  Thinking about it aren't we doing the shutdown from the
> pci_driver ->removal callback with my current branch?

There's actually lots of entry points to shutdown: system suspend,
shutdown, PCI-e Function Level Reset, NVMe Controller Level Reset, NVMe
Subsystem Reset Occurred/Controller Failure Status, and PCI removal. PCI
removal can happen from PCI-e hotplug event, driver requested, or user
requested.

I've never seen these events occur simultaneously in practice. There's
no handling for it, but we can fix it utilizing the new device flags.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] NVMe: Split shutdown work
  2015-11-24 16:13       ` Keith Busch
@ 2015-11-24 17:58         ` Christoph Hellwig
  0 siblings, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2015-11-24 17:58 UTC (permalink / raw)


On Tue, Nov 24, 2015@04:13:02PM +0000, Keith Busch wrote:
> Heh, my reasoning is focused a bit too narrowly. :)
> 
> Instead of examining a specific command's failure modes, can we agree
> there is a difference in how we should handle a controller that responds
> to initialization with failure status vs one that doesn't respond
> at all? I don't want to rat hole commentary for an exceedingly rare
> scenario, but it helps tremendously to have this distinction if it
> happens.

If we ignore a return value that shouldn't fail per spec we will need
a comment either way.  But yes, handling an error return vs timeout
differently makes sense independent of that.

> 
> > We need serialization not just of shutdown calls, but also of shutdown
> > vs reset.  Thinking about it aren't we doing the shutdown from the
> > pci_driver ->removal callback with my current branch?
> 
> There's actually lots of entry points to shutdown: system suspend,
> shutdown, PCI-e Function Level Reset, NVMe Controller Level Reset, NVMe
> Subsystem Reset Occurred/Controller Failure Status, and PCI removal. PCI
> removal can happen from PCI-e hotplug event, driver requested, or user
> requested.
> 
> I've never seen these events occur simultaneously in practice. There's
> no handling for it, but we can fix it utilizing the new device flags.

Ok.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] NVMe: Split shutdown work
@ 2015-11-23 19:59 Christoph Hellwig
  0 siblings, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2015-11-23 19:59 UTC (permalink / raw)


Hi Keith,

thanks - we should indeed try to shutdown the controller first.  The
code looks godo in general, but a few comments:

> @@ -1510,11 +1512,12 @@ static int set_queue_count(struct nvme_dev *dev, int count)
>  
>  	status = nvme_set_features(&dev->ctrl, NVME_FEAT_NUM_QUEUES, q_count, 0,
>  								&result);
> -	if (status < 0)
> -		return status;
> -	if (status > 0) {
> -		dev_err(dev->dev, "Could not set queue count (%d)\n", status);
> -		return 0;
> +	if (status) {
> +		bool enabled = pci_is_enabled(to_pci_dev(dev->dev));
> +
> +		dev_err(dev->dev, "Could not set queue count (%d) enabled:%d\n",
> +								status, enabled);
> +		return enabled ? 0 : -ENODEV;
>  	}
>  	return min(result & 0xffff, result >> 16) + 1;
>  }

I would really like to move set_queue_count to common code, and in fact
a patch to do that is in my NVMe target/loop driver series.

Can we instead handled this in the caller, and to do that change
the signature so that it always returns the status from
nvme_set_features as the return value, and instead gets a pass by
reference argument for the queue count?

> +	if (dev->bar) {
> +		queue_work(nvme_workq, &dev->shutdown_work);
> +		flush_work(&dev->shutdown_work);
> +	}

These synchronous workqueue calls always make my a bit nervous.
This is fine for now, but I'd love to have a single threaded workqueue
per controller that handles reset / shutdown and everything the kthread
handles in the long run.  I'm happy to do the work for that once your
patch is in.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-11-24 17:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-23 18:17 [PATCH] NVMe: Split shutdown work Keith Busch
2015-11-24  7:34 ` Christoph Hellwig
2015-11-24 15:14   ` Keith Busch
2015-11-24 15:31     ` Christoph Hellwig
2015-11-24 16:13       ` Keith Busch
2015-11-24 17:58         ` Christoph Hellwig
2015-11-23 19:59 Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).