All of lore.kernel.org
 help / color / mirror / Atom feed
* nvme device timeout
@ 2016-03-28 16:44 Tim Mohlmann
  2016-03-28 17:24 ` Keith Busch
  0 siblings, 1 reply; 13+ messages in thread
From: Tim Mohlmann @ 2016-03-28 16:44 UTC (permalink / raw)


Hi,

During boot my nvme device is not showing up. I've found the following
dmesg output from nvme:
[    0.634251] nvme 0000:04:00.0: PCI INT A: no GSI
[   61.703015] nvme 0000:04:00.0: I/O 0 QID 0 timeout, disable controller
[   61.704028] nvme 0000:04:00.0: Identify Controller failed (-4)
[   61.705016] nvme 0000:04:00.0: Removing after probe failure status: -5

lspci does not list device anymore.

This problem occured after upgrading my kernel from 4.4.5 to 4.5-rc7.

I'm new to posting about bugs in the kernel, so please advise how I
can provide you with more meaningful info.

Regards,

Tim
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dmesg
Type: application/octet-stream
Size: 54785 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20160328/bf047afc/attachment-0001.obj>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* nvme device timeout
  2016-03-28 16:44 nvme device timeout Tim Mohlmann
@ 2016-03-28 17:24 ` Keith Busch
  2016-03-28 22:30   ` Judy Brock-SSI
  0 siblings, 1 reply; 13+ messages in thread
From: Keith Busch @ 2016-03-28 17:24 UTC (permalink / raw)


On Mon, Mar 28, 2016@07:44:20PM +0300, Tim Mohlmann wrote:
> Hi,
> 
> During boot my nvme device is not showing up. I've found the following
> dmesg output from nvme:
> [    0.634251] nvme 0000:04:00.0: PCI INT A: no GSI
> [   61.703015] nvme 0000:04:00.0: I/O 0 QID 0 timeout, disable controller
> [   61.704028] nvme 0000:04:00.0: Identify Controller failed (-4)
> [   61.705016] nvme 0000:04:00.0: Removing after probe failure status: -5
> 
> lspci does not list device anymore.

Thanks for mentioning that. Removing the device from the pci topology
might be a bit heavy handed. We really only need to unbind the driver
in this case (will send a different patch).
 
> This problem occured after upgrading my kernel from 4.4.5 to 4.5-rc7.

I'll venture a guess that you were unknowingly relying on the driver's
polling completion feature, and your controller doesn't support legacy
irq.

We used to poll the drive for completions immediately after controller
enable completed, so every command could have been polled. That was moved
a little later in initialization in 4.5 (and removed entirely in 4.6 ...).

> I'm new to posting about bugs in the kernel, so please advise how I
> can provide you with more meaningful info.

Could you apply the test patch below (created against 4.5-stable),
and confirm if it still fails?

---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 680f578..b9c989c 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1945,17 +1945,16 @@ static void nvme_reset_work(struct work_struct *work)
 	if (result)
 		goto out;
 
-	result = nvme_init_identify(&dev->ctrl);
+	dev->ctrl.event_limit = NVME_NR_AEN_COMMANDS;
+	result = nvme_dev_list_add(dev);
 	if (result)
 		goto out;
 
-	result = nvme_setup_io_queues(dev);
+	result = nvme_init_identify(&dev->ctrl);
 	if (result)
 		goto out;
 
-	dev->ctrl.event_limit = NVME_NR_AEN_COMMANDS;
-
-	result = nvme_dev_list_add(dev);
+	result = nvme_setup_io_queues(dev);
 	if (result)
 		goto out;
 
--

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* nvme device timeout
  2016-03-28 17:24 ` Keith Busch
@ 2016-03-28 22:30   ` Judy Brock-SSI
  2016-03-28 22:54     ` Keith Busch
  0 siblings, 1 reply; 13+ messages in thread
From: Judy Brock-SSI @ 2016-03-28 22:30 UTC (permalink / raw)


Hi,


	>>We used to poll the drive for completions immediately after controller enable completed, so every command could have been polled. That was moved a little later in initialization in 4.5 (and removed entirely in 4.6 ...).

	>>  Removing the device from the pci topology might be a bit heavy handed. We really only need to unbind the driver in this case (will send a different patch).

Maybe I'm misunderstanding but unbinding the driver in this case instead of removing the driver from the PCI topology the device still doesn't achieve what polling for completions immediately after controller enable completes used to do, does it? I mean, identify will still fail in such cases without the polling. 

Why was that support removed and why not reinstate it?

Thanks,
Judy

-----Original Message-----
From: Linux-nvme [mailto:linux-nvme-bounces@lists.infradead.org] On Behalf Of Keith Busch
Sent: Monday, March 28, 2016 10:24 AM
To: Tim Mohlmann
Cc: linux-nvme at lists.infradead.org
Subject: Re: nvme device timeout

On Mon, Mar 28, 2016@07:44:20PM +0300, Tim Mohlmann wrote:
> Hi,
> 
> During boot my nvme device is not showing up. I've found the following 
> dmesg output from nvme:
> [    0.634251] nvme 0000:04:00.0: PCI INT A: no GSI
> [   61.703015] nvme 0000:04:00.0: I/O 0 QID 0 timeout, disable controller
> [   61.704028] nvme 0000:04:00.0: Identify Controller failed (-4)
> [   61.705016] nvme 0000:04:00.0: Removing after probe failure status: -5
> 
> lspci does not list device anymore.

Thanks for mentioning that. Removing the device from the pci topology might be a bit heavy handed. We really only need to unbind the driver in this case (will send a different patch).
 
> This problem occured after upgrading my kernel from 4.4.5 to 4.5-rc7.

I'll venture a guess that you were unknowingly relying on the driver's polling completion feature, and your controller doesn't support legacy irq.

We used to poll the drive for completions immediately after controller enable completed, so every command could have been polled. That was moved a little later in initialization in 4.5 (and removed entirely in 4.6 ...).

> I'm new to posting about bugs in the kernel, so please advise how I 
> can provide you with more meaningful info.

Could you apply the test patch below (created against 4.5-stable), and confirm if it still fails?

---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 680f578..b9c989c 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1945,17 +1945,16 @@ static void nvme_reset_work(struct work_struct *work)
 	if (result)
 		goto out;
 
-	result = nvme_init_identify(&dev->ctrl);
+	dev->ctrl.event_limit = NVME_NR_AEN_COMMANDS;
+	result = nvme_dev_list_add(dev);
 	if (result)
 		goto out;
 
-	result = nvme_setup_io_queues(dev);
+	result = nvme_init_identify(&dev->ctrl);
 	if (result)
 		goto out;
 
-	dev->ctrl.event_limit = NVME_NR_AEN_COMMANDS;
-
-	result = nvme_dev_list_add(dev);
+	result = nvme_setup_io_queues(dev);
 	if (result)
 		goto out;
 
--

_______________________________________________
Linux-nvme mailing list
Linux-nvme at lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 13+ messages in thread

* nvme device timeout
  2016-03-28 22:30   ` Judy Brock-SSI
@ 2016-03-28 22:54     ` Keith Busch
  2016-03-29  7:03       ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Keith Busch @ 2016-03-28 22:54 UTC (permalink / raw)


Hi Judy,

There are two different things that I noticed from the tester's report,
but I may have inadvertently made them seem related.

First, the reporter says he doesn't see his device in 'lspci' after the
driver gave up on it. That happened because the driver requests removal
on controller failure, and I wanted to point that out that approach may
be overreaching the driver's responsibility. This has absolutely nothing
to do with polling for completions(*).

The polling feature was moved/removed because it shouldn't be necessary
and hides issues that we should be aware of. When it was originally
proposed to remove polling, I mentioned we may possibly discover some
controllers were unknowingly relying on it. I'm not sure yet if this
report is such an issue vs something else entirely.

As far as bringing the driver initiated polling back, we can talk about
that. Do you have a real need for this feature?


* Note, the "polling" I'm referring should not to be confused with block
  layer driven IO polling. That's a completely different feature and
  not related or affected with this issue.


On Mon, Mar 28, 2016@10:30:57PM +0000, Judy Brock-SSI wrote:
> Hi,
> 
> 
> 	>>We used to poll the drive for completions immediately after controller enable completed, so every command could have been polled. That was moved a little later in initialization in 4.5 (and removed entirely in 4.6 ...).
> 
> 	>>  Removing the device from the pci topology might be a bit heavy handed. We really only need to unbind the driver in this case (will send a different patch).
> 
> Maybe I'm misunderstanding but unbinding the driver in this case instead of removing the driver from the PCI topology the device still doesn't achieve what polling for completions immediately after controller enable completes used to do, does it? I mean, identify will still fail in such cases without the polling. 
> 
> Why was that support removed and why not reinstate it?
> 
> Thanks,
> Judy
> 
> -----Original Message-----
> From: Linux-nvme [mailto:linux-nvme-bounces at lists.infradead.org] On Behalf Of Keith Busch
> Sent: Monday, March 28, 2016 10:24 AM
> To: Tim Mohlmann
> Cc: linux-nvme at lists.infradead.org
> Subject: Re: nvme device timeout
> 
> On Mon, Mar 28, 2016@07:44:20PM +0300, Tim Mohlmann wrote:
> > Hi,
> > 
> > During boot my nvme device is not showing up. I've found the following 
> > dmesg output from nvme:
> > [    0.634251] nvme 0000:04:00.0: PCI INT A: no GSI
> > [   61.703015] nvme 0000:04:00.0: I/O 0 QID 0 timeout, disable controller
> > [   61.704028] nvme 0000:04:00.0: Identify Controller failed (-4)
> > [   61.705016] nvme 0000:04:00.0: Removing after probe failure status: -5
> > 
> > lspci does not list device anymore.
> 
> Thanks for mentioning that. Removing the device from the pci topology might be a bit heavy handed. We really only need to unbind the driver in this case (will send a different patch).
>  
> > This problem occured after upgrading my kernel from 4.4.5 to 4.5-rc7.
> 
> I'll venture a guess that you were unknowingly relying on the driver's polling completion feature, and your controller doesn't support legacy irq.
> 
> We used to poll the drive for completions immediately after controller enable completed, so every command could have been polled. That was moved a little later in initialization in 4.5 (and removed entirely in 4.6 ...).
> 
> > I'm new to posting about bugs in the kernel, so please advise how I 
> > can provide you with more meaningful info.
> 
> Could you apply the test patch below (created against 4.5-stable), and confirm if it still fails?
> 
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 680f578..b9c989c 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1945,17 +1945,16 @@ static void nvme_reset_work(struct work_struct *work)
>  	if (result)
>  		goto out;
>  
> -	result = nvme_init_identify(&dev->ctrl);
> +	dev->ctrl.event_limit = NVME_NR_AEN_COMMANDS;
> +	result = nvme_dev_list_add(dev);
>  	if (result)
>  		goto out;
>  
> -	result = nvme_setup_io_queues(dev);
> +	result = nvme_init_identify(&dev->ctrl);
>  	if (result)
>  		goto out;
>  
> -	dev->ctrl.event_limit = NVME_NR_AEN_COMMANDS;
> -
> -	result = nvme_dev_list_add(dev);
> +	result = nvme_setup_io_queues(dev);
>  	if (result)
>  		goto out;
>  
> --
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 13+ messages in thread

* nvme device timeout
  2016-03-28 22:54     ` Keith Busch
@ 2016-03-29  7:03       ` Christoph Hellwig
  2016-03-29 11:04         ` Judy Brock-SSI
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2016-03-29  7:03 UTC (permalink / raw)


On Mon, Mar 28, 2016@10:54:11PM +0000, Keith Busch wrote:
> As far as bringing the driver initiated polling back, we can talk about
> that. Do you have a real need for this feature?

IFF we have to do this it should be a quirk flag for specific
controllers that need a not spec compliant workaround.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* nvme device timeout
  2016-03-29  7:03       ` Christoph Hellwig
@ 2016-03-29 11:04         ` Judy Brock-SSI
  2016-03-29 11:06           ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Judy Brock-SSI @ 2016-03-29 11:04 UTC (permalink / raw)


>>  When it was originally proposed to remove polling, I mentioned we may possibly discover some controllers were unknowingly relying on it. I'm not sure yet if this report is such an issue vs something >> else entirely
> IFF we have to do this it should be a quirk flag for specific controllers that need a not spec compliant workaround.

I think controllers that need it won't know they need it (like Keith hypothesized, will have an unknowing reliance on polling for completions during probe) and so a quirk flag won't do a lot of good.
Seems like it makes sense to put it back in to avoid breaking HW that do need it if there is not a compelling functional reason not to do so. Since 4.6 is still in RC stage, maybe it could be put back into 4.6.


-----Original Message-----
From: Christoph Hellwig [mailto:hch@infradead.org] 
Sent: Tuesday, March 29, 2016 12:03 AM
To: Keith Busch
Cc: Judy Brock-SSI; Tim Mohlmann; linux-nvme at lists.infradead.org
Subject: Re: nvme device timeout

On Mon, Mar 28, 2016@10:54:11PM +0000, Keith Busch wrote:
> As far as bringing the driver initiated polling back, we can talk 
> about that. Do you have a real need for this feature?

IFF we have to do this it should be a quirk flag for specific controllers that need a not spec compliant workaround.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* nvme device timeout
  2016-03-29 11:04         ` Judy Brock-SSI
@ 2016-03-29 11:06           ` Christoph Hellwig
  2016-03-29 14:42             ` Matthew Wilcox
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2016-03-29 11:06 UTC (permalink / raw)


On Tue, Mar 29, 2016@11:04:51AM +0000, Judy Brock-SSI wrote:
> I think controllers that need it won't know they need it (like Keith hypothesized, will have an unknowing reliance on polling for completions during probe) and so a quirk flag won't do a lot of good.
> Seems like it makes sense to put it back in to avoid breaking HW that do need it if there is not a compelling functional reason not to do so. Since 4.6 is still in RC stage, maybe it could be put back into 4.6.

It's not about not knowing - these controllers are broken plain and
simple and need to be fixed.  It's the same as we do for all kinds of
USB crap in SCSI - we don't simple tolerate broken behavior but instead
require a quirk for it.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* nvme device timeout
  2016-03-29 11:06           ` Christoph Hellwig
@ 2016-03-29 14:42             ` Matthew Wilcox
  2016-03-29 16:52               ` Tim Mohlmann
  0 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2016-03-29 14:42 UTC (permalink / raw)


On Tue, Mar 29, 2016@04:06:59AM -0700, Christoph Hellwig wrote:
> On Tue, Mar 29, 2016@11:04:51AM +0000, Judy Brock-SSI wrote:
> > I think controllers that need it won't know they need it (like Keith hypothesized, will have an unknowing reliance on polling for completions during probe) and so a quirk flag won't do a lot of good.
> > Seems like it makes sense to put it back in to avoid breaking HW that do need it if there is not a compelling functional reason not to do so. Since 4.6 is still in RC stage, maybe it could be put back into 4.6.
> 
> It's not about not knowing - these controllers are broken plain and
> simple and need to be fixed.  It's the same as we do for all kinds of
> USB crap in SCSI - we don't simple tolerate broken behavior but instead
> require a quirk for it.

I doubt it's the controller.  Usually it's ACPI or how the PCI bridge
got wired up.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* nvme device timeout
  2016-03-29 14:42             ` Matthew Wilcox
@ 2016-03-29 16:52               ` Tim Mohlmann
  0 siblings, 0 replies; 13+ messages in thread
From: Tim Mohlmann @ 2016-03-29 16:52 UTC (permalink / raw)




On 03/29/2016 05:42 PM, Matthew Wilcox wrote:

> 
> I doubt it's the controller.  Usually it's ACPI or how the PCI bridge
> got wired up.
> 

This can be an issue, yes. This is a new laptop, based on the intel
skylake platform. Since I got it a month ago I was walking trough kernel
version to get everything (display, network, sound etc) working.

At kernel version 4.4 nvme stopped to work in UEFI mode and in kernel
4.5 it stopped to work completely. You will probably not be suprised if
I told you that I have ACPI problems too, but they where of a less
priority on my list. Eg: battery information missing and poweroff not
working.

In my original mail I attached a dmesg output. There are also some ACPI
table errors in that. Maybe you want to look at it? I'm willing run a
bug report at kernel.org for ACPI, but I wouldn't know where to start.
It would be fuzzy to open a report saying: some things are not working

^ permalink raw reply	[flat|nested] 13+ messages in thread

* nvme device timeout
  2016-04-01 16:02   ` Derrick, Jonathan
@ 2016-04-01 16:25     ` Derrick, Jonathan
  0 siblings, 0 replies; 13+ messages in thread
From: Derrick, Jonathan @ 2016-04-01 16:25 UTC (permalink / raw)


>Acked-by: Jon Derrick <jonathan.derrick at intel.com>

For 4.5 stable of course :)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* nvme device timeout
  2016-03-29 14:38 ` Keith Busch
  2016-03-29 17:17   ` Tim Mohlmann
@ 2016-04-01 16:02   ` Derrick, Jonathan
  2016-04-01 16:25     ` Derrick, Jonathan
  1 sibling, 1 reply; 13+ messages in thread
From: Derrick, Jonathan @ 2016-04-01 16:02 UTC (permalink / raw)


Hi Keith, all,

One of our testers verified that this fixed a similar issue with their platform.

Needs a proper patch+log, but otherwise,
Acked-by: Jon Derrick <jonathan.derrick at intel.com>


-----Original Message-----
From: Linux-nvme [mailto:linux-nvme-bounces@lists.infradead.org] On Behalf Of Keith Busch
Sent: Tuesday, March 29, 2016 8:38 AM
To: Amit Kumar Pandey <amit.kp at samsung.com>
Cc: Tim Mohlmann <muhlemmer at gmail.com>; linux-nvme at lists.infradead.org
Subject: Re: nvme device timeout


On Tue, Mar 29, 2016@05:10:53AM +0000, Amit Kumar Pandey wrote:
> I fear that test patch may not work for Tim's reported issue. Since nvme_ktherad will not process any CQ until reset_work is completed, it may not be able to complete identify cmd and will result into failure. 
> 
> static int nvme_kthread(void *data)
> /*
> * Skip controllers currently under reset.
> */
>  if (work_pending(&dev->reset_work) || work_busy(&dev->reset_work))
>   continue;

Thanks, good catch!

Have you confirmed that this controller is indeed relying on polling completions during early init, or do you want to try a fix for kthread polling? The test patch is appeneded below if you still need to confirm it.

If the controller really doesn't work with legacy interrupts, would it be possible to permanently set the PCI interrupt pin register to 0? You shouldn't advertise the capability if it doesn't work, and the driver would have used MSI-x if that was the case.

If we absolutely need to add a quirk, either polling or force using MSI-x sound like possible solutions.

---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 680f578..05eb152 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1371,12 +1371,6 @@ static int nvme_kthread(void *data)
 			int i;
 			u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
-			/*
-			 * Skip controllers currently under reset.
-			 */
-			if (work_pending(&dev->reset_work) || work_busy(&dev->reset_work))
-				continue;
-
 			if ((dev->subsystem && (csts & NVME_CSTS_NSSRO)) ||
 							csts & NVME_CSTS_CFS) {
 				if (queue_work(nvme_workq, &dev->reset_work)) { @@ -1945,17 +1939,16 @@ static void nvme_reset_work(struct work_struct *work)
 	if (result)
 		goto out;
 
-	result = nvme_init_identify(&dev->ctrl);
+	dev->ctrl.event_limit = NVME_NR_AEN_COMMANDS;
+	result = nvme_dev_list_add(dev);
 	if (result)
 		goto out;
 
-	result = nvme_setup_io_queues(dev);
+	result = nvme_init_identify(&dev->ctrl);
 	if (result)
 		goto out;
 
-	dev->ctrl.event_limit = NVME_NR_AEN_COMMANDS;
-
-	result = nvme_dev_list_add(dev);
+	result = nvme_setup_io_queues(dev);
 	if (result)
 		goto out;
 
--

_______________________________________________
Linux-nvme mailing list
Linux-nvme at lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 13+ messages in thread

* nvme device timeout
  2016-03-29 14:38 ` Keith Busch
@ 2016-03-29 17:17   ` Tim Mohlmann
  2016-04-01 16:02   ` Derrick, Jonathan
  1 sibling, 0 replies; 13+ messages in thread
From: Tim Mohlmann @ 2016-03-29 17:17 UTC (permalink / raw)


The last patch worked and made my nvme device to show again, with UEFI
disabled. Tested against vanilla 4.5.0 sources

Like I wrote in other messages in this thread, before upgrading to 4.5 I
ran into problems about the nvme not showing when UEFI boot is enabled.
I assume this is something completely different. Should I open a new
thread in this ML, file a bug at kernel.org or use the lkml?

On 03/29/2016 05:38 PM, Keith Busch wrote:
> Thanks, good catch!
> 
> Have you confirmed that this controller is indeed relying on polling
> completions during early init, or do you want to try a fix for kthread
> polling? The test patch is appeneded below if you still need to confirm
> it.
> 
> If the controller really doesn't work with legacy interrupts, would it
> be possible to permanently set the PCI interrupt pin register to 0? You
> shouldn't advertise the capability if it doesn't work, and the driver
> would have used MSI-x if that was the case.
> 
> If we absolutely need to add a quirk, either polling or force using
> MSI-x sound like possible solutions.
> 
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 680f578..05eb152 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1371,12 +1371,6 @@ static int nvme_kthread(void *data)
>  			int i;
>  			u32 csts = readl(dev->bar + NVME_REG_CSTS);
>  
> -			/*
> -			 * Skip controllers currently under reset.
> -			 */
> -			if (work_pending(&dev->reset_work) || work_busy(&dev->reset_work))
> -				continue;
> -
>  			if ((dev->subsystem && (csts & NVME_CSTS_NSSRO)) ||
>  							csts & NVME_CSTS_CFS) {
>  				if (queue_work(nvme_workq, &dev->reset_work)) {
> @@ -1945,17 +1939,16 @@ static void nvme_reset_work(struct work_struct *work)
>  	if (result)
>  		goto out;
>  
> -	result = nvme_init_identify(&dev->ctrl);
> +	dev->ctrl.event_limit = NVME_NR_AEN_COMMANDS;
> +	result = nvme_dev_list_add(dev);
>  	if (result)
>  		goto out;
>  
> -	result = nvme_setup_io_queues(dev);
> +	result = nvme_init_identify(&dev->ctrl);
>  	if (result)
>  		goto out;
>  
> -	dev->ctrl.event_limit = NVME_NR_AEN_COMMANDS;
> -
> -	result = nvme_dev_list_add(dev);
> +	result = nvme_setup_io_queues(dev);
>  	if (result)
>  		goto out;
>  
> --
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* nvme device timeout
       [not found] <1604554248.106551459228253752.JavaMail.weblogic@epmlwas03c>
@ 2016-03-29 14:38 ` Keith Busch
  2016-03-29 17:17   ` Tim Mohlmann
  2016-04-01 16:02   ` Derrick, Jonathan
  0 siblings, 2 replies; 13+ messages in thread
From: Keith Busch @ 2016-03-29 14:38 UTC (permalink / raw)



On Tue, Mar 29, 2016@05:10:53AM +0000, Amit Kumar Pandey wrote:
> I fear that test patch may not work for Tim's reported issue. Since nvme_ktherad will not process any CQ until reset_work is completed, it may not be able to complete identify cmd and will result into failure. 
> 
> static int nvme_kthread(void *data)
> /*
> * Skip controllers currently under reset.
> */
>  if (work_pending(&dev->reset_work) || work_busy(&dev->reset_work))
>   continue;

Thanks, good catch!

Have you confirmed that this controller is indeed relying on polling
completions during early init, or do you want to try a fix for kthread
polling? The test patch is appeneded below if you still need to confirm
it.

If the controller really doesn't work with legacy interrupts, would it
be possible to permanently set the PCI interrupt pin register to 0? You
shouldn't advertise the capability if it doesn't work, and the driver
would have used MSI-x if that was the case.

If we absolutely need to add a quirk, either polling or force using
MSI-x sound like possible solutions.

---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 680f578..05eb152 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1371,12 +1371,6 @@ static int nvme_kthread(void *data)
 			int i;
 			u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
-			/*
-			 * Skip controllers currently under reset.
-			 */
-			if (work_pending(&dev->reset_work) || work_busy(&dev->reset_work))
-				continue;
-
 			if ((dev->subsystem && (csts & NVME_CSTS_NSSRO)) ||
 							csts & NVME_CSTS_CFS) {
 				if (queue_work(nvme_workq, &dev->reset_work)) {
@@ -1945,17 +1939,16 @@ static void nvme_reset_work(struct work_struct *work)
 	if (result)
 		goto out;
 
-	result = nvme_init_identify(&dev->ctrl);
+	dev->ctrl.event_limit = NVME_NR_AEN_COMMANDS;
+	result = nvme_dev_list_add(dev);
 	if (result)
 		goto out;
 
-	result = nvme_setup_io_queues(dev);
+	result = nvme_init_identify(&dev->ctrl);
 	if (result)
 		goto out;
 
-	dev->ctrl.event_limit = NVME_NR_AEN_COMMANDS;
-
-	result = nvme_dev_list_add(dev);
+	result = nvme_setup_io_queues(dev);
 	if (result)
 		goto out;
 
--

^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-04-01 16:25 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-28 16:44 nvme device timeout Tim Mohlmann
2016-03-28 17:24 ` Keith Busch
2016-03-28 22:30   ` Judy Brock-SSI
2016-03-28 22:54     ` Keith Busch
2016-03-29  7:03       ` Christoph Hellwig
2016-03-29 11:04         ` Judy Brock-SSI
2016-03-29 11:06           ` Christoph Hellwig
2016-03-29 14:42             ` Matthew Wilcox
2016-03-29 16:52               ` Tim Mohlmann
     [not found] <1604554248.106551459228253752.JavaMail.weblogic@epmlwas03c>
2016-03-29 14:38 ` Keith Busch
2016-03-29 17:17   ` Tim Mohlmann
2016-04-01 16:02   ` Derrick, Jonathan
2016-04-01 16:25     ` Derrick, Jonathan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.