linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] PCI: NVMe reset quirk
@ 2018-07-24  0:12 Alex Williamson
  2018-07-24  0:13 ` [PATCH v2 1/2] PCI: Export pcie_has_flr() Alex Williamson
  2018-07-24  0:13 ` [PATCH v2 2/2] PCI: NVMe device specific reset quirk Alex Williamson
  0 siblings, 2 replies; 8+ messages in thread
From: Alex Williamson @ 2018-07-24  0:12 UTC (permalink / raw)
  To: linux-pci; +Cc: linux-kernel, linux-nvme

v2: Add bug link, use Samsung vendor ID, add spec references

As discussed in the 2nd patch, at least one NVMe controller sometimes
doesn't like being reset while enabled and another will timeout during
a subsequent re-enable if it happens too quickly after reset.
Introduce a device specific reset quirk for all NVMe class devices so
that we can try to get reliable behavior from them for device
assignment and any other users of the PCI subsystem reset interface.

Patches against current PCI next branch.  Thanks,

Alex

---

Alex Williamson (2):
      PCI: Export pcie_has_flr()
      PCI: NVMe device specific reset quirk


 drivers/pci/pci.c    |    3 +
 drivers/pci/quirks.c |  118 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/pci.h  |    1 
 3 files changed, 121 insertions(+), 1 deletion(-)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/2] PCI: Export pcie_has_flr()
  2018-07-24  0:12 [PATCH v2 0/2] PCI: NVMe reset quirk Alex Williamson
@ 2018-07-24  0:13 ` Alex Williamson
  2018-07-24  0:13 ` [PATCH v2 2/2] PCI: NVMe device specific reset quirk Alex Williamson
  1 sibling, 0 replies; 8+ messages in thread
From: Alex Williamson @ 2018-07-24  0:13 UTC (permalink / raw)
  To: linux-pci; +Cc: linux-kernel, linux-nvme

pcie_flr() suggests pcie_has_flr() to ensure that PCIe FLR support is
present prior to calling.  pcie_flr() is exported while pcie_has_flr()
is not.  Resolve this.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/pci/pci.c   |    3 ++-
 include/linux/pci.h |    1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 2bec76c9d9a7..52fe2d72a99c 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4071,7 +4071,7 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
  * Returns true if the device advertises support for PCIe function level
  * resets.
  */
-static bool pcie_has_flr(struct pci_dev *dev)
+bool pcie_has_flr(struct pci_dev *dev)
 {
 	u32 cap;
 
@@ -4081,6 +4081,7 @@ static bool pcie_has_flr(struct pci_dev *dev)
 	pcie_capability_read_dword(dev, PCI_EXP_DEVCAP, &cap);
 	return cap & PCI_EXP_DEVCAP_FLR;
 }
+EXPORT_SYMBOL_GPL(pcie_has_flr);
 
 /**
  * pcie_flr - initiate a PCIe function level reset
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 04c7ea6ed67b..bbe030d7814f 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1092,6 +1092,7 @@ u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev **limiting_dev,
 			     enum pci_bus_speed *speed,
 			     enum pcie_link_width *width);
 void pcie_print_link_status(struct pci_dev *dev);
+bool pcie_has_flr(struct pci_dev *dev);
 int pcie_flr(struct pci_dev *dev);
 int __pci_reset_function_locked(struct pci_dev *dev);
 int pci_reset_function(struct pci_dev *dev);


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/2] PCI: NVMe device specific reset quirk
  2018-07-24  0:12 [PATCH v2 0/2] PCI: NVMe reset quirk Alex Williamson
  2018-07-24  0:13 ` [PATCH v2 1/2] PCI: Export pcie_has_flr() Alex Williamson
@ 2018-07-24  0:13 ` Alex Williamson
  2018-07-24  0:40   ` Sinan Kaya
  2018-07-24  6:54   ` Christoph Hellwig
  1 sibling, 2 replies; 8+ messages in thread
From: Alex Williamson @ 2018-07-24  0:13 UTC (permalink / raw)
  To: linux-pci; +Cc: linux-kernel, linux-nvme

Take advantage of NVMe devices using a standard interface to quiesce
the controller prior to reset, including device specific delays before
and after that reset.  This resolves several NVMe device assignment
scenarios with two different vendors.  The Intel DC P3700 controller
has been shown to only work as a VM boot device on the initial VM
startup, failing after reset or reboot, and also fails to initialize
after hot-plug into a VM.  Adding a delay after FLR resolves these
cases.  The Samsung SM961/PM961 (960 EVO) sometimes fails to return
from FLR with the PCI config space reading back as -1.  A reproducible
instance of this behavior is resolved by clearing the enable bit in
the configuration register and waiting for the ready status to clear
(disabling the NVMe controller) prior to FLR.

As all NVMe devices make use of this standard interface and the NVMe
specification also requires PCIe FLR support, we can apply this quirk
to all devices with matching class code.

Link: https://bugzilla.redhat.com/show_bug.cgi?id=1592654
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/pci/quirks.c |  118 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 118 insertions(+)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index e72c8742aafa..bbd029e8d3ae 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -28,6 +28,7 @@
 #include <linux/platform_data/x86/apple.h>
 #include <linux/pm_runtime.h>
 #include <linux/switchtec.h>
+#include <linux/nvme.h>
 #include <asm/dma.h>	/* isa_dma_bridge_buggy */
 #include "pci.h"
 
@@ -3669,6 +3670,122 @@ static int reset_chelsio_generic_dev(struct pci_dev *dev, int probe)
 #define PCI_DEVICE_ID_INTEL_IVB_M_VGA      0x0156
 #define PCI_DEVICE_ID_INTEL_IVB_M2_VGA     0x0166
 
+/* NVMe controller needs delay before testing ready status */
+#define NVME_QUIRK_CHK_RDY_DELAY	(1 << 0)
+/* NVMe controller needs post-FLR delay */
+#define NVME_QUIRK_POST_FLR_DELAY	(1 << 1)
+
+static const struct pci_device_id nvme_reset_tbl[] = {
+	{ PCI_DEVICE(0x1bb1, 0x0100),   /* Seagate Nytro Flash Storage */
+		.driver_data = NVME_QUIRK_CHK_RDY_DELAY, },
+	{ PCI_DEVICE(0x1c58, 0x0003),   /* HGST adapter */
+		.driver_data = NVME_QUIRK_CHK_RDY_DELAY, },
+	{ PCI_DEVICE(0x1c58, 0x0023),   /* WDC SN200 adapter */
+		.driver_data = NVME_QUIRK_CHK_RDY_DELAY, },
+	{ PCI_DEVICE(0x1c5f, 0x0540),   /* Memblaze Pblaze4 adapter */
+		.driver_data = NVME_QUIRK_CHK_RDY_DELAY, },
+	{ PCI_DEVICE(PCI_VENDOR_ID_SAMSUNG, 0xa821),   /* Samsung PM1725 */
+		.driver_data = NVME_QUIRK_CHK_RDY_DELAY, },
+	{ PCI_DEVICE(PCI_VENDOR_ID_SAMSUNG, 0xa822),   /* Samsung PM1725a */
+		.driver_data = NVME_QUIRK_CHK_RDY_DELAY, },
+	{ PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x0953),   /* Intel DC P3700 */
+		.driver_data = NVME_QUIRK_POST_FLR_DELAY, },
+	{ PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xffffff) },
+	{ 0 }
+};
+
+/*
+ * The NVMe specification requires that controllers support PCIe FLR, but
+ * but some Samsung SM961/PM961 controllers fail to recover after FLR (-1
+ * config space) unless the device is quiesced prior to FLR.  Do this for
+ * all NVMe devices by disabling the controller before reset.  Some Intel
+ * controllers also require an additional post-FLR delay or else attempts
+ * to re-enable will timeout, do that here as well with heuristically
+ * determined delay value.  Also maintain the delay between disabling and
+ * checking ready status as used by the native NVMe driver.
+ *
+ * NVMe specification: https://nvmexpress.org/resources/specifications/
+ * Revision 1.0e:
+ *    Chapter 2: Required and optional PCI config registers
+ *    Chapter 3: NVMe control registers
+ *    Chapter 7.3: Reset behavior
+ */
+static int reset_nvme(struct pci_dev *dev, int probe)
+{
+	const struct pci_device_id *id;
+	void __iomem *bar;
+	u16 cmd;
+	u32 cfg;
+
+	id = pci_match_id(nvme_reset_tbl, dev);
+	if (!id || !pcie_has_flr(dev) || !pci_resource_start(dev, 0))
+		return -ENOTTY;
+
+	if (probe)
+		return 0;
+
+	bar = pci_iomap(dev, 0, NVME_REG_CC + sizeof(cfg));
+	if (!bar)
+		return -ENOTTY;
+
+	pci_read_config_word(dev, PCI_COMMAND, &cmd);
+	pci_write_config_word(dev, PCI_COMMAND, cmd | PCI_COMMAND_MEMORY);
+
+	cfg = readl(bar + NVME_REG_CC);
+
+	/* Disable controller if enabled */
+	if (cfg & NVME_CC_ENABLE) {
+		u64 cap = readq(bar + NVME_REG_CAP);
+		unsigned long timeout;
+
+		/*
+		 * Per nvme_disable_ctrl() skip shutdown notification as it
+		 * could complete commands to the admin queue.  We only intend
+		 * to quiesce the device before reset.
+		 */
+		cfg &= ~(NVME_CC_SHN_MASK | NVME_CC_ENABLE);
+
+		writel(cfg, bar + NVME_REG_CC);
+
+		/* A heuristic value, matches NVME_QUIRK_DELAY_AMOUNT */
+		if (id->driver_data & NVME_QUIRK_CHK_RDY_DELAY)
+			msleep(2300);
+
+		/* Cap register provides max timeout in 500ms increments */
+		timeout = ((NVME_CAP_TIMEOUT(cap) + 1) * HZ / 2) + jiffies;
+
+		for (;;) {
+			u32 status = readl(bar + NVME_REG_CSTS);
+
+			/* Ready status becomes zero on disable complete */
+			if (!(status & NVME_CSTS_RDY))
+				break;
+
+			msleep(100);
+
+			if (time_after(jiffies, timeout)) {
+				pci_warn(dev, "Timeout waiting for NVMe ready status to clear after disable\n");
+				break;
+			}
+		}
+	}
+
+	pci_iounmap(dev, bar);
+
+	/*
+	 * We could use the optional NVM Subsystem Reset here, hardware
+	 * supporting this is simply unavailable at the time of this code
+	 * to validate in comparison to PCIe FLR.  NVMe spec dictates that
+	 * NVMe devices shall implement PCIe FLR.
+	 */
+	pcie_flr(dev);
+
+	if (id->driver_data & NVME_QUIRK_POST_FLR_DELAY)
+		msleep(250); /* Heuristic based on Intel DC P3700 */
+
+	return 0;
+}
+
 static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
 	{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82599_SFP_VF,
 		 reset_intel_82599_sfp_virtfn },
@@ -3678,6 +3795,7 @@ static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
 		reset_ivb_igd },
 	{ PCI_VENDOR_ID_CHELSIO, PCI_ANY_ID,
 		reset_chelsio_generic_dev },
+	{ PCI_ANY_ID, PCI_ANY_ID, reset_nvme },
 	{ 0 }
 };
 


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] PCI: NVMe device specific reset quirk
  2018-07-24  0:13 ` [PATCH v2 2/2] PCI: NVMe device specific reset quirk Alex Williamson
@ 2018-07-24  0:40   ` Sinan Kaya
  2018-07-24  1:57     ` Alex Williamson
  2018-07-24  6:54   ` Christoph Hellwig
  1 sibling, 1 reply; 8+ messages in thread
From: Sinan Kaya @ 2018-07-24  0:40 UTC (permalink / raw)
  To: Alex Williamson, linux-pci; +Cc: linux-kernel, linux-nvme

On 7/23/2018 5:13 PM, Alex Williamson wrote:
> + * The NVMe specification requires that controllers support PCIe FLR, but
> + * but some Samsung SM961/PM961 controllers fail to recover after FLR (-1
> + * config space) unless the device is quiesced prior to FLR.

Does disabling the memory bit in PCI config space as part of the FLR 
reset function help? (like the very first thing)

Can we do that in the pcie_flr() function to cover other endpoint types
that might be pushing traffic while code is trying to do a reset?



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] PCI: NVMe device specific reset quirk
  2018-07-24  0:40   ` Sinan Kaya
@ 2018-07-24  1:57     ` Alex Williamson
  2018-07-24  2:20       ` Sinan Kaya
  0 siblings, 1 reply; 8+ messages in thread
From: Alex Williamson @ 2018-07-24  1:57 UTC (permalink / raw)
  To: Sinan Kaya; +Cc: linux-pci, linux-kernel, linux-nvme

On Mon, 23 Jul 2018 17:40:02 -0700
Sinan Kaya <okaya@kernel.org> wrote:

> On 7/23/2018 5:13 PM, Alex Williamson wrote:
> > + * The NVMe specification requires that controllers support PCIe FLR, but
> > + * but some Samsung SM961/PM961 controllers fail to recover after FLR (-1
> > + * config space) unless the device is quiesced prior to FLR.  
> 
> Does disabling the memory bit in PCI config space as part of the FLR 
> reset function help? (like the very first thing)

No, it does not.  I modified this to only clear PCI_COMMAND_MEMORY and
call pcie_flr(), the Samsung controller dies just as it did previously.
 
> Can we do that in the pcie_flr() function to cover other endpoint types
> that might be pushing traffic while code is trying to do a reset?

Do you mean PCI_COMMAND_MASTER rather than PCI_COMMAND_MEMORY?  I tried
that too, it doesn't work either.  I'm not really sure the theory
behind clearing memory, clearing busmaster to stop DMA seems like a
sane thing to do, but doesn't help here.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] PCI: NVMe device specific reset quirk
  2018-07-24  1:57     ` Alex Williamson
@ 2018-07-24  2:20       ` Sinan Kaya
  2018-07-24  3:16         ` Alex Williamson
  0 siblings, 1 reply; 8+ messages in thread
From: Sinan Kaya @ 2018-07-24  2:20 UTC (permalink / raw)
  To: Alex Williamson; +Cc: linux-pci, linux-kernel, linux-nvme

On 7/23/18, Alex Williamson <alex.williamson@redhat.com> wrote:
> On Mon, 23 Jul 2018 17:40:02 -0700
> Sinan Kaya <okaya@kernel.org> wrote:
>
>> On 7/23/2018 5:13 PM, Alex Williamson wrote:
>> > + * The NVMe specification requires that controllers support PCIe FLR,
>> > but
>> > + * but some Samsung SM961/PM961 controllers fail to recover after FLR
>> > (-1
>> > + * config space) unless the device is quiesced prior to FLR.
>>
>> Does disabling the memory bit in PCI config space as part of the FLR
>> reset function help? (like the very first thing)
>
> No, it does not.  I modified this to only clear PCI_COMMAND_MEMORY and
> call pcie_flr(), the Samsung controller dies just as it did previously.
>
>> Can we do that in the pcie_flr() function to cover other endpoint types
>> that might be pushing traffic while code is trying to do a reset?
>
> Do you mean PCI_COMMAND_MASTER rather than PCI_COMMAND_MEMORY?

Yes

>  I tried
> that too, it doesn't work either.  I'm not really sure the theory
> behind clearing memory, clearing busmaster to stop DMA seems like a
> sane thing to do, but doesn't help here.

Let me explain what I guessed. You might be able to fill in the blanks
where I am completely off.

We do vfio initiated flr reset immediately following guest machine
shutdown. The card could be fully enabled and pushing traffic to the
system at this moment.

I don't know if vfio does any device disable or not.

FLR is supposed to reset the endpoint but endpoint doesn't recover per
your report.

Having vendor specific reset routines for PCIE endpoints defeats the
purpose of FLR.

Since the adapter is fully functional, i suggested turning off bus
master and memory enable bits to stop endpoint from sending packets.

But, this is not helping either.

Those sleep statements looked very fragile to be honest.

I was curious if there is something else that we could do for other endpoints.

No objections otherwise.

>
> Alex
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] PCI: NVMe device specific reset quirk
  2018-07-24  2:20       ` Sinan Kaya
@ 2018-07-24  3:16         ` Alex Williamson
  0 siblings, 0 replies; 8+ messages in thread
From: Alex Williamson @ 2018-07-24  3:16 UTC (permalink / raw)
  To: Sinan Kaya; +Cc: linux-pci, linux-kernel, linux-nvme

On Mon, 23 Jul 2018 19:20:41 -0700
Sinan Kaya <Okaya@kernel.org> wrote:

> On 7/23/18, Alex Williamson <alex.williamson@redhat.com> wrote:
> > On Mon, 23 Jul 2018 17:40:02 -0700
> > Sinan Kaya <okaya@kernel.org> wrote:
> >  
> >> On 7/23/2018 5:13 PM, Alex Williamson wrote:  
> >> > + * The NVMe specification requires that controllers support PCIe FLR,
> >> > but
> >> > + * but some Samsung SM961/PM961 controllers fail to recover after FLR
> >> > (-1
> >> > + * config space) unless the device is quiesced prior to FLR.  
> >>
> >> Does disabling the memory bit in PCI config space as part of the FLR
> >> reset function help? (like the very first thing)  
> >
> > No, it does not.  I modified this to only clear PCI_COMMAND_MEMORY and
> > call pcie_flr(), the Samsung controller dies just as it did previously.
> >  
> >> Can we do that in the pcie_flr() function to cover other endpoint types
> >> that might be pushing traffic while code is trying to do a reset?  
> >
> > Do you mean PCI_COMMAND_MASTER rather than PCI_COMMAND_MEMORY?  
> 
> Yes
> 
> >  I tried
> > that too, it doesn't work either.  I'm not really sure the theory
> > behind clearing memory, clearing busmaster to stop DMA seems like a
> > sane thing to do, but doesn't help here.  
> 
> Let me explain what I guessed. You might be able to fill in the blanks
> where I am completely off.
> 
> We do vfio initiated flr reset immediately following guest machine
> shutdown. The card could be fully enabled and pushing traffic to the
> system at this moment.
> 
> I don't know if vfio does any device disable or not.

Yes, pci_clear_master() is the very first thing we do in
vfio_pci_disable(), well before we try to reset the device.
 
> FLR is supposed to reset the endpoint but endpoint doesn't recover per
> your report.
> 
> Having vendor specific reset routines for PCIE endpoints defeats the
> purpose of FLR.
> 
> Since the adapter is fully functional, i suggested turning off bus
> master and memory enable bits to stop endpoint from sending packets.
> 
> But, this is not helping either.
> 
> Those sleep statements looked very fragile to be honest.
> 
> I was curious if there is something else that we could do for other endpoints.
> 
> No objections otherwise.

I certainly agree that it would be nice if FLR was more robust on these
devices, but if all devices behaved within the specs we wouldn't have
these quirks to start with ;)  Just as you're suggesting maybe we could
disable busmaster before FLR, which is reasonable but doesn't work
here, I'm basically moving that to a class specific action, quiesce the
controller at the NVMe level rather than PCI level.  Essentially that's
why I thought it reasonable to apply to all NVMe class devices rather
than create just a quirk that delays after FLR for Intel and another
that disables the NVMe controller just for Samsung.  Once I decide to
apply to the whole class, then I need to bring in the device specific
knowledge already found in the native nvme driver for the delay between
clearing the enable bit and checking the ready status bit.  If it's
fragile, then the bare metal nvme driver has the same frailty.  For the
delay I added, all I can say is that it works for me and improves the
usability of the device for this purpose.  I know that 200ms is too
low, ISTR the issue was fixed at 210-220ms, so 250ms provides some
headroom and I've not seen any issues there.  If we want to make it 500
or 1000ms, that's fine by me, I expect it'd work, it's just unnecessary
until we find devices that need longer delays.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] PCI: NVMe device specific reset quirk
  2018-07-24  0:13 ` [PATCH v2 2/2] PCI: NVMe device specific reset quirk Alex Williamson
  2018-07-24  0:40   ` Sinan Kaya
@ 2018-07-24  6:54   ` Christoph Hellwig
  1 sibling, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2018-07-24  6:54 UTC (permalink / raw)
  To: Alex Williamson; +Cc: linux-pci, linux-kernel, linux-nvme

> As all NVMe devices make use of this standard interface and the NVMe
> specification also requires PCIe FLR support, we can apply this quirk
> to all devices with matching class code.

But not all NVMe devices require this quirk.  So please only quirk
devices that actually require it.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-07-24  6:54 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-24  0:12 [PATCH v2 0/2] PCI: NVMe reset quirk Alex Williamson
2018-07-24  0:13 ` [PATCH v2 1/2] PCI: Export pcie_has_flr() Alex Williamson
2018-07-24  0:13 ` [PATCH v2 2/2] PCI: NVMe device specific reset quirk Alex Williamson
2018-07-24  0:40   ` Sinan Kaya
2018-07-24  1:57     ` Alex Williamson
2018-07-24  2:20       ` Sinan Kaya
2018-07-24  3:16         ` Alex Williamson
2018-07-24  6:54   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).