All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] PCI: Avoid FLR for AMD Matisse HD Audio and USB Controllers
@ 2020-05-16 13:37 Marcos Scriven
  2020-05-16 15:19 ` Marcos Scriven
  2020-05-18 16:17 ` Bjorn Helgaas
  0 siblings, 2 replies; 7+ messages in thread
From: Marcos Scriven @ 2020-05-16 13:37 UTC (permalink / raw)
  To: linux-pci

This patch fixes an FLR bug on the following two devices:

AMD Matisse HD Audio Controller 0x1487
AMD Matisse USB 3.0 Host Controller 0x149c

As there was already such a quirk for an Intel network device, I have
renamed that method and updated the comments, trying to make it
clearer what the specific original devices that were affected are
(based on the commit message this was original done:
https://git.thm.de/mhnn55/eco32-linux-ba/commit/f65fd1aa4f9881d5540192d11f7b8ed2fec936db).

I have ordered them by hex product ID.

I have verified this works on a X570 I AORUS PRO WIFI (rev. 1.0) motherboard.


From 651176ab164ae51e37d5bb86f5948da558744930 Mon Sep 17 00:00:00 2001
From: Marcos Scriven <marcos@scriven.org>
Date: Sat, 16 May 2020 14:23:26 +0100
Subject: [PATCH] PCI: Avoid FLR for:

    AMD Matisse HD Audio Controller 0x1487
    AMD Matisse USB 3.0 Host Controller 0x149c

These devices advertise a Function Level Reset (FLR) capability, but hang
when an FLR is triggered.

To reproduce the problem, attach the device to a VM, then detach and try to
attach again.

Add a quirk to prevent the use of FLR on these devices.

Signed-off-by: Marcos Scriven <marcos@scriven.org>
---
 drivers/pci/quirks.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 28c9a2409c50..ff310f0cac22 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -5129,13 +5129,23 @@ static void quirk_intel_qat_vf_cap(struct pci_dev *pdev)
 }
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x443, quirk_intel_qat_vf_cap);

-/* FLR may cause some 82579 devices to hang */
-static void quirk_intel_no_flr(struct pci_dev *dev)
+/*
+ * FLR may cause the following to devices to hang:
+ *
+ * AMD Starship/Matisse HD Audio Controller 0x1487
+ * AMD Matisse USB 3.0 Host Controller 0x149c
+ * Intel 82579LM Gigabit Ethernet Controller 0x1502
+ * Intel 82579V Gigabit Ethernet Controller 0x1503
+ *
+ */
+static void quirk_no_flr(struct pci_dev *dev)
 {
     dev->dev_flags |= PCI_DEV_FLAGS_NO_FLR_RESET;
 }
-DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502, quirk_intel_no_flr);
-DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503, quirk_intel_no_flr);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x1487, quirk_no_flr);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x149c, quirk_no_flr);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502, quirk_no_flr);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503, quirk_no_flr);

 static void quirk_no_ext_tags(struct pci_dev *pdev)
 {
--
2.25.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] PCI: Avoid FLR for AMD Matisse HD Audio and USB Controllers
  2020-05-16 13:37 [PATCH] PCI: Avoid FLR for AMD Matisse HD Audio and USB Controllers Marcos Scriven
@ 2020-05-16 15:19 ` Marcos Scriven
  2020-05-18 16:17 ` Bjorn Helgaas
  1 sibling, 0 replies; 7+ messages in thread
From: Marcos Scriven @ 2020-05-16 15:19 UTC (permalink / raw)
  To: linux-pci, bhelgaas

Apologies, I neglected to include Bjorn. Doing so now.

Marcos

On Sat, 16 May 2020 at 14:37, Marcos Scriven <marcos@scriven.org> wrote:
>
> This patch fixes an FLR bug on the following two devices:
>
> AMD Matisse HD Audio Controller 0x1487
> AMD Matisse USB 3.0 Host Controller 0x149c
>
> As there was already such a quirk for an Intel network device, I have
> renamed that method and updated the comments, trying to make it
> clearer what the specific original devices that were affected are
> (based on the commit message this was original done:
> https://git.thm.de/mhnn55/eco32-linux-ba/commit/f65fd1aa4f9881d5540192d11f7b8ed2fec936db).
>
> I have ordered them by hex product ID.
>
> I have verified this works on a X570 I AORUS PRO WIFI (rev. 1.0) motherboard.
>
>
> From 651176ab164ae51e37d5bb86f5948da558744930 Mon Sep 17 00:00:00 2001
> From: Marcos Scriven <marcos@scriven.org>
> Date: Sat, 16 May 2020 14:23:26 +0100
> Subject: [PATCH] PCI: Avoid FLR for:
>
>     AMD Matisse HD Audio Controller 0x1487
>     AMD Matisse USB 3.0 Host Controller 0x149c
>
> These devices advertise a Function Level Reset (FLR) capability, but hang
> when an FLR is triggered.
>
> To reproduce the problem, attach the device to a VM, then detach and try to
> attach again.
>
> Add a quirk to prevent the use of FLR on these devices.
>
> Signed-off-by: Marcos Scriven <marcos@scriven.org>
> ---
>  drivers/pci/quirks.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 28c9a2409c50..ff310f0cac22 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -5129,13 +5129,23 @@ static void quirk_intel_qat_vf_cap(struct pci_dev *pdev)
>  }
>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x443, quirk_intel_qat_vf_cap);
>
> -/* FLR may cause some 82579 devices to hang */
> -static void quirk_intel_no_flr(struct pci_dev *dev)
> +/*
> + * FLR may cause the following to devices to hang:
> + *
> + * AMD Starship/Matisse HD Audio Controller 0x1487
> + * AMD Matisse USB 3.0 Host Controller 0x149c
> + * Intel 82579LM Gigabit Ethernet Controller 0x1502
> + * Intel 82579V Gigabit Ethernet Controller 0x1503
> + *
> + */
> +static void quirk_no_flr(struct pci_dev *dev)
>  {
>      dev->dev_flags |= PCI_DEV_FLAGS_NO_FLR_RESET;
>  }
> -DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502, quirk_intel_no_flr);
> -DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503, quirk_intel_no_flr);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x1487, quirk_no_flr);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x149c, quirk_no_flr);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502, quirk_no_flr);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503, quirk_no_flr);
>
>  static void quirk_no_ext_tags(struct pci_dev *pdev)
>  {
> --
> 2.25.1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] PCI: Avoid FLR for AMD Matisse HD Audio and USB Controllers
  2020-05-16 13:37 [PATCH] PCI: Avoid FLR for AMD Matisse HD Audio and USB Controllers Marcos Scriven
  2020-05-16 15:19 ` Marcos Scriven
@ 2020-05-18 16:17 ` Bjorn Helgaas
  2020-05-18 19:26   ` Marcos Scriven
  1 sibling, 1 reply; 7+ messages in thread
From: Bjorn Helgaas @ 2020-05-18 16:17 UTC (permalink / raw)
  To: Marcos Scriven; +Cc: linux-pci, Alex Williamson

[+cc Alex]

On Sat, May 16, 2020 at 02:37:23PM +0100, Marcos Scriven wrote:
> This patch fixes an FLR bug on the following two devices:
> 
> AMD Matisse HD Audio Controller 0x1487
> AMD Matisse USB 3.0 Host Controller 0x149c
> 
> As there was already such a quirk for an Intel network device, I have
> renamed that method and updated the comments, trying to make it
> clearer what the specific original devices that were affected are
> (based on the commit message this was original done:
> https://git.thm.de/mhnn55/eco32-linux-ba/commit/f65fd1aa4f9881d5540192d11f7b8ed2fec936db).
> 
> I have ordered them by hex product ID.
> 
> I have verified this works on a X570 I AORUS PRO WIFI (rev. 1.0) motherboard.

If we avoid FLR, is there another method used to reset these devices
between attachments to different VMs?  Does the lack of FLR mean we
can leak information between VMs?

Would additional delay after the FLR work around this, e.g., something
like 51ba09452d11 ("PCI: Delay after FLR of Intel DC P3700 NVMe")?

> From 651176ab164ae51e37d5bb86f5948da558744930 Mon Sep 17 00:00:00 2001
> From: Marcos Scriven <marcos@scriven.org>
> Date: Sat, 16 May 2020 14:23:26 +0100
> Subject: [PATCH] PCI: Avoid FLR for:
> 
>     AMD Matisse HD Audio Controller 0x1487
>     AMD Matisse USB 3.0 Host Controller 0x149c
> 
> These devices advertise a Function Level Reset (FLR) capability, but hang
> when an FLR is triggered.
> 
> To reproduce the problem, attach the device to a VM, then detach and try to
> attach again.
> 
> Add a quirk to prevent the use of FLR on these devices.
> 
> Signed-off-by: Marcos Scriven <marcos@scriven.org>
> ---
>  drivers/pci/quirks.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 28c9a2409c50..ff310f0cac22 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -5129,13 +5129,23 @@ static void quirk_intel_qat_vf_cap(struct pci_dev *pdev)
>  }
>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x443, quirk_intel_qat_vf_cap);
> 
> -/* FLR may cause some 82579 devices to hang */
> -static void quirk_intel_no_flr(struct pci_dev *dev)
> +/*
> + * FLR may cause the following to devices to hang:
> + *
> + * AMD Starship/Matisse HD Audio Controller 0x1487
> + * AMD Matisse USB 3.0 Host Controller 0x149c
> + * Intel 82579LM Gigabit Ethernet Controller 0x1502
> + * Intel 82579V Gigabit Ethernet Controller 0x1503
> + *
> + */
> +static void quirk_no_flr(struct pci_dev *dev)
>  {
>      dev->dev_flags |= PCI_DEV_FLAGS_NO_FLR_RESET;
>  }
> -DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502, quirk_intel_no_flr);
> -DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503, quirk_intel_no_flr);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x1487, quirk_no_flr);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x149c, quirk_no_flr);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502, quirk_no_flr);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503, quirk_no_flr);
> 
>  static void quirk_no_ext_tags(struct pci_dev *pdev)
>  {
> --
> 2.25.1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] PCI: Avoid FLR for AMD Matisse HD Audio and USB Controllers
  2020-05-18 16:17 ` Bjorn Helgaas
@ 2020-05-18 19:26   ` Marcos Scriven
  2020-05-20  9:41     ` Marcos Scriven
  0 siblings, 1 reply; 7+ messages in thread
From: Marcos Scriven @ 2020-05-18 19:26 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Alex Williamson

On Mon, 18 May 2020 at 17:17, Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> [+cc Alex]
>
> On Sat, May 16, 2020 at 02:37:23PM +0100, Marcos Scriven wrote:
> > This patch fixes an FLR bug on the following two devices:
> >
> > AMD Matisse HD Audio Controller 0x1487
> > AMD Matisse USB 3.0 Host Controller 0x149c
> >
> > As there was already such a quirk for an Intel network device, I have
> > renamed that method and updated the comments, trying to make it
> > clearer what the specific original devices that were affected are
> > (based on the commit message this was original done:
> > https://git.thm.de/mhnn55/eco32-linux-ba/commit/f65fd1aa4f9881d5540192d11f7b8ed2fec936db).
> >
> > I have ordered them by hex product ID.
> >
> > I have verified this works on a X570 I AORUS PRO WIFI (rev. 1.0) motherboard.
>
> If we avoid FLR, is there another method used to reset these devices
> between attachments to different VMs?  Does the lack of FLR mean we
> can leak information between VMs?
>
> Would additional delay after the FLR work around this, e.g., something
> like 51ba09452d11 ("PCI: Delay after FLR of Intel DC P3700 NVMe")?
>

Thanks for looking at this patch Bjorn.

To take your three points:

1. Certainly I can see those devices able to be passed back and forth
between host and guest multiple times, once this patch is applied.

2. I don't know the answer to that question; would appreciate guidance
on how to determine this. Do you mean perhaps some buffered data in
the USB controller, for instance?

3. I have not tried an additional delay. This is the logs I see when
the error is occurring:

[ 2423.556570] vfio-pci 0000:0c:00.3: not ready 1023ms after FLR; waiting
[ 2425.604526] vfio-pci 0000:0c:00.3: not ready 2047ms after FLR; waiting
[ 2428.804509] vfio-pci 0000:0c:00.3: not ready 4095ms after FLR; waiting
[ 2433.924409] vfio-pci 0000:0c:00.3: not ready 8191ms after FLR; waiting
[ 2443.140721] vfio-pci 0000:0c:00.3: not ready 16383ms after FLR; waiting
[ 2461.571944] vfio-pci 0000:0c:00.3: not ready 32767ms after FLR; waiting
[ 2496.387544] vfio-pci 0000:0c:00.3: not ready 65535ms after FLR; giving up

What makes this bug especially bad is the host never recovers, and
eventually hangs or crashes.

For reference, the delay example you're talking about is:

static int delay_250ms_after_flr(struct pci_dev *dev, int probe)
{
if (!pcie_has_flr(dev))
return -ENOTTY;

if (probe)
return 0;

pcie_flr(dev);

msleep(250);

return 0;
}

I don't know if it would work, but I will try it out and report back.

Marcos


> > From 651176ab164ae51e37d5bb86f5948da558744930 Mon Sep 17 00:00:00 2001
> > From: Marcos Scriven <marcos@scriven.org>
> > Date: Sat, 16 May 2020 14:23:26 +0100
> > Subject: [PATCH] PCI: Avoid FLR for:
> >
> >     AMD Matisse HD Audio Controller 0x1487
> >     AMD Matisse USB 3.0 Host Controller 0x149c
> >
> > These devices advertise a Function Level Reset (FLR) capability, but hang
> > when an FLR is triggered.
> >
> > To reproduce the problem, attach the device to a VM, then detach and try to
> > attach again.
> >
> > Add a quirk to prevent the use of FLR on these devices.
> >
> > Signed-off-by: Marcos Scriven <marcos@scriven.org>
> > ---
> >  drivers/pci/quirks.c | 18 ++++++++++++++----
> >  1 file changed, 14 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 28c9a2409c50..ff310f0cac22 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -5129,13 +5129,23 @@ static void quirk_intel_qat_vf_cap(struct pci_dev *pdev)
> >  }
> >  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x443, quirk_intel_qat_vf_cap);
> >
> > -/* FLR may cause some 82579 devices to hang */
> > -static void quirk_intel_no_flr(struct pci_dev *dev)
> > +/*
> > + * FLR may cause the following to devices to hang:
> > + *
> > + * AMD Starship/Matisse HD Audio Controller 0x1487
> > + * AMD Matisse USB 3.0 Host Controller 0x149c
> > + * Intel 82579LM Gigabit Ethernet Controller 0x1502
> > + * Intel 82579V Gigabit Ethernet Controller 0x1503
> > + *
> > + */
> > +static void quirk_no_flr(struct pci_dev *dev)
> >  {
> >      dev->dev_flags |= PCI_DEV_FLAGS_NO_FLR_RESET;
> >  }
> > -DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502, quirk_intel_no_flr);
> > -DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503, quirk_intel_no_flr);
> > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x1487, quirk_no_flr);
> > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x149c, quirk_no_flr);
> > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502, quirk_no_flr);
> > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503, quirk_no_flr);
> >
> >  static void quirk_no_ext_tags(struct pci_dev *pdev)
> >  {
> > --
> > 2.25.1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] PCI: Avoid FLR for AMD Matisse HD Audio and USB Controllers
  2020-05-18 19:26   ` Marcos Scriven
@ 2020-05-20  9:41     ` Marcos Scriven
  2020-05-20 23:29       ` Bjorn Helgaas
  0 siblings, 1 reply; 7+ messages in thread
From: Marcos Scriven @ 2020-05-20  9:41 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Alex Williamson

On Mon, 18 May 2020 at 20:26, Marcos Scriven <marcos@scriven.org> wrote:
>
> On Mon, 18 May 2020 at 17:17, Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > [+cc Alex]
> >
> > On Sat, May 16, 2020 at 02:37:23PM +0100, Marcos Scriven wrote:
> > > This patch fixes an FLR bug on the following two devices:
> > >
> > > AMD Matisse HD Audio Controller 0x1487
> > > AMD Matisse USB 3.0 Host Controller 0x149c
> > >
> > > As there was already such a quirk for an Intel network device, I have
> > > renamed that method and updated the comments, trying to make it
> > > clearer what the specific original devices that were affected are
> > > (based on the commit message this was original done:
> > > https://git.thm.de/mhnn55/eco32-linux-ba/commit/f65fd1aa4f9881d5540192d11f7b8ed2fec936db).
> > >
> > > I have ordered them by hex product ID.
> > >
> > > I have verified this works on a X570 I AORUS PRO WIFI (rev. 1.0) motherboard.
> >
> > If we avoid FLR, is there another method used to reset these devices
> > between attachments to different VMs?  Does the lack of FLR mean we
> > can leak information between VMs?
> >
> > Would additional delay after the FLR work around this, e.g., something
> > like 51ba09452d11 ("PCI: Delay after FLR of Intel DC P3700 NVMe")?
> >
>
> Thanks for looking at this patch Bjorn.
>
> To take your three points:
>
> 1. Certainly I can see those devices able to be passed back and forth
> between host and guest multiple times, once this patch is applied.
>
> 2. I don't know the answer to that question; would appreciate guidance
> on how to determine this. Do you mean perhaps some buffered data in
> the USB controller, for instance?
>
> 3. I have not tried an additional delay. This is the logs I see when
> the error is occurring:
>
> [ 2423.556570] vfio-pci 0000:0c:00.3: not ready 1023ms after FLR; waiting
> [ 2425.604526] vfio-pci 0000:0c:00.3: not ready 2047ms after FLR; waiting
> [ 2428.804509] vfio-pci 0000:0c:00.3: not ready 4095ms after FLR; waiting
> [ 2433.924409] vfio-pci 0000:0c:00.3: not ready 8191ms after FLR; waiting
> [ 2443.140721] vfio-pci 0000:0c:00.3: not ready 16383ms after FLR; waiting
> [ 2461.571944] vfio-pci 0000:0c:00.3: not ready 32767ms after FLR; waiting
> [ 2496.387544] vfio-pci 0000:0c:00.3: not ready 65535ms after FLR; giving up
>
> What makes this bug especially bad is the host never recovers, and
> eventually hangs or crashes.
>
> For reference, the delay example you're talking about is:
>
> static int delay_250ms_after_flr(struct pci_dev *dev, int probe)
> {
> if (!pcie_has_flr(dev))
> return -ENOTTY;
>
> if (probe)
> return 0;
>
> pcie_flr(dev);
>
> msleep(250);
>
> return 0;
> }
>
> I don't know if it would work, but I will try it out and report back.
>
> Marcos
>
>

Bjorn/Alex

I have just tried the alternate approach of adding a 250ms delay to
the function level reset - this unfortunately results in the same
broken behaviour, with the host itself never recovering.

[   76.905410] vfio-pci 0000:0d:00.3: not ready 1023ms after FLR; waiting
[   79.018014] vfio-pci 0000:0d:00.3: not ready 2047ms after FLR; waiting
[   82.089390] vfio-pci 0000:0d:00.3: not ready 4095ms after FLR; waiting
[   87.209416] vfio-pci 0000:0d:00.3: not ready 8191ms after FLR; waiting
[   96.425440] vfio-pci 0000:0d:00.3: not ready 16383ms after FLR; waiting
[  114.615491] vfio-pci 0000:0d:00.3: not ready 32767ms after FLR; waiting
[  149.417712] vfio-pci 0000:0d:00.3: not ready 65535ms after FLR; giving up

I also tried a full second, to no avail.

What would be the next step in proceeding with the original patch please?

How can I allay your concerns on data leaking between VMs?

Thanks for your help with the patch.

Marcos

> > > From 651176ab164ae51e37d5bb86f5948da558744930 Mon Sep 17 00:00:00 2001
> > > From: Marcos Scriven <marcos@scriven.org>
> > > Date: Sat, 16 May 2020 14:23:26 +0100
> > > Subject: [PATCH] PCI: Avoid FLR for:
> > >
> > >     AMD Matisse HD Audio Controller 0x1487
> > >     AMD Matisse USB 3.0 Host Controller 0x149c
> > >
> > > These devices advertise a Function Level Reset (FLR) capability, but hang
> > > when an FLR is triggered.
> > >
> > > To reproduce the problem, attach the device to a VM, then detach and try to
> > > attach again.
> > >
> > > Add a quirk to prevent the use of FLR on these devices.
> > >
> > > Signed-off-by: Marcos Scriven <marcos@scriven.org>
> > > ---
> > >  drivers/pci/quirks.c | 18 ++++++++++++++----
> > >  1 file changed, 14 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > > index 28c9a2409c50..ff310f0cac22 100644
> > > --- a/drivers/pci/quirks.c
> > > +++ b/drivers/pci/quirks.c
> > > @@ -5129,13 +5129,23 @@ static void quirk_intel_qat_vf_cap(struct pci_dev *pdev)
> > >  }
> > >  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x443, quirk_intel_qat_vf_cap);
> > >
> > > -/* FLR may cause some 82579 devices to hang */
> > > -static void quirk_intel_no_flr(struct pci_dev *dev)
> > > +/*
> > > + * FLR may cause the following to devices to hang:
> > > + *
> > > + * AMD Starship/Matisse HD Audio Controller 0x1487
> > > + * AMD Matisse USB 3.0 Host Controller 0x149c
> > > + * Intel 82579LM Gigabit Ethernet Controller 0x1502
> > > + * Intel 82579V Gigabit Ethernet Controller 0x1503
> > > + *
> > > + */
> > > +static void quirk_no_flr(struct pci_dev *dev)
> > >  {
> > >      dev->dev_flags |= PCI_DEV_FLAGS_NO_FLR_RESET;
> > >  }
> > > -DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502, quirk_intel_no_flr);
> > > -DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503, quirk_intel_no_flr);
> > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x1487, quirk_no_flr);
> > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x149c, quirk_no_flr);
> > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502, quirk_no_flr);
> > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503, quirk_no_flr);
> > >
> > >  static void quirk_no_ext_tags(struct pci_dev *pdev)
> > >  {
> > > --
> > > 2.25.1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] PCI: Avoid FLR for AMD Matisse HD Audio and USB Controllers
  2020-05-20  9:41     ` Marcos Scriven
@ 2020-05-20 23:29       ` Bjorn Helgaas
  2020-05-21  7:05         ` Marcos Scriven
  0 siblings, 1 reply; 7+ messages in thread
From: Bjorn Helgaas @ 2020-05-20 23:29 UTC (permalink / raw)
  To: Marcos Scriven; +Cc: linux-pci, Alex Williamson

On Wed, May 20, 2020 at 10:41:03AM +0100, Marcos Scriven wrote:
> On Mon, 18 May 2020 at 20:26, Marcos Scriven <marcos@scriven.org> wrote:
> >
> > On Mon, 18 May 2020 at 17:17, Bjorn Helgaas <helgaas@kernel.org> wrote:
> > >
> > > [+cc Alex]
> > >
> > > On Sat, May 16, 2020 at 02:37:23PM +0100, Marcos Scriven wrote:
> > > > This patch fixes an FLR bug on the following two devices:
> > > >
> > > > AMD Matisse HD Audio Controller 0x1487
> > > > AMD Matisse USB 3.0 Host Controller 0x149c
> > > >
> > > > As there was already such a quirk for an Intel network device, I have
> > > > renamed that method and updated the comments, trying to make it
> > > > clearer what the specific original devices that were affected are
> > > > (based on the commit message this was original done:
> > > > https://git.thm.de/mhnn55/eco32-linux-ba/commit/f65fd1aa4f9881d5540192d11f7b8ed2fec936db).
> > > >
> > > > I have ordered them by hex product ID.
> > > >
> > > > I have verified this works on a X570 I AORUS PRO WIFI (rev. 1.0) motherboard.
> > >
> > > If we avoid FLR, is there another method used to reset these devices
> > > between attachments to different VMs?  Does the lack of FLR mean we
> > > can leak information between VMs?
> > >
> > > Would additional delay after the FLR work around this, e.g., something
> > > like 51ba09452d11 ("PCI: Delay after FLR of Intel DC P3700 NVMe")?
> > >
> >
> > Thanks for looking at this patch Bjorn.
> >
> > To take your three points:
> >
> > 1. Certainly I can see those devices able to be passed back and forth
> > between host and guest multiple times, once this patch is applied.
> >
> > 2. I don't know the answer to that question; would appreciate guidance
> > on how to determine this. Do you mean perhaps some buffered data in
> > the USB controller, for instance?
> >
> > 3. I have not tried an additional delay. This is the logs I see when
> > the error is occurring:
> >
> > [ 2423.556570] vfio-pci 0000:0c:00.3: not ready 1023ms after FLR; waiting
> > [ 2425.604526] vfio-pci 0000:0c:00.3: not ready 2047ms after FLR; waiting
> > [ 2428.804509] vfio-pci 0000:0c:00.3: not ready 4095ms after FLR; waiting
> > [ 2433.924409] vfio-pci 0000:0c:00.3: not ready 8191ms after FLR; waiting
> > [ 2443.140721] vfio-pci 0000:0c:00.3: not ready 16383ms after FLR; waiting
> > [ 2461.571944] vfio-pci 0000:0c:00.3: not ready 32767ms after FLR; waiting
> > [ 2496.387544] vfio-pci 0000:0c:00.3: not ready 65535ms after FLR; giving up
> >
> > What makes this bug especially bad is the host never recovers, and
> > eventually hangs or crashes.
> >
> > For reference, the delay example you're talking about is:
> >
> > static int delay_250ms_after_flr(struct pci_dev *dev, int probe)
> > {
> > if (!pcie_has_flr(dev))
> > return -ENOTTY;
> >
> > if (probe)
> > return 0;
> >
> > pcie_flr(dev);
> >
> > msleep(250);
> >
> > return 0;
> > }
> >
> > I don't know if it would work, but I will try it out and report back.
> >
> > Marcos
> >
> >
> 
> Bjorn/Alex
> 
> I have just tried the alternate approach of adding a 250ms delay to
> the function level reset - this unfortunately results in the same
> broken behaviour, with the host itself never recovering.
> 
> [   76.905410] vfio-pci 0000:0d:00.3: not ready 1023ms after FLR; waiting
> [   79.018014] vfio-pci 0000:0d:00.3: not ready 2047ms after FLR; waiting
> [   82.089390] vfio-pci 0000:0d:00.3: not ready 4095ms after FLR; waiting
> [   87.209416] vfio-pci 0000:0d:00.3: not ready 8191ms after FLR; waiting
> [   96.425440] vfio-pci 0000:0d:00.3: not ready 16383ms after FLR; waiting
> [  114.615491] vfio-pci 0000:0d:00.3: not ready 32767ms after FLR; waiting
> [  149.417712] vfio-pci 0000:0d:00.3: not ready 65535ms after FLR; giving up
> 
> I also tried a full second, to no avail.
> 
> What would be the next step in proceeding with the original patch please?

Implementation of FLR is "strongly recommended" by the spec but is
optional.  So I don't see a problem with just avoiding it via your
patch.

I applied it to pci/virtualization for v5.8, thanks!

Bjorn

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] PCI: Avoid FLR for AMD Matisse HD Audio and USB Controllers
  2020-05-20 23:29       ` Bjorn Helgaas
@ 2020-05-21  7:05         ` Marcos Scriven
  0 siblings, 0 replies; 7+ messages in thread
From: Marcos Scriven @ 2020-05-21  7:05 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Alex Williamson

On Thu, 21 May 2020 at 00:29, Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Wed, May 20, 2020 at 10:41:03AM +0100, Marcos Scriven wrote:
> > On Mon, 18 May 2020 at 20:26, Marcos Scriven <marcos@scriven.org> wrote:
> > >
> > > On Mon, 18 May 2020 at 17:17, Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > >
> > > > [+cc Alex]
> > > >
> > > > On Sat, May 16, 2020 at 02:37:23PM +0100, Marcos Scriven wrote:
> > > > > This patch fixes an FLR bug on the following two devices:
> > > > >
> > > > > AMD Matisse HD Audio Controller 0x1487
> > > > > AMD Matisse USB 3.0 Host Controller 0x149c
> > > > >
> > > > > As there was already such a quirk for an Intel network device, I have
> > > > > renamed that method and updated the comments, trying to make it
> > > > > clearer what the specific original devices that were affected are
> > > > > (based on the commit message this was original done:
> > > > > https://git.thm.de/mhnn55/eco32-linux-ba/commit/f65fd1aa4f9881d5540192d11f7b8ed2fec936db).
> > > > >
> > > > > I have ordered them by hex product ID.
> > > > >
> > > > > I have verified this works on a X570 I AORUS PRO WIFI (rev. 1.0) motherboard.
> > > >
> > > > If we avoid FLR, is there another method used to reset these devices
> > > > between attachments to different VMs?  Does the lack of FLR mean we
> > > > can leak information between VMs?
> > > >
> > > > Would additional delay after the FLR work around this, e.g., something
> > > > like 51ba09452d11 ("PCI: Delay after FLR of Intel DC P3700 NVMe")?
> > > >
> > >
> > > Thanks for looking at this patch Bjorn.
> > >
> > > To take your three points:
> > >
> > > 1. Certainly I can see those devices able to be passed back and forth
> > > between host and guest multiple times, once this patch is applied.
> > >
> > > 2. I don't know the answer to that question; would appreciate guidance
> > > on how to determine this. Do you mean perhaps some buffered data in
> > > the USB controller, for instance?
> > >
> > > 3. I have not tried an additional delay. This is the logs I see when
> > > the error is occurring:
> > >
> > > [ 2423.556570] vfio-pci 0000:0c:00.3: not ready 1023ms after FLR; waiting
> > > [ 2425.604526] vfio-pci 0000:0c:00.3: not ready 2047ms after FLR; waiting
> > > [ 2428.804509] vfio-pci 0000:0c:00.3: not ready 4095ms after FLR; waiting
> > > [ 2433.924409] vfio-pci 0000:0c:00.3: not ready 8191ms after FLR; waiting
> > > [ 2443.140721] vfio-pci 0000:0c:00.3: not ready 16383ms after FLR; waiting
> > > [ 2461.571944] vfio-pci 0000:0c:00.3: not ready 32767ms after FLR; waiting
> > > [ 2496.387544] vfio-pci 0000:0c:00.3: not ready 65535ms after FLR; giving up
> > >
> > > What makes this bug especially bad is the host never recovers, and
> > > eventually hangs or crashes.
> > >
> > > For reference, the delay example you're talking about is:
> > >
> > > static int delay_250ms_after_flr(struct pci_dev *dev, int probe)
> > > {
> > > if (!pcie_has_flr(dev))
> > > return -ENOTTY;
> > >
> > > if (probe)
> > > return 0;
> > >
> > > pcie_flr(dev);
> > >
> > > msleep(250);
> > >
> > > return 0;
> > > }
> > >
> > > I don't know if it would work, but I will try it out and report back.
> > >
> > > Marcos
> > >
> > >
> >
> > Bjorn/Alex
> >
> > I have just tried the alternate approach of adding a 250ms delay to
> > the function level reset - this unfortunately results in the same
> > broken behaviour, with the host itself never recovering.
> >
> > [   76.905410] vfio-pci 0000:0d:00.3: not ready 1023ms after FLR; waiting
> > [   79.018014] vfio-pci 0000:0d:00.3: not ready 2047ms after FLR; waiting
> > [   82.089390] vfio-pci 0000:0d:00.3: not ready 4095ms after FLR; waiting
> > [   87.209416] vfio-pci 0000:0d:00.3: not ready 8191ms after FLR; waiting
> > [   96.425440] vfio-pci 0000:0d:00.3: not ready 16383ms after FLR; waiting
> > [  114.615491] vfio-pci 0000:0d:00.3: not ready 32767ms after FLR; waiting
> > [  149.417712] vfio-pci 0000:0d:00.3: not ready 65535ms after FLR; giving up
> >
> > I also tried a full second, to no avail.
> >
> > What would be the next step in proceeding with the original patch please?
>
> Implementation of FLR is "strongly recommended" by the spec but is
> optional.  So I don't see a problem with just avoiding it via your
> patch.
>
> I applied it to pci/virtualization for v5.8, thanks!
>
> Bjorn

Hi Bjorn

Wonderful news! Thanks again for your help.

Marcos

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-05-21  7:06 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-16 13:37 [PATCH] PCI: Avoid FLR for AMD Matisse HD Audio and USB Controllers Marcos Scriven
2020-05-16 15:19 ` Marcos Scriven
2020-05-18 16:17 ` Bjorn Helgaas
2020-05-18 19:26   ` Marcos Scriven
2020-05-20  9:41     ` Marcos Scriven
2020-05-20 23:29       ` Bjorn Helgaas
2020-05-21  7:05         ` Marcos Scriven

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.