linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] pci: Add a few new IDs for Intel GPU "spurious interrupt" quirk
       [not found] <1537974841-29928-1-git-send-email-bmeng.cn@gmail.com>
@ 2018-09-26 16:57 ` Bjorn Helgaas
  2018-09-27  2:10   ` Bin Meng
  0 siblings, 1 reply; 11+ messages in thread
From: Bjorn Helgaas @ 2018-09-26 16:57 UTC (permalink / raw)
  To: Bin Meng
  Cc: Bjorn Helgaas, linux-pci, Thomas Jarosch, stable, Jani Nikula,
	Joonas Lahtinen, Rodrigo Vivi, intel-gfx, dri-devel,
	linux-kernel

[+cc Intel DRM maintainers, etc]

On Wed, Sep 26, 2018 at 08:14:01AM -0700, Bin Meng wrote:
> Add more PCI IDs to the Intel GPU "spurious interrupt" quirk table,
> which are known to break.

Do you have a reference for this?  Any public bug reports, bugzilla,
Intel spec reference or errata?  "Which are known to break" is pretty
vague.

> See commit f67fd55fa96f ("PCI: Add quirk for still enabled interrupts
> on Intel Sandy Bridge GPUs"), and commit 7c82126a94e6 ("PCI: Add new
> ID for Intel GPU "spurious interrupt" quirk") for some history.
> 
> Based on current findings, it is highly possible that all Intel
> 1st/2nd/3rd generation Core processors' IGD has such quirk.

Can you include a reference to these "current findings"?  I assume you
have bug reports that include the device IDs you're adding?  If not,
how did you build this list of new IDs?

The function comment added by f67fd55fa96f ("PCI: Add quirk for still
enabled interrupts on Intel Sandy Bridge GPUs") suggests that this is
actually a BIOS issue, not a hardware erratum, i.e., I don't see
anything there that suggests a hardware defect.

But there must be a hole somewhere -- the kernel can't be expected to
disable interrupts in device-specific ways when there's no driver
loaded.  Maybe it's simply a BIOS defect or maybe there's some
interrupt or _PRT-related setup we're missing.

> Signed-off-by: Bin Meng <bmeng.cn@gmail.com>
> Cc: <stable@vger.kernel.org> # v3.4+
> ---
> 
>  drivers/pci/quirks.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 6bc27b7..c0673a7 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3190,7 +3190,11 @@ static void disable_igfx_irq(struct pci_dev *dev)
>  
>  	pci_iounmap(dev, regs);
>  }
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0042, disable_igfx_irq);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0046, disable_igfx_irq);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x004a, disable_igfx_irq);
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0102, disable_igfx_irq);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0106, disable_igfx_irq);
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x010a, disable_igfx_irq);
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0152, disable_igfx_irq);
>  
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] pci: Add a few new IDs for Intel GPU "spurious interrupt" quirk
  2018-09-26 16:57 ` [PATCH] pci: Add a few new IDs for Intel GPU "spurious interrupt" quirk Bjorn Helgaas
@ 2018-09-27  2:10   ` Bin Meng
  2018-10-03 20:12     ` Bjorn Helgaas
  0 siblings, 1 reply; 11+ messages in thread
From: Bin Meng @ 2018-09-27  2:10 UTC (permalink / raw)
  To: helgaas
  Cc: Bjorn Helgaas, linux-pci, Thomas Jarosch, stable, jani.nikula,
	joonas.lahtinen, rodrigo.vivi, intel-gfx, dri-devel,
	linux-kernel

Hi Bjorn,

On Thu, Sep 27, 2018 at 12:57 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> [+cc Intel DRM maintainers, etc]
>
> On Wed, Sep 26, 2018 at 08:14:01AM -0700, Bin Meng wrote:
> > Add more PCI IDs to the Intel GPU "spurious interrupt" quirk table,
> > which are known to break.
>
> Do you have a reference for this?  Any public bug reports, bugzilla,
> Intel spec reference or errata?  "Which are known to break" is pretty
> vague.
>

Sorry I used wrong words and should have been clearer. These devices
are validated to be broken. The test I used is very simple, just
unplug the VGA cable and plug it again, and "spurious interrupt" will
be seen on the interrupt line of the IGD device. I was not aware of
any public bugs filed to Intel, nor seen any errata from Intel.

> > See commit f67fd55fa96f ("PCI: Add quirk for still enabled interrupts
> > on Intel Sandy Bridge GPUs"), and commit 7c82126a94e6 ("PCI: Add new
> > ID for Intel GPU "spurious interrupt" quirk") for some history.
> >
> > Based on current findings, it is highly possible that all Intel
> > 1st/2nd/3rd generation Core processors' IGD has such quirk.
>
> Can you include a reference to these "current findings"?  I assume you
> have bug reports that include the device IDs you're adding?  If not,
> how did you build this list of new IDs?
>

By "current findings" I mean given the IDs we have here, plus previous
one added by Thomas, it's highly possible this VGA BIOS bug exists in
every 1st/2nd/3rd generation Core processors.

> The function comment added by f67fd55fa96f ("PCI: Add quirk for still
> enabled interrupts on Intel Sandy Bridge GPUs") suggests that this is
> actually a BIOS issue, not a hardware erratum, i.e., I don't see
> anything there that suggests a hardware defect.
>
> But there must be a hole somewhere -- the kernel can't be expected to
> disable interrupts in device-specific ways when there's no driver
> loaded.  Maybe it's simply a BIOS defect or maybe there's some
> interrupt or _PRT-related setup we're missing.
>

It's a pure VGA BIOS bug, not the BIOS bug or _PRT etc. The VGA BIOS
forgot to turn off the interrupt on these devices.

> > Signed-off-by: Bin Meng <bmeng.cn@gmail.com>
> > Cc: <stable@vger.kernel.org> # v3.4+
> > ---
> >
> >  drivers/pci/quirks.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 6bc27b7..c0673a7 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -3190,7 +3190,11 @@ static void disable_igfx_irq(struct pci_dev *dev)
> >
> >       pci_iounmap(dev, regs);
> >  }
> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0042, disable_igfx_irq);
> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0046, disable_igfx_irq);
> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x004a, disable_igfx_irq);
> >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0102, disable_igfx_irq);
> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0106, disable_igfx_irq);
> >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x010a, disable_igfx_irq);
> >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0152, disable_igfx_irq);
> >
> > --

Regards,
Bin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] pci: Add a few new IDs for Intel GPU "spurious interrupt" quirk
  2018-09-27  2:10   ` Bin Meng
@ 2018-10-03 20:12     ` Bjorn Helgaas
  2018-10-08  9:44       ` Bin Meng
  0 siblings, 1 reply; 11+ messages in thread
From: Bjorn Helgaas @ 2018-10-03 20:12 UTC (permalink / raw)
  To: Bin Meng
  Cc: Bjorn Helgaas, linux-pci, Thomas Jarosch, stable, jani.nikula,
	joonas.lahtinen, rodrigo.vivi, intel-gfx, dri-devel,
	linux-kernel

On Thu, Sep 27, 2018 at 10:10:07AM +0800, Bin Meng wrote:
> On Thu, Sep 27, 2018 at 12:57 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Wed, Sep 26, 2018 at 08:14:01AM -0700, Bin Meng wrote:
> > > Add more PCI IDs to the Intel GPU "spurious interrupt" quirk table,
> > > which are known to break.
> >
> > Do you have a reference for this?  Any public bug reports, bugzilla,
> > Intel spec reference or errata?  "Which are known to break" is pretty
> > vague.
> 
> Sorry I used wrong words and should have been clearer. These devices
> are validated to be broken. The test I used is very simple, just
> unplug the VGA cable and plug it again, and "spurious interrupt" will
> be seen on the interrupt line of the IGD device. I was not aware of
> any public bugs filed to Intel, nor seen any errata from Intel.

The original commit, f67fd55fa96f ("PCI: Add quirk for still enabled
interrupts on Intel Sandy Bridge GPUs"), says some systems "crash"
(not sure if that means an oops or an actual crash that requires a
reboot) and on other systems, Linux disables the shared interrupt
line.  I assume disabling the interrupt line keeps devices using that
line from working, but does not directly cause a crash.

What specific symptom do you see here?  I think it might be useful to
collect details, e.g., dmesg logs, /proc/interrupts contents, output
of "sudo lspci -vv", etc., for the systems you're quirking here.  I'm
hoping we can eventually figure out a solution that doesn't require a
quirk for every new GPU, and maybe that info will help find it.

> > > See commit f67fd55fa96f ("PCI: Add quirk for still enabled interrupts
> > > on Intel Sandy Bridge GPUs"), and commit 7c82126a94e6 ("PCI: Add new
> > > ID for Intel GPU "spurious interrupt" quirk") for some history.
> > >
> > > Based on current findings, it is highly possible that all Intel
> > > 1st/2nd/3rd generation Core processors' IGD has such quirk.
> >
> > Can you include a reference to these "current findings"?  I assume you
> > have bug reports that include the device IDs you're adding?  If not,
> > how did you build this list of new IDs?
> 
> By "current findings" I mean given the IDs we have here, plus previous
> one added by Thomas, it's highly possible this VGA BIOS bug exists in
> every 1st/2nd/3rd generation Core processors.
> 
> > The function comment added by f67fd55fa96f ("PCI: Add quirk for still
> > enabled interrupts on Intel Sandy Bridge GPUs") suggests that this is
> > actually a BIOS issue, not a hardware erratum, i.e., I don't see
> > anything there that suggests a hardware defect.
> >
> > But there must be a hole somewhere -- the kernel can't be expected to
> > disable interrupts in device-specific ways when there's no driver
> > loaded.  Maybe it's simply a BIOS defect or maybe there's some
> > interrupt or _PRT-related setup we're missing.
> 
> It's a pure VGA BIOS bug, not the BIOS bug or _PRT etc. The VGA BIOS
> forgot to turn off the interrupt on these devices.

If this is a VGA BIOS defect, it's not very likely that it will
magically be fixed for all new Intel GPUs, so in effect it sounds like
we need to update this list of quirks in Linux every time a new Intel
GPU comes out.  That prospect is a little daunting.

Do you happen to know if Windows has the same problem?  I.e., if you
boot an old version of Windows with a new GPU, and unplug the VGA
cable, does Windows crash?  If Windows can figure out how to handle
that situation gracefully, Linux should be able to do it, too.

> > > Signed-off-by: Bin Meng <bmeng.cn@gmail.com>
> > > Cc: <stable@vger.kernel.org> # v3.4+
> > > ---
> > >
> > >  drivers/pci/quirks.c | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > >
> > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > > index 6bc27b7..c0673a7 100644
> > > --- a/drivers/pci/quirks.c
> > > +++ b/drivers/pci/quirks.c
> > > @@ -3190,7 +3190,11 @@ static void disable_igfx_irq(struct pci_dev *dev)
> > >
> > >       pci_iounmap(dev, regs);
> > >  }
> > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0042, disable_igfx_irq);
> > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0046, disable_igfx_irq);
> > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x004a, disable_igfx_irq);
> > >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0102, disable_igfx_irq);
> > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0106, disable_igfx_irq);
> > >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x010a, disable_igfx_irq);
> > >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0152, disable_igfx_irq);
> > >
> > > --
> 
> Regards,
> Bin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] pci: Add a few new IDs for Intel GPU "spurious interrupt" quirk
  2018-10-03 20:12     ` Bjorn Helgaas
@ 2018-10-08  9:44       ` Bin Meng
  2018-10-08 10:06         ` David Laight
  2018-10-09 17:01         ` Bjorn Helgaas
  0 siblings, 2 replies; 11+ messages in thread
From: Bin Meng @ 2018-10-08  9:44 UTC (permalink / raw)
  To: helgaas
  Cc: Bjorn Helgaas, linux-pci, Thomas Jarosch, stable, jani.nikula,
	joonas.lahtinen, rodrigo.vivi, intel-gfx, dri-devel,
	linux-kernel

Hi Bjorn,

On Thu, Oct 4, 2018 at 4:12 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Thu, Sep 27, 2018 at 10:10:07AM +0800, Bin Meng wrote:
> > On Thu, Sep 27, 2018 at 12:57 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Wed, Sep 26, 2018 at 08:14:01AM -0700, Bin Meng wrote:
> > > > Add more PCI IDs to the Intel GPU "spurious interrupt" quirk table,
> > > > which are known to break.
> > >
> > > Do you have a reference for this?  Any public bug reports, bugzilla,
> > > Intel spec reference or errata?  "Which are known to break" is pretty
> > > vague.
> >
> > Sorry I used wrong words and should have been clearer. These devices
> > are validated to be broken. The test I used is very simple, just
> > unplug the VGA cable and plug it again, and "spurious interrupt" will
> > be seen on the interrupt line of the IGD device. I was not aware of
> > any public bugs filed to Intel, nor seen any errata from Intel.
>
> The original commit, f67fd55fa96f ("PCI: Add quirk for still enabled
> interrupts on Intel Sandy Bridge GPUs"), says some systems "crash"
> (not sure if that means an oops or an actual crash that requires a
> reboot) and on other systems, Linux disables the shared interrupt
> line.  I assume disabling the interrupt line keeps devices using that
> line from working, but does not directly cause a crash.
>

Correct, disable the shared interrupt line keeps all devices using
that line from working, which is current kernel's behavior w/o this
quirk handling: it disables the (shared) interrupt line after 100.000+
generated interrupts. But the side effect is that other devices become
unusable after that (eg: USB devices which share the same interrupt
line with the Intel GPU). That's why the original commit, f67fd55fa96f
("PCI: Add quirk for still enabled interrupts on Intel Sandy Bridge
GPUs") disables the GPU's interrupt directly, which should really be
done by the VGA BIOS itself (a buggy VBIOS!).

> What specific symptom do you see here?  I think it might be useful to
> collect details, e.g., dmesg logs, /proc/interrupts contents, output
> of "sudo lspci -vv", etc., for the systems you're quirking here.  I'm
> hoping we can eventually figure out a solution that doesn't require a
> quirk for every new GPU, and maybe that info will help find it.
>

The symptom was described briefly in the original commit f67fd55fa96f
too, that disables the (shared) interrupt line after 100.000+
generated interrupts (can be observed via /proc/interrupts).

> > > > See commit f67fd55fa96f ("PCI: Add quirk for still enabled interrupts
> > > > on Intel Sandy Bridge GPUs"), and commit 7c82126a94e6 ("PCI: Add new
> > > > ID for Intel GPU "spurious interrupt" quirk") for some history.
> > > >
> > > > Based on current findings, it is highly possible that all Intel
> > > > 1st/2nd/3rd generation Core processors' IGD has such quirk.
> > >
> > > Can you include a reference to these "current findings"?  I assume you
> > > have bug reports that include the device IDs you're adding?  If not,
> > > how did you build this list of new IDs?
> >
> > By "current findings" I mean given the IDs we have here, plus previous
> > one added by Thomas, it's highly possible this VGA BIOS bug exists in
> > every 1st/2nd/3rd generation Core processors.
> >
> > > The function comment added by f67fd55fa96f ("PCI: Add quirk for still
> > > enabled interrupts on Intel Sandy Bridge GPUs") suggests that this is
> > > actually a BIOS issue, not a hardware erratum, i.e., I don't see
> > > anything there that suggests a hardware defect.
> > >
> > > But there must be a hole somewhere -- the kernel can't be expected to
> > > disable interrupts in device-specific ways when there's no driver
> > > loaded.  Maybe it's simply a BIOS defect or maybe there's some
> > > interrupt or _PRT-related setup we're missing.
> >
> > It's a pure VGA BIOS bug, not the BIOS bug or _PRT etc. The VGA BIOS
> > forgot to turn off the interrupt on these devices.
>
> If this is a VGA BIOS defect, it's not very likely that it will
> magically be fixed for all new Intel GPUs, so in effect it sounds like
> we need to update this list of quirks in Linux every time a new Intel
> GPU comes out.  That prospect is a little daunting.
>

I don't have a relatively newer Intel board at hand for testing right
now. I can try to locate one. But as I said, it's highly possible at
least all 1st/2nd/3rd generation Core processors are affected. Maybe
we can add all these known GPU devices of  1st/2nd/3rd generation Core
processors all together for now? For newer GPUs, let's wait until
someone reports the issue again?

> Do you happen to know if Windows has the same problem?  I.e., if you
> boot an old version of Windows with a new GPU, and unplug the VGA
> cable, does Windows crash?  If Windows can figure out how to handle
> that situation gracefully, Linux should be able to do it, too.
>

I suspect Windows cannot handle it too. Without the GPU awareness, the
interrupt line is simply on and no driver claims the devices and will
cause issues. I can test this.

> > > > Signed-off-by: Bin Meng <bmeng.cn@gmail.com>
> > > > Cc: <stable@vger.kernel.org> # v3.4+
> > > > ---
> > > >
> > > >  drivers/pci/quirks.c | 4 ++++
> > > >  1 file changed, 4 insertions(+)
> > > >
> > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > > > index 6bc27b7..c0673a7 100644
> > > > --- a/drivers/pci/quirks.c
> > > > +++ b/drivers/pci/quirks.c
> > > > @@ -3190,7 +3190,11 @@ static void disable_igfx_irq(struct pci_dev *dev)
> > > >
> > > >       pci_iounmap(dev, regs);
> > > >  }
> > > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0042, disable_igfx_irq);
> > > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0046, disable_igfx_irq);
> > > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x004a, disable_igfx_irq);
> > > >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0102, disable_igfx_irq);
> > > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0106, disable_igfx_irq);
> > > >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x010a, disable_igfx_irq);
> > > >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0152, disable_igfx_irq);
> > > >
> > > > --

Regards,
Bin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH] pci: Add a few new IDs for Intel GPU "spurious interrupt" quirk
  2018-10-08  9:44       ` Bin Meng
@ 2018-10-08 10:06         ` David Laight
  2018-10-08 12:34           ` Bin Meng
  2018-10-09 17:01         ` Bjorn Helgaas
  1 sibling, 1 reply; 11+ messages in thread
From: David Laight @ 2018-10-08 10:06 UTC (permalink / raw)
  To: 'Bin Meng', helgaas
  Cc: Bjorn Helgaas, linux-pci, Thomas Jarosch, stable, jani.nikula,
	joonas.lahtinen, rodrigo.vivi, intel-gfx, dri-devel,
	linux-kernel

From: Bin Meng
> Sent: 08 October 2018 10:44
...
> Correct, disable the shared interrupt line keeps all devices using
> that line from working, which is current kernel's behavior w/o this
> quirk handling: it disables the (shared) interrupt line after 100.000+
> generated interrupts. But the side effect is that other devices become
> unusable after that (eg: USB devices which share the same interrupt
> line with the Intel GPU). That's why the original commit, f67fd55fa96f
> ("PCI: Add quirk for still enabled interrupts on Intel Sandy Bridge
> GPUs") disables the GPU's interrupt directly, which should really be
> done by the VGA BIOS itself (a buggy VBIOS!).

Shouldn't the kernel just disable all PCI(e) interrupts by writing
1 to the config space control register bit during grope?
Can it ever by right for this to be set?

Apart from VGA the 'bus master' bit also needs to be clear.

ISTR some very early PCI systems which failed to reset the PCI
bus during reboot - at least the 'bus master' bit remained
set for an ethernet card.
On a private LAN the OS got reinstalled and rebooted without
using all the ethernet receive buffers and then died because
a receive frame got written into 'random' memory.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] pci: Add a few new IDs for Intel GPU "spurious interrupt" quirk
  2018-10-08 10:06         ` David Laight
@ 2018-10-08 12:34           ` Bin Meng
  2018-10-08 13:27             ` David Laight
  0 siblings, 1 reply; 11+ messages in thread
From: Bin Meng @ 2018-10-08 12:34 UTC (permalink / raw)
  To: David.Laight
  Cc: helgaas, Bjorn Helgaas, linux-pci, Thomas Jarosch, stable,
	jani.nikula, joonas.lahtinen, rodrigo.vivi, intel-gfx, dri-devel,
	linux-kernel

Hi David,

On Mon, Oct 8, 2018 at 6:06 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Bin Meng
> > Sent: 08 October 2018 10:44
> ...
> > Correct, disable the shared interrupt line keeps all devices using
> > that line from working, which is current kernel's behavior w/o this
> > quirk handling: it disables the (shared) interrupt line after 100.000+
> > generated interrupts. But the side effect is that other devices become
> > unusable after that (eg: USB devices which share the same interrupt
> > line with the Intel GPU). That's why the original commit, f67fd55fa96f
> > ("PCI: Add quirk for still enabled interrupts on Intel Sandy Bridge
> > GPUs") disables the GPU's interrupt directly, which should really be
> > done by the VGA BIOS itself (a buggy VBIOS!).
>
> Shouldn't the kernel just disable all PCI(e) interrupts by writing
> 1 to the config space control register bit during grope?
> Can it ever by right for this to be set?
>

Do you mean PCI_COMMAND_INTX_DISABLE bit of the command register in
the configuration space? Setting this bit indeed could disable the
INTx interrupt, but it does not work for all PCI devices as this bit
was introduced in PCI spec v2.3.

> Apart from VGA the 'bus master' bit also needs to be clear.
>
> ISTR some very early PCI systems which failed to reset the PCI
> bus during reboot - at least the 'bus master' bit remained
> set for an ethernet card.
> On a private LAN the OS got reinstalled and rebooted without
> using all the ethernet receive buffers and then died because
> a receive frame got written into 'random' memory.

Regards,
Bin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH] pci: Add a few new IDs for Intel GPU "spurious interrupt" quirk
  2018-10-08 12:34           ` Bin Meng
@ 2018-10-08 13:27             ` David Laight
  0 siblings, 0 replies; 11+ messages in thread
From: David Laight @ 2018-10-08 13:27 UTC (permalink / raw)
  To: 'Bin Meng'
  Cc: helgaas, Bjorn Helgaas, linux-pci, Thomas Jarosch, stable,
	jani.nikula, joonas.lahtinen, rodrigo.vivi, intel-gfx, dri-devel,
	linux-kernel

From: Bin Meng
> Sent: 08 October 2018 13:34
> Hi David,
> 
> On Mon, Oct 8, 2018 at 6:06 PM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Bin Meng
> > > Sent: 08 October 2018 10:44
> > ...
> > > Correct, disable the shared interrupt line keeps all devices using
> > > that line from working, which is current kernel's behavior w/o this
> > > quirk handling: it disables the (shared) interrupt line after 100.000+
> > > generated interrupts. But the side effect is that other devices become
> > > unusable after that (eg: USB devices which share the same interrupt
> > > line with the Intel GPU). That's why the original commit, f67fd55fa96f
> > > ("PCI: Add quirk for still enabled interrupts on Intel Sandy Bridge
> > > GPUs") disables the GPU's interrupt directly, which should really be
> > > done by the VGA BIOS itself (a buggy VBIOS!).
> >
> > Shouldn't the kernel just disable all PCI(e) interrupts by writing
> > 1 to the config space control register bit during grope?
> > Can it ever by right for this to be set?
> >
> 
> Do you mean PCI_COMMAND_INTX_DISABLE bit of the command register in
> the configuration space? Setting this bit indeed could disable the
> INTx interrupt, but it does not work for all PCI devices as this bit
> was introduced in PCI spec v2.3.

That's the one I was thinking of.
If it was introduced in v2.3 it explains why it is a 'disable' bit.

The v2.2 spec I just found doesn't seem to say anything about the
'reserved' bits. I guess the values are ignored (and probobly read
back as zeros).

In any case it should be implemented by the VGA devices in question.
I guess the kernel should also ensure that MSI and MSI-X interrupts
are also all disabled.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] pci: Add a few new IDs for Intel GPU "spurious interrupt" quirk
  2018-10-08  9:44       ` Bin Meng
  2018-10-08 10:06         ` David Laight
@ 2018-10-09 17:01         ` Bjorn Helgaas
  2018-10-10  8:00           ` Thomas Jarosch
  2018-10-11  7:11           ` Bin Meng
  1 sibling, 2 replies; 11+ messages in thread
From: Bjorn Helgaas @ 2018-10-09 17:01 UTC (permalink / raw)
  To: Bin Meng
  Cc: Bjorn Helgaas, linux-pci, Thomas Jarosch, stable, jani.nikula,
	joonas.lahtinen, rodrigo.vivi, intel-gfx, dri-devel,
	linux-kernel

On Mon, Oct 08, 2018 at 05:44:08PM +0800, Bin Meng wrote:
> On Thu, Oct 4, 2018 at 4:12 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Thu, Sep 27, 2018 at 10:10:07AM +0800, Bin Meng wrote:
> > > On Thu, Sep 27, 2018 at 12:57 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Wed, Sep 26, 2018 at 08:14:01AM -0700, Bin Meng wrote:
> > > > > Add more PCI IDs to the Intel GPU "spurious interrupt" quirk table,
> > > > > which are known to break.
> > > >
> > > > Do you have a reference for this?  Any public bug reports, bugzilla,
> > > > Intel spec reference or errata?  "Which are known to break" is pretty
> > > > vague.
> > >
> > > Sorry I used wrong words and should have been clearer. These devices
> > > are validated to be broken. The test I used is very simple, just
> > > unplug the VGA cable and plug it again, and "spurious interrupt" will
> > > be seen on the interrupt line of the IGD device. I was not aware of
> > > any public bugs filed to Intel, nor seen any errata from Intel.
> >
> > The original commit, f67fd55fa96f ("PCI: Add quirk for still enabled
> > interrupts on Intel Sandy Bridge GPUs"), says some systems "crash"
> > (not sure if that means an oops or an actual crash that requires a
> > reboot) and on other systems, Linux disables the shared interrupt
> > line.  I assume disabling the interrupt line keeps devices using that
> > line from working, but does not directly cause a crash.
> >
> 
> Correct, disable the shared interrupt line keeps all devices using
> that line from working, which is current kernel's behavior w/o this
> quirk handling: it disables the (shared) interrupt line after 100.000+
> generated interrupts. But the side effect is that other devices become
> unusable after that (eg: USB devices which share the same interrupt
> line with the Intel GPU). That's why the original commit, f67fd55fa96f
> ("PCI: Add quirk for still enabled interrupts on Intel Sandy Bridge
> GPUs") disables the GPU's interrupt directly, which should really be
> done by the VGA BIOS itself (a buggy VBIOS!).
> 
> > What specific symptom do you see here?  I think it might be useful to
> > collect details, e.g., dmesg logs, /proc/interrupts contents, output
> > of "sudo lspci -vv", etc., for the systems you're quirking here.  I'm
> > hoping we can eventually figure out a solution that doesn't require a
> > quirk for every new GPU, and maybe that info will help find it.
> 
> The symptom was described briefly in the original commit f67fd55fa96f
> too, that disables the (shared) interrupt line after 100.000+
> generated interrupts (can be observed via /proc/interrupts).
> 
> > > > > See commit f67fd55fa96f ("PCI: Add quirk for still enabled interrupts
> > > > > on Intel Sandy Bridge GPUs"), and commit 7c82126a94e6 ("PCI: Add new
> > > > > ID for Intel GPU "spurious interrupt" quirk") for some history.
> > > > >
> > > > > Based on current findings, it is highly possible that all Intel
> > > > > 1st/2nd/3rd generation Core processors' IGD has such quirk.
> > > >
> > > > Can you include a reference to these "current findings"?  I assume you
> > > > have bug reports that include the device IDs you're adding?  If not,
> > > > how did you build this list of new IDs?
> > >
> > > By "current findings" I mean given the IDs we have here, plus previous
> > > one added by Thomas, it's highly possible this VGA BIOS bug exists in
> > > every 1st/2nd/3rd generation Core processors.
> > >
> > > > The function comment added by f67fd55fa96f ("PCI: Add quirk for still
> > > > enabled interrupts on Intel Sandy Bridge GPUs") suggests that this is
> > > > actually a BIOS issue, not a hardware erratum, i.e., I don't see
> > > > anything there that suggests a hardware defect.
> > > >
> > > > But there must be a hole somewhere -- the kernel can't be expected to
> > > > disable interrupts in device-specific ways when there's no driver
> > > > loaded.  Maybe it's simply a BIOS defect or maybe there's some
> > > > interrupt or _PRT-related setup we're missing.
> > >
> > > It's a pure VGA BIOS bug, not the BIOS bug or _PRT etc. The VGA BIOS
> > > forgot to turn off the interrupt on these devices.
> >
> > If this is a VGA BIOS defect, it's not very likely that it will
> > magically be fixed for all new Intel GPUs, so in effect it sounds like
> > we need to update this list of quirks in Linux every time a new Intel
> > GPU comes out.  That prospect is a little daunting.
> 
> I don't have a relatively newer Intel board at hand for testing right
> now. I can try to locate one. But as I said, it's highly possible at
> least all 1st/2nd/3rd generation Core processors are affected.

> Maybe
> we can add all these known GPU devices of  1st/2nd/3rd generation Core
> processors all together for now? For newer GPUs, let's wait until
> someone reports the issue again?

This is exactly my point: we don't want to have to wait for somebody
to report an issue for every new GPU.  That (a) is a maintenance
headache and, more importantly, (b) prevents an old kernel from
running on new hardware.  (b) is important to distros because nobody
wants to qualify and release a new kernel just to add a new device ID.

Bottom line is that I think I'm going to have to apply this patch, but
I want to get off this train in the future, so now is the time to find
a better solution.

> > Do you happen to know if Windows has the same problem?  I.e., if you
> > boot an old version of Windows with a new GPU, and unplug the VGA
> > cable, does Windows crash?  If Windows can figure out how to handle
> > that situation gracefully, Linux should be able to do it, too.
> 
> I suspect Windows cannot handle it too. Without the GPU awareness, the
> interrupt line is simply on and no driver claims the devices and will
> cause issues. I can test this.

If you could test this, that would be great.  I would be quite
surprised if Windows crashed when you unplug the VGA cable.

What I'm wondering is if there's some different way we could manage
the IOAPICs or maybe disable interrupts at the PCI device level as
David suggests.  If something like that could be done we wouldn't need
quirks for every new device.

It's possible we could learn something by running Windows on qemu and
tracing its PCI config accesses to see whether it sets the
PCI_COMMAND_INTX_DISABLE bit or something.

> > > > > Signed-off-by: Bin Meng <bmeng.cn@gmail.com>
> > > > > Cc: <stable@vger.kernel.org> # v3.4+
> > > > > ---
> > > > >
> > > > >  drivers/pci/quirks.c | 4 ++++
> > > > >  1 file changed, 4 insertions(+)
> > > > >
> > > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > > > > index 6bc27b7..c0673a7 100644
> > > > > --- a/drivers/pci/quirks.c
> > > > > +++ b/drivers/pci/quirks.c
> > > > > @@ -3190,7 +3190,11 @@ static void disable_igfx_irq(struct pci_dev *dev)
> > > > >
> > > > >       pci_iounmap(dev, regs);
> > > > >  }
> > > > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0042, disable_igfx_irq);
> > > > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0046, disable_igfx_irq);
> > > > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x004a, disable_igfx_irq);
> > > > >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0102, disable_igfx_irq);
> > > > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0106, disable_igfx_irq);
> > > > >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x010a, disable_igfx_irq);
> > > > >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0152, disable_igfx_irq);
> > > > >
> > > > > --
> 
> Regards,
> Bin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] pci: Add a few new IDs for Intel GPU "spurious interrupt" quirk
  2018-10-09 17:01         ` Bjorn Helgaas
@ 2018-10-10  8:00           ` Thomas Jarosch
  2018-10-11  7:11           ` Bin Meng
  1 sibling, 0 replies; 11+ messages in thread
From: Thomas Jarosch @ 2018-10-10  8:00 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bin Meng, Bjorn Helgaas, linux-pci, stable, jani.nikula,
	joonas.lahtinen, rodrigo.vivi, intel-gfx, dri-devel,
	linux-kernel

Hello together,

On Tuesday, 9 October 2018 19:01:58 CEST Bjorn Helgaas wrote:
> > > Do you happen to know if Windows has the same problem?  I.e., if you
> > > boot an old version of Windows with a new GPU, and unplug the VGA
> > > cable, does Windows crash?  If Windows can figure out how to handle
> > > that situation gracefully, Linux should be able to do it, too.
> > 
> > I suspect Windows cannot handle it too. Without the GPU awareness, the
> > interrupt line is simply on and no driver claims the devices and will
> > cause issues. I can test this.
> 
> If you could test this, that would be great.  I would be quite
> surprised if Windows crashed when you unplug the VGA cable.

[original patch author from 2012 chiming in here]

When doing a test on Windows, I guess it's vital to use a generic VGA / VESA 
driver. The Intel GPU driver will modify the interrupt configuration 
registers. Also at least I don't know if Windows has similar logic like Linux 
to disable spurious interrupts after XX unhandled interrupts.
Personally I'm unsure if the Windows test is worth the time.

Back in 2012 I tried to report this as a video BIOS bug via the Intel Open 
Source Technology Center. There was not much interest to report an issue to 
the BIOS developers. I have private emails from other people with similar 
problems on "Kenosha Pass" based boards. One of them talked to an actual Intel 
BIOS engineer in 2014, but still that didn't get anything fixed.

Regarding a generic fix: The problem is testing on real hardware.
Some Intel GPUs have different configuration registers. The Intel GPU driver 
could be used as template, but doing a fix without actual hardware to verify 
it is unsafe (=might mess up the system in unexpected ways).

Related ideas from Daniel Vetter in 2014:
https://lists.freedesktop.org/archives/intel-gfx/2014-August/050448.html

Cheers,
Thomas




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] pci: Add a few new IDs for Intel GPU "spurious interrupt" quirk
  2018-10-09 17:01         ` Bjorn Helgaas
  2018-10-10  8:00           ` Thomas Jarosch
@ 2018-10-11  7:11           ` Bin Meng
  2018-10-11 16:13             ` Bjorn Helgaas
  1 sibling, 1 reply; 11+ messages in thread
From: Bin Meng @ 2018-10-11  7:11 UTC (permalink / raw)
  To: helgaas
  Cc: Bjorn Helgaas, linux-pci, Thomas Jarosch, stable, jani.nikula,
	joonas.lahtinen, rodrigo.vivi, intel-gfx, dri-devel,
	linux-kernel

Hi Bjorn,

On Wed, Oct 10, 2018 at 1:02 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Mon, Oct 08, 2018 at 05:44:08PM +0800, Bin Meng wrote:
> > On Thu, Oct 4, 2018 at 4:12 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Thu, Sep 27, 2018 at 10:10:07AM +0800, Bin Meng wrote:
> > > > On Thu, Sep 27, 2018 at 12:57 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > On Wed, Sep 26, 2018 at 08:14:01AM -0700, Bin Meng wrote:
> > > > > > Add more PCI IDs to the Intel GPU "spurious interrupt" quirk table,
> > > > > > which are known to break.
> > > > >
> > > > > Do you have a reference for this?  Any public bug reports, bugzilla,
> > > > > Intel spec reference or errata?  "Which are known to break" is pretty
> > > > > vague.
> > > >
> > > > Sorry I used wrong words and should have been clearer. These devices
> > > > are validated to be broken. The test I used is very simple, just
> > > > unplug the VGA cable and plug it again, and "spurious interrupt" will
> > > > be seen on the interrupt line of the IGD device. I was not aware of
> > > > any public bugs filed to Intel, nor seen any errata from Intel.
> > >
> > > The original commit, f67fd55fa96f ("PCI: Add quirk for still enabled
> > > interrupts on Intel Sandy Bridge GPUs"), says some systems "crash"
> > > (not sure if that means an oops or an actual crash that requires a
> > > reboot) and on other systems, Linux disables the shared interrupt
> > > line.  I assume disabling the interrupt line keeps devices using that
> > > line from working, but does not directly cause a crash.
> > >
> >
> > Correct, disable the shared interrupt line keeps all devices using
> > that line from working, which is current kernel's behavior w/o this
> > quirk handling: it disables the (shared) interrupt line after 100.000+
> > generated interrupts. But the side effect is that other devices become
> > unusable after that (eg: USB devices which share the same interrupt
> > line with the Intel GPU). That's why the original commit, f67fd55fa96f
> > ("PCI: Add quirk for still enabled interrupts on Intel Sandy Bridge
> > GPUs") disables the GPU's interrupt directly, which should really be
> > done by the VGA BIOS itself (a buggy VBIOS!).
> >
> > > What specific symptom do you see here?  I think it might be useful to
> > > collect details, e.g., dmesg logs, /proc/interrupts contents, output
> > > of "sudo lspci -vv", etc., for the systems you're quirking here.  I'm
> > > hoping we can eventually figure out a solution that doesn't require a
> > > quirk for every new GPU, and maybe that info will help find it.
> >
> > The symptom was described briefly in the original commit f67fd55fa96f
> > too, that disables the (shared) interrupt line after 100.000+
> > generated interrupts (can be observed via /proc/interrupts).
> >
> > > > > > See commit f67fd55fa96f ("PCI: Add quirk for still enabled interrupts
> > > > > > on Intel Sandy Bridge GPUs"), and commit 7c82126a94e6 ("PCI: Add new
> > > > > > ID for Intel GPU "spurious interrupt" quirk") for some history.
> > > > > >
> > > > > > Based on current findings, it is highly possible that all Intel
> > > > > > 1st/2nd/3rd generation Core processors' IGD has such quirk.
> > > > >
> > > > > Can you include a reference to these "current findings"?  I assume you
> > > > > have bug reports that include the device IDs you're adding?  If not,
> > > > > how did you build this list of new IDs?
> > > >
> > > > By "current findings" I mean given the IDs we have here, plus previous
> > > > one added by Thomas, it's highly possible this VGA BIOS bug exists in
> > > > every 1st/2nd/3rd generation Core processors.
> > > >
> > > > > The function comment added by f67fd55fa96f ("PCI: Add quirk for still
> > > > > enabled interrupts on Intel Sandy Bridge GPUs") suggests that this is
> > > > > actually a BIOS issue, not a hardware erratum, i.e., I don't see
> > > > > anything there that suggests a hardware defect.
> > > > >
> > > > > But there must be a hole somewhere -- the kernel can't be expected to
> > > > > disable interrupts in device-specific ways when there's no driver
> > > > > loaded.  Maybe it's simply a BIOS defect or maybe there's some
> > > > > interrupt or _PRT-related setup we're missing.
> > > >
> > > > It's a pure VGA BIOS bug, not the BIOS bug or _PRT etc. The VGA BIOS
> > > > forgot to turn off the interrupt on these devices.
> > >
> > > If this is a VGA BIOS defect, it's not very likely that it will
> > > magically be fixed for all new Intel GPUs, so in effect it sounds like
> > > we need to update this list of quirks in Linux every time a new Intel
> > > GPU comes out.  That prospect is a little daunting.
> >
> > I don't have a relatively newer Intel board at hand for testing right
> > now. I can try to locate one. But as I said, it's highly possible at
> > least all 1st/2nd/3rd generation Core processors are affected.
>
> > Maybe
> > we can add all these known GPU devices of  1st/2nd/3rd generation Core
> > processors all together for now? For newer GPUs, let's wait until
> > someone reports the issue again?
>
> This is exactly my point: we don't want to have to wait for somebody
> to report an issue for every new GPU.  That (a) is a maintenance
> headache and, more importantly, (b) prevents an old kernel from
> running on new hardware.  (b) is important to distros because nobody
> wants to qualify and release a new kernel just to add a new device ID.
>
> Bottom line is that I think I'm going to have to apply this patch, but
> I want to get off this train in the future, so now is the time to find
> a better solution.
>
> > > Do you happen to know if Windows has the same problem?  I.e., if you
> > > boot an old version of Windows with a new GPU, and unplug the VGA
> > > cable, does Windows crash?  If Windows can figure out how to handle
> > > that situation gracefully, Linux should be able to do it, too.
> >
> > I suspect Windows cannot handle it too. Without the GPU awareness, the
> > interrupt line is simply on and no driver claims the devices and will
> > cause issues. I can test this.
>
> If you could test this, that would be great.  I would be quite
> surprised if Windows crashed when you unplug the VGA cable.
>

For the record, I installed Windows 7 to one of the affected board.
The Intel GPU driver is not installed, so Windows is using the
standard VGA driver. Unplug/plug the VGA cable does not crash Windows,
nor did I notice anything abnormal. Since I have no idea how Windows
is handling any spurious interrupt, I cannot tell whether Windows does
anything special in the background to make it be "normal".

> What I'm wondering is if there's some different way we could manage
> the IOAPICs or maybe disable interrupts at the PCI device level as
> David suggests.  If something like that could be done we wouldn't need
> quirks for every new device.
>
> It's possible we could learn something by running Windows on qemu and
> tracing its PCI config accesses to see whether it sets the
> PCI_COMMAND_INTX_DISABLE bit or something.

Good idea.

Regards,
Bin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] pci: Add a few new IDs for Intel GPU "spurious interrupt" quirk
  2018-10-11  7:11           ` Bin Meng
@ 2018-10-11 16:13             ` Bjorn Helgaas
  0 siblings, 0 replies; 11+ messages in thread
From: Bjorn Helgaas @ 2018-10-11 16:13 UTC (permalink / raw)
  To: Bin Meng
  Cc: Bjorn Helgaas, linux-pci, Thomas Jarosch, stable, jani.nikula,
	joonas.lahtinen, rodrigo.vivi, intel-gfx, dri-devel,
	linux-kernel

On Thu, Oct 11, 2018 at 03:11:01PM +0800, Bin Meng wrote:
> On Wed, Oct 10, 2018 at 1:02 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Mon, Oct 08, 2018 at 05:44:08PM +0800, Bin Meng wrote:
> > > On Thu, Oct 4, 2018 at 4:12 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Thu, Sep 27, 2018 at 10:10:07AM +0800, Bin Meng wrote:
> > > > > On Thu, Sep 27, 2018 at 12:57 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > > On Wed, Sep 26, 2018 at 08:14:01AM -0700, Bin Meng wrote:
> > > > > > > Add more PCI IDs to the Intel GPU "spurious interrupt" quirk table,
> > > > > > > which are known to break.
> > > > > >
> > > > > > Do you have a reference for this?  Any public bug reports, bugzilla,
> > > > > > Intel spec reference or errata?  "Which are known to break" is pretty
> > > > > > vague.
> > > > >
> > > > > Sorry I used wrong words and should have been clearer. These devices
> > > > > are validated to be broken. The test I used is very simple, just
> > > > > unplug the VGA cable and plug it again, and "spurious interrupt" will
> > > > > be seen on the interrupt line of the IGD device. I was not aware of
> > > > > any public bugs filed to Intel, nor seen any errata from Intel.
> > > >
> > > > The original commit, f67fd55fa96f ("PCI: Add quirk for still enabled
> > > > interrupts on Intel Sandy Bridge GPUs"), says some systems "crash"
> > > > (not sure if that means an oops or an actual crash that requires a
> > > > reboot) and on other systems, Linux disables the shared interrupt
> > > > line.  I assume disabling the interrupt line keeps devices using that
> > > > line from working, but does not directly cause a crash.
> > > >
> > >
> > > Correct, disable the shared interrupt line keeps all devices using
> > > that line from working, which is current kernel's behavior w/o this
> > > quirk handling: it disables the (shared) interrupt line after 100.000+
> > > generated interrupts. But the side effect is that other devices become
> > > unusable after that (eg: USB devices which share the same interrupt
> > > line with the Intel GPU). That's why the original commit, f67fd55fa96f
> > > ("PCI: Add quirk for still enabled interrupts on Intel Sandy Bridge
> > > GPUs") disables the GPU's interrupt directly, which should really be
> > > done by the VGA BIOS itself (a buggy VBIOS!).
> > >
> > > > What specific symptom do you see here?  I think it might be useful to
> > > > collect details, e.g., dmesg logs, /proc/interrupts contents, output
> > > > of "sudo lspci -vv", etc., for the systems you're quirking here.  I'm
> > > > hoping we can eventually figure out a solution that doesn't require a
> > > > quirk for every new GPU, and maybe that info will help find it.
> > >
> > > The symptom was described briefly in the original commit f67fd55fa96f
> > > too, that disables the (shared) interrupt line after 100.000+
> > > generated interrupts (can be observed via /proc/interrupts).
> > >
> > > > > > > See commit f67fd55fa96f ("PCI: Add quirk for still enabled interrupts
> > > > > > > on Intel Sandy Bridge GPUs"), and commit 7c82126a94e6 ("PCI: Add new
> > > > > > > ID for Intel GPU "spurious interrupt" quirk") for some history.
> > > > > > >
> > > > > > > Based on current findings, it is highly possible that all Intel
> > > > > > > 1st/2nd/3rd generation Core processors' IGD has such quirk.
> > > > > >
> > > > > > Can you include a reference to these "current findings"?  I assume you
> > > > > > have bug reports that include the device IDs you're adding?  If not,
> > > > > > how did you build this list of new IDs?
> > > > >
> > > > > By "current findings" I mean given the IDs we have here, plus previous
> > > > > one added by Thomas, it's highly possible this VGA BIOS bug exists in
> > > > > every 1st/2nd/3rd generation Core processors.
> > > > >
> > > > > > The function comment added by f67fd55fa96f ("PCI: Add quirk for still
> > > > > > enabled interrupts on Intel Sandy Bridge GPUs") suggests that this is
> > > > > > actually a BIOS issue, not a hardware erratum, i.e., I don't see
> > > > > > anything there that suggests a hardware defect.
> > > > > >
> > > > > > But there must be a hole somewhere -- the kernel can't be expected to
> > > > > > disable interrupts in device-specific ways when there's no driver
> > > > > > loaded.  Maybe it's simply a BIOS defect or maybe there's some
> > > > > > interrupt or _PRT-related setup we're missing.
> > > > >
> > > > > It's a pure VGA BIOS bug, not the BIOS bug or _PRT etc. The VGA BIOS
> > > > > forgot to turn off the interrupt on these devices.
> > > >
> > > > If this is a VGA BIOS defect, it's not very likely that it will
> > > > magically be fixed for all new Intel GPUs, so in effect it sounds like
> > > > we need to update this list of quirks in Linux every time a new Intel
> > > > GPU comes out.  That prospect is a little daunting.
> > >
> > > I don't have a relatively newer Intel board at hand for testing right
> > > now. I can try to locate one. But as I said, it's highly possible at
> > > least all 1st/2nd/3rd generation Core processors are affected.
> >
> > > Maybe
> > > we can add all these known GPU devices of  1st/2nd/3rd generation Core
> > > processors all together for now? For newer GPUs, let's wait until
> > > someone reports the issue again?
> >
> > This is exactly my point: we don't want to have to wait for somebody
> > to report an issue for every new GPU.  That (a) is a maintenance
> > headache and, more importantly, (b) prevents an old kernel from
> > running on new hardware.  (b) is important to distros because nobody
> > wants to qualify and release a new kernel just to add a new device ID.
> >
> > Bottom line is that I think I'm going to have to apply this patch, but
> > I want to get off this train in the future, so now is the time to find
> > a better solution.
> >
> > > > Do you happen to know if Windows has the same problem?  I.e., if you
> > > > boot an old version of Windows with a new GPU, and unplug the VGA
> > > > cable, does Windows crash?  If Windows can figure out how to handle
> > > > that situation gracefully, Linux should be able to do it, too.
> > >
> > > I suspect Windows cannot handle it too. Without the GPU awareness, the
> > > interrupt line is simply on and no driver claims the devices and will
> > > cause issues. I can test this.
> >
> > If you could test this, that would be great.  I would be quite
> > surprised if Windows crashed when you unplug the VGA cable.
> >
> 
> For the record, I installed Windows 7 to one of the affected board.
> The Intel GPU driver is not installed, so Windows is using the
> standard VGA driver. Unplug/plug the VGA cable does not crash Windows,
> nor did I notice anything abnormal. Since I have no idea how Windows
> is handling any spurious interrupt, I cannot tell whether Windows does
> anything special in the background to make it be "normal".

Thanks a lot for testing this.  That's a very good clue that we can
make Linux handle this gracefully, too, even without having to add
Device IDs for every new GPU.

> > What I'm wondering is if there's some different way we could manage
> > the IOAPICs or maybe disable interrupts at the PCI device level as
> > David suggests.  If something like that could be done we wouldn't need
> > quirks for every new device.
> >
> > It's possible we could learn something by running Windows on qemu and
> > tracing its PCI config accesses to see whether it sets the
> > PCI_COMMAND_INTX_DISABLE bit or something.

I think we should explore using PCI_COMMAND_INTX_DISABLE.  Old devices
won't support it, so we might need the quirk for them, but new devices
should support it.

Bjorn

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-10-11 16:13 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1537974841-29928-1-git-send-email-bmeng.cn@gmail.com>
2018-09-26 16:57 ` [PATCH] pci: Add a few new IDs for Intel GPU "spurious interrupt" quirk Bjorn Helgaas
2018-09-27  2:10   ` Bin Meng
2018-10-03 20:12     ` Bjorn Helgaas
2018-10-08  9:44       ` Bin Meng
2018-10-08 10:06         ` David Laight
2018-10-08 12:34           ` Bin Meng
2018-10-08 13:27             ` David Laight
2018-10-09 17:01         ` Bjorn Helgaas
2018-10-10  8:00           ` Thomas Jarosch
2018-10-11  7:11           ` Bin Meng
2018-10-11 16:13             ` Bjorn Helgaas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).