linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS
@ 2017-03-28 12:16 Joerg Roedel
  2017-03-28 20:18 ` Deucher, Alexander
  2017-04-04 16:43 ` Bjorn Helgaas
  0 siblings, 2 replies; 13+ messages in thread
From: Joerg Roedel @ 2017-03-28 12:16 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, linux-kernel, Joerg Roedel, Daniel Drake, Alexander Deucher

From: Joerg Roedel <jroedel@suse.de>

ATS is broken on these devices. Under invalidation load, the
GPU does not reply to invalidations anymore, causing
Completion-wait loop timeouts on the AMD IOMMU driver side.
Fix it by not enabling ATS on these devices.

Note that below mentioned commit is not broken, it just
triggers the issue because it might cause invalidation
storms on devices.

Fixes: b1516a14657a ('iommu/amd: Implement flush queue')
Reported-by: Daniel Drake <drake@endlessm.com>
Cc: Daniel Drake <drake@endlessm.com>
Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/pci/ats.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
index eeb9fb2..711bdb2 100644
--- a/drivers/pci/ats.c
+++ b/drivers/pci/ats.c
@@ -17,10 +17,18 @@
 
 #include "pci.h"
 
+static const struct pci_device_id broken_ats_tbl[] = {
+	{ PCI_DEVICE(PCI_VENDOR_ID_AMD, 0x98e4) }, /* AMD Stoney GPU part */
+	{ 0 }
+};
+
 void pci_ats_init(struct pci_dev *dev)
 {
 	int pos;
 
+	if (pci_match_id(broken_ats_tbl, dev))
+		return;
+
 	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS);
 	if (!pos)
 		return;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* RE: [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS
  2017-03-28 12:16 [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS Joerg Roedel
@ 2017-03-28 20:18 ` Deucher, Alexander
  2017-03-28 20:28   ` Joerg Roedel
  2017-04-04 16:43 ` Bjorn Helgaas
  1 sibling, 1 reply; 13+ messages in thread
From: Deucher, Alexander @ 2017-03-28 20:18 UTC (permalink / raw)
  To: 'Joerg Roedel', Bjorn Helgaas
  Cc: linux-pci, linux-kernel, Joerg Roedel, Daniel Drake, Nath, Arindam

> -----Original Message-----
> From: Joerg Roedel [mailto:joro@8bytes.org]
> Sent: Tuesday, March 28, 2017 8:17 AM
> To: Bjorn Helgaas
> Cc: linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Joerg Roedel;
> Daniel Drake; Deucher, Alexander
> Subject: [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS
> 
> From: Joerg Roedel <jroedel@suse.de>
> 
> ATS is broken on these devices. Under invalidation load, the
> GPU does not reply to invalidations anymore, causing
> Completion-wait loop timeouts on the AMD IOMMU driver side.
> Fix it by not enabling ATS on these devices.
> 
> Note that below mentioned commit is not broken, it just
> triggers the issue because it might cause invalidation
> storms on devices.
> 
> Fixes: b1516a14657a ('iommu/amd: Implement flush queue')
> Reported-by: Daniel Drake <drake@endlessm.com>
> Cc: Daniel Drake <drake@endlessm.com>
> Cc: Alexander Deucher <Alexander.Deucher@amd.com>
> Signed-off-by: Joerg Roedel <jroedel@suse.de>

Did you see Arindam's patch from yesterday[1]?  Not sure which is the proper fix, maybe both?

Alex

[1] - https://lists.freedesktop.org/archives/amd-gfx/2017-March/006862.html

> ---
>  drivers/pci/ats.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
> index eeb9fb2..711bdb2 100644
> --- a/drivers/pci/ats.c
> +++ b/drivers/pci/ats.c
> @@ -17,10 +17,18 @@
> 
>  #include "pci.h"
> 
> +static const struct pci_device_id broken_ats_tbl[] = {
> +	{ PCI_DEVICE(PCI_VENDOR_ID_AMD, 0x98e4) }, /* AMD Stoney GPU
> part */
> +	{ 0 }
> +};
> +
>  void pci_ats_init(struct pci_dev *dev)
>  {
>  	int pos;
> 
> +	if (pci_match_id(broken_ats_tbl, dev))
> +		return;
> +
>  	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS);
>  	if (!pos)
>  		return;
> --
> 1.9.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS
  2017-03-28 20:18 ` Deucher, Alexander
@ 2017-03-28 20:28   ` Joerg Roedel
  2017-03-28 20:37     ` Deucher, Alexander
  0 siblings, 1 reply; 13+ messages in thread
From: Joerg Roedel @ 2017-03-28 20:28 UTC (permalink / raw)
  To: Deucher, Alexander
  Cc: 'Joerg Roedel',
	Bjorn Helgaas, linux-pci, linux-kernel, Daniel Drake, Nath,
	Arindam

On Tue, Mar 28, 2017 at 08:18:26PM +0000, Deucher, Alexander wrote:
> > -----Original Message-----
> > From: Joerg Roedel [mailto:joro@8bytes.org]
> > Sent: Tuesday, March 28, 2017 8:17 AM
> > To: Bjorn Helgaas
> > Cc: linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Joerg Roedel;
> > Daniel Drake; Deucher, Alexander
> > Subject: [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS
> > 
> > From: Joerg Roedel <jroedel@suse.de>
> > 
> > ATS is broken on these devices. Under invalidation load, the
> > GPU does not reply to invalidations anymore, causing
> > Completion-wait loop timeouts on the AMD IOMMU driver side.
> > Fix it by not enabling ATS on these devices.
> > 
> > Note that below mentioned commit is not broken, it just
> > triggers the issue because it might cause invalidation
> > storms on devices.
> > 
> > Fixes: b1516a14657a ('iommu/amd: Implement flush queue')
> > Reported-by: Daniel Drake <drake@endlessm.com>
> > Cc: Daniel Drake <drake@endlessm.com>
> > Cc: Alexander Deucher <Alexander.Deucher@amd.com>
> > Signed-off-by: Joerg Roedel <jroedel@suse.de>
> 
> Did you see Arindam's patch from yesterday[1]?  Not sure which is the proper fix, maybe both?

Arindam's patch makes sense on its own, but not as a fix for this issue.
It lowers the invalidation load on the GPU, but there are still ways to
trigger a high invalidation rate on the device. So it might hide the
issue, but not fix it.

We need to disable ATS on the device if it doesn't work reliably.



	Joerg

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS
  2017-03-28 20:28   ` Joerg Roedel
@ 2017-03-28 20:37     ` Deucher, Alexander
  2017-03-28 20:56       ` 'Joerg Roedel'
  0 siblings, 1 reply; 13+ messages in thread
From: Deucher, Alexander @ 2017-03-28 20:37 UTC (permalink / raw)
  To: 'Joerg Roedel'
  Cc: 'Joerg Roedel',
	Bjorn Helgaas, linux-pci, linux-kernel, Daniel Drake, Nath,
	Arindam

> -----Original Message-----
> From: Joerg Roedel [mailto:jroedel@suse.de]
> Sent: Tuesday, March 28, 2017 4:29 PM
> To: Deucher, Alexander
> Cc: 'Joerg Roedel'; Bjorn Helgaas; linux-pci@vger.kernel.org; linux-
> kernel@vger.kernel.org; Daniel Drake; Nath, Arindam
> Subject: Re: [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS
> 
> On Tue, Mar 28, 2017 at 08:18:26PM +0000, Deucher, Alexander wrote:
> > > -----Original Message-----
> > > From: Joerg Roedel [mailto:joro@8bytes.org]
> > > Sent: Tuesday, March 28, 2017 8:17 AM
> > > To: Bjorn Helgaas
> > > Cc: linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Joerg
> Roedel;
> > > Daniel Drake; Deucher, Alexander
> > > Subject: [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS
> > >
> > > From: Joerg Roedel <jroedel@suse.de>
> > >
> > > ATS is broken on these devices. Under invalidation load, the
> > > GPU does not reply to invalidations anymore, causing
> > > Completion-wait loop timeouts on the AMD IOMMU driver side.
> > > Fix it by not enabling ATS on these devices.
> > >
> > > Note that below mentioned commit is not broken, it just
> > > triggers the issue because it might cause invalidation
> > > storms on devices.
> > >
> > > Fixes: b1516a14657a ('iommu/amd: Implement flush queue')
> > > Reported-by: Daniel Drake <drake@endlessm.com>
> > > Cc: Daniel Drake <drake@endlessm.com>
> > > Cc: Alexander Deucher <Alexander.Deucher@amd.com>
> > > Signed-off-by: Joerg Roedel <jroedel@suse.de>
> >
> > Did you see Arindam's patch from yesterday[1]?  Not sure which is the
> proper fix, maybe both?
> 
> Arindam's patch makes sense on its own, but not as a fix for this issue.
> It lowers the invalidation load on the GPU, but there are still ways to
> trigger a high invalidation rate on the device. So it might hide the
> issue, but not fix it.
> 
> We need to disable ATS on the device if it doesn't work reliably.

The question is, could the problem stem from flushing an entity that didn't request it, or should that not matter?  I guess it shouldn't matter otherwise we'd see this on other platforms like Carrizo as well.

Alex

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS
  2017-03-28 20:37     ` Deucher, Alexander
@ 2017-03-28 20:56       ` 'Joerg Roedel'
  2017-03-28 21:13         ` Deucher, Alexander
  0 siblings, 1 reply; 13+ messages in thread
From: 'Joerg Roedel' @ 2017-03-28 20:56 UTC (permalink / raw)
  To: Deucher, Alexander
  Cc: 'Joerg Roedel',
	Bjorn Helgaas, linux-pci, linux-kernel, Daniel Drake, Nath,
	Arindam

On Tue, Mar 28, 2017 at 08:37:01PM +0000, Deucher, Alexander wrote:
> The question is, could the problem stem from flushing an entity that
> didn't request it, or should that not matter?  I guess it shouldn't
> matter otherwise we'd see this on other platforms like Carrizo as
> well.

What do you mean by "didn't request it"? The IOMMU driver only sends
io/tlb invalidations to devices it enabled ATS on.



	Joerg

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS
  2017-03-28 20:56       ` 'Joerg Roedel'
@ 2017-03-28 21:13         ` Deucher, Alexander
  2017-03-28 22:26           ` 'Joerg Roedel'
  0 siblings, 1 reply; 13+ messages in thread
From: Deucher, Alexander @ 2017-03-28 21:13 UTC (permalink / raw)
  To: 'Joerg Roedel'
  Cc: 'Joerg Roedel',
	Bjorn Helgaas, linux-pci, linux-kernel, Daniel Drake, Nath,
	Arindam

> -----Original Message-----
> From: 'Joerg Roedel' [mailto:jroedel@suse.de]
> Sent: Tuesday, March 28, 2017 4:56 PM
> To: Deucher, Alexander
> Cc: 'Joerg Roedel'; Bjorn Helgaas; linux-pci@vger.kernel.org; linux-
> kernel@vger.kernel.org; Daniel Drake; Nath, Arindam
> Subject: Re: [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS
> 
> On Tue, Mar 28, 2017 at 08:37:01PM +0000, Deucher, Alexander wrote:
> > The question is, could the problem stem from flushing an entity that
> > didn't request it, or should that not matter?  I guess it shouldn't
> > matter otherwise we'd see this on other platforms like Carrizo as
> > well.
> 
> What do you mean by "didn't request it"? The IOMMU driver only sends
> io/tlb invalidations to devices it enabled ATS on.

If I understand Arindam's patch correctly, it only flushes TLB entries for domains in the flush queue whereas the previous behavior was to flush all domains.  If there was no TLB flush in the queue for that domain, could flushing it cause a problem?  Sorry, I'm not too familiar with the IOMMU code or ATS.

Alex

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS
  2017-03-28 21:13         ` Deucher, Alexander
@ 2017-03-28 22:26           ` 'Joerg Roedel'
  2017-03-29  7:15             ` Nath, Arindam
  2017-03-29 16:21             ` Deucher, Alexander
  0 siblings, 2 replies; 13+ messages in thread
From: 'Joerg Roedel' @ 2017-03-28 22:26 UTC (permalink / raw)
  To: Deucher, Alexander
  Cc: 'Joerg Roedel',
	Bjorn Helgaas, linux-pci, linux-kernel, Daniel Drake, Nath,
	Arindam

On Tue, Mar 28, 2017 at 09:13:23PM +0000, Deucher, Alexander wrote:
> If I understand Arindam's patch correctly, it only flushes TLB entries
> for domains in the flush queue whereas the previous behavior was to
> flush all domains.  If there was no TLB flush in the queue for that
> domain, could flushing it cause a problem?

No, that can't cause a problem. An io/tlb flush for the device is just a
message that the device should invalidate its own tlb. The device can't
know and doesn't need to know whether the page-tables it used to fill
the tlb really changed.

As it looks, the problem we are seeing here is that we are sending a
large amount of these requests to the GPU device, and wait for its
completion every time. This shouldn't be a problem for ATS devices, but
the GPU here seems to fail at some point and doesn't answer to the
invalidation request anymore, causing the completion-wait loop timeouts.

Arindam's patch makes the high flush-frequency less likely, but it can
still happen, depending on how the GPU is used. So its the best to
keep ATS disabled on the device as it doesn't work correctly and we risk
running in the same problem again when we leave it enabled and just make
the trigger less likely.


	Joerg

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS
  2017-03-28 22:26           ` 'Joerg Roedel'
@ 2017-03-29  7:15             ` Nath, Arindam
  2017-03-29  9:42               ` 'Joerg Roedel'
  2017-03-29 16:21             ` Deucher, Alexander
  1 sibling, 1 reply; 13+ messages in thread
From: Nath, Arindam @ 2017-03-29  7:15 UTC (permalink / raw)
  To: 'Joerg Roedel', Deucher, Alexander
  Cc: 'Joerg Roedel',
	Bjorn Helgaas, linux-pci, linux-kernel, Daniel Drake

>-----Original Message-----
>From: 'Joerg Roedel' [mailto:jroedel@suse.de]
>Sent: Wednesday, March 29, 2017 3:56 AM
>To: Deucher, Alexander
>Cc: 'Joerg Roedel'; Bjorn Helgaas; linux-pci@vger.kernel.org; linux-
>kernel@vger.kernel.org; Daniel Drake; Nath, Arindam
>Subject: Re: [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS
>
>On Tue, Mar 28, 2017 at 09:13:23PM +0000, Deucher, Alexander wrote:
>> If I understand Arindam's patch correctly, it only flushes TLB entries
>> for domains in the flush queue whereas the previous behavior was to
>> flush all domains.  If there was no TLB flush in the queue for that
>> domain, could flushing it cause a problem?
>
>No, that can't cause a problem. An io/tlb flush for the device is just a
>message that the device should invalidate its own tlb. The device can't
>know and doesn't need to know whether the page-tables it used to fill
>the tlb really changed.

Joerg, as per my limited understanding of ATS, the ATC will respond to invalidation requests after making sure there are no in-flight DMA transactions with the address requested by IOMMU to be invalidated. Now since the IOMMU was sending invalidate command to GPU even though there was no explicit page unmapping request from the graphics subsystem, we _might_ end up in a situation where the ATC takes longer than the invalidation timeout to respond to IOMMU.

With the patch I provided, since only those domains who have actually requested for unmapping pages have been added to the flush queue, we send TLB invalidation commands to only those specific domains. This avoids sending invalidation command to GPU ATC every single flush.

I do agree that the Stoney might have some issue which causes it not to be able complete the invalidation command in time since we are not observing the issue on CZ and other ASICs.

Thanks,
Arindam

>
>As it looks, the problem we are seeing here is that we are sending a
>large amount of these requests to the GPU device, and wait for its
>completion every time. This shouldn't be a problem for ATS devices, but
>the GPU here seems to fail at some point and doesn't answer to the
>invalidation request anymore, causing the completion-wait loop timeouts.
>
>Arindam's patch makes the high flush-frequency less likely, but it can
>still happen, depending on how the GPU is used. So its the best to
>keep ATS disabled on the device as it doesn't work correctly and we risk
>running in the same problem again when we leave it enabled and just make
>the trigger less likely.
>
>
>	Joerg

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS
  2017-03-29  7:15             ` Nath, Arindam
@ 2017-03-29  9:42               ` 'Joerg Roedel'
  2017-03-29  9:47                 ` Nath, Arindam
  0 siblings, 1 reply; 13+ messages in thread
From: 'Joerg Roedel' @ 2017-03-29  9:42 UTC (permalink / raw)
  To: Nath, Arindam
  Cc: Deucher, Alexander, 'Joerg Roedel',
	Bjorn Helgaas, linux-pci, linux-kernel, Daniel Drake

Hi Arindam,

On Wed, Mar 29, 2017 at 07:15:42AM +0000, Nath, Arindam wrote:
> Joerg, as per my limited understanding of ATS, the ATC will respond to
> invalidation requests after making sure there are no in-flight DMA
> transactions with the address requested by IOMMU to be invalidated.
> Now since the IOMMU was sending invalidate command to GPU even though
> there was no explicit page unmapping request from the graphics
> subsystem, we _might_ end up in a situation where the ATC takes longer
> than the invalidation timeout to respond to IOMMU.

The maximum wait-time in the loop is 100ms. This should be more than
enough for the ATC to complete any in-flight transaction and flush its
internal TLB.

If that is not enough, there is almost certainly something wrong with
the hardware.



	Joerg

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS
  2017-03-29  9:42               ` 'Joerg Roedel'
@ 2017-03-29  9:47                 ` Nath, Arindam
  0 siblings, 0 replies; 13+ messages in thread
From: Nath, Arindam @ 2017-03-29  9:47 UTC (permalink / raw)
  To: 'Joerg Roedel'
  Cc: Deucher, Alexander, 'Joerg Roedel',
	Bjorn Helgaas, linux-pci, linux-kernel, Daniel Drake

>-----Original Message-----
>From: 'Joerg Roedel' [mailto:jroedel@suse.de]
>Sent: Wednesday, March 29, 2017 3:12 PM
>To: Nath, Arindam
>Cc: Deucher, Alexander; 'Joerg Roedel'; Bjorn Helgaas; linux-
>pci@vger.kernel.org; linux-kernel@vger.kernel.org; Daniel Drake
>Subject: Re: [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS
>
>Hi Arindam,
>
>On Wed, Mar 29, 2017 at 07:15:42AM +0000, Nath, Arindam wrote:
>> Joerg, as per my limited understanding of ATS, the ATC will respond to
>> invalidation requests after making sure there are no in-flight DMA
>> transactions with the address requested by IOMMU to be invalidated.
>> Now since the IOMMU was sending invalidate command to GPU even
>though
>> there was no explicit page unmapping request from the graphics
>> subsystem, we _might_ end up in a situation where the ATC takes longer
>> than the invalidation timeout to respond to IOMMU.
>
>The maximum wait-time in the loop is 100ms. This should be more than
>enough for the ATC to complete any in-flight transaction and flush its
>internal TLB.
>
>If that is not enough, there is almost certainly something wrong with
>the hardware.

Actually what you said is correct. Before creating the patch, I had experimented with 1s and 10s invalidation timeout values, but none of them helped.

Thanks,
Arindam

>
>
>
>	Joerg

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS
  2017-03-28 22:26           ` 'Joerg Roedel'
  2017-03-29  7:15             ` Nath, Arindam
@ 2017-03-29 16:21             ` Deucher, Alexander
  1 sibling, 0 replies; 13+ messages in thread
From: Deucher, Alexander @ 2017-03-29 16:21 UTC (permalink / raw)
  To: 'Joerg Roedel'
  Cc: 'Joerg Roedel',
	Bjorn Helgaas, linux-pci, linux-kernel, Daniel Drake, Nath,
	Arindam

> -----Original Message-----
> From: 'Joerg Roedel' [mailto:jroedel@suse.de]
> Sent: Tuesday, March 28, 2017 6:26 PM
> To: Deucher, Alexander
> Cc: 'Joerg Roedel'; Bjorn Helgaas; linux-pci@vger.kernel.org; linux-
> kernel@vger.kernel.org; Daniel Drake; Nath, Arindam
> Subject: Re: [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS
> 
> On Tue, Mar 28, 2017 at 09:13:23PM +0000, Deucher, Alexander wrote:
> > If I understand Arindam's patch correctly, it only flushes TLB entries
> > for domains in the flush queue whereas the previous behavior was to
> > flush all domains.  If there was no TLB flush in the queue for that
> > domain, could flushing it cause a problem?
> 
> No, that can't cause a problem. An io/tlb flush for the device is just a
> message that the device should invalidate its own tlb. The device can't
> know and doesn't need to know whether the page-tables it used to fill
> the tlb really changed.
> 
> As it looks, the problem we are seeing here is that we are sending a
> large amount of these requests to the GPU device, and wait for its
> completion every time. This shouldn't be a problem for ATS devices, but
> the GPU here seems to fail at some point and doesn't answer to the
> invalidation request anymore, causing the completion-wait loop timeouts.
> 
> Arindam's patch makes the high flush-frequency less likely, but it can
> still happen, depending on how the GPU is used. So its the best to
> keep ATS disabled on the device as it doesn't work correctly and we risk
> running in the same problem again when we leave it enabled and just make
> the trigger less likely.

Thanks for clarifying.  The patch is:
Acked-by: Alex Deucher <alexander.deucher@amd.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS
  2017-03-28 12:16 [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS Joerg Roedel
  2017-03-28 20:18 ` Deucher, Alexander
@ 2017-04-04 16:43 ` Bjorn Helgaas
  2017-04-07 12:34   ` Joerg Roedel
  1 sibling, 1 reply; 13+ messages in thread
From: Bjorn Helgaas @ 2017-04-04 16:43 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Bjorn Helgaas, linux-pci, linux-kernel, Joerg Roedel,
	Daniel Drake, Alexander Deucher

Hi Joerg,

On Tue, Mar 28, 2017 at 02:16:44PM +0200, Joerg Roedel wrote:
> From: Joerg Roedel <jroedel@suse.de>
> 
> ATS is broken on these devices. Under invalidation load, the
> GPU does not reply to invalidations anymore, causing
> Completion-wait loop timeouts on the AMD IOMMU driver side.
> Fix it by not enabling ATS on these devices.
> 
> Note that below mentioned commit is not broken, it just
> triggers the issue because it might cause invalidation
> storms on devices.
> 
> Fixes: b1516a14657a ('iommu/amd: Implement flush queue')
> Reported-by: Daniel Drake <drake@endlessm.com>
> Cc: Daniel Drake <drake@endlessm.com>
> Cc: Alexander Deucher <Alexander.Deucher@amd.com>
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  drivers/pci/ats.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
> index eeb9fb2..711bdb2 100644
> --- a/drivers/pci/ats.c
> +++ b/drivers/pci/ats.c
> @@ -17,10 +17,18 @@
>  
>  #include "pci.h"
>  
> +static const struct pci_device_id broken_ats_tbl[] = {
> +	{ PCI_DEVICE(PCI_VENDOR_ID_AMD, 0x98e4) }, /* AMD Stoney GPU part */
> +	{ 0 }
> +};
> +
>  void pci_ats_init(struct pci_dev *dev)
>  {
>  	int pos;
>  
> +	if (pci_match_id(broken_ats_tbl, dev))
> +		return;

This is fine functionally, but from a stylistic point of view, I guess
I would prefer to have it implemented in drivers/pci/quirks.c just to
have some consistency in how we work around device defects.

>  	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS);
>  	if (!pos)
>  		return;
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS
  2017-04-04 16:43 ` Bjorn Helgaas
@ 2017-04-07 12:34   ` Joerg Roedel
  0 siblings, 0 replies; 13+ messages in thread
From: Joerg Roedel @ 2017-04-07 12:34 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Joerg Roedel, Bjorn Helgaas, linux-pci, linux-kernel,
	Daniel Drake, Alexander Deucher

Hi Bjorn,

On Tue, Apr 04, 2017 at 11:43:11AM -0500, Bjorn Helgaas wrote:
> > +static const struct pci_device_id broken_ats_tbl[] = {
> > +	{ PCI_DEVICE(PCI_VENDOR_ID_AMD, 0x98e4) }, /* AMD Stoney GPU part */

Just found out that the affected GPU uses PCI_VENDOR_ID_ATI, fixed that
too.

> > +	{ 0 }
> > +};
> > +
> >  void pci_ats_init(struct pci_dev *dev)
> > 
> >  	int pos;
> >  
> > +	if (pci_match_id(broken_ats_tbl, dev))
> > +		return;
> 
> This is fine functionally, but from a stylistic point of view, I guess
> I would prefer to have it implemented in drivers/pci/quirks.c just to
> have some consistency in how we work around device defects.

Done, I send a new patch shortly.



	Joerg

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-04-07 12:34 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-28 12:16 [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS Joerg Roedel
2017-03-28 20:18 ` Deucher, Alexander
2017-03-28 20:28   ` Joerg Roedel
2017-03-28 20:37     ` Deucher, Alexander
2017-03-28 20:56       ` 'Joerg Roedel'
2017-03-28 21:13         ` Deucher, Alexander
2017-03-28 22:26           ` 'Joerg Roedel'
2017-03-29  7:15             ` Nath, Arindam
2017-03-29  9:42               ` 'Joerg Roedel'
2017-03-29  9:47                 ` Nath, Arindam
2017-03-29 16:21             ` Deucher, Alexander
2017-04-04 16:43 ` Bjorn Helgaas
2017-04-07 12:34   ` Joerg Roedel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).