All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Merger, Edgar [AUTOSOL/MAS/AUGS]" <Edgar.Merger@emerson.com>
To: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Huang, Ray" <Ray.Huang@amd.com>,
	"Kuehling, Felix" <Felix.Kuehling@amd.com>
Cc: Will Deacon <will@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"iommu@lists.linux-foundation.org"
	<iommu@lists.linux-foundation.org>,
	Bjorn Helgaas <bhelgaas@google.com>,
	Joerg Roedel <jroedel@suse.de>,
	"Zhu, Changfeng" <Changfeng.Zhu@amd.com>
Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken
Date: Wed, 25 Nov 2020 10:03:48 +0000	[thread overview]
Message-ID: <CY4PR10MB13022501A57CC02FF5BC632B89FA0@CY4PR10MB1302.namprd10.prod.outlook.com> (raw)
In-Reply-To: <CY4PR10MB13029B38D31936622E4CA62389FA0@CY4PR10MB1302.namprd10.prod.outlook.com>

I do have also other problems with this unit, when IOMMU is enabled and pci=noats is not set as kernel parameter.

[ 2004.265906] amdgpu 0000:0b:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx (-110).
[ 2004.266024] [drm:amdgpu_device_delayed_init_work_handler [amdgpu]] *ERROR* ib ring test failed (-110).

-----Original Message-----
From: Merger, Edgar [AUTOSOL/MAS/AUGS] 
Sent: Mittwoch, 25. November 2020 10:16
To: 'Deucher, Alexander' <Alexander.Deucher@amd.com>; 'Huang, Ray' <Ray.Huang@amd.com>; 'Kuehling, Felix' <Felix.Kuehling@amd.com>
Cc: 'Will Deacon' <will@kernel.org>; 'linux-kernel@vger.kernel.org' <linux-kernel@vger.kernel.org>; 'linux-pci@vger.kernel.org' <linux-pci@vger.kernel.org>; 'iommu@lists.linux-foundation.org' <iommu@lists.linux-foundation.org>; 'Bjorn Helgaas' <bhelgaas@google.com>; 'Joerg Roedel' <jroedel@suse.de>; 'Zhu, Changfeng' <Changfeng.Zhu@amd.com>
Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

Remark: 

Systems with R1305G APU (which show the issue) have the following VGA-Controller:
0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Picasso (rev cf)

Systems with V1404I APU (which do not show the issue) have the following VGA-Controller:
0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] (rev 83)

"rev cf" vs. "ref 83" is probably what you where referring to with PCI Revision ID.

-----Original Message-----
From: Merger, Edgar [AUTOSOL/MAS/AUGS]
Sent: Mittwoch, 25. November 2020 07:05
To: 'Deucher, Alexander' <Alexander.Deucher@amd.com>; Huang, Ray <Ray.Huang@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>
Cc: Will Deacon <will@kernel.org>; linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org; iommu@lists.linux-foundation.org; Bjorn Helgaas <bhelgaas@google.com>; Joerg Roedel <jroedel@suse.de>; Zhu, Changfeng <Changfeng.Zhu@amd.com>
Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

I see that problem only on systems that use a R1305G APU

sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info

shows

VCE feature version: 0, firmware version: 0x00000000 UVD feature version: 0, firmware version: 0x00000000 MC feature version: 0, firmware version: 0x00000000 ME feature version: 50, firmware version: 0x000000a3 PFP feature version: 50, firmware version: 0x000000bb CE feature version: 50, firmware version: 0x0000004f RLC feature version: 1, firmware version: 0x00000049 RLC SRLC feature version: 1, firmware version: 0x00000001 RLC SRLG feature version: 1, firmware version: 0x00000001 RLC SRLS feature version: 1, firmware version: 0x00000001 MEC feature version: 50, firmware version: 0x000001b5
MEC2 feature version: 50, firmware version: 0x000001b5 SOS feature version: 0, firmware version: 0x00000000 ASD feature version: 0, firmware version: 0x21000030 TA XGMI feature version: 0, firmware version: 0x00000000 TA RAS feature version: 0, firmware version: 0x00000000 SMC feature version: 0, firmware version: 0x00002527
SDMA0 feature version: 41, firmware version: 0x000000a9 VCN feature version: 0, firmware version: 0x0110901c DMCU feature version: 0, firmware version: 0x00000001 VBIOS version: 113-RAVEN2-117

We are also using V1404I APU on the same boards and I haven´t seen the issue on those boards

These boards give me slightly different info: sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info
 
VCE feature version: 0, firmware version: 0x00000000 UVD feature version: 0, firmware version: 0x00000000 MC feature version: 0, firmware version: 0x00000000 ME feature version: 47, firmware version: 0x000000a2 PFP feature version: 47, firmware version: 0x000000b9 CE feature version: 47, firmware version: 0x0000004e RLC feature version: 1, firmware version: 0x00000213 RLC SRLC feature version: 1, firmware version: 0x00000001 RLC SRLG feature version: 1, firmware version: 0x00000001 RLC SRLS feature version: 1, firmware version: 0x00000001 MEC feature version: 47, firmware version: 0x000001ab
MEC2 feature version: 47, firmware version: 0x000001ab SOS feature version: 0, firmware version: 0x00000000 ASD feature version: 0, firmware version: 0x21000013 TA XGMI feature version: 0, firmware version: 0x00000000 TA RAS feature version: 0, firmware version: 0x00000000 SMC feature version: 0, firmware version: 0x00001e5b
SDMA0 feature version: 41, firmware version: 0x000000a9 VCN feature version: 0, firmware version: 0x0110901c DMCU feature version: 0, firmware version: 0x00000000 VBIOS version: 113-RAVEN-116




00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Root Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 IOMMU
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 PCIe GPP Bridge [6:0]
00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Zeppelin Switch Upstream (PCIE SW.US)
00:01.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 PCIe GPP Bridge [6:0]
00:01.5 PCI bridge: Advanced Micro Devices, Inc. [AMD] Zeppelin Switch Upstream (PCIE SW.US)
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Internal PCIe GPP Bridge 0 to Bus A
00:08.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Internal PCIe GPP Bridge 0 to Bus B
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 61)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 7
01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0e)
01:00.1 Serial controller: Realtek Semiconductor Co., Ltd. Device 816a (rev 0e)
01:00.2 Serial controller: Realtek Semiconductor Co., Ltd. Device 816b (rev 0e)
01:00.3 IPMI Interface: Realtek Semiconductor Co., Ltd. Device 816c (rev 0e)
01:00.4 USB controller: Realtek Semiconductor Co., Ltd. Device 816d (rev 0e)
02:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
03:00.0 PCI bridge: Pericom Semiconductor PI7C9X2G608GP PCIe2 6-Port/8-Lane Packet Switch
04:01.0 PCI bridge: Pericom Semiconductor PI7C9X2G608GP PCIe2 6-Port/8-Lane Packet Switch
04:02.0 PCI bridge: Pericom Semiconductor PI7C9X2G608GP PCIe2 6-Port/8-Lane Packet Switch
04:03.0 PCI bridge: Pericom Semiconductor PI7C9X2G608GP PCIe2 6-Port/8-Lane Packet Switch
04:04.0 PCI bridge: Pericom Semiconductor PI7C9X2G608GP PCIe2 6-Port/8-Lane Packet Switch
04:05.0 PCI bridge: Pericom Semiconductor PI7C9X2G608GP PCIe2 6-Port/8-Lane Packet Switch
06:00.0 Serial controller: Asix Electronics Corporation Device 9100
06:00.1 Serial controller: Asix Electronics Corporation Device 9100
07:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
0a:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Picasso (rev cf)
0b:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Raven/Raven2/Fenghuang HDMI/DP Audio Controller
0b:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor
0b:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Raven2 USB 3.1
0b:00.5 Multimedia controller: Advanced Micro Devices, Inc. [AMD] Raven/Raven2/FireFlight/Renoir Audio Processor
0b:00.7 Non-VGA unclassified device: Advanced Micro Devices, Inc. [AMD] Raven/Raven2/Renoir Non-Sensor Fusion Hub KMDF driver
0c:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 61)

PCI Revision ID is 06 I believe. Got that from this lspci -xx

00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Zeppelin Switch Upstream (PCIE SW.US)
00: 22 10 5d 14 07 04 10 00 00 00 04 06 10 00 81 00
10: 00 00 00 00 00 00 00 00 00 02 02 00 f1 01 00 00
20: e0 fc e0 fc f1 ff 01 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 50 00 00 00 00 00 00 00 ff 00 12 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 01 58 03 c8 00 00 00 00 10 a0 42 01 22 80 00 00
60: 1f 29 00 00 13 38 73 03 42 00 11 30 00 00 04 00
70: 00 00 40 01 18 00 01 00 00 00 00 00 bf 01 70 00
80: 06 00 00 00 0e 00 00 00 03 00 01 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 05 c0 81 00 00 00 e0 fe 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 0d c8 00 00 22 10 34 12 08 00 03 a8 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 4c 8a 05 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

-----Original Message-----
From: Deucher, Alexander <Alexander.Deucher@amd.com>
Sent: Dienstag, 24. November 2020 16:06
To: Merger, Edgar [AUTOSOL/MAS/AUGS] <Edgar.Merger@emerson.com>; Huang, Ray <Ray.Huang@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>
Cc: Will Deacon <will@kernel.org>; linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org; iommu@lists.linux-foundation.org; Bjorn Helgaas <bhelgaas@google.com>; Joerg Roedel <jroedel@suse.de>; Zhu, Changfeng <Changfeng.Zhu@amd.com>
Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

[AMD Public Use]

> -----Original Message-----
> From: Merger, Edgar [AUTOSOL/MAS/AUGS] <Edgar.Merger@emerson.com>
> Sent: Tuesday, November 24, 2020 2:29 AM
> To: Huang, Ray <Ray.Huang@amd.com>; Kuehling, Felix 
> <Felix.Kuehling@amd.com>
> Cc: Will Deacon <will@kernel.org>; Deucher, Alexander 
> <Alexander.Deucher@amd.com>; linux-kernel@vger.kernel.org; linux- 
> pci@vger.kernel.org; iommu@lists.linux-foundation.org; Bjorn Helgaas 
> <bhelgaas@google.com>; Joerg Roedel <jroedel@suse.de>; Zhu, Changfeng 
> <Changfeng.Zhu@amd.com>
> Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as 
> broken
> 
> Module Version : PiccasoCpu 10
> AGESA Version   : PiccasoPI 100A
> 
> I did not try to enter the system in any other way (like via ssh) than 
> via Desktop.

You can get this information from the amdgpu driver.  E.g., sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info .  Also what is the PCI revision id of your chip (from lspci)?  Also are you just seeing this on specific versions of the sbios?

Thanks,

Alex


> 
> -----Original Message-----
> From: Huang Rui <ray.huang@amd.com>
> Sent: Dienstag, 24. November 2020 07:43
> To: Kuehling, Felix <Felix.Kuehling@amd.com>
> Cc: Will Deacon <will@kernel.org>; Deucher, Alexander 
> <Alexander.Deucher@amd.com>; linux-kernel@vger.kernel.org; linux- 
> pci@vger.kernel.org; iommu@lists.linux-foundation.org; Bjorn Helgaas 
> <bhelgaas@google.com>; Merger, Edgar [AUTOSOL/MAS/AUGS] 
> <Edgar.Merger@emerson.com>; Joerg Roedel <jroedel@suse.de>; Changfeng 
> Zhu <changfeng.zhu@amd.com>
> Subject: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken
> 
> On Tue, Nov 24, 2020 at 06:51:11AM +0800, Kuehling, Felix wrote:
> > On 2020-11-23 5:33 p.m., Will Deacon wrote:
> > > On Mon, Nov 23, 2020 at 09:04:14PM +0000, Deucher, Alexander wrote:
> > >> [AMD Public Use]
> > >>
> > >>> -----Original Message-----
> > >>> From: Will Deacon <will@kernel.org>
> > >>> Sent: Monday, November 23, 2020 8:44 AM
> > >>> To: linux-kernel@vger.kernel.org
> > >>> Cc: linux-pci@vger.kernel.org; iommu@lists.linux-foundation.org; 
> > >>> Will Deacon <will@kernel.org>; Bjorn Helgaas 
> > >>> <bhelgaas@google.com>; Deucher, Alexander 
> > >>> <Alexander.Deucher@amd.com>; Edgar Merger 
> > >>> <Edgar.Merger@emerson.com>; Joerg Roedel <jroedel@suse.de>
> > >>> Subject: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken
> > >>>
> > >>> Edgar Merger reports that the AMD Raven GPU does not work 
> > >>> reliably on his system when the IOMMU is enabled:
> > >>>
> > >>>    | [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx 
> > >>> timeout, signaled seq=1, emitted seq=3
> > >>>    | [...]
> > >>>    | amdgpu 0000:0b:00.0: GPU reset begin!
> > >>>    | AMD-Vi: Completion-Wait loop timed out
> > >>>    | iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT
> > >>> device=0b:00.0 address=0x38edc0970]
> > >>>
> > >>> This is indicative of a hardware/platform configuration issue 
> > >>> so, since disabling ATS has been shown to resolve the problem, 
> > >>> add a quirk to match this particular device while Edgar 
> > >>> follows-up with AMD
> for more information.
> > >>>
> > >>> Cc: Bjorn Helgaas <bhelgaas@google.com>
> > >>> Cc: Alex Deucher <alexander.deucher@amd.com>
> > >>> Reported-by: Edgar Merger <Edgar.Merger@emerson.com>
> > >>> Suggested-by: Joerg Roedel <jroedel@suse.de>
> > >>> Link:
> > >>>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__nam11.safelinks.p
> rotection.outlook.com_-3Furl-3Dhttps-253A-252F-252Furld&d=DwIFAw&c=jOU
> RTkCZzT8tVB5xPEYIm3YJGoxoTaQsQPzPKJGaWbo&r=BJxhacqqa4K1PJGm6_-862rdSP1
> 3_P6LVp7j_9l1xmg&m=MMI_EgCqeOX4EvIftpL7agRxJ-udp1CLokf2QWuzFgE&s=ZLdz6
> OgavzNn2vSzsgyL1IB6MbK7hPKavOYwbLhyTPU&e=
> efense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-
> 3A__lore%26d%3DDwIDAw%26c%3DjOURTkCZzT8tVB5xPEYIm3YJGoxoTaQs
> QPzPKJGaWbo%26r%3DBJxhacqqa4K1PJGm6_-
> 862rdSP13_P6LVp7j_9l1xmg%26m%3DlNXu2xwvyxEZ3PzoVmXMBXXS55jsmf
> DicuQFJqkIOH4%26s%3D_5VDNCRQdA7AhsvvZ3TJJtQZ2iBp9c9tFHIleTYT_ZM
> %26e%3D&amp;data=04%7C01%7CAlexander.Deucher%40amd.com%7C6d5f
> a241f9634692c03908d8904a942c%7C3dd8961fe4884e608e11a82d994e183d%7
> C0%7C0%7C637417997272974427%7CUnknown%7CTWFpbGZsb3d8eyJWIjoi
> MC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C100
> 0&amp;sdata=OEgYlw%2F1YP0C%2FnWBRQUxwBH56mGOJxYMWSQ%2Fj1Y
> 9f6Q%3D&amp;reserved=0 .
> > >>> kernel.org/linux-
> > >>>
> iommu/MWHPR10MB1310F042A30661D4158520B589FC0@MWHPR10M
> > >>> B1310.namprd10.prod.outlook.com
> > >>>
> her%40amd.com%7C1a883fe14d0c408e7d9508d88fb5df4e%7C3dd8961fe488
> > >>>
> 4e608e11a82d994e183d%7C0%7C0%7C637417358593629699%7CUnknown%7
> > >>>
> CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwi
> > >>>
> LCJXVCI6Mn0%3D%7C1000&amp;sdata=TMgKldWzsX8XZ0l7q3%2BszDWXQJJ
> > >>> LOUfX5oGaoLN8n%2B8%3D&amp;reserved=0
> > >>> Signed-off-by: Will Deacon <will@kernel.org>
> > >>> ---
> > >>>
> > >>> Hi all,
> > >>>
> > >>> Since Joerg is away at the moment, I'm posting this to try to 
> > >>> make some progress with the thread in the Link: tag.
> > >> + Felix
> > >>
> > >> What system is this?  Can you provide more details?  Does a sbios 
> > >> update fix this?  Disabling ATS for all Ravens will break GPU 
> > >> compute for a lot of people.  I'd prefer to just black list this 
> > >> particular system (e.g., just SSIDs or revision) if possible.
> >
> > +Ray
> >
> > There are already many systems where the IOMMU is disabled in the 
> > BIOS, or the CRAT table reporting the APU compute capabilities is 
> > broken. Ray has been working on a fallback to make APUs behave like 
> > dGPUs on such systems. That should also cover this case where ATS is 
> > blacklisted. That said, it affects the programming model, because we 
> > don't support the unified and coherent memory model on dGPUs like we 
> > do on APUs with IOMMUv2. So it would be good to make the conditions 
> > for this workaround as narrow as possible.
> 
> Yes, besides the comments from Alex and Felix, may we get your 
> firmware version (SMC firmware which is from SBIOS) and device id?
> 
> > >>>    | [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx 
> > >>> timeout, signaled seq=1, emitted seq=3
> 
> It looks only gfx ib test passed, and fails to lanuch desktop, am I right?
> 
> We would like to see whether it is Raven, Raven kicker (new Raven), or 
> Picasso. In our side, per the internal test result, we didn't see the 
> similiar issue on Raven kicker and Picasso platform.
> 
> Thanks,
> Ray
> 
> >
> > These are the relevant changes in KFD and Thunk for reference:
> >
> > ### KFD ###
> >
> > commit 914913ab04dfbcd0226ecb6bc99d276832ea2908
> > Author: Huang Rui <ray.huang@amd.com>
> > Date:   Tue Aug 18 14:54:23 2020 +0800
> >
> >      drm/amdkfd: implement the dGPU fallback path for apu (v6)
> >
> >      We still have a few iommu issues which need to address, so 
> > force raven
> >      as "dgpu" path for the moment.
> >
> >      This is to add the fallback path to bypass IOMMU if IOMMU v2 is 
> > disabled
> >      or ACPI CRAT table not correct.
> >
> >      v2: Use ignore_crat parameter to decide whether it will go with 
> > IOMMUv2.
> >      v3: Align with existed thunk, don't change the way of raven, 
> > only renoir
> >          will use "dgpu" path by default.
> >      v4: don't update global ignore_crat in the driver, and revise 
> > fallback
> >          function if CRAT is broken.
> >      v5: refine acpi crat good but no iommu support case, and rename 
> > the
> >          title.
> >      v6: fix the issue of dGPU initialized firstly, just modify the 
> > report
> >          value in the node_show().
> >
> >      Signed-off-by: Huang Rui <ray.huang@amd.com>
> >      Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
> >      Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> >
> > ### Thunk ###
> >
> > commit e32482fa4b9ca398c8bdc303920abfd672592764
> > Author: Huang Rui <ray.huang@amd.com>
> > Date:   Tue Aug 18 18:54:05 2020 +0800
> >
> >      libhsakmt: remove is_dgpu flag in the hsa_gfxip_table
> >
> >      Whether use dgpu path will check the props which exposed from kernel.
> >      We won't need hard code in the ASIC table.
> >
> >      Signed-off-by: Huang Rui <ray.huang@amd.com>
> >      Change-Id: I0c018a26b219914a41197ff36dbec7a75945d452
> >
> > commit 7c60f6d912034aa67ed27b47a29221422423f5cc
> > Author: Huang Rui <ray.huang@amd.com>
> > Date:   Thu Jul 30 10:22:23 2020 +0800
> >
> >      libhsakmt: implement the method that using flag which exposed 
> > by kfd to configure is_dgpu
> >
> >      KFD already implemented the fallback path for APU. Thunk will 
> > use flag
> >      which exposed by kfd to configure is_dgpu instead of hardcode before.
> >
> >      Signed-off-by: Huang Rui <ray.huang@amd.com>
> >      Change-Id: I445f6cf668f9484dd06cd9ae1bb3cfe7428ec7eb
> >
> > Regards,
> >    Felix
> >
> >
> > > Cheers, Alex. I'll have to defer to Edgar for the details, as my 
> > > understanding from the original thread over at:
> > >
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__nam11.safelinks.p
> rotection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fur&d=DwIFAw&c=jOURT
> kCZzT8tVB5xPEYIm3YJGoxoTaQsQPzPKJGaWbo&r=BJxhacqqa4K1PJGm6_-862rdSP13_
> P6LVp7j_9l1xmg&m=MMI_EgCqeOX4EvIftpL7agRxJ-udp1CLokf2QWuzFgE&s=IPZRolk
> y3TYlbWPsOkY37MbDdzwhc1b_LaE6JkaOkOo&e=
> > > ldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-
> 3A__lore.kernel.org&a
> > >
> mp;data=04%7C01%7CAlexander.Deucher%40amd.com%7C6d5fa241f963469
> 2c039
> > >
> 08d8904a942c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C63741
> 79972
> > >
> 72974427%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoi
> V2luMzI
> > >
> iLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=iKTPucGQqcRXET
> QZiQz
> > > j90WdJeCYDytdZHJ1ZiUyR%2FM%3D&amp;reserved=0
> > > _linux-2Diommu_MWHPR10MB1310CDB6829DDCF5EA84A14689150-
> 40MWHPR10MB131
> > >
> 0.namprd10.prod.outlook.com_&d=DwIDAw&c=jOURTkCZzT8tVB5xPEYIm3Y
> JGoxo
> > > TaQsQPzPKJGaWbo&r=BJxhacqqa4K1PJGm6_-
> 862rdSP13_P6LVp7j_9l1xmg&m=lNXu
> > >
> 2xwvyxEZ3PzoVmXMBXXS55jsmfDicuQFJqkIOH4&s=dsAVVJbD7gJIj3ctZpnnU
> 60y21
> > > ijWZmZ8xmOK1cO_O0&e=
> > >
> > > is that this is a board developed by his company.
> > >
> > > Edgar -- please can you answer Alex's questions?
> > >
> > > Will

WARNING: multiple messages have this Message-ID (diff)
From: "Merger, Edgar [AUTOSOL/MAS/AUGS]" <Edgar.Merger@emerson.com>
To: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Huang, Ray" <Ray.Huang@amd.com>,
	"Kuehling, Felix" <Felix.Kuehling@amd.com>
Cc: Joerg Roedel <jroedel@suse.de>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"iommu@lists.linux-foundation.org"
	<iommu@lists.linux-foundation.org>,
	"Zhu, Changfeng" <Changfeng.Zhu@amd.com>,
	Bjorn Helgaas <bhelgaas@google.com>,
	Will Deacon <will@kernel.org>
Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken
Date: Wed, 25 Nov 2020 10:03:48 +0000	[thread overview]
Message-ID: <CY4PR10MB13022501A57CC02FF5BC632B89FA0@CY4PR10MB1302.namprd10.prod.outlook.com> (raw)
In-Reply-To: <CY4PR10MB13029B38D31936622E4CA62389FA0@CY4PR10MB1302.namprd10.prod.outlook.com>

I do have also other problems with this unit, when IOMMU is enabled and pci=noats is not set as kernel parameter.

[ 2004.265906] amdgpu 0000:0b:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx (-110).
[ 2004.266024] [drm:amdgpu_device_delayed_init_work_handler [amdgpu]] *ERROR* ib ring test failed (-110).

-----Original Message-----
From: Merger, Edgar [AUTOSOL/MAS/AUGS] 
Sent: Mittwoch, 25. November 2020 10:16
To: 'Deucher, Alexander' <Alexander.Deucher@amd.com>; 'Huang, Ray' <Ray.Huang@amd.com>; 'Kuehling, Felix' <Felix.Kuehling@amd.com>
Cc: 'Will Deacon' <will@kernel.org>; 'linux-kernel@vger.kernel.org' <linux-kernel@vger.kernel.org>; 'linux-pci@vger.kernel.org' <linux-pci@vger.kernel.org>; 'iommu@lists.linux-foundation.org' <iommu@lists.linux-foundation.org>; 'Bjorn Helgaas' <bhelgaas@google.com>; 'Joerg Roedel' <jroedel@suse.de>; 'Zhu, Changfeng' <Changfeng.Zhu@amd.com>
Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

Remark: 

Systems with R1305G APU (which show the issue) have the following VGA-Controller:
0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Picasso (rev cf)

Systems with V1404I APU (which do not show the issue) have the following VGA-Controller:
0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] (rev 83)

"rev cf" vs. "ref 83" is probably what you where referring to with PCI Revision ID.

-----Original Message-----
From: Merger, Edgar [AUTOSOL/MAS/AUGS]
Sent: Mittwoch, 25. November 2020 07:05
To: 'Deucher, Alexander' <Alexander.Deucher@amd.com>; Huang, Ray <Ray.Huang@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>
Cc: Will Deacon <will@kernel.org>; linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org; iommu@lists.linux-foundation.org; Bjorn Helgaas <bhelgaas@google.com>; Joerg Roedel <jroedel@suse.de>; Zhu, Changfeng <Changfeng.Zhu@amd.com>
Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

I see that problem only on systems that use a R1305G APU

sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info

shows

VCE feature version: 0, firmware version: 0x00000000 UVD feature version: 0, firmware version: 0x00000000 MC feature version: 0, firmware version: 0x00000000 ME feature version: 50, firmware version: 0x000000a3 PFP feature version: 50, firmware version: 0x000000bb CE feature version: 50, firmware version: 0x0000004f RLC feature version: 1, firmware version: 0x00000049 RLC SRLC feature version: 1, firmware version: 0x00000001 RLC SRLG feature version: 1, firmware version: 0x00000001 RLC SRLS feature version: 1, firmware version: 0x00000001 MEC feature version: 50, firmware version: 0x000001b5
MEC2 feature version: 50, firmware version: 0x000001b5 SOS feature version: 0, firmware version: 0x00000000 ASD feature version: 0, firmware version: 0x21000030 TA XGMI feature version: 0, firmware version: 0x00000000 TA RAS feature version: 0, firmware version: 0x00000000 SMC feature version: 0, firmware version: 0x00002527
SDMA0 feature version: 41, firmware version: 0x000000a9 VCN feature version: 0, firmware version: 0x0110901c DMCU feature version: 0, firmware version: 0x00000001 VBIOS version: 113-RAVEN2-117

We are also using V1404I APU on the same boards and I haven´t seen the issue on those boards

These boards give me slightly different info: sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info
 
VCE feature version: 0, firmware version: 0x00000000 UVD feature version: 0, firmware version: 0x00000000 MC feature version: 0, firmware version: 0x00000000 ME feature version: 47, firmware version: 0x000000a2 PFP feature version: 47, firmware version: 0x000000b9 CE feature version: 47, firmware version: 0x0000004e RLC feature version: 1, firmware version: 0x00000213 RLC SRLC feature version: 1, firmware version: 0x00000001 RLC SRLG feature version: 1, firmware version: 0x00000001 RLC SRLS feature version: 1, firmware version: 0x00000001 MEC feature version: 47, firmware version: 0x000001ab
MEC2 feature version: 47, firmware version: 0x000001ab SOS feature version: 0, firmware version: 0x00000000 ASD feature version: 0, firmware version: 0x21000013 TA XGMI feature version: 0, firmware version: 0x00000000 TA RAS feature version: 0, firmware version: 0x00000000 SMC feature version: 0, firmware version: 0x00001e5b
SDMA0 feature version: 41, firmware version: 0x000000a9 VCN feature version: 0, firmware version: 0x0110901c DMCU feature version: 0, firmware version: 0x00000000 VBIOS version: 113-RAVEN-116




00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Root Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 IOMMU
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 PCIe GPP Bridge [6:0]
00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Zeppelin Switch Upstream (PCIE SW.US)
00:01.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 PCIe GPP Bridge [6:0]
00:01.5 PCI bridge: Advanced Micro Devices, Inc. [AMD] Zeppelin Switch Upstream (PCIE SW.US)
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Internal PCIe GPP Bridge 0 to Bus A
00:08.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Internal PCIe GPP Bridge 0 to Bus B
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 61)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 7
01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0e)
01:00.1 Serial controller: Realtek Semiconductor Co., Ltd. Device 816a (rev 0e)
01:00.2 Serial controller: Realtek Semiconductor Co., Ltd. Device 816b (rev 0e)
01:00.3 IPMI Interface: Realtek Semiconductor Co., Ltd. Device 816c (rev 0e)
01:00.4 USB controller: Realtek Semiconductor Co., Ltd. Device 816d (rev 0e)
02:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
03:00.0 PCI bridge: Pericom Semiconductor PI7C9X2G608GP PCIe2 6-Port/8-Lane Packet Switch
04:01.0 PCI bridge: Pericom Semiconductor PI7C9X2G608GP PCIe2 6-Port/8-Lane Packet Switch
04:02.0 PCI bridge: Pericom Semiconductor PI7C9X2G608GP PCIe2 6-Port/8-Lane Packet Switch
04:03.0 PCI bridge: Pericom Semiconductor PI7C9X2G608GP PCIe2 6-Port/8-Lane Packet Switch
04:04.0 PCI bridge: Pericom Semiconductor PI7C9X2G608GP PCIe2 6-Port/8-Lane Packet Switch
04:05.0 PCI bridge: Pericom Semiconductor PI7C9X2G608GP PCIe2 6-Port/8-Lane Packet Switch
06:00.0 Serial controller: Asix Electronics Corporation Device 9100
06:00.1 Serial controller: Asix Electronics Corporation Device 9100
07:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
0a:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Picasso (rev cf)
0b:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Raven/Raven2/Fenghuang HDMI/DP Audio Controller
0b:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor
0b:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Raven2 USB 3.1
0b:00.5 Multimedia controller: Advanced Micro Devices, Inc. [AMD] Raven/Raven2/FireFlight/Renoir Audio Processor
0b:00.7 Non-VGA unclassified device: Advanced Micro Devices, Inc. [AMD] Raven/Raven2/Renoir Non-Sensor Fusion Hub KMDF driver
0c:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 61)

PCI Revision ID is 06 I believe. Got that from this lspci -xx

00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Zeppelin Switch Upstream (PCIE SW.US)
00: 22 10 5d 14 07 04 10 00 00 00 04 06 10 00 81 00
10: 00 00 00 00 00 00 00 00 00 02 02 00 f1 01 00 00
20: e0 fc e0 fc f1 ff 01 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 50 00 00 00 00 00 00 00 ff 00 12 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 01 58 03 c8 00 00 00 00 10 a0 42 01 22 80 00 00
60: 1f 29 00 00 13 38 73 03 42 00 11 30 00 00 04 00
70: 00 00 40 01 18 00 01 00 00 00 00 00 bf 01 70 00
80: 06 00 00 00 0e 00 00 00 03 00 01 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 05 c0 81 00 00 00 e0 fe 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 0d c8 00 00 22 10 34 12 08 00 03 a8 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 4c 8a 05 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

-----Original Message-----
From: Deucher, Alexander <Alexander.Deucher@amd.com>
Sent: Dienstag, 24. November 2020 16:06
To: Merger, Edgar [AUTOSOL/MAS/AUGS] <Edgar.Merger@emerson.com>; Huang, Ray <Ray.Huang@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>
Cc: Will Deacon <will@kernel.org>; linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org; iommu@lists.linux-foundation.org; Bjorn Helgaas <bhelgaas@google.com>; Joerg Roedel <jroedel@suse.de>; Zhu, Changfeng <Changfeng.Zhu@amd.com>
Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

[AMD Public Use]

> -----Original Message-----
> From: Merger, Edgar [AUTOSOL/MAS/AUGS] <Edgar.Merger@emerson.com>
> Sent: Tuesday, November 24, 2020 2:29 AM
> To: Huang, Ray <Ray.Huang@amd.com>; Kuehling, Felix 
> <Felix.Kuehling@amd.com>
> Cc: Will Deacon <will@kernel.org>; Deucher, Alexander 
> <Alexander.Deucher@amd.com>; linux-kernel@vger.kernel.org; linux- 
> pci@vger.kernel.org; iommu@lists.linux-foundation.org; Bjorn Helgaas 
> <bhelgaas@google.com>; Joerg Roedel <jroedel@suse.de>; Zhu, Changfeng 
> <Changfeng.Zhu@amd.com>
> Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as 
> broken
> 
> Module Version : PiccasoCpu 10
> AGESA Version   : PiccasoPI 100A
> 
> I did not try to enter the system in any other way (like via ssh) than 
> via Desktop.

You can get this information from the amdgpu driver.  E.g., sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info .  Also what is the PCI revision id of your chip (from lspci)?  Also are you just seeing this on specific versions of the sbios?

Thanks,

Alex


> 
> -----Original Message-----
> From: Huang Rui <ray.huang@amd.com>
> Sent: Dienstag, 24. November 2020 07:43
> To: Kuehling, Felix <Felix.Kuehling@amd.com>
> Cc: Will Deacon <will@kernel.org>; Deucher, Alexander 
> <Alexander.Deucher@amd.com>; linux-kernel@vger.kernel.org; linux- 
> pci@vger.kernel.org; iommu@lists.linux-foundation.org; Bjorn Helgaas 
> <bhelgaas@google.com>; Merger, Edgar [AUTOSOL/MAS/AUGS] 
> <Edgar.Merger@emerson.com>; Joerg Roedel <jroedel@suse.de>; Changfeng 
> Zhu <changfeng.zhu@amd.com>
> Subject: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken
> 
> On Tue, Nov 24, 2020 at 06:51:11AM +0800, Kuehling, Felix wrote:
> > On 2020-11-23 5:33 p.m., Will Deacon wrote:
> > > On Mon, Nov 23, 2020 at 09:04:14PM +0000, Deucher, Alexander wrote:
> > >> [AMD Public Use]
> > >>
> > >>> -----Original Message-----
> > >>> From: Will Deacon <will@kernel.org>
> > >>> Sent: Monday, November 23, 2020 8:44 AM
> > >>> To: linux-kernel@vger.kernel.org
> > >>> Cc: linux-pci@vger.kernel.org; iommu@lists.linux-foundation.org; 
> > >>> Will Deacon <will@kernel.org>; Bjorn Helgaas 
> > >>> <bhelgaas@google.com>; Deucher, Alexander 
> > >>> <Alexander.Deucher@amd.com>; Edgar Merger 
> > >>> <Edgar.Merger@emerson.com>; Joerg Roedel <jroedel@suse.de>
> > >>> Subject: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken
> > >>>
> > >>> Edgar Merger reports that the AMD Raven GPU does not work 
> > >>> reliably on his system when the IOMMU is enabled:
> > >>>
> > >>>    | [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx 
> > >>> timeout, signaled seq=1, emitted seq=3
> > >>>    | [...]
> > >>>    | amdgpu 0000:0b:00.0: GPU reset begin!
> > >>>    | AMD-Vi: Completion-Wait loop timed out
> > >>>    | iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT
> > >>> device=0b:00.0 address=0x38edc0970]
> > >>>
> > >>> This is indicative of a hardware/platform configuration issue 
> > >>> so, since disabling ATS has been shown to resolve the problem, 
> > >>> add a quirk to match this particular device while Edgar 
> > >>> follows-up with AMD
> for more information.
> > >>>
> > >>> Cc: Bjorn Helgaas <bhelgaas@google.com>
> > >>> Cc: Alex Deucher <alexander.deucher@amd.com>
> > >>> Reported-by: Edgar Merger <Edgar.Merger@emerson.com>
> > >>> Suggested-by: Joerg Roedel <jroedel@suse.de>
> > >>> Link:
> > >>>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__nam11.safelinks.p
> rotection.outlook.com_-3Furl-3Dhttps-253A-252F-252Furld&d=DwIFAw&c=jOU
> RTkCZzT8tVB5xPEYIm3YJGoxoTaQsQPzPKJGaWbo&r=BJxhacqqa4K1PJGm6_-862rdSP1
> 3_P6LVp7j_9l1xmg&m=MMI_EgCqeOX4EvIftpL7agRxJ-udp1CLokf2QWuzFgE&s=ZLdz6
> OgavzNn2vSzsgyL1IB6MbK7hPKavOYwbLhyTPU&e=
> efense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-
> 3A__lore%26d%3DDwIDAw%26c%3DjOURTkCZzT8tVB5xPEYIm3YJGoxoTaQs
> QPzPKJGaWbo%26r%3DBJxhacqqa4K1PJGm6_-
> 862rdSP13_P6LVp7j_9l1xmg%26m%3DlNXu2xwvyxEZ3PzoVmXMBXXS55jsmf
> DicuQFJqkIOH4%26s%3D_5VDNCRQdA7AhsvvZ3TJJtQZ2iBp9c9tFHIleTYT_ZM
> %26e%3D&amp;data=04%7C01%7CAlexander.Deucher%40amd.com%7C6d5f
> a241f9634692c03908d8904a942c%7C3dd8961fe4884e608e11a82d994e183d%7
> C0%7C0%7C637417997272974427%7CUnknown%7CTWFpbGZsb3d8eyJWIjoi
> MC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C100
> 0&amp;sdata=OEgYlw%2F1YP0C%2FnWBRQUxwBH56mGOJxYMWSQ%2Fj1Y
> 9f6Q%3D&amp;reserved=0 .
> > >>> kernel.org/linux-
> > >>>
> iommu/MWHPR10MB1310F042A30661D4158520B589FC0@MWHPR10M
> > >>> B1310.namprd10.prod.outlook.com
> > >>>
> her%40amd.com%7C1a883fe14d0c408e7d9508d88fb5df4e%7C3dd8961fe488
> > >>>
> 4e608e11a82d994e183d%7C0%7C0%7C637417358593629699%7CUnknown%7
> > >>>
> CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwi
> > >>>
> LCJXVCI6Mn0%3D%7C1000&amp;sdata=TMgKldWzsX8XZ0l7q3%2BszDWXQJJ
> > >>> LOUfX5oGaoLN8n%2B8%3D&amp;reserved=0
> > >>> Signed-off-by: Will Deacon <will@kernel.org>
> > >>> ---
> > >>>
> > >>> Hi all,
> > >>>
> > >>> Since Joerg is away at the moment, I'm posting this to try to 
> > >>> make some progress with the thread in the Link: tag.
> > >> + Felix
> > >>
> > >> What system is this?  Can you provide more details?  Does a sbios 
> > >> update fix this?  Disabling ATS for all Ravens will break GPU 
> > >> compute for a lot of people.  I'd prefer to just black list this 
> > >> particular system (e.g., just SSIDs or revision) if possible.
> >
> > +Ray
> >
> > There are already many systems where the IOMMU is disabled in the 
> > BIOS, or the CRAT table reporting the APU compute capabilities is 
> > broken. Ray has been working on a fallback to make APUs behave like 
> > dGPUs on such systems. That should also cover this case where ATS is 
> > blacklisted. That said, it affects the programming model, because we 
> > don't support the unified and coherent memory model on dGPUs like we 
> > do on APUs with IOMMUv2. So it would be good to make the conditions 
> > for this workaround as narrow as possible.
> 
> Yes, besides the comments from Alex and Felix, may we get your 
> firmware version (SMC firmware which is from SBIOS) and device id?
> 
> > >>>    | [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx 
> > >>> timeout, signaled seq=1, emitted seq=3
> 
> It looks only gfx ib test passed, and fails to lanuch desktop, am I right?
> 
> We would like to see whether it is Raven, Raven kicker (new Raven), or 
> Picasso. In our side, per the internal test result, we didn't see the 
> similiar issue on Raven kicker and Picasso platform.
> 
> Thanks,
> Ray
> 
> >
> > These are the relevant changes in KFD and Thunk for reference:
> >
> > ### KFD ###
> >
> > commit 914913ab04dfbcd0226ecb6bc99d276832ea2908
> > Author: Huang Rui <ray.huang@amd.com>
> > Date:   Tue Aug 18 14:54:23 2020 +0800
> >
> >      drm/amdkfd: implement the dGPU fallback path for apu (v6)
> >
> >      We still have a few iommu issues which need to address, so 
> > force raven
> >      as "dgpu" path for the moment.
> >
> >      This is to add the fallback path to bypass IOMMU if IOMMU v2 is 
> > disabled
> >      or ACPI CRAT table not correct.
> >
> >      v2: Use ignore_crat parameter to decide whether it will go with 
> > IOMMUv2.
> >      v3: Align with existed thunk, don't change the way of raven, 
> > only renoir
> >          will use "dgpu" path by default.
> >      v4: don't update global ignore_crat in the driver, and revise 
> > fallback
> >          function if CRAT is broken.
> >      v5: refine acpi crat good but no iommu support case, and rename 
> > the
> >          title.
> >      v6: fix the issue of dGPU initialized firstly, just modify the 
> > report
> >          value in the node_show().
> >
> >      Signed-off-by: Huang Rui <ray.huang@amd.com>
> >      Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
> >      Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> >
> > ### Thunk ###
> >
> > commit e32482fa4b9ca398c8bdc303920abfd672592764
> > Author: Huang Rui <ray.huang@amd.com>
> > Date:   Tue Aug 18 18:54:05 2020 +0800
> >
> >      libhsakmt: remove is_dgpu flag in the hsa_gfxip_table
> >
> >      Whether use dgpu path will check the props which exposed from kernel.
> >      We won't need hard code in the ASIC table.
> >
> >      Signed-off-by: Huang Rui <ray.huang@amd.com>
> >      Change-Id: I0c018a26b219914a41197ff36dbec7a75945d452
> >
> > commit 7c60f6d912034aa67ed27b47a29221422423f5cc
> > Author: Huang Rui <ray.huang@amd.com>
> > Date:   Thu Jul 30 10:22:23 2020 +0800
> >
> >      libhsakmt: implement the method that using flag which exposed 
> > by kfd to configure is_dgpu
> >
> >      KFD already implemented the fallback path for APU. Thunk will 
> > use flag
> >      which exposed by kfd to configure is_dgpu instead of hardcode before.
> >
> >      Signed-off-by: Huang Rui <ray.huang@amd.com>
> >      Change-Id: I445f6cf668f9484dd06cd9ae1bb3cfe7428ec7eb
> >
> > Regards,
> >    Felix
> >
> >
> > > Cheers, Alex. I'll have to defer to Edgar for the details, as my 
> > > understanding from the original thread over at:
> > >
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__nam11.safelinks.p
> rotection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fur&d=DwIFAw&c=jOURT
> kCZzT8tVB5xPEYIm3YJGoxoTaQsQPzPKJGaWbo&r=BJxhacqqa4K1PJGm6_-862rdSP13_
> P6LVp7j_9l1xmg&m=MMI_EgCqeOX4EvIftpL7agRxJ-udp1CLokf2QWuzFgE&s=IPZRolk
> y3TYlbWPsOkY37MbDdzwhc1b_LaE6JkaOkOo&e=
> > > ldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-
> 3A__lore.kernel.org&a
> > >
> mp;data=04%7C01%7CAlexander.Deucher%40amd.com%7C6d5fa241f963469
> 2c039
> > >
> 08d8904a942c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C63741
> 79972
> > >
> 72974427%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoi
> V2luMzI
> > >
> iLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=iKTPucGQqcRXET
> QZiQz
> > > j90WdJeCYDytdZHJ1ZiUyR%2FM%3D&amp;reserved=0
> > > _linux-2Diommu_MWHPR10MB1310CDB6829DDCF5EA84A14689150-
> 40MWHPR10MB131
> > >
> 0.namprd10.prod.outlook.com_&d=DwIDAw&c=jOURTkCZzT8tVB5xPEYIm3Y
> JGoxo
> > > TaQsQPzPKJGaWbo&r=BJxhacqqa4K1PJGm6_-
> 862rdSP13_P6LVp7j_9l1xmg&m=lNXu
> > >
> 2xwvyxEZ3PzoVmXMBXXS55jsmfDicuQFJqkIOH4&s=dsAVVJbD7gJIj3ctZpnnU
> 60y21
> > > ijWZmZ8xmOK1cO_O0&e=
> > >
> > > is that this is a board developed by his company.
> > >
> > > Edgar -- please can you answer Alex's questions?
> > >
> > > Will
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

  reply	other threads:[~2020-11-25 10:06 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-23 13:44 [PATCH] PCI: Mark AMD Raven iGPU ATS as broken Will Deacon
2020-11-23 13:44 ` Will Deacon
2020-11-23 21:04 ` Deucher, Alexander
2020-11-23 21:04   ` Deucher, Alexander
2020-11-23 22:33   ` Will Deacon
2020-11-23 22:33     ` Will Deacon
2020-11-23 22:51     ` Felix Kuehling
2020-11-23 22:51       ` Felix Kuehling
2020-11-24  6:43       ` Huang Rui
2020-11-24  6:43         ` Huang Rui
2020-11-24  7:28         ` [EXTERNAL] " Merger, Edgar [AUTOSOL/MAS/AUGS]
2020-11-24  7:28           ` Merger, Edgar [AUTOSOL/MAS/AUGS]
2020-11-24 15:05           ` Deucher, Alexander
2020-11-24 15:05             ` Deucher, Alexander
2020-11-25  6:05             ` Merger, Edgar [AUTOSOL/MAS/AUGS]
2020-11-25  6:05               ` Merger, Edgar [AUTOSOL/MAS/AUGS]
2020-11-25  9:16               ` Merger, Edgar [AUTOSOL/MAS/AUGS]
2020-11-25  9:16                 ` Merger, Edgar [AUTOSOL/MAS/AUGS]
2020-11-25 10:03                 ` Merger, Edgar [AUTOSOL/MAS/AUGS] [this message]
2020-11-25 10:03                   ` Merger, Edgar [AUTOSOL/MAS/AUGS]
2020-11-25 16:07                   ` Deucher, Alexander
2020-11-25 16:07                     ` Deucher, Alexander
2020-11-26  9:24                     ` Merger, Edgar [AUTOSOL/MAS/AUGS]
2020-11-26  9:24                       ` Merger, Edgar [AUTOSOL/MAS/AUGS]
2020-11-30 18:36                       ` Deucher, Alexander
2020-11-30 18:36                         ` Deucher, Alexander
2020-12-07  4:53                         ` Merger, Edgar [AUTOSOL/MAS/AUGS]
2020-12-07  4:53                           ` Merger, Edgar [AUTOSOL/MAS/AUGS]
2020-12-08  8:23                           ` Merger, Edgar [AUTOSOL/MAS/AUGS]
2020-12-08  8:23                             ` Merger, Edgar [AUTOSOL/MAS/AUGS]
2020-12-09  7:59                             ` Merger, Edgar [AUTOSOL/MAS/AUGS]
2020-12-09  7:59                               ` Merger, Edgar [AUTOSOL/MAS/AUGS]
2020-12-09 14:23                               ` Deucher, Alexander
2020-12-09 14:23                                 ` Deucher, Alexander
2020-12-10 10:48                                 ` Merger, Edgar [AUTOSOL/MAS/AUGS]
2020-12-10 10:48                                   ` Merger, Edgar [AUTOSOL/MAS/AUGS]
2020-12-10 15:36                                   ` Deucher, Alexander
2020-12-10 15:36                                     ` Deucher, Alexander
2020-12-10 16:25                                     ` Bjorn Helgaas
2020-12-10 16:25                                       ` Bjorn Helgaas
2020-11-24  5:32     ` Merger, Edgar [AUTOSOL/MAS/AUGS]
2020-11-24  5:32       ` Merger, Edgar [AUTOSOL/MAS/AUGS]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CY4PR10MB13022501A57CC02FF5BC632B89FA0@CY4PR10MB1302.namprd10.prod.outlook.com \
    --to=edgar.merger@emerson.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Changfeng.Zhu@amd.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=Ray.Huang@amd.com \
    --cc=bhelgaas@google.com \
    --cc=iommu@lists.linux-foundation.org \
    --cc=jroedel@suse.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.