Re: [Bug 216373] New: Uncorrected errors reported for AMD GPU

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [Bug 216373] New: Uncorrected errors reported for AMD GPU
       [not found] <bug-216373-41252@https.bugzilla.kernel.org/>
@ 2022-08-18 20:38 ` Bjorn Helgaas
  2022-08-19  7:05   ` Christian König
                     ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Bjorn Helgaas @ 2022-08-18 20:38 UTC (permalink / raw)
  To: Alex Deucher, Christian König, Xinhui Pan
  Cc: David Airlie, Daniel Vetter, Tom Seewald, Stefan Roese,
	Kai-Heng Feng, regressions, linux-pci, amd-gfx

[Adding amdgpu folks]

On Wed, Aug 17, 2022 at 11:45:15PM +0000, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=216373
> 
>             Bug ID: 216373
>            Summary: Uncorrected errors reported for AMD GPU
>     Kernel Version: v6.0-rc1
>         Regression: No
> ...

I marked this as a regression in bugzilla.

> Hardware:
> CPU: Intel i7-12700K (Alder Lake)
> GPU: AMD RX 6700 XT [1002:73df]
> Motherboard: ASUS Prime Z690-A
> 
> Problem:
> After upgrading to v6.0-rc1 the kernel is now reporting uncorrected PCI errors
> for my GPU.

Thank you very much for the report and for taking the trouble to
bisect it and test Kai-Heng's patch!

I suspect that booting with "pci=noaer" should be a temporary
workaround for this issue.  If it, can you add that to the bugzilla
for anybody else who trips over this?

> I have bisected this issue to: [8795e182b02dc87e343c79e73af6b8b7f9c5e635]
> PCI/portdrv: Don't disable AER reporting in get_port_device_capability()
> Reverting that commit causes the errors to cease.

I suspect the errors still occur, but we just don't notice and log
them.

> I have also tried Kai-Heng Feng's patch[1] which seems to resolve a similar
> problem, but it did not fix my issue.
> 
> [1]
> https://lore.kernel.org/linux-pci/20220706123244.18056-1-kai.heng.feng@canonical.com/
>
> dmesg snippet:
> 
> pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received:
> 0000:03:00.0
> amdgpu 0000:03:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal),
> type=Transaction Layer, (Requester ID)
> amdgpu 0000:03:00.0:   device [1002:73df] error status/mask=00100000/00000000
> amdgpu 0000:03:00.0:    [20] UnsupReq               (First)
> amdgpu 0000:03:00.0: AER:   TLP Header: 40000001 0000000f 95e7f000 00000000

I think the TLP header decodes to:

  0x40000001 = 0100 0000 ... 0000 0001 binary
  0x0000000f = 0000 0000 ... 0000 1111 binary

  Fmt           010b                 3 DW header with data
  Type          0000b  010 0 0000    MWr Memory Write Request
  Length        00 0000 0001b        1 DW
  Requester ID  0x0000               00:00.0
  Tag           0x00
  Last DW BE    0000b                must be zero for 1 DW write
  First DW BE   1111b                all 4 bytes in DW enabled
  Address       0x95e7f000
  Data          0x00000000

So I think this is a 32-bit write of zero to PCI bus address
0x95e7f000.

Your dmesg log says:

  pci 0000:02:00.0: PCI bridge to [bus 03]
  pci 0000:02:00.0:   bridge window [mem 0x95e00000-0x95ffffff]
  pci 0000:03:00.0: reg 0x24: [mem 0x95e00000-0x95efffff]
  [drm] register mmio base: 0x95E00000

So this looks like a write to the device's BAR 5.  I don't see a PCI
reason why this should fail.  Maybe there's some amdgpu reason?

Bjorn

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Bug 216373] New: Uncorrected errors reported for AMD GPU
  2022-08-18 20:38 ` [Bug 216373] New: Uncorrected errors reported for AMD GPU Bjorn Helgaas
@ 2022-08-19  7:05   ` Christian König
  2022-08-19  8:33     ` Lazar, Lijo
  2022-08-19 17:13   ` Bjorn Helgaas
  2022-08-23 16:01   ` [Bug 216373] New: Uncorrected errors reported for AMD GPU #forregzbot Thorsten Leemhuis
  2 siblings, 1 reply; 18+ messages in thread
From: Christian König @ 2022-08-19  7:05 UTC (permalink / raw)
  To: Bjorn Helgaas, Alex Deucher, Xinhui Pan
  Cc: David Airlie, Daniel Vetter, Tom Seewald, Stefan Roese,
	Kai-Heng Feng, regressions, linux-pci, amd-gfx

Hi Bjorn,

Am 18.08.22 um 22:38 schrieb Bjorn Helgaas:
> [Adding amdgpu folks]
>
> On Wed, Aug 17, 2022 at 11:45:15PM +0000, bugzilla-daemon@kernel.org wrote:
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D216373&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7C62cca3872daa46ee7a0a08da8159950a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637964519011973266%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=TFF9LWIXBbdrU27%2FbjDfP8FTUhW874X8%2FA0kIrGrjJs%3D&amp;reserved=0
>>
>>              Bug ID: 216373
>>             Summary: Uncorrected errors reported for AMD GPU
>>      Kernel Version: v6.0-rc1
>>          Regression: No
>> ...
> I marked this as a regression in bugzilla.
>
>> Hardware:
>> CPU: Intel i7-12700K (Alder Lake)
>> GPU: AMD RX 6700 XT [1002:73df]
>> Motherboard: ASUS Prime Z690-A
>>
>> Problem:
>> After upgrading to v6.0-rc1 the kernel is now reporting uncorrected PCI errors
>> for my GPU.
> Thank you very much for the report and for taking the trouble to
> bisect it and test Kai-Heng's patch!
>
> I suspect that booting with "pci=noaer" should be a temporary
> workaround for this issue.  If it, can you add that to the bugzilla
> for anybody else who trips over this?
>
>> I have bisected this issue to: [8795e182b02dc87e343c79e73af6b8b7f9c5e635]
>> PCI/portdrv: Don't disable AER reporting in get_port_device_capability()
>> Reverting that commit causes the errors to cease.
> I suspect the errors still occur, but we just don't notice and log
> them.
>
>> I have also tried Kai-Heng Feng's patch[1] which seems to resolve a similar
>> problem, but it did not fix my issue.
>>
>> [1]
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F20220706123244.18056-1-kai.heng.feng%40canonical.com%2F&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7C62cca3872daa46ee7a0a08da8159950a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637964519011973266%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=Y0ofsDYgNGXoQn2e%2BbCM4NHaMOUnEJPqL8lqs1YJzrQ%3D&amp;reserved=0
>>
>> dmesg snippet:
>>
>> pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received:
>> 0000:03:00.0
>> amdgpu 0000:03:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal),
>> type=Transaction Layer, (Requester ID)
>> amdgpu 0000:03:00.0:   device [1002:73df] error status/mask=00100000/00000000
>> amdgpu 0000:03:00.0:    [20] UnsupReq               (First)
>> amdgpu 0000:03:00.0: AER:   TLP Header: 40000001 0000000f 95e7f000 00000000
> I think the TLP header decodes to:
>
>    0x40000001 = 0100 0000 ... 0000 0001 binary
>    0x0000000f = 0000 0000 ... 0000 1111 binary
>
>    Fmt           010b                 3 DW header with data
>    Type          0000b  010 0 0000    MWr Memory Write Request
>    Length        00 0000 0001b        1 DW
>    Requester ID  0x0000               00:00.0
>    Tag           0x00
>    Last DW BE    0000b                must be zero for 1 DW write
>    First DW BE   1111b                all 4 bytes in DW enabled
>    Address       0x95e7f000
>    Data          0x00000000
>
> So I think this is a 32-bit write of zero to PCI bus address
> 0x95e7f000.
>
> Your dmesg log says:
>
>    pci 0000:02:00.0: PCI bridge to [bus 03]
>    pci 0000:02:00.0:   bridge window [mem 0x95e00000-0x95ffffff]
>    pci 0000:03:00.0: reg 0x24: [mem 0x95e00000-0x95efffff]
>    [drm] register mmio base: 0x95E00000
>
> So this looks like a write to the device's BAR 5.  I don't see a PCI
> reason why this should fail.  Maybe there's some amdgpu reason?

Well I have seen a couple of boards where stuff like that happened, but 
from my experience this always has some hardware problem as background.

 From my understanding what essentially happens is that a write doesn't 
make it to the device (e.g. transmission errors can't be corrected).

It's quite likely that the write is then either dropped and doesn't 
matter that much (just clearing the framebuffer for example) or repeated 
and because of this everything still seems to work fine.

Either way I suggest to try this with some other hartdware 
configuration. E.g. put the GPU in another system and see if it still 
gives the same issues or put another GPU into this system.

Regards,
Christian.


>
> Bjorn


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Bug 216373] New: Uncorrected errors reported for AMD GPU
  2022-08-19  7:05   ` Christian König
@ 2022-08-19  8:33     ` Lazar, Lijo
  2022-08-19 11:04       ` Bjorn Helgaas
  0 siblings, 1 reply; 18+ messages in thread
From: Lazar, Lijo @ 2022-08-19  8:33 UTC (permalink / raw)
  To: Christian König, Bjorn Helgaas, Alex Deucher, Xinhui Pan
  Cc: regressions, David Airlie, linux-pci, amd-gfx, Tom Seewald,
	Kai-Heng Feng, Daniel Vetter, Stefan Roese



On 8/19/2022 12:35 PM, Christian König wrote:
> Hi Bjorn,
> 
> Am 18.08.22 um 22:38 schrieb Bjorn Helgaas:
>> [Adding amdgpu folks]
>>
>> On Wed, Aug 17, 2022 at 11:45:15PM +0000, bugzilla-daemon@kernel.org 
>> wrote:
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D216373&amp;data=05%7C01%7Clijo.lazar%40amd.com%7C59322ae65b814f132a7e08da81b14a95%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637964895716218989%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=tSdOYv7x%2BO6Rm01OFSDV0j3gevlhTF9lOq9pY2AixRM%3D&amp;reserved=0 
>>>
>>>
>>>              Bug ID: 216373
>>>             Summary: Uncorrected errors reported for AMD GPU
>>>      Kernel Version: v6.0-rc1
>>>          Regression: No
>>> ...
>> I marked this as a regression in bugzilla.
>>
>>> Hardware:
>>> CPU: Intel i7-12700K (Alder Lake)
>>> GPU: AMD RX 6700 XT [1002:73df]
>>> Motherboard: ASUS Prime Z690-A
>>>
>>> Problem:
>>> After upgrading to v6.0-rc1 the kernel is now reporting uncorrected 
>>> PCI errors
>>> for my GPU.
>> Thank you very much for the report and for taking the trouble to
>> bisect it and test Kai-Heng's patch!
>>
>> I suspect that booting with "pci=noaer" should be a temporary
>> workaround for this issue.  If it, can you add that to the bugzilla
>> for anybody else who trips over this?
>>
>>> I have bisected this issue to: 
>>> [8795e182b02dc87e343c79e73af6b8b7f9c5e635]
>>> PCI/portdrv: Don't disable AER reporting in get_port_device_capability()
>>> Reverting that commit causes the errors to cease.
>> I suspect the errors still occur, but we just don't notice and log
>> them.
>>
>>> I have also tried Kai-Heng Feng's patch[1] which seems to resolve a 
>>> similar
>>> problem, but it did not fix my issue.
>>>
>>> [1]
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F20220706123244.18056-1-kai.heng.feng%40canonical.com%2F&amp;data=05%7C01%7Clijo.lazar%40amd.com%7C59322ae65b814f132a7e08da81b14a95%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637964895716218989%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=7U52%2BsKIHHn1%2B%2F40dbPS38IGBrBYgBxCXAoFKcrTVGU%3D&amp;reserved=0 
>>>
>>>
>>> dmesg snippet:
>>>
>>> pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error 
>>> received:
>>> 0000:03:00.0
>>> amdgpu 0000:03:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal),
>>> type=Transaction Layer, (Requester ID)
>>> amdgpu 0000:03:00.0:   device [1002:73df] error 
>>> status/mask=00100000/00000000
>>> amdgpu 0000:03:00.0:    [20] UnsupReq               (First)
>>> amdgpu 0000:03:00.0: AER:   TLP Header: 40000001 0000000f 95e7f000 
>>> 00000000
>> I think the TLP header decodes to:
>>
>>    0x40000001 = 0100 0000 ... 0000 0001 binary
>>    0x0000000f = 0000 0000 ... 0000 1111 binary
>>
>>    Fmt           010b                 3 DW header with data
>>    Type          0000b  010 0 0000    MWr Memory Write Request
>>    Length        00 0000 0001b        1 DW
>>    Requester ID  0x0000               00:00.0
>>    Tag           0x00
>>    Last DW BE    0000b                must be zero for 1 DW write
>>    First DW BE   1111b                all 4 bytes in DW enabled
>>    Address       0x95e7f000
>>    Data          0x00000000
>>
>> So I think this is a 32-bit write of zero to PCI bus address
>> 0x95e7f000.
>>
>> Your dmesg log says:
>>
>>    pci 0000:02:00.0: PCI bridge to [bus 03]
>>    pci 0000:02:00.0:   bridge window [mem 0x95e00000-0x95ffffff]
>>    pci 0000:03:00.0: reg 0x24: [mem 0x95e00000-0x95efffff]
>>    [drm] register mmio base: 0x95E00000
>>
>> So this looks like a write to the device's BAR 5.  I don't see a PCI
>> reason why this should fail.  Maybe there's some amdgpu reason?
> 
> Well I have seen a couple of boards where stuff like that happened, but 
> from my experience this always has some hardware problem as background.
> 
>  From my understanding what essentially happens is that a write doesn't 
> make it to the device (e.g. transmission errors can't be corrected).
> 
> It's quite likely that the write is then either dropped and doesn't 
> matter that much (just clearing the framebuffer for example) or repeated 
> and because of this everything still seems to work fine.
> 
> Either way I suggest to try this with some other hartdware 
> configuration. E.g. put the GPU in another system and see if it still 
> gives the same issues or put another GPU into this system.
> 

Or, it could be amdgpu or some other software component -

register mmio base: 0x95E00000
Address       0x95e7f000

0x95e7f000 indicates access from CPU to a register offset 0x7FE000. This 
doesn't look like a valid register offset for this chip (device 
[1002:73df]). Any other clues in dmesg?

Thanks,
Lijo


> Regards,
> Christian.
> 
> 
>>
>> Bjorn
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Bug 216373] New: Uncorrected errors reported for AMD GPU
  2022-08-19  8:33     ` Lazar, Lijo
@ 2022-08-19 11:04       ` Bjorn Helgaas
  0 siblings, 0 replies; 18+ messages in thread
From: Bjorn Helgaas @ 2022-08-19 11:04 UTC (permalink / raw)
  To: Lazar, Lijo
  Cc: Christian König, Alex Deucher, Xinhui Pan, regressions,
	David Airlie, linux-pci, amd-gfx, Tom Seewald, Kai-Heng Feng,
	Daniel Vetter, Stefan Roese

On Fri, Aug 19, 2022 at 02:03:59PM +0530, Lazar, Lijo wrote:

> Or, it could be amdgpu or some other software component -
> 
> register mmio base: 0x95E00000
> Address       0x95e7f000
> 
> 0x95e7f000 indicates access from CPU to a register offset 0x7FE000. This
> doesn't look like a valid register offset for this chip (device
> [1002:73df]). Any other clues in dmesg?

The complete dmesg is at
https://bugzilla.kernel.org/attachment.cgi?id=301596

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Bug 216373] New: Uncorrected errors reported for AMD GPU
  2022-08-18 20:38 ` [Bug 216373] New: Uncorrected errors reported for AMD GPU Bjorn Helgaas
  2022-08-19  7:05   ` Christian König
@ 2022-08-19 17:13   ` Bjorn Helgaas
  2022-08-19 19:07     ` Bjorn Helgaas
  2022-08-23 16:01   ` [Bug 216373] New: Uncorrected errors reported for AMD GPU #forregzbot Thorsten Leemhuis
  2 siblings, 1 reply; 18+ messages in thread
From: Bjorn Helgaas @ 2022-08-19 17:13 UTC (permalink / raw)
  To: Tom Seewald
  Cc: David Airlie, Daniel Vetter, Tom Seewald, Stefan Roese,
	Kai-Heng Feng, regressions, linux-pci, amd-gfx, Alex Deucher,
	Christian König, Xinhui Pan, Lijo Lazar

On Thu, Aug 18, 2022 at 03:38:12PM -0500, Bjorn Helgaas wrote:
> [Adding amdgpu folks]
> 
> On Wed, Aug 17, 2022 at 11:45:15PM +0000, bugzilla-daemon@kernel.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=216373
> > 
> >             Bug ID: 216373
> >            Summary: Uncorrected errors reported for AMD GPU
> >     Kernel Version: v6.0-rc1
> >         Regression: No

Tom, thanks for trying out "pci=noaer".  Hopefully we won't need the
workaround for long.

Could I trouble you to try the debug patch below and see if we get any
stack trace clues in dmesg when the error happens?  I'm sure the
experts would have a better approach, but I'm amdgpu-illiterate, so 
this is all I can do :)

Bjorn

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index c4a6fe3070b6..fc34c66776bc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -130,6 +130,14 @@ const char *amdgpu_asic_name[] = {
 	"LAST",
 };
 
+void check_write(uint32_t v, void __iomem *base, uint32_t offset)
+{
+	if (offset == 0x7f000) {
+		pr_err("** writing %#010x to %px\n", v, base + offset);
+		dump_stack();
+	}
+}
+
 /**
  * DOC: pcie_replay_count
  *
@@ -512,9 +520,10 @@ void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t value)
 	if (amdgpu_device_skip_hw_access(adev))
 		return;
 
-	if (offset < adev->rmmio_size)
+	if (offset < adev->rmmio_size) {
+		check_write(value, adev->rmmio, offset);
 		writeb(value, adev->rmmio + offset);
-	else
+	} else
 		BUG();
 }
 
@@ -542,6 +551,7 @@ void amdgpu_device_wreg(struct amdgpu_device *adev,
 			amdgpu_kiq_wreg(adev, reg, v);
 			up_read(&adev->reset_domain->sem);
 		} else {
+			check_write(v, adev->rmmio, reg * 4);
 			writel(v, ((void __iomem *)adev->rmmio) + (reg * 4));
 		}
 	} else {
@@ -574,6 +584,7 @@ void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev,
 	} else if ((reg * 4) >= adev->rmmio_size) {
 		adev->pcie_wreg(adev, reg * 4, v);
 	} else {
+		check_write(v, adev->rmmio, reg * 4);
 		writel(v, ((void __iomem *)adev->rmmio) + (reg * 4));
 	}
 }
@@ -689,6 +700,7 @@ u32 amdgpu_device_indirect_rreg(struct amdgpu_device *adev,
 	pcie_index_offset = (void __iomem *)adev->rmmio + pcie_index * 4;
 	pcie_data_offset = (void __iomem *)adev->rmmio + pcie_data * 4;
 
+	check_write(reg_addr, adev->rmmio, pcie_index * 4);
 	writel(reg_addr, pcie_index_offset);
 	readl(pcie_index_offset);
 	r = readl(pcie_data_offset);
@@ -721,10 +733,12 @@ u64 amdgpu_device_indirect_rreg64(struct amdgpu_device *adev,
 	pcie_data_offset = (void __iomem *)adev->rmmio + pcie_data * 4;
 
 	/* read low 32 bits */
+	check_write(reg_addr, adev->rmmio, pcie_index * 4);
 	writel(reg_addr, pcie_index_offset);
 	readl(pcie_index_offset);
 	r = readl(pcie_data_offset);
 	/* read high 32 bits */
+	check_write(reg_addr + 4, adev->rmmio, pcie_index * 4);
 	writel(reg_addr + 4, pcie_index_offset);
 	readl(pcie_index_offset);
 	r |= ((u64)readl(pcie_data_offset) << 32);
@@ -755,8 +769,10 @@ void amdgpu_device_indirect_wreg(struct amdgpu_device *adev,
 	pcie_index_offset = (void __iomem *)adev->rmmio + pcie_index * 4;
 	pcie_data_offset = (void __iomem *)adev->rmmio + pcie_data * 4;
 
+	check_write(reg_addr, adev->rmmio, pcie_index * 4);
 	writel(reg_addr, pcie_index_offset);
 	readl(pcie_index_offset);
+	check_write(reg_data, adev->rmmio, pcie_data * 4);
 	writel(reg_data, pcie_data_offset);
 	readl(pcie_data_offset);
 	spin_unlock_irqrestore(&adev->pcie_idx_lock, flags);
@@ -785,13 +801,17 @@ void amdgpu_device_indirect_wreg64(struct amdgpu_device *adev,
 	pcie_data_offset = (void __iomem *)adev->rmmio + pcie_data * 4;
 
 	/* write low 32 bits */
+	check_write(reg_addr, adev->rmmio, pcie_index * 4);
 	writel(reg_addr, pcie_index_offset);
 	readl(pcie_index_offset);
+	check_write((u32)(reg_data & 0xffffffffULL), adev->rmmio, pcie_data * 4);
 	writel((u32)(reg_data & 0xffffffffULL), pcie_data_offset);
 	readl(pcie_data_offset);
 	/* write high 32 bits */
+	check_write(reg_addr + 4, adev->rmmio, pcie_index * 4);
 	writel(reg_addr + 4, pcie_index_offset);
 	readl(pcie_index_offset);
+	check_write((u32)(reg_data >> 32), adev->rmmio, pcie_data * 4);
 	writel((u32)(reg_data >> 32), pcie_data_offset);
 	readl(pcie_data_offset);
 	spin_unlock_irqrestore(&adev->pcie_idx_lock, flags);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 9be57389301b..b552d7c27ec0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -36,6 +36,8 @@
 #include "soc15.h"
 #include "nv.h"
 
+extern void check_write(uint32_t v, void __iomem *base, uint32_t offset);
+
 #define POPULATE_UCODE_INFO(vf2pf_info, ucode, ver) \
 	do { \
 		vf2pf_info->ucode_info[ucode].id = ucode; \
@@ -900,11 +902,15 @@ static u32 amdgpu_virt_rlcg_reg_rw(struct amdgpu_device *adev, u32 offset, u32 v
 
 	if (offset == reg_access_ctrl->grbm_cntl) {
 		/* if the target reg offset is grbm_cntl, write to scratch_reg2 */
+		check_write(v, adev->rmmio, 4 * reg_access_ctrl->scratch_reg2);
 		writel(v, scratch_reg2);
+		check_write(v, adev->rmmio, offset * 4);
 		writel(v, ((void __iomem *)adev->rmmio) + (offset * 4));
 	} else if (offset == reg_access_ctrl->grbm_idx) {
 		/* if the target reg offset is grbm_idx, write to scratch_reg3 */
+		check_write(v, adev->rmmio, 4 * reg_access_ctrl->scratch_reg3);
 		writel(v, scratch_reg3);
+		check_write(v, adev->rmmio, offset * 4);
 		writel(v, ((void __iomem *)adev->rmmio) + (offset * 4));
 	} else {
 		/*
@@ -913,10 +919,14 @@ static u32 amdgpu_virt_rlcg_reg_rw(struct amdgpu_device *adev, u32 offset, u32 v
 		 * SCRATCH_REG1[19:0]	= address in dword
 		 * SCRATCH_REG1[26:24]	= Error reporting
 		 */
+		check_write(v, adev->rmmio, 4 * reg_access_ctrl->scratch_reg0);
 		writel(v, scratch_reg0);
+		check_write(offset | flag, adev->rmmio, 4 * reg_access_ctrl->scratch_reg1);
 		writel((offset | flag), scratch_reg1);
-		if (reg_access_ctrl->spare_int)
+		if (reg_access_ctrl->spare_int) {
+			check_write(1, adev->rmmio, 4 * reg_access_ctrl->spare_int);
 			writel(1, spare_int);
+		}
 
 		for (i = 0; i < timeout; i++) {
 			tmp = readl(scratch_reg1);

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Bug 216373] New: Uncorrected errors reported for AMD GPU
  2022-08-19 17:13   ` Bjorn Helgaas
@ 2022-08-19 19:07     ` Bjorn Helgaas
  2022-08-20  7:52       ` Lazar, Lijo
  0 siblings, 1 reply; 18+ messages in thread
From: Bjorn Helgaas @ 2022-08-19 19:07 UTC (permalink / raw)
  To: Tom Seewald
  Cc: Lijo Lazar, regressions, David Airlie, linux-pci, Xinhui Pan,
	amd-gfx, Kai-Heng Feng, Daniel Vetter, Alex Deucher,
	Stefan Roese, Christian König

On Fri, Aug 19, 2022 at 12:13:03PM -0500, Bjorn Helgaas wrote:
> On Thu, Aug 18, 2022 at 03:38:12PM -0500, Bjorn Helgaas wrote:
> > [Adding amdgpu folks]
> > 
> > On Wed, Aug 17, 2022 at 11:45:15PM +0000, bugzilla-daemon@kernel.org wrote:
> > > https://bugzilla.kernel.org/show_bug.cgi?id=216373
> > > 
> > >             Bug ID: 216373
> > >            Summary: Uncorrected errors reported for AMD GPU
> > >     Kernel Version: v6.0-rc1
> > >         Regression: No
> 
> Tom, thanks for trying out "pci=noaer".  Hopefully we won't need the
> workaround for long.
> 
> Could I trouble you to try the debug patch below and see if we get any
> stack trace clues in dmesg when the error happens?  I'm sure the
> experts would have a better approach, but I'm amdgpu-illiterate, so 
> this is all I can do :)

Thanks for doing this, Tom!  For everybody else, Tom attached a dmesg
log to the bugzilla: https://bugzilla.kernel.org/attachment.cgi?id=301606

Lots of traces of the form:

  amdgpu_device_wreg.part.0.cold+0xb/0x17 [amdgpu]
  amdgpu_gart_invalidate_tlb+0x22/0x60 [amdgpu]
  gmc_v10_0_hw_init+0x44/0x180 [amdgpu]

  amdgpu_device_wreg.part.0.cold+0xb/0x17 [amdgpu]
  gmc_v10_0_hw_init+0xa8/0x180 [amdgpu]

  amdgpu_device_wreg.part.0.cold+0xb/0x17 [amdgpu]
  gmc_v10_0_flush_gpu_tlb+0x35/0x280 [amdgpu]
  amdgpu_gart_invalidate_tlb+0x46/0x60 [amdgpu]
  gmc_v10_0_hw_init+0x44/0x180 [amdgpu]

I tried connecting the dots but I gave up chasing all the function
pointers.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Bug 216373] New: Uncorrected errors reported for AMD GPU
  2022-08-19 19:07     ` Bjorn Helgaas
@ 2022-08-20  7:52       ` Lazar, Lijo
  2022-08-23 17:04         ` Tom Seewald
  0 siblings, 1 reply; 18+ messages in thread
From: Lazar, Lijo @ 2022-08-20  7:52 UTC (permalink / raw)
  To: Bjorn Helgaas, Tom Seewald
  Cc: regressions, David Airlie, linux-pci, Xinhui Pan, amd-gfx,
	Kai-Heng Feng, Daniel Vetter, Alex Deucher, Stefan Roese,
	Christian König



On 8/20/2022 12:37 AM, Bjorn Helgaas wrote:
> On Fri, Aug 19, 2022 at 12:13:03PM -0500, Bjorn Helgaas wrote:
>> On Thu, Aug 18, 2022 at 03:38:12PM -0500, Bjorn Helgaas wrote:
>>> [Adding amdgpu folks]
>>>
>>> On Wed, Aug 17, 2022 at 11:45:15PM +0000, bugzilla-daemon@kernel.org wrote:
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D216373&amp;data=05%7C01%7Clijo.lazar%40amd.com%7C958674b7d05040ddd93f08da8216107a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637965328525221166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=otI1jfdKInPamMGrLyhFsVoGnK%2BIgL%2BcyMtBVPsV7bE%3D&amp;reserved=0
>>>>
>>>>              Bug ID: 216373
>>>>             Summary: Uncorrected errors reported for AMD GPU
>>>>      Kernel Version: v6.0-rc1
>>>>          Regression: No
>>
>> Tom, thanks for trying out "pci=noaer".  Hopefully we won't need the
>> workaround for long.
>>
>> Could I trouble you to try the debug patch below and see if we get any
>> stack trace clues in dmesg when the error happens?  I'm sure the
>> experts would have a better approach, but I'm amdgpu-illiterate, so
>> this is all I can do :)
> 
> Thanks for doing this, Tom!  For everybody else, Tom attached a dmesg
> log to the bugzilla: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fattachment.cgi%3Fid%3D301606&amp;data=05%7C01%7Clijo.lazar%40amd.com%7C958674b7d05040ddd93f08da8216107a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637965328525221166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=pzmtR3daPOimRkX6fohHQTDiEkp8B9aCJX5Z%2BKUcq2c%3D&amp;reserved=0
> 
> Lots of traces of the form:
> 
>    amdgpu_device_wreg.part.0.cold+0xb/0x17 [amdgpu]
>    amdgpu_gart_invalidate_tlb+0x22/0x60 [amdgpu]
>    gmc_v10_0_hw_init+0x44/0x180 [amdgpu]
> 
>    amdgpu_device_wreg.part.0.cold+0xb/0x17 [amdgpu]
>    gmc_v10_0_hw_init+0xa8/0x180 [amdgpu]
> 
>    amdgpu_device_wreg.part.0.cold+0xb/0x17 [amdgpu]
>    gmc_v10_0_flush_gpu_tlb+0x35/0x280 [amdgpu]
>    amdgpu_gart_invalidate_tlb+0x46/0x60 [amdgpu]
>    gmc_v10_0_hw_init+0x44/0x180 [amdgpu]
> 
> I tried connecting the dots but I gave up chasing all the function
> pointers.
> 

Missed the remap part, the offset is here -

https://elixir.bootlin.com/linux/v6.0-rc1/source/drivers/gpu/drm/amd/amdgpu/nv.c#L680 


The trace is coming from *_flush_hdp.

You may also check if *_remap_hdp_registers() is getting called. It is 
done in nbio_vx_y files, most likely this one for your device -
https://elixir.bootlin.com/linux/v6.0-rc1/source/drivers/gpu/drm/amd/amdgpu/nbio_v2_3.c#L68

Thanks,
Lijo


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Bug 216373] New: Uncorrected errors reported for AMD GPU
  2022-08-20  7:52       ` Lazar, Lijo
@ 2022-08-23 17:04         ` Tom Seewald
  2022-08-24  5:10           ` Lazar, Lijo
  0 siblings, 1 reply; 18+ messages in thread
From: Tom Seewald @ 2022-08-23 17:04 UTC (permalink / raw)
  To: Lazar, Lijo
  Cc: Bjorn Helgaas, regressions, David Airlie, linux-pci, Xinhui Pan,
	amd-gfx, Kai-Heng Feng, Daniel Vetter, Alex Deucher,
	Stefan Roese, Christian König

On Sat, Aug 20, 2022 at 2:53 AM Lazar, Lijo <lijo.lazar@amd.com> wrote:
>
> Missed the remap part, the offset is here -
>
> https://elixir.bootlin.com/linux/v6.0-rc1/source/drivers/gpu/drm/amd/amdgpu/nv.c#L680
>
>
> The trace is coming from *_flush_hdp.
>
> You may also check if *_remap_hdp_registers() is getting called. It is
> done in nbio_vx_y files, most likely this one for your device -
> https://elixir.bootlin.com/linux/v6.0-rc1/source/drivers/gpu/drm/amd/amdgpu/nbio_v2_3.c#L68
>
> Thanks,
> Lijo

Hi Lijo,

I would be happy to test any patches that you think would shed some
light on this.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Bug 216373] New: Uncorrected errors reported for AMD GPU
  2022-08-23 17:04         ` Tom Seewald
@ 2022-08-24  5:10           ` Lazar, Lijo
  2022-08-24 14:45             ` Tom Seewald
  2022-08-25 15:05             ` Felix Kuehling
  0 siblings, 2 replies; 18+ messages in thread
From: Lazar, Lijo @ 2022-08-24  5:10 UTC (permalink / raw)
  To: Tom Seewald
  Cc: Bjorn Helgaas, regressions, David Airlie, linux-pci, Xinhui Pan,
	amd-gfx, Kai-Heng Feng, Daniel Vetter, Alex Deucher,
	Stefan Roese, Christian König

[-- Attachment #1: Type: text/plain, Size: 1597 bytes --]



On 8/23/2022 10:34 PM, Tom Seewald wrote:
> On Sat, Aug 20, 2022 at 2:53 AM Lazar, Lijo <lijo.lazar@amd.com> wrote:
>>
>> Missed the remap part, the offset is here -
>>
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Fv6.0-rc1%2Fsource%2Fdrivers%2Fgpu%2Fdrm%2Famd%2Famdgpu%2Fnv.c%23L680&amp;data=05%7C01%7Clijo.lazar%40amd.com%7Cac6bd5bb5d4143ff9e5808da852982d8%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637968710652869475%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=U5AkO7jfPGP2veXp%2FQkoLY92%2BHNOdMkwTwQCb0tRJPk%3D&amp;reserved=0
>>
>>
>> The trace is coming from *_flush_hdp.
>>
>> You may also check if *_remap_hdp_registers() is getting called. It is
>> done in nbio_vx_y files, most likely this one for your device -
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Fv6.0-rc1%2Fsource%2Fdrivers%2Fgpu%2Fdrm%2Famd%2Famdgpu%2Fnbio_v2_3.c%23L68&amp;data=05%7C01%7Clijo.lazar%40amd.com%7Cac6bd5bb5d4143ff9e5808da852982d8%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637968710652869475%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=N4ZbLJuRgddTqMdMo2vD5iJGMMmUJ1MPUVJwVIKThSU%3D&amp;reserved=0
>>
>> Thanks,
>> Lijo
> 
> Hi Lijo,
> 
> I would be happy to test any patches that you think would shed some
> light on this.
> 
Unfortunately, I don't have any NV platforms to test. Attached is an 
'untested-patch' based on your trace logs.

Thanks,
Lijo

[-- Attachment #2: remap_hdp.diff --]
[-- Type: text/plain, Size: 2918 bytes --]

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index d7eb23b8d692..743a3ac909ad 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2376,6 +2376,15 @@ static int amdgpu_device_ip_init(struct amdgpu_device *adev)
 				DRM_ERROR("amdgpu_vram_scratch_init failed %d\n", r);
 				goto init_failed;
 			}
+
+			/* remap HDP registers to a hole in mmio space,
+			 * for the purpose of expose those registers
+			 * to process space. This is needed for any early HDP
+			 * flush operation
+			 */
+			if (adev->nbio.funcs->remap_hdp_registers && !amdgpu_sriov_vf(adev))
+				adev->nbio.funcs->remap_hdp_registers(adev);
+
 			r = adev->ip_blocks[i].version->funcs->hw_init((void *)adev);
 			if (r) {
 				DRM_ERROR("hw_init %d failed %d\n", i, r);
diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
index b3fba8dea63c..3ac7fef74277 100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -1032,12 +1032,6 @@ static int nv_common_hw_init(void *handle)
 	nv_program_aspm(adev);
 	/* setup nbio registers */
 	adev->nbio.funcs->init_registers(adev);
-	/* remap HDP registers to a hole in mmio space,
-	 * for the purpose of expose those registers
-	 * to process space
-	 */
-	if (adev->nbio.funcs->remap_hdp_registers && !amdgpu_sriov_vf(adev))
-		adev->nbio.funcs->remap_hdp_registers(adev);
 	/* enable the doorbell aperture */
 	nv_enable_doorbell_aperture(adev, true);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
index fde6154f2009..a0481e37d7cf 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -1240,12 +1240,6 @@ static int soc15_common_hw_init(void *handle)
 	soc15_program_aspm(adev);
 	/* setup nbio registers */
 	adev->nbio.funcs->init_registers(adev);
-	/* remap HDP registers to a hole in mmio space,
-	 * for the purpose of expose those registers
-	 * to process space
-	 */
-	if (adev->nbio.funcs->remap_hdp_registers && !amdgpu_sriov_vf(adev))
-		adev->nbio.funcs->remap_hdp_registers(adev);
 
 	/* enable the doorbell aperture */
 	soc15_enable_doorbell_aperture(adev, true);
diff --git a/drivers/gpu/drm/amd/amdgpu/soc21.c b/drivers/gpu/drm/amd/amdgpu/soc21.c
index 1ff7fc7bb340..60a1cf03fddc 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc21.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc21.c
@@ -643,12 +643,6 @@ static int soc21_common_hw_init(void *handle)
 	soc21_program_aspm(adev);
 	/* setup nbio registers */
 	adev->nbio.funcs->init_registers(adev);
-	/* remap HDP registers to a hole in mmio space,
-	 * for the purpose of expose those registers
-	 * to process space
-	 */
-	if (adev->nbio.funcs->remap_hdp_registers)
-		adev->nbio.funcs->remap_hdp_registers(adev);
 	/* enable the doorbell aperture */
 	soc21_enable_doorbell_aperture(adev, true);
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Bug 216373] New: Uncorrected errors reported for AMD GPU
  2022-08-24  5:10           ` Lazar, Lijo
@ 2022-08-24 14:45             ` Tom Seewald
  2022-08-25  6:40               ` Stefan Roese
  2022-08-25 15:05             ` Felix Kuehling
  1 sibling, 1 reply; 18+ messages in thread
From: Tom Seewald @ 2022-08-24 14:45 UTC (permalink / raw)
  To: Lazar, Lijo
  Cc: Bjorn Helgaas, regressions, David Airlie, linux-pci, Xinhui Pan,
	amd-gfx, Kai-Heng Feng, Daniel Vetter, Alex Deucher,
	Stefan Roese, Christian König

On Wed, Aug 24, 2022 at 12:11 AM Lazar, Lijo <lijo.lazar@amd.com> wrote:
> Unfortunately, I don't have any NV platforms to test. Attached is an
> 'untested-patch' based on your trace logs.
>
> Thanks,
> Lijo

Thank you for the patch. It applied cleanly to v6.0-rc2 and after
booting that kernel I no longer see any messages about PCI errors. I
have uploaded a dmesg log to the bug report:
https://bugzilla.kernel.org/attachment.cgi?id=301642

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Bug 216373] New: Uncorrected errors reported for AMD GPU
  2022-08-24 14:45             ` Tom Seewald
@ 2022-08-25  6:40               ` Stefan Roese
  2022-08-25  7:34                 ` Christian König
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan Roese @ 2022-08-25  6:40 UTC (permalink / raw)
  To: Tom Seewald, Lazar, Lijo
  Cc: Bjorn Helgaas, regressions, David Airlie, linux-pci, Xinhui Pan,
	amd-gfx, Kai-Heng Feng, Daniel Vetter, Alex Deucher,
	Christian König

On 24.08.22 16:45, Tom Seewald wrote:
> On Wed, Aug 24, 2022 at 12:11 AM Lazar, Lijo <lijo.lazar@amd.com> wrote:
>> Unfortunately, I don't have any NV platforms to test. Attached is an
>> 'untested-patch' based on your trace logs.
>>
>> Thanks,
>> Lijo
> 
> Thank you for the patch. It applied cleanly to v6.0-rc2 and after
> booting that kernel I no longer see any messages about PCI errors. I
> have uploaded a dmesg log to the bug report:
> https://bugzilla.kernel.org/attachment.cgi?id=301642

I did not follow this thread in depth, but FWICT the bug is solved now
with this patch. So is it correct, that the now fully enabled AER
support in the PCI subsystem in v6.0 helped detecting a bug in the AMD
GPU driver?

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Bug 216373] New: Uncorrected errors reported for AMD GPU
  2022-08-25  6:40               ` Stefan Roese
@ 2022-08-25  7:34                 ` Christian König
  2022-08-25  7:54                   ` Lazar, Lijo
  0 siblings, 1 reply; 18+ messages in thread
From: Christian König @ 2022-08-25  7:34 UTC (permalink / raw)
  To: Stefan Roese, Tom Seewald, Lazar, Lijo
  Cc: Bjorn Helgaas, regressions, David Airlie, linux-pci, Xinhui Pan,
	amd-gfx, Kai-Heng Feng, Daniel Vetter, Alex Deucher

Am 25.08.22 um 08:40 schrieb Stefan Roese:
> On 24.08.22 16:45, Tom Seewald wrote:
>> On Wed, Aug 24, 2022 at 12:11 AM Lazar, Lijo <lijo.lazar@amd.com> wrote:
>>> Unfortunately, I don't have any NV platforms to test. Attached is an
>>> 'untested-patch' based on your trace logs.
>>>
>>> Thanks,
>>> Lijo
>>
>> Thank you for the patch. It applied cleanly to v6.0-rc2 and after
>> booting that kernel I no longer see any messages about PCI errors. I
>> have uploaded a dmesg log to the bug report:
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fattachment.cgi%3Fid%3D301642&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7Cd55a659245b24864bd2d08da8664ae2d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637970065087671063%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000%7C%7C%7C&amp;sdata=vbhJ9OB0jIYr%2FRkDIbQHhRRqhyklnnHOT9Xi8z17MYY%3D&amp;reserved=0 
>>
>
> I did not follow this thread in depth, but FWICT the bug is solved now
> with this patch. So is it correct, that the now fully enabled AER
> support in the PCI subsystem in v6.0 helped detecting a bug in the AMD
> GPU driver?

It looks like it, but I'm not 100% sure about the rational behind it.

Lijo can you explain more on this?

Thanks,
Christian.

>
> Thanks,
> Stefan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Bug 216373] New: Uncorrected errors reported for AMD GPU
  2022-08-25  7:34                 ` Christian König
@ 2022-08-25  7:54                   ` Lazar, Lijo
  2022-08-25  8:18                     ` Christian König
  0 siblings, 1 reply; 18+ messages in thread
From: Lazar, Lijo @ 2022-08-25  7:54 UTC (permalink / raw)
  To: Christian König, Stefan Roese, Tom Seewald
  Cc: Bjorn Helgaas, regressions, David Airlie, linux-pci, Xinhui Pan,
	amd-gfx, Kai-Heng Feng, Daniel Vetter, Alex Deucher



On 8/25/2022 1:04 PM, Christian König wrote:
> Am 25.08.22 um 08:40 schrieb Stefan Roese:
>> On 24.08.22 16:45, Tom Seewald wrote:
>>> On Wed, Aug 24, 2022 at 12:11 AM Lazar, Lijo <lijo.lazar@amd.com> wrote:
>>>> Unfortunately, I don't have any NV platforms to test. Attached is an
>>>> 'untested-patch' based on your trace logs.
>>>>
>>>> Thanks,
>>>> Lijo
>>>
>>> Thank you for the patch. It applied cleanly to v6.0-rc2 and after
>>> booting that kernel I no longer see any messages about PCI errors. I
>>> have uploaded a dmesg log to the bug report:
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fattachment.cgi%3Fid%3D301642&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7Cd55a659245b24864bd2d08da8664ae2d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637970065087671063%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000%7C%7C%7C&amp;sdata=vbhJ9OB0jIYr%2FRkDIbQHhRRqhyklnnHOT9Xi8z17MYY%3D&amp;reserved=0 
>>>
>>
>> I did not follow this thread in depth, but FWICT the bug is solved now
>> with this patch. So is it correct, that the now fully enabled AER
>> support in the PCI subsystem in v6.0 helped detecting a bug in the AMD
>> GPU driver?
> 
> It looks like it, but I'm not 100% sure about the rational behind it.
> 
> Lijo can you explain more on this?
> 

 From the trace, during gmc hw_init it takes this route -

gart_enable -> amdgpu_gtt_mgr_recover -> amdgpu_gart_invalidate_tlb -> 
amdgpu_device_flush_hdp -> amdgpu_asic_flush_hdp (non-ring based HDP flush)

HDP flush is done using remapped offset which is MMIO_REG_HOLE_OFFSET 
(0x80000 - PAGE_SIZE)

WREG32_NO_KIQ((adev->rmmio_remap.reg_offset + 
KFD_MMIO_REMAP_HDP_MEM_FLUSH_CNTL) >> 2, 0);

However, the remapping is not yet done at this point. It's done at a 
later point during common block initialization. Access to the unmapped 
offset '(0x80000 - PAGE_SIZE)' seems to come back as unsupported request 
and reported through AER.

In the patch, I just moved the remapping before gmc block initialization.

Thanks,
Lijo

> Thanks,
> Christian.
> 
>>
>> Thanks,
>> Stefan
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Bug 216373] New: Uncorrected errors reported for AMD GPU
  2022-08-25  7:54                   ` Lazar, Lijo
@ 2022-08-25  8:18                     ` Christian König
  2022-08-25 17:48                       ` Bjorn Helgaas
  0 siblings, 1 reply; 18+ messages in thread
From: Christian König @ 2022-08-25  8:18 UTC (permalink / raw)
  To: Lazar, Lijo, Stefan Roese, Tom Seewald
  Cc: Bjorn Helgaas, regressions, David Airlie, linux-pci, Xinhui Pan,
	amd-gfx, Kai-Heng Feng, Daniel Vetter, Alex Deucher

Am 25.08.22 um 09:54 schrieb Lazar, Lijo:
>
>
> On 8/25/2022 1:04 PM, Christian König wrote:
>> Am 25.08.22 um 08:40 schrieb Stefan Roese:
>>> On 24.08.22 16:45, Tom Seewald wrote:
>>>> On Wed, Aug 24, 2022 at 12:11 AM Lazar, Lijo <lijo.lazar@amd.com> 
>>>> wrote:
>>>>> Unfortunately, I don't have any NV platforms to test. Attached is an
>>>>> 'untested-patch' based on your trace logs.
>>>>>
>>>>> Thanks,
>>>>> Lijo
>>>>
>>>> Thank you for the patch. It applied cleanly to v6.0-rc2 and after
>>>> booting that kernel I no longer see any messages about PCI errors. I
>>>> have uploaded a dmesg log to the bug report:
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fattachment.cgi%3Fid%3D301642&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7Cd55a659245b24864bd2d08da8664ae2d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637970065087671063%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000%7C%7C%7C&amp;sdata=vbhJ9OB0jIYr%2FRkDIbQHhRRqhyklnnHOT9Xi8z17MYY%3D&amp;reserved=0 
>>>>
>>>
>>> I did not follow this thread in depth, but FWICT the bug is solved now
>>> with this patch. So is it correct, that the now fully enabled AER
>>> support in the PCI subsystem in v6.0 helped detecting a bug in the AMD
>>> GPU driver?
>>
>> It looks like it, but I'm not 100% sure about the rational behind it.
>>
>> Lijo can you explain more on this?
>>
>
> From the trace, during gmc hw_init it takes this route -
>
> gart_enable -> amdgpu_gtt_mgr_recover -> amdgpu_gart_invalidate_tlb -> 
> amdgpu_device_flush_hdp -> amdgpu_asic_flush_hdp (non-ring based HDP 
> flush)
>
> HDP flush is done using remapped offset which is MMIO_REG_HOLE_OFFSET 
> (0x80000 - PAGE_SIZE)
>
> WREG32_NO_KIQ((adev->rmmio_remap.reg_offset + 
> KFD_MMIO_REMAP_HDP_MEM_FLUSH_CNTL) >> 2, 0);
>
> However, the remapping is not yet done at this point. It's done at a 
> later point during common block initialization. Access to the unmapped 
> offset '(0x80000 - PAGE_SIZE)' seems to come back as unsupported 
> request and reported through AER.

That's interesting behavior. So far AER always indicated some kind of 
transmission error.

When that happens as well on unmapped areas of the MMIO BAR then we need 
to keep that in mind.

Thanks,
Christian.

>
> In the patch, I just moved the remapping before gmc block initialization.
>
> Thanks,
> Lijo
>
>> Thanks,
>> Christian.
>>
>>>
>>> Thanks,
>>> Stefan
>>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Bug 216373] New: Uncorrected errors reported for AMD GPU
  2022-08-25  8:18                     ` Christian König
@ 2022-08-25 17:48                       ` Bjorn Helgaas
  2022-08-26  7:10                         ` Christian König
  0 siblings, 1 reply; 18+ messages in thread
From: Bjorn Helgaas @ 2022-08-25 17:48 UTC (permalink / raw)
  To: Christian König
  Cc: Lazar, Lijo, Stefan Roese, Tom Seewald, regressions,
	David Airlie, linux-pci, Xinhui Pan, amd-gfx, Kai-Heng Feng,
	Daniel Vetter, Alex Deucher

On Thu, Aug 25, 2022 at 10:18:28AM +0200, Christian König wrote:
> Am 25.08.22 um 09:54 schrieb Lazar, Lijo:
> > On 8/25/2022 1:04 PM, Christian König wrote:
> > > Am 25.08.22 um 08:40 schrieb Stefan Roese:
> > > > On 24.08.22 16:45, Tom Seewald wrote:
> > > > > On Wed, Aug 24, 2022 at 12:11 AM Lazar, Lijo
> > > > > <lijo.lazar@amd.com> wrote:
> > > > > > Unfortunately, I don't have any NV platforms to test. Attached is an
> > > > > > 'untested-patch' based on your trace logs.
> > > > > ...
> > > > 
> > > > I did not follow this thread in depth, but FWICT the bug is solved now
> > > > with this patch. So is it correct, that the now fully enabled AER
> > > > support in the PCI subsystem in v6.0 helped detecting a bug in the AMD
> > > > GPU driver?
> > > 
> > > It looks like it, but I'm not 100% sure about the rational behind it.
> > > 
> > > Lijo can you explain more on this?
> > 
> > From the trace, during gmc hw_init it takes this route -
> > 
> > gart_enable -> amdgpu_gtt_mgr_recover -> amdgpu_gart_invalidate_tlb ->
> > amdgpu_device_flush_hdp -> amdgpu_asic_flush_hdp (non-ring based HDP
> > flush)
> > 
> > HDP flush is done using remapped offset which is MMIO_REG_HOLE_OFFSET
> > (0x80000 - PAGE_SIZE)
> > 
> > WREG32_NO_KIQ((adev->rmmio_remap.reg_offset +
> > KFD_MMIO_REMAP_HDP_MEM_FLUSH_CNTL) >> 2, 0);
> > 
> > However, the remapping is not yet done at this point. It's done at a
> > later point during common block initialization. Access to the unmapped
> > offset '(0x80000 - PAGE_SIZE)' seems to come back as unsupported request
> > and reported through AER.
> 
> That's interesting behavior. So far AER always indicated some kind of
> transmission error.
> 
> When that happens as well on unmapped areas of the MMIO BAR then we need to
> keep that in mind.

AER can log many different kinds of errors, some related to hardware
issues and some related to software.

PCI writes are normally posted and get no response, so AER is the main
way to find out about writes to unimplemented addresses.

Reads do get a response, of course, and reads to unimplemented
addresses cause errors that most hardware turns into a ~0 data return
(in addition to reporting via AER if enabled).

Bjorn

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Bug 216373] New: Uncorrected errors reported for AMD GPU
  2022-08-25 17:48                       ` Bjorn Helgaas
@ 2022-08-26  7:10                         ` Christian König
  0 siblings, 0 replies; 18+ messages in thread
From: Christian König @ 2022-08-26  7:10 UTC (permalink / raw)
  To: Bjorn Helgaas, Christian König
  Cc: Xinhui Pan, regressions, David Airlie, linux-pci, Lazar, Lijo,
	amd-gfx, Tom Seewald, Kai-Heng Feng, Daniel Vetter, Alex Deucher,
	Stefan Roese

Am 25.08.22 um 19:48 schrieb Bjorn Helgaas:
> On Thu, Aug 25, 2022 at 10:18:28AM +0200, Christian König wrote:
>> Am 25.08.22 um 09:54 schrieb Lazar, Lijo:
>>> On 8/25/2022 1:04 PM, Christian König wrote:
>>>> Am 25.08.22 um 08:40 schrieb Stefan Roese:
>>>>> On 24.08.22 16:45, Tom Seewald wrote:
>>>>>> On Wed, Aug 24, 2022 at 12:11 AM Lazar, Lijo
>>>>>> <lijo.lazar@amd.com> wrote:
>>>>>>> Unfortunately, I don't have any NV platforms to test. Attached is an
>>>>>>> 'untested-patch' based on your trace logs.
>>>>>> ...
>>>>> I did not follow this thread in depth, but FWICT the bug is solved now
>>>>> with this patch. So is it correct, that the now fully enabled AER
>>>>> support in the PCI subsystem in v6.0 helped detecting a bug in the AMD
>>>>> GPU driver?
>>>> It looks like it, but I'm not 100% sure about the rational behind it.
>>>>
>>>> Lijo can you explain more on this?
>>>  From the trace, during gmc hw_init it takes this route -
>>>
>>> gart_enable -> amdgpu_gtt_mgr_recover -> amdgpu_gart_invalidate_tlb ->
>>> amdgpu_device_flush_hdp -> amdgpu_asic_flush_hdp (non-ring based HDP
>>> flush)
>>>
>>> HDP flush is done using remapped offset which is MMIO_REG_HOLE_OFFSET
>>> (0x80000 - PAGE_SIZE)
>>>
>>> WREG32_NO_KIQ((adev->rmmio_remap.reg_offset +
>>> KFD_MMIO_REMAP_HDP_MEM_FLUSH_CNTL) >> 2, 0);
>>>
>>> However, the remapping is not yet done at this point. It's done at a
>>> later point during common block initialization. Access to the unmapped
>>> offset '(0x80000 - PAGE_SIZE)' seems to come back as unsupported request
>>> and reported through AER.
>> That's interesting behavior. So far AER always indicated some kind of
>> transmission error.
>>
>> When that happens as well on unmapped areas of the MMIO BAR then we need to
>> keep that in mind.
> AER can log many different kinds of errors, some related to hardware
> issues and some related to software.
>
> PCI writes are normally posted and get no response, so AER is the main
> way to find out about writes to unimplemented addresses.
>
> Reads do get a response, of course, and reads to unimplemented
> addresses cause errors that most hardware turns into a ~0 data return
> (in addition to reporting via AER if enabled).

The issue is that previous hardware generations reported this through a 
device specific interrupt.

It's nice to see that this is finally standardized. I'm just wondering 
if we could retire our hardware specific interrupt handler for this as well.

Christian.

>
> Bjorn


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Bug 216373] New: Uncorrected errors reported for AMD GPU
  2022-08-24  5:10           ` Lazar, Lijo
  2022-08-24 14:45             ` Tom Seewald
@ 2022-08-25 15:05             ` Felix Kuehling
  1 sibling, 0 replies; 18+ messages in thread
From: Felix Kuehling @ 2022-08-25 15:05 UTC (permalink / raw)
  To: Lazar, Lijo, Tom Seewald
  Cc: regressions, David Airlie, linux-pci, Xinhui Pan, amd-gfx,
	Kai-Heng Feng, Bjorn Helgaas, Daniel Vetter, Alex Deucher,
	Stefan Roese, Christian König


Am 2022-08-24 um 01:10 schrieb Lazar, Lijo:
>
>
> On 8/23/2022 10:34 PM, Tom Seewald wrote:
>> On Sat, Aug 20, 2022 at 2:53 AM Lazar, Lijo <lijo.lazar@amd.com> wrote:
>>>
>>> Missed the remap part, the offset is here -
>>>
>>> https://elixir.bootlin.com/linux/v6.0-rc1/source/drivers/gpu/drm/amd/amdgpu/nv.c#L680 
>>>
>>>
>>>
>>> The trace is coming from *_flush_hdp.
>>>
>>> You may also check if *_remap_hdp_registers() is getting called. It is
>>> done in nbio_vx_y files, most likely this one for your device -
>>> https://elixir.bootlin.com/linux/v6.0-rc1/source/drivers/gpu/drm/amd/amdgpu/nbio_v2_3.c#L68 
>>>
>>>
>>> Thanks,
>>> Lijo
>>
>> Hi Lijo,
>>
>> I would be happy to test any patches that you think would shed some
>> light on this.
>>
> Unfortunately, I don't have any NV platforms to test. Attached is an 
> 'untested-patch' based on your trace logs.
Hi Lijo,

I like that the patch also removes some code duplication. Can you check 
that this doesn't break GFXv8 GPUs? You may need to add a NULL-check for 
adev->nbio.funcs to the if-condition.

Regards,
   Felix


>
> Thanks,
> Lijo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Bug 216373] New: Uncorrected errors reported for AMD GPU #forregzbot
  2022-08-18 20:38 ` [Bug 216373] New: Uncorrected errors reported for AMD GPU Bjorn Helgaas
  2022-08-19  7:05   ` Christian König
  2022-08-19 17:13   ` Bjorn Helgaas
@ 2022-08-23 16:01   ` Thorsten Leemhuis
  2 siblings, 0 replies; 18+ messages in thread
From: Thorsten Leemhuis @ 2022-08-23 16:01 UTC (permalink / raw)
  To: regressions; +Cc: linux-pci, amd-gfx

TWIMC: this mail is primarily send for documentation purposes and for
regzbot, my Linux kernel regression tracking bot. These mails usually
contain '#forregzbot' in the subject, to make them easy to spot and filter.

[TLDR: I'm adding this regression report to the list of tracked
regressions; all text from me you find below is based on a few templates
paragraphs you might have encountered already already in similar form.]

Hi, this is your Linux kernel regression tracker.

On 18.08.22 22:38, Bjorn Helgaas wrote:
> [Adding amdgpu folks]
> 
> On Wed, Aug 17, 2022 at 11:45:15PM +0000, bugzilla-daemon@kernel.org wrote:
>> https://bugzilla.kernel.org/show_bug.cgi?id=216373
>>
>>             Bug ID: 216373
>>            Summary: Uncorrected errors reported for AMD GPU
>>     Kernel Version: v6.0-rc1
>>         Regression: No
>> ...
> 
> I marked this as a regression in bugzilla.
> 
>> Hardware:
>> CPU: Intel i7-12700K (Alder Lake)
>> GPU: AMD RX 6700 XT [1002:73df]
>> Motherboard: ASUS Prime Z690-A
>>
>> Problem:
>> After upgrading to v6.0-rc1 the kernel is now reporting uncorrected PCI errors
>> for my GPU.
> 
> Thank you very much for the report and for taking the trouble to
> bisect it and test Kai-Heng's patch!
> 
> I suspect that booting with "pci=noaer" should be a temporary
> workaround for this issue.  If it, can you add that to the bugzilla
> for anybody else who trips over this?
> 
>> I have bisected this issue to: [8795e182b02dc87e343c79e73af6b8b7f9c5e635]
>> PCI/portdrv: Don't disable AER reporting in get_port_device_capability()
>> Reverting that commit causes the errors to cease.
> 
> I suspect the errors still occur, but we just don't notice and log
> them.
> 
>> I have also tried Kai-Heng Feng's patch[1] which seems to resolve a similar
>> problem, but it did not fix my issue.
>>
>> [1]
>> https://lore.kernel.org/linux-pci/20220706123244.18056-1-kai.heng.feng@canonical.com/
>>
>> dmesg snippet:
>>
>> pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received:
>> 0000:03:00.0
>> amdgpu 0000:03:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal),
>> type=Transaction Layer, (Requester ID)
>> amdgpu 0000:03:00.0:   device [1002:73df] error status/mask=00100000/00000000
>> amdgpu 0000:03:00.0:    [20] UnsupReq               (First)
>> amdgpu 0000:03:00.0: AER:   TLP Header: 40000001 0000000f 95e7f000 00000000
> 
> I think the TLP header decodes to:
> 
>   0x40000001 = 0100 0000 ... 0000 0001 binary
>   0x0000000f = 0000 0000 ... 0000 1111 binary
> 
>   Fmt           010b                 3 DW header with data
>   Type          0000b  010 0 0000    MWr Memory Write Request
>   Length        00 0000 0001b        1 DW
>   Requester ID  0x0000               00:00.0
>   Tag           0x00
>   Last DW BE    0000b                must be zero for 1 DW write
>   First DW BE   1111b                all 4 bytes in DW enabled
>   Address       0x95e7f000
>   Data          0x00000000
> 
> So I think this is a 32-bit write of zero to PCI bus address
> 0x95e7f000.
> 
> Your dmesg log says:
> 
>   pci 0000:02:00.0: PCI bridge to [bus 03]
>   pci 0000:02:00.0:   bridge window [mem 0x95e00000-0x95ffffff]
>   pci 0000:03:00.0: reg 0x24: [mem 0x95e00000-0x95efffff]
>   [drm] register mmio base: 0x95E00000
> 
> So this looks like a write to the device's BAR 5.  I don't see a PCI
> reason why this should fail.  Maybe there's some amdgpu reason?

I'd like to add to the tracking to ensure it's not forgotten.

#regzbot introduced: v5.19..v6.0-rc1 ^
https://bugzilla.kernel.org/show_bug.cgi?id=216373
#regzbot title: pci or amdgpu: Uncorrected errors reported for AMD GPU

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-08-26  7:11 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-216373-41252@https.bugzilla.kernel.org/>
2022-08-18 20:38 ` [Bug 216373] New: Uncorrected errors reported for AMD GPU Bjorn Helgaas
2022-08-19  7:05   ` Christian König
2022-08-19  8:33     ` Lazar, Lijo
2022-08-19 11:04       ` Bjorn Helgaas
2022-08-19 17:13   ` Bjorn Helgaas
2022-08-19 19:07     ` Bjorn Helgaas
2022-08-20  7:52       ` Lazar, Lijo
2022-08-23 17:04         ` Tom Seewald
2022-08-24  5:10           ` Lazar, Lijo
2022-08-24 14:45             ` Tom Seewald
2022-08-25  6:40               ` Stefan Roese
2022-08-25  7:34                 ` Christian König
2022-08-25  7:54                   ` Lazar, Lijo
2022-08-25  8:18                     ` Christian König
2022-08-25 17:48                       ` Bjorn Helgaas
2022-08-26  7:10                         ` Christian König
2022-08-25 15:05             ` Felix Kuehling
2022-08-23 16:01   ` [Bug 216373] New: Uncorrected errors reported for AMD GPU #forregzbot Thorsten Leemhuis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).