All of lore.kernel.org
 help / color / mirror / Atom feed
* Help debug amdgpu faults
@ 2022-11-22 11:53 Robert Beckett
  2022-11-22 14:12 ` Alex Deucher
  0 siblings, 1 reply; 4+ messages in thread
From: Robert Beckett @ 2022-11-22 11:53 UTC (permalink / raw)
  To: amd-gfx, Christian König
  Cc: Adrián Martínez Larumbe, Daniel Stone, Dmitrii Osipenko

[-- Attachment #1: Type: text/plain, Size: 1890 bytes --]

Hi,


does anyone know any documentation, or can provide advice on debugging 
amdgpu fault reports?


e.g:

Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: [gfxhub] 
page fault (src_id:0 ring:8 vmid:1 pasid:32769, for process vkcube pid 
999 thread vkcube pid 999)
Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: in page 
starting at address 0x0000800100700000 from client 0x1b (UTCL2)
Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: 
GCVM_L2_PROTECTION_FAULT_STATUS:0x00101A10
Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: Faulty UTCL2 
client ID: SDMA0 (0xd)
Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: MORE_FAULTS: 0x0
Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: 
WALKER_ERROR: 0x0
Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: 
PERMISSION_FAULTS: 0x1
Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: 
MAPPING_ERROR: 0x0
Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: RW: 0x0



seehttps://gitlab.freedesktop.org/drm/amd/-/issues/2267  for more context.

We have a complicated setup involving rendering then blitting to virtio-gpu exported dmabufs, with plenty of hacks in the mesa and xwayland stacks, so we are considering this our issue to debug, and not an issue with the driver at this point.
However, having debugged all the interesting parts leading to these faults, I am unable to decode the fault report to correlate to a buffer.

in the fault report, what address space is the address from?
given that the fault handler shifts the reported addres up by 12, I assume it is a 4K pfn which makes me assume a physical address is this correct?
if so, is that a vram pa or a host system memory pa?
is there any documentation for the other fields reported like the fault status etc?

I'd appreciate any advice you could give to help us debug further.

Thanks

Bob

[-- Attachment #2: Type: text/html, Size: 3446 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Help debug amdgpu faults
  2022-11-22 11:53 Help debug amdgpu faults Robert Beckett
@ 2022-11-22 14:12 ` Alex Deucher
  2022-11-23  7:50   ` Khatri, Sunil
  0 siblings, 1 reply; 4+ messages in thread
From: Alex Deucher @ 2022-11-22 14:12 UTC (permalink / raw)
  To: Robert Beckett
  Cc: Dmitrii Osipenko, Adrián Martínez Larumbe,
	Christian König, amd-gfx, Daniel Stone

On Tue, Nov 22, 2022 at 6:53 AM Robert Beckett
<bob.beckett@collabora.com> wrote:
>
> Hi,
>
>
> does anyone know any documentation, or can provide advice on debugging amdgpu fault reports?

This is a GPU page fault so it refers the the GPU virtual address
space of the application .  Each process (well fd really), gets its
own GPU virtual address space into which system memory, system mmio
space, or vram can be mapped.  The user mode drivers control their GPU
virtual address space.

>
>
> e.g:
>
> Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:8 vmid:1 pasid:32769, for process vkcube pid 999 thread vkcube pid 999)

This is the process that caused the fault.

> Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:   in page starting at address 0x0000800100700000 from client 0x1b (UTCL2)

This is the virtual address that faulted.

> Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00101A10
> Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:          Faulty UTCL2 client ID: SDMA0 (0xd)

The fault came from the SDMA engine.

> Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:          MORE_FAULTS: 0x0
> Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:          WALKER_ERROR: 0x0
> Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:          PERMISSION_FAULTS: 0x1

The page was not marked as valid in the GPU page table.

> Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:          MAPPING_ERROR: 0x0
> Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:          RW: 0x0

SDMA attempted to read an invalid page.

>
>
>
> see https://gitlab.freedesktop.org/drm/amd/-/issues/2267 for more context.
>
> We have a complicated setup involving rendering then blitting to virtio-gpu exported dmabufs, with plenty of hacks in the mesa and xwayland stacks, so we are considering this our issue to debug, and not an issue with the driver at this point.
> However, having debugged all the interesting parts leading to these faults, I am unable to decode the fault report to correlate to a buffer.
>
> in the fault report, what address space is the address from?
> given that the fault handler shifts the reported addres up by 12, I assume it is a 4K pfn which makes me assume a physical address is this correct?
> if so, is that a vram pa or a host system memory pa?
> is there any documentation for the other fields reported like the fault status etc?

See the comments above.  There is some kernel doc as well:
https://docs.kernel.org/gpu/amdgpu/driver-core.html#amdgpu-virtual-memory

>
> I'd appreciate any advice you could give to help us debug further.

Some operation you are doing in the user mode driver is reading an
invalid page.  Possibly reading past the end of a buffer or something
mis-aligned.  Compare the faulting GPU address to the GPU virtual
address space in the application and you should be able to track down
what is happening.

Alex

>
> Thanks
>
> Bob
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: Help debug amdgpu faults
  2022-11-22 14:12 ` Alex Deucher
@ 2022-11-23  7:50   ` Khatri, Sunil
  2022-11-23 15:48     ` Alex Deucher
  0 siblings, 1 reply; 4+ messages in thread
From: Khatri, Sunil @ 2022-11-23  7:50 UTC (permalink / raw)
  To: Alex Deucher, Robert Beckett
  Cc: amd-gfx, Adrián Martínez Larumbe, Koenig, Christian,
	Dmitrii Osipenko, Daniel Stone

[-- Attachment #1: Type: text/plain, Size: 5118 bytes --]

[AMD Official Use Only - General]

Hello Alex, Robert

I too have similar issues which I am facing on chrome. Are there any tools in linux environment which can help debug such issues like page faults, kernel panic caused by invalid pointer access.

I have used tools like ramdump parser which can be used to use the ramdump after a crash and check a lot of static data in the memory and even the page table could be checked by walking through them manually. We used to load the kernel symbols along with ramdump to go line by line.

Appreciate if you can point to some document or some tools which is already used by linux graphics teams either UMD or KMD drivers so chrome team can also exploit those to debug issues.

Regards
Sunil Khatri 

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Alex Deucher
Sent: Tuesday, November 22, 2022 7:42 PM
To: Robert Beckett <bob.beckett@collabora.com>
Cc: Dmitrii Osipenko <dmitry.osipenko@collabora.com>; Adrián Martínez Larumbe <adrian.larumbe@collabora.com>; Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org; Daniel Stone <daniels@collabora.com>
Subject: Re: Help debug amdgpu faults

On Tue, Nov 22, 2022 at 6:53 AM Robert Beckett <bob.beckett@collabora.com> wrote:
>
> Hi,
>
>
> does anyone know any documentation, or can provide advice on debugging amdgpu fault reports?

This is a GPU page fault so it refers the the GPU virtual address space of the application .  Each process (well fd really), gets its own GPU virtual address space into which system memory, system mmio space, or vram can be mapped.  The user mode drivers control their GPU virtual address space.

>
>
> e.g:
>
> Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: [gfxhub] 
> page fault (src_id:0 ring:8 vmid:1 pasid:32769, for process vkcube pid 
> 999 thread vkcube pid 999)

This is the process that caused the fault.

> Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:   in page starting at address 0x0000800100700000 from client 0x1b (UTCL2)

This is the virtual address that faulted.

> Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00101A10
> Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:          Faulty UTCL2 client ID: SDMA0 (0xd)

The fault came from the SDMA engine.

> Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:          MORE_FAULTS: 0x0
> Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:          WALKER_ERROR: 0x0
> Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:          PERMISSION_FAULTS: 0x1

The page was not marked as valid in the GPU page table.

> Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:          MAPPING_ERROR: 0x0
> Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:          RW: 0x0

SDMA attempted to read an invalid page.

>
>
>
> see https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F2267&amp;data=05%7C01%7Csunil.khatri%40amd.com%7Cd7778c40bff6443c2af708dacc9394c6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638047231486449634%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=vep6PKgDjRz02A3xYI8f7600QV2%2B7GYXsx%2FVYPY1M2I%3D&amp;reserved=0 for more context.
>
> We have a complicated setup involving rendering then blitting to virtio-gpu exported dmabufs, with plenty of hacks in the mesa and xwayland stacks, so we are considering this our issue to debug, and not an issue with the driver at this point.
> However, having debugged all the interesting parts leading to these faults, I am unable to decode the fault report to correlate to a buffer.
>
> in the fault report, what address space is the address from?
> given that the fault handler shifts the reported addres up by 12, I assume it is a 4K pfn which makes me assume a physical address is this correct?
> if so, is that a vram pa or a host system memory pa?
> is there any documentation for the other fields reported like the fault status etc?

See the comments above.  There is some kernel doc as well:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.kernel.org%2Fgpu%2Famdgpu%2Fdriver-core.html%23amdgpu-virtual-memory&amp;data=05%7C01%7Csunil.khatri%40amd.com%7Cd7778c40bff6443c2af708dacc9394c6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638047231486449634%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=dd971OoEZSJl%2FJif4%2Bypv9Dp0deeMVsQuCMc2o9BgQk%3D&amp;reserved=0

>
> I'd appreciate any advice you could give to help us debug further.

Some operation you are doing in the user mode driver is reading an invalid page.  Possibly reading past the end of a buffer or something mis-aligned.  Compare the faulting GPU address to the GPU virtual address space in the application and you should be able to track down what is happening.

Alex

>
> Thanks
>
> Bob
>

[-- Attachment #2: winmail.dat --]
[-- Type: application/ms-tnef, Size: 18157 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Help debug amdgpu faults
  2022-11-23  7:50   ` Khatri, Sunil
@ 2022-11-23 15:48     ` Alex Deucher
  0 siblings, 0 replies; 4+ messages in thread
From: Alex Deucher @ 2022-11-23 15:48 UTC (permalink / raw)
  To: Khatri, Sunil
  Cc: Robert Beckett, Daniel Stone, Adrián Martínez Larumbe,
	amd-gfx, Dmitrii Osipenko, Koenig, Christian

On Wed, Nov 23, 2022 at 2:50 AM Khatri, Sunil <Sunil.Khatri@amd.com> wrote:
>
> [AMD Official Use Only - General]
>
> Hello Alex, Robert
>
> I too have similar issues which I am facing on chrome. Are there any tools in linux environment which can help debug such issues like page faults, kernel panic caused by invalid pointer access.
>
> I have used tools like ramdump parser which can be used to use the ramdump after a crash and check a lot of static data in the memory and even the page table could be checked by walking through them manually. We used to load the kernel symbols along with ramdump to go line by line.
>
> Appreciate if you can point to some document or some tools which is already used by linux graphics teams either UMD or KMD drivers so chrome team can also exploit those to debug issues.
>

UMR has a number of tools for dumping GPU page tables and debugging page faults:
https://gitlab.freedesktop.org/tomstdenis/umr

Alex


> Regards
> Sunil Khatri
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Alex Deucher
> Sent: Tuesday, November 22, 2022 7:42 PM
> To: Robert Beckett <bob.beckett@collabora.com>
> Cc: Dmitrii Osipenko <dmitry.osipenko@collabora.com>; Adrián Martínez Larumbe <adrian.larumbe@collabora.com>; Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org; Daniel Stone <daniels@collabora.com>
> Subject: Re: Help debug amdgpu faults
>
> On Tue, Nov 22, 2022 at 6:53 AM Robert Beckett <bob.beckett@collabora.com> wrote:
> >
> > Hi,
> >
> >
> > does anyone know any documentation, or can provide advice on debugging amdgpu fault reports?
>
> This is a GPU page fault so it refers the the GPU virtual address space of the application .  Each process (well fd really), gets its own GPU virtual address space into which system memory, system mmio space, or vram can be mapped.  The user mode drivers control their GPU virtual address space.
>
> >
> >
> > e.g:
> >
> > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: [gfxhub]
> > page fault (src_id:0 ring:8 vmid:1 pasid:32769, for process vkcube pid
> > 999 thread vkcube pid 999)
>
> This is the process that caused the fault.
>
> > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:   in page starting at address 0x0000800100700000 from client 0x1b (UTCL2)
>
> This is the virtual address that faulted.
>
> > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00101A10
> > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:          Faulty UTCL2 client ID: SDMA0 (0xd)
>
> The fault came from the SDMA engine.
>
> > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:          MORE_FAULTS: 0x0
> > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:          WALKER_ERROR: 0x0
> > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:          PERMISSION_FAULTS: 0x1
>
> The page was not marked as valid in the GPU page table.
>
> > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:          MAPPING_ERROR: 0x0
> > Nov 21 19:17:06 fedora kernel: amdgpu 0000:01:00.0: amdgpu:          RW: 0x0
>
> SDMA attempted to read an invalid page.
>
> >
> >
> >
> > see https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F2267&amp;data=05%7C01%7Csunil.khatri%40amd.com%7Cd7778c40bff6443c2af708dacc9394c6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638047231486449634%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=vep6PKgDjRz02A3xYI8f7600QV2%2B7GYXsx%2FVYPY1M2I%3D&amp;reserved=0 for more context.
> >
> > We have a complicated setup involving rendering then blitting to virtio-gpu exported dmabufs, with plenty of hacks in the mesa and xwayland stacks, so we are considering this our issue to debug, and not an issue with the driver at this point.
> > However, having debugged all the interesting parts leading to these faults, I am unable to decode the fault report to correlate to a buffer.
> >
> > in the fault report, what address space is the address from?
> > given that the fault handler shifts the reported addres up by 12, I assume it is a 4K pfn which makes me assume a physical address is this correct?
> > if so, is that a vram pa or a host system memory pa?
> > is there any documentation for the other fields reported like the fault status etc?
>
> See the comments above.  There is some kernel doc as well:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.kernel.org%2Fgpu%2Famdgpu%2Fdriver-core.html%23amdgpu-virtual-memory&amp;data=05%7C01%7Csunil.khatri%40amd.com%7Cd7778c40bff6443c2af708dacc9394c6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638047231486449634%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=dd971OoEZSJl%2FJif4%2Bypv9Dp0deeMVsQuCMc2o9BgQk%3D&amp;reserved=0
>
> >
> > I'd appreciate any advice you could give to help us debug further.
>
> Some operation you are doing in the user mode driver is reading an invalid page.  Possibly reading past the end of a buffer or something mis-aligned.  Compare the faulting GPU address to the GPU virtual address space in the application and you should be able to track down what is happening.
>
> Alex
>
> >
> > Thanks
> >
> > Bob
> >

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-11-23 15:48 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-22 11:53 Help debug amdgpu faults Robert Beckett
2022-11-22 14:12 ` Alex Deucher
2022-11-23  7:50   ` Khatri, Sunil
2022-11-23 15:48     ` Alex Deucher

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.