All of lore.kernel.org
 help / color / mirror / Atom feed
* [Bug 108854] [polaris11] - Failed GPU reset after hang
@ 2018-11-24 20:41 bugzilla-daemon
  2018-12-01 18:17 ` bugzilla-daemon
                   ` (19 more replies)
  0 siblings, 20 replies; 21+ messages in thread
From: bugzilla-daemon @ 2018-11-24 20:41 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 1680 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=108854

            Bug ID: 108854
           Summary: [polaris11] - Failed GPU reset after hang
           Product: DRI
           Version: DRI git
          Hardware: x86-64 (AMD64)
                OS: Linux (All)
            Status: NEW
          Severity: normal
          Priority: medium
         Component: DRM/AMDgpu
          Assignee: dri-devel@lists.freedesktop.org
          Reporter: tseewald@gmail.com

Created attachment 142604
  --> https://bugs.freedesktop.org/attachment.cgi?id=142604&action=edit
dmesg showing the hang and failed gpu reset

Problem:

While running RuneLite [1] with GPU acceleration enabled, the system hangs
after  several minutes of seemingly normal operation. Once the GPU hangs, it
attempts to reset itself but fails with the following message:

[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

This hang causes the system to lock up and ssh is the only access possible.
There is no graphical corruption, the displays are simply frozen.

System Information:

GPU: POLARIS11 - RX 560 4GB (1002:67ff)
Mesa: 18.0.5
X11: 1.19.6
Firmware files should be the latest as I've pulled them from adg5f's repo [2].

Kernel parameters: "quiet splash scsi_mod.use_blk_mq=1 apparmor=2
security=apparmor amdgpu.gpu_recovery=1 spectre_v2=off"

I have reproduced this issue on:

4.20-rc3
amd-staging-drm-next (as of commit 1179994039abc10aab0d2f0ecfc4c65dfbd77438)

[1] https://github.com/runelite/runelite
[2] https://people.freedesktop.org/~agd5f/radeon_ucode/

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 3274 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug 108854] [polaris11] - Failed GPU reset after hang
  2018-11-24 20:41 [Bug 108854] [polaris11] - Failed GPU reset after hang bugzilla-daemon
@ 2018-12-01 18:17 ` bugzilla-daemon
  2018-12-08 18:20 ` bugzilla-daemon
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: bugzilla-daemon @ 2018-12-01 18:17 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 1883 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=108854

--- Comment #1 from Tom Seewald <tseewald@gmail.com> ---
I can confirm this is still happening on 4.20-rc4 as well as with more up to
date userspace software.

libdrm: 3.27.0
Mesa: 18.2.4

The hangs can be reliably reproduced at least as far back as kernel 4.15 so I
am not confident I can bisect this.

Here is a dump of my card's firmware information in case I missed an update.

# cat /sys/kernel/debug/dri/1/amdgpu_firmware_info

VCE feature version: 0, firmware version: 0x34040300
UVD feature version: 0, firmware version: 0x01821000
MC feature version: 0, firmware version: 0x00000000
ME feature version: 47, firmware version: 0x000000a2
PFP feature version: 47, firmware version: 0x000000f0
CE feature version: 47, firmware version: 0x00000089
RLC feature version: 1, firmware version: 0x00000035
RLC SRLC feature version: 0, firmware version: 0x00000000
RLC SRLG feature version: 0, firmware version: 0x00000000
RLC SRLS feature version: 0, firmware version: 0x00000000
MEC feature version: 47, firmware version: 0x000002cb
MEC2 feature version: 47, firmware version: 0x000002cb
SOS feature version: 0, firmware version: 0x00000000
ASD feature version: 0, firmware version: 0x00000000
SMC feature version: 0, firmware version: 0x001d0900
SDMA0 feature version: 31, firmware version: 0x00000036
SDMA1 feature version: 0, firmware version: 0x00000036
VCN feature version: 0, firmware version: 0x00000000
DMCU feature version: 0, firmware version: 0x00000000
VBIOS version: 113-C98121-M01

Would umr[1] be useful here? I have not used it before, so I'd need some
guidance on what arguments would produce output relevant to this hang.

Any help is appreciated.

[1] https://cgit.freedesktop.org/amd/umr/

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 2700 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug 108854] [polaris11] - Failed GPU reset after hang
  2018-11-24 20:41 [Bug 108854] [polaris11] - Failed GPU reset after hang bugzilla-daemon
  2018-12-01 18:17 ` bugzilla-daemon
@ 2018-12-08 18:20 ` bugzilla-daemon
  2018-12-08 18:27 ` bugzilla-daemon
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: bugzilla-daemon @ 2018-12-08 18:20 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 329 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=108854

--- Comment #2 from Tom Seewald <tseewald@gmail.com> ---
Created attachment 142754
  --> https://bugs.freedesktop.org/attachment.cgi?id=142754&action=edit
dmesg of 4.20-rc5 with drm.debug=0xe

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 1258 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug 108854] [polaris11] - Failed GPU reset after hang
  2018-11-24 20:41 [Bug 108854] [polaris11] - Failed GPU reset after hang bugzilla-daemon
  2018-12-01 18:17 ` bugzilla-daemon
  2018-12-08 18:20 ` bugzilla-daemon
@ 2018-12-08 18:27 ` bugzilla-daemon
  2019-01-14 16:54 ` bugzilla-daemon
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: bugzilla-daemon @ 2018-12-08 18:27 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 4028 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=108854

--- Comment #3 from Tom Seewald <tseewald@gmail.com> ---
Installed the new Polaris firmware released on December 3rd, however that
doesn't appear to affect my card as the content of
/sys/kernel/debug/dri/1/amdgpu_firmware_info is unchanged.

Upgraded to Mesa 18.3.0 from 18.2.4 - no change.

Added dmesg of 4.20-rc5 with drm.debug=0xe, showing the hang. It now prints
hung kernel tasks backtraces rather than "[drm:amdgpu_cs_ioctl [amdgpu]]
*ERROR* Failed to initialize parser -125!".

I've also included the power management information before and after the GPU
hang.

/sys/kernel/debug/dri/1/amdgpu_pm_info *before* GPU hang:

Clock Gating Flags Mask: 0x3fbcf
        Graphics Medium Grain Clock Gating: On
        Graphics Medium Grain memory Light Sleep: On
        Graphics Coarse Grain Clock Gating: On
        Graphics Coarse Grain memory Light Sleep: On
        Graphics Coarse Grain Tree Shader Clock Gating: Off
        Graphics Coarse Grain Tree Shader Light Sleep: Off
        Graphics Command Processor Light Sleep: On
        Graphics Run List Controller Light Sleep: On
        Graphics 3D Coarse Grain Clock Gating: Off
        Graphics 3D Coarse Grain memory Light Sleep: Off
        Memory Controller Light Sleep: On
        Memory Controller Medium Grain Clock Gating: On
        System Direct Memory Access Light Sleep: Off
        System Direct Memory Access Medium Grain Clock Gating: On
        Bus Interface Medium Grain Clock Gating: Off
        Bus Interface Light Sleep: On
        Unified Video Decoder Medium Grain Clock Gating: On
        Video Compression Engine Medium Grain Clock Gating: On
        Host Data Path Light Sleep: On
        Host Data Path Medium Grain Clock Gating: On
        Digital Right Management Medium Grain Clock Gating: Off
        Digital Right Management Light Sleep: Off
        Rom Medium Grain Clock Gating: On
        Data Fabric Medium Grain Clock Gating: Off

GFX Clocks and Power:
        1750 MHz (MCLK)
        1196 MHz (SCLK)
        387 MHz (PSTATE_SCLK)
        625 MHz (PSTATE_MCLK)
        993 mV (VDDGFX)
        20.30 W (average GPU)

GPU Temperature: 38 C
GPU Load: 0 %

UVD: Disabled

VCE: Disabled


/sys/kernel/debug/dri/1/amdgpu_pm_info *after* GPU hang:

Clock Gating Flags Mask: 0x6400
        Graphics Medium Grain Clock Gating: Off
        Graphics Medium Grain memory Light Sleep: Off
        Graphics Coarse Grain Clock Gating: Off
        Graphics Coarse Grain memory Light Sleep: Off
        Graphics Coarse Grain Tree Shader Clock Gating: Off
        Graphics Coarse Grain Tree Shader Light Sleep: Off
        Graphics Command Processor Light Sleep: Off
        Graphics Run List Controller Light Sleep: Off
        Graphics 3D Coarse Grain Clock Gating: Off
        Graphics 3D Coarse Grain memory Light Sleep: Off
        Memory Controller Light Sleep: Off
        Memory Controller Medium Grain Clock Gating: Off
        System Direct Memory Access Light Sleep: On
        System Direct Memory Access Medium Grain Clock Gating: Off
        Bus Interface Medium Grain Clock Gating: Off
        Bus Interface Light Sleep: Off
        Unified Video Decoder Medium Grain Clock Gating: On
        Video Compression Engine Medium Grain Clock Gating: On
        Host Data Path Light Sleep: Off
        Host Data Path Medium Grain Clock Gating: Off
        Digital Right Management Medium Grain Clock Gating: Off
        Digital Right Management Light Sleep: Off
        Rom Medium Grain Clock Gating: Off
        Data Fabric Medium Grain Clock Gating: Off

GFX Clocks and Power:
        1750 MHz (MCLK)
        1196 MHz (SCLK)
        387 MHz (PSTATE_SCLK)
        625 MHz (PSTATE_MCLK)
        993 mV (VDDGFX)
        28.186 W (average GPU)

GPU Temperature: 42 C
GPU Load: 100 %

UVD: Disabled

VCE: Disabled

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 4803 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug 108854] [polaris11] - Failed GPU reset after hang
  2018-11-24 20:41 [Bug 108854] [polaris11] - Failed GPU reset after hang bugzilla-daemon
                   ` (2 preceding siblings ...)
  2018-12-08 18:27 ` bugzilla-daemon
@ 2019-01-14 16:54 ` bugzilla-daemon
  2019-01-14 16:55 ` bugzilla-daemon
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: bugzilla-daemon @ 2019-01-14 16:54 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 343 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=108854

--- Comment #4 from Tom Seewald <tseewald@gmail.com> ---
Created attachment 143107
  --> https://bugs.freedesktop.org/attachment.cgi?id=143107&action=edit
amd-drm-staging-next dmesg as of January 14th 2019

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 1300 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug 108854] [polaris11] - Failed GPU reset after hang
  2018-11-24 20:41 [Bug 108854] [polaris11] - Failed GPU reset after hang bugzilla-daemon
                   ` (3 preceding siblings ...)
  2019-01-14 16:54 ` bugzilla-daemon
@ 2019-01-14 16:55 ` bugzilla-daemon
  2019-01-14 16:55 ` bugzilla-daemon
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: bugzilla-daemon @ 2019-01-14 16:55 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 330 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=108854

--- Comment #5 from Tom Seewald <tseewald@gmail.com> ---
Created attachment 143108
  --> https://bugs.freedesktop.org/attachment.cgi?id=143108&action=edit
UMR wave dump as of January 14th 2019

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 1261 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug 108854] [polaris11] - Failed GPU reset after hang
  2018-11-24 20:41 [Bug 108854] [polaris11] - Failed GPU reset after hang bugzilla-daemon
                   ` (4 preceding siblings ...)
  2019-01-14 16:55 ` bugzilla-daemon
@ 2019-01-14 16:55 ` bugzilla-daemon
  2019-01-14 16:57 ` bugzilla-daemon
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: bugzilla-daemon @ 2019-01-14 16:55 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 334 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=108854

--- Comment #6 from Tom Seewald <tseewald@gmail.com> ---
Created attachment 143109
  --> https://bugs.freedesktop.org/attachment.cgi?id=143109&action=edit
UMR gfx ring dump as of January 14th 2019

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 1273 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug 108854] [polaris11] - Failed GPU reset after hang
  2018-11-24 20:41 [Bug 108854] [polaris11] - Failed GPU reset after hang bugzilla-daemon
                   ` (5 preceding siblings ...)
  2019-01-14 16:55 ` bugzilla-daemon
@ 2019-01-14 16:57 ` bugzilla-daemon
  2019-01-14 17:05 ` bugzilla-daemon
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: bugzilla-daemon @ 2019-01-14 16:57 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 305 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=108854

--- Comment #7 from Tom Seewald <tseewald@gmail.com> ---
Created attachment 143110
  --> https://bugs.freedesktop.org/attachment.cgi?id=143110&action=edit
UMR gpu info

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 1186 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug 108854] [polaris11] - Failed GPU reset after hang
  2018-11-24 20:41 [Bug 108854] [polaris11] - Failed GPU reset after hang bugzilla-daemon
                   ` (6 preceding siblings ...)
  2019-01-14 16:57 ` bugzilla-daemon
@ 2019-01-14 17:05 ` bugzilla-daemon
  2019-01-14 17:42 ` bugzilla-daemon
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: bugzilla-daemon @ 2019-01-14 17:05 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 1177 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=108854

--- Comment #8 from Tom Seewald <tseewald@gmail.com> ---
I've reproduced this issue on amd-staging-drm-next and have attached a UMR wave
and gfx ring dump, along with a new dmesg.  To clarify, this issue also
prevents me from rebooting/shutting down my computer, and I am forced to hold
the power button.

Here are the version strings of the relevant software I'm running:

Kernel: amd-staging-drm-next (commit: d2d07f246b126b23d02af0603b83866a3c3e2483)
Mesa: 18.3.1
Xorg: 1.19.6
UMR: 016bc2e93af2cac7a9bd790f7fcacb1ffdadc819

This is my first attempt at using UMR to get information about this system
hang.  I'm essentially just copying what Andrey Grodzovsky suggested in a
previous thread[0].

Here are the umr commands used to gather the information:

Waves dump: umr -i 1 -O verbose,halt_waves -wa

GFX ring dump: umr -i 1 -O verbose,follow -R gfx[.]

GFX info: umr -i 1 -e

I've attached the output of these to the bugzilla report.

[0] https://lists.freedesktop.org/archives/amd-gfx/2018-December/029790.html

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 2029 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug 108854] [polaris11] - Failed GPU reset after hang
  2018-11-24 20:41 [Bug 108854] [polaris11] - Failed GPU reset after hang bugzilla-daemon
                   ` (7 preceding siblings ...)
  2019-01-14 17:05 ` bugzilla-daemon
@ 2019-01-14 17:42 ` bugzilla-daemon
  2019-01-14 17:43 ` bugzilla-daemon
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: bugzilla-daemon @ 2019-01-14 17:42 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 257 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=108854

--- Comment #9 from Tom Seewald <tseewald@gmail.com> ---
I temporarily upgraded to Xorg 1.20, and the issue still occurs.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 1022 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug 108854] [polaris11] - Failed GPU reset after hang
  2018-11-24 20:41 [Bug 108854] [polaris11] - Failed GPU reset after hang bugzilla-daemon
                   ` (8 preceding siblings ...)
  2019-01-14 17:42 ` bugzilla-daemon
@ 2019-01-14 17:43 ` bugzilla-daemon
  2019-01-23  0:20 ` [Bug 108854] [polaris11] - GPU Hang - ring gfx timeout bugzilla-daemon
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: bugzilla-daemon @ 2019-01-14 17:43 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 330 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=108854

--- Comment #10 from Tom Seewald <tseewald@gmail.com> ---
Created attachment 143113
  --> https://bugs.freedesktop.org/attachment.cgi?id=143113&action=edit
dmesg with xorg 1.20, kernel 5.0-rc2

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 1260 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug 108854] [polaris11] - GPU Hang - ring gfx timeout
  2018-11-24 20:41 [Bug 108854] [polaris11] - Failed GPU reset after hang bugzilla-daemon
                   ` (9 preceding siblings ...)
  2019-01-14 17:43 ` bugzilla-daemon
@ 2019-01-23  0:20 ` bugzilla-daemon
  2019-01-25 17:24 ` bugzilla-daemon
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: bugzilla-daemon @ 2019-01-23  0:20 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 454 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=108854

Tom Seewald <tseewald@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|[polaris11] - Failed GPU    |[polaris11] - GPU Hang -
                   |reset after hang            |ring gfx timeout

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 1121 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug 108854] [polaris11] - GPU Hang - ring gfx timeout
  2018-11-24 20:41 [Bug 108854] [polaris11] - Failed GPU reset after hang bugzilla-daemon
                   ` (10 preceding siblings ...)
  2019-01-23  0:20 ` [Bug 108854] [polaris11] - GPU Hang - ring gfx timeout bugzilla-daemon
@ 2019-01-25 17:24 ` bugzilla-daemon
  2019-01-25 17:26 ` bugzilla-daemon
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: bugzilla-daemon @ 2019-01-25 17:24 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 396 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=108854

--- Comment #11 from Alex Deucher <alexdeucher@gmail.com> ---
The reset was actually successful.  The problem is, userspace components need
to be aware of the reset and recreate their contexts.  As a workaround, you can
kill the problematic app or restart X.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 1166 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug 108854] [polaris11] - GPU Hang - ring gfx timeout
  2018-11-24 20:41 [Bug 108854] [polaris11] - Failed GPU reset after hang bugzilla-daemon
                   ` (11 preceding siblings ...)
  2019-01-25 17:24 ` bugzilla-daemon
@ 2019-01-25 17:26 ` bugzilla-daemon
  2019-02-10  1:01 ` bugzilla-daemon
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: bugzilla-daemon @ 2019-01-25 17:26 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 641 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=108854

Alex Deucher <alexdeucher@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|DRM/AMDgpu                  |Drivers/Gallium/radeonsi
            Product|DRI                         |Mesa
         QA Contact|                            |dri-devel@lists.freedesktop
                   |                            |.org
            Version|DRI git                     |unspecified

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 1643 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug 108854] [polaris11] - GPU Hang - ring gfx timeout
  2018-11-24 20:41 [Bug 108854] [polaris11] - Failed GPU reset after hang bugzilla-daemon
                   ` (12 preceding siblings ...)
  2019-01-25 17:26 ` bugzilla-daemon
@ 2019-02-10  1:01 ` bugzilla-daemon
  2019-02-21 17:56 ` bugzilla-daemon
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: bugzilla-daemon @ 2019-02-10  1:01 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 935 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=108854

--- Comment #12 from Tom Seewald <tseewald@gmail.com> ---
(In reply to Alex Deucher from comment #11)
> The reset was actually successful.  The problem is, userspace components
> need to be aware of the reset and recreate their contexts.  As a workaround,
> you can kill the problematic app or restart X.

Hmm, but then why will the machine not restart unless I use sysrq keys? I would
think a userspace issue wouldn't cause hung kernel tasks like that.

I'm also curious regarding why this program is causing the GPU to reset to
begin with, I have not seen others reporting issues on other platforms with
this program.

Is this ring gfx timeout purely a problem with userspace?

e.g.
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled
seq=32203, emitted seq=32205

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 1779 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug 108854] [polaris11] - GPU Hang - ring gfx timeout
  2018-11-24 20:41 [Bug 108854] [polaris11] - Failed GPU reset after hang bugzilla-daemon
                   ` (13 preceding siblings ...)
  2019-02-10  1:01 ` bugzilla-daemon
@ 2019-02-21 17:56 ` bugzilla-daemon
  2019-02-21 19:21 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: bugzilla-daemon @ 2019-02-21 17:56 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 325 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=108854

--- Comment #13 from Tom St Denis <tom.stdenis@amd.com> ---
The wave dump seems to be empty... Is that the complete output?  Was there
anything printed to stderr (like there are no waves)?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 1093 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug 108854] [polaris11] - GPU Hang - ring gfx timeout
  2018-11-24 20:41 [Bug 108854] [polaris11] - Failed GPU reset after hang bugzilla-daemon
                   ` (14 preceding siblings ...)
  2019-02-21 17:56 ` bugzilla-daemon
@ 2019-02-21 19:21 ` bugzilla-daemon
  2019-02-22 17:03 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: bugzilla-daemon @ 2019-02-21 19:21 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 1144 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=108854

--- Comment #14 from Tom Seewald <tseewald@gmail.com> ---
(In reply to Tom St Denis from comment #13)
> The wave dump seems to be empty... Is that the complete output?  Was there
> anything printed to stderr (like there are no waves)?

Yes it says "no active waves!" - so it makes sense that is empty.  Is there
something else you'd like me to try? 

Currently I'm running "umr -i 1 -O verbose,halt_waves -wa" immediately after I
see the "ring gfx timeout" in dmesg.  I also just rebuilt UMR so I should be up
to date.

Some potentially good news though, after upgrading from mesa 18.3.1 to 18.3.3,
I have not been able to reproduce the issue. On mesa 18.3.1 and earlier I can
reproduce it in under 20 seconds (I did so today on the latest
amd-staging-drm-next), and I have tested mesa 18.3.3 for about an hour now.

But I believe this is still something to look into as user space should
probably not be able to hang the entire system, even if the user is running an
older version of Mesa.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 2015 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug 108854] [polaris11] - GPU Hang - ring gfx timeout
  2018-11-24 20:41 [Bug 108854] [polaris11] - Failed GPU reset after hang bugzilla-daemon
                   ` (15 preceding siblings ...)
  2019-02-21 19:21 ` bugzilla-daemon
@ 2019-02-22 17:03 ` bugzilla-daemon
  2019-02-22 18:15 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: bugzilla-daemon @ 2019-02-22 17:03 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 459 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=108854

--- Comment #15 from Tom St Denis <tom.stdenis@amd.com> ---
If you can't reproduce on a newer version of mesa then it's "been fixed" :-)

Blocking shutdown is simply due to the device deinit being blocked because the
device is not in an operational state.  Not much to be done from a driver point
of view I don't think.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 1237 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug 108854] [polaris11] - GPU Hang - ring gfx timeout
  2018-11-24 20:41 [Bug 108854] [polaris11] - Failed GPU reset after hang bugzilla-daemon
                   ` (16 preceding siblings ...)
  2019-02-22 17:03 ` bugzilla-daemon
@ 2019-02-22 18:15 ` bugzilla-daemon
  2019-02-22 18:31 ` bugzilla-daemon
  2019-02-22 18:46 ` bugzilla-daemon
  19 siblings, 0 replies; 21+ messages in thread
From: bugzilla-daemon @ 2019-02-22 18:15 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 1620 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=108854

--- Comment #16 from Tom Seewald <tseewald@gmail.com> ---
(In reply to Tom St Denis from comment #15)
> If you can't reproduce on a newer version of mesa then it's "been fixed" :-)

My (probably incorrect) understanding is roughly this:

    +-------+-------+
1.) |  Application  |
    +-------+-------+
       |
       | Possibly sending bad commands/calls to Mesa
       |
       v
    +------+---------+
2.) |     Mesa       |
    +------+---------+
       |
       | Passing on bad calls from the application
       |     or
       | There is a bug in Mesa itself where it is sending bad calls/commands
to the kernel
       v
    +--------+--------+
3.) |  Kernel/amdgpu  |
    +--------+--------+
       |
       | amdgpu puts the physical device in a bad state due to bad commands
from Mesa
       v
    +--------+--------+
4.) |       GPU       |
    +--------+--------+

Given that mesa 18.3.3+ "fixes" the issue, it sounds like a specific case of
mesa sending garbage to the kernel (step 2 to 3) has been fixed.

But in general shouldn't the kernel driver (ideally) be able to handle mesa
passing malformed/bad commands rather than freezing the device (step 3 to 4)? 
I understand not every case can be covered, and I also understand that GPU
resets need to be supported in user space for seamless recovery, but shouldn't
the driver "unstick" itself enough so the computer can be rebooted normally?

Thanks for your time and patience.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 2488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug 108854] [polaris11] - GPU Hang - ring gfx timeout
  2018-11-24 20:41 [Bug 108854] [polaris11] - Failed GPU reset after hang bugzilla-daemon
                   ` (17 preceding siblings ...)
  2019-02-22 18:15 ` bugzilla-daemon
@ 2019-02-22 18:31 ` bugzilla-daemon
  2019-02-22 18:46 ` bugzilla-daemon
  19 siblings, 0 replies; 21+ messages in thread
From: bugzilla-daemon @ 2019-02-22 18:31 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 1393 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=108854

--- Comment #17 from Alex Deucher <alexdeucher@gmail.com> ---
(In reply to Tom Seewald from comment #16)

> But in general shouldn't the kernel driver (ideally) be able to handle mesa
> passing malformed/bad commands rather than freezing the device (step 3 to
> 4)?  I understand not every case can be covered, and I also understand that
> GPU resets need to be supported in user space for seamless recovery, but
> shouldn't the driver "unstick" itself enough so the computer can be rebooted
> normally?

These are not generally bad data from mesa per se.  There's not really a good
way to validate all combinations of state sent to the GPU are valid or not. 
There are hundreds of registers and state buffers that the GPU uses to process
the 3D pipeline.  It's impossible to test every combination of state and
dispatch and ordering.  The hangs are generally due to a deadlock in the hw due
to a bad interaction of states set by the application.  E.g., some hw block is
waiting on a signal from another hw block which won't get sent because the user
sent another state update which stops that signal.

The GPU reset should generally be able recover the GPU, but in some cases you
may end up with a deadlock in sw in the kernel somewhere.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 2260 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug 108854] [polaris11] - GPU Hang - ring gfx timeout
  2018-11-24 20:41 [Bug 108854] [polaris11] - Failed GPU reset after hang bugzilla-daemon
                   ` (18 preceding siblings ...)
  2019-02-22 18:31 ` bugzilla-daemon
@ 2019-02-22 18:46 ` bugzilla-daemon
  19 siblings, 0 replies; 21+ messages in thread
From: bugzilla-daemon @ 2019-02-22 18:46 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 545 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=108854

Tom Seewald <tseewald@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|NEW                         |RESOLVED

--- Comment #18 from Tom Seewald <tseewald@gmail.com> ---
Thanks Tom and Alex, I'll trust your judgement on this.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 2018 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-02-22 18:46 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-24 20:41 [Bug 108854] [polaris11] - Failed GPU reset after hang bugzilla-daemon
2018-12-01 18:17 ` bugzilla-daemon
2018-12-08 18:20 ` bugzilla-daemon
2018-12-08 18:27 ` bugzilla-daemon
2019-01-14 16:54 ` bugzilla-daemon
2019-01-14 16:55 ` bugzilla-daemon
2019-01-14 16:55 ` bugzilla-daemon
2019-01-14 16:57 ` bugzilla-daemon
2019-01-14 17:05 ` bugzilla-daemon
2019-01-14 17:42 ` bugzilla-daemon
2019-01-14 17:43 ` bugzilla-daemon
2019-01-23  0:20 ` [Bug 108854] [polaris11] - GPU Hang - ring gfx timeout bugzilla-daemon
2019-01-25 17:24 ` bugzilla-daemon
2019-01-25 17:26 ` bugzilla-daemon
2019-02-10  1:01 ` bugzilla-daemon
2019-02-21 17:56 ` bugzilla-daemon
2019-02-21 19:21 ` bugzilla-daemon
2019-02-22 17:03 ` bugzilla-daemon
2019-02-22 18:15 ` bugzilla-daemon
2019-02-22 18:31 ` bugzilla-daemon
2019-02-22 18:46 ` bugzilla-daemon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.