All of lore.kernel.org
 help / color / mirror / Atom feed
* [Bug 109403] amdgpu randomly hangs while streaming or when CPU is busy on X399 with TR 1950X
@ 2019-01-21 10:36 bugzilla-daemon
  2019-01-21 10:50 ` bugzilla-daemon
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: bugzilla-daemon @ 2019-01-21 10:36 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 11734 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=109403

            Bug ID: 109403
           Summary: amdgpu randomly hangs while streaming or when CPU is
                    busy on X399 with TR 1950X
           Product: DRI
           Version: unspecified
          Hardware: x86-64 (AMD64)
                OS: Linux (All)
            Status: NEW
          Severity: normal
          Priority: medium
         Component: DRM/AMDgpu
          Assignee: dri-devel@lists.freedesktop.org
          Reporter: 1@provod.gl

I've been experiencing random GPU hangs since I upgraded to Threadripper about
a year ago.

Specs:
- Motherboard: ASUS Prime X399-A, all bios versions from stock until current
0808
- CPU: Threadripper 1950X, 32 threads
- GPU: MSI Radeon RX Vega 64 Air Boost 8G OC (was also happening on ASUS R9
Fury X on the same machine; this GPU was generally stable on previous box)
- Displays:
   - 2x DELL U2412M 1920x1200x60 (DP)
   - 1x ASUS MG279Q 2560x1440x144 (DP)
- Kernel versions: 4.20, 5.0-rc2 (has been happening since from at least 4.14;
earlier versions weren't tried).
- linux-firmware: 20181218
- Mesa: 18.3.1
- X: 1.20.3
- libdrm: 2.4.96
- Possibly relevant kernel options: amd_iommu=on
vfio-pci.ids=10de:1005,10de:0e1a,1912:0014,1106:3483 iommu=pt
vfio-pci.disable_vga=1 hpet=disable nohpet amdgpu.ppfeaturemask=0xfffd7fff
amdgpu.gpu_recovery=1 pcie_aspm=off

The problem manifests itself usually like this:
1. Screen suddenly freezes (sometimes it is possible to move mouse cursor for a
few seconds, but it will freeze eventually too)
2. GPU fan speeds up and remain high
3. Every process that talks to GPU freezes and becomes impossible to kill.
4. Can SSH into the machine and everything else besides the GPU works ok.
5. dmesg contains a message like this:
                [Jan21 00:03] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
gfx timeout, signaled seq=17188686, emitted seq=17188689
                [  +0.000032] [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
Process information: process X pid 9315 thread X:cs0 pid 9335
        or with a bit more stuff happening before:
                [Jan18 19:43] amdgpu 0000:44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
                [  +0.000003] amdgpu 0000:44:00.0:   in page starting at
address 0x0000800010607000 from 27
                [  +0.000002] amdgpu 0000:44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x0060153D
                [  +0.000005] amdgpu 0000:44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
                [  +0.000002] amdgpu 0000:44:00.0:   in page starting at
address 0x0000800010609000 from 27
                [  +0.000001] amdgpu 0000:44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x00000000
                [  +0.000004] amdgpu 0000:44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
                [  +0.000001] amdgpu 0000:44:00.0:   in page starting at
address 0x0000800010607000 from 27
                [  +0.000002] amdgpu 0000:44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x00000000
                [  +0.000004] amdgpu 0000:44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
                [  +0.000001] amdgpu 0000:44:00.0:   in page starting at
address 0x0000800010609000 from 27
                [  +0.000001] amdgpu 0000:44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x00000000
                [  +0.000004] amdgpu 0000:44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
                [  +0.000002] amdgpu 0000:44:00.0:   in page starting at
address 0x0000800010607000 from 27
                [  +0.000001] amdgpu 0000:44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x00000000
                [  +0.000004] amdgpu 0000:44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
                [  +0.000001] amdgpu 0000:44:00.0:   in page starting at
address 0x0000800010609000 from 27
                [  +0.000001] amdgpu 0000:44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x00000000
                [  +0.000004] amdgpu 0000:44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
                [  +0.000001] amdgpu 0000:44:00.0:   in page starting at
address 0x0000800010607000 from 27
                [  +0.000001] amdgpu 0000:44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x00000000
                [  +0.000004] amdgpu 0000:44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
                [  +0.000002] amdgpu 0000:44:00.0:   in page starting at
address 0x0000800010609000 from 27
                [  +0.000001] amdgpu 0000:44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x00000000
                [  +0.000004] amdgpu 0000:44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
                [  +0.000001] amdgpu 0000:44:00.0:   in page starting at
address 0x0000800010607000 from 27
                [  +0.000001] amdgpu 0000:44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x00000000
                [  +0.000004] amdgpu 0000:44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
                [  +0.000001] amdgpu 0000:44:00.0:   in page starting at
address 0x0000800010609000 from 27
                [  +0.000001] amdgpu 0000:44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x00000000
                [Jan18 19:44] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
gfx timeout, signaled seq=40554, emitted seq=40556
                [  +0.000047] [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
Process information: process superposition pid 11225 thread superposit:cs0 pid
11308
6. amdgpu reports near 100% cpu usage and high power draw, even it was
completely idle before the freeze.

If I enable amdgpu.gpu_recovery, then it tries to reset the GPU but fails most
of the time:
                [  +0.000005] amdgpu 0000:44:00.0: GPU reset begin!
                [ +10.230091] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR*
[CRTC:51:crtc-2] hw_done or flip_done timed out
                (there are no further logs)
        (I've seen it succesfully reset the GPU only *once*, and that obviously
required X restart)

These freezes happen pretty much randomly:
- Sometimes the GPU remains stable for weeks
- It will generally remain stable while just playing games or running
benchmarks like Unigine Superposition for many hours
- There have been a couple of freezes when just watching youtube using firefox
and not doing anything else
- It will sometimes freeze with GPU being completely idle (but outputs on),
while CPU is at 100%
- It will sometimes freeze when opening shadertoy shaders. Not specific ones,
just randomly.
- It will likely freeze within 1-2 hours of streaming using OBS:
                - XSHM is used to grab 2560x1440 screen at 60fps
                - image downscaled to 1080p60 using whatever OBS does
                - a bunch of minor stuff added to the frame
                - software encoding using x264 medium preset resulting in
10-30% CPU load
        - It can freeze both when doing live shader programming (and GPU is at
100% with heavy pathtracing compute), and when just editing text in vim.
        - It is still pretty random: sometimes it remains stable for a week of
2-4 hours of almost everyday streaming, but on some days it will freeze 2-3
times within one evening.

This would suggest a hardware issue, but strangely enough I have never
experienced this problem on Windows using the same PC. This also prevents me
from RMA because there's no plausible way reproduce the issue.

Other hardware is stable:
- CPU being 100% busy compiling some huge C++ codebases for hours remains
stable
- many-hours memtest doesn't show any errors
- there's also an NVidia GPU installed in this machine that is being passed
through to Windows running under qemu. This GPU is also stable under any load.
        - although it was throwing PCI AER errors into dmesg (without any other
symptoms). This is believed to be benign X399 issue, and is suppressed using
pcie_aspm=off kernel parameter
- Loading the entire system for 100% (simultaneously running GPU benchmarks on
host and vm, and also compiling something on CPU) generally doesn't trigger the
issue. Adding OBS to that likely does.
- Three different PSUs were used on this system, no behaviour difference.

Other things:
- Power management on Linux is significantly different from one on Windows.
        - on Windows idle means idle: all clocks and voltages are as low as pp
allows, power draw is ~20W
        - on Linux even idle (nothing is feeding GPU with any work) will have
slck at 3 (1138Mhz 1000mV) and mclk at 3 (max, 945MHz 1100mV), power draw is
40W
- I am unable to dump BIOS of this card properly on Linux:
        - Both /sys/kernel/debug/dri/0/amdgpu_vbios and
/sys/class/drm/card0/device/rom are truncated at 60928
        - Contents are different from what I could dump on Windows, e.g:
                @@ -1,6 +1,6 @@
                -00000000: 55aa 77e9 eb02 0000 0000 0000 0000 0000 
U.w.............
                -00000010: 0000 0000 0000 0000 9c02 0000 0000 4942 
..............IB
                -00000020: 4d9d ac8a 0000 0000 0000 0000 0000 0004 
M...............
                +00000000: 55aa 77e9 eb02 0000 00c0 0000 0000 0000 
U.w.............
                +00000010: 0000 0000 0044 0000 9c02 0000 0000 4942 
.....D........IB
                +00000020: 4d43 ac8a 0000 0000 0000 0000 0000 0004 
MC..............
                 00000030: 2037 3631 3239 3535 3230 0000 0000 0000  
761295520......
                 00000040: 0000 0000 0000 0000 7402 0000 0000 0000 
........t.......
                 00000050: 3132 2f31 322f 3137 2030 313a 3237 0000  12/12/17
01:27..
                @@ -38,13 +38,13 @@
                 00000250: 315f 4d42 415f 4131 5f48 424d 5f38 4742 
1_MBA_A1_HBM_8GB
                 00000260: 5f56 3336 3831 305c 636f 6e66 6967 2e68 
_V36810\config.h
                 00000270: 0000 0090 2800 0202 4154 4f4d 00c0 eb03 
....(...ATOM....
                -00000280: 1802 c102 6c01 1e04 0000 0000 6214 8036 
....l.......b..6
                +00000280: 1802 c102 6c01 1e04 0000 0030 6214 8036 
....l......0b..6
- Under/over-volting doesn't work: any however insignificant change to any of
the default voltages result in severe throttling, see
https://github.com/RadeonOpenCompute/ROCm/issues/681

Is there anything else I could try?
Is there a way to collect more info?

Links to (probably, superficially) similar problems:
- https://bugs.freedesktop.org/show_bug.cgi?id=105733
- https://bugs.freedesktop.org/show_bug.cgi?id=105819
- https://bugs.freedesktop.org/show_bug.cgi?id=109022
- https://bugs.freedesktop.org/show_bug.cgi?id=105251
- https://bugs.freedesktop.org/show_bug.cgi?id=108493
- https://github.com/RadeonOpenCompute/ROCm/issues/348

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 14057 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug 109403] amdgpu randomly hangs while streaming or when CPU is busy on X399 with TR 1950X
  2019-01-21 10:36 [Bug 109403] amdgpu randomly hangs while streaming or when CPU is busy on X399 with TR 1950X bugzilla-daemon
@ 2019-01-21 10:50 ` bugzilla-daemon
  2019-01-29 21:50 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: bugzilla-daemon @ 2019-01-21 10:50 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 535 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=109403

Ivan Avdeev <1@provod.gl> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |1@provod.gl

--- Comment #1 from Ivan Avdeev <1@provod.gl> ---
Created attachment 143176
  --> https://bugs.freedesktop.org/attachment.cgi?id=143176&action=edit
dmesg-5.0-rc2

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 2054 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug 109403] amdgpu randomly hangs while streaming or when CPU is busy on X399 with TR 1950X
  2019-01-21 10:36 [Bug 109403] amdgpu randomly hangs while streaming or when CPU is busy on X399 with TR 1950X bugzilla-daemon
  2019-01-21 10:50 ` bugzilla-daemon
@ 2019-01-29 21:50 ` bugzilla-daemon
  2019-01-31 16:01 ` bugzilla-daemon
  2019-11-19  9:11 ` bugzilla-daemon
  3 siblings, 0 replies; 5+ messages in thread
From: bugzilla-daemon @ 2019-01-29 21:50 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 674 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=109403

--- Comment #2 from Chris <htfreaq@gmail.com> ---
I wonder if this is related to your motherboard. I have an ASUS Zenith with a
1950X, 128GB RAM and a Vega 64 LC that have been on Kernel 4.20 through
5.0-rc4. The latter of which I'm currently on. I have no kernel parameters on
my grub file, only the default 'splash quiet'. I have never ran into hangs
while gaming, using youtube, OBS nor compiling the linux kernel. Just thought I
would share my similar configuration. I can only suggest try updating and or
downgrading your BIOS?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 1508 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug 109403] amdgpu randomly hangs while streaming or when CPU is busy on X399 with TR 1950X
  2019-01-21 10:36 [Bug 109403] amdgpu randomly hangs while streaming or when CPU is busy on X399 with TR 1950X bugzilla-daemon
  2019-01-21 10:50 ` bugzilla-daemon
  2019-01-29 21:50 ` bugzilla-daemon
@ 2019-01-31 16:01 ` bugzilla-daemon
  2019-11-19  9:11 ` bugzilla-daemon
  3 siblings, 0 replies; 5+ messages in thread
From: bugzilla-daemon @ 2019-01-31 16:01 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 502 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=109403

--- Comment #3 from Andrey Grodzovsky <andrey.grodzovsky@amd.com> ---
Hey, can you check if adding amdgpu.vm_debug=1 makes the VMC page faults 
disappear ?

Regarding  the hang you see while doing GPU reset - please provide dmesg for
this but with command line parameter of drm.debug=0xff  - you also probably
should open another ticket for this specific issue.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 1356 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug 109403] amdgpu randomly hangs while streaming or when CPU is busy on X399 with TR 1950X
  2019-01-21 10:36 [Bug 109403] amdgpu randomly hangs while streaming or when CPU is busy on X399 with TR 1950X bugzilla-daemon
                   ` (2 preceding siblings ...)
  2019-01-31 16:01 ` bugzilla-daemon
@ 2019-11-19  9:11 ` bugzilla-daemon
  3 siblings, 0 replies; 5+ messages in thread
From: bugzilla-daemon @ 2019-11-19  9:11 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 805 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=109403

Martin Peres <martin.peres@free.fr> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |MOVED

--- Comment #4 from Martin Peres <martin.peres@free.fr> ---
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been
closed from further activity.

You can subscribe and participate further through the new bug through this link
to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/682.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 2461 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-11-19  9:11 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-21 10:36 [Bug 109403] amdgpu randomly hangs while streaming or when CPU is busy on X399 with TR 1950X bugzilla-daemon
2019-01-21 10:50 ` bugzilla-daemon
2019-01-29 21:50 ` bugzilla-daemon
2019-01-31 16:01 ` bugzilla-daemon
2019-11-19  9:11 ` bugzilla-daemon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.