All of lore.kernel.org
 help / color / mirror / Atom feed
* [Bug 203111] New: Unrecoverable GPU crash with DiRT 4
@ 2019-03-30  9:29 bugzilla-daemon
  2019-04-01 16:02 ` [Bug 203111] " bugzilla-daemon
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: bugzilla-daemon @ 2019-03-30  9:29 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=203111

            Bug ID: 203111
           Summary: Unrecoverable GPU crash with DiRT 4
           Product: Drivers
           Version: 2.5
    Kernel Version: 5.0.4
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Video(DRI - non Intel)
          Assignee: drivers_video-dri@kernel-bugs.osdl.org
          Reporter: v10lator@myway.de
        Regression: No

I just played the Linux version of DiRT 4 and after some rounds of driving the
screen froze. The game sound was still there but the keyboard didn't react to
any inut. So I decided to try to SSH to the PC and see the logs. This is what I
found:

> [52700.498697] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, signaled
> seq=1423558, emitted seq=1423560`
> [52700.498702] [drm:amdgpu_job_timedout] *ERROR* Process information: process
> Dirt4 pid 10332 thread WebViewRenderer pid 10391
> [52700.498705] amdgpu 0000:01:00.0: GPU reset begin!
> [52710.728397] [drm:amdgpu_dm_atomic_check] *ERROR* [CRTC:47:crtc-0] hw_done
> or flip_done timed out

After some time sound stopped and the log showed:

> [52873.699280] WARNING: CPU: 2 PID: 4034 at kernel/kthread.c:529
> kthread_park+0x67/0x78
> [52873.699283] Modules linked in: nfsd
> [52873.699287] CPU: 2 PID: 4034 Comm: TaskSchedulerFo Not tainted 5.0.4 #1
> [52873.699288] Hardware name: To be filled by O.E.M. To be filled by
> O.E.M./SABERTOOTH 990FX R2.0, BIOS 2901 05/04/2016
> [52873.699290] RIP: 0010:kthread_park+0x67/0x78
> [52873.699291] Code: 18 e8 9d 78 aa 00 be 40 00 00 00 48 89 df e8 60 72 00 00
> 48 85 c0 74 1b 31 c0 5b 5d c3 0f 0b eb ae 0f 0b b8 da ff ff ff eb f0 <0f> 0b
> b8 f0 ff ff ff eb e7 0f 0b eb e3 0f 1f 40 00 f6 47 26 20 74
> [52873.699293] RSP: 0018:ffffa0144460fb78 EFLAGS: 00210202
> [52873.699294] RAX: 0000000000000004 RBX: ffff9155631210c0 RCX:
> 0000000000000000
> [52873.699295] RDX: ffff9155ef427428 RSI: ffff9155631210c0 RDI:
> ffff9155ef9bbfc0
> [52873.699296] RBP: ffff9155f013b8a0 R08: ffff9155f2a97480 R09:
> ffff9155f2a94a00
> [52873.699297] R10: 0000d46d0abbfe3a R11: 000033d8b581bc78 R12:
> ffff9155ef422790
> [52873.699298] R13: ffff9155a2f83c00 R14: 0000000000000202 R15:
> dead000000000100
> [52873.699299] FS:  00007fc756cff700(0000) GS:ffff9155f2a80000(0000)
> knlGS:0000000000000000
> [52873.699301] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [52873.699302] CR2: 00007fc7650b8070 CR3: 0000000322b86000 CR4:
> 00000000000406e0
> [52873.699302] Call Trace:
> [52873.699307]  drm_sched_entity_fini+0x32/0x180
> [52873.699309]  amdgpu_vm_fini+0xa8/0x520
> [52873.699311]  ? idr_destroy+0x78/0xc0
> [52873.699313]  amdgpu_driver_postclose_kms+0x14c/0x268
> [52873.699316]  drm_file_free.part.7+0x21a/0x2f8
> [52873.699318]  drm_release+0xa5/0x120
> [52873.699320]  __fput+0x9a/0x1c8
> [52873.699323]  task_work_run+0x8a/0xb0
> [52873.699325]  do_exit+0x2b5/0xb30
> [52873.699326]  do_group_exit+0x35/0x98
> [52873.699328]  get_signal+0xbd/0x690
> [52873.699331]  ? _raw_spin_unlock+0xd/0x20
> [52873.699333]  ? do_signal+0x2b/0x6b8
> [52873.699335]  ? __x64_sys_futex+0x137/0x178
> [52873.699337]  ? exit_to_usermode_loop+0x46/0xa0
> [52873.699338]  ? do_syscall_64+0x14c/0x178
> [52873.699339]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [52873.699341] ---[ end trace 1e1efc0508ef22df ]---
> [52875.619562] [drm] Skip scheduling IBs!
> [52875.625247] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser
> -125!
> [52885.826983] [drm:drm_atomic_helper_wait_for_flip_done] *ERROR*
> [CRTC:47:crtc-0] flip_done timed out
> [52896.066581] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR*
> [CRTC:47:crtc-0] flip_done timed out
> [52906.306280] [drm:drm_atomic_helper_wait_for_dependencies] *ERROR*
> [PLANE:45:plane-5] flip_done timed out

I tried to soft reboot through SSH but it didn't work so at the end I had to
hard reset by removing power. This is on a Radeon RX 580.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug 203111] Unrecoverable GPU crash with DiRT 4
  2019-03-30  9:29 [Bug 203111] New: Unrecoverable GPU crash with DiRT 4 bugzilla-daemon
@ 2019-04-01 16:02 ` bugzilla-daemon
  2019-04-02  7:38 ` bugzilla-daemon
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2019-04-01 16:02 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=203111

Alex Deucher (alexdeucher@gmail.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |alexdeucher@gmail.com

--- Comment #1 from Alex Deucher (alexdeucher@gmail.com) ---
This is probably a mesa bug.  I'd suggest trying a new version of mesa or
filing a mesa bug.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug 203111] Unrecoverable GPU crash with DiRT 4
  2019-03-30  9:29 [Bug 203111] New: Unrecoverable GPU crash with DiRT 4 bugzilla-daemon
  2019-04-01 16:02 ` [Bug 203111] " bugzilla-daemon
@ 2019-04-02  7:38 ` bugzilla-daemon
  2019-04-05 18:44 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2019-04-02  7:38 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=203111

Thomas (v10lator@myway.de) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |INVALID

--- Comment #2 from Thomas (v10lator@myway.de) ---
(In reply to Alex Deucher from comment #1)
> This is probably a mesa bug.  I'd suggest trying a new version of mesa

That helped, thank you.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug 203111] Unrecoverable GPU crash with DiRT 4
  2019-03-30  9:29 [Bug 203111] New: Unrecoverable GPU crash with DiRT 4 bugzilla-daemon
  2019-04-01 16:02 ` [Bug 203111] " bugzilla-daemon
  2019-04-02  7:38 ` bugzilla-daemon
@ 2019-04-05 18:44 ` bugzilla-daemon
  2019-04-05 20:31 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2019-04-05 18:44 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=203111

Thomas (v10lator@myway.de) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|INVALID                     |---

--- Comment #3 from Thomas (v10lator@myway.de) ---
(In reply to Alex Deucher from comment #1)
> I'd suggest trying a new version of mesa

I was too fast with closing this: It crashes with newer mesa, too, just
(subjective) less frequent.

Here's a log from mesa 19.0.1:

> [178793.032358] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, signaled
> seq=12332054, emitted seq=12332056
> [178793.032362] [drm:amdgpu_job_timedout] *ERROR* Process information:
> process Dirt4 pid 31348 thread WebViewRenderer pid 31422
> [178793.032365] amdgpu 0000:01:00.0: GPU reset begin!
> [178803.262008] [drm:amdgpu_dm_atomic_check] *ERROR* [CRTC:47:crtc-0] hw_done
> or flip_done timed out

And from git (26e161b1e9):

> [ 7819.095648] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, signaled
> seq=2652771, emitted seq=2652773
> [ 7819.095652] [drm:amdgpu_job_timedout] *ERROR* Process information: process
> Dirt4 pid 3075 thread WebViewRenderer pid 3152
> [ 7819.095655] amdgpu 0000:01:00.0: GPU reset begin!
> [ 7829.315220] [drm:amdgpu_dm_atomic_check] *ERROR* [CRTC:47:crtc-0] hw_done
> or flip_done timed out

Not sure if the log is shorter cause of new mesa or new kernel (updated from
5.0.4 to 5.0.5).

Are you sure this could be a mesa bug? Just asking cause for me a hanging
kernel sounds like a kernel bug.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug 203111] Unrecoverable GPU crash with DiRT 4
  2019-03-30  9:29 [Bug 203111] New: Unrecoverable GPU crash with DiRT 4 bugzilla-daemon
                   ` (2 preceding siblings ...)
  2019-04-05 18:44 ` bugzilla-daemon
@ 2019-04-05 20:31 ` bugzilla-daemon
  2019-04-05 21:15 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2019-04-05 20:31 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=203111

--- Comment #4 from Alex Deucher (alexdeucher@gmail.com) ---
(In reply to Thomas from comment #3)
> 
> Are you sure this could be a mesa bug? Just asking cause for me a hanging
> kernel sounds like a kernel bug.

Likely a mesa bug.  Mesa submits gfx/video/compute jobs to the kernel driver. 
If there are subtle bugs in those jobs, the GPU can hang.  The kernel driver
can reset the GPU, but the display server needs to catch the reset and properly
re-initialize it's context and buffers.  At the moment, none of the display
servers do this so you need to restart them after a GPU reset.

The:
[drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
error is because userspace tried to submit more work to the kernel after a
reset without re-initializing it's context, so the kernel rejects it.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug 203111] Unrecoverable GPU crash with DiRT 4
  2019-03-30  9:29 [Bug 203111] New: Unrecoverable GPU crash with DiRT 4 bugzilla-daemon
                   ` (3 preceding siblings ...)
  2019-04-05 20:31 ` bugzilla-daemon
@ 2019-04-05 21:15 ` bugzilla-daemon
  2019-04-05 21:15 ` bugzilla-daemon
  2019-04-09  1:39 ` bugzilla-daemon
  6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2019-04-05 21:15 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=203111

--- Comment #5 from Thomas (v10lator@myway.de) ---
Thanks a lot for the detailed answer. I'm still not sure if I understand
everything correctly (shouldn't the kernel driver validate the command stream
from userspace/mesa and stop bad things before they hit hardware / hang the
GPU?) but I'll close this now and check for or open a new mesa bug report
tomorrow (I really need sleep now).

Damn, if this wouldn't be the wrong place I would ask for more details about
your last reply (the thing about the display servers not catching up with the
GPU reset - aren't there drivers which perform GPU resets just nice under X11
already? What about Wayland?). It's so freaking nice, I bet I would learn a lot
if we wold continue the discussion... Anyway, thanks again for explaining and
sorry for me going a bit off topic in this reply.


One last thing... It's exremely off topic but I already derailed this reply and
it has to be told: Thank you Alex for being the guy you are. I bet AMD doesn't
pay you to explain technical details to stupid end users like me but that's
very appreciated. You're a hero, keep on rockin'!

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug 203111] Unrecoverable GPU crash with DiRT 4
  2019-03-30  9:29 [Bug 203111] New: Unrecoverable GPU crash with DiRT 4 bugzilla-daemon
                   ` (4 preceding siblings ...)
  2019-04-05 21:15 ` bugzilla-daemon
@ 2019-04-05 21:15 ` bugzilla-daemon
  2019-04-09  1:39 ` bugzilla-daemon
  6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2019-04-05 21:15 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=203111

Thomas (v10lator@myway.de) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|---                         |INVALID

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug 203111] Unrecoverable GPU crash with DiRT 4
  2019-03-30  9:29 [Bug 203111] New: Unrecoverable GPU crash with DiRT 4 bugzilla-daemon
                   ` (5 preceding siblings ...)
  2019-04-05 21:15 ` bugzilla-daemon
@ 2019-04-09  1:39 ` bugzilla-daemon
  6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2019-04-09  1:39 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=203111

--- Comment #6 from Alex Deucher (alexdeucher@gmail.com) ---
(In reply to Thomas from comment #5)
> Thanks a lot for the detailed answer. I'm still not sure if I understand
> everything correctly (shouldn't the kernel driver validate the command
> stream from userspace/mesa and stop bad things before they hit hardware /
> hang the GPU?) 

It's not really feasible.  For one, it adds a lot of CPU overhead.  There is
also so much state in the 3D pipeline it's nearly impossible to validate all of
the possible cases that could cause a hang.  In some cases, you may not even
know that a particular combination is bad until it gets hit.

> 
> Damn, if this wouldn't be the wrong place I would ask for more details about
> your last reply (the thing about the display servers not catching up with
> the GPU reset - aren't there drivers which perform GPU resets just nice
> under X11 already? What about Wayland?). It's so freaking nice, I bet I
> would learn a lot if we wold continue the discussion... Anyway, thanks again
> for explaining and sorry for me going a bit off topic in this reply.

I'm not sure if other drivers silently reset the GPU when they encounter a
hang.  It's generally easier to deal with on integrated GPUs since they operate
on system memory.  On dGPUs, the contents of vram might be lost after a GPU
reset as the memory controller is reset.  If vram is lost, the application that
is running needs to reload it's vram state.  Also for reliability, applications
should really be made aware of a GPU reset so they can validate their data. 
E.g., you don't want a scientific application to silently get bad data because
the GPU was reset silently in the background.

> 
> 
> One last thing... It's exremely off topic but I already derailed this reply
> and it has to be told: Thank you Alex for being the guy you are. I bet AMD
> doesn't pay you to explain technical details to stupid end users like me but
> that's very appreciated. You're a hero, keep on rockin'!

Thanks!  Glad to help.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-04-09  1:39 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-30  9:29 [Bug 203111] New: Unrecoverable GPU crash with DiRT 4 bugzilla-daemon
2019-04-01 16:02 ` [Bug 203111] " bugzilla-daemon
2019-04-02  7:38 ` bugzilla-daemon
2019-04-05 18:44 ` bugzilla-daemon
2019-04-05 20:31 ` bugzilla-daemon
2019-04-05 21:15 ` bugzilla-daemon
2019-04-05 21:15 ` bugzilla-daemon
2019-04-09  1:39 ` bugzilla-daemon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.