From mboxrd@z Thu Jan 1 00:00:00 1970
From: bugzilla-daemon@freedesktop.org
Subject: [Bug 91278] Tonga GPU lock/reset fail with Unigine Valley
Date: Thu, 09 Jul 2015 10:10:12 +0000
Message-ID:
Bug ID
91278
Summary
Tonga GPU lock/reset fail with Unigine Valley
Product
DRI
Version
XOrg git
Hardware
Other
OS
All
Status
NEW
Severity
normal
Priority
medium
Component
DRM/AMDgpu
Assignee
dri-devel@lists.freedesktop.org
Reporter
adf.lists@gmail.com
R9 285 kernel agd5f amdgpu with or without patches from
https://bugs.freedesktop.org/show_bug.cgi?id=91141
mesa is agd5f with a few patches from mainline to build with current llvm.
ddx is git against older xorg.
Simpler games like openarena don't lock.
Valley settings ultra, 8xAA, fullscreen 1920x1080.
Doesn't show in this log but I've also seen some
[drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-35)
around the lock/reset on previous tests.
Created attachment 117013 [details]
xorg-log for ref (not from lock)
I've of course tried various things sine reporting - Valley doesn't always instantly lock. Unreal 4.5 Elemental got half way before locking. Perhaps more interesting I managed to reset/fail resume just browsing - of course I've done a lot of browsing without issue so far. The difference this time was I had a huge ffmpeg/x265 encode going - it was using all my memory (8 Gig and swap had been used a bit), so it's possible memory pressure plays a role - or maybe just a red herring :-) I haven't managed to get a reset running timedemos on openarena or xonotic so far - will try with memory pressure as time allows. The reset when browsing - -rw-rw-r-- 1 andy andy 153K Jun 13 00:04 hacky-fix.jpeg [ 8052.101670] amdgpu 0000:01:00.0: GPU lockup (waiting for 0x000000000000f019 last fence id 0x000000000000f018 on ring 9) [ 8052.101672] amdgpu 0000:01:00.0: failed to sync rings (-35) [ 8052.108912] amdgpu 0000:01:00.0: Saved 9216 dwords of commands on ring 9. [ 8052.108929] amdgpu 0000:01:00.0: GPU softreset: 0x00000100 [ 8052.108930] amdgpu 0000:01:00.0: GRBM_STATUS=0x00003028 [ 8052.108932] amdgpu 0000:01:00.0: GRBM_STATUS2=0x00000008 [ 8052.108934] amdgpu 0000:01:00.0: GRBM_STATUS_SE0=0x00000006 [ 8052.108935] amdgpu 0000:01:00.0: GRBM_STATUS_SE1=0x00000006 [ 8052.108937] amdgpu 0000:01:00.0: GRBM_STATUS_SE2=0x00000006 [ 8052.108938] amdgpu 0000:01:00.0: GRBM_STATUS_SE3=0x00000006 [ 8052.108940] amdgpu 0000:01:00.0: SRBM_STATUS=0x20020240 [ 8052.108941] amdgpu 0000:01:00.0: SRBM_STATUS2=0x00000080 [ 8052.108943] amdgpu 0000:01:00.0: SDMA0_STATUS_REG = 0x76DEED57 [ 8052.108945] amdgpu 0000:01:00.0: SDMA1_STATUS_REG = 0x46DEED57 [ 8052.108946] amdgpu 0000:01:00.0: CP_STAT = 0x00000000 [ 8052.108948] amdgpu 0000:01:00.0: CP_STALLED_STAT1 = 0x00000c00 [ 8052.108949] amdgpu 0000:01:00.0: CP_STALLED_STAT2 = 0x00000000 [ 8052.108951] amdgpu 0000:01:00.0: CP_STALLED_STAT3 = 0x00000000 [ 8052.108953] amdgpu 0000:01:00.0: CP_CPF_BUSY_STAT = 0x00000000 [ 8052.108954] amdgpu 0000:01:00.0: CP_CPF_STALLED_STAT1 = 0x00000000 [ 8052.108956] amdgpu 0000:01:00.0: CP_CPF_STATUS = 0x00000000 [ 8052.108957] amdgpu 0000:01:00.0: CP_CPC_BUSY_STAT = 0x00000000 [ 8052.108959] amdgpu 0000:01:00.0: CP_CPC_STALLED_STAT1 = 0x00000000 [ 8052.108961] amdgpu 0000:01:00.0: CP_CPC_STATUS = 0x00000000 [ 8052.108962] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 8052.108964] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 8052.109078] amdgpu 0000:01:00.0: SRBM_SOFT_RESET=0x00000400 [ 8052.110233] amdgpu 0000:01:00.0: GRBM_STATUS=0x00003028 [ 8052.110235] amdgpu 0000:01:00.0: GRBM_STATUS2=0x00000008 [ 8052.110236] amdgpu 0000:01:00.0: GRBM_STATUS_SE0=0x00000006 [ 8052.110238] amdgpu 0000:01:00.0: GRBM_STATUS_SE1=0x00000006 [ 8052.110239] amdgpu 0000:01:00.0: GRBM_STATUS_SE2=0x00000006 [ 8052.110241] amdgpu 0000:01:00.0: GRBM_STATUS_SE3=0x00000006 [ 8052.110242] amdgpu 0000:01:00.0: SRBM_STATUS=0x20020040 [ 8052.110244] amdgpu 0000:01:00.0: SRBM_STATUS2=0x00000080 [ 8052.110245] amdgpu 0000:01:00.0: SDMA0_STATUS_REG = 0x76DEED57 [ 8052.110247] amdgpu 0000:01:00.0: SDMA1_STATUS_REG = 0x46DEED57 [ 8052.110248] amdgpu 0000:01:00.0: CP_STAT = 0x00000000 [ 8052.110250] amdgpu 0000:01:00.0: CP_STALLED_STAT1 = 0x00000c00 [ 8052.110252] amdgpu 0000:01:00.0: CP_STALLED_STAT2 = 0x00000000 [ 8052.110253] amdgpu 0000:01:00.0: CP_STALLED_STAT3 = 0x00000000 [ 8052.110255] amdgpu 0000:01:00.0: CP_CPF_BUSY_STAT = 0x00000000 [ 8052.110256] amdgpu 0000:01:00.0: CP_CPF_STALLED_STAT1 = 0x00000000 [ 8052.110258] amdgpu 0000:01:00.0: CP_CPF_STATUS = 0x00000000 [ 8052.110259] amdgpu 0000:01:00.0: CP_CPC_BUSY_STAT = 0x00000000 [ 8052.110261] amdgpu 0000:01:00.0: CP_CPC_STALLED_STAT1 = 0x00000000 [ 8052.110262] amdgpu 0000:01:00.0: CP_CPC_STATUS = 0x00000000 [ 8052.110282] amdgpu 0000:01:00.0: GPU reset succeeded, trying to resume [ 8052.110289] [drm] probing gen 2 caps for device 1002:5a16 = 31cd02/0 [ 8052.111446] [drm] PCIE GART of 2048M enabled (table at 0x0000000000040000). [ 8052.113940] [drm] ring test on 0 succeeded in 10 usecs [ 8053.856277] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 1 test failed (scratch(0xC040)=0xCAFEDEAD) [ 8054.049187] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 2 test failed (scratch(0xC040)=0xCAFEDEAD) [ 8054.242101] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 3 test failed (scratch(0xC040)=0xCAFEDEAD) [ 8054.435020] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 4 test failed (scratch(0xC040)=0xCAFEDEAD) [ 8054.627925] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 5 test failed (scratch(0xC040)=0xCAFEDEAD) [ 8054.820839] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 6 test failed (scratch(0xC040)=0xCAFEDEAD) [ 8055.013737] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 7 test failed (scratch(0xC040)=0xCAFEDEAD) [ 8055.206669] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 8 test failed (scratch(0xC040)=0xCAFEDEAD) [ 8055.313826] [drm:sdma_v3_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 9 test failed (0xCAFEDEAD) [ 8055.319862] amdgpu 0000:01:00.0: GPU reset failed [ 8055.320787] amdgpu 0000:01:00.0: couldn't schedule ib [ 8055.320806] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-22) [ 8055.320831] amdgpu 0000:01:00.0: couldn't schedule ib [ 8055.320841] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-22) [ 8055.320854] amdgpu 0000:01:00.0: couldn't schedule ib [ 8055.320863] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-22)
I tested Xonotic with load over the weekend and it did lock after about 10 minutes - but I then tested without any memory pressure and still managed to lock, it did take longer. I got another lock ehile browsing today - there was nothing else happening at the time, but I had a few minutes earlier compiled llvm and mesa.
What | Removed | Added |
---|---|---|
CC | master.homer@gmail.com |
Created attachment 117739 [details]
Kernel log of hang
Have been getting the same hangs, though I get it while just using the computer
normally, or even while it was idle.
Using Ubuntu vivid with kernel 4.2-rc7 from Ubuntu mainline with the oibaf ppa
and a self-compiled xf86-video-amdgpu module.
(In reply to Mathias Tillman from comment #5) > Have been getting the same hangs, though I get i= t while just using the > computer normally, or even while it was idle. The symptoms may be similar, but since the circumstances differ, please file your own report.
Created attachment 117964 [details]
hung tast with current agd5f drm-next-4.3
Valley does sometimes get further with newer gits - I have recently got all the
way through the scenes. It does still lock though.
Attached is a hung task trace with current agd5f drm-next-4.3, libdrm, mesa and
a recentish llvm.
What | Removed | Added |
---|---|---|
CC | edward.ocallaghan@koparo.com |
*** Bug 92087 has been marked as a duplicate of this bug. ***
Created attachment 118448 [details]
apitrace of hang
Not exactly sure how useful it is, but I have attached an excerpt of an
apitrace of the unigine valley demo. I had to cut out most of it, due to the
size of it - I've run it several times, and the size has always been >500MB.
The last thing in the trace was always glXSwapBuffers, so the excerpt consists
of the contents between the last and the next to last glXSwapBuffers lines.
Mathias, so you can reliably reproduce the hang by replaying t= hat apitrace?
(In reply to Michel D=C3=A4nzer from comment #10) > Mathias, so you can reliably reproduce the hang = by replaying that apitrace? Unfortunately I haven't been able to replay the apitrace properly - I just = get a black screen with a bunch of errors in the output about not supporting GL= SL 1.50, program not supported etc. I will keep trying though.
(In reply to Michel D=C3=A4nzer from comment #10) > Mathias, so you can reliably reproduce the hang = by replaying that apitrace? Okay, I have been able to replay the trace by compiling apitrace from sourc= e, and renaming glretrace to valley_x64. I can reproduce the hang by replaying, but not in a very useful way as the = hang happens on different frames on each replay, so I wouldn't really call it reproducible at this point. I will see what different options glretrace gives me, to see if I can find = some kind of common denominator between the hangs.
Sorry for triple posting, but I have some more info that may or may not be useful. When enabling verbose output from glretrace, I can see that the next to last operation before a hang is always glBindBuffer (I've run it 10 times) even though the rest of the output is very different. This, however, only happens when double buffer visuals is enabled, if I disable it using --sb the output is the same as above, ending with glXSwapBuffers. Not sure if this is a coincidence or not, but it seemed interesting to me.
FWIW I don't think the Valley code/shaders/whatever its self triggers this. I can run valley for > an hour depending on luck/state of my box. One way to get a long run for me is to go into mem sleep, come out and while nothing else is running run valley. Based on only a few runs like this, I haven't locked it yet. Randomly starting valley after my PC has been in use all day may lock withing 10 seconds. I am not saying mem sleep cures all locks, just seems to make it hard. After running valley for an hour one time, I mover onto Unreal elemental, several runs, no lock. I then tried Unreal Atlantis and eventually got it to hang with that, though it ran through the scripted bit and I had to start flying around interactively before the hang. Unreal Atlantis is a bit different/annoying. Different in that it requests more vram than I have and annoying as it makes a 1.8 gig cache file under $HOME/.config. This mem sleep observation may be luck. I also haven't tested some variants yet like with vblank_mode=0 or cpufreq_ondemand settings. I run valley all maxed fullscreen - so yet more variables verses what others may be running it with.
These patches may help: http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20150928/302380.html http://lists.freedesktop.org/archives/mesa-dev/2015-September/095718.html
(In reply to Alex Deucher from comment #15) > These patches may help: > http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20150928/302380.html > http://lists.freedesktop.org/archives/mesa-dev/2015-September/095718.html Afraid not, applied both of them and it still hangs. One interesting thing is that I was using mesa from the oibaf ppa which compiles against llvm 3.6. While using that I haven't been able to replay my apitrace of valley once - it always hangs before it finishes. However, I compiled mesa against llvm 3.7 (one compiled from source with the patch, and one from llvm's apt repository) and 3.8, and it gets much further now - I've been able to replay the trace three times without a hang, though it does ultimately hang unfortunately.
Haven't had time yet to hang with the patches. Yesterday without I them I hung, rebooted, did the memsleep, then tested the rest of the day trying to lock valley and unreal but couldn't. For the whole day, the only logging I got was a few hundred - Sep 29 18:10:47 ph4 kernel: VM fault (0x04, vmid 4) at page 1529213, read from 'TC6' (0x54433600) (72) Sep 29 18:10:49 ph4 kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x0be84804 Sep 29 18:10:49 ph4 kernel: amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0017557D Sep 29 18:10:49 ph4 kernel: amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08048004 Sep 29 18:10:49 ph4 kernel: VM fault (0x04, vmid 4) at page 1529213, read from 'TC6' (0x54433600) (72) Last thing I applied the patches to couple of days old llvm and mesa gits. This morning ran valley from power off boot after a bit of browsing/mail (yesterday this hung). Only a quick test which I stopped, looked OK but in dmesg I have >10k of - [ 1792.292640] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0918c404 [ 1792.292643] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00136F23 [ 1792.292644] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x060C4004 [ 1792.292646] VM fault (0x04, vmid 3) at page 1273635, read from 'TC4' (0x54433400) (196) [ 1792.292650] amdgpu 0000:01:00.0: GPU fault detected: 146 0x09184404 [ 1792.292651] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 1792.292652] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 1792.292654] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0) [ 1792.292658] amdgpu 0000:01:00.0: GPU fault detected: 146 0x09188404 [ 1792.292659] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 1792.292660] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 1792.292661] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0) [ 1792.292666] amdgpu 0000:01:00.0: GPU fault detected: 146 0x09180404 [ 1792.292667] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 1792.292668] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 1792.292669] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0) [ 1792.375515] amdgpu 0000:01:00.0: GPU fault detected: 146 0x09188404 [ 1792.375518] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00136F23 [ 1792.375519] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06084004 [ 1792.375521] VM fault (0x04, vmid 3) at page 1273635, read from 'TC10' (0x54433130) (132) [ 1792.375526] amdgpu 0000:01:00.0: GPU fault detected: 146 0x09184404 [ 1792.375527] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 1792.375528] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 1792.375530] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0) [ 1792.375534] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0918c404 [ 1792.375535] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 1792.375536] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 1792.375538] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0) [ 1792.375542] amdgpu 0000:01:00.0: GPU fault detected: 146 0x09180404 [ 1792.375543] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 1792.375544] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 1792.375546] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0) [ 1792.432272] amdgpu 0000:01:00.0: GPU fault detected: 146 0x09184404 [ 1792.432276] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00136F23 [ 1792.432277] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06044004 [ 1792.432280] VM fault (0x04, vmid 3) at page 1273635, read from 'TC7' (0x54433700) (68)
So after last update I ran valley again briefly and saw a few vmfaults, then a longer run and got thousands again. without touching anything else did echo mem >/sys/power/state and then woke up. 10 minute run of valley has produced zero faults.
(In reply to Andy Furniss from comment #18) > So after last update I ran valley again briefly and saw a few vmfaults, then > a longer run and got thousands again. > > without touching anything else did echo mem >/sys/power/state and then woke > up. > > 10 minute run of valley has produced zero faults. Further test from power off, nothing else running apart from X/fluxox short run of valley no faults. Reran valley for a bit longer and got thousands. Did memsleep ran valley no faults but after about 10 minutes it hung.
(In reply to Andy Furniss from comment #19) > (In reply to Andy Furniss from comment #18) > > So after last update I ran valley again briefly and saw a few vmfaults, then > > a longer run and got thousands again. > > > > without touching anything else did echo mem >/sys/power/state and then woke > > up. > > > > 10 minute run of valley has produced zero faults. > > Further test from power off, nothing else running apart from X/fluxox short > run of valley no faults. Reran valley for a bit longer and got thousands. > Did memsleep ran valley no faults but after about 10 minutes it hung. Do you get those GPU faults in the log even when there's no hang? I haven't checked dmesg while running valley myself, but I do know they always appear when a hang has happened (I'm using ssh to grab dmesg while it's hung). Dmesg is sometimes completely filled with GPU faults, other times it's just a few. I ran it a few minutes ago and only got this: [ 1737.984328] amdgpu 0000:01:00.0: GPU fault detected: 146 0x08804804 [ 1737.984338] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00100110 [ 1737.984343] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A048004 [ 1737.984348] VM fault (0x04, vmid 5) at page 1048848, read from 'TC6' (0x54433600) (72) [ 1737.984355] amdgpu 0000:01:00.0: GPU fault detected: 146 0x08804004 [ 1737.984359] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 1737.984363] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 1737.984366] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0) [ 1737.984374] amdgpu 0000:01:00.0: GPU fault detected: 146 0x08800804 [ 1737.984378] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 1737.984381] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 1737.984384] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0)
(In reply to Mathias Tillman from comment #20) > Do you get those GPU faults in the log even when there's no hang? Yes and I can also hang without getting any.
(In reply to Andy Furniss from comment #21) > (In reply to Mathias Tillman from comment #20) > > > Do you get those GPU faults in the log even when there's no hang? > > Yes and I can also hang without getting any. Actually, I think I've seen hangs without the GPU faults too now that I think about it. Makes you wonder if the GPU faults are related at all, or if this is something else entirely.