[Bug 205585] New: [Regression] [amdgpu] AMD Vega 64 GPU invalid access and EEH under load

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Bug 205585] New: [Regression] [amdgpu] AMD Vega 64 GPU invalid access and EEH under load
@ 2019-11-20  1:18 bugzilla-daemon
  2019-11-20 15:21 ` [Bug 205585] " bugzilla-daemon
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: bugzilla-daemon @ 2019-11-20  1:18 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=205585

            Bug ID: 205585
           Summary: [Regression] [amdgpu] AMD Vega 64 GPU invalid access
                    and EEH under load
           Product: Drivers
           Version: 2.5
    Kernel Version: 5.4-rc7
          Hardware: PPC-64
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: high
          Priority: P1
         Component: Video(DRI - non Intel)
          Assignee: drivers_video-dri@kernel-bugs.osdl.org
          Reporter: tpearson@raptorengineering.com
        Regression: No

Created attachment 285977
  --> https://bugzilla.kernel.org/attachment.cgi?id=285977&action=edit
backtrace

When using amdgpu with kernel 5.4-rc5 through 5.4-rc7, we are seeing invalid
DMA under load with the Vega 64.  This issue did not occur on 5.3 or earlier. 
The invalid DMA causes an EEH and knocks the GPU offline until a reboot.

Trace attached.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug 205585] [Regression] [amdgpu] AMD Vega 64 GPU invalid access and EEH under load
  2019-11-20  1:18 [Bug 205585] New: [Regression] [amdgpu] AMD Vega 64 GPU invalid access and EEH under load bugzilla-daemon
@ 2019-11-20 15:21 ` bugzilla-daemon
  2019-11-20 20:26 ` bugzilla-daemon
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2019-11-20 15:21 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=205585

Alex Deucher (alexdeucher@gmail.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |alexdeucher@gmail.com

--- Comment #1 from Alex Deucher (alexdeucher@gmail.com) ---
Can you bisect?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug 205585] [Regression] [amdgpu] AMD Vega 64 GPU invalid access and EEH under load
  2019-11-20  1:18 [Bug 205585] New: [Regression] [amdgpu] AMD Vega 64 GPU invalid access and EEH under load bugzilla-daemon
  2019-11-20 15:21 ` [Bug 205585] " bugzilla-daemon
@ 2019-11-20 20:26 ` bugzilla-daemon
  2019-11-29  7:45 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2019-11-20 20:26 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=205585

--- Comment #2 from Timothy Pearson (tpearson@raptorengineering.com) ---
I am travelling now but can bisect when back at the lab next week.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug 205585] [Regression] [amdgpu] AMD Vega 64 GPU invalid access and EEH under load
  2019-11-20  1:18 [Bug 205585] New: [Regression] [amdgpu] AMD Vega 64 GPU invalid access and EEH under load bugzilla-daemon
  2019-11-20 15:21 ` [Bug 205585] " bugzilla-daemon
  2019-11-20 20:26 ` bugzilla-daemon
@ 2019-11-29  7:45 ` bugzilla-daemon
  2019-11-29  8:25 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2019-11-29  7:45 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=205585

--- Comment #3 from Timothy Pearson (tpearson@raptorengineering.com) ---
Just had a chance to test on 5.4.0, still fails (haven't had a chance to bisect
yet; I suspect it's more related to the 64-bit enablement on POWER in 5.4 than
anything else).

The EEH is quite strange, the PEST register decodes as:
MMIO
CFG Read
Other Transaction Type
An MMIO Load, MMIO I/O Write, or other transaction returned from the PCIe link
with a status of Unsupported Request (UR)
Failure address: 0x000000000000

Full trace

[20341.276752702,3] PHB#0033[8:3]: PHB Freeze/Fence detected !
[20341.276848173,3] PHB#0033[8:3]:             PCI FIR=2000000000000000
[20341.276900504,3] PHB#0033[8:3]:         PCI FIR WOF=2000000000000000
[20341.276939625,3] PHB#0033[8:3]:            NEST FIR=0000800000000000
[20341.276979866,3] PHB#0033[8:3]:        NEST FIR WOF=0000800000000000
[20341.277023394,3] PHB#0033[8:3]:            ERR RPT0=0000000000000001
[20341.277068184,3] PHB#0033[8:3]:            ERR RPT1=0000000000000000
[20341.277110812,3] PHB#0033[8:3]:             AIB ERR=0000200000000000
[20341.277830701,3] PHB#0033[8:3]:                  brdgCtl = 00000002
[20341.277906614,3] PHB#0033[8:3]:             deviceStatus = 00000020
[20341.277946469,3] PHB#0033[8:3]:               slotStatus = 00402000
[20341.277981186,3] PHB#0033[8:3]:               linkStatus = e9010008
[20341.278025974,3] PHB#0033[8:3]:             devCmdStatus = 00100107
[20341.278068859,3] PHB#0033[8:3]:             devSecStatus = 00000000
[20341.278109829,3] PHB#0033[8:3]:          rootErrorStatus = 00000000
[20341.278149196,3] PHB#0033[8:3]:          corrErrorStatus = 00000000
[20341.278190145,3] PHB#0033[8:3]:        uncorrErrorStatus = 00000000
[20341.278223684,3] PHB#0033[8:3]:                   devctl = 00000020
[20341.278276525,3] PHB#0033[8:3]:                  devStat = 00000000
[20341.278314241,3] PHB#0033[8:3]:                  tlpHdr1 = 00000000
[20341.278356746,3] PHB#0033[8:3]:                  tlpHdr2 = 00000000
[20341.278397163,3] PHB#0033[8:3]:                  tlpHdr3 = 00000000
[20341.278440709,3] PHB#0033[8:3]:                  tlpHdr4 = 00000000
[20341.278478424,3] PHB#0033[8:3]:                 sourceId = 00000000
[20341.278516547,3] PHB#0033[8:3]:                     nFir = 0000800000000000
[20341.278555975,3] PHB#0033[8:3]:                 nFirMask = 0030001c00000000
[20341.278598653,3] PHB#0033[8:3]:                  nFirWOF = 0000800000000000
[20341.278642004,3] PHB#0033[8:3]:                 phbPlssr = 0000001800000000
[20341.278686870,3] PHB#0033[8:3]:                   phbCsr = 0000001800000000
[20341.278731874,3] PHB#0033[8:3]:                   lemFir = 0004000100000100
[20341.278776158,3] PHB#0033[8:3]:             lemErrorMask = 0000000000000000
[20341.278815229,3] PHB#0033[8:3]:                   lemWOF = 0000000100000000
[20341.278857015,3] PHB#0033[8:3]:           phbErrorStatus = 000005a000000000
[20341.278909821,3] PHB#0033[8:3]:      phbFirstErrorStatus = 0000002000000000
[20341.278951950,3] PHB#0033[8:3]:             phbErrorLog0 = 2148000098000240
[20341.278999524,3] PHB#0033[8:3]:             phbErrorLog1 = a008400000000000
[20341.279042839,3] PHB#0033[8:3]:        phbTxeErrorStatus = 0000200000000000
[20341.279081676,3] PHB#0033[8:3]:   phbTxeFirstErrorStatus = 0000200000000000
[20341.279120945,3] PHB#0033[8:3]:          phbTxeErrorLog0 = 4000000000000000
[20341.279160833,3] PHB#0033[8:3]:          phbTxeErrorLog1 = 0000000000000000
[20341.279207802,3] PHB#0033[8:3]:     phbRxeArbErrorStatus = 0000000000000000
[20341.279254658,3] PHB#0033[8:3]: phbRxeArbFrstErrorStatus = 0000000000000000
[20341.279297181,3] PHB#0033[8:3]:       phbRxeArbErrorLog0 = 0000000000000000
[20341.279334227,3] PHB#0033[8:3]:       phbRxeArbErrorLog1 = 0000000000000000
[20341.279376968,3] PHB#0033[8:3]:     phbRxeMrgErrorStatus = 0000000000000001
[20341.279420726,3] PHB#0033[8:3]: phbRxeMrgFrstErrorStatus = 0000000000000001
[20341.279469009,3] PHB#0033[8:3]:       phbRxeMrgErrorLog0 = 0000000000000000
[20341.279512839,3] PHB#0033[8:3]:       phbRxeMrgErrorLog1 = 0000000000000000
[20341.279561496,3] PHB#0033[8:3]:     phbRxeTceErrorStatus = 0000000000000000
[20341.279604696,3] PHB#0033[8:3]: phbRxeTceFrstErrorStatus = 0000000000000000
[20341.279645952,3] PHB#0033[8:3]:       phbRxeTceErrorLog0 = 0000000000000000
[20341.279685644,3] PHB#0033[8:3]:       phbRxeTceErrorLog1 = 0000000000000000
[20341.279731458,3] PHB#0033[8:3]:        phbPblErrorStatus = 0000000000000800
[20341.279778323,3] PHB#0033[8:3]:   phbPblFirstErrorStatus = 0000000000000800
[20341.279825433,3] PHB#0033[8:3]:          phbPblErrorLog0 = 0000000000000000
[20341.279866852,3] PHB#0033[8:3]:          phbPblErrorLog1 = 00000000028de410
[20341.279903104,3] PHB#0033[8:3]:      phbPcieDlpErrorLog1 = 0000000000000000
[20341.279942888,3] PHB#0033[8:3]:      phbPcieDlpErrorLog2 = 0000000000000000
[20341.279984925,3] PHB#0033[8:3]:    phbPcieDlpErrorStatus = 0000000000000000
[20341.280033282,3] PHB#0033[8:3]:       phbRegbErrorStatus = 0010001000000000
[20341.280080310,3] PHB#0033[8:3]:  phbRegbFirstErrorStatus = 0000001000000000
[20341.280126330,3] PHB#0033[8:3]:         phbRegbErrorLog0 = 4800003c00000000
[20341.280173657,3] PHB#0033[8:3]:         phbRegbErrorLog1 = 0000000000000200
[20341.280218925,3] PHB#0033[8:3]:                PEST[1ff] = 3740002a01000000
0000000000000000
[ 1580.231935] EEH: PHB#33 failure detected, location: N/A
[ 1580.231958] EEH: Frozen PHB#33-PE#0 detected
[ 1580.231969] EEH: Call Trace:
[ 1580.231983] EEH: [00000000741e7c92] __eeh_send_failure_event+0x78/0x150
[ 1580.232006] EEH: [0000000019c0a3ea] eeh_dev_check_failure+0x1d8/0x6b0
[ 1580.232019] EEH: [00000000d1114f7e] eeh_check_failure+0x98/0x100
[ 1580.232080] EEH: [0000000026fdad67] amdgpu_mm_rreg+0x20c/0x250 [amdgpu]
[ 1580.232134] EEH: [0000000087736ee4] vi_flush_hdp+0xa0/0xc0 [amdgpu]
[ 1580.232191] EEH: [000000000b00465e] amdgpu_gart_bind+0x78/0x140 [amdgpu]
[ 1580.232247] EEH: [00000000e410157a] amdgpu_ttm_gart_bind+0x124/0x140
[amdgpu]
[ 1580.232295] EEH: [0000000027696b17] amdgpu_ttm_alloc_gart+0x19c/0x230
[amdgpu]
[ 1580.232350] EEH: [00000000abff626d] amdgpu_vm_sdma_map_table+0x4c/0x70
[amdgpu]
[ 1580.232411] EEH: [000000003babc62e] amdgpu_vm_clear_bo+0x188/0x460 [amdgpu]
[ 1580.232460] EEH: [000000003135d9d5] amdgpu_vm_update_ptes+0x300/0x5f0
[amdgpu]
[ 1580.232513] EEH: [00000000a9b62a4c] amdgpu_vm_bo_update_mapping+0x100/0x140
[amdgpu]
[ 1580.232565] EEH: [00000000c53ee852] amdgpu_vm_bo_update+0x348/0x8a0 [amdgpu]
[ 1580.232614] EEH: [00000000e468e987] amdgpu_gem_va_ioctl+0x5c4/0x620 [amdgpu]
[ 1580.232644] EEH: [000000002c0a19e7] drm_ioctl_kernel+0xfc/0x180 [drm]
[ 1580.232671] EEH: [000000005cb0f244] drm_ioctl+0x238/0x480 [drm]
[ 1580.232725] EEH: [00000000b812c3a6] amdgpu_drm_ioctl+0x70/0xd0 [amdgpu]
[ 1580.232749] EEH: [000000004de566d7] do_vfs_ioctl+0xe0/0xac0
[ 1580.232770] EEH: [0000000045206404] ksys_ioctl+0xc4/0x110
[ 1580.232782] EEH: [000000001e273b3a] sys_ioctl+0x28/0x80
[ 1580.232804] EEH: [00000000aa248bf4] system_call+0x5c/0x68
[ 1580.232834] EEH: This PCI device has failed 1 times in the last hour and
will be permanently disabled after 5 failures.
[ 1580.232880] EEH: Notify device drivers to shutdown
[ 1580.232911] EEH: Beginning: 'error_detected(IO frozen)'
[ 1580.232933] PCI 0033:00:00.0#01fe: EEH: no driver
[ 1580.232935] PCI 0033:01:00.0#0000: EEH: driver not EEH aware
[ 1580.232957] PCI 0033:01:00.1#0000: EEH: driver not EEH aware
[ 1580.232970] EEH: Finished:'error_detected(IO frozen)' with aggregate
recovery state:'none'
[ 1580.232998] EEH: Collect temporary log
[ 1580.233008] PHB4 PHB#51 Diag-data (Version: 1)
[ 1580.233018] brdgCtl:    00000002
[ 1580.233028] RootSts:    00000020 00402000 e9010008 00100107 00000000
[ 1580.233040] nFir:       0000800000000000 0030001c00000000 0000800000000000
[ 1580.233062] PhbSts:     0000001800000000 0000001800000000
[ 1580.233082] Lem:        0004000100000100 0000000000000000 0000000100000000
[ 1580.233104] PhbErr:     000005a000000000 0000002000000000 2148000098000240
a008400000000000
[ 1580.233136] PhbTxeErr:  0000200000000000 0000200000000000 4000000000000000
0000000000000000
[ 1580.233169] RxeMrgErr:  0000000000000001 0000000000000001 0000000000000000
0000000000000000
[ 1580.233192] PblErr:     0000000000000800 0000000000000800 0000000000000000
00000000028de410
[ 1580.233225] RegbErr:    0010001000000000 0000001000000000 4800003c00000000
0000000000000200
[ 1580.233259] EEH: Reset with hotplug activity
[ 1580.891352] snd_hda_codec_hdmi hdaudioC0D0: Unable to sync register
0x2f0d00. -5
[ 1590.340025] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout,
signaled seq=7463, emitted seq=7465
[ 1590.340117] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
process  pid 0 thread  pid 0
[ 1590.340172] amdgpu 0033:01:00.0: GPU reset begin!
[ 1590.350000] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
signaled seq=325761, emitted seq=325763
[ 1590.350057] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
process hyperspace pid 4160 thread hyperspace:cs0 pid 4161
[ 1590.350089] amdgpu 0033:01:00.0: GPU reset begin!
[ 1590.350108] [drm] Bailing on TDR for s_job:4f608, as another already in
progress
[ 1590.350923] amdgpu: [powerplay]
[ 1590.350923]  last message was failed ret is 65535
[ 1590.350949] amdgpu: [powerplay]
[ 1590.350949]  failed to send message 261 ret is 65535
[ 1590.350971] amdgpu: [powerplay]
[ 1590.350971]  last message was failed ret is 65535
[ 1590.350983] amdgpu: [powerplay]
[ 1590.350983]  failed to send message 261 ret is 65535
[ 1590.350996] amdgpu: [powerplay]
[ 1590.350996]  last message was failed ret is 65535
[ 1590.351017] amdgpu: [powerplay]
[ 1590.351017]  failed to send message 261 ret is 65535
[ 1590.351030] amdgpu: [powerplay]
[ 1590.351030]  last message was failed ret is 65535
[ 1590.351064] amdgpu: [powerplay]
[ 1590.351064]  failed to send message 261 ret is 65535
[ 1590.351096] amdgpu: [powerplay]
[ 1590.351096]  last message was failed ret is 65535
[ 1590.351127] amdgpu: [powerplay]
[ 1590.351127]  failed to send message 261 ret is 65535
[ 1590.351158] amdgpu: [powerplay]
[ 1590.351158]  last message was failed ret is 65535
[ 1590.351202] amdgpu: [powerplay]
[ 1590.351202]  failed to send message 261 ret is 65535
[ 1590.351224] amdgpu: [powerplay]
[ 1590.351224]  last message was failed ret is 65535
[ 1590.351236] amdgpu: [powerplay]
[ 1590.351236]  failed to send message 261 ret is 65535
[ 1590.351251] amdgpu: [powerplay]
[ 1590.351251]  last message was failed ret is 65535
[ 1590.351272] amdgpu: [powerplay]
[ 1590.351272]  failed to send message 261 ret is 65535
[ 1590.351303] amdgpu: [powerplay]
[ 1590.351303]  last message was failed ret is 65535
[ 1590.351324] amdgpu: [powerplay]
[ 1590.351324]  failed to send message 261 ret is 65535
[ 1590.351356] amdgpu: [powerplay]
[ 1590.351356]  last message was failed ret is 65535
[ 1590.351378] amdgpu: [powerplay]
[ 1590.351378]  failed to send message 261 ret is 65535
[ 1590.351410] amdgpu: [powerplay]
[ 1590.351410]  last message was failed ret is 65535
[ 1590.351441] amdgpu: [powerplay]
[ 1590.351441]  failed to send message 261 ret is 65535
[ 1590.351463] amdgpu: [powerplay]
[ 1590.351463]  last message was failed ret is 65535
[ 1590.351485] amdgpu: [powerplay]
[ 1590.351485]  failed to send message 261 ret is 65535
[ 1590.351520] amdgpu: [powerplay]
[ 1590.351520]  last message was failed ret is 65535
[ 1590.351541] amdgpu: [powerplay]
[ 1590.351541]  failed to send message 261 ret is 65535
[ 1590.351572] amdgpu: [powerplay]
[ 1590.351572]  last message was failed ret is 65535
[ 1590.351603] amdgpu: [powerplay]
[ 1590.351603]  failed to send message 261 ret is 65535
[ 1590.351634] amdgpu: [powerplay]
[ 1590.351634]  last message was failed ret is 65535
[ 1590.351666] amdgpu: [powerplay]
[ 1590.351666]  failed to send message 261 ret is 65535
[ 1590.351698] amdgpu: [powerplay]
[ 1590.351698]  last message was failed ret is 65535
[ 1590.351730] amdgpu: [powerplay]
[ 1590.351730]  failed to send message 261 ret is 65535
[ 1590.351761] amdgpu: [powerplay]
[ 1590.351761]  last message was failed ret is 65535
[ 1590.351795] amdgpu: [powerplay]
[ 1590.351795]  failed to send message 261 ret is 65535
[ 1590.351980] amdgpu: [powerplay]
[ 1590.351980]  last message was failed ret is 65535
[ 1590.352014] amdgpu: [powerplay]
[ 1590.352014]  failed to send message 306 ret is 65535
[ 1590.352039] amdgpu: [powerplay]
[ 1590.352039]  last message was failed ret is 65535
[ 1590.352080] amdgpu: [powerplay]
[ 1590.352080]  failed to send message 5e ret is 65535
[ 1590.352103] amdgpu: [powerplay]
[ 1590.352103]  last message was failed ret is 65535
[ 1590.352134] amdgpu: [powerplay]
[ 1590.352134]  failed to send message 145 ret is 65535
[ 1590.352156] amdgpu: [powerplay]
[ 1590.352156]  last message was failed ret is 65535
[ 1590.352190] amdgpu: [powerplay]
[ 1590.352190]  failed to send message 146 ret is 65535
[ 1590.352225] amdgpu: [powerplay]
[ 1590.352225]  last message was failed ret is 65535
[ 1590.352271] amdgpu: [powerplay]
[ 1590.352271]  failed to send message 148 ret is 65535
[ 1590.352292] amdgpu: [powerplay]
[ 1590.352292]  last message was failed ret is 65535
[ 1590.352304] amdgpu: [powerplay]
[ 1590.352304]  failed to send message 145 ret is 65535
[ 1590.352339] amdgpu: [powerplay]
[ 1590.352339]  last message was failed ret is 65535
[ 1590.352370] amdgpu: [powerplay]
[ 1590.352370]  failed to send message 146 ret is 65535
[ 1590.383835] [drm] REG_WAIT timeout 10us * 3000 tries -
dce110_stream_encoder_dp_blank line:956
[ 1590.383875] ------------[ cut here ]------------
[ 1590.383912] WARNING: CPU: 48 PID: 1214 at
drivers/gpu/drm/amd/amdgpu/../display/dc/dc_helper.c:332
generic_reg_wait+0x214/0x230 [amdgpu]
[ 1590.383945] Modules linked in: i2c_dev uinput amdgpu snd_usb_audio
drm_vram_helper snd_usbmidi_lib gpu_sched ttm snd_rawmidi snd_seq_device ses mc
drm_kms_helper snd_hda_codec_hdmi enclosure joydev sd_mod evdev
scsi_transport_sas drm snd_hda_intel sg snd_hda_codec
drm_panel_orientation_quirks snd_hda_core syscopyarea sysfillrect ecb snd_hwdep
aacraid sysimgblt fb_sys_fops snd_pcm nvme nvme_core xts i2c_algo_bit snd_timer
snd soundcore ctr cbc ofpart vmx_crypto ipmi_powernv ipmi_devintf powernv_flash
gf128mul mtd ipmi_msghandler opal_prd at24 binfmt_misc parport_pc lp parport
ip_tables x_tables autofs4 nfsv3 nfs_acl nfs lockd grace sunrpc fscache raid10
raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
hid_generic usbhid hid raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath
linear md_mod xhci_pci xhci_hcd usbcore tg3 libphy
[ 1590.384181] CPU: 48 PID: 1214 Comm: kworker/48:2 Not tainted 5.4.0 #5
[ 1590.384194] Workqueue: events drm_sched_job_timedout [gpu_sched]
[ 1590.384205] NIP:  c00800000888505c LR: c00800000888504c CTR:
c000000000715d70
[ 1590.384238] REGS: c0000007dd55ec40 TRAP: 0700   Not tainted  (5.4.0)
[ 1590.384257] MSR:  9000000002029033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE>  CR:
28224228  XER: 00000000
[ 1590.384284] CFAR: c0000000001b66f4 IRQMASK: 0
[ 1590.384284] GPR00: c00800000888504c c0000007dd55eed0 c0080000089f5000
0000000000000052
[ 1590.384284] GPR04: c0000007fdd1ce18 c0000007fdda5858 0000000000000490
c0000007fffc9000
[ 1590.384284] GPR08: 0000000000000007 0000000000000000 00000007fced0000
9000000002001033
[ 1590.384284] GPR12: 0000000000004000 c0000007fffc9000 c000200715000000
c0000007eff449c0
[ 1590.384284] GPR16: c0000007dc7a6000 c0000007def45300 0000000000000000
00000000000003bc
[ 1590.384284] GPR20: c0080000088f6470 0000000000000000 0000000000004ea4
0000000000010000
[ 1590.384284] GPR24: 0000000000000000 c00800000890ca90 c0000007a9e40680
0000000000000bb8
[ 1590.384284] GPR28: 0000000000000010 0000000000000bb8 000000000000000a
0000000000000bb9
[ 1590.384414] NIP [c00800000888505c] generic_reg_wait+0x214/0x230 [amdgpu]
[ 1590.384450] LR [c00800000888504c] generic_reg_wait+0x204/0x230 [amdgpu]
[ 1590.384467] Call Trace:
[ 1590.384499] [c0000007dd55eed0] [c00800000888504c]
generic_reg_wait+0x204/0x230 [amdgpu] (unreliable)
[ 1590.384548] [c0000007dd55efa0] [c00800000882caec]
dce110_stream_encoder_dp_blank+0x104/0x170 [amdgpu]
[ 1590.384601] [c0000007dd55f030] [c00800000885a07c]
dce110_blank_stream+0xf4/0x120 [amdgpu]
[ 1590.384632] [c0000007dd55f060] [c0080000088743bc]
core_link_disable_stream+0x64/0x420 [amdgpu]
[ 1590.384692] [c0000007dd55f140] [c008000008857dbc]
dce110_reset_hw_ctx_wrap+0xf4/0x2e0 [amdgpu]
[ 1590.384745] [c0000007dd55f200] [c00800000885a2e0]
dce110_apply_ctx_to_hw+0x58/0x600 [amdgpu]
[ 1590.384797] [c0000007dd55f2d0] [c00800000886dcec]
dc_commit_state+0x3d4/0x820 [amdgpu]
[ 1590.384853] [c0000007dd55f400] [c0080000087fe94c]
amdgpu_dm_atomic_commit_tail+0x3c4/0x19a8 [amdgpu]
[ 1590.384888] [c0000007dd55f700] [c008000007d93fb0] commit_tail+0xf8/0x1f0
[drm_kms_helper]
[ 1590.384912] [c0000007dd55f740] [c008000007d942a8]
drm_atomic_helper_commit+0x1e0/0x1f0 [drm_kms_helper]
[ 1590.384951] [c0000007dd55f780] [c0080000087fbac8]
amdgpu_dm_atomic_commit+0x110/0x140 [amdgpu]
[ 1590.384992] [c0000007dd55f7e0] [c0080000079ce2cc]
drm_atomic_commit+0x74/0xa0 [drm]
[ 1590.385016] [c0000007dd55f850] [c008000007d94768]
drm_atomic_helper_disable_all+0x290/0x2b0 [drm_kms_helper]
[ 1590.385044] [c0000007dd55f8a0] [c008000007d949dc]
drm_atomic_helper_suspend+0x154/0x1a0 [drm_kms_helper]
[ 1590.385094] [c0000007dd55f920] [c0080000087f717c] dm_suspend+0x44/0xa0
[amdgpu]
[ 1590.385124] [c0000007dd55f950] [c008000008621e2c]
amdgpu_device_ip_suspend_phase1+0xe4/0x190 [amdgpu]
[ 1590.385163] [c0000007dd55f9d0] [c008000008623ddc]
amdgpu_device_ip_suspend+0x44/0xe0 [amdgpu]
[ 1590.385192] [c0000007dd55fa10] [c00800000888de54]
amdgpu_device_pre_asic_reset+0x248/0x28c [amdgpu]
[ 1590.385230] [c0000007dd55fab0] [c00800000888e7b8]
amdgpu_device_gpu_recover+0x2f0/0xb4c [amdgpu]
[ 1590.385268] [c0000007dd55fb90] [c008000008779f3c]
amdgpu_job_timedout+0x124/0x170 [amdgpu]
[ 1590.385290] [c0000007dd55fc30] [c008000007651244]
drm_sched_job_timedout+0x6c/0x110 [gpu_sched]
[ 1590.385336] [c0000007dd55fc70] [c000000000154ee0]
process_one_work+0x260/0x520
[ 1590.385379] [c0000007dd55fd10] [c000000000155228] worker_thread+0x88/0x5f0
[ 1590.385400] [c0000007dd55fdb0] [c00000000015f21c] kthread+0x19c/0x1b0
[ 1590.385430] [c0000007dd55fe20] [c00000000000bd54]
ret_from_kernel_thread+0x5c/0x68
[ 1590.385463] Instruction dump:
[ 1590.385480] 4bfffed4 3c620000 e8633ab8 7e679b78 7e86a378 7f65db78 7fc4f378
4800f091
[ 1590.385513] e8410018 813a0020 2f890001 419eff7c <0fe00000> 4bffff74 60000000
60000000
[ 1590.385546] ---[ end trace 59567a2f8b8649ed ]---
[ 1591.478349] PCI 0033:01:00.0#0000: EEH: 2100000 reads ignored for recovering
device at location=CPU2 Slot1 (16x) driver=amdgpu
[ 1591.478370] PCI 0033:01:00.0#0000: EEH: Might be infinite loop in amdgpu
driver
[ 1591.478382] CPU: 48 PID: 1214 Comm: kworker/48:2 Tainted: G        W        
5.4.0 #5
[ 1591.478405] Workqueue: events drm_sched_job_timedout [gpu_sched]
[ 1591.478414] Call Trace:
[ 1591.478422] [c0000007dd55e940] [c000000000a9ccc8] dump_stack+0xbc/0x104
(unreliable)
[ 1591.478434] [c0000007dd55e980] [c00000000003e788]
eeh_dev_check_failure+0x598/0x6b0
[ 1591.478455] [c0000007dd55ea30] [c00000000003eb08]
eeh_check_failure+0x98/0x100
[ 1591.478491] [c0000007dd55ea70] [c008000008622744] amdgpu_mm_rreg+0x20c/0x250
[amdgpu]
[ 1591.478539] [c0000007dd55eac0] [c0080000086298f4] cail_reg_read+0x2c/0x50
[amdgpu]
[ 1591.478577] [c0000007dd55eae0] [c00800000863255c]
atom_get_src_int+0x104/0xa00 [amdgpu]
[ 1591.478615] [c0000007dd55eb90] [c008000008633e30] atom_op_test+0xd8/0x1d0
[amdgpu]
[ 1591.478660] [c0000007dd55ec20] [c008000008636a2c]
amdgpu_atom_execute_table_locked+0x204/0x3e0 [amdgpu]
[ 1591.478701] [c0000007dd55ed20] [c008000008636d30]
atom_op_calltable+0x128/0x1e0 [amdgpu]
[ 1591.478740] [c0000007dd55eda0] [c008000008636a2c]
amdgpu_atom_execute_table_locked+0x204/0x3e0 [amdgpu]
[ 1591.478770] [c0000007dd55eea0] [c008000008636e58]
amdgpu_atom_execute_table+0x70/0xb0 [amdgpu]
[ 1591.478829] [c0000007dd55eee0] [c008000008810f30]
transmitter_control_v1_6+0x128/0x220 [amdgpu]
[ 1591.478887] [c0000007dd55ef40] [c00800000880c410]
bios_parser_transmitter_control+0x38/0x70 [amdgpu]
[ 1591.478944] [c0000007dd55ef60] [c00800000882f678]
dce110_link_encoder_disable_output+0xd0/0x1c0 [amdgpu]
[ 1591.478997] [c0000007dd55f020] [c00800000887cbfc]
dp_disable_link_phy+0xa4/0x1d0 [amdgpu]
[ 1591.479029] [c0000007dd55f060] [c008000008874488]
core_link_disable_stream+0x130/0x420 [amdgpu]
[ 1591.479082] [c0000007dd55f140] [c008000008857dbc]
dce110_reset_hw_ctx_wrap+0xf4/0x2e0 [amdgpu]
[ 1591.479134] [c0000007dd55f200] [c00800000885a2e0]
dce110_apply_ctx_to_hw+0x58/0x600 [amdgpu]
[ 1591.479186] [c0000007dd55f2d0] [c00800000886dcec]
dc_commit_state+0x3d4/0x820 [amdgpu]
[ 1591.479241] [c0000007dd55f400] [c0080000087fe94c]
amdgpu_dm_atomic_commit_tail+0x3c4/0x19a8 [amdgpu]
[ 1591.479280] [c0000007dd55f700] [c008000007d93fb0] commit_tail+0xf8/0x1f0
[drm_kms_helper]
[ 1591.479325] [c0000007dd55f740] [c008000007d942a8]
drm_atomic_helper_commit+0x1e0/0x1f0 [drm_kms_helper]
[ 1591.479381] [c0000007dd55f780] [c0080000087fbac8]
amdgpu_dm_atomic_commit+0x110/0x140 [amdgpu]
[ 1591.479419] [c0000007dd55f7e0] [c0080000079ce2cc]
drm_atomic_commit+0x74/0xa0 [drm]
[ 1591.479445] [c0000007dd55f850] [c008000007d94768]
drm_atomic_helper_disable_all+0x290/0x2b0 [drm_kms_helper]
[ 1591.479484] [c0000007dd55f8a0] [c008000007d949dc]
drm_atomic_helper_suspend+0x154/0x1a0 [drm_kms_helper]
[ 1591.479542] [c0000007dd55f920] [c0080000087f717c] dm_suspend+0x44/0xa0
[amdgpu]
[ 1591.479589] [c0000007dd55f950] [c008000008621e2c]
amdgpu_device_ip_suspend_phase1+0xe4/0x190 [amdgpu]
[ 1591.479640] [c0000007dd55f9d0] [c008000008623ddc]
amdgpu_device_ip_suspend+0x44/0xe0 [amdgpu]
[ 1591.479674] [c0000007dd55fa10] [c00800000888de54]
amdgpu_device_pre_asic_reset+0x248/0x28c [amdgpu]
[ 1591.479712] [c0000007dd55fab0] [c00800000888e7b8]
amdgpu_device_gpu_recover+0x2f0/0xb4c [amdgpu]
[ 1591.479769] [c0000007dd55fb90] [c008000008779f3c]
amdgpu_job_timedout+0x124/0x170 [amdgpu]
[ 1591.479815] [c0000007dd55fc30] [c008000007651244]
drm_sched_job_timedout+0x6c/0x110 [gpu_sched]
[ 1591.479860] [c0000007dd55fc70] [c000000000154ee0]
process_one_work+0x260/0x520
[ 1591.479903] [c0000007dd55fd10] [c000000000155228] worker_thread+0x88/0x5f0
[ 1591.479923] [c0000007dd55fdb0] [c00000000015f21c] kthread+0x19c/0x1b0
[ 1591.479953] [c0000007dd55fe20] [c00000000000bd54]
ret_from_kernel_thread+0x5c/0x68
[ 1592.584699] PCI 0033:01:00.0#0000: EEH: 4200000 reads ignored for recovering
device at location=CPU2 Slot1 (16x) driver=amdgpu
[ 1592.584723] PCI 0033:01:00.0#0000: EEH: Might be infinite loop in amdgpu
driver

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug 205585] [Regression] [amdgpu] AMD Vega 64 GPU invalid access and EEH under load
  2019-11-20  1:18 [Bug 205585] New: [Regression] [amdgpu] AMD Vega 64 GPU invalid access and EEH under load bugzilla-daemon
                   ` (2 preceding siblings ...)
  2019-11-29  7:45 ` bugzilla-daemon
@ 2019-11-29  8:25 ` bugzilla-daemon
  2019-11-29 20:16 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2019-11-29  8:25 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=205585

--- Comment #4 from Timothy Pearson (tpearson@raptorengineering.com) ---
Stack decodes to:

arch/powerpc/include/asm/eeh.h:403 [if (EEH_POSSIBLE_ERROR(val, u32))]
drivers/gpu/drm/amd/amdgpu/vi.c:913 [RREG32(mmHDP_MEM_COHERENCY_FLUSH_CNTL)]
drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c:340 [amdgpu_asic_flush_hdp(adev,
NULL)]

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug 205585] [Regression] [amdgpu] AMD Vega 64 GPU invalid access and EEH under load
  2019-11-20  1:18 [Bug 205585] New: [Regression] [amdgpu] AMD Vega 64 GPU invalid access and EEH under load bugzilla-daemon
                   ` (3 preceding siblings ...)
  2019-11-29  8:25 ` bugzilla-daemon
@ 2019-11-29 20:16 ` bugzilla-daemon
  2019-11-29 21:24 ` bugzilla-daemon
  2019-11-30  0:36 ` bugzilla-daemon
  6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2019-11-29 20:16 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=205585

--- Comment #5 from Alex Deucher (alexdeucher@gmail.com) ---
This doesn't look related to the first one.  The first one is a vega10 asic
according to the description, the second one is from a older VI asic. 
mmHDP_MEM_COHERENCY_FLUSH_CNTL is a register that the driver uses to flush and
invalidate the cache on the framebuffer BAR (for CPU access to the
framebuffer).  This particular code path has been in the driver for years.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug 205585] [Regression] [amdgpu] AMD Vega 64 GPU invalid access and EEH under load
  2019-11-20  1:18 [Bug 205585] New: [Regression] [amdgpu] AMD Vega 64 GPU invalid access and EEH under load bugzilla-daemon
                   ` (4 preceding siblings ...)
  2019-11-29 20:16 ` bugzilla-daemon
@ 2019-11-29 21:24 ` bugzilla-daemon
  2019-11-30  0:36 ` bugzilla-daemon
  6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2019-11-29 21:24 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=205585

--- Comment #6 from Timothy Pearson (tpearson@raptorengineering.com) ---
Yes, my fault, sorry about that -- different box, unbeknownst to me had a
different GPU (note to self, check lspci next time before decoding trace).

To top it off, this particular fault seems to be related to a faulty GPU --
letting it cool overnight fixes the problems temporarily.  I still need to
verify the Vega is failing on 5.4.0, as one of the patches leading up to 5.4.0
resolved the similar software lockup I had been seeing on this Polaris card.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug 205585] [Regression] [amdgpu] AMD Vega 64 GPU invalid access and EEH under load
  2019-11-20  1:18 [Bug 205585] New: [Regression] [amdgpu] AMD Vega 64 GPU invalid access and EEH under load bugzilla-daemon
                   ` (5 preceding siblings ...)
  2019-11-29 21:24 ` bugzilla-daemon
@ 2019-11-30  0:36 ` bugzilla-daemon
  6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2019-11-30  0:36 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=205585

Timothy Pearson (tpearson@raptorengineering.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |PATCH_ALREADY_AVAILABLE

--- Comment #7 from Timothy Pearson (tpearson@raptorengineering.com) ---
Thus far I have not been able to reproduce on 5.4.0 stable.  At this point I'm
going to assume it was fixed somewhere in the rc merge process.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-11-30  0:36 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-20  1:18 [Bug 205585] New: [Regression] [amdgpu] AMD Vega 64 GPU invalid access and EEH under load bugzilla-daemon
2019-11-20 15:21 ` [Bug 205585] " bugzilla-daemon
2019-11-20 20:26 ` bugzilla-daemon
2019-11-29  7:45 ` bugzilla-daemon
2019-11-29  8:25 ` bugzilla-daemon
2019-11-29 20:16 ` bugzilla-daemon
2019-11-29 21:24 ` bugzilla-daemon
2019-11-30  0:36 ` bugzilla-daemon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.