linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* amdgpu errors (VM fault / GPU fault detected) with 5.19 merge window snapshots
@ 2022-05-27  9:00 Michal Kubecek
  2022-05-27 12:44 ` (REGRESSION bisected) " Michal Kubecek
  0 siblings, 1 reply; 10+ messages in thread
From: Michal Kubecek @ 2022-05-27  9:00 UTC (permalink / raw)
  To: amd-gfx; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 16737 bytes --]

Hello,

while testing 5.19 merge window snapshots (commits babf0bb978e3 and
7e284070abe5), I keep getting errors like below. I have not seen them
with 5.18 final or older.

------------------------------------------------------------------------
[  247.150333] gmc_v8_0_process_interrupt: 46 callbacks suppressed
[  247.150336] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x00020802 for process firefox pid 6101 thread firefox:cs0 pid 6116
[  247.150339] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00107800
[  247.150340] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0D008002
[  247.150341] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 6, pasid 32780) at page 1079296, write from 'TC2' (0x54433200) (8)
[  248.866434] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x00120802 for process firefox pid 6101 thread firefox:cs0 pid 6116
[  248.866438] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00138802
[  248.866439] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0D008002
[  248.866440] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 6, pasid 32780) at page 1280002, write from 'TC2' (0x54433200) (8)
[  248.866775] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x000a0802 for process firefox pid 6101 thread firefox:cs0 pid 6116
[  248.866776] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00138C01
[  248.866777] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0D008002
[  248.866777] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 6, pasid 32780) at page 1281025, write from 'TC2' (0x54433200) (8)
[  248.866884] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x00020802 for process firefox pid 6101 thread firefox:cs0 pid 6116
[  248.866885] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00139400
[  248.866885] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0D008002
[  248.866886] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 6, pasid 32780) at page 1283072, write from 'TC2' (0x54433200) (8)
[  248.866939] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x00020802 for process firefox pid 6101 thread firefox:cs0 pid 6116
[  248.866940] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00139800
[  248.866940] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0D008002
[  248.866941] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 6, pasid 32780) at page 1284096, write from 'TC2' (0x54433200) (8)
[  248.867000] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x00020802 for process firefox pid 6101 thread firefox:cs0 pid 6116
[  248.867001] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00139C00
[  248.867001] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0D008002
[  248.867002] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 6, pasid 32780) at page 1285120, write from 'TC2' (0x54433200) (8)
[  248.879700] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x00020802 for process firefox pid 6101 thread firefox:cs0 pid 6116
[  248.879704] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0013D600
[  248.879705] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03008002
[  248.879706] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 1, pasid 32780) at page 1299968, write from 'TC2' (0x54433200) (8)
[  248.883086] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x000a0802 for process firefox pid 6101 thread firefox:cs0 pid 6116
[  248.883088] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0013EE01
[  248.883088] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03008002
[  248.883089] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 1, pasid 32780) at page 1306113, write from 'TC2' (0x54433200) (8)
[  249.191811] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x00020802 for process firefox pid 6101 thread firefox:cs0 pid 6116
[  249.191815] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00142C00
[  249.191816] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0D008002
[  249.191817] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 6, pasid 32780) at page 1321984, write from 'TC2' (0x54433200) (8)
[  249.193491] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x00085202 for process firefox pid 6101 thread firefox:cs0 pid 6116
[  249.193493] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00142C01
[  249.193493] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C052002
[  249.193494] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 6, pasid 32780) at page 1321985, read from 'CBC1' (0x43424331) (82)
[  249.925909] amdgpu 0000:0c:00.0: amdgpu: IH ring buffer overflow (0x000844C0, 0x00004A00, 0x000044D0)
[  250.434986] [drm] Fence fallback timer expired on ring sdma0
[  466.621568] gmc_v8_0_process_interrupt: 122 callbacks suppressed
[  466.621573] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x000a0802 for process firefox pid 6101 thread firefox:cs0 pid 6116
[  466.621575] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00161401
[  466.621575] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B008002
[  466.621576] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32780) at page 1446913, write from 'TC2' (0x54433200) (8)
[ 1044.915401] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x00020802 for process firefox pid 6101 thread firefox:cs0 pid 6116
[ 1044.915405] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0016BE00
[ 1044.915406] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x07008002
[ 1044.915407] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 3, pasid 32780) at page 1490432, write from 'TC2' (0x54433200) (8)
[ 1059.900168] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x00020802 for process firefox pid 6101 thread firefox:cs0 pid 6116
[ 1059.900172] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0016F600
[ 1059.900173] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x09008002
[ 1059.900174] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 4, pasid 32780) at page 1504768, write from 'TC2' (0x54433200) (8)
[ 3972.123585] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x0ac20402 for process stellarium pid 8057 thread stellarium:cs0 pid 8060
[ 3972.123589] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010D958
[ 3972.123590] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x09004002
[ 3972.123591] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 4, pasid 32781) at page 1104216, write from 'TC3' (0x54433300) (4)
[ 3972.123644] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x0ada0802 for process stellarium pid 8057 thread stellarium:cs0 pid 8060
[ 3972.123645] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010D95B
[ 3972.123646] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x09008002
[ 3972.123646] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 4, pasid 32781) at page 1104219, write from 'TC2' (0x54433200) (8)
[ 3972.124308] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x02024802 for process stellarium pid 8057 thread stellarium:cs0 pid 8060
[ 3972.124309] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010DA40
[ 3972.124309] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x09048002
[ 3972.124310] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 4, pasid 32781) at page 1104448, write from 'TC0' (0x54433000) (72)
[ 3972.124993] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x0ac06202 for process stellarium pid 8057 thread stellarium:cs0 pid 8060
[ 3972.124994] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010D958
[ 3972.124995] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08062002
[ 3972.124995] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 4, pasid 32781) at page 1104216, read from 'CBC0' (0x43424330) (98)
[ 3972.124999] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x0ac02202 for process stellarium pid 8057 thread stellarium:cs0 pid 8060
[ 3972.124999] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010D843
[ 3972.125000] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x09050002
[ 3972.125000] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 4, pasid 32781) at page 1103939, write from 'CB1' (0x43423100) (80)
[ 3972.125004] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x0ac01202 for process stellarium pid 8057 thread stellarium:cs0 pid 8060
[ 3972.125005] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010D891
[ 3972.125005] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x09010002
[ 3972.125005] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 4, pasid 32781) at page 1104017, write from 'CB3' (0x43423300) (16)
[ 3972.125009] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x0ac05202 for process stellarium pid 8057 thread stellarium:cs0 pid 8060
[ 3972.125010] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010D8CB
[ 3972.125010] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x09050002
[ 3972.125011] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 4, pasid 32781) at page 1104075, write from 'CB1' (0x43423100) (80)
[ 3972.125015] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x002a1002 for process stellarium pid 8057 thread stellarium:cs0 pid 8060
[ 3972.125015] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010D831
[ 3972.125016] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x09060002
[ 3972.125016] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 4, pasid 32781) at page 1103921, write from 'CB0' (0x43423000) (96)
[ 3972.125020] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x002a2002 for process stellarium pid 8057 thread stellarium:cs0 pid 8060
[ 3972.125021] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010D8B0
[ 3972.125021] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x09050002
[ 3972.125021] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 4, pasid 32781) at page 1104048, write from 'CB1' (0x43423100) (80)
[ 3972.129482] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x0ac04802 for process stellarium pid 8057 thread stellarium:cs0 pid 8060
[ 3972.129483] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010D958
[ 3972.129484] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08048002
[ 3972.129484] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 4, pasid 32781) at page 1104216, read from 'TC0' (0x54433000) (72)
[ 3979.889515] gmc_v8_0_process_interrupt: 530 callbacks suppressed
[ 3979.889519] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x02020802 for process stellarium pid 8057 thread stellarium:cs0 pid 8060
[ 3979.889522] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00143C40
[ 3979.889523] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x05008002
[ 3979.889523] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 2, pasid 32781) at page 1326144, write from 'TC2' (0x54433200) (8)
[ 3979.889975] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x02001202 for process stellarium pid 8057 thread stellarium:cs0 pid 8060
[ 3979.889976] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00143C40
[ 3979.889977] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x04012002
[ 3979.889977] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 2, pasid 32781) at page 1326144, read from 'CBC3' (0x43424333) (18)
[ 3979.889982] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x02002202 for process stellarium pid 8057 thread stellarium:cs0 pid 8060
[ 3979.889983] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00143C05
[ 3979.889983] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x05020002
[ 3979.889984] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 2, pasid 32781) at page 1326085, write from 'CB2' (0x43423200) (32)
[ 3979.889988] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x02005202 for process stellarium pid 8057 thread stellarium:cs0 pid 8060
[ 3979.889989] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00143C07
[ 3979.889989] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x05050002
[ 3979.889990] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 2, pasid 32781) at page 1326087, write from 'CB1' (0x43423100) (80)
[ 3979.889994] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x02006202 for process stellarium pid 8057 thread stellarium:cs0 pid 8060
[ 3979.889995] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00143C15
[ 3979.889995] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x05010002
[ 3979.889995] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 2, pasid 32781) at page 1326101, write from 'CB3' (0x43423300) (16)
[ 3979.890000] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x002a2002 for process stellarium pid 8057 thread stellarium:cs0 pid 8060
[ 3979.890001] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00143C19
[ 3979.890001] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x05010002
[ 3979.890002] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 2, pasid 32781) at page 1326105, write from 'CB3' (0x43423300) (16)
[ 3979.890006] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x002a1002 for process stellarium pid 8057 thread stellarium:cs0 pid 8060
[ 3979.890007] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00143C30
[ 3979.890007] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x05010002
[ 3979.890007] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 2, pasid 32781) at page 1326128, write from 'CB3' (0x43423300) (16)
[ 3979.890012] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x000a2002 for process stellarium pid 8057 thread stellarium:cs0 pid 8060
[ 3979.890013] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00143C32
[ 3979.890013] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x05010002
[ 3979.890014] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 2, pasid 32781) at page 1326130, write from 'CB3' (0x43423300) (16)
[ 3979.890017] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x000a1002 for process stellarium pid 8057 thread stellarium:cs0 pid 8060
[ 3979.890018] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00143C28
[ 3979.890018] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x05010002
[ 3979.890019] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 2, pasid 32781) at page 1326120, write from 'CB3' (0x43423300) (16)
[ 3979.891937] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x02000802 for process stellarium pid 8057 thread stellarium:cs0 pid 8060
[ 3979.891937] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00143C40
[ 3979.891938] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x04008002
[ 3979.891938] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 2, pasid 32781) at page 1326144, read from 'TC2' (0x54433200) (8)
[ 4062.912573] gmc_v8_0_process_interrupt: 2 callbacks suppressed
[ 4062.912578] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x00004802 for process stellarium pid 8057 thread stellarium:cs0 pid 8060
[ 4062.912580] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000800
[ 4062.912581] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06048002
[ 4062.912582] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 3, pasid 32781) at page 2048, read from 'TC0' (0x54433000) (72)
------------------------------------------------------------------------

There does not seem to be any apparent immediate problem with graphics
but when running commit babf0bb978e3, there seemed to be a noticeable
lag in some operations, e.g. when moving a window or repainting large
part of the terminal window in konsole (no idea if it's related).

My GPU is Radeon Pro WX 2100 (1002:6995). What other information should
I collect to help debugging the issue?

Michal Kubecek

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* (REGRESSION bisected) Re: amdgpu errors (VM fault / GPU fault detected) with 5.19 merge window snapshots
  2022-05-27  9:00 amdgpu errors (VM fault / GPU fault detected) with 5.19 merge window snapshots Michal Kubecek
@ 2022-05-27 12:44 ` Michal Kubecek
  2022-06-01 14:55   ` Alex Deucher
  2022-06-02 13:58   ` Alex Deucher
  0 siblings, 2 replies; 10+ messages in thread
From: Michal Kubecek @ 2022-05-27 12:44 UTC (permalink / raw)
  To: amd-gfx; +Cc: Christian König, Felix Kuehling, Alex Deucher, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2106 bytes --]

On Fri, May 27, 2022 at 11:00:39AM +0200, Michal Kubecek wrote:
> Hello,
> 
> while testing 5.19 merge window snapshots (commits babf0bb978e3 and
> 7e284070abe5), I keep getting errors like below. I have not seen them
> with 5.18 final or older.
> 
> ------------------------------------------------------------------------
> [  247.150333] gmc_v8_0_process_interrupt: 46 callbacks suppressed
> [  247.150336] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x00020802 for process firefox pid 6101 thread firefox:cs0 pid 6116
> [  247.150339] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00107800
> [  247.150340] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0D008002
> [  247.150341] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 6, pasid 32780) at page 1079296, write from 'TC2' (0x54433200) (8)
[...]
> [  249.925909] amdgpu 0000:0c:00.0: amdgpu: IH ring buffer overflow (0x000844C0, 0x00004A00, 0x000044D0)
> [  250.434986] [drm] Fence fallback timer expired on ring sdma0
> [  466.621568] gmc_v8_0_process_interrupt: 122 callbacks suppressed
[...]
> ------------------------------------------------------------------------
> 
> There does not seem to be any apparent immediate problem with graphics
> but when running commit babf0bb978e3, there seemed to be a noticeable
> lag in some operations, e.g. when moving a window or repainting large
> part of the terminal window in konsole (no idea if it's related).
> 
> My GPU is Radeon Pro WX 2100 (1002:6995). What other information should
> I collect to help debugging the issue?

Bisected to commit 5255e146c99a ("drm/amdgpu: rework TLB flushing").
There seem to be later commits depending on it so I did not test
a revert on top of current mainline.

I should also mention that most commits tested as "bad" during the
bisect did behave much worse than current mainline (errors starting as
early as with sddm, visibly damaged screen content, sometimes even
crashes). But all of them issued messages similar to those above into
kernel log.

Michal Kubecek

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: (REGRESSION bisected) Re: amdgpu errors (VM fault / GPU fault detected) with 5.19 merge window snapshots
  2022-05-27 12:44 ` (REGRESSION bisected) " Michal Kubecek
@ 2022-06-01 14:55   ` Alex Deucher
  2022-06-01 14:59     ` Christian König
  2022-06-02 13:58   ` Alex Deucher
  1 sibling, 1 reply; 10+ messages in thread
From: Alex Deucher @ 2022-06-01 14:55 UTC (permalink / raw)
  To: Michal Kubecek, Yang, Philip
  Cc: amd-gfx list, Alex Deucher, Felix Kuehling, Christian König, LKML

On Fri, May 27, 2022 at 8:58 AM Michal Kubecek <mkubecek@suse.cz> wrote:
>
> On Fri, May 27, 2022 at 11:00:39AM +0200, Michal Kubecek wrote:
> > Hello,
> >
> > while testing 5.19 merge window snapshots (commits babf0bb978e3 and
> > 7e284070abe5), I keep getting errors like below. I have not seen them
> > with 5.18 final or older.
> >
> > ------------------------------------------------------------------------
> > [  247.150333] gmc_v8_0_process_interrupt: 46 callbacks suppressed
> > [  247.150336] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x00020802 for process firefox pid 6101 thread firefox:cs0 pid 6116
> > [  247.150339] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00107800
> > [  247.150340] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0D008002
> > [  247.150341] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 6, pasid 32780) at page 1079296, write from 'TC2' (0x54433200) (8)
> [...]
> > [  249.925909] amdgpu 0000:0c:00.0: amdgpu: IH ring buffer overflow (0x000844C0, 0x00004A00, 0x000044D0)
> > [  250.434986] [drm] Fence fallback timer expired on ring sdma0
> > [  466.621568] gmc_v8_0_process_interrupt: 122 callbacks suppressed
> [...]
> > ------------------------------------------------------------------------
> >
> > There does not seem to be any apparent immediate problem with graphics
> > but when running commit babf0bb978e3, there seemed to be a noticeable
> > lag in some operations, e.g. when moving a window or repainting large
> > part of the terminal window in konsole (no idea if it's related).
> >
> > My GPU is Radeon Pro WX 2100 (1002:6995). What other information should
> > I collect to help debugging the issue?
>
> Bisected to commit 5255e146c99a ("drm/amdgpu: rework TLB flushing").
> There seem to be later commits depending on it so I did not test
> a revert on top of current mainline.
>

@Christian Koenig, @Yang, Philip Any ideas?  I think there were some
fix ups for this.  Maybe those just haven't hit the tree yet?

Alex


> I should also mention that most commits tested as "bad" during the
> bisect did behave much worse than current mainline (errors starting as
> early as with sddm, visibly damaged screen content, sometimes even
> crashes). But all of them issued messages similar to those above into
> kernel log.
>
> Michal Kubecek

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: (REGRESSION bisected) Re: amdgpu errors (VM fault / GPU fault detected) with 5.19 merge window snapshots
  2022-06-01 14:55   ` Alex Deucher
@ 2022-06-01 14:59     ` Christian König
  0 siblings, 0 replies; 10+ messages in thread
From: Christian König @ 2022-06-01 14:59 UTC (permalink / raw)
  To: Alex Deucher, Michal Kubecek, Yang, Philip
  Cc: Alex Deucher, Felix Kuehling, Christian König, amd-gfx list, LKML

Am 01.06.22 um 16:55 schrieb Alex Deucher:
> On Fri, May 27, 2022 at 8:58 AM Michal Kubecek <mkubecek@suse.cz> wrote:
>> On Fri, May 27, 2022 at 11:00:39AM +0200, Michal Kubecek wrote:
>>> Hello,
>>>
>>> while testing 5.19 merge window snapshots (commits babf0bb978e3 and
>>> 7e284070abe5), I keep getting errors like below. I have not seen them
>>> with 5.18 final or older.
>>>
>>> ------------------------------------------------------------------------
>>> [  247.150333] gmc_v8_0_process_interrupt: 46 callbacks suppressed
>>> [  247.150336] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x00020802 for process firefox pid 6101 thread firefox:cs0 pid 6116
>>> [  247.150339] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00107800
>>> [  247.150340] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0D008002
>>> [  247.150341] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 6, pasid 32780) at page 1079296, write from 'TC2' (0x54433200) (8)
>> [...]
>>> [  249.925909] amdgpu 0000:0c:00.0: amdgpu: IH ring buffer overflow (0x000844C0, 0x00004A00, 0x000044D0)
>>> [  250.434986] [drm] Fence fallback timer expired on ring sdma0
>>> [  466.621568] gmc_v8_0_process_interrupt: 122 callbacks suppressed
>> [...]
>>> ------------------------------------------------------------------------
>>>
>>> There does not seem to be any apparent immediate problem with graphics
>>> but when running commit babf0bb978e3, there seemed to be a noticeable
>>> lag in some operations, e.g. when moving a window or repainting large
>>> part of the terminal window in konsole (no idea if it's related).
>>>
>>> My GPU is Radeon Pro WX 2100 (1002:6995). What other information should
>>> I collect to help debugging the issue?
>> Bisected to commit 5255e146c99a ("drm/amdgpu: rework TLB flushing").
>> There seem to be later commits depending on it so I did not test
>> a revert on top of current mainline.
>>
> @Christian Koenig, @Yang, Philip Any ideas?  I think there were some
> fix ups for this.  Maybe those just haven't hit the tree yet?

I need to double check, but as far as I know we have fixed all the fallout.

Could be that something didn't went upstream because it came to late for 
the merge window.

Christian.

>
> Alex
>
>
>> I should also mention that most commits tested as "bad" during the
>> bisect did behave much worse than current mainline (errors starting as
>> early as with sddm, visibly damaged screen content, sometimes even
>> crashes). But all of them issued messages similar to those above into
>> kernel log.
>>
>> Michal Kubecek


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: (REGRESSION bisected) Re: amdgpu errors (VM fault / GPU fault detected) with 5.19 merge window snapshots
  2022-05-27 12:44 ` (REGRESSION bisected) " Michal Kubecek
  2022-06-01 14:55   ` Alex Deucher
@ 2022-06-02 13:58   ` Alex Deucher
  2022-06-02 14:22     ` Michal Kubecek
  1 sibling, 1 reply; 10+ messages in thread
From: Alex Deucher @ 2022-06-02 13:58 UTC (permalink / raw)
  To: Michal Kubecek, Yang, Philip
  Cc: amd-gfx list, Alex Deucher, Felix Kuehling, Christian König, LKML

On Fri, May 27, 2022 at 8:58 AM Michal Kubecek <mkubecek@suse.cz> wrote:
>
> On Fri, May 27, 2022 at 11:00:39AM +0200, Michal Kubecek wrote:
> > Hello,
> >
> > while testing 5.19 merge window snapshots (commits babf0bb978e3 and
> > 7e284070abe5), I keep getting errors like below. I have not seen them
> > with 5.18 final or older.
> >
> > ------------------------------------------------------------------------
> > [  247.150333] gmc_v8_0_process_interrupt: 46 callbacks suppressed
> > [  247.150336] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x00020802 for process firefox pid 6101 thread firefox:cs0 pid 6116
> > [  247.150339] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00107800
> > [  247.150340] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0D008002
> > [  247.150341] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 6, pasid 32780) at page 1079296, write from 'TC2' (0x54433200) (8)
> [...]
> > [  249.925909] amdgpu 0000:0c:00.0: amdgpu: IH ring buffer overflow (0x000844C0, 0x00004A00, 0x000044D0)
> > [  250.434986] [drm] Fence fallback timer expired on ring sdma0
> > [  466.621568] gmc_v8_0_process_interrupt: 122 callbacks suppressed
> [...]
> > ------------------------------------------------------------------------
> >
> > There does not seem to be any apparent immediate problem with graphics
> > but when running commit babf0bb978e3, there seemed to be a noticeable
> > lag in some operations, e.g. when moving a window or repainting large
> > part of the terminal window in konsole (no idea if it's related).
> >
> > My GPU is Radeon Pro WX 2100 (1002:6995). What other information should
> > I collect to help debugging the issue?
>
> Bisected to commit 5255e146c99a ("drm/amdgpu: rework TLB flushing").
> There seem to be later commits depending on it so I did not test
> a revert on top of current mainline.
>
> I should also mention that most commits tested as "bad" during the
> bisect did behave much worse than current mainline (errors starting as
> early as with sddm, visibly damaged screen content, sometimes even
> crashes). But all of them issued messages similar to those above into
> kernel log.

Can you verify that the kernel you tested has this patch:
https://cgit.freedesktop.org/drm/drm/commit/?id=5be323562c6a699d38430bc068a3fd192be8ed0d

Alex

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: (REGRESSION bisected) Re: amdgpu errors (VM fault / GPU fault detected) with 5.19 merge window snapshots
  2022-06-02 13:58   ` Alex Deucher
@ 2022-06-02 14:22     ` Michal Kubecek
  2022-06-03 15:49       ` Alex Deucher
  0 siblings, 1 reply; 10+ messages in thread
From: Michal Kubecek @ 2022-06-02 14:22 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Yang, Philip, amd-gfx list, Alex Deucher, Felix Kuehling,
	Christian König, LKML

[-- Attachment #1: Type: text/plain, Size: 3180 bytes --]

On Thu, Jun 02, 2022 at 09:58:22AM -0400, Alex Deucher wrote:
> On Fri, May 27, 2022 at 8:58 AM Michal Kubecek <mkubecek@suse.cz> wrote:
> > On Fri, May 27, 2022 at 11:00:39AM +0200, Michal Kubecek wrote:
> > > Hello,
> > >
> > > while testing 5.19 merge window snapshots (commits babf0bb978e3 and
> > > 7e284070abe5), I keep getting errors like below. I have not seen them
> > > with 5.18 final or older.
> > >
> > > ------------------------------------------------------------------------
> > > [  247.150333] gmc_v8_0_process_interrupt: 46 callbacks suppressed
> > > [  247.150336] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x00020802 for process firefox pid 6101 thread firefox:cs0 pid 6116
> > > [  247.150339] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00107800
> > > [  247.150340] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0D008002
> > > [  247.150341] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 6, pasid 32780) at page 1079296, write from 'TC2' (0x54433200) (8)
> > [...]
> > > [  249.925909] amdgpu 0000:0c:00.0: amdgpu: IH ring buffer overflow (0x000844C0, 0x00004A00, 0x000044D0)
> > > [  250.434986] [drm] Fence fallback timer expired on ring sdma0
> > > [  466.621568] gmc_v8_0_process_interrupt: 122 callbacks suppressed
> > [...]
> > > ------------------------------------------------------------------------
> > >
> > > There does not seem to be any apparent immediate problem with graphics
> > > but when running commit babf0bb978e3, there seemed to be a noticeable
> > > lag in some operations, e.g. when moving a window or repainting large
> > > part of the terminal window in konsole (no idea if it's related).
> > >
> > > My GPU is Radeon Pro WX 2100 (1002:6995). What other information should
> > > I collect to help debugging the issue?
> >
> > Bisected to commit 5255e146c99a ("drm/amdgpu: rework TLB flushing").
> > There seem to be later commits depending on it so I did not test
> > a revert on top of current mainline.
> >
> > I should also mention that most commits tested as "bad" during the
> > bisect did behave much worse than current mainline (errors starting as
> > early as with sddm, visibly damaged screen content, sometimes even
> > crashes). But all of them issued messages similar to those above into
> > kernel log.
> 
> Can you verify that the kernel you tested has this patch:
> https://cgit.freedesktop.org/drm/drm/commit/?id=5be323562c6a699d38430bc068a3fd192be8ed0d

Yes, both of them:

mike@lion:~/work/git/kernel-upstream> git merge-base --is-ancestor 5be323562c6a babf0bb978e3 && echo yes
yes

(7e284070abe5 is a later mainline snapshot so it also contains
5be323562c6a)

But it's likely that commit 5be323562c6a fixed most of the problem and
only some corner case was left as most bisect steps had many more error
messages and some even crashed before I was able to even log into KDE.
Compared to that, the mainline snapshots show much fewer errors, no
distorted picture and no crash; on the other hand, applications like
firefox or stellarium seem to trigger the errors quite consistently.

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: (REGRESSION bisected) Re: amdgpu errors (VM fault / GPU fault detected) with 5.19 merge window snapshots
  2022-06-02 14:22     ` Michal Kubecek
@ 2022-06-03 15:49       ` Alex Deucher
  2022-06-03 17:23         ` Michal Kubecek
  2022-06-05 22:00         ` Michal Kubecek
  0 siblings, 2 replies; 10+ messages in thread
From: Alex Deucher @ 2022-06-03 15:49 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: Yang, Philip, amd-gfx list, Alex Deucher, Felix Kuehling,
	Christian König, LKML

On Thu, Jun 2, 2022 at 10:22 AM Michal Kubecek <mkubecek@suse.cz> wrote:
>
> On Thu, Jun 02, 2022 at 09:58:22AM -0400, Alex Deucher wrote:
> > On Fri, May 27, 2022 at 8:58 AM Michal Kubecek <mkubecek@suse.cz> wrote:
> > > On Fri, May 27, 2022 at 11:00:39AM +0200, Michal Kubecek wrote:
> > > > Hello,
> > > >
> > > > while testing 5.19 merge window snapshots (commits babf0bb978e3 and
> > > > 7e284070abe5), I keep getting errors like below. I have not seen them
> > > > with 5.18 final or older.
> > > >
> > > > ------------------------------------------------------------------------
> > > > [  247.150333] gmc_v8_0_process_interrupt: 46 callbacks suppressed
> > > > [  247.150336] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x00020802 for process firefox pid 6101 thread firefox:cs0 pid 6116
> > > > [  247.150339] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00107800
> > > > [  247.150340] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0D008002
> > > > [  247.150341] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 6, pasid 32780) at page 1079296, write from 'TC2' (0x54433200) (8)
> > > [...]
> > > > [  249.925909] amdgpu 0000:0c:00.0: amdgpu: IH ring buffer overflow (0x000844C0, 0x00004A00, 0x000044D0)
> > > > [  250.434986] [drm] Fence fallback timer expired on ring sdma0
> > > > [  466.621568] gmc_v8_0_process_interrupt: 122 callbacks suppressed
> > > [...]
> > > > ------------------------------------------------------------------------
> > > >
> > > > There does not seem to be any apparent immediate problem with graphics
> > > > but when running commit babf0bb978e3, there seemed to be a noticeable
> > > > lag in some operations, e.g. when moving a window or repainting large
> > > > part of the terminal window in konsole (no idea if it's related).
> > > >
> > > > My GPU is Radeon Pro WX 2100 (1002:6995). What other information should
> > > > I collect to help debugging the issue?
> > >
> > > Bisected to commit 5255e146c99a ("drm/amdgpu: rework TLB flushing").
> > > There seem to be later commits depending on it so I did not test
> > > a revert on top of current mainline.
> > >
> > > I should also mention that most commits tested as "bad" during the
> > > bisect did behave much worse than current mainline (errors starting as
> > > early as with sddm, visibly damaged screen content, sometimes even
> > > crashes). But all of them issued messages similar to those above into
> > > kernel log.
> >
> > Can you verify that the kernel you tested has this patch:
> > https://cgit.freedesktop.org/drm/drm/commit/?id=5be323562c6a699d38430bc068a3fd192be8ed0d
>
> Yes, both of them:
>
> mike@lion:~/work/git/kernel-upstream> git merge-base --is-ancestor 5be323562c6a babf0bb978e3 && echo yes
> yes
>
> (7e284070abe5 is a later mainline snapshot so it also contains
> 5be323562c6a)
>
> But it's likely that commit 5be323562c6a fixed most of the problem and
> only some corner case was left as most bisect steps had many more error
> messages and some even crashed before I was able to even log into KDE.
> Compared to that, the mainline snapshots show much fewer errors, no
> distorted picture and no crash; on the other hand, applications like
> firefox or stellarium seem to trigger the errors quite consistently.

This patch should help:
https://patchwork.freedesktop.org/patch/488258/

Alex

>
> Michal

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: (REGRESSION bisected) Re: amdgpu errors (VM fault / GPU fault detected) with 5.19 merge window snapshots
  2022-06-03 15:49       ` Alex Deucher
@ 2022-06-03 17:23         ` Michal Kubecek
  2022-06-05 22:00         ` Michal Kubecek
  1 sibling, 0 replies; 10+ messages in thread
From: Michal Kubecek @ 2022-06-03 17:23 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Yang, Philip, amd-gfx list, Alex Deucher, Felix Kuehling,
	Christian König, LKML

[-- Attachment #1: Type: text/plain, Size: 3829 bytes --]

On Fri, Jun 03, 2022 at 11:49:31AM -0400, Alex Deucher wrote:
> On Thu, Jun 2, 2022 at 10:22 AM Michal Kubecek <mkubecek@suse.cz> wrote:
> >
> > On Thu, Jun 02, 2022 at 09:58:22AM -0400, Alex Deucher wrote:
> > > On Fri, May 27, 2022 at 8:58 AM Michal Kubecek <mkubecek@suse.cz> wrote:
> > > > On Fri, May 27, 2022 at 11:00:39AM +0200, Michal Kubecek wrote:
> > > > > Hello,
> > > > >
> > > > > while testing 5.19 merge window snapshots (commits babf0bb978e3 and
> > > > > 7e284070abe5), I keep getting errors like below. I have not seen them
> > > > > with 5.18 final or older.
> > > > >
> > > > > ------------------------------------------------------------------------
> > > > > [  247.150333] gmc_v8_0_process_interrupt: 46 callbacks suppressed
> > > > > [  247.150336] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x00020802 for process firefox pid 6101 thread firefox:cs0 pid 6116
> > > > > [  247.150339] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00107800
> > > > > [  247.150340] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0D008002
> > > > > [  247.150341] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 6, pasid 32780) at page 1079296, write from 'TC2' (0x54433200) (8)
> > > > [...]
> > > > > [  249.925909] amdgpu 0000:0c:00.0: amdgpu: IH ring buffer overflow (0x000844C0, 0x00004A00, 0x000044D0)
> > > > > [  250.434986] [drm] Fence fallback timer expired on ring sdma0
> > > > > [  466.621568] gmc_v8_0_process_interrupt: 122 callbacks suppressed
> > > > [...]
> > > > > ------------------------------------------------------------------------
> > > > >
> > > > > There does not seem to be any apparent immediate problem with graphics
> > > > > but when running commit babf0bb978e3, there seemed to be a noticeable
> > > > > lag in some operations, e.g. when moving a window or repainting large
> > > > > part of the terminal window in konsole (no idea if it's related).
> > > > >
> > > > > My GPU is Radeon Pro WX 2100 (1002:6995). What other information should
> > > > > I collect to help debugging the issue?
> > > >
> > > > Bisected to commit 5255e146c99a ("drm/amdgpu: rework TLB flushing").
> > > > There seem to be later commits depending on it so I did not test
> > > > a revert on top of current mainline.
> > > >
> > > > I should also mention that most commits tested as "bad" during the
> > > > bisect did behave much worse than current mainline (errors starting as
> > > > early as with sddm, visibly damaged screen content, sometimes even
> > > > crashes). But all of them issued messages similar to those above into
> > > > kernel log.
> > >
> > > Can you verify that the kernel you tested has this patch:
> > > https://cgit.freedesktop.org/drm/drm/commit/?id=5be323562c6a699d38430bc068a3fd192be8ed0d
> >
> > Yes, both of them:
> >
> > mike@lion:~/work/git/kernel-upstream> git merge-base --is-ancestor 5be323562c6a babf0bb978e3 && echo yes
> > yes
> >
> > (7e284070abe5 is a later mainline snapshot so it also contains
> > 5be323562c6a)
> >
> > But it's likely that commit 5be323562c6a fixed most of the problem and
> > only some corner case was left as most bisect steps had many more error
> > messages and some even crashed before I was able to even log into KDE.
> > Compared to that, the mainline snapshots show much fewer errors, no
> > distorted picture and no crash; on the other hand, applications like
> > firefox or stellarium seem to trigger the errors quite consistently.
> 
> This patch should help:
> https://patchwork.freedesktop.org/patch/488258/

It seems to help, I'm running a kernel built with this patch on top of
mainline commit 50fd82b3a9a9 (current head) and I haven't seen any
errors yet. I'll give it some more time and report back.

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: (REGRESSION bisected) Re: amdgpu errors (VM fault / GPU fault detected) with 5.19 merge window snapshots
  2022-06-03 15:49       ` Alex Deucher
  2022-06-03 17:23         ` Michal Kubecek
@ 2022-06-05 22:00         ` Michal Kubecek
  2022-06-06 10:25           ` Christian König
  1 sibling, 1 reply; 10+ messages in thread
From: Michal Kubecek @ 2022-06-05 22:00 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Yang, Philip, amd-gfx list, Alex Deucher, Felix Kuehling,
	Christian König, LKML

[-- Attachment #1: Type: text/plain, Size: 3739 bytes --]

On Fri, Jun 03, 2022 at 11:49:31AM -0400, Alex Deucher wrote:
> On Thu, Jun 2, 2022 at 10:22 AM Michal Kubecek <mkubecek@suse.cz> wrote:
> >
> > On Thu, Jun 02, 2022 at 09:58:22AM -0400, Alex Deucher wrote:
> > > On Fri, May 27, 2022 at 8:58 AM Michal Kubecek <mkubecek@suse.cz> wrote:
> > > > On Fri, May 27, 2022 at 11:00:39AM +0200, Michal Kubecek wrote:
> > > > > Hello,
> > > > >
> > > > > while testing 5.19 merge window snapshots (commits babf0bb978e3 and
> > > > > 7e284070abe5), I keep getting errors like below. I have not seen them
> > > > > with 5.18 final or older.
> > > > >
> > > > > ------------------------------------------------------------------------
> > > > > [  247.150333] gmc_v8_0_process_interrupt: 46 callbacks suppressed
> > > > > [  247.150336] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x00020802 for process firefox pid 6101 thread firefox:cs0 pid 6116
> > > > > [  247.150339] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00107800
> > > > > [  247.150340] amdgpu 0000:0c:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0D008002
> > > > > [  247.150341] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 6, pasid 32780) at page 1079296, write from 'TC2' (0x54433200) (8)
> > > > [...]
> > > > > [  249.925909] amdgpu 0000:0c:00.0: amdgpu: IH ring buffer overflow (0x000844C0, 0x00004A00, 0x000044D0)
> > > > > [  250.434986] [drm] Fence fallback timer expired on ring sdma0
> > > > > [  466.621568] gmc_v8_0_process_interrupt: 122 callbacks suppressed
> > > > [...]
> > > > > ------------------------------------------------------------------------
> > > > >
> > > > > There does not seem to be any apparent immediate problem with graphics
> > > > > but when running commit babf0bb978e3, there seemed to be a noticeable
> > > > > lag in some operations, e.g. when moving a window or repainting large
> > > > > part of the terminal window in konsole (no idea if it's related).
> > > > >
> > > > > My GPU is Radeon Pro WX 2100 (1002:6995). What other information should
> > > > > I collect to help debugging the issue?
> > > >
> > > > Bisected to commit 5255e146c99a ("drm/amdgpu: rework TLB flushing").
> > > > There seem to be later commits depending on it so I did not test
> > > > a revert on top of current mainline.
> > > >
> > > > I should also mention that most commits tested as "bad" during the
> > > > bisect did behave much worse than current mainline (errors starting as
> > > > early as with sddm, visibly damaged screen content, sometimes even
> > > > crashes). But all of them issued messages similar to those above into
> > > > kernel log.
> > >
> > > Can you verify that the kernel you tested has this patch:
> > > https://cgit.freedesktop.org/drm/drm/commit/?id=5be323562c6a699d38430bc068a3fd192be8ed0d
> >
> > Yes, both of them:
> >
> > mike@lion:~/work/git/kernel-upstream> git merge-base --is-ancestor 5be323562c6a babf0bb978e3 && echo yes
> > yes
> >
> > (7e284070abe5 is a later mainline snapshot so it also contains
> > 5be323562c6a)
> >
> > But it's likely that commit 5be323562c6a fixed most of the problem and
> > only some corner case was left as most bisect steps had many more error
> > messages and some even crashed before I was able to even log into KDE.
> > Compared to that, the mainline snapshots show much fewer errors, no
> > distorted picture and no crash; on the other hand, applications like
> > firefox or stellarium seem to trigger the errors quite consistently.
> 
> This patch should help:
> https://patchwork.freedesktop.org/patch/488258/

After ~48 hours with this patch, still no apparent issues.

Tested-by: Michal Kubecek <mkubecek@suse.cz>

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: (REGRESSION bisected) Re: amdgpu errors (VM fault / GPU fault detected) with 5.19 merge window snapshots
  2022-06-05 22:00         ` Michal Kubecek
@ 2022-06-06 10:25           ` Christian König
  0 siblings, 0 replies; 10+ messages in thread
From: Christian König @ 2022-06-06 10:25 UTC (permalink / raw)
  To: Michal Kubecek, Alex Deucher
  Cc: Yang, Philip, amd-gfx list, Alex Deucher, Felix Kuehling, LKML

Am 06.06.22 um 00:00 schrieb Michal Kubecek:
> [SNIP]
>> This patch should help:
>> https://patchwork.freedesktop.org/patch/488258/
> After ~48 hours with this patch, still no apparent issues.
>
> Tested-by: Michal Kubecek <mkubecek@suse.cz>

Thanks, this could be optimized for gfx8 a bit if anybody is interested 
in a typing exercise.

E.g. we only need the tlb flush when "start" or "end" are not 8 entries 
aligned on gfx8.

I don't have time to test this, but should be trivial to implement.

Christian.

>
> Michal


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-06-06 10:25 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-27  9:00 amdgpu errors (VM fault / GPU fault detected) with 5.19 merge window snapshots Michal Kubecek
2022-05-27 12:44 ` (REGRESSION bisected) " Michal Kubecek
2022-06-01 14:55   ` Alex Deucher
2022-06-01 14:59     ` Christian König
2022-06-02 13:58   ` Alex Deucher
2022-06-02 14:22     ` Michal Kubecek
2022-06-03 15:49       ` Alex Deucher
2022-06-03 17:23         ` Michal Kubecek
2022-06-05 22:00         ` Michal Kubecek
2022-06-06 10:25           ` Christian König

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).