All of lore.kernel.org
 help / color / mirror / Atom feed
* [Bug][5.19-rc0] Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in graphic mode.
@ 2022-06-28  9:21 Mikhail Gavrilov
  2022-07-07  0:20 ` Mikhail Gavrilov
  0 siblings, 1 reply; 10+ messages in thread
From: Mikhail Gavrilov @ 2022-06-28  9:21 UTC (permalink / raw)
  To: amd-gfx list, Linux List Kernel Mailing

Hi guys.
Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in
graphic mode instead I see black screen with constantly glowing
cursor. Demonstration: https://youtu.be/rGL4LsHMae4
In the kernel logs there are references to hung processes:
[  149.363465] rfkill: input handler disabled
[  249.072478] INFO: task (brt-dbus):1645 blocked for more than 122 seconds.
[  249.072515]       Tainted: G        W    L   --------  ---
5.19.0-0.rc0.20220526gitbabf0bb978e3.4.fc37.x86_64 #1
[  249.072520] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  249.072524] task:(brt-dbus)      state:D stack:14384 pid: 1645
ppid:     1 flags:0x00000002
[  249.072536] Call Trace:
[  249.072540]  <TASK>
[  249.072551]  __schedule+0x492/0x1640
[  249.072560]  ? lock_is_held_type+0xe8/0x140
[  249.072569]  ? find_held_lock+0x32/0x80
[  249.072584]  schedule+0x4e/0xb0
[  249.072591]  schedule_preempt_disabled+0x14/0x20
[  249.072597]  __mutex_lock+0x423/0x890
[  249.072608]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.072818]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.073010]  amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.073207]  amdgpu_flush+0x25/0x40 [amdgpu]
[  249.074088]  filp_close+0x31/0x70
[  249.074097]  __close_range+0x130/0x320
[  249.074108]  __x64_sys_close_range+0x13/0x20
[  249.074113]  do_syscall_64+0x5b/0x80
[  249.074120]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.074127]  ? do_syscall_64+0x67/0x80
[  249.074135]  ? do_syscall_64+0x67/0x80
[  249.074140]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.074147]  ? do_syscall_64+0x67/0x80
[  249.074154]  ? lock_is_held_type+0xe8/0x140
[  249.074164]  ? asm_exc_page_fault+0x27/0x30
[  249.074171]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.074178]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  249.074184] RIP: 0033:0x7fd71f54f97b
[  249.074208] RSP: 002b:00007fffc8e752a8 EFLAGS: 00000246 ORIG_RAX:
00000000000001b4
[  249.074215] RAX: ffffffffffffffda RBX: 00007fffc8e752b0 RCX: 00007fd71f54f97b
[  249.074220] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: 0000000000000027
[  249.074224] RBP: 00007fffc8e75330 R08: 0000000000000000 R09: 00007fffc8e75380
[  249.074228] R10: 00007fffc8e751f0 R11: 0000000000000246 R12: 0000000000000002
[  249.074232] R13: 00007fffc8e75340 R14: 0000000000000000 R15: 0000000000000002
[  249.074252]  </TASK>
[  249.074261] INFO: task (ostnamed):1718 blocked for more than 122 seconds.
[  249.074266]       Tainted: G        W    L   --------  ---
5.19.0-0.rc0.20220526gitbabf0bb978e3.4.fc37.x86_64 #1
[  249.074285] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  249.074289] task:(ostnamed)      state:D stack:14552 pid: 1718
ppid:     1 flags:0x00000006
[  249.074299] Call Trace:
[  249.074302]  <TASK>
[  249.074310]  __schedule+0x492/0x1640
[  249.074316]  ? lock_is_held_type+0xe8/0x140
[  249.074324]  ? find_held_lock+0x32/0x80
[  249.074339]  schedule+0x4e/0xb0
[  249.074346]  schedule_preempt_disabled+0x14/0x20
[  249.074352]  __mutex_lock+0x423/0x890
[  249.074361]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.074564]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.074754]  amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.074950]  amdgpu_flush+0x25/0x40 [amdgpu]
[  249.075133]  filp_close+0x31/0x70
[  249.075140]  __close_range+0x130/0x320
[  249.075150]  __x64_sys_close_range+0x13/0x20
[  249.075154]  do_syscall_64+0x5b/0x80
[  249.075164]  ? lock_is_held_type+0xe8/0x140
[  249.075175]  ? do_syscall_64+0x67/0x80
[  249.075180]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.075187]  ? do_syscall_64+0x67/0x80
[  249.075194]  ? lock_is_held_type+0xe8/0x140
[  249.075204]  ? asm_exc_page_fault+0x27/0x30
[  249.075210]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.075217]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  249.075222] RIP: 0033:0x7fd71f54f97b
[  249.075231] RSP: 002b:00007fffc8e752a8 EFLAGS: 00000246 ORIG_RAX:
00000000000001b4
[  249.075237] RAX: ffffffffffffffda RBX: 00007fffc8e752b0 RCX: 00007fd71f54f97b
[  249.075241] RDX: 0000000000000000 RSI: 00000000000000b9 RDI: 0000000000000027
[  249.075245] RBP: 00007fffc8e75330 R08: 0000000000000000 R09: 00007fffc8e75380
[  249.075249] R10: 00007fffc8e751f0 R11: 0000000000000246 R12: 0000000000000004
[  249.075253] R13: 00007fffc8e75340 R14: 0000000000000000 R15: 0000000000000003
[  249.075289]  </TASK>
[  249.075294] INFO: task (pcscd):1749 blocked for more than 122 seconds.
[  249.075298]       Tainted: G        W    L   --------  ---
5.19.0-0.rc0.20220526gitbabf0bb978e3.4.fc37.x86_64 #1
[  249.075302] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  249.075306] task:(pcscd)         state:D stack:14256 pid: 1749
ppid:     1 flags:0x00000002
[  249.075314] Call Trace:
[  249.075318]  <TASK>
[  249.075325]  __schedule+0x492/0x1640
[  249.075331]  ? lock_is_held_type+0xe8/0x140
[  249.075339]  ? find_held_lock+0x32/0x80
[  249.075353]  schedule+0x4e/0xb0
[  249.075360]  schedule_preempt_disabled+0x14/0x20
[  249.075365]  __mutex_lock+0x423/0x890
[  249.075375]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.075574]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.075764]  amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.075960]  amdgpu_flush+0x25/0x40 [amdgpu]
[  249.076143]  filp_close+0x31/0x70
[  249.076150]  __close_range+0x130/0x320
[  249.076160]  __x64_sys_close_range+0x13/0x20
[  249.076164]  do_syscall_64+0x5b/0x80
[  249.076169]  ? do_syscall_64+0x67/0x80
[  249.076175]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.076182]  ? do_syscall_64+0x67/0x80
[  249.076188]  ? do_syscall_64+0x67/0x80
[  249.076194]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.076201]  ? do_syscall_64+0x67/0x80
[  249.076206]  ? do_syscall_64+0x67/0x80
[  249.076211]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.076218]  ? do_syscall_64+0x67/0x80
[  249.076223]  ? lock_is_held_type+0xe8/0x140
[  249.076233]  ? asm_exc_page_fault+0x27/0x30
[  249.076239]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.076246]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  249.076251] RIP: 0033:0x7fd71f54f97b
[  249.076259] RSP: 002b:00007fffc8e752a8 EFLAGS: 00000246 ORIG_RAX:
00000000000001b4
[  249.076265] RAX: ffffffffffffffda RBX: 00007fffc8e752b0 RCX: 00007fd71f54f97b
[  249.076287] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: 000000000000004c
[  249.076291] RBP: 00007fffc8e75330 R08: 0000000000000000 R09: 00007fffc8e75380
[  249.076295] R10: 00007fffc8e751f0 R11: 0000000000000246 R12: 0000000000000003
[  249.076300] R13: 00007fffc8e75340 R14: 0000000000000000 R15: 0000000000000003
[  249.076319]  </TASK>
[  249.076323]
               Showing all locks held in the system:
[  249.076335] 1 lock held by khungtaskd/183:
[  249.076340]  #0: ffffffff84169060 (rcu_read_lock){....}-{1:2}, at:
debug_show_all_locks+0x15/0x16b
[  249.076364] 3 locks held by systemd-journal/868:
[  249.076376] 3 locks held by gnome-shell/1626:
[  249.076380]  #0: ffff9f2b248e4680
(&sig->cred_guard_mutex){+.+.}-{3:3}, at: bprm_execve+0x3c/0x880
[  249.076394]  #1: ffff9f2b248e4728
(&sig->exec_update_lock){++++}-{3:3}, at: begin_new_exec+0x384/0xcc0
[  249.076407]  #2: ffff9f2b3a95ec58 (&mgr->lock#3){+.+.}-{3:3}, at:
amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.076609] 1 lock held by (brt-dbus)/1645:
[  249.076613]  #0: ffff9f2b3a95ec58 (&mgr->lock#3){+.+.}-{3:3}, at:
amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.076814] 1 lock held by (ostnamed)/1718:
[  249.076818]  #0: ffff9f2b3a95ec58 (&mgr->lock#3){+.+.}-{3:3}, at:
amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.077018] 1 lock held by (pcscd)/1749:
[  249.077022]  #0: ffff9f2b3a95ec58 (&mgr->lock#3){+.+.}-{3:3}, at:
amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]

[  249.077226] =============================================

[  335.093113] kworker/dying (297) used greatest stack depth: 11608 bytes left
[  335.093254] kworker/dying (241) used greatest stack depth: 11360 bytes left
Full kernel log is here: https://pastebin.com/0YHs6wyB

Naturally, I tried to find the problematic commit via git bisect. It
was the longest bisect in my life, I needed to collect the core 565
times and it took three weeks. This is what explains why I am writing
only now, and not immediately. The most annoying thing is that it
looks like I wasted three weeks because the exact commit was never
found. My bisect log can be found here: https://pastebin.com/AhLMNfyv

If you open it you will see a lot of skip steps. This is due to the
fact that in these steps I observe a problem when loading the kernel
hangs on the messages on screen:
[drm] amdgpu kernel modesetting enabled.
amdgpu: Ignoring ACPI CRAT on non-APU system
amdgpu: Virtual CRAT table created for CPU
amdgpu: Topology: Add CPU node
Here is photo of boot screen:
https://i.postimg.cc/DwVbYP4b/IMG-20220525-130140.jpg

And the following trace is written to the log:
[    8.173558] [drm] amdgpu kernel modesetting enabled.
[    8.196766] amdgpu: Ignoring ACPI CRAT on non-APU system
[    8.196846] amdgpu: Virtual CRAT table created for CPU
[    8.197015] amdgpu: Topology: Add CPU node
[    8.201791] Console: switching to colour dummy device 80x25
[    8.215200] page:00000000b17305fd refcount:0 mapcount:0
mapping:0000000000000000 index:0x0 pfn:0x1029c00
[    8.215224] head:00000000b17305fd order:0 compound_mapcount:-6459
compound_pincount:0
[    8.215243] flags: 0x17ffffc0010000(head|node=0|zone=2|lastcpupid=0x1fffff)
[    8.215261] raw: 0017ffffc0010000 ffffe6c480a70008 ffffe6c480a70008
0000000000000000
[    8.215279] raw: 0000000000000000 0000000000000000 00000000ffffffff
0000000000000000
[    8.215296] page dumped because: VM_BUG_ON_PAGE(compound &&
compound_order(page) != order)
[    8.215324] ------------[ cut here ]------------
[    8.215340] kernel BUG at mm/page_alloc.c:1329!
[    8.215358] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[    8.215363] CPU: 20 PID: 584 Comm: systemd-udevd Tainted: G
W         5.18.0-rc1-004-c6ed9f66eb70aeaac9998bd3552ada740d90e20c+
#357
[    8.215370] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
[    8.215375] RIP: 0010:free_pcp_prepare+0x455/0x650
[    8.215381] Code: ff ff 48 8b 43 48 a8 01 0f 84 48 ff ff ff 48 83
e8 01 48 39 c3 0f 84 3b ff ff ff 48 c7 c6 08 f0 85 aa 48 89 df e8 5b
cb fc ff <0f> 0b 4c 89 ef 48 89 14 24 41 83 c6 01 e8 b9 ed ff ff 48 8b
14 24
[    8.215390] RSP: 0018:ffffbb7dc23779d8 EFLAGS: 00010296
[    8.215394] RAX: 000000000000004e RBX: ffffe6c480a70000 RCX: 0000000000000000
[    8.215399] RDX: 0000000000000001 RSI: ffffffffaa89db77 RDI: 00000000ffffffff
[    8.215402] RBP: 0000000000000009 R08: 0000000000000000 R09: ffffbb7dc23777c0
[    8.215406] R10: 0000000000000003 R11: ffffa08bae1fefe8 R12: 0000000000000000
[    8.215410] R13: ffffa07c817eadc0 R14: 00000000fffffe00 R15: ffffe6c480a70000
[    8.215414] FS:  00007f35b2f1ab40(0000) GS:ffffa08b5d200000(0000)
knlGS:0000000000000000
[    8.215419] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.215422] CR2: 00005631caec1878 CR3: 000000017d09c000 CR4: 0000000000350ee0
[    8.215427] Call Trace:
[    8.215429]  <TASK>
[    8.215431]  ? find_held_lock+0x32/0x80
[    8.215436]  free_unref_page+0x25/0x280
[    8.215440]  __vunmap+0x261/0x3d0
[    8.215444]  drm_fbdev_cleanup+0x6b/0xc0
[    8.215449]  drm_fbdev_fb_destroy+0x15/0x30
[    8.215453]  unregister_framebuffer+0x2e/0x40
[    8.215458]  drm_client_dev_unregister+0x6e/0xe0
[    8.215464]  drm_dev_unregister+0x34/0x90
[    8.215467]  drm_dev_unplug+0x24/0x40
[    8.215471]  simpledrm_remove+0x11/0x20
[    8.215475]  platform_remove+0x1f/0x40
[    8.215479]  device_release_driver_internal+0x1b8/0x220
[    8.215484]  bus_remove_device+0xef/0x160
[    8.215488]  device_del+0x18c/0x3f0
[    8.215492]  platform_device_del.part.0+0x13/0x70
[    8.215496]  platform_device_unregister+0x1c/0x30
[    8.215500]  drm_aperture_detach_drivers+0xa3/0xd0
[    8.215505]  drm_aperture_remove_conflicting_pci_framebuffers+0x3f/0x70
[    8.215511]  amdgpu_pci_probe+0x126/0x3c0 [amdgpu]
[    8.215672]  local_pci_probe+0x41/0x80
[    8.215677]  pci_device_probe+0xaa/0x200
[    8.215681]  really_probe+0x1a0/0x370
[    8.215685]  __driver_probe_device+0xfb/0x170
[    8.215689]  driver_probe_device+0x1f/0x90
[    8.215693]  __driver_attach+0xbe/0x1a0
[    8.215697]  ? __device_attach_driver+0xe0/0xe0
[    8.215701]  bus_for_each_dev+0x65/0x90
[    8.215705]  bus_add_driver+0x150/0x1f0
[    8.215709]  driver_register+0x89/0xd0
[    8.215713]  ? 0xffffffffc044e000
[    8.215719]  do_one_initcall+0x69/0x350
[    8.215724]  ? do_init_module+0x22/0x260
[    8.215728]  ? rcu_read_lock_sched_held+0x3b/0x70
[    8.215732]  ? trace_kmalloc+0x3b/0x100
[    8.215737]  ? kmem_cache_alloc_trace+0x1eb/0x3a0
[    8.215742]  do_init_module+0x4a/0x260
[    8.215745]  __do_sys_finit_module+0x93/0xf0
[    8.215751]  do_syscall_64+0x3a/0x80
[    8.215756]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[    8.215761] RIP: 0033:0x7f35b3acb62d
[    8.215765] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e
fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c3 c7 0c 00 f7 d8 64 89
01 48
[    8.215773] RSP: 002b:00007ffc39f6ef68 EFLAGS: 00000246 ORIG_RAX:
0000000000000139
[    8.215778] RAX: ffffffffffffffda RBX: 00005631cae55830 RCX: 00007f35b3acb62d
[    8.215782] RDX: 0000000000000000 RSI: 00005631cae6ceb0 RDI: 0000000000000011
[    8.215786] RBP: 00005631cae6ceb0 R08: 0000000000000000 R09: 00007f35b3b98c80
[    8.215790] R10: 0000000000000011 R11: 0000000000000246 R12: 0000000000020000
[    8.215794] R13: 00005631cae74660 R14: 0000000000000000 R15: 00005631cae805d0
[    8.215800]  </TASK>
[    8.215801] Modules linked in: amdgpu(+) drm_ttm_helper ttm
crct10dif_pclmul crc32_pclmul iommu_v2 crc32c_intel gpu_sched ucsi_ccg
nvme drm_buddy typec_ucsi ghash_clmulni_intel igb ccp drm_dp_helper
typec sp5100_tco nvme_core dca wmi ip6_tables ip_tables ipmi_devintf
ipmi_msghandler fuse
[    8.215825] ---[ end trace 0000000000000000 ]---
[    8.215828] RIP: 0010:free_pcp_prepare+0x455/0x650
[    8.215832] Code: ff ff 48 8b 43 48 a8 01 0f 84 48 ff ff ff 48 83
e8 01 48 39 c3 0f 84 3b ff ff ff 48 c7 c6 08 f0 85 aa 48 89 df e8 5b
cb fc ff <0f> 0b 4c 89 ef 48 89 14 24 41 83 c6 01 e8 b9 ed ff ff 48 8b
14 24
[    8.215841] RSP: 0018:ffffbb7dc23779d8 EFLAGS: 00010296
[    8.215844] RAX: 000000000000004e RBX: ffffe6c480a70000 RCX: 0000000000000000
[    8.215848] RDX: 0000000000000001 RSI: ffffffffaa89db77 RDI: 00000000ffffffff
[    8.215852] RBP: 0000000000000009 R08: 0000000000000000 R09: ffffbb7dc23777c0
[    8.215856] R10: 0000000000000003 R11: ffffa08bae1fefe8 R12: 0000000000000000
[    8.215860] R13: ffffa07c817eadc0 R14: 00000000fffffe00 R15: ffffe6c480a70000
[    8.215864] FS:  00007f35b2f1ab40(0000) GS:ffffa08b5d200000(0000)
knlGS:0000000000000000
[    8.215875] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.215879] CR2: 00005631caec1878 CR3: 000000017d09c000 CR4: 0000000000350ee0
[    8.216344] systemd-udevd (584) used greatest stack depth: 12776 bytes left
Full kernel log is here: https://pastebin.com/rDAjKpSg

Please help me get rid of the bug that crashes systemd-udevd so I can
find the exact commit that caused the GPU hang.

Or, based on the trace of the hung process, help fix the problem.

Thank you all in advance.

UPD:
I am still observing the issue rc1-rc4 :(

My hardware specs:
GPU: 6900XT
CPU: 3950X
M/B: ROG Strix X570-I Gaming
RAM: 64GB
SSD: Intel Optane 905P


-- 
Best Regards,
Mike Gavrilov.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bug][5.19-rc0] Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in graphic mode.
  2022-06-28  9:21 [Bug][5.19-rc0] Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in graphic mode Mikhail Gavrilov
@ 2022-07-07  0:20 ` Mikhail Gavrilov
  2022-07-07  9:50   ` Christian König
  2022-07-07 10:10   ` Thomas Zimmermann
  0 siblings, 2 replies; 10+ messages in thread
From: Mikhail Gavrilov @ 2022-07-07  0:20 UTC (permalink / raw)
  To: amd-gfx list, Linux List Kernel Mailing, Christian König,
	tzimmermann

On Tue, Jun 28, 2022 at 2:21 PM Mikhail Gavrilov
<mikhail.v.gavrilov@gmail.com> wrote:
>

Christian can you look why
drm_aperture_remove_conflicting_pci_framebuffers cause this kernel bug
on my machine?

[    6.822385] amdgpu: Ignoring ACPI CRAT on non-APU system
[    6.822462] amdgpu: Virtual CRAT table created for CPU
[    6.822654] amdgpu: Topology: Add CPU node
[    6.827643] Console: switching to colour dummy device 80x25
[    6.845504] BUG: kernel NULL pointer dereference, address: 0000000000000038
[    6.845509] #PF: supervisor read access in kernel mode
[    6.845512] #PF: error_code(0x0000) - not-present page
[    6.845515] PGD 0 P4D 0
[    6.845518] Oops: 0000 [#1] PREEMPT SMP NOPTI
[    6.845522] CPU: 27 PID: 612 Comm: systemd-udevd Tainted: G
W        --------  ---
5.19.0-0.rc5.20220705gitc1084b6c5620.40.fc37.x86_64 #1
[    6.845528] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
[    6.845533] RIP: 0010:kernfs_find_and_get_ns+0x11/0x70
[    6.845539] Code: 78 e8 c3 fa 31 00 48 85 c0 75 e1 eb 93 66 66 2e
0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 49 89 d5 41 54 49 89
f4 55 53 <48> 8b 47 38 48 89 fb 48 85 c0 48 0f 44 c7 48 8b a8 80 00 00
00 48
[    6.845546] RSP: 0018:ffffa98c022f3aa0 EFLAGS: 00010246
[    6.845550] RAX: 0000000000000000 RBX: ffffffffaf52c3c0 RCX: ffff9e150147b640
[    6.845553] RDX: 0000000000000000 RSI: ffffffffaf52c508 RDI: 0000000000000000
[    6.845557] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000249249d4
[    6.845560] R10: 0000000000000001 R11: 0000000000000000 R12: ffffffffaf52c508
[    6.845563] R13: 0000000000000000 R14: ffff9e157aa93900 R15: 0000000000000000
[    6.845567] FS:  00007fabaafbf680(0000) GS:ffff9e23e6a00000(0000)
knlGS:0000000000000000
[    6.845571] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    6.845574] CR2: 0000000000000038 CR3: 000000017cb56000 CR4: 0000000000350ee0
[    6.845578] Call Trace:
[    6.845579]  <TASK>
[    6.845582]  sysfs_unmerge_group+0x18/0x60
[    6.845585]  dpm_sysfs_remove+0x20/0x60
[    6.845590]  device_del+0xa4/0x3f0
[    6.845594]  platform_device_del.part.0+0x13/0x70
[    6.845599]  platform_device_unregister+0x1c/0x30
[    6.845602]  sysfb_disable+0x2d/0x60
[    6.845605]  remove_conflicting_framebuffers+0x1b/0xc0
[    6.845610]  remove_conflicting_pci_framebuffers+0xce/0x120
[    6.845614]  drm_aperture_remove_conflicting_pci_framebuffers+0x57/0x80
[    6.845620]  amdgpu_pci_probe+0xcb/0x360 [amdgpu]
[    6.845760]  local_pci_probe+0x41/0x80
[    6.845764]  pci_device_probe+0xaa/0x210
[    6.845768]  really_probe+0x1bf/0x390
[    6.845771]  __driver_probe_device+0xfc/0x170
[    6.845775]  driver_probe_device+0x1f/0x90
[    6.845778]  __driver_attach+0xbf/0x1b0
[    6.845782]  ? __device_attach_driver+0xe0/0xe0
[    6.845785]  bus_for_each_dev+0x65/0x90
[    6.845789]  bus_add_driver+0x15c/0x200
[    6.845792]  driver_register+0x89/0xe0
[    6.845796]  ? 0xffffffffc0c8d000
[    6.845801]  do_one_initcall+0x69/0x350
[    6.845806]  ? rcu_read_lock_sched_held+0x3c/0x70
[    6.845810]  ? trace_kmalloc+0x3c/0x100
[    6.845814]  ? kmem_cache_alloc_trace+0x1e8/0x350
[    6.845818]  do_init_module+0x4a/0x200
[    6.845822]  __do_sys_init_module+0x13a/0x190
[    6.845827]  do_syscall_64+0x5b/0x80
[    6.845832]  ? asm_exc_page_fault+0x27/0x30
[    6.845835]  ? lockdep_hardirqs_on+0x7d/0x100
[    6.845839]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[    6.845842] RIP: 0033:0x7fababb7463e
[    6.845845] Code: 48 8b 0d e5 57 0c 00 f7 d8 64 89 01 48 83 c8 ff
c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00
00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b2 57 0c 00 f7 d8 64 89
01 48
[    6.845852] RSP: 002b:00007ffc6a6c9658 EFLAGS: 00000246 ORIG_RAX:
00000000000000af
[    6.845857] RAX: ffffffffffffffda RBX: 00005620deef53f0 RCX: 00007fababb7463e
[    6.845860] RDX: 00005620deeb2df0 RSI: 00000000010bfac6 RDI: 00007faba943e010
[    6.845864] RBP: 00005620deeb2df0 R08: 00005620deef4880 R09: 0000000000000000
[    6.845867] R10: 0000000000000005 R11: 0000000000000246 R12: 0000000000020000
[    6.845870] R13: 00005620deeb5330 R14: 0000000000000000 R15: 00005620deef0410
[    6.845875]  </TASK>
[    6.845877] Modules linked in: amdgpu(+) drm_ttm_helper ttm
iommu_v2 crct10dif_pclmul gpu_sched crc32_pclmul crc32c_intel
drm_buddy drm_display_helper ucsi_ccg nvme igb typec_ucsi
ghash_clmulni_intel ccp cec typec sp5100_tco nvme_core dca wmi
ip6_tables ip_tables ipmi_devintf ipmi_msghandler fuse
[    6.845898] CR2: 0000000000000038
[    6.845900] ---[ end trace 0000000000000000 ]---


$ /usr/src/kernels/5.19.0-0.rc5.20220705gitc1084b6c5620.40.fc37.x86_64/scripts/faddr2line
/lib/debug/lib/modules/5.19.0-0.rc5.20220705gitc1084b6c5620.40.fc37.x86_64/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko.debug
amdgpu_pci_probe+0xcb
amdgpu_pci_probe+0xcb/0x360:
amdgpu_pci_probe at
/usr/src/debug/kernel-5.19-rc5-49-gc1084b6c5620/linux-5.19.0-0.rc5.20220705gitc1084b6c5620.40.fc37.x86_64/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:2061


$ cat -s -n /usr/src/debug/kernel-5.19-rc5-49-gc1084b6c5620/linux-5.19.0-0.rc5.20220705gitc1084b6c5620.40.fc37.x86_64/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
| head -2071 | tail -20
  2052 "Use radeon.cik_support=0 amdgpu.cik_support=1 to override.\n"
  2053 );
  2054 return -ENODEV;
  2055 }
  2056 }
  2057 #endif
  2058
  2059 /* Get rid of things like offb */
  2060 ret = drm_aperture_remove_conflicting_pci_framebuffers(pdev,
&amdgpu_kms_driver);
  2061 if (ret)
  2062 return ret;
  2063
  2064 adev = devm_drm_dev_alloc(&pdev->dev, &amdgpu_kms_driver,
typeof(*adev), ddev);
  2065 if (IS_ERR(adev))
  2066 return PTR_ERR(adev);
  2067
  2068 adev->dev  = &pdev->dev;
  2069 adev->pdev = pdev;
  2070 ddev = adev_to_drm(adev);

$ git blame -L 2052,2070 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
Blaming lines: 100% (19/19), done.
984d7a929ad68 (Hans de Goede     2019-10-10 18:28:17 +0200 2052)
                 dev_info(&pdev->dev,
984d7a929ad68 (Hans de Goede     2019-10-10 18:28:17 +0200 2053)
                          "Use radeon.cik_support=0
amdgpu.cik_support=1 to override.\n"
984d7a929ad68 (Hans de Goede     2019-10-10 18:28:17 +0200 2054)
                         );
984d7a929ad68 (Hans de Goede     2019-10-10 18:28:17 +0200 2055)
                 return -ENODEV;
984d7a929ad68 (Hans de Goede     2019-10-10 18:28:17 +0200 2056)
         }
984d7a929ad68 (Hans de Goede     2019-10-10 18:28:17 +0200 2057)        }
984d7a929ad68 (Hans de Goede     2019-10-10 18:28:17 +0200 2058) #endif
984d7a929ad68 (Hans de Goede     2019-10-10 18:28:17 +0200 2059)
d38ceaf99ed01 (Alex Deucher      2015-04-20 16:55:21 -0400 2060)
 /* Get rid of things like offb */
97c9bfe3f6605 (Thomas Zimmermann 2021-06-29 15:58:33 +0200 2061)
 ret = drm_aperture_remove_conflicting_pci_framebuffers(pdev,
&amdgpu_kms_driver);
d38ceaf99ed01 (Alex Deucher      2015-04-20 16:55:21 -0400 2062)        if (ret)
d38ceaf99ed01 (Alex Deucher      2015-04-20 16:55:21 -0400 2063)
         return ret;
d38ceaf99ed01 (Alex Deucher      2015-04-20 16:55:21 -0400 2064)
5088d6572e8ff (Luben Tuikov      2020-11-04 11:04:25 +0100 2065)
 adev = devm_drm_dev_alloc(&pdev->dev, &amdgpu_kms_driver,
typeof(*adev), ddev);
df2ce4596c044 (Luben Tuikov      2020-09-18 15:25:04 +0200 2066)
 if (IS_ERR(adev))
df2ce4596c044 (Luben Tuikov      2020-09-18 15:25:04 +0200 2067)
         return PTR_ERR(adev);
8aba21b75136c (Luben Tuikov      2020-08-14 20:41:55 -0400 2068)
8aba21b75136c (Luben Tuikov      2020-08-14 20:41:55 -0400 2069)
 adev->dev  = &pdev->dev;
8aba21b75136c (Luben Tuikov      2020-08-14 20:41:55 -0400 2070)
 adev->pdev = pdev;

Thomas, you recently changed this line. Can you tell why we are
catching kernel Oops here?

Full kernel log (5.19-rc5): https://pastebin.com/5Ag804bd

-- 
Best Regards,
Mike Gavrilov.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bug][5.19-rc0] Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in graphic mode.
  2022-07-07  0:20 ` Mikhail Gavrilov
@ 2022-07-07  9:50   ` Christian König
  2022-07-09 12:10       ` Mikhail Gavrilov
  2022-07-07 10:10   ` Thomas Zimmermann
  1 sibling, 1 reply; 10+ messages in thread
From: Christian König @ 2022-07-07  9:50 UTC (permalink / raw)
  To: Mikhail Gavrilov, amd-gfx list, Linux List Kernel Mailing, tzimmermann

Am 07.07.22 um 02:20 schrieb Mikhail Gavrilov:
> On Tue, Jun 28, 2022 at 2:21 PM Mikhail Gavrilov
> <mikhail.v.gavrilov@gmail.com> wrote:
> Christian can you look why
> drm_aperture_remove_conflicting_pci_framebuffers cause this kernel bug
> on my machine?

That looks like a problem outside of the amdgpu driver.

What happens is that during load amdgpu requests whatever driver 
(vesafb,vgafb or efifb) is currently handling the framebuffer to unload. 
This unload in turn now crashes for some reason.

My best suggestion is to try to bisect this.

Regards,
Christian.

>
> [    6.822385] amdgpu: Ignoring ACPI CRAT on non-APU system
> [    6.822462] amdgpu: Virtual CRAT table created for CPU
> [    6.822654] amdgpu: Topology: Add CPU node
> [    6.827643] Console: switching to colour dummy device 80x25
> [    6.845504] BUG: kernel NULL pointer dereference, address: 0000000000000038
> [    6.845509] #PF: supervisor read access in kernel mode
> [    6.845512] #PF: error_code(0x0000) - not-present page
> [    6.845515] PGD 0 P4D 0
> [    6.845518] Oops: 0000 [#1] PREEMPT SMP NOPTI
> [    6.845522] CPU: 27 PID: 612 Comm: systemd-udevd Tainted: G
> W        --------  ---
> 5.19.0-0.rc5.20220705gitc1084b6c5620.40.fc37.x86_64 #1
> [    6.845528] Hardware name: System manufacturer System Product
> Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
> [    6.845533] RIP: 0010:kernfs_find_and_get_ns+0x11/0x70
> [    6.845539] Code: 78 e8 c3 fa 31 00 48 85 c0 75 e1 eb 93 66 66 2e
> 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 49 89 d5 41 54 49 89
> f4 55 53 <48> 8b 47 38 48 89 fb 48 85 c0 48 0f 44 c7 48 8b a8 80 00 00
> 00 48
> [    6.845546] RSP: 0018:ffffa98c022f3aa0 EFLAGS: 00010246
> [    6.845550] RAX: 0000000000000000 RBX: ffffffffaf52c3c0 RCX: ffff9e150147b640
> [    6.845553] RDX: 0000000000000000 RSI: ffffffffaf52c508 RDI: 0000000000000000
> [    6.845557] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000249249d4
> [    6.845560] R10: 0000000000000001 R11: 0000000000000000 R12: ffffffffaf52c508
> [    6.845563] R13: 0000000000000000 R14: ffff9e157aa93900 R15: 0000000000000000
> [    6.845567] FS:  00007fabaafbf680(0000) GS:ffff9e23e6a00000(0000)
> knlGS:0000000000000000
> [    6.845571] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    6.845574] CR2: 0000000000000038 CR3: 000000017cb56000 CR4: 0000000000350ee0
> [    6.845578] Call Trace:
> [    6.845579]  <TASK>
> [    6.845582]  sysfs_unmerge_group+0x18/0x60
> [    6.845585]  dpm_sysfs_remove+0x20/0x60
> [    6.845590]  device_del+0xa4/0x3f0
> [    6.845594]  platform_device_del.part.0+0x13/0x70
> [    6.845599]  platform_device_unregister+0x1c/0x30
> [    6.845602]  sysfb_disable+0x2d/0x60
> [    6.845605]  remove_conflicting_framebuffers+0x1b/0xc0
> [    6.845610]  remove_conflicting_pci_framebuffers+0xce/0x120
> [    6.845614]  drm_aperture_remove_conflicting_pci_framebuffers+0x57/0x80
> [    6.845620]  amdgpu_pci_probe+0xcb/0x360 [amdgpu]
> [    6.845760]  local_pci_probe+0x41/0x80
> [    6.845764]  pci_device_probe+0xaa/0x210
> [    6.845768]  really_probe+0x1bf/0x390
> [    6.845771]  __driver_probe_device+0xfc/0x170
> [    6.845775]  driver_probe_device+0x1f/0x90
> [    6.845778]  __driver_attach+0xbf/0x1b0
> [    6.845782]  ? __device_attach_driver+0xe0/0xe0
> [    6.845785]  bus_for_each_dev+0x65/0x90
> [    6.845789]  bus_add_driver+0x15c/0x200
> [    6.845792]  driver_register+0x89/0xe0
> [    6.845796]  ? 0xffffffffc0c8d000
> [    6.845801]  do_one_initcall+0x69/0x350
> [    6.845806]  ? rcu_read_lock_sched_held+0x3c/0x70
> [    6.845810]  ? trace_kmalloc+0x3c/0x100
> [    6.845814]  ? kmem_cache_alloc_trace+0x1e8/0x350
> [    6.845818]  do_init_module+0x4a/0x200
> [    6.845822]  __do_sys_init_module+0x13a/0x190
> [    6.845827]  do_syscall_64+0x5b/0x80
> [    6.845832]  ? asm_exc_page_fault+0x27/0x30
> [    6.845835]  ? lockdep_hardirqs_on+0x7d/0x100
> [    6.845839]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> [    6.845842] RIP: 0033:0x7fababb7463e
> [    6.845845] Code: 48 8b 0d e5 57 0c 00 f7 d8 64 89 01 48 83 c8 ff
> c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00
> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b2 57 0c 00 f7 d8 64 89
> 01 48
> [    6.845852] RSP: 002b:00007ffc6a6c9658 EFLAGS: 00000246 ORIG_RAX:
> 00000000000000af
> [    6.845857] RAX: ffffffffffffffda RBX: 00005620deef53f0 RCX: 00007fababb7463e
> [    6.845860] RDX: 00005620deeb2df0 RSI: 00000000010bfac6 RDI: 00007faba943e010
> [    6.845864] RBP: 00005620deeb2df0 R08: 00005620deef4880 R09: 0000000000000000
> [    6.845867] R10: 0000000000000005 R11: 0000000000000246 R12: 0000000000020000
> [    6.845870] R13: 00005620deeb5330 R14: 0000000000000000 R15: 00005620deef0410
> [    6.845875]  </TASK>
> [    6.845877] Modules linked in: amdgpu(+) drm_ttm_helper ttm
> iommu_v2 crct10dif_pclmul gpu_sched crc32_pclmul crc32c_intel
> drm_buddy drm_display_helper ucsi_ccg nvme igb typec_ucsi
> ghash_clmulni_intel ccp cec typec sp5100_tco nvme_core dca wmi
> ip6_tables ip_tables ipmi_devintf ipmi_msghandler fuse
> [    6.845898] CR2: 0000000000000038
> [    6.845900] ---[ end trace 0000000000000000 ]---
>
>
> $ /usr/src/kernels/5.19.0-0.rc5.20220705gitc1084b6c5620.40.fc37.x86_64/scripts/faddr2line
> /lib/debug/lib/modules/5.19.0-0.rc5.20220705gitc1084b6c5620.40.fc37.x86_64/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko.debug
> amdgpu_pci_probe+0xcb
> amdgpu_pci_probe+0xcb/0x360:
> amdgpu_pci_probe at
> /usr/src/debug/kernel-5.19-rc5-49-gc1084b6c5620/linux-5.19.0-0.rc5.20220705gitc1084b6c5620.40.fc37.x86_64/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:2061
>
>
> $ cat -s -n /usr/src/debug/kernel-5.19-rc5-49-gc1084b6c5620/linux-5.19.0-0.rc5.20220705gitc1084b6c5620.40.fc37.x86_64/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> | head -2071 | tail -20
>    2052 "Use radeon.cik_support=0 amdgpu.cik_support=1 to override.\n"
>    2053 );
>    2054 return -ENODEV;
>    2055 }
>    2056 }
>    2057 #endif
>    2058
>    2059 /* Get rid of things like offb */
>    2060 ret = drm_aperture_remove_conflicting_pci_framebuffers(pdev,
> &amdgpu_kms_driver);
>    2061 if (ret)
>    2062 return ret;
>    2063
>    2064 adev = devm_drm_dev_alloc(&pdev->dev, &amdgpu_kms_driver,
> typeof(*adev), ddev);
>    2065 if (IS_ERR(adev))
>    2066 return PTR_ERR(adev);
>    2067
>    2068 adev->dev  = &pdev->dev;
>    2069 adev->pdev = pdev;
>    2070 ddev = adev_to_drm(adev);
>
> $ git blame -L 2052,2070 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> Blaming lines: 100% (19/19), done.
> 984d7a929ad68 (Hans de Goede     2019-10-10 18:28:17 +0200 2052)
>                   dev_info(&pdev->dev,
> 984d7a929ad68 (Hans de Goede     2019-10-10 18:28:17 +0200 2053)
>                            "Use radeon.cik_support=0
> amdgpu.cik_support=1 to override.\n"
> 984d7a929ad68 (Hans de Goede     2019-10-10 18:28:17 +0200 2054)
>                           );
> 984d7a929ad68 (Hans de Goede     2019-10-10 18:28:17 +0200 2055)
>                   return -ENODEV;
> 984d7a929ad68 (Hans de Goede     2019-10-10 18:28:17 +0200 2056)
>           }
> 984d7a929ad68 (Hans de Goede     2019-10-10 18:28:17 +0200 2057)        }
> 984d7a929ad68 (Hans de Goede     2019-10-10 18:28:17 +0200 2058) #endif
> 984d7a929ad68 (Hans de Goede     2019-10-10 18:28:17 +0200 2059)
> d38ceaf99ed01 (Alex Deucher      2015-04-20 16:55:21 -0400 2060)
>   /* Get rid of things like offb */
> 97c9bfe3f6605 (Thomas Zimmermann 2021-06-29 15:58:33 +0200 2061)
>   ret = drm_aperture_remove_conflicting_pci_framebuffers(pdev,
> &amdgpu_kms_driver);
> d38ceaf99ed01 (Alex Deucher      2015-04-20 16:55:21 -0400 2062)        if (ret)
> d38ceaf99ed01 (Alex Deucher      2015-04-20 16:55:21 -0400 2063)
>           return ret;
> d38ceaf99ed01 (Alex Deucher      2015-04-20 16:55:21 -0400 2064)
> 5088d6572e8ff (Luben Tuikov      2020-11-04 11:04:25 +0100 2065)
>   adev = devm_drm_dev_alloc(&pdev->dev, &amdgpu_kms_driver,
> typeof(*adev), ddev);
> df2ce4596c044 (Luben Tuikov      2020-09-18 15:25:04 +0200 2066)
>   if (IS_ERR(adev))
> df2ce4596c044 (Luben Tuikov      2020-09-18 15:25:04 +0200 2067)
>           return PTR_ERR(adev);
> 8aba21b75136c (Luben Tuikov      2020-08-14 20:41:55 -0400 2068)
> 8aba21b75136c (Luben Tuikov      2020-08-14 20:41:55 -0400 2069)
>   adev->dev  = &pdev->dev;
> 8aba21b75136c (Luben Tuikov      2020-08-14 20:41:55 -0400 2070)
>   adev->pdev = pdev;
>
> Thomas, you recently changed this line. Can you tell why we are
> catching kernel Oops here?
>
> Full kernel log (5.19-rc5): https://pastebin.com/5Ag804bd
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bug][5.19-rc0] Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in graphic mode.
  2022-07-07  0:20 ` Mikhail Gavrilov
  2022-07-07  9:50   ` Christian König
@ 2022-07-07 10:10   ` Thomas Zimmermann
  1 sibling, 0 replies; 10+ messages in thread
From: Thomas Zimmermann @ 2022-07-07 10:10 UTC (permalink / raw)
  To: Mikhail Gavrilov, amd-gfx list, Linux List Kernel Mailing,
	Christian König


[-- Attachment #1.1: Type: text/plain, Size: 8846 bytes --]

Hi

Am 07.07.22 um 02:20 schrieb Mikhail Gavrilov:
> On Tue, Jun 28, 2022 at 2:21 PM Mikhail Gavrilov
> <mikhail.v.gavrilov@gmail.com> wrote:
>>
> 
> Christian can you look why
> drm_aperture_remove_conflicting_pci_framebuffers cause this kernel bug
> on my machine?

Thanks for reporting. This bug has been fixed in

 
https://cgit.freedesktop.org/drm/drm/commit/?h=drm-fixes&id=ee7a69aa38d87a3bbced7b8245c732c05ed0c6ec

The patch should reach mainline next week or so.

Best regards
Thomas

> 
> [    6.822385] amdgpu: Ignoring ACPI CRAT on non-APU system
> [    6.822462] amdgpu: Virtual CRAT table created for CPU
> [    6.822654] amdgpu: Topology: Add CPU node
> [    6.827643] Console: switching to colour dummy device 80x25
> [    6.845504] BUG: kernel NULL pointer dereference, address: 0000000000000038
> [    6.845509] #PF: supervisor read access in kernel mode
> [    6.845512] #PF: error_code(0x0000) - not-present page
> [    6.845515] PGD 0 P4D 0
> [    6.845518] Oops: 0000 [#1] PREEMPT SMP NOPTI
> [    6.845522] CPU: 27 PID: 612 Comm: systemd-udevd Tainted: G
> W        --------  ---
> 5.19.0-0.rc5.20220705gitc1084b6c5620.40.fc37.x86_64 #1
> [    6.845528] Hardware name: System manufacturer System Product
> Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
> [    6.845533] RIP: 0010:kernfs_find_and_get_ns+0x11/0x70
> [    6.845539] Code: 78 e8 c3 fa 31 00 48 85 c0 75 e1 eb 93 66 66 2e
> 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 49 89 d5 41 54 49 89
> f4 55 53 <48> 8b 47 38 48 89 fb 48 85 c0 48 0f 44 c7 48 8b a8 80 00 00
> 00 48
> [    6.845546] RSP: 0018:ffffa98c022f3aa0 EFLAGS: 00010246
> [    6.845550] RAX: 0000000000000000 RBX: ffffffffaf52c3c0 RCX: ffff9e150147b640
> [    6.845553] RDX: 0000000000000000 RSI: ffffffffaf52c508 RDI: 0000000000000000
> [    6.845557] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000249249d4
> [    6.845560] R10: 0000000000000001 R11: 0000000000000000 R12: ffffffffaf52c508
> [    6.845563] R13: 0000000000000000 R14: ffff9e157aa93900 R15: 0000000000000000
> [    6.845567] FS:  00007fabaafbf680(0000) GS:ffff9e23e6a00000(0000)
> knlGS:0000000000000000
> [    6.845571] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    6.845574] CR2: 0000000000000038 CR3: 000000017cb56000 CR4: 0000000000350ee0
> [    6.845578] Call Trace:
> [    6.845579]  <TASK>
> [    6.845582]  sysfs_unmerge_group+0x18/0x60
> [    6.845585]  dpm_sysfs_remove+0x20/0x60
> [    6.845590]  device_del+0xa4/0x3f0
> [    6.845594]  platform_device_del.part.0+0x13/0x70
> [    6.845599]  platform_device_unregister+0x1c/0x30
> [    6.845602]  sysfb_disable+0x2d/0x60
> [    6.845605]  remove_conflicting_framebuffers+0x1b/0xc0
> [    6.845610]  remove_conflicting_pci_framebuffers+0xce/0x120
> [    6.845614]  drm_aperture_remove_conflicting_pci_framebuffers+0x57/0x80
> [    6.845620]  amdgpu_pci_probe+0xcb/0x360 [amdgpu]
> [    6.845760]  local_pci_probe+0x41/0x80
> [    6.845764]  pci_device_probe+0xaa/0x210
> [    6.845768]  really_probe+0x1bf/0x390
> [    6.845771]  __driver_probe_device+0xfc/0x170
> [    6.845775]  driver_probe_device+0x1f/0x90
> [    6.845778]  __driver_attach+0xbf/0x1b0
> [    6.845782]  ? __device_attach_driver+0xe0/0xe0
> [    6.845785]  bus_for_each_dev+0x65/0x90
> [    6.845789]  bus_add_driver+0x15c/0x200
> [    6.845792]  driver_register+0x89/0xe0
> [    6.845796]  ? 0xffffffffc0c8d000
> [    6.845801]  do_one_initcall+0x69/0x350
> [    6.845806]  ? rcu_read_lock_sched_held+0x3c/0x70
> [    6.845810]  ? trace_kmalloc+0x3c/0x100
> [    6.845814]  ? kmem_cache_alloc_trace+0x1e8/0x350
> [    6.845818]  do_init_module+0x4a/0x200
> [    6.845822]  __do_sys_init_module+0x13a/0x190
> [    6.845827]  do_syscall_64+0x5b/0x80
> [    6.845832]  ? asm_exc_page_fault+0x27/0x30
> [    6.845835]  ? lockdep_hardirqs_on+0x7d/0x100
> [    6.845839]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> [    6.845842] RIP: 0033:0x7fababb7463e
> [    6.845845] Code: 48 8b 0d e5 57 0c 00 f7 d8 64 89 01 48 83 c8 ff
> c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00
> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b2 57 0c 00 f7 d8 64 89
> 01 48
> [    6.845852] RSP: 002b:00007ffc6a6c9658 EFLAGS: 00000246 ORIG_RAX:
> 00000000000000af
> [    6.845857] RAX: ffffffffffffffda RBX: 00005620deef53f0 RCX: 00007fababb7463e
> [    6.845860] RDX: 00005620deeb2df0 RSI: 00000000010bfac6 RDI: 00007faba943e010
> [    6.845864] RBP: 00005620deeb2df0 R08: 00005620deef4880 R09: 0000000000000000
> [    6.845867] R10: 0000000000000005 R11: 0000000000000246 R12: 0000000000020000
> [    6.845870] R13: 00005620deeb5330 R14: 0000000000000000 R15: 00005620deef0410
> [    6.845875]  </TASK>
> [    6.845877] Modules linked in: amdgpu(+) drm_ttm_helper ttm
> iommu_v2 crct10dif_pclmul gpu_sched crc32_pclmul crc32c_intel
> drm_buddy drm_display_helper ucsi_ccg nvme igb typec_ucsi
> ghash_clmulni_intel ccp cec typec sp5100_tco nvme_core dca wmi
> ip6_tables ip_tables ipmi_devintf ipmi_msghandler fuse
> [    6.845898] CR2: 0000000000000038
> [    6.845900] ---[ end trace 0000000000000000 ]---
> 
> 
> $ /usr/src/kernels/5.19.0-0.rc5.20220705gitc1084b6c5620.40.fc37.x86_64/scripts/faddr2line
> /lib/debug/lib/modules/5.19.0-0.rc5.20220705gitc1084b6c5620.40.fc37.x86_64/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko.debug
> amdgpu_pci_probe+0xcb
> amdgpu_pci_probe+0xcb/0x360:
> amdgpu_pci_probe at
> /usr/src/debug/kernel-5.19-rc5-49-gc1084b6c5620/linux-5.19.0-0.rc5.20220705gitc1084b6c5620.40.fc37.x86_64/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:2061
> 
> 
> $ cat -s -n /usr/src/debug/kernel-5.19-rc5-49-gc1084b6c5620/linux-5.19.0-0.rc5.20220705gitc1084b6c5620.40.fc37.x86_64/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> | head -2071 | tail -20
>    2052 "Use radeon.cik_support=0 amdgpu.cik_support=1 to override.\n"
>    2053 );
>    2054 return -ENODEV;
>    2055 }
>    2056 }
>    2057 #endif
>    2058
>    2059 /* Get rid of things like offb */
>    2060 ret = drm_aperture_remove_conflicting_pci_framebuffers(pdev,
> &amdgpu_kms_driver);
>    2061 if (ret)
>    2062 return ret;
>    2063
>    2064 adev = devm_drm_dev_alloc(&pdev->dev, &amdgpu_kms_driver,
> typeof(*adev), ddev);
>    2065 if (IS_ERR(adev))
>    2066 return PTR_ERR(adev);
>    2067
>    2068 adev->dev  = &pdev->dev;
>    2069 adev->pdev = pdev;
>    2070 ddev = adev_to_drm(adev);
> 
> $ git blame -L 2052,2070 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> Blaming lines: 100% (19/19), done.
> 984d7a929ad68 (Hans de Goede     2019-10-10 18:28:17 +0200 2052)
>                   dev_info(&pdev->dev,
> 984d7a929ad68 (Hans de Goede     2019-10-10 18:28:17 +0200 2053)
>                            "Use radeon.cik_support=0
> amdgpu.cik_support=1 to override.\n"
> 984d7a929ad68 (Hans de Goede     2019-10-10 18:28:17 +0200 2054)
>                           );
> 984d7a929ad68 (Hans de Goede     2019-10-10 18:28:17 +0200 2055)
>                   return -ENODEV;
> 984d7a929ad68 (Hans de Goede     2019-10-10 18:28:17 +0200 2056)
>           }
> 984d7a929ad68 (Hans de Goede     2019-10-10 18:28:17 +0200 2057)        }
> 984d7a929ad68 (Hans de Goede     2019-10-10 18:28:17 +0200 2058) #endif
> 984d7a929ad68 (Hans de Goede     2019-10-10 18:28:17 +0200 2059)
> d38ceaf99ed01 (Alex Deucher      2015-04-20 16:55:21 -0400 2060)
>   /* Get rid of things like offb */
> 97c9bfe3f6605 (Thomas Zimmermann 2021-06-29 15:58:33 +0200 2061)
>   ret = drm_aperture_remove_conflicting_pci_framebuffers(pdev,
> &amdgpu_kms_driver);
> d38ceaf99ed01 (Alex Deucher      2015-04-20 16:55:21 -0400 2062)        if (ret)
> d38ceaf99ed01 (Alex Deucher      2015-04-20 16:55:21 -0400 2063)
>           return ret;
> d38ceaf99ed01 (Alex Deucher      2015-04-20 16:55:21 -0400 2064)
> 5088d6572e8ff (Luben Tuikov      2020-11-04 11:04:25 +0100 2065)
>   adev = devm_drm_dev_alloc(&pdev->dev, &amdgpu_kms_driver,
> typeof(*adev), ddev);
> df2ce4596c044 (Luben Tuikov      2020-09-18 15:25:04 +0200 2066)
>   if (IS_ERR(adev))
> df2ce4596c044 (Luben Tuikov      2020-09-18 15:25:04 +0200 2067)
>           return PTR_ERR(adev);
> 8aba21b75136c (Luben Tuikov      2020-08-14 20:41:55 -0400 2068)
> 8aba21b75136c (Luben Tuikov      2020-08-14 20:41:55 -0400 2069)
>   adev->dev  = &pdev->dev;
> 8aba21b75136c (Luben Tuikov      2020-08-14 20:41:55 -0400 2070)
>   adev->pdev = pdev;
> 
> Thomas, you recently changed this line. Can you tell why we are
> catching kernel Oops here?
> 
> Full kernel log (5.19-rc5): https://pastebin.com/5Ag804bd
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 840 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bug][5.19-rc0] Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in graphic mode.
  2022-07-07  9:50   ` Christian König
@ 2022-07-09 12:10       ` Mikhail Gavrilov
  0 siblings, 0 replies; 10+ messages in thread
From: Mikhail Gavrilov @ 2022-07-09 12:10 UTC (permalink / raw)
  To: Christian König; +Cc: amd-gfx list, Linux List Kernel Mailing, tzimmermann

On Thu, Jul 7, 2022 at 2:50 PM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 07.07.22 um 02:20 schrieb Mikhail Gavrilov:
> > On Tue, Jun 28, 2022 at 2:21 PM Mikhail Gavrilov
> > <mikhail.v.gavrilov@gmail.com> wrote:
> > Christian can you look why
> > drm_aperture_remove_conflicting_pci_framebuffers cause this kernel bug
> > on my machine?
>
> That looks like a problem outside of the amdgpu driver.
>
> What happens is that during load amdgpu requests whatever driver
> (vesafb,vgafb or efifb) is currently handling the framebuffer to unload.
> This unload in turn now crashes for some reason.
>
> My best suggestion is to try to bisect this.

Hi Christian,
if you read my initial post. You should see that I tried to bisect the issue.
But it is very problematic because on each step I see different symptomes.
And if mark different symptoms with skip step we got at end lot of
possible commits:
Here is my bisect from initial post: https://pastebin.com/AhLMNfyv

If you want that I ended bisection successfully please help how to fix
this oops:
[    8.291177] page:00000000af2b6334 refcount:0 mapcount:0
mapping:0000000000000000 index:0x0 pfn:0x102a000
[    8.291202] head:00000000af2b6334 order:0 compound_mapcount:-1226
compound_pincount:0
[    8.291221] flags: 0x17ffffc0010000(head|node=0|zone=2|lastcpupid=0x1fffff)
[    8.291239] raw: 0017ffffc0010000 fffffb35c0a80008 fffffb35c0a80008
0000000000000000
[    8.291257] raw: 0000000000000000 0000000000000000 00000000ffffffff
0000000000000000
[    8.291275] page dumped because: VM_BUG_ON_PAGE(compound &&
compound_order(page) != order)
[    8.291298] ------------[ cut here ]------------
[    8.291309] kernel BUG at mm/page_alloc.c:1329!
[    8.291324] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[    8.291328] CPU: 8 PID: 599 Comm: systemd-udevd Not tainted
5.18.0-rc2-003-790b45f1bc6736a8dd48ba5731b6871e0217311e+ #361
[    8.291333] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
[    8.291338] RIP: 0010:free_pcp_prepare+0x58d/0x5a0
[    8.291343] Code: c6 18 a2 85 a7 e8 d3 b7 fc ff 0f 0b 31 f6 48 89
df e8 97 cf 06 00 e9 29 ff ff ff 48 c7 c6 00 f1 85 a7 48 89 df e8 b3
b7 fc ff <0f> 0b 48 c7 c6 58 92 85 a7 e8 a5 b7 fc ff 0f 0b 0f 1f 00 0f
1f 44
[    8.291351] RSP: 0018:ffffb07c023ab9d8 EFLAGS: 00010296
[    8.291354] RAX: 000000000000004e RBX: fffffb35c0a80000 RCX: 0000000000000000
[    8.291358] RDX: 0000000000000001 RSI: ffffffffa789dbaf RDI: 00000000ffffffff
[    8.291361] RBP: 0000000000000009 R08: 0000000000000000 R09: ffffb07c023ab7c0
[    8.291365] R10: 0000000000000003 R11: ffff92ee2e2fffe8 R12: 0000000000000000
[    8.291368] R13: ffff92ee2a55d180 R14: 00000000fffffe00 R15: fffffb35c0a80000
[    8.291371] FS:  00007f80aa398680(0000) GS:ffff92edda200000(0000)
knlGS:0000000000000000
[    8.291376] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.291379] CR2: 00007f80aa38e616 CR3: 000000017d726000 CR4: 0000000000350ee0
[    8.291382] Call Trace:
[    8.291384]  <TASK>
[    8.291386]  ? find_held_lock+0x32/0x80
[    8.291391]  free_unref_page+0x25/0x2a0
[    8.291395]  __vunmap+0x261/0x3d0
[    8.291399]  drm_fbdev_cleanup+0x6b/0xc0
[    8.291403]  drm_fbdev_fb_destroy+0x15/0x30
[    8.291407]  unregister_framebuffer+0x2e/0x40
[    8.291411]  drm_client_dev_unregister+0x6e/0xe0
[    8.291416]  drm_dev_unregister+0x34/0x90
[    8.291419]  drm_dev_unplug+0x24/0x40
[    8.291422]  simpledrm_remove+0x11/0x20
[    8.291426]  platform_remove+0x1f/0x40
[    8.291429]  device_release_driver_internal+0x1b8/0x220
[    8.291433]  bus_remove_device+0xef/0x160
[    8.291437]  device_del+0x18c/0x3f0
[    8.291440]  platform_device_del.part.0+0x13/0x70
[    8.291444]  platform_device_unregister+0x1c/0x30
[    8.291447]  drm_aperture_detach_drivers+0xa3/0xd0
[    8.291452]  drm_aperture_remove_conflicting_pci_framebuffers+0x3f/0x70
[    8.291457]  amdgpu_pci_probe+0x126/0x3c0 [amdgpu]
[    8.291599]  local_pci_probe+0x41/0x80
[    8.291604]  pci_device_probe+0xaa/0x200
[    8.291607]  really_probe+0x1a0/0x370
[    8.291611]  __driver_probe_device+0xfb/0x170
[    8.291615]  driver_probe_device+0x1f/0x90
[    8.291618]  __driver_attach+0xbe/0x1a0
[    8.291622]  ? __device_attach_driver+0xe0/0xe0
[    8.291625]  bus_for_each_dev+0x65/0x90
[    8.291629]  bus_add_driver+0x150/0x1f0
[    8.291632]  driver_register+0x89/0xd0
[    8.291636]  ? 0xffffffffc067b000
[    8.291641]  do_one_initcall+0x69/0x350
[    8.291645]  ? do_init_module+0x22/0x260
[    8.291650]  ? rcu_read_lock_sched_held+0x3b/0x70
[    8.291654]  ? trace_kmalloc+0x3b/0x100
[    8.291658]  ? kmem_cache_alloc_trace+0x1eb/0x3a0
[    8.291662]  do_init_module+0x4a/0x260
[    8.291666]  __do_sys_finit_module+0x93/0xf0
[    8.291673]  do_syscall_64+0x3a/0x80
[    8.291677]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[    8.291681] RIP: 0033:0x7f80aaf4507d
[    8.291685] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e
fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 dd 0c 00 f7 d8 64 89
01 48
[    8.291694] RSP: 002b:00007ffe43973ce8 EFLAGS: 00000246 ORIG_RAX:
0000000000000139
[    8.291699] RAX: ffffffffffffffda RBX: 000055ce603a3fe0 RCX: 00007f80aaf4507d
[    8.291702] RDX: 0000000000000000 RSI: 000055ce60395ac0 RDI: 0000000000000011
[    8.291706] RBP: 000055ce60395ac0 R08: 0000000000000000 R09: 00007f80ab013c80
[    8.291709] R10: 0000000000000011 R11: 0000000000000246 R12: 0000000000020000
[    8.291713] R13: 000055ce60387c30 R14: 0000000000000000 R15: 000055ce6038ede0
[    8.291718]  </TASK>
[    8.291719] Modules linked in: amdgpu(+) drm_ttm_helper ttm
crct10dif_pclmul crc32_pclmul iommu_v2 crc32c_intel gpu_sched ucsi_ccg
typec_ucsi nvme drm_buddy igb ccp ghash_clmulni_intel typec
drm_dp_helper sp5100_tco nvme_core dca wmi ip6_tables ip_tables
ipmi_devintf ipmi_msghandler fuse
[    8.291740] ---[ end trace 0000000000000000 ]---


> Thanks for reporting. This bug has been fixed in
>
>
> https://cgit.freedesktop.org/drm/drm/commit/?h=drm-fixes&id=ee7a69aa38d87a3bbced7b8245c732c05ed0c6ec
>
> The patch should reach mainline next week or so.

Hi Thomas,
thanks for the patch, this patch fixes oops
But this patch does not fix the initial issue when a lot of processes
blocked by mutex which applied by amdgpu_ctx_mgr_entity_flush
[  249.491425] INFO: task (brt-dbus):1634 blocked for more than 122 seconds.
[  249.491520]       Tainted: G        W    L   --------  ---
5.19.0-0.rc5.20220707git9f09069cde34.43.fc37.x86_64 #1
[  249.491526] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  249.491529] task:(brt-dbus)      state:D stack:14504 pid: 1634
ppid:     1 flags:0x00000002
[  249.491541] Call Trace:
[  249.491545]  <TASK>
[  249.491556]  __schedule+0x492/0x1620
[  249.491565]  ? lock_is_held_type+0xe8/0x140
[  249.491575]  ? find_held_lock+0x32/0x80
[  249.491590]  schedule+0x4e/0xb0
[  249.491597]  schedule_preempt_disabled+0x14/0x20
[  249.491603]  __mutex_lock+0x423/0x890
[  249.491609]  ? __lock_acquire+0x387/0x1ee0
[  249.491618]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.491849]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.492040]  amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.492237]  amdgpu_flush+0x25/0x40 [amdgpu]
[  249.492420]  filp_close+0x31/0x70
[  249.492429]  __close_range+0x1f3/0x490
[  249.492441]  __x64_sys_close_range+0x13/0x20
[  249.492446]  do_syscall_64+0x5b/0x80
[  249.492452]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.492459]  ? do_syscall_64+0x67/0x80
[  249.492467]  ? asm_exc_page_fault+0x27/0x30
[  249.492473]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.492480]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  249.492486] RIP: 0033:0x7f789d4c23cb
[  249.492512] RSP: 002b:00007ffe10773198 EFLAGS: 00000246 ORIG_RAX:
00000000000001b4
[  249.492518] RAX: ffffffffffffffda RBX: 00007ffe107731a0 RCX: 00007f789d4c23cb
[  249.492523] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: 0000000000000027
[  249.492527] RBP: 00007ffe10773220 R08: 0000000000000000 R09: 00007ffe10773270
[  249.492531] R10: 00007ffe107730e0 R11: 0000000000000246 R12: 0000000000000002
[  249.492534] R13: 00007ffe10773230 R14: 0000000000000000 R15: 0000000000000002
[  249.492555]  </TASK>
[  249.492559] INFO: task (time-dir):1640 blocked for more than 122 seconds.
[  249.492564]       Tainted: G        W    L   --------  ---
5.19.0-0.rc5.20220707git9f09069cde34.43.fc37.x86_64 #1
[  249.492568] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  249.492571] task:(time-dir)      state:D stack:14504 pid: 1640
ppid:     1 flags:0x00000002
[  249.492580] Call Trace:
[  249.492584]  <TASK>
[  249.492592]  __schedule+0x492/0x1620
[  249.492597]  ? lock_is_held_type+0xe8/0x140
[  249.492605]  ? find_held_lock+0x32/0x80
[  249.492620]  schedule+0x4e/0xb0
[  249.492627]  schedule_preempt_disabled+0x14/0x20
[  249.492632]  __mutex_lock+0x423/0x890
[  249.492638]  ? __lock_acquire+0x387/0x1ee0
[  249.492646]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.492859]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.493049]  amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.493245]  amdgpu_flush+0x25/0x40 [amdgpu]
[  249.493428]  filp_close+0x31/0x70
[  249.493436]  __close_range+0x1f3/0x490
[  249.493447]  __x64_sys_close_range+0x13/0x20
[  249.493452]  do_syscall_64+0x5b/0x80
[  249.493457]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.493465]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  249.493470] RIP: 0033:0x7f789d4c23cb
[  249.493478] RSP: 002b:00007ffe10773198 EFLAGS: 00000246 ORIG_RAX:
00000000000001b4
[  249.493484] RAX: ffffffffffffffda RBX: 00007ffe107731a0 RCX: 00007f789d4c23cb
[  249.493488] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: 0000000000000027
[  249.493492] RBP: 00007ffe10773220 R08: 0000000000000000 R09: 00007ffe10773270
[  249.493496] R10: 00007ffe107730e0 R11: 0000000000000246 R12: 0000000000000002
[  249.493499] R13: 00007ffe10773230 R14: 0000000000000000 R15: 0000000000000002
[  249.493519]  </TASK>
[  249.493528]
               Showing all locks held in the system:
[  249.493537] 1 lock held by khungtaskd/182:
[  249.493542]  #0: ffffffffba168e20 (rcu_read_lock){....}-{1:2}, at:
debug_show_all_locks+0x15/0x16b
[  249.493565] 1 lock held by systemd-journal/879:
[  249.493575] 3 locks held by gnome-shell/1633:
[  249.493579]  #0: ffff9bd4be4f8c00
(&sig->cred_guard_mutex){+.+.}-{3:3}, at: bprm_execve+0x3c/0x880
[  249.493593]  #1: ffff9bd4be4f8ca8
(&sig->exec_update_lock){++++}-{3:3}, at: begin_new_exec+0x384/0xca0
[  249.493607]  #2: ffff9bd441f20c58 (&mgr->lock#3){+.+.}-{3:3}, at:
amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.493807] 1 lock held by (brt-dbus)/1634:
[  249.493811]  #0: ffff9bd441f20c58 (&mgr->lock#3){+.+.}-{3:3}, at:
amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.494025] 1 lock held by (time-dir)/1640:
[  249.494029]  #0: ffff9bd441f20c58 (&mgr->lock#3){+.+.}-{3:3}, at:
amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.494229] 1 lock held by (ostnamed)/1723:
[  249.494233]  #0: ffff9bd441f20c58 (&mgr->lock#3){+.+.}-{3:3}, at:
amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.494432] 1 lock held by (pcscd)/1748:
[  249.494436]  #0: ffff9bd441f20c58 (&mgr->lock#3){+.+.}-{3:3}, at:
amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]

[  249.494639] =============================================

Here is pastebin from initial post: https://pastebin.com/0YHs6wyB

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bug][5.19-rc0] Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in graphic mode.
@ 2022-07-09 12:10       ` Mikhail Gavrilov
  0 siblings, 0 replies; 10+ messages in thread
From: Mikhail Gavrilov @ 2022-07-09 12:10 UTC (permalink / raw)
  To: Christian König; +Cc: tzimmermann, Linux List Kernel Mailing, amd-gfx list

On Thu, Jul 7, 2022 at 2:50 PM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 07.07.22 um 02:20 schrieb Mikhail Gavrilov:
> > On Tue, Jun 28, 2022 at 2:21 PM Mikhail Gavrilov
> > <mikhail.v.gavrilov@gmail.com> wrote:
> > Christian can you look why
> > drm_aperture_remove_conflicting_pci_framebuffers cause this kernel bug
> > on my machine?
>
> That looks like a problem outside of the amdgpu driver.
>
> What happens is that during load amdgpu requests whatever driver
> (vesafb,vgafb or efifb) is currently handling the framebuffer to unload.
> This unload in turn now crashes for some reason.
>
> My best suggestion is to try to bisect this.

Hi Christian,
if you read my initial post. You should see that I tried to bisect the issue.
But it is very problematic because on each step I see different symptomes.
And if mark different symptoms with skip step we got at end lot of
possible commits:
Here is my bisect from initial post: https://pastebin.com/AhLMNfyv

If you want that I ended bisection successfully please help how to fix
this oops:
[    8.291177] page:00000000af2b6334 refcount:0 mapcount:0
mapping:0000000000000000 index:0x0 pfn:0x102a000
[    8.291202] head:00000000af2b6334 order:0 compound_mapcount:-1226
compound_pincount:0
[    8.291221] flags: 0x17ffffc0010000(head|node=0|zone=2|lastcpupid=0x1fffff)
[    8.291239] raw: 0017ffffc0010000 fffffb35c0a80008 fffffb35c0a80008
0000000000000000
[    8.291257] raw: 0000000000000000 0000000000000000 00000000ffffffff
0000000000000000
[    8.291275] page dumped because: VM_BUG_ON_PAGE(compound &&
compound_order(page) != order)
[    8.291298] ------------[ cut here ]------------
[    8.291309] kernel BUG at mm/page_alloc.c:1329!
[    8.291324] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[    8.291328] CPU: 8 PID: 599 Comm: systemd-udevd Not tainted
5.18.0-rc2-003-790b45f1bc6736a8dd48ba5731b6871e0217311e+ #361
[    8.291333] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
[    8.291338] RIP: 0010:free_pcp_prepare+0x58d/0x5a0
[    8.291343] Code: c6 18 a2 85 a7 e8 d3 b7 fc ff 0f 0b 31 f6 48 89
df e8 97 cf 06 00 e9 29 ff ff ff 48 c7 c6 00 f1 85 a7 48 89 df e8 b3
b7 fc ff <0f> 0b 48 c7 c6 58 92 85 a7 e8 a5 b7 fc ff 0f 0b 0f 1f 00 0f
1f 44
[    8.291351] RSP: 0018:ffffb07c023ab9d8 EFLAGS: 00010296
[    8.291354] RAX: 000000000000004e RBX: fffffb35c0a80000 RCX: 0000000000000000
[    8.291358] RDX: 0000000000000001 RSI: ffffffffa789dbaf RDI: 00000000ffffffff
[    8.291361] RBP: 0000000000000009 R08: 0000000000000000 R09: ffffb07c023ab7c0
[    8.291365] R10: 0000000000000003 R11: ffff92ee2e2fffe8 R12: 0000000000000000
[    8.291368] R13: ffff92ee2a55d180 R14: 00000000fffffe00 R15: fffffb35c0a80000
[    8.291371] FS:  00007f80aa398680(0000) GS:ffff92edda200000(0000)
knlGS:0000000000000000
[    8.291376] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.291379] CR2: 00007f80aa38e616 CR3: 000000017d726000 CR4: 0000000000350ee0
[    8.291382] Call Trace:
[    8.291384]  <TASK>
[    8.291386]  ? find_held_lock+0x32/0x80
[    8.291391]  free_unref_page+0x25/0x2a0
[    8.291395]  __vunmap+0x261/0x3d0
[    8.291399]  drm_fbdev_cleanup+0x6b/0xc0
[    8.291403]  drm_fbdev_fb_destroy+0x15/0x30
[    8.291407]  unregister_framebuffer+0x2e/0x40
[    8.291411]  drm_client_dev_unregister+0x6e/0xe0
[    8.291416]  drm_dev_unregister+0x34/0x90
[    8.291419]  drm_dev_unplug+0x24/0x40
[    8.291422]  simpledrm_remove+0x11/0x20
[    8.291426]  platform_remove+0x1f/0x40
[    8.291429]  device_release_driver_internal+0x1b8/0x220
[    8.291433]  bus_remove_device+0xef/0x160
[    8.291437]  device_del+0x18c/0x3f0
[    8.291440]  platform_device_del.part.0+0x13/0x70
[    8.291444]  platform_device_unregister+0x1c/0x30
[    8.291447]  drm_aperture_detach_drivers+0xa3/0xd0
[    8.291452]  drm_aperture_remove_conflicting_pci_framebuffers+0x3f/0x70
[    8.291457]  amdgpu_pci_probe+0x126/0x3c0 [amdgpu]
[    8.291599]  local_pci_probe+0x41/0x80
[    8.291604]  pci_device_probe+0xaa/0x200
[    8.291607]  really_probe+0x1a0/0x370
[    8.291611]  __driver_probe_device+0xfb/0x170
[    8.291615]  driver_probe_device+0x1f/0x90
[    8.291618]  __driver_attach+0xbe/0x1a0
[    8.291622]  ? __device_attach_driver+0xe0/0xe0
[    8.291625]  bus_for_each_dev+0x65/0x90
[    8.291629]  bus_add_driver+0x150/0x1f0
[    8.291632]  driver_register+0x89/0xd0
[    8.291636]  ? 0xffffffffc067b000
[    8.291641]  do_one_initcall+0x69/0x350
[    8.291645]  ? do_init_module+0x22/0x260
[    8.291650]  ? rcu_read_lock_sched_held+0x3b/0x70
[    8.291654]  ? trace_kmalloc+0x3b/0x100
[    8.291658]  ? kmem_cache_alloc_trace+0x1eb/0x3a0
[    8.291662]  do_init_module+0x4a/0x260
[    8.291666]  __do_sys_finit_module+0x93/0xf0
[    8.291673]  do_syscall_64+0x3a/0x80
[    8.291677]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[    8.291681] RIP: 0033:0x7f80aaf4507d
[    8.291685] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e
fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 dd 0c 00 f7 d8 64 89
01 48
[    8.291694] RSP: 002b:00007ffe43973ce8 EFLAGS: 00000246 ORIG_RAX:
0000000000000139
[    8.291699] RAX: ffffffffffffffda RBX: 000055ce603a3fe0 RCX: 00007f80aaf4507d
[    8.291702] RDX: 0000000000000000 RSI: 000055ce60395ac0 RDI: 0000000000000011
[    8.291706] RBP: 000055ce60395ac0 R08: 0000000000000000 R09: 00007f80ab013c80
[    8.291709] R10: 0000000000000011 R11: 0000000000000246 R12: 0000000000020000
[    8.291713] R13: 000055ce60387c30 R14: 0000000000000000 R15: 000055ce6038ede0
[    8.291718]  </TASK>
[    8.291719] Modules linked in: amdgpu(+) drm_ttm_helper ttm
crct10dif_pclmul crc32_pclmul iommu_v2 crc32c_intel gpu_sched ucsi_ccg
typec_ucsi nvme drm_buddy igb ccp ghash_clmulni_intel typec
drm_dp_helper sp5100_tco nvme_core dca wmi ip6_tables ip_tables
ipmi_devintf ipmi_msghandler fuse
[    8.291740] ---[ end trace 0000000000000000 ]---


> Thanks for reporting. This bug has been fixed in
>
>
> https://cgit.freedesktop.org/drm/drm/commit/?h=drm-fixes&id=ee7a69aa38d87a3bbced7b8245c732c05ed0c6ec
>
> The patch should reach mainline next week or so.

Hi Thomas,
thanks for the patch, this patch fixes oops
But this patch does not fix the initial issue when a lot of processes
blocked by mutex which applied by amdgpu_ctx_mgr_entity_flush
[  249.491425] INFO: task (brt-dbus):1634 blocked for more than 122 seconds.
[  249.491520]       Tainted: G        W    L   --------  ---
5.19.0-0.rc5.20220707git9f09069cde34.43.fc37.x86_64 #1
[  249.491526] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  249.491529] task:(brt-dbus)      state:D stack:14504 pid: 1634
ppid:     1 flags:0x00000002
[  249.491541] Call Trace:
[  249.491545]  <TASK>
[  249.491556]  __schedule+0x492/0x1620
[  249.491565]  ? lock_is_held_type+0xe8/0x140
[  249.491575]  ? find_held_lock+0x32/0x80
[  249.491590]  schedule+0x4e/0xb0
[  249.491597]  schedule_preempt_disabled+0x14/0x20
[  249.491603]  __mutex_lock+0x423/0x890
[  249.491609]  ? __lock_acquire+0x387/0x1ee0
[  249.491618]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.491849]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.492040]  amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.492237]  amdgpu_flush+0x25/0x40 [amdgpu]
[  249.492420]  filp_close+0x31/0x70
[  249.492429]  __close_range+0x1f3/0x490
[  249.492441]  __x64_sys_close_range+0x13/0x20
[  249.492446]  do_syscall_64+0x5b/0x80
[  249.492452]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.492459]  ? do_syscall_64+0x67/0x80
[  249.492467]  ? asm_exc_page_fault+0x27/0x30
[  249.492473]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.492480]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  249.492486] RIP: 0033:0x7f789d4c23cb
[  249.492512] RSP: 002b:00007ffe10773198 EFLAGS: 00000246 ORIG_RAX:
00000000000001b4
[  249.492518] RAX: ffffffffffffffda RBX: 00007ffe107731a0 RCX: 00007f789d4c23cb
[  249.492523] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: 0000000000000027
[  249.492527] RBP: 00007ffe10773220 R08: 0000000000000000 R09: 00007ffe10773270
[  249.492531] R10: 00007ffe107730e0 R11: 0000000000000246 R12: 0000000000000002
[  249.492534] R13: 00007ffe10773230 R14: 0000000000000000 R15: 0000000000000002
[  249.492555]  </TASK>
[  249.492559] INFO: task (time-dir):1640 blocked for more than 122 seconds.
[  249.492564]       Tainted: G        W    L   --------  ---
5.19.0-0.rc5.20220707git9f09069cde34.43.fc37.x86_64 #1
[  249.492568] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  249.492571] task:(time-dir)      state:D stack:14504 pid: 1640
ppid:     1 flags:0x00000002
[  249.492580] Call Trace:
[  249.492584]  <TASK>
[  249.492592]  __schedule+0x492/0x1620
[  249.492597]  ? lock_is_held_type+0xe8/0x140
[  249.492605]  ? find_held_lock+0x32/0x80
[  249.492620]  schedule+0x4e/0xb0
[  249.492627]  schedule_preempt_disabled+0x14/0x20
[  249.492632]  __mutex_lock+0x423/0x890
[  249.492638]  ? __lock_acquire+0x387/0x1ee0
[  249.492646]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.492859]  ? amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.493049]  amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.493245]  amdgpu_flush+0x25/0x40 [amdgpu]
[  249.493428]  filp_close+0x31/0x70
[  249.493436]  __close_range+0x1f3/0x490
[  249.493447]  __x64_sys_close_range+0x13/0x20
[  249.493452]  do_syscall_64+0x5b/0x80
[  249.493457]  ? lockdep_hardirqs_on+0x7d/0x100
[  249.493465]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  249.493470] RIP: 0033:0x7f789d4c23cb
[  249.493478] RSP: 002b:00007ffe10773198 EFLAGS: 00000246 ORIG_RAX:
00000000000001b4
[  249.493484] RAX: ffffffffffffffda RBX: 00007ffe107731a0 RCX: 00007f789d4c23cb
[  249.493488] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: 0000000000000027
[  249.493492] RBP: 00007ffe10773220 R08: 0000000000000000 R09: 00007ffe10773270
[  249.493496] R10: 00007ffe107730e0 R11: 0000000000000246 R12: 0000000000000002
[  249.493499] R13: 00007ffe10773230 R14: 0000000000000000 R15: 0000000000000002
[  249.493519]  </TASK>
[  249.493528]
               Showing all locks held in the system:
[  249.493537] 1 lock held by khungtaskd/182:
[  249.493542]  #0: ffffffffba168e20 (rcu_read_lock){....}-{1:2}, at:
debug_show_all_locks+0x15/0x16b
[  249.493565] 1 lock held by systemd-journal/879:
[  249.493575] 3 locks held by gnome-shell/1633:
[  249.493579]  #0: ffff9bd4be4f8c00
(&sig->cred_guard_mutex){+.+.}-{3:3}, at: bprm_execve+0x3c/0x880
[  249.493593]  #1: ffff9bd4be4f8ca8
(&sig->exec_update_lock){++++}-{3:3}, at: begin_new_exec+0x384/0xca0
[  249.493607]  #2: ffff9bd441f20c58 (&mgr->lock#3){+.+.}-{3:3}, at:
amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.493807] 1 lock held by (brt-dbus)/1634:
[  249.493811]  #0: ffff9bd441f20c58 (&mgr->lock#3){+.+.}-{3:3}, at:
amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.494025] 1 lock held by (time-dir)/1640:
[  249.494029]  #0: ffff9bd441f20c58 (&mgr->lock#3){+.+.}-{3:3}, at:
amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.494229] 1 lock held by (ostnamed)/1723:
[  249.494233]  #0: ffff9bd441f20c58 (&mgr->lock#3){+.+.}-{3:3}, at:
amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]
[  249.494432] 1 lock held by (pcscd)/1748:
[  249.494436]  #0: ffff9bd441f20c58 (&mgr->lock#3){+.+.}-{3:3}, at:
amdgpu_ctx_mgr_entity_flush+0x32/0xc0 [amdgpu]

[  249.494639] =============================================

Here is pastebin from initial post: https://pastebin.com/0YHs6wyB

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bug][5.19-rc0] Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in graphic mode.
  2022-07-09 12:10       ` Mikhail Gavrilov
@ 2022-07-13 12:38         ` Mikhail Gavrilov
  -1 siblings, 0 replies; 10+ messages in thread
From: Mikhail Gavrilov @ 2022-07-13 12:38 UTC (permalink / raw)
  To: Christian König; +Cc: amd-gfx list, Linux List Kernel Mailing, tzimmermann

On Sat, Jul 9, 2022 at 5:10 PM Mikhail Gavrilov
<mikhail.v.gavrilov@gmail.com> wrote:

> Hi Christian,
> if you read my initial post. You should see that I tried to bisect the issue.
> But it is very problematic because on each step I see different symptomes.
> And if mark different symptoms with skip step we got at end lot of
> possible commits:
> Here is my bisect from initial post: https://pastebin.com/AhLMNfyv

> [    8.291298] ------------[ cut here ]------------
> [    8.291309] kernel BUG at mm/page_alloc.c:1329!
> [    8.291324] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> [    8.291328] CPU: 8 PID: 599 Comm: systemd-udevd Not tainted
> 5.18.0-rc2-003-790b45f1bc6736a8dd48ba5731b6871e0217311e+ #361
> [    8.291333] Hardware name: System manufacturer System Product
> Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
> [    8.291338] RIP: 0010:free_pcp_prepare+0x58d/0x5a0

There will be a 5.19 release soon. I haven't got a working kernel
fresher than the fdaf9a5840ac commit on any machine (all machines have
AMD graphics).

Bisecting the kernel if we considered the mutex issue as "bad" state
and all other non working state as "skip" did not lead to anything
useful.

Even if we consider "bad" all commits in which the kernel does not
work, this also does not lead to anything good.
Below I did it:
$ git bisect log
git bisect start
# status: waiting for both good and bad commits
# good: [fdaf9a5840acaab18694a19e0eb0aa51162eeeed] Merge tag
'folio-5.19' of git://git.infradead.org/users/willy/pagecache
git bisect good fdaf9a5840acaab18694a19e0eb0aa51162eeeed
# status: waiting for bad commit, 1 good commit known
# bad: [babf0bb978e3c9fce6c4eba6b744c8754fd43d8e] Merge tag
'xfs-5.19-for-linus' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
git bisect bad babf0bb978e3c9fce6c4eba6b744c8754fd43d8e

# 01 - good: [86c87bea6b42100c67418af690919c44de6ede6e] Merge tag
'devicetree-for-5.19' of
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
git bisect good 86c87bea6b42100c67418af690919c44de6ede6e

# 02 - observed initial problem with mutex
# bad: [43ab20c599f4dc4c3972a8386ef4ca3943b5f9cd] drm/i915/gt: Fix
build error without CONFIG_PM
git bisect bad 43ab20c599f4dc4c3972a8386ef4ca3943b5f9cd

# 03 - observed invalid opcode: 0000 [#1] PREEMPT SMP NOPTI - RIP:
0010:free_pcp_prepare+0x58d/0x5a0
# bad: [790b45f1bc6736a8dd48ba5731b6871e0217311e] drm/i915/bios: Parse
the seamless DRRS min refresh rate
git bisect bad 790b45f1bc6736a8dd48ba5731b6871e0217311e

# 04 - observed invalid opcode: 0000 [#1] PREEMPT SMP NOPTI - RIP:
0010:free_pcp_prepare+0x455/0x650
# bad: [c6ed9f66eb70aeaac9998bd3552ada740d90e20c]
drm/nouveau/gr/gf100-: change gf108_gr_fwif from global to static
git bisect bad c6ed9f66eb70aeaac9998bd3552ada740d90e20c

# 05 good: [3123109284176b1532874591f7c81f3837bbdc17] Linux 5.18-rc1
git bisect good 3123109284176b1532874591f7c81f3837bbdc17

# 06 good: [711c7adc4687250deb550ee8a6994203f817b2ca] drm: exynos:
dsi: Use drm panel_bridge API
git bisect good 711c7adc4687250deb550ee8a6994203f817b2ca

# 07 - observed invalid opcode: 0000 [#1] PREEMPT SMP NOPTI - RIP:
0010:free_pcp_prepare+0x35e/0x410
# bad: [047a1b877ed48098bed71fcfb1d4891e1b54441d] dma-buf &
drm/amdgpu: remove dma_resv workaround
git bisect bad 047a1b877ed48098bed71fcfb1d4891e1b54441d

# 08 good: [644704740b8282c9ee9483a38666ee4a4561c37c] drm/amdgpu: use
dma_resv_for_each_fence for CS workaround v2
git bisect good 644704740b8282c9ee9483a38666ee4a4561c37c

# 09 - observed invalid opcode: 0000 [#1] PREEMPT SMP NOPTI - RIP:
0010:free_pcp_prepare+0x35e/0x410
# bad: [61fe0ab26e36998cebec48805d6873e31f0d79d7] drm/gma500: fix a
missing break in psb_intel_crtc_mode_set
git bisect bad 61fe0ab26e36998cebec48805d6873e31f0d79d7

# 10 good: [1c3b2a27def609473ed13b1cd668cb10deab49b4] drm/nouveau/clk:
Fix an incorrect NULL check on list iterator
git bisect good 1c3b2a27def609473ed13b1cd668cb10deab49b4

# 11 - observed invalid opcode: 0000 [#1] PREEMPT SMP NOPTI - RIP:
0010:free_pcp_prepare+0x35e/0x410
# bad: [aa46154355e1e81ef746470d2e88bdb283508bff] drm/ingenic: Add
ingenic_drm_bridge_atomic_enable and disable
git bisect bad aa46154355e1e81ef746470d2e88bdb283508bff

# 12 good: [71d637823cac7748079a912e0373476c7cf6f985] dma-buf: finally
make dma_resv_excl_fence private v2
git bisect good 71d637823cac7748079a912e0373476c7cf6f985

# 13 - observed invalid opcode: 0000 [#1] PREEMPT SMP NOPTI - RIP:
0010:free_pcp_prepare+0x35e/0x410
# bad: [33f2069fb6a9c2d6509accc39521d3f4d6369576] drm/nouveau: support
more than one write fence in fenv50_wndw_prepare_fb
git bisect bad 33f2069fb6a9c2d6509accc39521d3f4d6369576

# 14 - observed invalid opcode: 0000 [#1] PREEMPT SMP NOPTI - RIP:
0010:free_pcp_prepare+0x35e/0x410
# bad: [9cbbd694a58bdf24def2462276514c90cab7cf80] Merge drm/drm-next
into drm-misc-next
git bisect bad 9cbbd694a58bdf24def2462276514c90cab7cf80

# first bad commit: [9cbbd694a58bdf24def2462276514c90cab7cf80] Merge
drm/drm-next into drm-misc-next


Need an alternative way to find the problem. And then the kernel will
be released not working.

-- 
Best Regards,
Mike Gavrilov.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bug][5.19-rc0] Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in graphic mode.
@ 2022-07-13 12:38         ` Mikhail Gavrilov
  0 siblings, 0 replies; 10+ messages in thread
From: Mikhail Gavrilov @ 2022-07-13 12:38 UTC (permalink / raw)
  To: Christian König; +Cc: tzimmermann, Linux List Kernel Mailing, amd-gfx list

On Sat, Jul 9, 2022 at 5:10 PM Mikhail Gavrilov
<mikhail.v.gavrilov@gmail.com> wrote:

> Hi Christian,
> if you read my initial post. You should see that I tried to bisect the issue.
> But it is very problematic because on each step I see different symptomes.
> And if mark different symptoms with skip step we got at end lot of
> possible commits:
> Here is my bisect from initial post: https://pastebin.com/AhLMNfyv

> [    8.291298] ------------[ cut here ]------------
> [    8.291309] kernel BUG at mm/page_alloc.c:1329!
> [    8.291324] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> [    8.291328] CPU: 8 PID: 599 Comm: systemd-udevd Not tainted
> 5.18.0-rc2-003-790b45f1bc6736a8dd48ba5731b6871e0217311e+ #361
> [    8.291333] Hardware name: System manufacturer System Product
> Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
> [    8.291338] RIP: 0010:free_pcp_prepare+0x58d/0x5a0

There will be a 5.19 release soon. I haven't got a working kernel
fresher than the fdaf9a5840ac commit on any machine (all machines have
AMD graphics).

Bisecting the kernel if we considered the mutex issue as "bad" state
and all other non working state as "skip" did not lead to anything
useful.

Even if we consider "bad" all commits in which the kernel does not
work, this also does not lead to anything good.
Below I did it:
$ git bisect log
git bisect start
# status: waiting for both good and bad commits
# good: [fdaf9a5840acaab18694a19e0eb0aa51162eeeed] Merge tag
'folio-5.19' of git://git.infradead.org/users/willy/pagecache
git bisect good fdaf9a5840acaab18694a19e0eb0aa51162eeeed
# status: waiting for bad commit, 1 good commit known
# bad: [babf0bb978e3c9fce6c4eba6b744c8754fd43d8e] Merge tag
'xfs-5.19-for-linus' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
git bisect bad babf0bb978e3c9fce6c4eba6b744c8754fd43d8e

# 01 - good: [86c87bea6b42100c67418af690919c44de6ede6e] Merge tag
'devicetree-for-5.19' of
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
git bisect good 86c87bea6b42100c67418af690919c44de6ede6e

# 02 - observed initial problem with mutex
# bad: [43ab20c599f4dc4c3972a8386ef4ca3943b5f9cd] drm/i915/gt: Fix
build error without CONFIG_PM
git bisect bad 43ab20c599f4dc4c3972a8386ef4ca3943b5f9cd

# 03 - observed invalid opcode: 0000 [#1] PREEMPT SMP NOPTI - RIP:
0010:free_pcp_prepare+0x58d/0x5a0
# bad: [790b45f1bc6736a8dd48ba5731b6871e0217311e] drm/i915/bios: Parse
the seamless DRRS min refresh rate
git bisect bad 790b45f1bc6736a8dd48ba5731b6871e0217311e

# 04 - observed invalid opcode: 0000 [#1] PREEMPT SMP NOPTI - RIP:
0010:free_pcp_prepare+0x455/0x650
# bad: [c6ed9f66eb70aeaac9998bd3552ada740d90e20c]
drm/nouveau/gr/gf100-: change gf108_gr_fwif from global to static
git bisect bad c6ed9f66eb70aeaac9998bd3552ada740d90e20c

# 05 good: [3123109284176b1532874591f7c81f3837bbdc17] Linux 5.18-rc1
git bisect good 3123109284176b1532874591f7c81f3837bbdc17

# 06 good: [711c7adc4687250deb550ee8a6994203f817b2ca] drm: exynos:
dsi: Use drm panel_bridge API
git bisect good 711c7adc4687250deb550ee8a6994203f817b2ca

# 07 - observed invalid opcode: 0000 [#1] PREEMPT SMP NOPTI - RIP:
0010:free_pcp_prepare+0x35e/0x410
# bad: [047a1b877ed48098bed71fcfb1d4891e1b54441d] dma-buf &
drm/amdgpu: remove dma_resv workaround
git bisect bad 047a1b877ed48098bed71fcfb1d4891e1b54441d

# 08 good: [644704740b8282c9ee9483a38666ee4a4561c37c] drm/amdgpu: use
dma_resv_for_each_fence for CS workaround v2
git bisect good 644704740b8282c9ee9483a38666ee4a4561c37c

# 09 - observed invalid opcode: 0000 [#1] PREEMPT SMP NOPTI - RIP:
0010:free_pcp_prepare+0x35e/0x410
# bad: [61fe0ab26e36998cebec48805d6873e31f0d79d7] drm/gma500: fix a
missing break in psb_intel_crtc_mode_set
git bisect bad 61fe0ab26e36998cebec48805d6873e31f0d79d7

# 10 good: [1c3b2a27def609473ed13b1cd668cb10deab49b4] drm/nouveau/clk:
Fix an incorrect NULL check on list iterator
git bisect good 1c3b2a27def609473ed13b1cd668cb10deab49b4

# 11 - observed invalid opcode: 0000 [#1] PREEMPT SMP NOPTI - RIP:
0010:free_pcp_prepare+0x35e/0x410
# bad: [aa46154355e1e81ef746470d2e88bdb283508bff] drm/ingenic: Add
ingenic_drm_bridge_atomic_enable and disable
git bisect bad aa46154355e1e81ef746470d2e88bdb283508bff

# 12 good: [71d637823cac7748079a912e0373476c7cf6f985] dma-buf: finally
make dma_resv_excl_fence private v2
git bisect good 71d637823cac7748079a912e0373476c7cf6f985

# 13 - observed invalid opcode: 0000 [#1] PREEMPT SMP NOPTI - RIP:
0010:free_pcp_prepare+0x35e/0x410
# bad: [33f2069fb6a9c2d6509accc39521d3f4d6369576] drm/nouveau: support
more than one write fence in fenv50_wndw_prepare_fb
git bisect bad 33f2069fb6a9c2d6509accc39521d3f4d6369576

# 14 - observed invalid opcode: 0000 [#1] PREEMPT SMP NOPTI - RIP:
0010:free_pcp_prepare+0x35e/0x410
# bad: [9cbbd694a58bdf24def2462276514c90cab7cf80] Merge drm/drm-next
into drm-misc-next
git bisect bad 9cbbd694a58bdf24def2462276514c90cab7cf80

# first bad commit: [9cbbd694a58bdf24def2462276514c90cab7cf80] Merge
drm/drm-next into drm-misc-next


Need an alternative way to find the problem. And then the kernel will
be released not working.

-- 
Best Regards,
Mike Gavrilov.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bug][5.19-rc0] Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in graphic mode.
  2022-07-13 12:38         ` Mikhail Gavrilov
@ 2022-07-18 23:52           ` Mikhail Gavrilov
  -1 siblings, 0 replies; 10+ messages in thread
From: Mikhail Gavrilov @ 2022-07-18 23:52 UTC (permalink / raw)
  To: Christian König; +Cc: amd-gfx list, Linux List Kernel Mailing, tzimmermann

On Wed, Jul 13, 2022 at 5:38 PM Mikhail Gavrilov
<mikhail.v.gavrilov@gmail.com> wrote:
> # first bad commit: [9cbbd694a58bdf24def2462276514c90cab7cf80] Merge
> drm/drm-next into drm-misc-next
>

Don't know who to thank but the issue disappeared in 5.19 rc7.

-- 
Best Regards,
Mike Gavrilov.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bug][5.19-rc0] Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in graphic mode.
@ 2022-07-18 23:52           ` Mikhail Gavrilov
  0 siblings, 0 replies; 10+ messages in thread
From: Mikhail Gavrilov @ 2022-07-18 23:52 UTC (permalink / raw)
  To: Christian König; +Cc: tzimmermann, Linux List Kernel Mailing, amd-gfx list

On Wed, Jul 13, 2022 at 5:38 PM Mikhail Gavrilov
<mikhail.v.gavrilov@gmail.com> wrote:
> # first bad commit: [9cbbd694a58bdf24def2462276514c90cab7cf80] Merge
> drm/drm-next into drm-misc-next
>

Don't know who to thank but the issue disappeared in 5.19 rc7.

-- 
Best Regards,
Mike Gavrilov.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-07-18 23:52 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-28  9:21 [Bug][5.19-rc0] Between commits fdaf9a5840ac and babf0bb978e3 GPU stopped entering in graphic mode Mikhail Gavrilov
2022-07-07  0:20 ` Mikhail Gavrilov
2022-07-07  9:50   ` Christian König
2022-07-09 12:10     ` Mikhail Gavrilov
2022-07-09 12:10       ` Mikhail Gavrilov
2022-07-13 12:38       ` Mikhail Gavrilov
2022-07-13 12:38         ` Mikhail Gavrilov
2022-07-18 23:52         ` Mikhail Gavrilov
2022-07-18 23:52           ` Mikhail Gavrilov
2022-07-07 10:10   ` Thomas Zimmermann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.