linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Command "clinfo" causes BUG: kernel NULL pointer dereference, address: 0000000000000008 on driver amdgpu
@ 2022-07-18 23:50 Mikhail Gavrilov
  2022-07-19  2:33 ` Chen, Guchun
  0 siblings, 1 reply; 5+ messages in thread
From: Mikhail Gavrilov @ 2022-07-18 23:50 UTC (permalink / raw)
  To: amd-gfx list, Linux List Kernel Mailing, Christian König

Hi guys I continue testing 5.19 rc7 and found the bug.
Command "clinfo" causes BUG: kernel NULL pointer dereference, address:
0000000000000008 on driver amdgpu.

Here is trace:
[ 1320.203332] BUG: kernel NULL pointer dereference, address: 0000000000000008
[ 1320.203338] #PF: supervisor read access in kernel mode
[ 1320.203340] #PF: error_code(0x0000) - not-present page
[ 1320.203341] PGD 0 P4D 0
[ 1320.203344] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 1320.203346] CPU: 5 PID: 1226 Comm: kworker/5:2 Tainted: G W L
-------- --- 5.19.0-0.rc7.53.fc37.x86_64+debug #1
[ 1320.203348] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
[ 1320.203350] Workqueue: events delayed_fput
[ 1320.203354] RIP: 0010:dma_resv_add_fence+0x5a/0x2d0
[ 1320.203358] Code: 85 c0 0f 84 43 02 00 00 8d 50 01 09 c2 0f 88 47
02 00 00 8b 15 73 10 99 01 49 8d 45 70 48 89 44 24 10 85 d2 0f 85 05
02 00 00 <49> 8b 44 24 08 48 3d 80 93 53 97 0f 84 06 01 00 00 48 3d 20
93 53
[ 1320.203360] RSP: 0018:ffffaf4cc1adfc68 EFLAGS: 00010246
[ 1320.203362] RAX: ffff976660408208 RBX: ffff975f545f2000 RCX: 0000000000000000
[ 1320.203363] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff976660408198
[ 1320.203364] RBP: ffff976806f6e800 R08: 0000000000000000 R09: 0000000000000000
[ 1320.203366] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[ 1320.203367] R13: ffff976660408198 R14: ffff975f545f2000 R15: ffff976660408198
[ 1320.203368] FS: 0000000000000000(0000) GS:ffff976de1200000(0000)
knlGS:0000000000000000
[ 1320.203370] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1320.203371] CR2: 0000000000000008 CR3: 00000007fb31c000 CR4: 0000000000350ee0
[ 1320.203372] Call Trace:
[ 1320.203374] <TASK>
[ 1320.203378] amdgpu_amdkfd_gpuvm_destroy_cb+0x5d/0x1e0 [amdgpu]
[ 1320.203516] amdgpu_vm_fini+0x2f/0x4e0 [amdgpu]
[ 1320.203625] ? mutex_destroy+0x21/0x50
[ 1320.203629] amdgpu_driver_postclose_kms+0x1da/0x2b0 [amdgpu]
[ 1320.203734] drm_file_free.part.0+0x20d/0x260
[ 1320.203738] drm_release+0x6a/0x120
[ 1320.203741] __fput+0xab/0x270
[ 1320.203743] delayed_fput+0x1f/0x30
[ 1320.203745] process_one_work+0x2a0/0x600
[ 1320.203749] worker_thread+0x4f/0x3a0
[ 1320.203751] ? process_one_work+0x600/0x600
[ 1320.203753] kthread+0xf5/0x120
[ 1320.203755] ? kthread_complete_and_exit+0x20/0x20
[ 1320.203758] ret_from_fork+0x22/0x30
[ 1320.203764] </TASK>

Full kernel log is here:
https://pastebin.com/EeKh2LEr

And one hour later after a lot of messages "BUG: workqueue lockup" GPU
completely hung.

I will be glad to test patches that fix this bug.

-- 
Best Regards,
Mike Gavrilov.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: Command "clinfo" causes BUG: kernel NULL pointer dereference, address: 0000000000000008 on driver amdgpu
  2022-07-18 23:50 Command "clinfo" causes BUG: kernel NULL pointer dereference, address: 0000000000000008 on driver amdgpu Mikhail Gavrilov
@ 2022-07-19  2:33 ` Chen, Guchun
  2022-07-19  8:40   ` Mike Lothian
  0 siblings, 1 reply; 5+ messages in thread
From: Chen, Guchun @ 2022-07-19  2:33 UTC (permalink / raw)
  To: Mikhail Gavrilov, amd-gfx list, Linux List Kernel Mailing,
	Christian König

Patch https://patchwork.freedesktop.org/series/106024/ should fix this.

Regards,
Guchun

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Mikhail Gavrilov
Sent: Tuesday, July 19, 2022 7:50 AM
To: amd-gfx list <amd-gfx@lists.freedesktop.org>; Linux List Kernel Mailing <linux-kernel@vger.kernel.org>; Christian König <ckoenig.leichtzumerken@gmail.com>
Subject: Command "clinfo" causes BUG: kernel NULL pointer dereference, address: 0000000000000008 on driver amdgpu

Hi guys I continue testing 5.19 rc7 and found the bug.
Command "clinfo" causes BUG: kernel NULL pointer dereference, address:
0000000000000008 on driver amdgpu.

Here is trace:
[ 1320.203332] BUG: kernel NULL pointer dereference, address: 0000000000000008 [ 1320.203338] #PF: supervisor read access in kernel mode [ 1320.203340] #PF: error_code(0x0000) - not-present page [ 1320.203341] PGD 0 P4D 0 [ 1320.203344] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 1320.203346] CPU: 5 PID: 1226 Comm: kworker/5:2 Tainted: G W L
-------- --- 5.19.0-0.rc7.53.fc37.x86_64+debug #1 [ 1320.203348] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022 [ 1320.203350] Workqueue: events delayed_fput [ 1320.203354] RIP: 0010:dma_resv_add_fence+0x5a/0x2d0
[ 1320.203358] Code: 85 c0 0f 84 43 02 00 00 8d 50 01 09 c2 0f 88 47
02 00 00 8b 15 73 10 99 01 49 8d 45 70 48 89 44 24 10 85 d2 0f 85 05
02 00 00 <49> 8b 44 24 08 48 3d 80 93 53 97 0f 84 06 01 00 00 48 3d 20
93 53
[ 1320.203360] RSP: 0018:ffffaf4cc1adfc68 EFLAGS: 00010246 [ 1320.203362] RAX: ffff976660408208 RBX: ffff975f545f2000 RCX: 0000000000000000 [ 1320.203363] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff976660408198 [ 1320.203364] RBP: ffff976806f6e800 R08: 0000000000000000 R09: 0000000000000000 [ 1320.203366] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 [ 1320.203367] R13: ffff976660408198 R14: ffff975f545f2000 R15: ffff976660408198 [ 1320.203368] FS: 0000000000000000(0000) GS:ffff976de1200000(0000)
knlGS:0000000000000000
[ 1320.203370] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1320.203371] CR2: 0000000000000008 CR3: 00000007fb31c000 CR4: 0000000000350ee0 [ 1320.203372] Call Trace:
[ 1320.203374] <TASK>
[ 1320.203378] amdgpu_amdkfd_gpuvm_destroy_cb+0x5d/0x1e0 [amdgpu] [ 1320.203516] amdgpu_vm_fini+0x2f/0x4e0 [amdgpu] [ 1320.203625] ? mutex_destroy+0x21/0x50 [ 1320.203629] amdgpu_driver_postclose_kms+0x1da/0x2b0 [amdgpu] [ 1320.203734] drm_file_free.part.0+0x20d/0x260 [ 1320.203738] drm_release+0x6a/0x120 [ 1320.203741] __fput+0xab/0x270 [ 1320.203743] delayed_fput+0x1f/0x30 [ 1320.203745] process_one_work+0x2a0/0x600 [ 1320.203749] worker_thread+0x4f/0x3a0 [ 1320.203751] ? process_one_work+0x600/0x600 [ 1320.203753] kthread+0xf5/0x120 [ 1320.203755] ? kthread_complete_and_exit+0x20/0x20
[ 1320.203758] ret_from_fork+0x22/0x30
[ 1320.203764] </TASK>

Full kernel log is here:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2FEeKh2LEr&amp;data=05%7C01%7Cguchun.chen%40amd.com%7C06749e19d65b418748dc08da6918435f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637937850184140997%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=x1%2FR7m9Vy2XwkXKXsmEOeaAyv44ZKNsU4caZJOOSIvY%3D&amp;reserved=0

And one hour later after a lot of messages "BUG: workqueue lockup" GPU completely hung.

I will be glad to test patches that fix this bug.

--
Best Regards,
Mike Gavrilov.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Command "clinfo" causes BUG: kernel NULL pointer dereference, address: 0000000000000008 on driver amdgpu
  2022-07-19  2:33 ` Chen, Guchun
@ 2022-07-19  8:40   ` Mike Lothian
  2022-07-19 11:26     ` Mikhail Gavrilov
  0 siblings, 1 reply; 5+ messages in thread
From: Mike Lothian @ 2022-07-19  8:40 UTC (permalink / raw)
  To: Chen, Guchun
  Cc: Mikhail Gavrilov, amd-gfx list, Linux List Kernel Mailing,
	Christian König

I was told that this patch replaces the patch you mentioned
https://patchwork.freedesktop.org/series/106078/ and it the one
that'll hopefully land in Linus's tree

On Tue, 19 Jul 2022 at 03:33, Chen, Guchun <Guchun.Chen@amd.com> wrote:
>
> Patch https://patchwork.freedesktop.org/series/106024/ should fix this.
>
> Regards,
> Guchun
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Mikhail Gavrilov
> Sent: Tuesday, July 19, 2022 7:50 AM
> To: amd-gfx list <amd-gfx@lists.freedesktop.org>; Linux List Kernel Mailing <linux-kernel@vger.kernel.org>; Christian König <ckoenig.leichtzumerken@gmail.com>
> Subject: Command "clinfo" causes BUG: kernel NULL pointer dereference, address: 0000000000000008 on driver amdgpu
>
> Hi guys I continue testing 5.19 rc7 and found the bug.
> Command "clinfo" causes BUG: kernel NULL pointer dereference, address:
> 0000000000000008 on driver amdgpu.
>
> Here is trace:
> [ 1320.203332] BUG: kernel NULL pointer dereference, address: 0000000000000008 [ 1320.203338] #PF: supervisor read access in kernel mode [ 1320.203340] #PF: error_code(0x0000) - not-present page [ 1320.203341] PGD 0 P4D 0 [ 1320.203344] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 1320.203346] CPU: 5 PID: 1226 Comm: kworker/5:2 Tainted: G W L
> -------- --- 5.19.0-0.rc7.53.fc37.x86_64+debug #1 [ 1320.203348] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022 [ 1320.203350] Workqueue: events delayed_fput [ 1320.203354] RIP: 0010:dma_resv_add_fence+0x5a/0x2d0
> [ 1320.203358] Code: 85 c0 0f 84 43 02 00 00 8d 50 01 09 c2 0f 88 47
> 02 00 00 8b 15 73 10 99 01 49 8d 45 70 48 89 44 24 10 85 d2 0f 85 05
> 02 00 00 <49> 8b 44 24 08 48 3d 80 93 53 97 0f 84 06 01 00 00 48 3d 20
> 93 53
> [ 1320.203360] RSP: 0018:ffffaf4cc1adfc68 EFLAGS: 00010246 [ 1320.203362] RAX: ffff976660408208 RBX: ffff975f545f2000 RCX: 0000000000000000 [ 1320.203363] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff976660408198 [ 1320.203364] RBP: ffff976806f6e800 R08: 0000000000000000 R09: 0000000000000000 [ 1320.203366] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 [ 1320.203367] R13: ffff976660408198 R14: ffff975f545f2000 R15: ffff976660408198 [ 1320.203368] FS: 0000000000000000(0000) GS:ffff976de1200000(0000)
> knlGS:0000000000000000
> [ 1320.203370] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1320.203371] CR2: 0000000000000008 CR3: 00000007fb31c000 CR4: 0000000000350ee0 [ 1320.203372] Call Trace:
> [ 1320.203374] <TASK>
> [ 1320.203378] amdgpu_amdkfd_gpuvm_destroy_cb+0x5d/0x1e0 [amdgpu] [ 1320.203516] amdgpu_vm_fini+0x2f/0x4e0 [amdgpu] [ 1320.203625] ? mutex_destroy+0x21/0x50 [ 1320.203629] amdgpu_driver_postclose_kms+0x1da/0x2b0 [amdgpu] [ 1320.203734] drm_file_free.part.0+0x20d/0x260 [ 1320.203738] drm_release+0x6a/0x120 [ 1320.203741] __fput+0xab/0x270 [ 1320.203743] delayed_fput+0x1f/0x30 [ 1320.203745] process_one_work+0x2a0/0x600 [ 1320.203749] worker_thread+0x4f/0x3a0 [ 1320.203751] ? process_one_work+0x600/0x600 [ 1320.203753] kthread+0xf5/0x120 [ 1320.203755] ? kthread_complete_and_exit+0x20/0x20
> [ 1320.203758] ret_from_fork+0x22/0x30
> [ 1320.203764] </TASK>
>
> Full kernel log is here:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2FEeKh2LEr&amp;data=05%7C01%7Cguchun.chen%40amd.com%7C06749e19d65b418748dc08da6918435f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637937850184140997%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=x1%2FR7m9Vy2XwkXKXsmEOeaAyv44ZKNsU4caZJOOSIvY%3D&amp;reserved=0
>
> And one hour later after a lot of messages "BUG: workqueue lockup" GPU completely hung.
>
> I will be glad to test patches that fix this bug.
>
> --
> Best Regards,
> Mike Gavrilov.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Command "clinfo" causes BUG: kernel NULL pointer dereference, address: 0000000000000008 on driver amdgpu
  2022-07-19  8:40   ` Mike Lothian
@ 2022-07-19 11:26     ` Mikhail Gavrilov
  2022-07-19 17:05       ` Mikhail Gavrilov
  0 siblings, 1 reply; 5+ messages in thread
From: Mikhail Gavrilov @ 2022-07-19 11:26 UTC (permalink / raw)
  To: Mike Lothian
  Cc: Chen, Guchun, amd-gfx list, Linux List Kernel Mailing,
	Christian König

On Tue, Jul 19, 2022 at 1:40 PM Mike Lothian <mike@fireburn.co.uk> wrote:
>
> I was told that this patch replaces the patch you mentioned
> https://patchwork.freedesktop.org/series/106078/ and it the one
> that'll hopefully land in Linus's tree
>

Great, I confirm that both patches solve the issue.
As I understand the second patch [1] is more right and it should be
land merged 5.19 soon, right?

And since we are talking about clinfo, there is a question.
No one has encountered the problem that on configurations with two
GPUs, it hangs in a cycle since it completely occupies one processor
core. In my case, one GPU is in the RENOIR processor, and the other is
a discrete AMD Radeon 6800M. In the BIOS there is no ability to turn
off the integrated GPU in the processor, so there is no way to check
this configuration with each GPU separately. In the kernel log there
is no error so it is most likely a user space issue , but I am not
sure about it.

clinfo backtrace is here [2]

[1] https://patchwork.freedesktop.org/series/106078/
[2] https://pastebin.com/wv5iGibi

-- 
Best Regards,
Mike Gavrilov.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Command "clinfo" causes BUG: kernel NULL pointer dereference, address: 0000000000000008 on driver amdgpu
  2022-07-19 11:26     ` Mikhail Gavrilov
@ 2022-07-19 17:05       ` Mikhail Gavrilov
  0 siblings, 0 replies; 5+ messages in thread
From: Mikhail Gavrilov @ 2022-07-19 17:05 UTC (permalink / raw)
  To: Mike Lothian
  Cc: Chen, Guchun, amd-gfx list, Linux List Kernel Mailing,
	Christian König

On Tue, Jul 19, 2022 at 4:26 PM Mikhail Gavrilov
<mikhail.v.gavrilov@gmail.com> wrote:
> In the kernel log there is no error so it is most likely a user space issue , but I am not
> sure about it.

But I am confused by the message in the kernel log:
[ 1962.000909] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue
preemption time out
[ 1962.000912] amdgpu: Failed to evict process queues
[ 1962.000918] amdgpu: Failed to quiesce KFD
[ 1966.010395] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue
preemption time out
[ 1966.010406] amdgpu: Resetting wave fronts (cpsch) on dev 00000000b40e7982


-- 
Best Regards,
Mike Gavrilov.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-07-19 17:06 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-18 23:50 Command "clinfo" causes BUG: kernel NULL pointer dereference, address: 0000000000000008 on driver amdgpu Mikhail Gavrilov
2022-07-19  2:33 ` Chen, Guchun
2022-07-19  8:40   ` Mike Lothian
2022-07-19 11:26     ` Mikhail Gavrilov
2022-07-19 17:05       ` Mikhail Gavrilov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).