linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [amdgpu] deadlock
@ 2021-02-03  8:33 Daniel Gomez
  2021-02-03  8:36 ` Christian König
  2021-02-03 14:37 ` Christian König
  0 siblings, 2 replies; 20+ messages in thread
From: Daniel Gomez @ 2021-02-03  8:33 UTC (permalink / raw)
  To: amd-gfx, dri-devel; +Cc: linux-kernel, alexander.deucher, christian.koenig

Hi all,

I have a deadlock with the amdgpu mainline driver when running in parallel two
OpenCL applications. So far, we've been able to replicate it easily by executing
clinfo and MatrixMultiplication (from AMD opencl-samples). It's quite old the
opencl-samples so, if you have any other suggestion for testing I'd be very
happy to test it as well.

How to replicate the issue:

# while true; do /usr/bin/MatrixMultiplication --device gpu \
    --deviceId 0 -x 1000 -y 1000 -z 1000 -q -t -i 50; done
# while true; do clinfo; done

Output:

After a minute or less (sometimes could be more) I can see that
MatrixMultiplication and clinfo hang. In addition, with radeontop you can see
how the Graphics pipe goes from ~50% to 100%. Also the shader clocks
goes up from ~35% to ~96%.

clinfo keeps printing:
ioctl(7, DRM_IOCTL_SYNCOBJ_WAIT, 0x7ffe46e5f950) = -1 ETIME (Timer expired)

And MatrixMultiplication prints the following (strace) if you try to
kill the process:

sched_yield()                           = 0
futex(0x557e945343b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
NULL, FUTEX_BITSET_MATCH_ANYstrace: Process 651 detached
 <detached ...>

After this, the gpu is not functional at all and you'd need a power cycle reset
to restore the system.

Hardware info:
CPU: AMD Ryzen Embedded V1605B with Radeon Vega Gfx (8) @ 2.000GHz
GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series

03:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
[AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
(rev 83)
    DeviceName: Broadcom 5762
    Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge
[Radeon Vega Series / Radeon Vega Mobile Series]
    Kernel driver in use: amdgpu
    Kernel modules: amdgpu

Linux kernel info:

root@qt5222:~# uname -a
Linux qt5222 5.11.0-rc6-qtec-standard #2 SMP Tue Feb 2 09:41:46 UTC
2021 x86_64 x86_64 x86_64 GNU/Linux

By enabling the kernel locks stats I could see the MatrixMultiplication is
hanged in the amdgpu_mn_invalidate_gfx function:

[  738.359202] 1 lock held by MatrixMultiplic/653:
[  738.359206]  #0: ffff88810e364fe0
(&adev->notifier_lock){+.+.}-{3:3}, at:
amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]

I can see in the the amdgpu_mn_invalidate_gfx function: the
dma_resv_wait_timeout_rcu uses wait_all (fences) and MAX_SCHEDULE_TIMEOUT so, I
guess the code gets stuck there waiting forever. According to the
documentation: "When somebody tries to invalidate the page tables we block the
update until all operations on the pages in question are completed, then those
pages are marked  as accessed and also dirty if it wasn’t a read only access."
Looks like the fences are deadlocked and therefore, it never returns. Could it
be possible? any hint to where can I look to fix this?

Thank you  in advance.

Here the full dmesg output:

[  738.337726] INFO: task MatrixMultiplic:653 blocked for more than 122 seconds.
[  738.344937]       Not tainted 5.11.0-rc6-qtec-standard #2
[  738.350384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  738.358240] task:MatrixMultiplic state:D stack:    0 pid:  653
ppid:     1 flags:0x00004000
[  738.358254] Call Trace:
[  738.358261]  ? dma_fence_default_wait+0x1eb/0x230
[  738.358276]  __schedule+0x370/0x960
[  738.358291]  ? dma_fence_default_wait+0x117/0x230
[  738.358297]  ? dma_fence_default_wait+0x1eb/0x230
[  738.358305]  schedule+0x51/0xc0
[  738.358312]  schedule_timeout+0x275/0x380
[  738.358324]  ? dma_fence_default_wait+0x1eb/0x230
[  738.358332]  ? mark_held_locks+0x4f/0x70
[  738.358341]  ? dma_fence_default_wait+0x117/0x230
[  738.358347]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
[  738.358353]  ? _raw_spin_unlock_irqrestore+0x39/0x40
[  738.358362]  ? dma_fence_default_wait+0x117/0x230
[  738.358370]  ? dma_fence_default_wait+0x1eb/0x230
[  738.358375]  dma_fence_default_wait+0x214/0x230
[  738.358384]  ? dma_fence_release+0x1a0/0x1a0
[  738.358396]  dma_fence_wait_timeout+0x105/0x200
[  738.358405]  dma_resv_wait_timeout_rcu+0x1aa/0x5e0
[  738.358421]  amdgpu_mn_invalidate_gfx+0x55/0xa0 [amdgpu]
[  738.358688]  __mmu_notifier_release+0x1bb/0x210
[  738.358710]  exit_mmap+0x2f/0x1e0
[  738.358723]  ? find_held_lock+0x34/0xa0
[  738.358746]  mmput+0x39/0xe0
[  738.358756]  do_exit+0x5c3/0xc00
[  738.358763]  ? find_held_lock+0x34/0xa0
[  738.358780]  do_group_exit+0x47/0xb0
[  738.358791]  get_signal+0x15b/0xc50
[  738.358807]  arch_do_signal_or_restart+0xaf/0x710
[  738.358816]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
[  738.358822]  ? _raw_spin_unlock_irqrestore+0x39/0x40
[  738.358831]  ? ktime_get_mono_fast_ns+0x50/0xa0
[  738.358844]  ? amdgpu_drm_ioctl+0x6b/0x80 [amdgpu]
[  738.359044]  exit_to_user_mode_prepare+0xf2/0x1b0
[  738.359054]  syscall_exit_to_user_mode+0x19/0x60
[  738.359062]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  738.359069] RIP: 0033:0x7f6b89a51887
[  738.359076] RSP: 002b:00007f6b82b54b18 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[  738.359086] RAX: fffffffffffffe00 RBX: 00007f6b82b54b50 RCX: 00007f6b89a51887
[  738.359091] RDX: 00007f6b82b54b50 RSI: 00000000c02064c3 RDI: 0000000000000007
[  738.359096] RBP: 00000000c02064c3 R08: 0000000000000003 R09: 00007f6b82b54bbc
[  738.359101] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000165a0bc00
[  738.359106] R13: 0000000000000007 R14: 0000000000000001 R15: 0000000000000000
[  738.359129]
               Showing all locks held in the system:
[  738.359141] 1 lock held by khungtaskd/54:
[  738.359148]  #0: ffffffff829f6840 (rcu_read_lock){....}-{1:2}, at:
debug_show_all_locks+0x15/0x183
[  738.359187] 1 lock held by systemd-journal/174:
[  738.359202] 1 lock held by MatrixMultiplic/653:
[  738.359206]  #0: ffff88810e364fe0
(&adev->notifier_lock){+.+.}-{3:3}, at:
amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]

Daniel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [amdgpu] deadlock
  2021-02-03  8:33 [amdgpu] deadlock Daniel Gomez
@ 2021-02-03  8:36 ` Christian König
  2021-02-03  8:48   ` Daniel Vetter
  2021-02-03 14:37 ` Christian König
  1 sibling, 1 reply; 20+ messages in thread
From: Christian König @ 2021-02-03  8:36 UTC (permalink / raw)
  To: Daniel Gomez, amd-gfx, dri-devel; +Cc: linux-kernel, alexander.deucher

Hi Daniel,

this is not a deadlock, but rather a hardware lockup.

Which OpenCl stack are you using?

Regards,
Christian.

Am 03.02.21 um 09:33 schrieb Daniel Gomez:
> Hi all,
>
> I have a deadlock with the amdgpu mainline driver when running in parallel two
> OpenCL applications. So far, we've been able to replicate it easily by executing
> clinfo and MatrixMultiplication (from AMD opencl-samples). It's quite old the
> opencl-samples so, if you have any other suggestion for testing I'd be very
> happy to test it as well.
>
> How to replicate the issue:
>
> # while true; do /usr/bin/MatrixMultiplication --device gpu \
>      --deviceId 0 -x 1000 -y 1000 -z 1000 -q -t -i 50; done
> # while true; do clinfo; done
>
> Output:
>
> After a minute or less (sometimes could be more) I can see that
> MatrixMultiplication and clinfo hang. In addition, with radeontop you can see
> how the Graphics pipe goes from ~50% to 100%. Also the shader clocks
> goes up from ~35% to ~96%.
>
> clinfo keeps printing:
> ioctl(7, DRM_IOCTL_SYNCOBJ_WAIT, 0x7ffe46e5f950) = -1 ETIME (Timer expired)
>
> And MatrixMultiplication prints the following (strace) if you try to
> kill the process:
>
> sched_yield()                           = 0
> futex(0x557e945343b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
> NULL, FUTEX_BITSET_MATCH_ANYstrace: Process 651 detached
>   <detached ...>
>
> After this, the gpu is not functional at all and you'd need a power cycle reset
> to restore the system.
>
> Hardware info:
> CPU: AMD Ryzen Embedded V1605B with Radeon Vega Gfx (8) @ 2.000GHz
> GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series
>
> 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
> (rev 83)
>      DeviceName: Broadcom 5762
>      Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge
> [Radeon Vega Series / Radeon Vega Mobile Series]
>      Kernel driver in use: amdgpu
>      Kernel modules: amdgpu
>
> Linux kernel info:
>
> root@qt5222:~# uname -a
> Linux qt5222 5.11.0-rc6-qtec-standard #2 SMP Tue Feb 2 09:41:46 UTC
> 2021 x86_64 x86_64 x86_64 GNU/Linux
>
> By enabling the kernel locks stats I could see the MatrixMultiplication is
> hanged in the amdgpu_mn_invalidate_gfx function:
>
> [  738.359202] 1 lock held by MatrixMultiplic/653:
> [  738.359206]  #0: ffff88810e364fe0
> (&adev->notifier_lock){+.+.}-{3:3}, at:
> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
>
> I can see in the the amdgpu_mn_invalidate_gfx function: the
> dma_resv_wait_timeout_rcu uses wait_all (fences) and MAX_SCHEDULE_TIMEOUT so, I
> guess the code gets stuck there waiting forever. According to the
> documentation: "When somebody tries to invalidate the page tables we block the
> update until all operations on the pages in question are completed, then those
> pages are marked  as accessed and also dirty if it wasn’t a read only access."
> Looks like the fences are deadlocked and therefore, it never returns. Could it
> be possible? any hint to where can I look to fix this?
>
> Thank you  in advance.
>
> Here the full dmesg output:
>
> [  738.337726] INFO: task MatrixMultiplic:653 blocked for more than 122 seconds.
> [  738.344937]       Not tainted 5.11.0-rc6-qtec-standard #2
> [  738.350384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  738.358240] task:MatrixMultiplic state:D stack:    0 pid:  653
> ppid:     1 flags:0x00004000
> [  738.358254] Call Trace:
> [  738.358261]  ? dma_fence_default_wait+0x1eb/0x230
> [  738.358276]  __schedule+0x370/0x960
> [  738.358291]  ? dma_fence_default_wait+0x117/0x230
> [  738.358297]  ? dma_fence_default_wait+0x1eb/0x230
> [  738.358305]  schedule+0x51/0xc0
> [  738.358312]  schedule_timeout+0x275/0x380
> [  738.358324]  ? dma_fence_default_wait+0x1eb/0x230
> [  738.358332]  ? mark_held_locks+0x4f/0x70
> [  738.358341]  ? dma_fence_default_wait+0x117/0x230
> [  738.358347]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> [  738.358353]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> [  738.358362]  ? dma_fence_default_wait+0x117/0x230
> [  738.358370]  ? dma_fence_default_wait+0x1eb/0x230
> [  738.358375]  dma_fence_default_wait+0x214/0x230
> [  738.358384]  ? dma_fence_release+0x1a0/0x1a0
> [  738.358396]  dma_fence_wait_timeout+0x105/0x200
> [  738.358405]  dma_resv_wait_timeout_rcu+0x1aa/0x5e0
> [  738.358421]  amdgpu_mn_invalidate_gfx+0x55/0xa0 [amdgpu]
> [  738.358688]  __mmu_notifier_release+0x1bb/0x210
> [  738.358710]  exit_mmap+0x2f/0x1e0
> [  738.358723]  ? find_held_lock+0x34/0xa0
> [  738.358746]  mmput+0x39/0xe0
> [  738.358756]  do_exit+0x5c3/0xc00
> [  738.358763]  ? find_held_lock+0x34/0xa0
> [  738.358780]  do_group_exit+0x47/0xb0
> [  738.358791]  get_signal+0x15b/0xc50
> [  738.358807]  arch_do_signal_or_restart+0xaf/0x710
> [  738.358816]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> [  738.358822]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> [  738.358831]  ? ktime_get_mono_fast_ns+0x50/0xa0
> [  738.358844]  ? amdgpu_drm_ioctl+0x6b/0x80 [amdgpu]
> [  738.359044]  exit_to_user_mode_prepare+0xf2/0x1b0
> [  738.359054]  syscall_exit_to_user_mode+0x19/0x60
> [  738.359062]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  738.359069] RIP: 0033:0x7f6b89a51887
> [  738.359076] RSP: 002b:00007f6b82b54b18 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [  738.359086] RAX: fffffffffffffe00 RBX: 00007f6b82b54b50 RCX: 00007f6b89a51887
> [  738.359091] RDX: 00007f6b82b54b50 RSI: 00000000c02064c3 RDI: 0000000000000007
> [  738.359096] RBP: 00000000c02064c3 R08: 0000000000000003 R09: 00007f6b82b54bbc
> [  738.359101] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000165a0bc00
> [  738.359106] R13: 0000000000000007 R14: 0000000000000001 R15: 0000000000000000
> [  738.359129]
>                 Showing all locks held in the system:
> [  738.359141] 1 lock held by khungtaskd/54:
> [  738.359148]  #0: ffffffff829f6840 (rcu_read_lock){....}-{1:2}, at:
> debug_show_all_locks+0x15/0x183
> [  738.359187] 1 lock held by systemd-journal/174:
> [  738.359202] 1 lock held by MatrixMultiplic/653:
> [  738.359206]  #0: ffff88810e364fe0
> (&adev->notifier_lock){+.+.}-{3:3}, at:
> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
>
> Daniel


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [amdgpu] deadlock
  2021-02-03  8:36 ` Christian König
@ 2021-02-03  8:48   ` Daniel Vetter
  2021-02-03  8:51     ` Christian König
  0 siblings, 1 reply; 20+ messages in thread
From: Daniel Vetter @ 2021-02-03  8:48 UTC (permalink / raw)
  To: Christian König
  Cc: Daniel Gomez, amd-gfx list, dri-devel, Alex Deucher,
	Linux Kernel Mailing List

On Wed, Feb 3, 2021 at 9:36 AM Christian König <christian.koenig@amd.com> wrote:
>
> Hi Daniel,
>
> this is not a deadlock, but rather a hardware lockup.

Are you sure? Ime getting stuck in dma_fence_wait has generally good
chance of being a dma_fence deadlock. GPU hang should never result in
a forever stuck dma_fence.

Daniel, can you pls re-hang your machine and then dump backtraces of
all tasks into dmesg with sysrq-t, and then attach that? Without all
the backtraces it's tricky to construct the full dependency chain of
what's going on. Also is this plain -rc6, not some more patches on
top?
-Daniel

> Which OpenCl stack are you using?
>
> Regards,
> Christian.
>
> Am 03.02.21 um 09:33 schrieb Daniel Gomez:
> > Hi all,
> >
> > I have a deadlock with the amdgpu mainline driver when running in parallel two
> > OpenCL applications. So far, we've been able to replicate it easily by executing
> > clinfo and MatrixMultiplication (from AMD opencl-samples). It's quite old the
> > opencl-samples so, if you have any other suggestion for testing I'd be very
> > happy to test it as well.
> >
> > How to replicate the issue:
> >
> > # while true; do /usr/bin/MatrixMultiplication --device gpu \
> >      --deviceId 0 -x 1000 -y 1000 -z 1000 -q -t -i 50; done
> > # while true; do clinfo; done
> >
> > Output:
> >
> > After a minute or less (sometimes could be more) I can see that
> > MatrixMultiplication and clinfo hang. In addition, with radeontop you can see
> > how the Graphics pipe goes from ~50% to 100%. Also the shader clocks
> > goes up from ~35% to ~96%.
> >
> > clinfo keeps printing:
> > ioctl(7, DRM_IOCTL_SYNCOBJ_WAIT, 0x7ffe46e5f950) = -1 ETIME (Timer expired)
> >
> > And MatrixMultiplication prints the following (strace) if you try to
> > kill the process:
> >
> > sched_yield()                           = 0
> > futex(0x557e945343b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
> > NULL, FUTEX_BITSET_MATCH_ANYstrace: Process 651 detached
> >   <detached ...>
> >
> > After this, the gpu is not functional at all and you'd need a power cycle reset
> > to restore the system.
> >
> > Hardware info:
> > CPU: AMD Ryzen Embedded V1605B with Radeon Vega Gfx (8) @ 2.000GHz
> > GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series
> >
> > 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> > [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
> > (rev 83)
> >      DeviceName: Broadcom 5762
> >      Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge
> > [Radeon Vega Series / Radeon Vega Mobile Series]
> >      Kernel driver in use: amdgpu
> >      Kernel modules: amdgpu
> >
> > Linux kernel info:
> >
> > root@qt5222:~# uname -a
> > Linux qt5222 5.11.0-rc6-qtec-standard #2 SMP Tue Feb 2 09:41:46 UTC
> > 2021 x86_64 x86_64 x86_64 GNU/Linux
> >
> > By enabling the kernel locks stats I could see the MatrixMultiplication is
> > hanged in the amdgpu_mn_invalidate_gfx function:
> >
> > [  738.359202] 1 lock held by MatrixMultiplic/653:
> > [  738.359206]  #0: ffff88810e364fe0
> > (&adev->notifier_lock){+.+.}-{3:3}, at:
> > amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> >
> > I can see in the the amdgpu_mn_invalidate_gfx function: the
> > dma_resv_wait_timeout_rcu uses wait_all (fences) and MAX_SCHEDULE_TIMEOUT so, I
> > guess the code gets stuck there waiting forever. According to the
> > documentation: "When somebody tries to invalidate the page tables we block the
> > update until all operations on the pages in question are completed, then those
> > pages are marked  as accessed and also dirty if it wasn’t a read only access."
> > Looks like the fences are deadlocked and therefore, it never returns. Could it
> > be possible? any hint to where can I look to fix this?
> >
> > Thank you  in advance.
> >
> > Here the full dmesg output:
> >
> > [  738.337726] INFO: task MatrixMultiplic:653 blocked for more than 122 seconds.
> > [  738.344937]       Not tainted 5.11.0-rc6-qtec-standard #2
> > [  738.350384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [  738.358240] task:MatrixMultiplic state:D stack:    0 pid:  653
> > ppid:     1 flags:0x00004000
> > [  738.358254] Call Trace:
> > [  738.358261]  ? dma_fence_default_wait+0x1eb/0x230
> > [  738.358276]  __schedule+0x370/0x960
> > [  738.358291]  ? dma_fence_default_wait+0x117/0x230
> > [  738.358297]  ? dma_fence_default_wait+0x1eb/0x230
> > [  738.358305]  schedule+0x51/0xc0
> > [  738.358312]  schedule_timeout+0x275/0x380
> > [  738.358324]  ? dma_fence_default_wait+0x1eb/0x230
> > [  738.358332]  ? mark_held_locks+0x4f/0x70
> > [  738.358341]  ? dma_fence_default_wait+0x117/0x230
> > [  738.358347]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> > [  738.358353]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> > [  738.358362]  ? dma_fence_default_wait+0x117/0x230
> > [  738.358370]  ? dma_fence_default_wait+0x1eb/0x230
> > [  738.358375]  dma_fence_default_wait+0x214/0x230
> > [  738.358384]  ? dma_fence_release+0x1a0/0x1a0
> > [  738.358396]  dma_fence_wait_timeout+0x105/0x200
> > [  738.358405]  dma_resv_wait_timeout_rcu+0x1aa/0x5e0
> > [  738.358421]  amdgpu_mn_invalidate_gfx+0x55/0xa0 [amdgpu]
> > [  738.358688]  __mmu_notifier_release+0x1bb/0x210
> > [  738.358710]  exit_mmap+0x2f/0x1e0
> > [  738.358723]  ? find_held_lock+0x34/0xa0
> > [  738.358746]  mmput+0x39/0xe0
> > [  738.358756]  do_exit+0x5c3/0xc00
> > [  738.358763]  ? find_held_lock+0x34/0xa0
> > [  738.358780]  do_group_exit+0x47/0xb0
> > [  738.358791]  get_signal+0x15b/0xc50
> > [  738.358807]  arch_do_signal_or_restart+0xaf/0x710
> > [  738.358816]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> > [  738.358822]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> > [  738.358831]  ? ktime_get_mono_fast_ns+0x50/0xa0
> > [  738.358844]  ? amdgpu_drm_ioctl+0x6b/0x80 [amdgpu]
> > [  738.359044]  exit_to_user_mode_prepare+0xf2/0x1b0
> > [  738.359054]  syscall_exit_to_user_mode+0x19/0x60
> > [  738.359062]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [  738.359069] RIP: 0033:0x7f6b89a51887
> > [  738.359076] RSP: 002b:00007f6b82b54b18 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000010
> > [  738.359086] RAX: fffffffffffffe00 RBX: 00007f6b82b54b50 RCX: 00007f6b89a51887
> > [  738.359091] RDX: 00007f6b82b54b50 RSI: 00000000c02064c3 RDI: 0000000000000007
> > [  738.359096] RBP: 00000000c02064c3 R08: 0000000000000003 R09: 00007f6b82b54bbc
> > [  738.359101] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000165a0bc00
> > [  738.359106] R13: 0000000000000007 R14: 0000000000000001 R15: 0000000000000000
> > [  738.359129]
> >                 Showing all locks held in the system:
> > [  738.359141] 1 lock held by khungtaskd/54:
> > [  738.359148]  #0: ffffffff829f6840 (rcu_read_lock){....}-{1:2}, at:
> > debug_show_all_locks+0x15/0x183
> > [  738.359187] 1 lock held by systemd-journal/174:
> > [  738.359202] 1 lock held by MatrixMultiplic/653:
> > [  738.359206]  #0: ffff88810e364fe0
> > (&adev->notifier_lock){+.+.}-{3:3}, at:
> > amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> >
> > Daniel
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [amdgpu] deadlock
  2021-02-03  8:48   ` Daniel Vetter
@ 2021-02-03  8:51     ` Christian König
  2021-02-03  8:56       ` Daniel Gomez
  2021-02-03  9:17       ` Daniel Vetter
  0 siblings, 2 replies; 20+ messages in thread
From: Christian König @ 2021-02-03  8:51 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Daniel Gomez, amd-gfx list, dri-devel, Alex Deucher,
	Linux Kernel Mailing List

Am 03.02.21 um 09:48 schrieb Daniel Vetter:
> On Wed, Feb 3, 2021 at 9:36 AM Christian König <christian.koenig@amd.com> wrote:
>> Hi Daniel,
>>
>> this is not a deadlock, but rather a hardware lockup.
> Are you sure? Ime getting stuck in dma_fence_wait has generally good
> chance of being a dma_fence deadlock. GPU hang should never result in
> a forever stuck dma_fence.

Yes, I'm pretty sure. Otherwise the hardware clocks wouldn't go up like 
this.

Question is rather why we end up in the userptr handling for GFX? Our 
ROCm OpenCL stack shouldn't use this.

> Daniel, can you pls re-hang your machine and then dump backtraces of
> all tasks into dmesg with sysrq-t, and then attach that? Without all
> the backtraces it's tricky to construct the full dependency chain of
> what's going on. Also is this plain -rc6, not some more patches on
> top?

Yeah, that's still a good idea to have.

Christian.

> -Daniel
>
>> Which OpenCl stack are you using?
>>
>> Regards,
>> Christian.
>>
>> Am 03.02.21 um 09:33 schrieb Daniel Gomez:
>>> Hi all,
>>>
>>> I have a deadlock with the amdgpu mainline driver when running in parallel two
>>> OpenCL applications. So far, we've been able to replicate it easily by executing
>>> clinfo and MatrixMultiplication (from AMD opencl-samples). It's quite old the
>>> opencl-samples so, if you have any other suggestion for testing I'd be very
>>> happy to test it as well.
>>>
>>> How to replicate the issue:
>>>
>>> # while true; do /usr/bin/MatrixMultiplication --device gpu \
>>>       --deviceId 0 -x 1000 -y 1000 -z 1000 -q -t -i 50; done
>>> # while true; do clinfo; done
>>>
>>> Output:
>>>
>>> After a minute or less (sometimes could be more) I can see that
>>> MatrixMultiplication and clinfo hang. In addition, with radeontop you can see
>>> how the Graphics pipe goes from ~50% to 100%. Also the shader clocks
>>> goes up from ~35% to ~96%.
>>>
>>> clinfo keeps printing:
>>> ioctl(7, DRM_IOCTL_SYNCOBJ_WAIT, 0x7ffe46e5f950) = -1 ETIME (Timer expired)
>>>
>>> And MatrixMultiplication prints the following (strace) if you try to
>>> kill the process:
>>>
>>> sched_yield()                           = 0
>>> futex(0x557e945343b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
>>> NULL, FUTEX_BITSET_MATCH_ANYstrace: Process 651 detached
>>>    <detached ...>
>>>
>>> After this, the gpu is not functional at all and you'd need a power cycle reset
>>> to restore the system.
>>>
>>> Hardware info:
>>> CPU: AMD Ryzen Embedded V1605B with Radeon Vega Gfx (8) @ 2.000GHz
>>> GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series
>>>
>>> 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
>>> [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
>>> (rev 83)
>>>       DeviceName: Broadcom 5762
>>>       Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge
>>> [Radeon Vega Series / Radeon Vega Mobile Series]
>>>       Kernel driver in use: amdgpu
>>>       Kernel modules: amdgpu
>>>
>>> Linux kernel info:
>>>
>>> root@qt5222:~# uname -a
>>> Linux qt5222 5.11.0-rc6-qtec-standard #2 SMP Tue Feb 2 09:41:46 UTC
>>> 2021 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> By enabling the kernel locks stats I could see the MatrixMultiplication is
>>> hanged in the amdgpu_mn_invalidate_gfx function:
>>>
>>> [  738.359202] 1 lock held by MatrixMultiplic/653:
>>> [  738.359206]  #0: ffff88810e364fe0
>>> (&adev->notifier_lock){+.+.}-{3:3}, at:
>>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
>>>
>>> I can see in the the amdgpu_mn_invalidate_gfx function: the
>>> dma_resv_wait_timeout_rcu uses wait_all (fences) and MAX_SCHEDULE_TIMEOUT so, I
>>> guess the code gets stuck there waiting forever. According to the
>>> documentation: "When somebody tries to invalidate the page tables we block the
>>> update until all operations on the pages in question are completed, then those
>>> pages are marked  as accessed and also dirty if it wasn’t a read only access."
>>> Looks like the fences are deadlocked and therefore, it never returns. Could it
>>> be possible? any hint to where can I look to fix this?
>>>
>>> Thank you  in advance.
>>>
>>> Here the full dmesg output:
>>>
>>> [  738.337726] INFO: task MatrixMultiplic:653 blocked for more than 122 seconds.
>>> [  738.344937]       Not tainted 5.11.0-rc6-qtec-standard #2
>>> [  738.350384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>> disables this message.
>>> [  738.358240] task:MatrixMultiplic state:D stack:    0 pid:  653
>>> ppid:     1 flags:0x00004000
>>> [  738.358254] Call Trace:
>>> [  738.358261]  ? dma_fence_default_wait+0x1eb/0x230
>>> [  738.358276]  __schedule+0x370/0x960
>>> [  738.358291]  ? dma_fence_default_wait+0x117/0x230
>>> [  738.358297]  ? dma_fence_default_wait+0x1eb/0x230
>>> [  738.358305]  schedule+0x51/0xc0
>>> [  738.358312]  schedule_timeout+0x275/0x380
>>> [  738.358324]  ? dma_fence_default_wait+0x1eb/0x230
>>> [  738.358332]  ? mark_held_locks+0x4f/0x70
>>> [  738.358341]  ? dma_fence_default_wait+0x117/0x230
>>> [  738.358347]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
>>> [  738.358353]  ? _raw_spin_unlock_irqrestore+0x39/0x40
>>> [  738.358362]  ? dma_fence_default_wait+0x117/0x230
>>> [  738.358370]  ? dma_fence_default_wait+0x1eb/0x230
>>> [  738.358375]  dma_fence_default_wait+0x214/0x230
>>> [  738.358384]  ? dma_fence_release+0x1a0/0x1a0
>>> [  738.358396]  dma_fence_wait_timeout+0x105/0x200
>>> [  738.358405]  dma_resv_wait_timeout_rcu+0x1aa/0x5e0
>>> [  738.358421]  amdgpu_mn_invalidate_gfx+0x55/0xa0 [amdgpu]
>>> [  738.358688]  __mmu_notifier_release+0x1bb/0x210
>>> [  738.358710]  exit_mmap+0x2f/0x1e0
>>> [  738.358723]  ? find_held_lock+0x34/0xa0
>>> [  738.358746]  mmput+0x39/0xe0
>>> [  738.358756]  do_exit+0x5c3/0xc00
>>> [  738.358763]  ? find_held_lock+0x34/0xa0
>>> [  738.358780]  do_group_exit+0x47/0xb0
>>> [  738.358791]  get_signal+0x15b/0xc50
>>> [  738.358807]  arch_do_signal_or_restart+0xaf/0x710
>>> [  738.358816]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
>>> [  738.358822]  ? _raw_spin_unlock_irqrestore+0x39/0x40
>>> [  738.358831]  ? ktime_get_mono_fast_ns+0x50/0xa0
>>> [  738.358844]  ? amdgpu_drm_ioctl+0x6b/0x80 [amdgpu]
>>> [  738.359044]  exit_to_user_mode_prepare+0xf2/0x1b0
>>> [  738.359054]  syscall_exit_to_user_mode+0x19/0x60
>>> [  738.359062]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [  738.359069] RIP: 0033:0x7f6b89a51887
>>> [  738.359076] RSP: 002b:00007f6b82b54b18 EFLAGS: 00000246 ORIG_RAX:
>>> 0000000000000010
>>> [  738.359086] RAX: fffffffffffffe00 RBX: 00007f6b82b54b50 RCX: 00007f6b89a51887
>>> [  738.359091] RDX: 00007f6b82b54b50 RSI: 00000000c02064c3 RDI: 0000000000000007
>>> [  738.359096] RBP: 00000000c02064c3 R08: 0000000000000003 R09: 00007f6b82b54bbc
>>> [  738.359101] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000165a0bc00
>>> [  738.359106] R13: 0000000000000007 R14: 0000000000000001 R15: 0000000000000000
>>> [  738.359129]
>>>                  Showing all locks held in the system:
>>> [  738.359141] 1 lock held by khungtaskd/54:
>>> [  738.359148]  #0: ffffffff829f6840 (rcu_read_lock){....}-{1:2}, at:
>>> debug_show_all_locks+0x15/0x183
>>> [  738.359187] 1 lock held by systemd-journal/174:
>>> [  738.359202] 1 lock held by MatrixMultiplic/653:
>>> [  738.359206]  #0: ffff88810e364fe0
>>> (&adev->notifier_lock){+.+.}-{3:3}, at:
>>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
>>>
>>> Daniel
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C81203e5bac5841b8e5a108d8c82087a9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479389339295622%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=RDSxD6OqD8HaOA2VnNfbJwLnKzhCLgOr5SVLjLF91bA%3D&amp;reserved=0
>
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [amdgpu] deadlock
  2021-02-03  8:51     ` Christian König
@ 2021-02-03  8:56       ` Daniel Gomez
  2021-02-03  9:17       ` Daniel Vetter
  1 sibling, 0 replies; 20+ messages in thread
From: Daniel Gomez @ 2021-02-03  8:56 UTC (permalink / raw)
  To: Christian König
  Cc: Daniel Vetter, amd-gfx list, dri-devel, Alex Deucher,
	Linux Kernel Mailing List

On Wed, 3 Feb 2021 at 09:51, Christian König <christian.koenig@amd.com> wrote:
>
> Am 03.02.21 um 09:48 schrieb Daniel Vetter:
> > On Wed, Feb 3, 2021 at 9:36 AM Christian König <christian.koenig@amd.com> wrote:
> >> Hi Daniel,
> >>
> >> this is not a deadlock, but rather a hardware lockup.
> > Are you sure? Ime getting stuck in dma_fence_wait has generally good
> > chance of being a dma_fence deadlock. GPU hang should never result in
> > a forever stuck dma_fence.
>
> Yes, I'm pretty sure. Otherwise the hardware clocks wouldn't go up like
> this.
>
> Question is rather why we end up in the userptr handling for GFX? Our
> ROCm OpenCL stack shouldn't use this.
>
> > Daniel, can you pls re-hang your machine and then dump backtraces of
> > all tasks into dmesg with sysrq-t, and then attach that? Without al
yes, I'll try it again.
> > the backtraces it's tricky to construct the full dependency chain of
> > what's going on. Also is this plain -rc6, not some more patches on
> > top?
It's plain -rc6 in regards to amdgpu and dma fences. I have some other patches
on top but they are v4l2 related.
>
> Yeah, that's still a good idea to have.
>
> Christian.
>
> > -Daniel
> >
> >> Which OpenCl stack are you using?
Radeon Software for Linux 20.20.
https://www.amd.com/en/support/kb/release-notes/rn-amdgpu-unified-linux-20-20

I've also noticed the 'latest' linux-firmware adds support for the
20.45 version but
I haven't tested it yet since I couldn't bring it up properly the
support as I was doing for
the previous versions. Somehow the libamdocl64.so got reduced from 80 Mb to
1.2 Mb and I couldn't figure out what has been changed.

Should I use ROCm instead of Radeon for Linux?

> >>
> >> Regards,
> >> Christian.
> >>
> >> Am 03.02.21 um 09:33 schrieb Daniel Gomez:
> >>> Hi all,
> >>>
> >>> I have a deadlock with the amdgpu mainline driver when running in parallel two
> >>> OpenCL applications. So far, we've been able to replicate it easily by executing
> >>> clinfo and MatrixMultiplication (from AMD opencl-samples). It's quite old the
> >>> opencl-samples so, if you have any other suggestion for testing I'd be very
> >>> happy to test it as well.
> >>>
> >>> How to replicate the issue:
> >>>
> >>> # while true; do /usr/bin/MatrixMultiplication --device gpu \
> >>>       --deviceId 0 -x 1000 -y 1000 -z 1000 -q -t -i 50; done
> >>> # while true; do clinfo; done
> >>>
> >>> Output:
> >>>
> >>> After a minute or less (sometimes could be more) I can see that
> >>> MatrixMultiplication and clinfo hang. In addition, with radeontop you can see
> >>> how the Graphics pipe goes from ~50% to 100%. Also the shader clocks
> >>> goes up from ~35% to ~96%.
> >>>
> >>> clinfo keeps printing:
> >>> ioctl(7, DRM_IOCTL_SYNCOBJ_WAIT, 0x7ffe46e5f950) = -1 ETIME (Timer expired)
> >>>
> >>> And MatrixMultiplication prints the following (strace) if you try to
> >>> kill the process:
> >>>
> >>> sched_yield()                           = 0
> >>> futex(0x557e945343b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
> >>> NULL, FUTEX_BITSET_MATCH_ANYstrace: Process 651 detached
> >>>    <detached ...>
> >>>
> >>> After this, the gpu is not functional at all and you'd need a power cycle reset
> >>> to restore the system.
> >>>
> >>> Hardware info:
> >>> CPU: AMD Ryzen Embedded V1605B with Radeon Vega Gfx (8) @ 2.000GHz
> >>> GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series
> >>>
> >>> 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> >>> [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
> >>> (rev 83)
> >>>       DeviceName: Broadcom 5762
> >>>       Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge
> >>> [Radeon Vega Series / Radeon Vega Mobile Series]
> >>>       Kernel driver in use: amdgpu
> >>>       Kernel modules: amdgpu
> >>>
> >>> Linux kernel info:
> >>>
> >>> root@qt5222:~# uname -a
> >>> Linux qt5222 5.11.0-rc6-qtec-standard #2 SMP Tue Feb 2 09:41:46 UTC
> >>> 2021 x86_64 x86_64 x86_64 GNU/Linux
> >>>
> >>> By enabling the kernel locks stats I could see the MatrixMultiplication is
> >>> hanged in the amdgpu_mn_invalidate_gfx function:
> >>>
> >>> [  738.359202] 1 lock held by MatrixMultiplic/653:
> >>> [  738.359206]  #0: ffff88810e364fe0
> >>> (&adev->notifier_lock){+.+.}-{3:3}, at:
> >>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> >>>
> >>> I can see in the the amdgpu_mn_invalidate_gfx function: the
> >>> dma_resv_wait_timeout_rcu uses wait_all (fences) and MAX_SCHEDULE_TIMEOUT so, I
> >>> guess the code gets stuck there waiting forever. According to the
> >>> documentation: "When somebody tries to invalidate the page tables we block the
> >>> update until all operations on the pages in question are completed, then those
> >>> pages are marked  as accessed and also dirty if it wasn’t a read only access."
> >>> Looks like the fences are deadlocked and therefore, it never returns. Could it
> >>> be possible? any hint to where can I look to fix this?
> >>>
> >>> Thank you  in advance.
> >>>
> >>> Here the full dmesg output:
> >>>
> >>> [  738.337726] INFO: task MatrixMultiplic:653 blocked for more than 122 seconds.
> >>> [  738.344937]       Not tainted 5.11.0-rc6-qtec-standard #2
> >>> [  738.350384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >>> disables this message.
> >>> [  738.358240] task:MatrixMultiplic state:D stack:    0 pid:  653
> >>> ppid:     1 flags:0x00004000
> >>> [  738.358254] Call Trace:
> >>> [  738.358261]  ? dma_fence_default_wait+0x1eb/0x230
> >>> [  738.358276]  __schedule+0x370/0x960
> >>> [  738.358291]  ? dma_fence_default_wait+0x117/0x230
> >>> [  738.358297]  ? dma_fence_default_wait+0x1eb/0x230
> >>> [  738.358305]  schedule+0x51/0xc0
> >>> [  738.358312]  schedule_timeout+0x275/0x380
> >>> [  738.358324]  ? dma_fence_default_wait+0x1eb/0x230
> >>> [  738.358332]  ? mark_held_locks+0x4f/0x70
> >>> [  738.358341]  ? dma_fence_default_wait+0x117/0x230
> >>> [  738.358347]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> >>> [  738.358353]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> >>> [  738.358362]  ? dma_fence_default_wait+0x117/0x230
> >>> [  738.358370]  ? dma_fence_default_wait+0x1eb/0x230
> >>> [  738.358375]  dma_fence_default_wait+0x214/0x230
> >>> [  738.358384]  ? dma_fence_release+0x1a0/0x1a0
> >>> [  738.358396]  dma_fence_wait_timeout+0x105/0x200
> >>> [  738.358405]  dma_resv_wait_timeout_rcu+0x1aa/0x5e0
> >>> [  738.358421]  amdgpu_mn_invalidate_gfx+0x55/0xa0 [amdgpu]
> >>> [  738.358688]  __mmu_notifier_release+0x1bb/0x210
> >>> [  738.358710]  exit_mmap+0x2f/0x1e0
> >>> [  738.358723]  ? find_held_lock+0x34/0xa0
> >>> [  738.358746]  mmput+0x39/0xe0
> >>> [  738.358756]  do_exit+0x5c3/0xc00
> >>> [  738.358763]  ? find_held_lock+0x34/0xa0
> >>> [  738.358780]  do_group_exit+0x47/0xb0
> >>> [  738.358791]  get_signal+0x15b/0xc50
> >>> [  738.358807]  arch_do_signal_or_restart+0xaf/0x710
> >>> [  738.358816]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> >>> [  738.358822]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> >>> [  738.358831]  ? ktime_get_mono_fast_ns+0x50/0xa0
> >>> [  738.358844]  ? amdgpu_drm_ioctl+0x6b/0x80 [amdgpu]
> >>> [  738.359044]  exit_to_user_mode_prepare+0xf2/0x1b0
> >>> [  738.359054]  syscall_exit_to_user_mode+0x19/0x60
> >>> [  738.359062]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>> [  738.359069] RIP: 0033:0x7f6b89a51887
> >>> [  738.359076] RSP: 002b:00007f6b82b54b18 EFLAGS: 00000246 ORIG_RAX:
> >>> 0000000000000010
> >>> [  738.359086] RAX: fffffffffffffe00 RBX: 00007f6b82b54b50 RCX: 00007f6b89a51887
> >>> [  738.359091] RDX: 00007f6b82b54b50 RSI: 00000000c02064c3 RDI: 0000000000000007
> >>> [  738.359096] RBP: 00000000c02064c3 R08: 0000000000000003 R09: 00007f6b82b54bbc
> >>> [  738.359101] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000165a0bc00
> >>> [  738.359106] R13: 0000000000000007 R14: 0000000000000001 R15: 0000000000000000
> >>> [  738.359129]
> >>>                  Showing all locks held in the system:
> >>> [  738.359141] 1 lock held by khungtaskd/54:
> >>> [  738.359148]  #0: ffffffff829f6840 (rcu_read_lock){....}-{1:2}, at:
> >>> debug_show_all_locks+0x15/0x183
> >>> [  738.359187] 1 lock held by systemd-journal/174:
> >>> [  738.359202] 1 lock held by MatrixMultiplic/653:
> >>> [  738.359206]  #0: ffff88810e364fe0
> >>> (&adev->notifier_lock){+.+.}-{3:3}, at:
> >>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> >>>
> >>> Daniel
> >> _______________________________________________
> >> dri-devel mailing list
> >> dri-devel@lists.freedesktop.org
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C81203e5bac5841b8e5a108d8c82087a9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479389339295622%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=RDSxD6OqD8HaOA2VnNfbJwLnKzhCLgOr5SVLjLF91bA%3D&amp;reserved=0
> >
> >
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [amdgpu] deadlock
  2021-02-03  8:51     ` Christian König
  2021-02-03  8:56       ` Daniel Gomez
@ 2021-02-03  9:17       ` Daniel Vetter
  2021-02-03  9:47         ` Daniel Gomez
  1 sibling, 1 reply; 20+ messages in thread
From: Daniel Vetter @ 2021-02-03  9:17 UTC (permalink / raw)
  To: Christian König
  Cc: Daniel Gomez, amd-gfx list, dri-devel, Alex Deucher,
	Linux Kernel Mailing List

On Wed, Feb 3, 2021 at 9:51 AM Christian König <christian.koenig@amd.com> wrote:
>
> Am 03.02.21 um 09:48 schrieb Daniel Vetter:
> > On Wed, Feb 3, 2021 at 9:36 AM Christian König <christian.koenig@amd.com> wrote:
> >> Hi Daniel,
> >>
> >> this is not a deadlock, but rather a hardware lockup.
> > Are you sure? Ime getting stuck in dma_fence_wait has generally good
> > chance of being a dma_fence deadlock. GPU hang should never result in
> > a forever stuck dma_fence.
>
> Yes, I'm pretty sure. Otherwise the hardware clocks wouldn't go up like
> this.

Maybe clarifying, could be both. TDR should notice and get us out of
this, but if there's a dma_fence deadlock and we can't re-emit or
force complete the pending things, then we're stuck for good.
-Daniel

> Question is rather why we end up in the userptr handling for GFX? Our
> ROCm OpenCL stack shouldn't use this.
>
> > Daniel, can you pls re-hang your machine and then dump backtraces of
> > all tasks into dmesg with sysrq-t, and then attach that? Without all
> > the backtraces it's tricky to construct the full dependency chain of
> > what's going on. Also is this plain -rc6, not some more patches on
> > top?
>
> Yeah, that's still a good idea to have.
>
> Christian.
>
> > -Daniel
> >
> >> Which OpenCl stack are you using?
> >>
> >> Regards,
> >> Christian.
> >>
> >> Am 03.02.21 um 09:33 schrieb Daniel Gomez:
> >>> Hi all,
> >>>
> >>> I have a deadlock with the amdgpu mainline driver when running in parallel two
> >>> OpenCL applications. So far, we've been able to replicate it easily by executing
> >>> clinfo and MatrixMultiplication (from AMD opencl-samples). It's quite old the
> >>> opencl-samples so, if you have any other suggestion for testing I'd be very
> >>> happy to test it as well.
> >>>
> >>> How to replicate the issue:
> >>>
> >>> # while true; do /usr/bin/MatrixMultiplication --device gpu \
> >>>       --deviceId 0 -x 1000 -y 1000 -z 1000 -q -t -i 50; done
> >>> # while true; do clinfo; done
> >>>
> >>> Output:
> >>>
> >>> After a minute or less (sometimes could be more) I can see that
> >>> MatrixMultiplication and clinfo hang. In addition, with radeontop you can see
> >>> how the Graphics pipe goes from ~50% to 100%. Also the shader clocks
> >>> goes up from ~35% to ~96%.
> >>>
> >>> clinfo keeps printing:
> >>> ioctl(7, DRM_IOCTL_SYNCOBJ_WAIT, 0x7ffe46e5f950) = -1 ETIME (Timer expired)
> >>>
> >>> And MatrixMultiplication prints the following (strace) if you try to
> >>> kill the process:
> >>>
> >>> sched_yield()                           = 0
> >>> futex(0x557e945343b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
> >>> NULL, FUTEX_BITSET_MATCH_ANYstrace: Process 651 detached
> >>>    <detached ...>
> >>>
> >>> After this, the gpu is not functional at all and you'd need a power cycle reset
> >>> to restore the system.
> >>>
> >>> Hardware info:
> >>> CPU: AMD Ryzen Embedded V1605B with Radeon Vega Gfx (8) @ 2.000GHz
> >>> GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series
> >>>
> >>> 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> >>> [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
> >>> (rev 83)
> >>>       DeviceName: Broadcom 5762
> >>>       Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge
> >>> [Radeon Vega Series / Radeon Vega Mobile Series]
> >>>       Kernel driver in use: amdgpu
> >>>       Kernel modules: amdgpu
> >>>
> >>> Linux kernel info:
> >>>
> >>> root@qt5222:~# uname -a
> >>> Linux qt5222 5.11.0-rc6-qtec-standard #2 SMP Tue Feb 2 09:41:46 UTC
> >>> 2021 x86_64 x86_64 x86_64 GNU/Linux
> >>>
> >>> By enabling the kernel locks stats I could see the MatrixMultiplication is
> >>> hanged in the amdgpu_mn_invalidate_gfx function:
> >>>
> >>> [  738.359202] 1 lock held by MatrixMultiplic/653:
> >>> [  738.359206]  #0: ffff88810e364fe0
> >>> (&adev->notifier_lock){+.+.}-{3:3}, at:
> >>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> >>>
> >>> I can see in the the amdgpu_mn_invalidate_gfx function: the
> >>> dma_resv_wait_timeout_rcu uses wait_all (fences) and MAX_SCHEDULE_TIMEOUT so, I
> >>> guess the code gets stuck there waiting forever. According to the
> >>> documentation: "When somebody tries to invalidate the page tables we block the
> >>> update until all operations on the pages in question are completed, then those
> >>> pages are marked  as accessed and also dirty if it wasn’t a read only access."
> >>> Looks like the fences are deadlocked and therefore, it never returns. Could it
> >>> be possible? any hint to where can I look to fix this?
> >>>
> >>> Thank you  in advance.
> >>>
> >>> Here the full dmesg output:
> >>>
> >>> [  738.337726] INFO: task MatrixMultiplic:653 blocked for more than 122 seconds.
> >>> [  738.344937]       Not tainted 5.11.0-rc6-qtec-standard #2
> >>> [  738.350384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >>> disables this message.
> >>> [  738.358240] task:MatrixMultiplic state:D stack:    0 pid:  653
> >>> ppid:     1 flags:0x00004000
> >>> [  738.358254] Call Trace:
> >>> [  738.358261]  ? dma_fence_default_wait+0x1eb/0x230
> >>> [  738.358276]  __schedule+0x370/0x960
> >>> [  738.358291]  ? dma_fence_default_wait+0x117/0x230
> >>> [  738.358297]  ? dma_fence_default_wait+0x1eb/0x230
> >>> [  738.358305]  schedule+0x51/0xc0
> >>> [  738.358312]  schedule_timeout+0x275/0x380
> >>> [  738.358324]  ? dma_fence_default_wait+0x1eb/0x230
> >>> [  738.358332]  ? mark_held_locks+0x4f/0x70
> >>> [  738.358341]  ? dma_fence_default_wait+0x117/0x230
> >>> [  738.358347]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> >>> [  738.358353]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> >>> [  738.358362]  ? dma_fence_default_wait+0x117/0x230
> >>> [  738.358370]  ? dma_fence_default_wait+0x1eb/0x230
> >>> [  738.358375]  dma_fence_default_wait+0x214/0x230
> >>> [  738.358384]  ? dma_fence_release+0x1a0/0x1a0
> >>> [  738.358396]  dma_fence_wait_timeout+0x105/0x200
> >>> [  738.358405]  dma_resv_wait_timeout_rcu+0x1aa/0x5e0
> >>> [  738.358421]  amdgpu_mn_invalidate_gfx+0x55/0xa0 [amdgpu]
> >>> [  738.358688]  __mmu_notifier_release+0x1bb/0x210
> >>> [  738.358710]  exit_mmap+0x2f/0x1e0
> >>> [  738.358723]  ? find_held_lock+0x34/0xa0
> >>> [  738.358746]  mmput+0x39/0xe0
> >>> [  738.358756]  do_exit+0x5c3/0xc00
> >>> [  738.358763]  ? find_held_lock+0x34/0xa0
> >>> [  738.358780]  do_group_exit+0x47/0xb0
> >>> [  738.358791]  get_signal+0x15b/0xc50
> >>> [  738.358807]  arch_do_signal_or_restart+0xaf/0x710
> >>> [  738.358816]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> >>> [  738.358822]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> >>> [  738.358831]  ? ktime_get_mono_fast_ns+0x50/0xa0
> >>> [  738.358844]  ? amdgpu_drm_ioctl+0x6b/0x80 [amdgpu]
> >>> [  738.359044]  exit_to_user_mode_prepare+0xf2/0x1b0
> >>> [  738.359054]  syscall_exit_to_user_mode+0x19/0x60
> >>> [  738.359062]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>> [  738.359069] RIP: 0033:0x7f6b89a51887
> >>> [  738.359076] RSP: 002b:00007f6b82b54b18 EFLAGS: 00000246 ORIG_RAX:
> >>> 0000000000000010
> >>> [  738.359086] RAX: fffffffffffffe00 RBX: 00007f6b82b54b50 RCX: 00007f6b89a51887
> >>> [  738.359091] RDX: 00007f6b82b54b50 RSI: 00000000c02064c3 RDI: 0000000000000007
> >>> [  738.359096] RBP: 00000000c02064c3 R08: 0000000000000003 R09: 00007f6b82b54bbc
> >>> [  738.359101] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000165a0bc00
> >>> [  738.359106] R13: 0000000000000007 R14: 0000000000000001 R15: 0000000000000000
> >>> [  738.359129]
> >>>                  Showing all locks held in the system:
> >>> [  738.359141] 1 lock held by khungtaskd/54:
> >>> [  738.359148]  #0: ffffffff829f6840 (rcu_read_lock){....}-{1:2}, at:
> >>> debug_show_all_locks+0x15/0x183
> >>> [  738.359187] 1 lock held by systemd-journal/174:
> >>> [  738.359202] 1 lock held by MatrixMultiplic/653:
> >>> [  738.359206]  #0: ffff88810e364fe0
> >>> (&adev->notifier_lock){+.+.}-{3:3}, at:
> >>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> >>>
> >>> Daniel
> >> _______________________________________________
> >> dri-devel mailing list
> >> dri-devel@lists.freedesktop.org
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C81203e5bac5841b8e5a108d8c82087a9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479389339295622%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=RDSxD6OqD8HaOA2VnNfbJwLnKzhCLgOr5SVLjLF91bA%3D&amp;reserved=0
> >
> >
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [amdgpu] deadlock
  2021-02-03  9:17       ` Daniel Vetter
@ 2021-02-03  9:47         ` Daniel Gomez
  2021-02-03 11:45           ` Daniel Gomez
  0 siblings, 1 reply; 20+ messages in thread
From: Daniel Gomez @ 2021-02-03  9:47 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christian König, amd-gfx list, dri-devel, Alex Deucher,
	Linux Kernel Mailing List

On Wed, 3 Feb 2021 at 10:17, Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Wed, Feb 3, 2021 at 9:51 AM Christian König <christian.koenig@amd.com> wrote:
> >
> > Am 03.02.21 um 09:48 schrieb Daniel Vetter:
> > > On Wed, Feb 3, 2021 at 9:36 AM Christian König <christian.koenig@amd.com> wrote:
> > >> Hi Daniel,
> > >>
> > >> this is not a deadlock, but rather a hardware lockup.
> > > Are you sure? Ime getting stuck in dma_fence_wait has generally good
> > > chance of being a dma_fence deadlock. GPU hang should never result in
> > > a forever stuck dma_fence.
> >
> > Yes, I'm pretty sure. Otherwise the hardware clocks wouldn't go up like
> > this.
>
> Maybe clarifying, could be both. TDR should notice and get us out of
> this, but if there's a dma_fence deadlock and we can't re-emit or
> force complete the pending things, then we're stuck for good.
> -Daniel
>
> > Question is rather why we end up in the userptr handling for GFX? Our
> > ROCm OpenCL stack shouldn't use this.
> >
> > > Daniel, can you pls re-hang your machine and then dump backtraces of
> > > all tasks into dmesg with sysrq-t, and then attach that? Without all
> > > the backtraces it's tricky to construct the full dependency chain of
> > > what's going on. Also is this plain -rc6, not some more patches on
> > > top?
> >
> > Yeah, that's still a good idea to have.

Here the full backtrace dmesg logs after the hang:
https://pastebin.com/raw/kzivm2L3

This is another dmesg log with the backtraces after SIGKILL the matrix process:
(I didn't have the sysrq enable at the time):
https://pastebin.com/raw/pRBwGcj1

> >
> > Christian.
> >
> > > -Daniel
> > >
> > >> Which OpenCl stack are you using?
> > >>
> > >> Regards,
> > >> Christian.
> > >>
> > >> Am 03.02.21 um 09:33 schrieb Daniel Gomez:
> > >>> Hi all,
> > >>>
> > >>> I have a deadlock with the amdgpu mainline driver when running in parallel two
> > >>> OpenCL applications. So far, we've been able to replicate it easily by executing
> > >>> clinfo and MatrixMultiplication (from AMD opencl-samples). It's quite old the
> > >>> opencl-samples so, if you have any other suggestion for testing I'd be very
> > >>> happy to test it as well.
> > >>>
> > >>> How to replicate the issue:
> > >>>
> > >>> # while true; do /usr/bin/MatrixMultiplication --device gpu \
> > >>>       --deviceId 0 -x 1000 -y 1000 -z 1000 -q -t -i 50; done
> > >>> # while true; do clinfo; done
> > >>>
> > >>> Output:
> > >>>
> > >>> After a minute or less (sometimes could be more) I can see that
> > >>> MatrixMultiplication and clinfo hang. In addition, with radeontop you can see
> > >>> how the Graphics pipe goes from ~50% to 100%. Also the shader clocks
> > >>> goes up from ~35% to ~96%.
> > >>>
> > >>> clinfo keeps printing:
> > >>> ioctl(7, DRM_IOCTL_SYNCOBJ_WAIT, 0x7ffe46e5f950) = -1 ETIME (Timer expired)
> > >>>
> > >>> And MatrixMultiplication prints the following (strace) if you try to
> > >>> kill the process:
> > >>>
> > >>> sched_yield()                           = 0
> > >>> futex(0x557e945343b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
> > >>> NULL, FUTEX_BITSET_MATCH_ANYstrace: Process 651 detached
> > >>>    <detached ...>
> > >>>
> > >>> After this, the gpu is not functional at all and you'd need a power cycle reset
> > >>> to restore the system.
> > >>>
> > >>> Hardware info:
> > >>> CPU: AMD Ryzen Embedded V1605B with Radeon Vega Gfx (8) @ 2.000GHz
> > >>> GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series
> > >>>
> > >>> 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> > >>> [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
> > >>> (rev 83)
> > >>>       DeviceName: Broadcom 5762
> > >>>       Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge
> > >>> [Radeon Vega Series / Radeon Vega Mobile Series]
> > >>>       Kernel driver in use: amdgpu
> > >>>       Kernel modules: amdgpu
> > >>>
> > >>> Linux kernel info:
> > >>>
> > >>> root@qt5222:~# uname -a
> > >>> Linux qt5222 5.11.0-rc6-qtec-standard #2 SMP Tue Feb 2 09:41:46 UTC
> > >>> 2021 x86_64 x86_64 x86_64 GNU/Linux
> > >>>
> > >>> By enabling the kernel locks stats I could see the MatrixMultiplication is
> > >>> hanged in the amdgpu_mn_invalidate_gfx function:
> > >>>
> > >>> [  738.359202] 1 lock held by MatrixMultiplic/653:
> > >>> [  738.359206]  #0: ffff88810e364fe0
> > >>> (&adev->notifier_lock){+.+.}-{3:3}, at:
> > >>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> > >>>
> > >>> I can see in the the amdgpu_mn_invalidate_gfx function: the
> > >>> dma_resv_wait_timeout_rcu uses wait_all (fences) and MAX_SCHEDULE_TIMEOUT so, I
> > >>> guess the code gets stuck there waiting forever. According to the
> > >>> documentation: "When somebody tries to invalidate the page tables we block the
> > >>> update until all operations on the pages in question are completed, then those
> > >>> pages are marked  as accessed and also dirty if it wasn’t a read only access."
> > >>> Looks like the fences are deadlocked and therefore, it never returns. Could it
> > >>> be possible? any hint to where can I look to fix this?
> > >>>
> > >>> Thank you  in advance.
> > >>>
> > >>> Here the full dmesg output:
> > >>>
> > >>> [  738.337726] INFO: task MatrixMultiplic:653 blocked for more than 122 seconds.
> > >>> [  738.344937]       Not tainted 5.11.0-rc6-qtec-standard #2
> > >>> [  738.350384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > >>> disables this message.
> > >>> [  738.358240] task:MatrixMultiplic state:D stack:    0 pid:  653
> > >>> ppid:     1 flags:0x00004000
> > >>> [  738.358254] Call Trace:
> > >>> [  738.358261]  ? dma_fence_default_wait+0x1eb/0x230
> > >>> [  738.358276]  __schedule+0x370/0x960
> > >>> [  738.358291]  ? dma_fence_default_wait+0x117/0x230
> > >>> [  738.358297]  ? dma_fence_default_wait+0x1eb/0x230
> > >>> [  738.358305]  schedule+0x51/0xc0
> > >>> [  738.358312]  schedule_timeout+0x275/0x380
> > >>> [  738.358324]  ? dma_fence_default_wait+0x1eb/0x230
> > >>> [  738.358332]  ? mark_held_locks+0x4f/0x70
> > >>> [  738.358341]  ? dma_fence_default_wait+0x117/0x230
> > >>> [  738.358347]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> > >>> [  738.358353]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> > >>> [  738.358362]  ? dma_fence_default_wait+0x117/0x230
> > >>> [  738.358370]  ? dma_fence_default_wait+0x1eb/0x230
> > >>> [  738.358375]  dma_fence_default_wait+0x214/0x230
> > >>> [  738.358384]  ? dma_fence_release+0x1a0/0x1a0
> > >>> [  738.358396]  dma_fence_wait_timeout+0x105/0x200
> > >>> [  738.358405]  dma_resv_wait_timeout_rcu+0x1aa/0x5e0
> > >>> [  738.358421]  amdgpu_mn_invalidate_gfx+0x55/0xa0 [amdgpu]
> > >>> [  738.358688]  __mmu_notifier_release+0x1bb/0x210
> > >>> [  738.358710]  exit_mmap+0x2f/0x1e0
> > >>> [  738.358723]  ? find_held_lock+0x34/0xa0
> > >>> [  738.358746]  mmput+0x39/0xe0
> > >>> [  738.358756]  do_exit+0x5c3/0xc00
> > >>> [  738.358763]  ? find_held_lock+0x34/0xa0
> > >>> [  738.358780]  do_group_exit+0x47/0xb0
> > >>> [  738.358791]  get_signal+0x15b/0xc50
> > >>> [  738.358807]  arch_do_signal_or_restart+0xaf/0x710
> > >>> [  738.358816]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> > >>> [  738.358822]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> > >>> [  738.358831]  ? ktime_get_mono_fast_ns+0x50/0xa0
> > >>> [  738.358844]  ? amdgpu_drm_ioctl+0x6b/0x80 [amdgpu]
> > >>> [  738.359044]  exit_to_user_mode_prepare+0xf2/0x1b0
> > >>> [  738.359054]  syscall_exit_to_user_mode+0x19/0x60
> > >>> [  738.359062]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > >>> [  738.359069] RIP: 0033:0x7f6b89a51887
> > >>> [  738.359076] RSP: 002b:00007f6b82b54b18 EFLAGS: 00000246 ORIG_RAX:
> > >>> 0000000000000010
> > >>> [  738.359086] RAX: fffffffffffffe00 RBX: 00007f6b82b54b50 RCX: 00007f6b89a51887
> > >>> [  738.359091] RDX: 00007f6b82b54b50 RSI: 00000000c02064c3 RDI: 0000000000000007
> > >>> [  738.359096] RBP: 00000000c02064c3 R08: 0000000000000003 R09: 00007f6b82b54bbc
> > >>> [  738.359101] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000165a0bc00
> > >>> [  738.359106] R13: 0000000000000007 R14: 0000000000000001 R15: 0000000000000000
> > >>> [  738.359129]
> > >>>                  Showing all locks held in the system:
> > >>> [  738.359141] 1 lock held by khungtaskd/54:
> > >>> [  738.359148]  #0: ffffffff829f6840 (rcu_read_lock){....}-{1:2}, at:
> > >>> debug_show_all_locks+0x15/0x183
> > >>> [  738.359187] 1 lock held by systemd-journal/174:
> > >>> [  738.359202] 1 lock held by MatrixMultiplic/653:
> > >>> [  738.359206]  #0: ffff88810e364fe0
> > >>> (&adev->notifier_lock){+.+.}-{3:3}, at:
> > >>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> > >>>
> > >>> Daniel
> > >> _______________________________________________
> > >> dri-devel mailing list
> > >> dri-devel@lists.freedesktop.org
> > >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C81203e5bac5841b8e5a108d8c82087a9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479389339295622%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=RDSxD6OqD8HaOA2VnNfbJwLnKzhCLgOr5SVLjLF91bA%3D&amp;reserved=0
> > >
> > >
> >
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [amdgpu] deadlock
  2021-02-03  9:47         ` Daniel Gomez
@ 2021-02-03 11:45           ` Daniel Gomez
  2021-02-03 12:21             ` Christian König
  0 siblings, 1 reply; 20+ messages in thread
From: Daniel Gomez @ 2021-02-03 11:45 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christian König, amd-gfx list, dri-devel, Alex Deucher,
	Linux Kernel Mailing List

On Wed, 3 Feb 2021 at 10:47, Daniel Gomez <daniel@qtec.com> wrote:
>
> On Wed, 3 Feb 2021 at 10:17, Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Wed, Feb 3, 2021 at 9:51 AM Christian König <christian.koenig@amd.com> wrote:
> > >
> > > Am 03.02.21 um 09:48 schrieb Daniel Vetter:
> > > > On Wed, Feb 3, 2021 at 9:36 AM Christian König <christian.koenig@amd.com> wrote:
> > > >> Hi Daniel,
> > > >>
> > > >> this is not a deadlock, but rather a hardware lockup.
> > > > Are you sure? Ime getting stuck in dma_fence_wait has generally good
> > > > chance of being a dma_fence deadlock. GPU hang should never result in
> > > > a forever stuck dma_fence.
> > >
> > > Yes, I'm pretty sure. Otherwise the hardware clocks wouldn't go up like
> > > this.
> >
> > Maybe clarifying, could be both. TDR should notice and get us out of
> > this, but if there's a dma_fence deadlock and we can't re-emit or
> > force complete the pending things, then we're stuck for good.
> > -Daniel
> >
> > > Question is rather why we end up in the userptr handling for GFX? Our
> > > ROCm OpenCL stack shouldn't use this.
> > >
> > > > Daniel, can you pls re-hang your machine and then dump backtraces of
> > > > all tasks into dmesg with sysrq-t, and then attach that? Without all
> > > > the backtraces it's tricky to construct the full dependency chain of
> > > > what's going on. Also is this plain -rc6, not some more patches on
> > > > top?
> > >
> > > Yeah, that's still a good idea to have.
>
> Here the full backtrace dmesg logs after the hang:
> https://pastebin.com/raw/kzivm2L3
>
> This is another dmesg log with the backtraces after SIGKILL the matrix process:
> (I didn't have the sysrq enable at the time):
> https://pastebin.com/raw/pRBwGcj1

I've now removed all our v4l2 patches and did the same test with the 'plain'
mainline version (-rc6).

Reference: 3aaf0a27ffc29b19a62314edd684b9bc6346f9a8

Same error, same behaviour. Full dmesg log attached:
https://pastebin.com/raw/KgaEf7Y1
Note:
  dmesg with sysrq-t before running the test starts in [  122.016502]
sysrq: Show State
  dmesg with sysrq-t after the test starts in: [  495.587671] sysrq: Show State


>
> > >
> > > Christian.
> > >
> > > > -Daniel
> > > >
> > > >> Which OpenCl stack are you using?
> > > >>
> > > >> Regards,
> > > >> Christian.
> > > >>
> > > >> Am 03.02.21 um 09:33 schrieb Daniel Gomez:
> > > >>> Hi all,
> > > >>>
> > > >>> I have a deadlock with the amdgpu mainline driver when running in parallel two
> > > >>> OpenCL applications. So far, we've been able to replicate it easily by executing
> > > >>> clinfo and MatrixMultiplication (from AMD opencl-samples). It's quite old the
> > > >>> opencl-samples so, if you have any other suggestion for testing I'd be very
> > > >>> happy to test it as well.
> > > >>>
> > > >>> How to replicate the issue:
> > > >>>
> > > >>> # while true; do /usr/bin/MatrixMultiplication --device gpu \
> > > >>>       --deviceId 0 -x 1000 -y 1000 -z 1000 -q -t -i 50; done
> > > >>> # while true; do clinfo; done
> > > >>>
> > > >>> Output:
> > > >>>
> > > >>> After a minute or less (sometimes could be more) I can see that
> > > >>> MatrixMultiplication and clinfo hang. In addition, with radeontop you can see
> > > >>> how the Graphics pipe goes from ~50% to 100%. Also the shader clocks
> > > >>> goes up from ~35% to ~96%.
> > > >>>
> > > >>> clinfo keeps printing:
> > > >>> ioctl(7, DRM_IOCTL_SYNCOBJ_WAIT, 0x7ffe46e5f950) = -1 ETIME (Timer expired)
> > > >>>
> > > >>> And MatrixMultiplication prints the following (strace) if you try to
> > > >>> kill the process:
> > > >>>
> > > >>> sched_yield()                           = 0
> > > >>> futex(0x557e945343b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
> > > >>> NULL, FUTEX_BITSET_MATCH_ANYstrace: Process 651 detached
> > > >>>    <detached ...>
> > > >>>
> > > >>> After this, the gpu is not functional at all and you'd need a power cycle reset
> > > >>> to restore the system.
> > > >>>
> > > >>> Hardware info:
> > > >>> CPU: AMD Ryzen Embedded V1605B with Radeon Vega Gfx (8) @ 2.000GHz
> > > >>> GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series
> > > >>>
> > > >>> 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> > > >>> [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
> > > >>> (rev 83)
> > > >>>       DeviceName: Broadcom 5762
> > > >>>       Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge
> > > >>> [Radeon Vega Series / Radeon Vega Mobile Series]
> > > >>>       Kernel driver in use: amdgpu
> > > >>>       Kernel modules: amdgpu
> > > >>>
> > > >>> Linux kernel info:
> > > >>>
> > > >>> root@qt5222:~# uname -a
> > > >>> Linux qt5222 5.11.0-rc6-qtec-standard #2 SMP Tue Feb 2 09:41:46 UTC
> > > >>> 2021 x86_64 x86_64 x86_64 GNU/Linux
> > > >>>
> > > >>> By enabling the kernel locks stats I could see the MatrixMultiplication is
> > > >>> hanged in the amdgpu_mn_invalidate_gfx function:
> > > >>>
> > > >>> [  738.359202] 1 lock held by MatrixMultiplic/653:
> > > >>> [  738.359206]  #0: ffff88810e364fe0
> > > >>> (&adev->notifier_lock){+.+.}-{3:3}, at:
> > > >>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> > > >>>
> > > >>> I can see in the the amdgpu_mn_invalidate_gfx function: the
> > > >>> dma_resv_wait_timeout_rcu uses wait_all (fences) and MAX_SCHEDULE_TIMEOUT so, I
> > > >>> guess the code gets stuck there waiting forever. According to the
> > > >>> documentation: "When somebody tries to invalidate the page tables we block the
> > > >>> update until all operations on the pages in question are completed, then those
> > > >>> pages are marked  as accessed and also dirty if it wasn’t a read only access."
> > > >>> Looks like the fences are deadlocked and therefore, it never returns. Could it
> > > >>> be possible? any hint to where can I look to fix this?
> > > >>>
> > > >>> Thank you  in advance.
> > > >>>
> > > >>> Here the full dmesg output:
> > > >>>
> > > >>> [  738.337726] INFO: task MatrixMultiplic:653 blocked for more than 122 seconds.
> > > >>> [  738.344937]       Not tainted 5.11.0-rc6-qtec-standard #2
> > > >>> [  738.350384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > >>> disables this message.
> > > >>> [  738.358240] task:MatrixMultiplic state:D stack:    0 pid:  653
> > > >>> ppid:     1 flags:0x00004000
> > > >>> [  738.358254] Call Trace:
> > > >>> [  738.358261]  ? dma_fence_default_wait+0x1eb/0x230
> > > >>> [  738.358276]  __schedule+0x370/0x960
> > > >>> [  738.358291]  ? dma_fence_default_wait+0x117/0x230
> > > >>> [  738.358297]  ? dma_fence_default_wait+0x1eb/0x230
> > > >>> [  738.358305]  schedule+0x51/0xc0
> > > >>> [  738.358312]  schedule_timeout+0x275/0x380
> > > >>> [  738.358324]  ? dma_fence_default_wait+0x1eb/0x230
> > > >>> [  738.358332]  ? mark_held_locks+0x4f/0x70
> > > >>> [  738.358341]  ? dma_fence_default_wait+0x117/0x230
> > > >>> [  738.358347]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> > > >>> [  738.358353]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> > > >>> [  738.358362]  ? dma_fence_default_wait+0x117/0x230
> > > >>> [  738.358370]  ? dma_fence_default_wait+0x1eb/0x230
> > > >>> [  738.358375]  dma_fence_default_wait+0x214/0x230
> > > >>> [  738.358384]  ? dma_fence_release+0x1a0/0x1a0
> > > >>> [  738.358396]  dma_fence_wait_timeout+0x105/0x200
> > > >>> [  738.358405]  dma_resv_wait_timeout_rcu+0x1aa/0x5e0
> > > >>> [  738.358421]  amdgpu_mn_invalidate_gfx+0x55/0xa0 [amdgpu]
> > > >>> [  738.358688]  __mmu_notifier_release+0x1bb/0x210
> > > >>> [  738.358710]  exit_mmap+0x2f/0x1e0
> > > >>> [  738.358723]  ? find_held_lock+0x34/0xa0
> > > >>> [  738.358746]  mmput+0x39/0xe0
> > > >>> [  738.358756]  do_exit+0x5c3/0xc00
> > > >>> [  738.358763]  ? find_held_lock+0x34/0xa0
> > > >>> [  738.358780]  do_group_exit+0x47/0xb0
> > > >>> [  738.358791]  get_signal+0x15b/0xc50
> > > >>> [  738.358807]  arch_do_signal_or_restart+0xaf/0x710
> > > >>> [  738.358816]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> > > >>> [  738.358822]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> > > >>> [  738.358831]  ? ktime_get_mono_fast_ns+0x50/0xa0
> > > >>> [  738.358844]  ? amdgpu_drm_ioctl+0x6b/0x80 [amdgpu]
> > > >>> [  738.359044]  exit_to_user_mode_prepare+0xf2/0x1b0
> > > >>> [  738.359054]  syscall_exit_to_user_mode+0x19/0x60
> > > >>> [  738.359062]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > >>> [  738.359069] RIP: 0033:0x7f6b89a51887
> > > >>> [  738.359076] RSP: 002b:00007f6b82b54b18 EFLAGS: 00000246 ORIG_RAX:
> > > >>> 0000000000000010
> > > >>> [  738.359086] RAX: fffffffffffffe00 RBX: 00007f6b82b54b50 RCX: 00007f6b89a51887
> > > >>> [  738.359091] RDX: 00007f6b82b54b50 RSI: 00000000c02064c3 RDI: 0000000000000007
> > > >>> [  738.359096] RBP: 00000000c02064c3 R08: 0000000000000003 R09: 00007f6b82b54bbc
> > > >>> [  738.359101] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000165a0bc00
> > > >>> [  738.359106] R13: 0000000000000007 R14: 0000000000000001 R15: 0000000000000000
> > > >>> [  738.359129]
> > > >>>                  Showing all locks held in the system:
> > > >>> [  738.359141] 1 lock held by khungtaskd/54:
> > > >>> [  738.359148]  #0: ffffffff829f6840 (rcu_read_lock){....}-{1:2}, at:
> > > >>> debug_show_all_locks+0x15/0x183
> > > >>> [  738.359187] 1 lock held by systemd-journal/174:
> > > >>> [  738.359202] 1 lock held by MatrixMultiplic/653:
> > > >>> [  738.359206]  #0: ffff88810e364fe0
> > > >>> (&adev->notifier_lock){+.+.}-{3:3}, at:
> > > >>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> > > >>>
> > > >>> Daniel
> > > >> _______________________________________________
> > > >> dri-devel mailing list
> > > >> dri-devel@lists.freedesktop.org
> > > >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C81203e5bac5841b8e5a108d8c82087a9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479389339295622%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=RDSxD6OqD8HaOA2VnNfbJwLnKzhCLgOr5SVLjLF91bA%3D&amp;reserved=0
> > > >
> > > >
> > >
> >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [amdgpu] deadlock
  2021-02-03 11:45           ` Daniel Gomez
@ 2021-02-03 12:21             ` Christian König
  2021-02-03 12:24               ` Daniel Vetter
  0 siblings, 1 reply; 20+ messages in thread
From: Christian König @ 2021-02-03 12:21 UTC (permalink / raw)
  To: Daniel Gomez, Daniel Vetter
  Cc: Alex Deucher, dri-devel, Christian König, amd-gfx list,
	Linux Kernel Mailing List

Am 03.02.21 um 12:45 schrieb Daniel Gomez:
> On Wed, 3 Feb 2021 at 10:47, Daniel Gomez <daniel@qtec.com> wrote:
>> On Wed, 3 Feb 2021 at 10:17, Daniel Vetter <daniel@ffwll.ch> wrote:
>>> On Wed, Feb 3, 2021 at 9:51 AM Christian König <christian.koenig@amd.com> wrote:
>>>> Am 03.02.21 um 09:48 schrieb Daniel Vetter:
>>>>> On Wed, Feb 3, 2021 at 9:36 AM Christian König <christian.koenig@amd.com> wrote:
>>>>>> Hi Daniel,
>>>>>>
>>>>>> this is not a deadlock, but rather a hardware lockup.
>>>>> Are you sure? Ime getting stuck in dma_fence_wait has generally good
>>>>> chance of being a dma_fence deadlock. GPU hang should never result in
>>>>> a forever stuck dma_fence.
>>>> Yes, I'm pretty sure. Otherwise the hardware clocks wouldn't go up like
>>>> this.
>>> Maybe clarifying, could be both. TDR should notice and get us out of
>>> this, but if there's a dma_fence deadlock and we can't re-emit or
>>> force complete the pending things, then we're stuck for good.
>>> -Daniel
>>>
>>>> Question is rather why we end up in the userptr handling for GFX? Our
>>>> ROCm OpenCL stack shouldn't use this.
>>>>
>>>>> Daniel, can you pls re-hang your machine and then dump backtraces of
>>>>> all tasks into dmesg with sysrq-t, and then attach that? Without all
>>>>> the backtraces it's tricky to construct the full dependency chain of
>>>>> what's going on. Also is this plain -rc6, not some more patches on
>>>>> top?
>>>> Yeah, that's still a good idea to have.
>> Here the full backtrace dmesg logs after the hang:
>> https://pastebin.com/raw/kzivm2L3
>>
>> This is another dmesg log with the backtraces after SIGKILL the matrix process:
>> (I didn't have the sysrq enable at the time):
>> https://pastebin.com/raw/pRBwGcj1
> I've now removed all our v4l2 patches and did the same test with the 'plain'
> mainline version (-rc6).
>
> Reference: 3aaf0a27ffc29b19a62314edd684b9bc6346f9a8
>
> Same error, same behaviour. Full dmesg log attached:
> https://pastebin.com/raw/KgaEf7Y1
> Note:
>    dmesg with sysrq-t before running the test starts in [  122.016502]
> sysrq: Show State
>    dmesg with sysrq-t after the test starts in: [  495.587671] sysrq: Show State

There is nothing amdgpu related in there except for waiting for the 
hardware.

This is a pretty standard hardware lockup, but I'm still waiting for an 
explanation why we end up in this call path in the first place.

Christian.

>
>
>>>> Christian.
>>>>
>>>>> -Daniel
>>>>>
>>>>>> Which OpenCl stack are you using?
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>> Am 03.02.21 um 09:33 schrieb Daniel Gomez:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I have a deadlock with the amdgpu mainline driver when running in parallel two
>>>>>>> OpenCL applications. So far, we've been able to replicate it easily by executing
>>>>>>> clinfo and MatrixMultiplication (from AMD opencl-samples). It's quite old the
>>>>>>> opencl-samples so, if you have any other suggestion for testing I'd be very
>>>>>>> happy to test it as well.
>>>>>>>
>>>>>>> How to replicate the issue:
>>>>>>>
>>>>>>> # while true; do /usr/bin/MatrixMultiplication --device gpu \
>>>>>>>        --deviceId 0 -x 1000 -y 1000 -z 1000 -q -t -i 50; done
>>>>>>> # while true; do clinfo; done
>>>>>>>
>>>>>>> Output:
>>>>>>>
>>>>>>> After a minute or less (sometimes could be more) I can see that
>>>>>>> MatrixMultiplication and clinfo hang. In addition, with radeontop you can see
>>>>>>> how the Graphics pipe goes from ~50% to 100%. Also the shader clocks
>>>>>>> goes up from ~35% to ~96%.
>>>>>>>
>>>>>>> clinfo keeps printing:
>>>>>>> ioctl(7, DRM_IOCTL_SYNCOBJ_WAIT, 0x7ffe46e5f950) = -1 ETIME (Timer expired)
>>>>>>>
>>>>>>> And MatrixMultiplication prints the following (strace) if you try to
>>>>>>> kill the process:
>>>>>>>
>>>>>>> sched_yield()                           = 0
>>>>>>> futex(0x557e945343b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
>>>>>>> NULL, FUTEX_BITSET_MATCH_ANYstrace: Process 651 detached
>>>>>>>     <detached ...>
>>>>>>>
>>>>>>> After this, the gpu is not functional at all and you'd need a power cycle reset
>>>>>>> to restore the system.
>>>>>>>
>>>>>>> Hardware info:
>>>>>>> CPU: AMD Ryzen Embedded V1605B with Radeon Vega Gfx (8) @ 2.000GHz
>>>>>>> GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series
>>>>>>>
>>>>>>> 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
>>>>>>> [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
>>>>>>> (rev 83)
>>>>>>>        DeviceName: Broadcom 5762
>>>>>>>        Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge
>>>>>>> [Radeon Vega Series / Radeon Vega Mobile Series]
>>>>>>>        Kernel driver in use: amdgpu
>>>>>>>        Kernel modules: amdgpu
>>>>>>>
>>>>>>> Linux kernel info:
>>>>>>>
>>>>>>> root@qt5222:~# uname -a
>>>>>>> Linux qt5222 5.11.0-rc6-qtec-standard #2 SMP Tue Feb 2 09:41:46 UTC
>>>>>>> 2021 x86_64 x86_64 x86_64 GNU/Linux
>>>>>>>
>>>>>>> By enabling the kernel locks stats I could see the MatrixMultiplication is
>>>>>>> hanged in the amdgpu_mn_invalidate_gfx function:
>>>>>>>
>>>>>>> [  738.359202] 1 lock held by MatrixMultiplic/653:
>>>>>>> [  738.359206]  #0: ffff88810e364fe0
>>>>>>> (&adev->notifier_lock){+.+.}-{3:3}, at:
>>>>>>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
>>>>>>>
>>>>>>> I can see in the the amdgpu_mn_invalidate_gfx function: the
>>>>>>> dma_resv_wait_timeout_rcu uses wait_all (fences) and MAX_SCHEDULE_TIMEOUT so, I
>>>>>>> guess the code gets stuck there waiting forever. According to the
>>>>>>> documentation: "When somebody tries to invalidate the page tables we block the
>>>>>>> update until all operations on the pages in question are completed, then those
>>>>>>> pages are marked  as accessed and also dirty if it wasn’t a read only access."
>>>>>>> Looks like the fences are deadlocked and therefore, it never returns. Could it
>>>>>>> be possible? any hint to where can I look to fix this?
>>>>>>>
>>>>>>> Thank you  in advance.
>>>>>>>
>>>>>>> Here the full dmesg output:
>>>>>>>
>>>>>>> [  738.337726] INFO: task MatrixMultiplic:653 blocked for more than 122 seconds.
>>>>>>> [  738.344937]       Not tainted 5.11.0-rc6-qtec-standard #2
>>>>>>> [  738.350384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>>>>> disables this message.
>>>>>>> [  738.358240] task:MatrixMultiplic state:D stack:    0 pid:  653
>>>>>>> ppid:     1 flags:0x00004000
>>>>>>> [  738.358254] Call Trace:
>>>>>>> [  738.358261]  ? dma_fence_default_wait+0x1eb/0x230
>>>>>>> [  738.358276]  __schedule+0x370/0x960
>>>>>>> [  738.358291]  ? dma_fence_default_wait+0x117/0x230
>>>>>>> [  738.358297]  ? dma_fence_default_wait+0x1eb/0x230
>>>>>>> [  738.358305]  schedule+0x51/0xc0
>>>>>>> [  738.358312]  schedule_timeout+0x275/0x380
>>>>>>> [  738.358324]  ? dma_fence_default_wait+0x1eb/0x230
>>>>>>> [  738.358332]  ? mark_held_locks+0x4f/0x70
>>>>>>> [  738.358341]  ? dma_fence_default_wait+0x117/0x230
>>>>>>> [  738.358347]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
>>>>>>> [  738.358353]  ? _raw_spin_unlock_irqrestore+0x39/0x40
>>>>>>> [  738.358362]  ? dma_fence_default_wait+0x117/0x230
>>>>>>> [  738.358370]  ? dma_fence_default_wait+0x1eb/0x230
>>>>>>> [  738.358375]  dma_fence_default_wait+0x214/0x230
>>>>>>> [  738.358384]  ? dma_fence_release+0x1a0/0x1a0
>>>>>>> [  738.358396]  dma_fence_wait_timeout+0x105/0x200
>>>>>>> [  738.358405]  dma_resv_wait_timeout_rcu+0x1aa/0x5e0
>>>>>>> [  738.358421]  amdgpu_mn_invalidate_gfx+0x55/0xa0 [amdgpu]
>>>>>>> [  738.358688]  __mmu_notifier_release+0x1bb/0x210
>>>>>>> [  738.358710]  exit_mmap+0x2f/0x1e0
>>>>>>> [  738.358723]  ? find_held_lock+0x34/0xa0
>>>>>>> [  738.358746]  mmput+0x39/0xe0
>>>>>>> [  738.358756]  do_exit+0x5c3/0xc00
>>>>>>> [  738.358763]  ? find_held_lock+0x34/0xa0
>>>>>>> [  738.358780]  do_group_exit+0x47/0xb0
>>>>>>> [  738.358791]  get_signal+0x15b/0xc50
>>>>>>> [  738.358807]  arch_do_signal_or_restart+0xaf/0x710
>>>>>>> [  738.358816]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
>>>>>>> [  738.358822]  ? _raw_spin_unlock_irqrestore+0x39/0x40
>>>>>>> [  738.358831]  ? ktime_get_mono_fast_ns+0x50/0xa0
>>>>>>> [  738.358844]  ? amdgpu_drm_ioctl+0x6b/0x80 [amdgpu]
>>>>>>> [  738.359044]  exit_to_user_mode_prepare+0xf2/0x1b0
>>>>>>> [  738.359054]  syscall_exit_to_user_mode+0x19/0x60
>>>>>>> [  738.359062]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>>>>> [  738.359069] RIP: 0033:0x7f6b89a51887
>>>>>>> [  738.359076] RSP: 002b:00007f6b82b54b18 EFLAGS: 00000246 ORIG_RAX:
>>>>>>> 0000000000000010
>>>>>>> [  738.359086] RAX: fffffffffffffe00 RBX: 00007f6b82b54b50 RCX: 00007f6b89a51887
>>>>>>> [  738.359091] RDX: 00007f6b82b54b50 RSI: 00000000c02064c3 RDI: 0000000000000007
>>>>>>> [  738.359096] RBP: 00000000c02064c3 R08: 0000000000000003 R09: 00007f6b82b54bbc
>>>>>>> [  738.359101] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000165a0bc00
>>>>>>> [  738.359106] R13: 0000000000000007 R14: 0000000000000001 R15: 0000000000000000
>>>>>>> [  738.359129]
>>>>>>>                   Showing all locks held in the system:
>>>>>>> [  738.359141] 1 lock held by khungtaskd/54:
>>>>>>> [  738.359148]  #0: ffffffff829f6840 (rcu_read_lock){....}-{1:2}, at:
>>>>>>> debug_show_all_locks+0x15/0x183
>>>>>>> [  738.359187] 1 lock held by systemd-journal/174:
>>>>>>> [  738.359202] 1 lock held by MatrixMultiplic/653:
>>>>>>> [  738.359206]  #0: ffff88810e364fe0
>>>>>>> (&adev->notifier_lock){+.+.}-{3:3}, at:
>>>>>>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
>>>>>>>
>>>>>>> Daniel
>>>>>> _______________________________________________
>>>>>> dri-devel mailing list
>>>>>> dri-devel@lists.freedesktop.org
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C81203e5bac5841b8e5a108d8c82087a9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479389339295622%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=RDSxD6OqD8HaOA2VnNfbJwLnKzhCLgOr5SVLjLF91bA%3D&amp;reserved=0
>>>>>
>>>
>>> --
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> http://blog.ffwll.ch
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [amdgpu] deadlock
  2021-02-03 12:21             ` Christian König
@ 2021-02-03 12:24               ` Daniel Vetter
  2021-02-03 12:30                 ` Christian König
  0 siblings, 1 reply; 20+ messages in thread
From: Daniel Vetter @ 2021-02-03 12:24 UTC (permalink / raw)
  To: christian.koenig
  Cc: Daniel Gomez, Daniel Vetter, Alex Deucher, dri-devel,
	amd-gfx list, Linux Kernel Mailing List

On Wed, Feb 03, 2021 at 01:21:20PM +0100, Christian König wrote:
> Am 03.02.21 um 12:45 schrieb Daniel Gomez:
> > On Wed, 3 Feb 2021 at 10:47, Daniel Gomez <daniel@qtec.com> wrote:
> > > On Wed, 3 Feb 2021 at 10:17, Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > On Wed, Feb 3, 2021 at 9:51 AM Christian König <christian.koenig@amd.com> wrote:
> > > > > Am 03.02.21 um 09:48 schrieb Daniel Vetter:
> > > > > > On Wed, Feb 3, 2021 at 9:36 AM Christian König <christian.koenig@amd.com> wrote:
> > > > > > > Hi Daniel,
> > > > > > > 
> > > > > > > this is not a deadlock, but rather a hardware lockup.
> > > > > > Are you sure? Ime getting stuck in dma_fence_wait has generally good
> > > > > > chance of being a dma_fence deadlock. GPU hang should never result in
> > > > > > a forever stuck dma_fence.
> > > > > Yes, I'm pretty sure. Otherwise the hardware clocks wouldn't go up like
> > > > > this.
> > > > Maybe clarifying, could be both. TDR should notice and get us out of
> > > > this, but if there's a dma_fence deadlock and we can't re-emit or
> > > > force complete the pending things, then we're stuck for good.
> > > > -Daniel
> > > > 
> > > > > Question is rather why we end up in the userptr handling for GFX? Our
> > > > > ROCm OpenCL stack shouldn't use this.
> > > > > 
> > > > > > Daniel, can you pls re-hang your machine and then dump backtraces of
> > > > > > all tasks into dmesg with sysrq-t, and then attach that? Without all
> > > > > > the backtraces it's tricky to construct the full dependency chain of
> > > > > > what's going on. Also is this plain -rc6, not some more patches on
> > > > > > top?
> > > > > Yeah, that's still a good idea to have.
> > > Here the full backtrace dmesg logs after the hang:
> > > https://pastebin.com/raw/kzivm2L3
> > > 
> > > This is another dmesg log with the backtraces after SIGKILL the matrix process:
> > > (I didn't have the sysrq enable at the time):
> > > https://pastebin.com/raw/pRBwGcj1
> > I've now removed all our v4l2 patches and did the same test with the 'plain'
> > mainline version (-rc6).
> > 
> > Reference: 3aaf0a27ffc29b19a62314edd684b9bc6346f9a8
> > 
> > Same error, same behaviour. Full dmesg log attached:
> > https://pastebin.com/raw/KgaEf7Y1
> > Note:
> >    dmesg with sysrq-t before running the test starts in [  122.016502]
> > sysrq: Show State
> >    dmesg with sysrq-t after the test starts in: [  495.587671] sysrq: Show State
> 
> There is nothing amdgpu related in there except for waiting for the
> hardware.

Yeah, but there's also no other driver that could cause a stuck dma_fence,
so why is reset not cleaning up the mess here? Irrespective of why the gpu
is stuck, the kernel should at least complete all the dma_fences even if
the gpu for some reason is terminally ill ...
-Daniel

> This is a pretty standard hardware lockup, but I'm still waiting for an
> explanation why we end up in this call path in the first place.
> 
> Christian.
> 
> > 
> > 
> > > > > Christian.
> > > > > 
> > > > > > -Daniel
> > > > > > 
> > > > > > > Which OpenCl stack are you using?
> > > > > > > 
> > > > > > > Regards,
> > > > > > > Christian.
> > > > > > > 
> > > > > > > Am 03.02.21 um 09:33 schrieb Daniel Gomez:
> > > > > > > > Hi all,
> > > > > > > > 
> > > > > > > > I have a deadlock with the amdgpu mainline driver when running in parallel two
> > > > > > > > OpenCL applications. So far, we've been able to replicate it easily by executing
> > > > > > > > clinfo and MatrixMultiplication (from AMD opencl-samples). It's quite old the
> > > > > > > > opencl-samples so, if you have any other suggestion for testing I'd be very
> > > > > > > > happy to test it as well.
> > > > > > > > 
> > > > > > > > How to replicate the issue:
> > > > > > > > 
> > > > > > > > # while true; do /usr/bin/MatrixMultiplication --device gpu \
> > > > > > > >        --deviceId 0 -x 1000 -y 1000 -z 1000 -q -t -i 50; done
> > > > > > > > # while true; do clinfo; done
> > > > > > > > 
> > > > > > > > Output:
> > > > > > > > 
> > > > > > > > After a minute or less (sometimes could be more) I can see that
> > > > > > > > MatrixMultiplication and clinfo hang. In addition, with radeontop you can see
> > > > > > > > how the Graphics pipe goes from ~50% to 100%. Also the shader clocks
> > > > > > > > goes up from ~35% to ~96%.
> > > > > > > > 
> > > > > > > > clinfo keeps printing:
> > > > > > > > ioctl(7, DRM_IOCTL_SYNCOBJ_WAIT, 0x7ffe46e5f950) = -1 ETIME (Timer expired)
> > > > > > > > 
> > > > > > > > And MatrixMultiplication prints the following (strace) if you try to
> > > > > > > > kill the process:
> > > > > > > > 
> > > > > > > > sched_yield()                           = 0
> > > > > > > > futex(0x557e945343b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
> > > > > > > > NULL, FUTEX_BITSET_MATCH_ANYstrace: Process 651 detached
> > > > > > > >     <detached ...>
> > > > > > > > 
> > > > > > > > After this, the gpu is not functional at all and you'd need a power cycle reset
> > > > > > > > to restore the system.
> > > > > > > > 
> > > > > > > > Hardware info:
> > > > > > > > CPU: AMD Ryzen Embedded V1605B with Radeon Vega Gfx (8) @ 2.000GHz
> > > > > > > > GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series
> > > > > > > > 
> > > > > > > > 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> > > > > > > > [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
> > > > > > > > (rev 83)
> > > > > > > >        DeviceName: Broadcom 5762
> > > > > > > >        Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge
> > > > > > > > [Radeon Vega Series / Radeon Vega Mobile Series]
> > > > > > > >        Kernel driver in use: amdgpu
> > > > > > > >        Kernel modules: amdgpu
> > > > > > > > 
> > > > > > > > Linux kernel info:
> > > > > > > > 
> > > > > > > > root@qt5222:~# uname -a
> > > > > > > > Linux qt5222 5.11.0-rc6-qtec-standard #2 SMP Tue Feb 2 09:41:46 UTC
> > > > > > > > 2021 x86_64 x86_64 x86_64 GNU/Linux
> > > > > > > > 
> > > > > > > > By enabling the kernel locks stats I could see the MatrixMultiplication is
> > > > > > > > hanged in the amdgpu_mn_invalidate_gfx function:
> > > > > > > > 
> > > > > > > > [  738.359202] 1 lock held by MatrixMultiplic/653:
> > > > > > > > [  738.359206]  #0: ffff88810e364fe0
> > > > > > > > (&adev->notifier_lock){+.+.}-{3:3}, at:
> > > > > > > > amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> > > > > > > > 
> > > > > > > > I can see in the the amdgpu_mn_invalidate_gfx function: the
> > > > > > > > dma_resv_wait_timeout_rcu uses wait_all (fences) and MAX_SCHEDULE_TIMEOUT so, I
> > > > > > > > guess the code gets stuck there waiting forever. According to the
> > > > > > > > documentation: "When somebody tries to invalidate the page tables we block the
> > > > > > > > update until all operations on the pages in question are completed, then those
> > > > > > > > pages are marked  as accessed and also dirty if it wasn’t a read only access."
> > > > > > > > Looks like the fences are deadlocked and therefore, it never returns. Could it
> > > > > > > > be possible? any hint to where can I look to fix this?
> > > > > > > > 
> > > > > > > > Thank you  in advance.
> > > > > > > > 
> > > > > > > > Here the full dmesg output:
> > > > > > > > 
> > > > > > > > [  738.337726] INFO: task MatrixMultiplic:653 blocked for more than 122 seconds.
> > > > > > > > [  738.344937]       Not tainted 5.11.0-rc6-qtec-standard #2
> > > > > > > > [  738.350384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > > > > > > disables this message.
> > > > > > > > [  738.358240] task:MatrixMultiplic state:D stack:    0 pid:  653
> > > > > > > > ppid:     1 flags:0x00004000
> > > > > > > > [  738.358254] Call Trace:
> > > > > > > > [  738.358261]  ? dma_fence_default_wait+0x1eb/0x230
> > > > > > > > [  738.358276]  __schedule+0x370/0x960
> > > > > > > > [  738.358291]  ? dma_fence_default_wait+0x117/0x230
> > > > > > > > [  738.358297]  ? dma_fence_default_wait+0x1eb/0x230
> > > > > > > > [  738.358305]  schedule+0x51/0xc0
> > > > > > > > [  738.358312]  schedule_timeout+0x275/0x380
> > > > > > > > [  738.358324]  ? dma_fence_default_wait+0x1eb/0x230
> > > > > > > > [  738.358332]  ? mark_held_locks+0x4f/0x70
> > > > > > > > [  738.358341]  ? dma_fence_default_wait+0x117/0x230
> > > > > > > > [  738.358347]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> > > > > > > > [  738.358353]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> > > > > > > > [  738.358362]  ? dma_fence_default_wait+0x117/0x230
> > > > > > > > [  738.358370]  ? dma_fence_default_wait+0x1eb/0x230
> > > > > > > > [  738.358375]  dma_fence_default_wait+0x214/0x230
> > > > > > > > [  738.358384]  ? dma_fence_release+0x1a0/0x1a0
> > > > > > > > [  738.358396]  dma_fence_wait_timeout+0x105/0x200
> > > > > > > > [  738.358405]  dma_resv_wait_timeout_rcu+0x1aa/0x5e0
> > > > > > > > [  738.358421]  amdgpu_mn_invalidate_gfx+0x55/0xa0 [amdgpu]
> > > > > > > > [  738.358688]  __mmu_notifier_release+0x1bb/0x210
> > > > > > > > [  738.358710]  exit_mmap+0x2f/0x1e0
> > > > > > > > [  738.358723]  ? find_held_lock+0x34/0xa0
> > > > > > > > [  738.358746]  mmput+0x39/0xe0
> > > > > > > > [  738.358756]  do_exit+0x5c3/0xc00
> > > > > > > > [  738.358763]  ? find_held_lock+0x34/0xa0
> > > > > > > > [  738.358780]  do_group_exit+0x47/0xb0
> > > > > > > > [  738.358791]  get_signal+0x15b/0xc50
> > > > > > > > [  738.358807]  arch_do_signal_or_restart+0xaf/0x710
> > > > > > > > [  738.358816]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> > > > > > > > [  738.358822]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> > > > > > > > [  738.358831]  ? ktime_get_mono_fast_ns+0x50/0xa0
> > > > > > > > [  738.358844]  ? amdgpu_drm_ioctl+0x6b/0x80 [amdgpu]
> > > > > > > > [  738.359044]  exit_to_user_mode_prepare+0xf2/0x1b0
> > > > > > > > [  738.359054]  syscall_exit_to_user_mode+0x19/0x60
> > > > > > > > [  738.359062]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > > > > [  738.359069] RIP: 0033:0x7f6b89a51887
> > > > > > > > [  738.359076] RSP: 002b:00007f6b82b54b18 EFLAGS: 00000246 ORIG_RAX:
> > > > > > > > 0000000000000010
> > > > > > > > [  738.359086] RAX: fffffffffffffe00 RBX: 00007f6b82b54b50 RCX: 00007f6b89a51887
> > > > > > > > [  738.359091] RDX: 00007f6b82b54b50 RSI: 00000000c02064c3 RDI: 0000000000000007
> > > > > > > > [  738.359096] RBP: 00000000c02064c3 R08: 0000000000000003 R09: 00007f6b82b54bbc
> > > > > > > > [  738.359101] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000165a0bc00
> > > > > > > > [  738.359106] R13: 0000000000000007 R14: 0000000000000001 R15: 0000000000000000
> > > > > > > > [  738.359129]
> > > > > > > >                   Showing all locks held in the system:
> > > > > > > > [  738.359141] 1 lock held by khungtaskd/54:
> > > > > > > > [  738.359148]  #0: ffffffff829f6840 (rcu_read_lock){....}-{1:2}, at:
> > > > > > > > debug_show_all_locks+0x15/0x183
> > > > > > > > [  738.359187] 1 lock held by systemd-journal/174:
> > > > > > > > [  738.359202] 1 lock held by MatrixMultiplic/653:
> > > > > > > > [  738.359206]  #0: ffff88810e364fe0
> > > > > > > > (&adev->notifier_lock){+.+.}-{3:3}, at:
> > > > > > > > amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> > > > > > > > 
> > > > > > > > Daniel
> > > > > > > _______________________________________________
> > > > > > > dri-devel mailing list
> > > > > > > dri-devel@lists.freedesktop.org
> > > > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C81203e5bac5841b8e5a108d8c82087a9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479389339295622%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=RDSxD6OqD8HaOA2VnNfbJwLnKzhCLgOr5SVLjLF91bA%3D&amp;reserved=0
> > > > > > 
> > > > 
> > > > --
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > http://blog.ffwll.ch
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [amdgpu] deadlock
  2021-02-03 12:24               ` Daniel Vetter
@ 2021-02-03 12:30                 ` Christian König
  2021-02-03 13:56                   ` Alex Deucher
  0 siblings, 1 reply; 20+ messages in thread
From: Christian König @ 2021-02-03 12:30 UTC (permalink / raw)
  To: Daniel Gomez, Alex Deucher, dri-devel, amd-gfx list,
	Linux Kernel Mailing List

Am 03.02.21 um 13:24 schrieb Daniel Vetter:
> On Wed, Feb 03, 2021 at 01:21:20PM +0100, Christian König wrote:
>> Am 03.02.21 um 12:45 schrieb Daniel Gomez:
>>> On Wed, 3 Feb 2021 at 10:47, Daniel Gomez <daniel@qtec.com> wrote:
>>>> On Wed, 3 Feb 2021 at 10:17, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>> On Wed, Feb 3, 2021 at 9:51 AM Christian König <christian.koenig@amd.com> wrote:
>>>>>> Am 03.02.21 um 09:48 schrieb Daniel Vetter:
>>>>>>> On Wed, Feb 3, 2021 at 9:36 AM Christian König <christian.koenig@amd.com> wrote:
>>>>>>>> Hi Daniel,
>>>>>>>>
>>>>>>>> this is not a deadlock, but rather a hardware lockup.
>>>>>>> Are you sure? Ime getting stuck in dma_fence_wait has generally good
>>>>>>> chance of being a dma_fence deadlock. GPU hang should never result in
>>>>>>> a forever stuck dma_fence.
>>>>>> Yes, I'm pretty sure. Otherwise the hardware clocks wouldn't go up like
>>>>>> this.
>>>>> Maybe clarifying, could be both. TDR should notice and get us out of
>>>>> this, but if there's a dma_fence deadlock and we can't re-emit or
>>>>> force complete the pending things, then we're stuck for good.
>>>>> -Daniel
>>>>>
>>>>>> Question is rather why we end up in the userptr handling for GFX? Our
>>>>>> ROCm OpenCL stack shouldn't use this.
>>>>>>
>>>>>>> Daniel, can you pls re-hang your machine and then dump backtraces of
>>>>>>> all tasks into dmesg with sysrq-t, and then attach that? Without all
>>>>>>> the backtraces it's tricky to construct the full dependency chain of
>>>>>>> what's going on. Also is this plain -rc6, not some more patches on
>>>>>>> top?
>>>>>> Yeah, that's still a good idea to have.
>>>> Here the full backtrace dmesg logs after the hang:
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2Fkzivm2L3&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885971019%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=a3934SOOSFtRU3RraUe%2BWDgAEDefENxQZcd0prmSZXs%3D&amp;reserved=0
>>>>
>>>> This is another dmesg log with the backtraces after SIGKILL the matrix process:
>>>> (I didn't have the sysrq enable at the time):
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2FpRBwGcj1&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885981018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=nPom9VwIrEZF02hSEnC5Ef8lHdQURMELCapIhwKk2JE%3D&amp;reserved=0
>>> I've now removed all our v4l2 patches and did the same test with the 'plain'
>>> mainline version (-rc6).
>>>
>>> Reference: 3aaf0a27ffc29b19a62314edd684b9bc6346f9a8
>>>
>>> Same error, same behaviour. Full dmesg log attached:
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2FKgaEf7Y1&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885981018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=WQw6g9oA38aT1VuuZ8%2F1Y43pG%2BPlV%2F9%2FRHjKdGvZLK4%3D&amp;reserved=0
>>> Note:
>>>     dmesg with sysrq-t before running the test starts in [  122.016502]
>>> sysrq: Show State
>>>     dmesg with sysrq-t after the test starts in: [  495.587671] sysrq: Show State
>> There is nothing amdgpu related in there except for waiting for the
>> hardware.
> Yeah, but there's also no other driver that could cause a stuck dma_fence,
> so why is reset not cleaning up the mess here? Irrespective of why the gpu
> is stuck, the kernel should at least complete all the dma_fences even if
> the gpu for some reason is terminally ill ...

That's a good question as well. I'm digging into this.

My best theory is that the amdgpu packages disabled GPU reset for some 
reason.

But the much more interesting question is why we end up in this call 
path. I've pinged internally, but east coast is not awake yet :)

Christian.

> -Daniel
>
>> This is a pretty standard hardware lockup, but I'm still waiting for an
>> explanation why we end up in this call path in the first place.
>>
>> Christian.
>>
>>>
>>>>>> Christian.
>>>>>>
>>>>>>> -Daniel
>>>>>>>
>>>>>>>> Which OpenCl stack are you using?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 03.02.21 um 09:33 schrieb Daniel Gomez:
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I have a deadlock with the amdgpu mainline driver when running in parallel two
>>>>>>>>> OpenCL applications. So far, we've been able to replicate it easily by executing
>>>>>>>>> clinfo and MatrixMultiplication (from AMD opencl-samples). It's quite old the
>>>>>>>>> opencl-samples so, if you have any other suggestion for testing I'd be very
>>>>>>>>> happy to test it as well.
>>>>>>>>>
>>>>>>>>> How to replicate the issue:
>>>>>>>>>
>>>>>>>>> # while true; do /usr/bin/MatrixMultiplication --device gpu \
>>>>>>>>>         --deviceId 0 -x 1000 -y 1000 -z 1000 -q -t -i 50; done
>>>>>>>>> # while true; do clinfo; done
>>>>>>>>>
>>>>>>>>> Output:
>>>>>>>>>
>>>>>>>>> After a minute or less (sometimes could be more) I can see that
>>>>>>>>> MatrixMultiplication and clinfo hang. In addition, with radeontop you can see
>>>>>>>>> how the Graphics pipe goes from ~50% to 100%. Also the shader clocks
>>>>>>>>> goes up from ~35% to ~96%.
>>>>>>>>>
>>>>>>>>> clinfo keeps printing:
>>>>>>>>> ioctl(7, DRM_IOCTL_SYNCOBJ_WAIT, 0x7ffe46e5f950) = -1 ETIME (Timer expired)
>>>>>>>>>
>>>>>>>>> And MatrixMultiplication prints the following (strace) if you try to
>>>>>>>>> kill the process:
>>>>>>>>>
>>>>>>>>> sched_yield()                           = 0
>>>>>>>>> futex(0x557e945343b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
>>>>>>>>> NULL, FUTEX_BITSET_MATCH_ANYstrace: Process 651 detached
>>>>>>>>>      <detached ...>
>>>>>>>>>
>>>>>>>>> After this, the gpu is not functional at all and you'd need a power cycle reset
>>>>>>>>> to restore the system.
>>>>>>>>>
>>>>>>>>> Hardware info:
>>>>>>>>> CPU: AMD Ryzen Embedded V1605B with Radeon Vega Gfx (8) @ 2.000GHz
>>>>>>>>> GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series
>>>>>>>>>
>>>>>>>>> 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
>>>>>>>>> [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
>>>>>>>>> (rev 83)
>>>>>>>>>         DeviceName: Broadcom 5762
>>>>>>>>>         Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge
>>>>>>>>> [Radeon Vega Series / Radeon Vega Mobile Series]
>>>>>>>>>         Kernel driver in use: amdgpu
>>>>>>>>>         Kernel modules: amdgpu
>>>>>>>>>
>>>>>>>>> Linux kernel info:
>>>>>>>>>
>>>>>>>>> root@qt5222:~# uname -a
>>>>>>>>> Linux qt5222 5.11.0-rc6-qtec-standard #2 SMP Tue Feb 2 09:41:46 UTC
>>>>>>>>> 2021 x86_64 x86_64 x86_64 GNU/Linux
>>>>>>>>>
>>>>>>>>> By enabling the kernel locks stats I could see the MatrixMultiplication is
>>>>>>>>> hanged in the amdgpu_mn_invalidate_gfx function:
>>>>>>>>>
>>>>>>>>> [  738.359202] 1 lock held by MatrixMultiplic/653:
>>>>>>>>> [  738.359206]  #0: ffff88810e364fe0
>>>>>>>>> (&adev->notifier_lock){+.+.}-{3:3}, at:
>>>>>>>>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
>>>>>>>>>
>>>>>>>>> I can see in the the amdgpu_mn_invalidate_gfx function: the
>>>>>>>>> dma_resv_wait_timeout_rcu uses wait_all (fences) and MAX_SCHEDULE_TIMEOUT so, I
>>>>>>>>> guess the code gets stuck there waiting forever. According to the
>>>>>>>>> documentation: "When somebody tries to invalidate the page tables we block the
>>>>>>>>> update until all operations on the pages in question are completed, then those
>>>>>>>>> pages are marked  as accessed and also dirty if it wasn’t a read only access."
>>>>>>>>> Looks like the fences are deadlocked and therefore, it never returns. Could it
>>>>>>>>> be possible? any hint to where can I look to fix this?
>>>>>>>>>
>>>>>>>>> Thank you  in advance.
>>>>>>>>>
>>>>>>>>> Here the full dmesg output:
>>>>>>>>>
>>>>>>>>> [  738.337726] INFO: task MatrixMultiplic:653 blocked for more than 122 seconds.
>>>>>>>>> [  738.344937]       Not tainted 5.11.0-rc6-qtec-standard #2
>>>>>>>>> [  738.350384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>>>>>>> disables this message.
>>>>>>>>> [  738.358240] task:MatrixMultiplic state:D stack:    0 pid:  653
>>>>>>>>> ppid:     1 flags:0x00004000
>>>>>>>>> [  738.358254] Call Trace:
>>>>>>>>> [  738.358261]  ? dma_fence_default_wait+0x1eb/0x230
>>>>>>>>> [  738.358276]  __schedule+0x370/0x960
>>>>>>>>> [  738.358291]  ? dma_fence_default_wait+0x117/0x230
>>>>>>>>> [  738.358297]  ? dma_fence_default_wait+0x1eb/0x230
>>>>>>>>> [  738.358305]  schedule+0x51/0xc0
>>>>>>>>> [  738.358312]  schedule_timeout+0x275/0x380
>>>>>>>>> [  738.358324]  ? dma_fence_default_wait+0x1eb/0x230
>>>>>>>>> [  738.358332]  ? mark_held_locks+0x4f/0x70
>>>>>>>>> [  738.358341]  ? dma_fence_default_wait+0x117/0x230
>>>>>>>>> [  738.358347]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
>>>>>>>>> [  738.358353]  ? _raw_spin_unlock_irqrestore+0x39/0x40
>>>>>>>>> [  738.358362]  ? dma_fence_default_wait+0x117/0x230
>>>>>>>>> [  738.358370]  ? dma_fence_default_wait+0x1eb/0x230
>>>>>>>>> [  738.358375]  dma_fence_default_wait+0x214/0x230
>>>>>>>>> [  738.358384]  ? dma_fence_release+0x1a0/0x1a0
>>>>>>>>> [  738.358396]  dma_fence_wait_timeout+0x105/0x200
>>>>>>>>> [  738.358405]  dma_resv_wait_timeout_rcu+0x1aa/0x5e0
>>>>>>>>> [  738.358421]  amdgpu_mn_invalidate_gfx+0x55/0xa0 [amdgpu]
>>>>>>>>> [  738.358688]  __mmu_notifier_release+0x1bb/0x210
>>>>>>>>> [  738.358710]  exit_mmap+0x2f/0x1e0
>>>>>>>>> [  738.358723]  ? find_held_lock+0x34/0xa0
>>>>>>>>> [  738.358746]  mmput+0x39/0xe0
>>>>>>>>> [  738.358756]  do_exit+0x5c3/0xc00
>>>>>>>>> [  738.358763]  ? find_held_lock+0x34/0xa0
>>>>>>>>> [  738.358780]  do_group_exit+0x47/0xb0
>>>>>>>>> [  738.358791]  get_signal+0x15b/0xc50
>>>>>>>>> [  738.358807]  arch_do_signal_or_restart+0xaf/0x710
>>>>>>>>> [  738.358816]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
>>>>>>>>> [  738.358822]  ? _raw_spin_unlock_irqrestore+0x39/0x40
>>>>>>>>> [  738.358831]  ? ktime_get_mono_fast_ns+0x50/0xa0
>>>>>>>>> [  738.358844]  ? amdgpu_drm_ioctl+0x6b/0x80 [amdgpu]
>>>>>>>>> [  738.359044]  exit_to_user_mode_prepare+0xf2/0x1b0
>>>>>>>>> [  738.359054]  syscall_exit_to_user_mode+0x19/0x60
>>>>>>>>> [  738.359062]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>>>>>>> [  738.359069] RIP: 0033:0x7f6b89a51887
>>>>>>>>> [  738.359076] RSP: 002b:00007f6b82b54b18 EFLAGS: 00000246 ORIG_RAX:
>>>>>>>>> 0000000000000010
>>>>>>>>> [  738.359086] RAX: fffffffffffffe00 RBX: 00007f6b82b54b50 RCX: 00007f6b89a51887
>>>>>>>>> [  738.359091] RDX: 00007f6b82b54b50 RSI: 00000000c02064c3 RDI: 0000000000000007
>>>>>>>>> [  738.359096] RBP: 00000000c02064c3 R08: 0000000000000003 R09: 00007f6b82b54bbc
>>>>>>>>> [  738.359101] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000165a0bc00
>>>>>>>>> [  738.359106] R13: 0000000000000007 R14: 0000000000000001 R15: 0000000000000000
>>>>>>>>> [  738.359129]
>>>>>>>>>                    Showing all locks held in the system:
>>>>>>>>> [  738.359141] 1 lock held by khungtaskd/54:
>>>>>>>>> [  738.359148]  #0: ffffffff829f6840 (rcu_read_lock){....}-{1:2}, at:
>>>>>>>>> debug_show_all_locks+0x15/0x183
>>>>>>>>> [  738.359187] 1 lock held by systemd-journal/174:
>>>>>>>>> [  738.359202] 1 lock held by MatrixMultiplic/653:
>>>>>>>>> [  738.359206]  #0: ffff88810e364fe0
>>>>>>>>> (&adev->notifier_lock){+.+.}-{3:3}, at:
>>>>>>>>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
>>>>>>>>>
>>>>>>>>> Daniel
>>>>>>>> _______________________________________________
>>>>>>>> dri-devel mailing list
>>>>>>>> dri-devel@lists.freedesktop.org
>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885981018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=OkFv8jiehNoa46Q%2B5yOXUg29cRbzl8voV2GqC8j1V9Q%3D&amp;reserved=0
>>>>> --
>>>>> Daniel Vetter
>>>>> Software Engineer, Intel Corporation
>>>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885981018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=m0e9DrqnuYQoJYwwZAyonKlSfkp9hFTRNoT53OY3IbU%3D&amp;reserved=0
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885981018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=BuUCnnGsKhSQc0ldgBPVBIQxYUnvIPwqqLMe81ynrgY%3D&amp;reserved=0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [amdgpu] deadlock
  2021-02-03 12:30                 ` Christian König
@ 2021-02-03 13:56                   ` Alex Deucher
  2021-02-03 14:27                     ` Daniel Vetter
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Deucher @ 2021-02-03 13:56 UTC (permalink / raw)
  To: Christian König
  Cc: Daniel Gomez, Alex Deucher, dri-devel, amd-gfx list,
	Linux Kernel Mailing List

On Wed, Feb 3, 2021 at 7:30 AM Christian König <christian.koenig@amd.com> wrote:
>
> Am 03.02.21 um 13:24 schrieb Daniel Vetter:
> > On Wed, Feb 03, 2021 at 01:21:20PM +0100, Christian König wrote:
> >> Am 03.02.21 um 12:45 schrieb Daniel Gomez:
> >>> On Wed, 3 Feb 2021 at 10:47, Daniel Gomez <daniel@qtec.com> wrote:
> >>>> On Wed, 3 Feb 2021 at 10:17, Daniel Vetter <daniel@ffwll.ch> wrote:
> >>>>> On Wed, Feb 3, 2021 at 9:51 AM Christian König <christian.koenig@amd.com> wrote:
> >>>>>> Am 03.02.21 um 09:48 schrieb Daniel Vetter:
> >>>>>>> On Wed, Feb 3, 2021 at 9:36 AM Christian König <christian.koenig@amd.com> wrote:
> >>>>>>>> Hi Daniel,
> >>>>>>>>
> >>>>>>>> this is not a deadlock, but rather a hardware lockup.
> >>>>>>> Are you sure? Ime getting stuck in dma_fence_wait has generally good
> >>>>>>> chance of being a dma_fence deadlock. GPU hang should never result in
> >>>>>>> a forever stuck dma_fence.
> >>>>>> Yes, I'm pretty sure. Otherwise the hardware clocks wouldn't go up like
> >>>>>> this.
> >>>>> Maybe clarifying, could be both. TDR should notice and get us out of
> >>>>> this, but if there's a dma_fence deadlock and we can't re-emit or
> >>>>> force complete the pending things, then we're stuck for good.
> >>>>> -Daniel
> >>>>>
> >>>>>> Question is rather why we end up in the userptr handling for GFX? Our
> >>>>>> ROCm OpenCL stack shouldn't use this.
> >>>>>>
> >>>>>>> Daniel, can you pls re-hang your machine and then dump backtraces of
> >>>>>>> all tasks into dmesg with sysrq-t, and then attach that? Without all
> >>>>>>> the backtraces it's tricky to construct the full dependency chain of
> >>>>>>> what's going on. Also is this plain -rc6, not some more patches on
> >>>>>>> top?
> >>>>>> Yeah, that's still a good idea to have.
> >>>> Here the full backtrace dmesg logs after the hang:
> >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2Fkzivm2L3&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885971019%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=a3934SOOSFtRU3RraUe%2BWDgAEDefENxQZcd0prmSZXs%3D&amp;reserved=0
> >>>>
> >>>> This is another dmesg log with the backtraces after SIGKILL the matrix process:
> >>>> (I didn't have the sysrq enable at the time):
> >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2FpRBwGcj1&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885981018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=nPom9VwIrEZF02hSEnC5Ef8lHdQURMELCapIhwKk2JE%3D&amp;reserved=0
> >>> I've now removed all our v4l2 patches and did the same test with the 'plain'
> >>> mainline version (-rc6).
> >>>
> >>> Reference: 3aaf0a27ffc29b19a62314edd684b9bc6346f9a8
> >>>
> >>> Same error, same behaviour. Full dmesg log attached:
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2FKgaEf7Y1&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885981018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=WQw6g9oA38aT1VuuZ8%2F1Y43pG%2BPlV%2F9%2FRHjKdGvZLK4%3D&amp;reserved=0
> >>> Note:
> >>>     dmesg with sysrq-t before running the test starts in [  122.016502]
> >>> sysrq: Show State
> >>>     dmesg with sysrq-t after the test starts in: [  495.587671] sysrq: Show State
> >> There is nothing amdgpu related in there except for waiting for the
> >> hardware.
> > Yeah, but there's also no other driver that could cause a stuck dma_fence,
> > so why is reset not cleaning up the mess here? Irrespective of why the gpu
> > is stuck, the kernel should at least complete all the dma_fences even if
> > the gpu for some reason is terminally ill ...
>
> That's a good question as well. I'm digging into this.
>
> My best theory is that the amdgpu packages disabled GPU reset for some
> reason.

The timeout for compute queues is infinite because of long running
compute kernels.  You can override with the amdgpu.lockup_timeout
parameter.

Alex

>
> But the much more interesting question is why we end up in this call
> path. I've pinged internally, but east coast is not awake yet :)
>
> Christian.
>
> > -Daniel
> >
> >> This is a pretty standard hardware lockup, but I'm still waiting for an
> >> explanation why we end up in this call path in the first place.
> >>
> >> Christian.
> >>
> >>>
> >>>>>> Christian.
> >>>>>>
> >>>>>>> -Daniel
> >>>>>>>
> >>>>>>>> Which OpenCl stack are you using?
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Christian.
> >>>>>>>>
> >>>>>>>> Am 03.02.21 um 09:33 schrieb Daniel Gomez:
> >>>>>>>>> Hi all,
> >>>>>>>>>
> >>>>>>>>> I have a deadlock with the amdgpu mainline driver when running in parallel two
> >>>>>>>>> OpenCL applications. So far, we've been able to replicate it easily by executing
> >>>>>>>>> clinfo and MatrixMultiplication (from AMD opencl-samples). It's quite old the
> >>>>>>>>> opencl-samples so, if you have any other suggestion for testing I'd be very
> >>>>>>>>> happy to test it as well.
> >>>>>>>>>
> >>>>>>>>> How to replicate the issue:
> >>>>>>>>>
> >>>>>>>>> # while true; do /usr/bin/MatrixMultiplication --device gpu \
> >>>>>>>>>         --deviceId 0 -x 1000 -y 1000 -z 1000 -q -t -i 50; done
> >>>>>>>>> # while true; do clinfo; done
> >>>>>>>>>
> >>>>>>>>> Output:
> >>>>>>>>>
> >>>>>>>>> After a minute or less (sometimes could be more) I can see that
> >>>>>>>>> MatrixMultiplication and clinfo hang. In addition, with radeontop you can see
> >>>>>>>>> how the Graphics pipe goes from ~50% to 100%. Also the shader clocks
> >>>>>>>>> goes up from ~35% to ~96%.
> >>>>>>>>>
> >>>>>>>>> clinfo keeps printing:
> >>>>>>>>> ioctl(7, DRM_IOCTL_SYNCOBJ_WAIT, 0x7ffe46e5f950) = -1 ETIME (Timer expired)
> >>>>>>>>>
> >>>>>>>>> And MatrixMultiplication prints the following (strace) if you try to
> >>>>>>>>> kill the process:
> >>>>>>>>>
> >>>>>>>>> sched_yield()                           = 0
> >>>>>>>>> futex(0x557e945343b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
> >>>>>>>>> NULL, FUTEX_BITSET_MATCH_ANYstrace: Process 651 detached
> >>>>>>>>>      <detached ...>
> >>>>>>>>>
> >>>>>>>>> After this, the gpu is not functional at all and you'd need a power cycle reset
> >>>>>>>>> to restore the system.
> >>>>>>>>>
> >>>>>>>>> Hardware info:
> >>>>>>>>> CPU: AMD Ryzen Embedded V1605B with Radeon Vega Gfx (8) @ 2.000GHz
> >>>>>>>>> GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series
> >>>>>>>>>
> >>>>>>>>> 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> >>>>>>>>> [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
> >>>>>>>>> (rev 83)
> >>>>>>>>>         DeviceName: Broadcom 5762
> >>>>>>>>>         Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge
> >>>>>>>>> [Radeon Vega Series / Radeon Vega Mobile Series]
> >>>>>>>>>         Kernel driver in use: amdgpu
> >>>>>>>>>         Kernel modules: amdgpu
> >>>>>>>>>
> >>>>>>>>> Linux kernel info:
> >>>>>>>>>
> >>>>>>>>> root@qt5222:~# uname -a
> >>>>>>>>> Linux qt5222 5.11.0-rc6-qtec-standard #2 SMP Tue Feb 2 09:41:46 UTC
> >>>>>>>>> 2021 x86_64 x86_64 x86_64 GNU/Linux
> >>>>>>>>>
> >>>>>>>>> By enabling the kernel locks stats I could see the MatrixMultiplication is
> >>>>>>>>> hanged in the amdgpu_mn_invalidate_gfx function:
> >>>>>>>>>
> >>>>>>>>> [  738.359202] 1 lock held by MatrixMultiplic/653:
> >>>>>>>>> [  738.359206]  #0: ffff88810e364fe0
> >>>>>>>>> (&adev->notifier_lock){+.+.}-{3:3}, at:
> >>>>>>>>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> >>>>>>>>>
> >>>>>>>>> I can see in the the amdgpu_mn_invalidate_gfx function: the
> >>>>>>>>> dma_resv_wait_timeout_rcu uses wait_all (fences) and MAX_SCHEDULE_TIMEOUT so, I
> >>>>>>>>> guess the code gets stuck there waiting forever. According to the
> >>>>>>>>> documentation: "When somebody tries to invalidate the page tables we block the
> >>>>>>>>> update until all operations on the pages in question are completed, then those
> >>>>>>>>> pages are marked  as accessed and also dirty if it wasn’t a read only access."
> >>>>>>>>> Looks like the fences are deadlocked and therefore, it never returns. Could it
> >>>>>>>>> be possible? any hint to where can I look to fix this?
> >>>>>>>>>
> >>>>>>>>> Thank you  in advance.
> >>>>>>>>>
> >>>>>>>>> Here the full dmesg output:
> >>>>>>>>>
> >>>>>>>>> [  738.337726] INFO: task MatrixMultiplic:653 blocked for more than 122 seconds.
> >>>>>>>>> [  738.344937]       Not tainted 5.11.0-rc6-qtec-standard #2
> >>>>>>>>> [  738.350384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >>>>>>>>> disables this message.
> >>>>>>>>> [  738.358240] task:MatrixMultiplic state:D stack:    0 pid:  653
> >>>>>>>>> ppid:     1 flags:0x00004000
> >>>>>>>>> [  738.358254] Call Trace:
> >>>>>>>>> [  738.358261]  ? dma_fence_default_wait+0x1eb/0x230
> >>>>>>>>> [  738.358276]  __schedule+0x370/0x960
> >>>>>>>>> [  738.358291]  ? dma_fence_default_wait+0x117/0x230
> >>>>>>>>> [  738.358297]  ? dma_fence_default_wait+0x1eb/0x230
> >>>>>>>>> [  738.358305]  schedule+0x51/0xc0
> >>>>>>>>> [  738.358312]  schedule_timeout+0x275/0x380
> >>>>>>>>> [  738.358324]  ? dma_fence_default_wait+0x1eb/0x230
> >>>>>>>>> [  738.358332]  ? mark_held_locks+0x4f/0x70
> >>>>>>>>> [  738.358341]  ? dma_fence_default_wait+0x117/0x230
> >>>>>>>>> [  738.358347]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> >>>>>>>>> [  738.358353]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> >>>>>>>>> [  738.358362]  ? dma_fence_default_wait+0x117/0x230
> >>>>>>>>> [  738.358370]  ? dma_fence_default_wait+0x1eb/0x230
> >>>>>>>>> [  738.358375]  dma_fence_default_wait+0x214/0x230
> >>>>>>>>> [  738.358384]  ? dma_fence_release+0x1a0/0x1a0
> >>>>>>>>> [  738.358396]  dma_fence_wait_timeout+0x105/0x200
> >>>>>>>>> [  738.358405]  dma_resv_wait_timeout_rcu+0x1aa/0x5e0
> >>>>>>>>> [  738.358421]  amdgpu_mn_invalidate_gfx+0x55/0xa0 [amdgpu]
> >>>>>>>>> [  738.358688]  __mmu_notifier_release+0x1bb/0x210
> >>>>>>>>> [  738.358710]  exit_mmap+0x2f/0x1e0
> >>>>>>>>> [  738.358723]  ? find_held_lock+0x34/0xa0
> >>>>>>>>> [  738.358746]  mmput+0x39/0xe0
> >>>>>>>>> [  738.358756]  do_exit+0x5c3/0xc00
> >>>>>>>>> [  738.358763]  ? find_held_lock+0x34/0xa0
> >>>>>>>>> [  738.358780]  do_group_exit+0x47/0xb0
> >>>>>>>>> [  738.358791]  get_signal+0x15b/0xc50
> >>>>>>>>> [  738.358807]  arch_do_signal_or_restart+0xaf/0x710
> >>>>>>>>> [  738.358816]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> >>>>>>>>> [  738.358822]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> >>>>>>>>> [  738.358831]  ? ktime_get_mono_fast_ns+0x50/0xa0
> >>>>>>>>> [  738.358844]  ? amdgpu_drm_ioctl+0x6b/0x80 [amdgpu]
> >>>>>>>>> [  738.359044]  exit_to_user_mode_prepare+0xf2/0x1b0
> >>>>>>>>> [  738.359054]  syscall_exit_to_user_mode+0x19/0x60
> >>>>>>>>> [  738.359062]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>>>>>>>> [  738.359069] RIP: 0033:0x7f6b89a51887
> >>>>>>>>> [  738.359076] RSP: 002b:00007f6b82b54b18 EFLAGS: 00000246 ORIG_RAX:
> >>>>>>>>> 0000000000000010
> >>>>>>>>> [  738.359086] RAX: fffffffffffffe00 RBX: 00007f6b82b54b50 RCX: 00007f6b89a51887
> >>>>>>>>> [  738.359091] RDX: 00007f6b82b54b50 RSI: 00000000c02064c3 RDI: 0000000000000007
> >>>>>>>>> [  738.359096] RBP: 00000000c02064c3 R08: 0000000000000003 R09: 00007f6b82b54bbc
> >>>>>>>>> [  738.359101] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000165a0bc00
> >>>>>>>>> [  738.359106] R13: 0000000000000007 R14: 0000000000000001 R15: 0000000000000000
> >>>>>>>>> [  738.359129]
> >>>>>>>>>                    Showing all locks held in the system:
> >>>>>>>>> [  738.359141] 1 lock held by khungtaskd/54:
> >>>>>>>>> [  738.359148]  #0: ffffffff829f6840 (rcu_read_lock){....}-{1:2}, at:
> >>>>>>>>> debug_show_all_locks+0x15/0x183
> >>>>>>>>> [  738.359187] 1 lock held by systemd-journal/174:
> >>>>>>>>> [  738.359202] 1 lock held by MatrixMultiplic/653:
> >>>>>>>>> [  738.359206]  #0: ffff88810e364fe0
> >>>>>>>>> (&adev->notifier_lock){+.+.}-{3:3}, at:
> >>>>>>>>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> >>>>>>>>>
> >>>>>>>>> Daniel
> >>>>>>>> _______________________________________________
> >>>>>>>> dri-devel mailing list
> >>>>>>>> dri-devel@lists.freedesktop.org
> >>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885981018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=OkFv8jiehNoa46Q%2B5yOXUg29cRbzl8voV2GqC8j1V9Q%3D&amp;reserved=0
> >>>>> --
> >>>>> Daniel Vetter
> >>>>> Software Engineer, Intel Corporation
> >>>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885981018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=m0e9DrqnuYQoJYwwZAyonKlSfkp9hFTRNoT53OY3IbU%3D&amp;reserved=0
> >>> _______________________________________________
> >>> amd-gfx mailing list
> >>> amd-gfx@lists.freedesktop.org
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885981018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=BuUCnnGsKhSQc0ldgBPVBIQxYUnvIPwqqLMe81ynrgY%3D&amp;reserved=0
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [amdgpu] deadlock
  2021-02-03 13:56                   ` Alex Deucher
@ 2021-02-03 14:27                     ` Daniel Vetter
  2021-02-03 14:33                       ` Bridgman, John
  0 siblings, 1 reply; 20+ messages in thread
From: Daniel Vetter @ 2021-02-03 14:27 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Christian König, Alex Deucher, Daniel Gomez, amd-gfx list,
	dri-devel, Linux Kernel Mailing List

On Wed, Feb 03, 2021 at 08:56:17AM -0500, Alex Deucher wrote:
> On Wed, Feb 3, 2021 at 7:30 AM Christian König <christian.koenig@amd.com> wrote:
> >
> > Am 03.02.21 um 13:24 schrieb Daniel Vetter:
> > > On Wed, Feb 03, 2021 at 01:21:20PM +0100, Christian König wrote:
> > >> Am 03.02.21 um 12:45 schrieb Daniel Gomez:
> > >>> On Wed, 3 Feb 2021 at 10:47, Daniel Gomez <daniel@qtec.com> wrote:
> > >>>> On Wed, 3 Feb 2021 at 10:17, Daniel Vetter <daniel@ffwll.ch> wrote:
> > >>>>> On Wed, Feb 3, 2021 at 9:51 AM Christian König <christian.koenig@amd.com> wrote:
> > >>>>>> Am 03.02.21 um 09:48 schrieb Daniel Vetter:
> > >>>>>>> On Wed, Feb 3, 2021 at 9:36 AM Christian König <christian.koenig@amd.com> wrote:
> > >>>>>>>> Hi Daniel,
> > >>>>>>>>
> > >>>>>>>> this is not a deadlock, but rather a hardware lockup.
> > >>>>>>> Are you sure? Ime getting stuck in dma_fence_wait has generally good
> > >>>>>>> chance of being a dma_fence deadlock. GPU hang should never result in
> > >>>>>>> a forever stuck dma_fence.
> > >>>>>> Yes, I'm pretty sure. Otherwise the hardware clocks wouldn't go up like
> > >>>>>> this.
> > >>>>> Maybe clarifying, could be both. TDR should notice and get us out of
> > >>>>> this, but if there's a dma_fence deadlock and we can't re-emit or
> > >>>>> force complete the pending things, then we're stuck for good.
> > >>>>> -Daniel
> > >>>>>
> > >>>>>> Question is rather why we end up in the userptr handling for GFX? Our
> > >>>>>> ROCm OpenCL stack shouldn't use this.
> > >>>>>>
> > >>>>>>> Daniel, can you pls re-hang your machine and then dump backtraces of
> > >>>>>>> all tasks into dmesg with sysrq-t, and then attach that? Without all
> > >>>>>>> the backtraces it's tricky to construct the full dependency chain of
> > >>>>>>> what's going on. Also is this plain -rc6, not some more patches on
> > >>>>>>> top?
> > >>>>>> Yeah, that's still a good idea to have.
> > >>>> Here the full backtrace dmesg logs after the hang:
> > >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2Fkzivm2L3&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885971019%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=a3934SOOSFtRU3RraUe%2BWDgAEDefENxQZcd0prmSZXs%3D&amp;reserved=0
> > >>>>
> > >>>> This is another dmesg log with the backtraces after SIGKILL the matrix process:
> > >>>> (I didn't have the sysrq enable at the time):
> > >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2FpRBwGcj1&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885981018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=nPom9VwIrEZF02hSEnC5Ef8lHdQURMELCapIhwKk2JE%3D&amp;reserved=0
> > >>> I've now removed all our v4l2 patches and did the same test with the 'plain'
> > >>> mainline version (-rc6).
> > >>>
> > >>> Reference: 3aaf0a27ffc29b19a62314edd684b9bc6346f9a8
> > >>>
> > >>> Same error, same behaviour. Full dmesg log attached:
> > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2FKgaEf7Y1&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885981018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=WQw6g9oA38aT1VuuZ8%2F1Y43pG%2BPlV%2F9%2FRHjKdGvZLK4%3D&amp;reserved=0
> > >>> Note:
> > >>>     dmesg with sysrq-t before running the test starts in [  122.016502]
> > >>> sysrq: Show State
> > >>>     dmesg with sysrq-t after the test starts in: [  495.587671] sysrq: Show State
> > >> There is nothing amdgpu related in there except for waiting for the
> > >> hardware.
> > > Yeah, but there's also no other driver that could cause a stuck dma_fence,
> > > so why is reset not cleaning up the mess here? Irrespective of why the gpu
> > > is stuck, the kernel should at least complete all the dma_fences even if
> > > the gpu for some reason is terminally ill ...
> >
> > That's a good question as well. I'm digging into this.
> >
> > My best theory is that the amdgpu packages disabled GPU reset for some
> > reason.
> 
> The timeout for compute queues is infinite because of long running
> compute kernels.  You can override with the amdgpu.lockup_timeout
> parameter.

Uh, that doesn't work. If you want infinite compute queues you need the
amdkfd model with preempt-ctx dma_fence. If you allow normal cs ioctl to
run forever, you just hang the kernel whenever userspace feels like. Not
just the gpu, the kernel (anything that allocates memory, irrespective of
process can hang). That's no good.
-Daniel

> 
> Alex
> 
> >
> > But the much more interesting question is why we end up in this call
> > path. I've pinged internally, but east coast is not awake yet :)
> >
> > Christian.
> >
> > > -Daniel
> > >
> > >> This is a pretty standard hardware lockup, but I'm still waiting for an
> > >> explanation why we end up in this call path in the first place.
> > >>
> > >> Christian.
> > >>
> > >>>
> > >>>>>> Christian.
> > >>>>>>
> > >>>>>>> -Daniel
> > >>>>>>>
> > >>>>>>>> Which OpenCl stack are you using?
> > >>>>>>>>
> > >>>>>>>> Regards,
> > >>>>>>>> Christian.
> > >>>>>>>>
> > >>>>>>>> Am 03.02.21 um 09:33 schrieb Daniel Gomez:
> > >>>>>>>>> Hi all,
> > >>>>>>>>>
> > >>>>>>>>> I have a deadlock with the amdgpu mainline driver when running in parallel two
> > >>>>>>>>> OpenCL applications. So far, we've been able to replicate it easily by executing
> > >>>>>>>>> clinfo and MatrixMultiplication (from AMD opencl-samples). It's quite old the
> > >>>>>>>>> opencl-samples so, if you have any other suggestion for testing I'd be very
> > >>>>>>>>> happy to test it as well.
> > >>>>>>>>>
> > >>>>>>>>> How to replicate the issue:
> > >>>>>>>>>
> > >>>>>>>>> # while true; do /usr/bin/MatrixMultiplication --device gpu \
> > >>>>>>>>>         --deviceId 0 -x 1000 -y 1000 -z 1000 -q -t -i 50; done
> > >>>>>>>>> # while true; do clinfo; done
> > >>>>>>>>>
> > >>>>>>>>> Output:
> > >>>>>>>>>
> > >>>>>>>>> After a minute or less (sometimes could be more) I can see that
> > >>>>>>>>> MatrixMultiplication and clinfo hang. In addition, with radeontop you can see
> > >>>>>>>>> how the Graphics pipe goes from ~50% to 100%. Also the shader clocks
> > >>>>>>>>> goes up from ~35% to ~96%.
> > >>>>>>>>>
> > >>>>>>>>> clinfo keeps printing:
> > >>>>>>>>> ioctl(7, DRM_IOCTL_SYNCOBJ_WAIT, 0x7ffe46e5f950) = -1 ETIME (Timer expired)
> > >>>>>>>>>
> > >>>>>>>>> And MatrixMultiplication prints the following (strace) if you try to
> > >>>>>>>>> kill the process:
> > >>>>>>>>>
> > >>>>>>>>> sched_yield()                           = 0
> > >>>>>>>>> futex(0x557e945343b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
> > >>>>>>>>> NULL, FUTEX_BITSET_MATCH_ANYstrace: Process 651 detached
> > >>>>>>>>>      <detached ...>
> > >>>>>>>>>
> > >>>>>>>>> After this, the gpu is not functional at all and you'd need a power cycle reset
> > >>>>>>>>> to restore the system.
> > >>>>>>>>>
> > >>>>>>>>> Hardware info:
> > >>>>>>>>> CPU: AMD Ryzen Embedded V1605B with Radeon Vega Gfx (8) @ 2.000GHz
> > >>>>>>>>> GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series
> > >>>>>>>>>
> > >>>>>>>>> 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> > >>>>>>>>> [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
> > >>>>>>>>> (rev 83)
> > >>>>>>>>>         DeviceName: Broadcom 5762
> > >>>>>>>>>         Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge
> > >>>>>>>>> [Radeon Vega Series / Radeon Vega Mobile Series]
> > >>>>>>>>>         Kernel driver in use: amdgpu
> > >>>>>>>>>         Kernel modules: amdgpu
> > >>>>>>>>>
> > >>>>>>>>> Linux kernel info:
> > >>>>>>>>>
> > >>>>>>>>> root@qt5222:~# uname -a
> > >>>>>>>>> Linux qt5222 5.11.0-rc6-qtec-standard #2 SMP Tue Feb 2 09:41:46 UTC
> > >>>>>>>>> 2021 x86_64 x86_64 x86_64 GNU/Linux
> > >>>>>>>>>
> > >>>>>>>>> By enabling the kernel locks stats I could see the MatrixMultiplication is
> > >>>>>>>>> hanged in the amdgpu_mn_invalidate_gfx function:
> > >>>>>>>>>
> > >>>>>>>>> [  738.359202] 1 lock held by MatrixMultiplic/653:
> > >>>>>>>>> [  738.359206]  #0: ffff88810e364fe0
> > >>>>>>>>> (&adev->notifier_lock){+.+.}-{3:3}, at:
> > >>>>>>>>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> > >>>>>>>>>
> > >>>>>>>>> I can see in the the amdgpu_mn_invalidate_gfx function: the
> > >>>>>>>>> dma_resv_wait_timeout_rcu uses wait_all (fences) and MAX_SCHEDULE_TIMEOUT so, I
> > >>>>>>>>> guess the code gets stuck there waiting forever. According to the
> > >>>>>>>>> documentation: "When somebody tries to invalidate the page tables we block the
> > >>>>>>>>> update until all operations on the pages in question are completed, then those
> > >>>>>>>>> pages are marked  as accessed and also dirty if it wasn’t a read only access."
> > >>>>>>>>> Looks like the fences are deadlocked and therefore, it never returns. Could it
> > >>>>>>>>> be possible? any hint to where can I look to fix this?
> > >>>>>>>>>
> > >>>>>>>>> Thank you  in advance.
> > >>>>>>>>>
> > >>>>>>>>> Here the full dmesg output:
> > >>>>>>>>>
> > >>>>>>>>> [  738.337726] INFO: task MatrixMultiplic:653 blocked for more than 122 seconds.
> > >>>>>>>>> [  738.344937]       Not tainted 5.11.0-rc6-qtec-standard #2
> > >>>>>>>>> [  738.350384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > >>>>>>>>> disables this message.
> > >>>>>>>>> [  738.358240] task:MatrixMultiplic state:D stack:    0 pid:  653
> > >>>>>>>>> ppid:     1 flags:0x00004000
> > >>>>>>>>> [  738.358254] Call Trace:
> > >>>>>>>>> [  738.358261]  ? dma_fence_default_wait+0x1eb/0x230
> > >>>>>>>>> [  738.358276]  __schedule+0x370/0x960
> > >>>>>>>>> [  738.358291]  ? dma_fence_default_wait+0x117/0x230
> > >>>>>>>>> [  738.358297]  ? dma_fence_default_wait+0x1eb/0x230
> > >>>>>>>>> [  738.358305]  schedule+0x51/0xc0
> > >>>>>>>>> [  738.358312]  schedule_timeout+0x275/0x380
> > >>>>>>>>> [  738.358324]  ? dma_fence_default_wait+0x1eb/0x230
> > >>>>>>>>> [  738.358332]  ? mark_held_locks+0x4f/0x70
> > >>>>>>>>> [  738.358341]  ? dma_fence_default_wait+0x117/0x230
> > >>>>>>>>> [  738.358347]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> > >>>>>>>>> [  738.358353]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> > >>>>>>>>> [  738.358362]  ? dma_fence_default_wait+0x117/0x230
> > >>>>>>>>> [  738.358370]  ? dma_fence_default_wait+0x1eb/0x230
> > >>>>>>>>> [  738.358375]  dma_fence_default_wait+0x214/0x230
> > >>>>>>>>> [  738.358384]  ? dma_fence_release+0x1a0/0x1a0
> > >>>>>>>>> [  738.358396]  dma_fence_wait_timeout+0x105/0x200
> > >>>>>>>>> [  738.358405]  dma_resv_wait_timeout_rcu+0x1aa/0x5e0
> > >>>>>>>>> [  738.358421]  amdgpu_mn_invalidate_gfx+0x55/0xa0 [amdgpu]
> > >>>>>>>>> [  738.358688]  __mmu_notifier_release+0x1bb/0x210
> > >>>>>>>>> [  738.358710]  exit_mmap+0x2f/0x1e0
> > >>>>>>>>> [  738.358723]  ? find_held_lock+0x34/0xa0
> > >>>>>>>>> [  738.358746]  mmput+0x39/0xe0
> > >>>>>>>>> [  738.358756]  do_exit+0x5c3/0xc00
> > >>>>>>>>> [  738.358763]  ? find_held_lock+0x34/0xa0
> > >>>>>>>>> [  738.358780]  do_group_exit+0x47/0xb0
> > >>>>>>>>> [  738.358791]  get_signal+0x15b/0xc50
> > >>>>>>>>> [  738.358807]  arch_do_signal_or_restart+0xaf/0x710
> > >>>>>>>>> [  738.358816]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> > >>>>>>>>> [  738.358822]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> > >>>>>>>>> [  738.358831]  ? ktime_get_mono_fast_ns+0x50/0xa0
> > >>>>>>>>> [  738.358844]  ? amdgpu_drm_ioctl+0x6b/0x80 [amdgpu]
> > >>>>>>>>> [  738.359044]  exit_to_user_mode_prepare+0xf2/0x1b0
> > >>>>>>>>> [  738.359054]  syscall_exit_to_user_mode+0x19/0x60
> > >>>>>>>>> [  738.359062]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > >>>>>>>>> [  738.359069] RIP: 0033:0x7f6b89a51887
> > >>>>>>>>> [  738.359076] RSP: 002b:00007f6b82b54b18 EFLAGS: 00000246 ORIG_RAX:
> > >>>>>>>>> 0000000000000010
> > >>>>>>>>> [  738.359086] RAX: fffffffffffffe00 RBX: 00007f6b82b54b50 RCX: 00007f6b89a51887
> > >>>>>>>>> [  738.359091] RDX: 00007f6b82b54b50 RSI: 00000000c02064c3 RDI: 0000000000000007
> > >>>>>>>>> [  738.359096] RBP: 00000000c02064c3 R08: 0000000000000003 R09: 00007f6b82b54bbc
> > >>>>>>>>> [  738.359101] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000165a0bc00
> > >>>>>>>>> [  738.359106] R13: 0000000000000007 R14: 0000000000000001 R15: 0000000000000000
> > >>>>>>>>> [  738.359129]
> > >>>>>>>>>                    Showing all locks held in the system:
> > >>>>>>>>> [  738.359141] 1 lock held by khungtaskd/54:
> > >>>>>>>>> [  738.359148]  #0: ffffffff829f6840 (rcu_read_lock){....}-{1:2}, at:
> > >>>>>>>>> debug_show_all_locks+0x15/0x183
> > >>>>>>>>> [  738.359187] 1 lock held by systemd-journal/174:
> > >>>>>>>>> [  738.359202] 1 lock held by MatrixMultiplic/653:
> > >>>>>>>>> [  738.359206]  #0: ffff88810e364fe0
> > >>>>>>>>> (&adev->notifier_lock){+.+.}-{3:3}, at:
> > >>>>>>>>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> > >>>>>>>>>
> > >>>>>>>>> Daniel
> > >>>>>>>> _______________________________________________
> > >>>>>>>> dri-devel mailing list
> > >>>>>>>> dri-devel@lists.freedesktop.org
> > >>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885981018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=OkFv8jiehNoa46Q%2B5yOXUg29cRbzl8voV2GqC8j1V9Q%3D&amp;reserved=0
> > >>>>> --
> > >>>>> Daniel Vetter
> > >>>>> Software Engineer, Intel Corporation
> > >>>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885981018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=m0e9DrqnuYQoJYwwZAyonKlSfkp9hFTRNoT53OY3IbU%3D&amp;reserved=0
> > >>> _______________________________________________
> > >>> amd-gfx mailing list
> > >>> amd-gfx@lists.freedesktop.org
> > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885981018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=BuUCnnGsKhSQc0ldgBPVBIQxYUnvIPwqqLMe81ynrgY%3D&amp;reserved=0
> >
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [amdgpu] deadlock
  2021-02-03 14:27                     ` Daniel Vetter
@ 2021-02-03 14:33                       ` Bridgman, John
  2021-02-03 14:41                         ` Daniel Vetter
  0 siblings, 1 reply; 20+ messages in thread
From: Bridgman, John @ 2021-02-03 14:33 UTC (permalink / raw)
  To: Daniel Vetter, Alex Deucher
  Cc: Linux Kernel Mailing List, dri-devel, amd-gfx list, Deucher,
	Alexander, Daniel Gomez, Koenig, Christian

‎>>Uh, that doesn't work. If you want infinite compute queues you need the
amdkfd model with preempt-ctx dma_fence. If you allow normal cs ioctl to
run forever, you just hang the kernel whenever userspace feels like. Not
just the gpu, the kernel (anything that allocates memory, irrespective of
process can hang). That's no good.

We have moved from using gfx paths to using kfd paths as of the 20.45 release a couple of months ago. Not sure if that applies to APU's yet but if not I would expect it to just be a matter of time.

Thanks,
John
  Original Message
From: Daniel Vetter
Sent: Wednesday, February 3, 2021 9:27 AM
To: Alex Deucher
Cc: Linux Kernel Mailing List; dri-devel; amd-gfx list; Deucher, Alexander; Daniel Gomez; Koenig, Christian
Subject: Re: [amdgpu] deadlock


On Wed, Feb 03, 2021 at 08:56:17AM -0500, Alex Deucher wrote:
> On Wed, Feb 3, 2021 at 7:30 AM Christian König <christian.koenig@amd.com> wrote:
> >
> > Am 03.02.21 um 13:24 schrieb Daniel Vetter:
> > > On Wed, Feb 03, 2021 at 01:21:20PM +0100, Christian König wrote:
> > >> Am 03.02.21 um 12:45 schrieb Daniel Gomez:
> > >>> On Wed, 3 Feb 2021 at 10:47, Daniel Gomez <daniel@qtec.com> wrote:
> > >>>> On Wed, 3 Feb 2021 at 10:17, Daniel Vetter <daniel@ffwll.ch> wrote:
> > >>>>> On Wed, Feb 3, 2021 at 9:51 AM Christian König <christian.koenig@amd.com> wrote:
> > >>>>>> Am 03.02.21 um 09:48 schrieb Daniel Vetter:
> > >>>>>>> On Wed, Feb 3, 2021 at 9:36 AM Christian König <christian.koenig@amd.com> wrote:
> > >>>>>>>> Hi Daniel,
> > >>>>>>>>
> > >>>>>>>> this is not a deadlock, but rather a hardware lockup.
> > >>>>>>> Are you sure? Ime getting stuck in dma_fence_wait has generally good
> > >>>>>>> chance of being a dma_fence deadlock. GPU hang should never result in
> > >>>>>>> a forever stuck dma_fence.
> > >>>>>> Yes, I'm pretty sure. Otherwise the hardware clocks wouldn't go up like
> > >>>>>> this.
> > >>>>> Maybe clarifying, could be both. TDR should notice and get us out of
> > >>>>> this, but if there's a dma_fence deadlock and we can't re-emit or
> > >>>>> force complete the pending things, then we're stuck for good.
> > >>>>> -Daniel
> > >>>>>
> > >>>>>> Question is rather why we end up in the userptr handling for GFX? Our
> > >>>>>> ROCm OpenCL stack shouldn't use this.
> > >>>>>>
> > >>>>>>> Daniel, can you pls re-hang your machine and then dump backtraces of
> > >>>>>>> all tasks into dmesg with sysrq-t, and then attach that? Without all
> > >>>>>>> the backtraces it's tricky to construct the full dependency chain of
> > >>>>>>> what's going on. Also is this plain -rc6, not some more patches on
> > >>>>>>> top?
> > >>>>>> Yeah, that's still a good idea to have.
> > >>>> Here the full backtrace dmesg logs after the hang:
> > >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2Fkzivm2L3&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320075525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=z5%2FBK1akJi7%2BGrUZOA8cmyN7uOAn02ckU4tv1EprVQk%3D&amp;reserved=0
> > >>>>
> > >>>> This is another dmesg log with the backtraces after SIGKILL the matrix process:
> > >>>> (I didn't have the sysrq enable at the time):
> > >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2FpRBwGcj1&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320075525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=TPt%2BS8l6%2Boza78KQwGplTf%2FHj5guCctbJGq3WiioIGg%3D&amp;reserved=0
> > >>> I've now removed all our v4l2 patches and did the same test with the 'plain'
> > >>> mainline version (-rc6).
> > >>>
> > >>> Reference: 3aaf0a27ffc29b19a62314edd684b9bc6346f9a8
> > >>>
> > >>> Same error, same behaviour. Full dmesg log attached:
> > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2FKgaEf7Y1&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320075525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=wG%2Bex7ZOJWd%2B4ZQyhJA%2BHTVbzZXC2lRPSvzYfZlzwIY%3D&amp;reserved=0
> > >>> Note:
> > >>>     dmesg with sysrq-t before running the test starts in [  122.016502]
> > >>> sysrq: Show State
> > >>>     dmesg with sysrq-t after the test starts in: [  495.587671] sysrq: Show State
> > >> There is nothing amdgpu related in there except for waiting for the
> > >> hardware.
> > > Yeah, but there's also no other driver that could cause a stuck dma_fence,
> > > so why is reset not cleaning up the mess here? Irrespective of why the gpu
> > > is stuck, the kernel should at least complete all the dma_fences even if
> > > the gpu for some reason is terminally ill ...
> >
> > That's a good question as well. I'm digging into this.
> >
> > My best theory is that the amdgpu packages disabled GPU reset for some
> > reason.
>
> The timeout for compute queues is infinite because of long running
> compute kernels.  You can override with the amdgpu.lockup_timeout
> parameter.

Uh, that doesn't work. If you want infinite compute queues you need the
amdkfd model with preempt-ctx dma_fence. If you allow normal cs ioctl to
run forever, you just hang the kernel whenever userspace feels like. Not
just the gpu, the kernel (anything that allocates memory, irrespective of
process can hang). That's no good.
-Daniel

>
> Alex
>
> >
> > But the much more interesting question is why we end up in this call
> > path. I've pinged internally, but east coast is not awake yet :)
> >
> > Christian.
> >
> > > -Daniel
> > >
> > >> This is a pretty standard hardware lockup, but I'm still waiting for an
> > >> explanation why we end up in this call path in the first place.
> > >>
> > >> Christian.
> > >>
> > >>>
> > >>>>>> Christian.
> > >>>>>>
> > >>>>>>> -Daniel
> > >>>>>>>
> > >>>>>>>> Which OpenCl stack are you using?
> > >>>>>>>>
> > >>>>>>>> Regards,
> > >>>>>>>> Christian.
> > >>>>>>>>
> > >>>>>>>> Am 03.02.21 um 09:33 schrieb Daniel Gomez:
> > >>>>>>>>> Hi all,
> > >>>>>>>>>
> > >>>>>>>>> I have a deadlock with the amdgpu mainline driver when running in parallel two
> > >>>>>>>>> OpenCL applications. So far, we've been able to replicate it easily by executing
> > >>>>>>>>> clinfo and MatrixMultiplication (from AMD opencl-samples). It's quite old the
> > >>>>>>>>> opencl-samples so, if you have any other suggestion for testing I'd be very
> > >>>>>>>>> happy to test it as well.
> > >>>>>>>>>
> > >>>>>>>>> How to replicate the issue:
> > >>>>>>>>>
> > >>>>>>>>> # while true; do /usr/bin/MatrixMultiplication --device gpu \
> > >>>>>>>>>         --deviceId 0 -x 1000 -y 1000 -z 1000 -q -t -i 50; done
> > >>>>>>>>> # while true; do clinfo; done
> > >>>>>>>>>
> > >>>>>>>>> Output:
> > >>>>>>>>>
> > >>>>>>>>> After a minute or less (sometimes could be more) I can see that
> > >>>>>>>>> MatrixMultiplication and clinfo hang. In addition, with radeontop you can see
> > >>>>>>>>> how the Graphics pipe goes from ~50% to 100%. Also the shader clocks
> > >>>>>>>>> goes up from ~35% to ~96%.
> > >>>>>>>>>
> > >>>>>>>>> clinfo keeps printing:
> > >>>>>>>>> ioctl(7, DRM_IOCTL_SYNCOBJ_WAIT, 0x7ffe46e5f950) = -1 ETIME (Timer expired)
> > >>>>>>>>>
> > >>>>>>>>> And MatrixMultiplication prints the following (strace) if you try to
> > >>>>>>>>> kill the process:
> > >>>>>>>>>
> > >>>>>>>>> sched_yield()                           = 0
> > >>>>>>>>> futex(0x557e945343b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
> > >>>>>>>>> NULL, FUTEX_BITSET_MATCH_ANYstrace: Process 651 detached
> > >>>>>>>>>      <detached ...>
> > >>>>>>>>>
> > >>>>>>>>> After this, the gpu is not functional at all and you'd need a power cycle reset
> > >>>>>>>>> to restore the system.
> > >>>>>>>>>
> > >>>>>>>>> Hardware info:
> > >>>>>>>>> CPU: AMD Ryzen Embedded V1605B with Radeon Vega Gfx (8) @ 2.000GHz
> > >>>>>>>>> GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series
> > >>>>>>>>>
> > >>>>>>>>> 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> > >>>>>>>>> [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
> > >>>>>>>>> (rev 83)
> > >>>>>>>>>         DeviceName: Broadcom 5762
> > >>>>>>>>>         Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge
> > >>>>>>>>> [Radeon Vega Series / Radeon Vega Mobile Series]
> > >>>>>>>>>         Kernel driver in use: amdgpu
> > >>>>>>>>>         Kernel modules: amdgpu
> > >>>>>>>>>
> > >>>>>>>>> Linux kernel info:
> > >>>>>>>>>
> > >>>>>>>>> root@qt5222:~# uname -a
> > >>>>>>>>> Linux qt5222 5.11.0-rc6-qtec-standard #2 SMP Tue Feb 2 09:41:46 UTC
> > >>>>>>>>> 2021 x86_64 x86_64 x86_64 GNU/Linux
> > >>>>>>>>>
> > >>>>>>>>> By enabling the kernel locks stats I could see the MatrixMultiplication is
> > >>>>>>>>> hanged in the amdgpu_mn_invalidate_gfx function:
> > >>>>>>>>>
> > >>>>>>>>> [  738.359202] 1 lock held by MatrixMultiplic/653:
> > >>>>>>>>> [  738.359206]  #0: ffff88810e364fe0
> > >>>>>>>>> (&adev->notifier_lock){+.+.}-{3:3}, at:
> > >>>>>>>>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> > >>>>>>>>>
> > >>>>>>>>> I can see in the the amdgpu_mn_invalidate_gfx function: the
> > >>>>>>>>> dma_resv_wait_timeout_rcu uses wait_all (fences) and MAX_SCHEDULE_TIMEOUT so, I
> > >>>>>>>>> guess the code gets stuck there waiting forever. According to the
> > >>>>>>>>> documentation: "When somebody tries to invalidate the page tables we block the
> > >>>>>>>>> update until all operations on the pages in question are completed, then those
> > >>>>>>>>> pages are marked  as accessed and also dirty if it wasn’t a read only access."
> > >>>>>>>>> Looks like the fences are deadlocked and therefore, it never returns. Could it
> > >>>>>>>>> be possible? any hint to where can I look to fix this?
> > >>>>>>>>>
> > >>>>>>>>> Thank you  in advance.
> > >>>>>>>>>
> > >>>>>>>>> Here the full dmesg output:
> > >>>>>>>>>
> > >>>>>>>>> [  738.337726] INFO: task MatrixMultiplic:653 blocked for more than 122 seconds.
> > >>>>>>>>> [  738.344937]       Not tainted 5.11.0-rc6-qtec-standard #2
> > >>>>>>>>> [  738.350384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > >>>>>>>>> disables this message.
> > >>>>>>>>> [  738.358240] task:MatrixMultiplic state:D stack:    0 pid:  653
> > >>>>>>>>> ppid:     1 flags:0x00004000
> > >>>>>>>>> [  738.358254] Call Trace:
> > >>>>>>>>> [  738.358261]  ? dma_fence_default_wait+0x1eb/0x230
> > >>>>>>>>> [  738.358276]  __schedule+0x370/0x960
> > >>>>>>>>> [  738.358291]  ? dma_fence_default_wait+0x117/0x230
> > >>>>>>>>> [  738.358297]  ? dma_fence_default_wait+0x1eb/0x230
> > >>>>>>>>> [  738.358305]  schedule+0x51/0xc0
> > >>>>>>>>> [  738.358312]  schedule_timeout+0x275/0x380
> > >>>>>>>>> [  738.358324]  ? dma_fence_default_wait+0x1eb/0x230
> > >>>>>>>>> [  738.358332]  ? mark_held_locks+0x4f/0x70
> > >>>>>>>>> [  738.358341]  ? dma_fence_default_wait+0x117/0x230
> > >>>>>>>>> [  738.358347]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> > >>>>>>>>> [  738.358353]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> > >>>>>>>>> [  738.358362]  ? dma_fence_default_wait+0x117/0x230
> > >>>>>>>>> [  738.358370]  ? dma_fence_default_wait+0x1eb/0x230
> > >>>>>>>>> [  738.358375]  dma_fence_default_wait+0x214/0x230
> > >>>>>>>>> [  738.358384]  ? dma_fence_release+0x1a0/0x1a0
> > >>>>>>>>> [  738.358396]  dma_fence_wait_timeout+0x105/0x200
> > >>>>>>>>> [  738.358405]  dma_resv_wait_timeout_rcu+0x1aa/0x5e0
> > >>>>>>>>> [  738.358421]  amdgpu_mn_invalidate_gfx+0x55/0xa0 [amdgpu]
> > >>>>>>>>> [  738.358688]  __mmu_notifier_release+0x1bb/0x210
> > >>>>>>>>> [  738.358710]  exit_mmap+0x2f/0x1e0
> > >>>>>>>>> [  738.358723]  ? find_held_lock+0x34/0xa0
> > >>>>>>>>> [  738.358746]  mmput+0x39/0xe0
> > >>>>>>>>> [  738.358756]  do_exit+0x5c3/0xc00
> > >>>>>>>>> [  738.358763]  ? find_held_lock+0x34/0xa0
> > >>>>>>>>> [  738.358780]  do_group_exit+0x47/0xb0
> > >>>>>>>>> [  738.358791]  get_signal+0x15b/0xc50
> > >>>>>>>>> [  738.358807]  arch_do_signal_or_restart+0xaf/0x710
> > >>>>>>>>> [  738.358816]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> > >>>>>>>>> [  738.358822]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> > >>>>>>>>> [  738.358831]  ? ktime_get_mono_fast_ns+0x50/0xa0
> > >>>>>>>>> [  738.358844]  ? amdgpu_drm_ioctl+0x6b/0x80 [amdgpu]
> > >>>>>>>>> [  738.359044]  exit_to_user_mode_prepare+0xf2/0x1b0
> > >>>>>>>>> [  738.359054]  syscall_exit_to_user_mode+0x19/0x60
> > >>>>>>>>> [  738.359062]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > >>>>>>>>> [  738.359069] RIP: 0033:0x7f6b89a51887
> > >>>>>>>>> [  738.359076] RSP: 002b:00007f6b82b54b18 EFLAGS: 00000246 ORIG_RAX:
> > >>>>>>>>> 0000000000000010
> > >>>>>>>>> [  738.359086] RAX: fffffffffffffe00 RBX: 00007f6b82b54b50 RCX: 00007f6b89a51887
> > >>>>>>>>> [  738.359091] RDX: 00007f6b82b54b50 RSI: 00000000c02064c3 RDI: 0000000000000007
> > >>>>>>>>> [  738.359096] RBP: 00000000c02064c3 R08: 0000000000000003 R09: 00007f6b82b54bbc
> > >>>>>>>>> [  738.359101] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000165a0bc00
> > >>>>>>>>> [  738.359106] R13: 0000000000000007 R14: 0000000000000001 R15: 0000000000000000
> > >>>>>>>>> [  738.359129]
> > >>>>>>>>>                    Showing all locks held in the system:
> > >>>>>>>>> [  738.359141] 1 lock held by khungtaskd/54:
> > >>>>>>>>> [  738.359148]  #0: ffffffff829f6840 (rcu_read_lock){....}-{1:2}, at:
> > >>>>>>>>> debug_show_all_locks+0x15/0x183
> > >>>>>>>>> [  738.359187] 1 lock held by systemd-journal/174:
> > >>>>>>>>> [  738.359202] 1 lock held by MatrixMultiplic/653:
> > >>>>>>>>> [  738.359206]  #0: ffff88810e364fe0
> > >>>>>>>>> (&adev->notifier_lock){+.+.}-{3:3}, at:
> > >>>>>>>>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> > >>>>>>>>>
> > >>>>>>>>> Daniel
> > >>>>>>>> _______________________________________________
> > >>>>>>>> dri-devel mailing list
> > >>>>>>>> dri-devel@lists.freedesktop.org
> > >>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320075525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=5lN%2Fc2Kncy31PnJEJIUC%2BFAItXAmXAtKAHp%2F6d%2Burgo%3D&amp;reserved=0
> > >>>>> --
> > >>>>> Daniel Vetter
> > >>>>> Software Engineer, Intel Corporation
> > >>>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=E6uF1FUrp9RSryVDLD1zIPaMFJU5aP5xvx7F9Mgf53s%3D&amp;reserved=0
> > >>> _______________________________________________
> > >>> amd-gfx mailing list
> > >>> amd-gfx@lists.freedesktop.org
> > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=sXukJnECw3PKXuY9V4JeUQtn86qH7EHHjHZ9brfILK0%3D&amp;reserved=0
> >
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel@lists.freedesktop.org
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2OSwOUbIkoQYy%2F5JGiDyNEK2oOFR%2FGFd9kiP65z15F0%3D&amp;reserved=0
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2OSwOUbIkoQYy%2F5JGiDyNEK2oOFR%2FGFd9kiP65z15F0%3D&amp;reserved=0

--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=E6uF1FUrp9RSryVDLD1zIPaMFJU5aP5xvx7F9Mgf53s%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=sXukJnECw3PKXuY9V4JeUQtn86qH7EHHjHZ9brfILK0%3D&amp;reserved=0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [amdgpu] deadlock
  2021-02-03  8:33 [amdgpu] deadlock Daniel Gomez
  2021-02-03  8:36 ` Christian König
@ 2021-02-03 14:37 ` Christian König
  2021-02-03 14:54   ` Daniel Gomez
  1 sibling, 1 reply; 20+ messages in thread
From: Christian König @ 2021-02-03 14:37 UTC (permalink / raw)
  To: Daniel Gomez, amd-gfx, dri-devel; +Cc: linux-kernel, alexander.deucher

Hi Daniel,

I've talked a bit with our internal team.

The problem is that the 20.20 release still uses the older OpenCL stack 
which obviously has a bug here and causes a hang.

The best approach I can give you is to switch to the ROCm stack instead.

Regards,
Christian.

Am 03.02.21 um 09:33 schrieb Daniel Gomez:
> Hi all,
>
> I have a deadlock with the amdgpu mainline driver when running in parallel two
> OpenCL applications. So far, we've been able to replicate it easily by executing
> clinfo and MatrixMultiplication (from AMD opencl-samples). It's quite old the
> opencl-samples so, if you have any other suggestion for testing I'd be very
> happy to test it as well.
>
> How to replicate the issue:
>
> # while true; do /usr/bin/MatrixMultiplication --device gpu \
>      --deviceId 0 -x 1000 -y 1000 -z 1000 -q -t -i 50; done
> # while true; do clinfo; done
>
> Output:
>
> After a minute or less (sometimes could be more) I can see that
> MatrixMultiplication and clinfo hang. In addition, with radeontop you can see
> how the Graphics pipe goes from ~50% to 100%. Also the shader clocks
> goes up from ~35% to ~96%.
>
> clinfo keeps printing:
> ioctl(7, DRM_IOCTL_SYNCOBJ_WAIT, 0x7ffe46e5f950) = -1 ETIME (Timer expired)
>
> And MatrixMultiplication prints the following (strace) if you try to
> kill the process:
>
> sched_yield()                           = 0
> futex(0x557e945343b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
> NULL, FUTEX_BITSET_MATCH_ANYstrace: Process 651 detached
>   <detached ...>
>
> After this, the gpu is not functional at all and you'd need a power cycle reset
> to restore the system.
>
> Hardware info:
> CPU: AMD Ryzen Embedded V1605B with Radeon Vega Gfx (8) @ 2.000GHz
> GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series
>
> 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
> (rev 83)
>      DeviceName: Broadcom 5762
>      Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge
> [Radeon Vega Series / Radeon Vega Mobile Series]
>      Kernel driver in use: amdgpu
>      Kernel modules: amdgpu
>
> Linux kernel info:
>
> root@qt5222:~# uname -a
> Linux qt5222 5.11.0-rc6-qtec-standard #2 SMP Tue Feb 2 09:41:46 UTC
> 2021 x86_64 x86_64 x86_64 GNU/Linux
>
> By enabling the kernel locks stats I could see the MatrixMultiplication is
> hanged in the amdgpu_mn_invalidate_gfx function:
>
> [  738.359202] 1 lock held by MatrixMultiplic/653:
> [  738.359206]  #0: ffff88810e364fe0
> (&adev->notifier_lock){+.+.}-{3:3}, at:
> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
>
> I can see in the the amdgpu_mn_invalidate_gfx function: the
> dma_resv_wait_timeout_rcu uses wait_all (fences) and MAX_SCHEDULE_TIMEOUT so, I
> guess the code gets stuck there waiting forever. According to the
> documentation: "When somebody tries to invalidate the page tables we block the
> update until all operations on the pages in question are completed, then those
> pages are marked  as accessed and also dirty if it wasn’t a read only access."
> Looks like the fences are deadlocked and therefore, it never returns. Could it
> be possible? any hint to where can I look to fix this?
>
> Thank you  in advance.
>
> Here the full dmesg output:
>
> [  738.337726] INFO: task MatrixMultiplic:653 blocked for more than 122 seconds.
> [  738.344937]       Not tainted 5.11.0-rc6-qtec-standard #2
> [  738.350384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  738.358240] task:MatrixMultiplic state:D stack:    0 pid:  653
> ppid:     1 flags:0x00004000
> [  738.358254] Call Trace:
> [  738.358261]  ? dma_fence_default_wait+0x1eb/0x230
> [  738.358276]  __schedule+0x370/0x960
> [  738.358291]  ? dma_fence_default_wait+0x117/0x230
> [  738.358297]  ? dma_fence_default_wait+0x1eb/0x230
> [  738.358305]  schedule+0x51/0xc0
> [  738.358312]  schedule_timeout+0x275/0x380
> [  738.358324]  ? dma_fence_default_wait+0x1eb/0x230
> [  738.358332]  ? mark_held_locks+0x4f/0x70
> [  738.358341]  ? dma_fence_default_wait+0x117/0x230
> [  738.358347]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> [  738.358353]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> [  738.358362]  ? dma_fence_default_wait+0x117/0x230
> [  738.358370]  ? dma_fence_default_wait+0x1eb/0x230
> [  738.358375]  dma_fence_default_wait+0x214/0x230
> [  738.358384]  ? dma_fence_release+0x1a0/0x1a0
> [  738.358396]  dma_fence_wait_timeout+0x105/0x200
> [  738.358405]  dma_resv_wait_timeout_rcu+0x1aa/0x5e0
> [  738.358421]  amdgpu_mn_invalidate_gfx+0x55/0xa0 [amdgpu]
> [  738.358688]  __mmu_notifier_release+0x1bb/0x210
> [  738.358710]  exit_mmap+0x2f/0x1e0
> [  738.358723]  ? find_held_lock+0x34/0xa0
> [  738.358746]  mmput+0x39/0xe0
> [  738.358756]  do_exit+0x5c3/0xc00
> [  738.358763]  ? find_held_lock+0x34/0xa0
> [  738.358780]  do_group_exit+0x47/0xb0
> [  738.358791]  get_signal+0x15b/0xc50
> [  738.358807]  arch_do_signal_or_restart+0xaf/0x710
> [  738.358816]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> [  738.358822]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> [  738.358831]  ? ktime_get_mono_fast_ns+0x50/0xa0
> [  738.358844]  ? amdgpu_drm_ioctl+0x6b/0x80 [amdgpu]
> [  738.359044]  exit_to_user_mode_prepare+0xf2/0x1b0
> [  738.359054]  syscall_exit_to_user_mode+0x19/0x60
> [  738.359062]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  738.359069] RIP: 0033:0x7f6b89a51887
> [  738.359076] RSP: 002b:00007f6b82b54b18 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [  738.359086] RAX: fffffffffffffe00 RBX: 00007f6b82b54b50 RCX: 00007f6b89a51887
> [  738.359091] RDX: 00007f6b82b54b50 RSI: 00000000c02064c3 RDI: 0000000000000007
> [  738.359096] RBP: 00000000c02064c3 R08: 0000000000000003 R09: 00007f6b82b54bbc
> [  738.359101] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000165a0bc00
> [  738.359106] R13: 0000000000000007 R14: 0000000000000001 R15: 0000000000000000
> [  738.359129]
>                 Showing all locks held in the system:
> [  738.359141] 1 lock held by khungtaskd/54:
> [  738.359148]  #0: ffffffff829f6840 (rcu_read_lock){....}-{1:2}, at:
> debug_show_all_locks+0x15/0x183
> [  738.359187] 1 lock held by systemd-journal/174:
> [  738.359202] 1 lock held by MatrixMultiplic/653:
> [  738.359206]  #0: ffff88810e364fe0
> (&adev->notifier_lock){+.+.}-{3:3}, at:
> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
>
> Daniel


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [amdgpu] deadlock
  2021-02-03 14:33                       ` Bridgman, John
@ 2021-02-03 14:41                         ` Daniel Vetter
  2021-02-03 16:42                           ` Alex Deucher
  0 siblings, 1 reply; 20+ messages in thread
From: Daniel Vetter @ 2021-02-03 14:41 UTC (permalink / raw)
  To: Bridgman, John
  Cc: Alex Deucher, Linux Kernel Mailing List, dri-devel, amd-gfx list,
	Deucher, Alexander, Daniel Gomez, Koenig, Christian

On Wed, Feb 3, 2021 at 3:33 PM Bridgman, John <John.Bridgman@amd.com> wrote:
>
> ‎>>Uh, that doesn't work. If you want infinite compute queues you need the
> amdkfd model with preempt-ctx dma_fence. If you allow normal cs ioctl to
> run forever, you just hang the kernel whenever userspace feels like. Not
> just the gpu, the kernel (anything that allocates memory, irrespective of
> process can hang). That's no good.
>
> We have moved from using gfx paths to using kfd paths as of the 20.45 release a couple of months ago. Not sure if that applies to APU's yet but if not I would expect it to just be a matter of time.

Yeah but that still leaves a DOS attack open. I think we have to
change the reset timeout for compute kernels to something reasonable
to close that (and eat some of the angry bug reporters and politely
tell them to pls upgrade). Hanging gpu's is kinda fine (but shouldn't
affect other process really, if at all possible), hanging kernels at
large not so much.
-Daniel

> Thanks,
> John
>   Original Message
> From: Daniel Vetter
> Sent: Wednesday, February 3, 2021 9:27 AM
> To: Alex Deucher
> Cc: Linux Kernel Mailing List; dri-devel; amd-gfx list; Deucher, Alexander; Daniel Gomez; Koenig, Christian
> Subject: Re: [amdgpu] deadlock
>
>
> On Wed, Feb 03, 2021 at 08:56:17AM -0500, Alex Deucher wrote:
> > On Wed, Feb 3, 2021 at 7:30 AM Christian König <christian.koenig@amd.com> wrote:
> > >
> > > Am 03.02.21 um 13:24 schrieb Daniel Vetter:
> > > > On Wed, Feb 03, 2021 at 01:21:20PM +0100, Christian König wrote:
> > > >> Am 03.02.21 um 12:45 schrieb Daniel Gomez:
> > > >>> On Wed, 3 Feb 2021 at 10:47, Daniel Gomez <daniel@qtec.com> wrote:
> > > >>>> On Wed, 3 Feb 2021 at 10:17, Daniel Vetter <daniel@ffwll.ch> wrote:
> > > >>>>> On Wed, Feb 3, 2021 at 9:51 AM Christian König <christian.koenig@amd.com> wrote:
> > > >>>>>> Am 03.02.21 um 09:48 schrieb Daniel Vetter:
> > > >>>>>>> On Wed, Feb 3, 2021 at 9:36 AM Christian König <christian.koenig@amd.com> wrote:
> > > >>>>>>>> Hi Daniel,
> > > >>>>>>>>
> > > >>>>>>>> this is not a deadlock, but rather a hardware lockup.
> > > >>>>>>> Are you sure? Ime getting stuck in dma_fence_wait has generally good
> > > >>>>>>> chance of being a dma_fence deadlock. GPU hang should never result in
> > > >>>>>>> a forever stuck dma_fence.
> > > >>>>>> Yes, I'm pretty sure. Otherwise the hardware clocks wouldn't go up like
> > > >>>>>> this.
> > > >>>>> Maybe clarifying, could be both. TDR should notice and get us out of
> > > >>>>> this, but if there's a dma_fence deadlock and we can't re-emit or
> > > >>>>> force complete the pending things, then we're stuck for good.
> > > >>>>> -Daniel
> > > >>>>>
> > > >>>>>> Question is rather why we end up in the userptr handling for GFX? Our
> > > >>>>>> ROCm OpenCL stack shouldn't use this.
> > > >>>>>>
> > > >>>>>>> Daniel, can you pls re-hang your machine and then dump backtraces of
> > > >>>>>>> all tasks into dmesg with sysrq-t, and then attach that? Without all
> > > >>>>>>> the backtraces it's tricky to construct the full dependency chain of
> > > >>>>>>> what's going on. Also is this plain -rc6, not some more patches on
> > > >>>>>>> top?
> > > >>>>>> Yeah, that's still a good idea to have.
> > > >>>> Here the full backtrace dmesg logs after the hang:
> > > >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2Fkzivm2L3&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320075525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=z5%2FBK1akJi7%2BGrUZOA8cmyN7uOAn02ckU4tv1EprVQk%3D&amp;reserved=0
> > > >>>>
> > > >>>> This is another dmesg log with the backtraces after SIGKILL the matrix process:
> > > >>>> (I didn't have the sysrq enable at the time):
> > > >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2FpRBwGcj1&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320075525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=TPt%2BS8l6%2Boza78KQwGplTf%2FHj5guCctbJGq3WiioIGg%3D&amp;reserved=0
> > > >>> I've now removed all our v4l2 patches and did the same test with the 'plain'
> > > >>> mainline version (-rc6).
> > > >>>
> > > >>> Reference: 3aaf0a27ffc29b19a62314edd684b9bc6346f9a8
> > > >>>
> > > >>> Same error, same behaviour. Full dmesg log attached:
> > > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2FKgaEf7Y1&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320075525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=wG%2Bex7ZOJWd%2B4ZQyhJA%2BHTVbzZXC2lRPSvzYfZlzwIY%3D&amp;reserved=0
> > > >>> Note:
> > > >>>     dmesg with sysrq-t before running the test starts in [  122.016502]
> > > >>> sysrq: Show State
> > > >>>     dmesg with sysrq-t after the test starts in: [  495.587671] sysrq: Show State
> > > >> There is nothing amdgpu related in there except for waiting for the
> > > >> hardware.
> > > > Yeah, but there's also no other driver that could cause a stuck dma_fence,
> > > > so why is reset not cleaning up the mess here? Irrespective of why the gpu
> > > > is stuck, the kernel should at least complete all the dma_fences even if
> > > > the gpu for some reason is terminally ill ...
> > >
> > > That's a good question as well. I'm digging into this.
> > >
> > > My best theory is that the amdgpu packages disabled GPU reset for some
> > > reason.
> >
> > The timeout for compute queues is infinite because of long running
> > compute kernels.  You can override with the amdgpu.lockup_timeout
> > parameter.
>
> Uh, that doesn't work. If you want infinite compute queues you need the
> amdkfd model with preempt-ctx dma_fence. If you allow normal cs ioctl to
> run forever, you just hang the kernel whenever userspace feels like. Not
> just the gpu, the kernel (anything that allocates memory, irrespective of
> process can hang). That's no good.
> -Daniel
>
> >
> > Alex
> >
> > >
> > > But the much more interesting question is why we end up in this call
> > > path. I've pinged internally, but east coast is not awake yet :)
> > >
> > > Christian.
> > >
> > > > -Daniel
> > > >
> > > >> This is a pretty standard hardware lockup, but I'm still waiting for an
> > > >> explanation why we end up in this call path in the first place.
> > > >>
> > > >> Christian.
> > > >>
> > > >>>
> > > >>>>>> Christian.
> > > >>>>>>
> > > >>>>>>> -Daniel
> > > >>>>>>>
> > > >>>>>>>> Which OpenCl stack are you using?
> > > >>>>>>>>
> > > >>>>>>>> Regards,
> > > >>>>>>>> Christian.
> > > >>>>>>>>
> > > >>>>>>>> Am 03.02.21 um 09:33 schrieb Daniel Gomez:
> > > >>>>>>>>> Hi all,
> > > >>>>>>>>>
> > > >>>>>>>>> I have a deadlock with the amdgpu mainline driver when running in parallel two
> > > >>>>>>>>> OpenCL applications. So far, we've been able to replicate it easily by executing
> > > >>>>>>>>> clinfo and MatrixMultiplication (from AMD opencl-samples). It's quite old the
> > > >>>>>>>>> opencl-samples so, if you have any other suggestion for testing I'd be very
> > > >>>>>>>>> happy to test it as well.
> > > >>>>>>>>>
> > > >>>>>>>>> How to replicate the issue:
> > > >>>>>>>>>
> > > >>>>>>>>> # while true; do /usr/bin/MatrixMultiplication --device gpu \
> > > >>>>>>>>>         --deviceId 0 -x 1000 -y 1000 -z 1000 -q -t -i 50; done
> > > >>>>>>>>> # while true; do clinfo; done
> > > >>>>>>>>>
> > > >>>>>>>>> Output:
> > > >>>>>>>>>
> > > >>>>>>>>> After a minute or less (sometimes could be more) I can see that
> > > >>>>>>>>> MatrixMultiplication and clinfo hang. In addition, with radeontop you can see
> > > >>>>>>>>> how the Graphics pipe goes from ~50% to 100%. Also the shader clocks
> > > >>>>>>>>> goes up from ~35% to ~96%.
> > > >>>>>>>>>
> > > >>>>>>>>> clinfo keeps printing:
> > > >>>>>>>>> ioctl(7, DRM_IOCTL_SYNCOBJ_WAIT, 0x7ffe46e5f950) = -1 ETIME (Timer expired)
> > > >>>>>>>>>
> > > >>>>>>>>> And MatrixMultiplication prints the following (strace) if you try to
> > > >>>>>>>>> kill the process:
> > > >>>>>>>>>
> > > >>>>>>>>> sched_yield()                           = 0
> > > >>>>>>>>> futex(0x557e945343b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
> > > >>>>>>>>> NULL, FUTEX_BITSET_MATCH_ANYstrace: Process 651 detached
> > > >>>>>>>>>      <detached ...>
> > > >>>>>>>>>
> > > >>>>>>>>> After this, the gpu is not functional at all and you'd need a power cycle reset
> > > >>>>>>>>> to restore the system.
> > > >>>>>>>>>
> > > >>>>>>>>> Hardware info:
> > > >>>>>>>>> CPU: AMD Ryzen Embedded V1605B with Radeon Vega Gfx (8) @ 2.000GHz
> > > >>>>>>>>> GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series
> > > >>>>>>>>>
> > > >>>>>>>>> 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> > > >>>>>>>>> [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
> > > >>>>>>>>> (rev 83)
> > > >>>>>>>>>         DeviceName: Broadcom 5762
> > > >>>>>>>>>         Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge
> > > >>>>>>>>> [Radeon Vega Series / Radeon Vega Mobile Series]
> > > >>>>>>>>>         Kernel driver in use: amdgpu
> > > >>>>>>>>>         Kernel modules: amdgpu
> > > >>>>>>>>>
> > > >>>>>>>>> Linux kernel info:
> > > >>>>>>>>>
> > > >>>>>>>>> root@qt5222:~# uname -a
> > > >>>>>>>>> Linux qt5222 5.11.0-rc6-qtec-standard #2 SMP Tue Feb 2 09:41:46 UTC
> > > >>>>>>>>> 2021 x86_64 x86_64 x86_64 GNU/Linux
> > > >>>>>>>>>
> > > >>>>>>>>> By enabling the kernel locks stats I could see the MatrixMultiplication is
> > > >>>>>>>>> hanged in the amdgpu_mn_invalidate_gfx function:
> > > >>>>>>>>>
> > > >>>>>>>>> [  738.359202] 1 lock held by MatrixMultiplic/653:
> > > >>>>>>>>> [  738.359206]  #0: ffff88810e364fe0
> > > >>>>>>>>> (&adev->notifier_lock){+.+.}-{3:3}, at:
> > > >>>>>>>>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> > > >>>>>>>>>
> > > >>>>>>>>> I can see in the the amdgpu_mn_invalidate_gfx function: the
> > > >>>>>>>>> dma_resv_wait_timeout_rcu uses wait_all (fences) and MAX_SCHEDULE_TIMEOUT so, I
> > > >>>>>>>>> guess the code gets stuck there waiting forever. According to the
> > > >>>>>>>>> documentation: "When somebody tries to invalidate the page tables we block the
> > > >>>>>>>>> update until all operations on the pages in question are completed, then those
> > > >>>>>>>>> pages are marked  as accessed and also dirty if it wasn’t a read only access."
> > > >>>>>>>>> Looks like the fences are deadlocked and therefore, it never returns. Could it
> > > >>>>>>>>> be possible? any hint to where can I look to fix this?
> > > >>>>>>>>>
> > > >>>>>>>>> Thank you  in advance.
> > > >>>>>>>>>
> > > >>>>>>>>> Here the full dmesg output:
> > > >>>>>>>>>
> > > >>>>>>>>> [  738.337726] INFO: task MatrixMultiplic:653 blocked for more than 122 seconds.
> > > >>>>>>>>> [  738.344937]       Not tainted 5.11.0-rc6-qtec-standard #2
> > > >>>>>>>>> [  738.350384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > >>>>>>>>> disables this message.
> > > >>>>>>>>> [  738.358240] task:MatrixMultiplic state:D stack:    0 pid:  653
> > > >>>>>>>>> ppid:     1 flags:0x00004000
> > > >>>>>>>>> [  738.358254] Call Trace:
> > > >>>>>>>>> [  738.358261]  ? dma_fence_default_wait+0x1eb/0x230
> > > >>>>>>>>> [  738.358276]  __schedule+0x370/0x960
> > > >>>>>>>>> [  738.358291]  ? dma_fence_default_wait+0x117/0x230
> > > >>>>>>>>> [  738.358297]  ? dma_fence_default_wait+0x1eb/0x230
> > > >>>>>>>>> [  738.358305]  schedule+0x51/0xc0
> > > >>>>>>>>> [  738.358312]  schedule_timeout+0x275/0x380
> > > >>>>>>>>> [  738.358324]  ? dma_fence_default_wait+0x1eb/0x230
> > > >>>>>>>>> [  738.358332]  ? mark_held_locks+0x4f/0x70
> > > >>>>>>>>> [  738.358341]  ? dma_fence_default_wait+0x117/0x230
> > > >>>>>>>>> [  738.358347]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> > > >>>>>>>>> [  738.358353]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> > > >>>>>>>>> [  738.358362]  ? dma_fence_default_wait+0x117/0x230
> > > >>>>>>>>> [  738.358370]  ? dma_fence_default_wait+0x1eb/0x230
> > > >>>>>>>>> [  738.358375]  dma_fence_default_wait+0x214/0x230
> > > >>>>>>>>> [  738.358384]  ? dma_fence_release+0x1a0/0x1a0
> > > >>>>>>>>> [  738.358396]  dma_fence_wait_timeout+0x105/0x200
> > > >>>>>>>>> [  738.358405]  dma_resv_wait_timeout_rcu+0x1aa/0x5e0
> > > >>>>>>>>> [  738.358421]  amdgpu_mn_invalidate_gfx+0x55/0xa0 [amdgpu]
> > > >>>>>>>>> [  738.358688]  __mmu_notifier_release+0x1bb/0x210
> > > >>>>>>>>> [  738.358710]  exit_mmap+0x2f/0x1e0
> > > >>>>>>>>> [  738.358723]  ? find_held_lock+0x34/0xa0
> > > >>>>>>>>> [  738.358746]  mmput+0x39/0xe0
> > > >>>>>>>>> [  738.358756]  do_exit+0x5c3/0xc00
> > > >>>>>>>>> [  738.358763]  ? find_held_lock+0x34/0xa0
> > > >>>>>>>>> [  738.358780]  do_group_exit+0x47/0xb0
> > > >>>>>>>>> [  738.358791]  get_signal+0x15b/0xc50
> > > >>>>>>>>> [  738.358807]  arch_do_signal_or_restart+0xaf/0x710
> > > >>>>>>>>> [  738.358816]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> > > >>>>>>>>> [  738.358822]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> > > >>>>>>>>> [  738.358831]  ? ktime_get_mono_fast_ns+0x50/0xa0
> > > >>>>>>>>> [  738.358844]  ? amdgpu_drm_ioctl+0x6b/0x80 [amdgpu]
> > > >>>>>>>>> [  738.359044]  exit_to_user_mode_prepare+0xf2/0x1b0
> > > >>>>>>>>> [  738.359054]  syscall_exit_to_user_mode+0x19/0x60
> > > >>>>>>>>> [  738.359062]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > >>>>>>>>> [  738.359069] RIP: 0033:0x7f6b89a51887
> > > >>>>>>>>> [  738.359076] RSP: 002b:00007f6b82b54b18 EFLAGS: 00000246 ORIG_RAX:
> > > >>>>>>>>> 0000000000000010
> > > >>>>>>>>> [  738.359086] RAX: fffffffffffffe00 RBX: 00007f6b82b54b50 RCX: 00007f6b89a51887
> > > >>>>>>>>> [  738.359091] RDX: 00007f6b82b54b50 RSI: 00000000c02064c3 RDI: 0000000000000007
> > > >>>>>>>>> [  738.359096] RBP: 00000000c02064c3 R08: 0000000000000003 R09: 00007f6b82b54bbc
> > > >>>>>>>>> [  738.359101] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000165a0bc00
> > > >>>>>>>>> [  738.359106] R13: 0000000000000007 R14: 0000000000000001 R15: 0000000000000000
> > > >>>>>>>>> [  738.359129]
> > > >>>>>>>>>                    Showing all locks held in the system:
> > > >>>>>>>>> [  738.359141] 1 lock held by khungtaskd/54:
> > > >>>>>>>>> [  738.359148]  #0: ffffffff829f6840 (rcu_read_lock){....}-{1:2}, at:
> > > >>>>>>>>> debug_show_all_locks+0x15/0x183
> > > >>>>>>>>> [  738.359187] 1 lock held by systemd-journal/174:
> > > >>>>>>>>> [  738.359202] 1 lock held by MatrixMultiplic/653:
> > > >>>>>>>>> [  738.359206]  #0: ffff88810e364fe0
> > > >>>>>>>>> (&adev->notifier_lock){+.+.}-{3:3}, at:
> > > >>>>>>>>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> > > >>>>>>>>>
> > > >>>>>>>>> Daniel
> > > >>>>>>>> _______________________________________________
> > > >>>>>>>> dri-devel mailing list
> > > >>>>>>>> dri-devel@lists.freedesktop.org
> > > >>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320075525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=5lN%2Fc2Kncy31PnJEJIUC%2BFAItXAmXAtKAHp%2F6d%2Burgo%3D&amp;reserved=0
> > > >>>>> --
> > > >>>>> Daniel Vetter
> > > >>>>> Software Engineer, Intel Corporation
> > > >>>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=E6uF1FUrp9RSryVDLD1zIPaMFJU5aP5xvx7F9Mgf53s%3D&amp;reserved=0
> > > >>> _______________________________________________
> > > >>> amd-gfx mailing list
> > > >>> amd-gfx@lists.freedesktop.org
> > > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=sXukJnECw3PKXuY9V4JeUQtn86qH7EHHjHZ9brfILK0%3D&amp;reserved=0
> > >
> > > _______________________________________________
> > > dri-devel mailing list
> > > dri-devel@lists.freedesktop.org
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2OSwOUbIkoQYy%2F5JGiDyNEK2oOFR%2FGFd9kiP65z15F0%3D&amp;reserved=0
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel@lists.freedesktop.org
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2OSwOUbIkoQYy%2F5JGiDyNEK2oOFR%2FGFd9kiP65z15F0%3D&amp;reserved=0
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=E6uF1FUrp9RSryVDLD1zIPaMFJU5aP5xvx7F9Mgf53s%3D&amp;reserved=0
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=sXukJnECw3PKXuY9V4JeUQtn86qH7EHHjHZ9brfILK0%3D&amp;reserved=0



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [amdgpu] deadlock
  2021-02-03 14:37 ` Christian König
@ 2021-02-03 14:54   ` Daniel Gomez
  0 siblings, 0 replies; 20+ messages in thread
From: Daniel Gomez @ 2021-02-03 14:54 UTC (permalink / raw)
  To: Christian König
  Cc: amd-gfx list, dri-devel, Linux Kernel Mailing List, Alex Deucher

On Wed, 3 Feb 2021 at 15:37, Christian König <christian.koenig@amd.com> wrote:
>
> Hi Daniel,
>
> I've talked a bit with our internal team.
>
> The problem is that the 20.20 release still uses the older OpenCL stack
> which obviously has a bug here and causes a hang.
>
> The best approach I can give you is to switch to the ROCm stack instead.
Thanks Christian. I'll try with the ROCm stack then. As far as I understood,
it should work because the part of the code where it now hangs is not actually
used by the ROCm stack, is that correct? However, the hang/bug will
still be there
even though it is not used in that stack.
Anyway, I'll keep you guys posted with this change.

>
> Regards,
> Christian.
>
> Am 03.02.21 um 09:33 schrieb Daniel Gomez:
> > Hi all,
> >
> > I have a deadlock with the amdgpu mainline driver when running in parallel two
> > OpenCL applications. So far, we've been able to replicate it easily by executing
> > clinfo and MatrixMultiplication (from AMD opencl-samples). It's quite old the
> > opencl-samples so, if you have any other suggestion for testing I'd be very
> > happy to test it as well.
> >
> > How to replicate the issue:
> >
> > # while true; do /usr/bin/MatrixMultiplication --device gpu \
> >      --deviceId 0 -x 1000 -y 1000 -z 1000 -q -t -i 50; done
> > # while true; do clinfo; done
> >
> > Output:
> >
> > After a minute or less (sometimes could be more) I can see that
> > MatrixMultiplication and clinfo hang. In addition, with radeontop you can see
> > how the Graphics pipe goes from ~50% to 100%. Also the shader clocks
> > goes up from ~35% to ~96%.
> >
> > clinfo keeps printing:
> > ioctl(7, DRM_IOCTL_SYNCOBJ_WAIT, 0x7ffe46e5f950) = -1 ETIME (Timer expired)
> >
> > And MatrixMultiplication prints the following (strace) if you try to
> > kill the process:
> >
> > sched_yield()                           = 0
> > futex(0x557e945343b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
> > NULL, FUTEX_BITSET_MATCH_ANYstrace: Process 651 detached
> >   <detached ...>
> >
> > After this, the gpu is not functional at all and you'd need a power cycle reset
> > to restore the system.
> >
> > Hardware info:
> > CPU: AMD Ryzen Embedded V1605B with Radeon Vega Gfx (8) @ 2.000GHz
> > GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series
> >
> > 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> > [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
> > (rev 83)
> >      DeviceName: Broadcom 5762
> >      Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge
> > [Radeon Vega Series / Radeon Vega Mobile Series]
> >      Kernel driver in use: amdgpu
> >      Kernel modules: amdgpu
> >
> > Linux kernel info:
> >
> > root@qt5222:~# uname -a
> > Linux qt5222 5.11.0-rc6-qtec-standard #2 SMP Tue Feb 2 09:41:46 UTC
> > 2021 x86_64 x86_64 x86_64 GNU/Linux
> >
> > By enabling the kernel locks stats I could see the MatrixMultiplication is
> > hanged in the amdgpu_mn_invalidate_gfx function:
> >
> > [  738.359202] 1 lock held by MatrixMultiplic/653:
> > [  738.359206]  #0: ffff88810e364fe0
> > (&adev->notifier_lock){+.+.}-{3:3}, at:
> > amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> >
> > I can see in the the amdgpu_mn_invalidate_gfx function: the
> > dma_resv_wait_timeout_rcu uses wait_all (fences) and MAX_SCHEDULE_TIMEOUT so, I
> > guess the code gets stuck there waiting forever. According to the
> > documentation: "When somebody tries to invalidate the page tables we block the
> > update until all operations on the pages in question are completed, then those
> > pages are marked  as accessed and also dirty if it wasn’t a read only access."
> > Looks like the fences are deadlocked and therefore, it never returns. Could it
> > be possible? any hint to where can I look to fix this?
> >
> > Thank you  in advance.
> >
> > Here the full dmesg output:
> >
> > [  738.337726] INFO: task MatrixMultiplic:653 blocked for more than 122 seconds.
> > [  738.344937]       Not tainted 5.11.0-rc6-qtec-standard #2
> > [  738.350384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [  738.358240] task:MatrixMultiplic state:D stack:    0 pid:  653
> > ppid:     1 flags:0x00004000
> > [  738.358254] Call Trace:
> > [  738.358261]  ? dma_fence_default_wait+0x1eb/0x230
> > [  738.358276]  __schedule+0x370/0x960
> > [  738.358291]  ? dma_fence_default_wait+0x117/0x230
> > [  738.358297]  ? dma_fence_default_wait+0x1eb/0x230
> > [  738.358305]  schedule+0x51/0xc0
> > [  738.358312]  schedule_timeout+0x275/0x380
> > [  738.358324]  ? dma_fence_default_wait+0x1eb/0x230
> > [  738.358332]  ? mark_held_locks+0x4f/0x70
> > [  738.358341]  ? dma_fence_default_wait+0x117/0x230
> > [  738.358347]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> > [  738.358353]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> > [  738.358362]  ? dma_fence_default_wait+0x117/0x230
> > [  738.358370]  ? dma_fence_default_wait+0x1eb/0x230
> > [  738.358375]  dma_fence_default_wait+0x214/0x230
> > [  738.358384]  ? dma_fence_release+0x1a0/0x1a0
> > [  738.358396]  dma_fence_wait_timeout+0x105/0x200
> > [  738.358405]  dma_resv_wait_timeout_rcu+0x1aa/0x5e0
> > [  738.358421]  amdgpu_mn_invalidate_gfx+0x55/0xa0 [amdgpu]
> > [  738.358688]  __mmu_notifier_release+0x1bb/0x210
> > [  738.358710]  exit_mmap+0x2f/0x1e0
> > [  738.358723]  ? find_held_lock+0x34/0xa0
> > [  738.358746]  mmput+0x39/0xe0
> > [  738.358756]  do_exit+0x5c3/0xc00
> > [  738.358763]  ? find_held_lock+0x34/0xa0
> > [  738.358780]  do_group_exit+0x47/0xb0
> > [  738.358791]  get_signal+0x15b/0xc50
> > [  738.358807]  arch_do_signal_or_restart+0xaf/0x710
> > [  738.358816]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> > [  738.358822]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> > [  738.358831]  ? ktime_get_mono_fast_ns+0x50/0xa0
> > [  738.358844]  ? amdgpu_drm_ioctl+0x6b/0x80 [amdgpu]
> > [  738.359044]  exit_to_user_mode_prepare+0xf2/0x1b0
> > [  738.359054]  syscall_exit_to_user_mode+0x19/0x60
> > [  738.359062]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [  738.359069] RIP: 0033:0x7f6b89a51887
> > [  738.359076] RSP: 002b:00007f6b82b54b18 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000010
> > [  738.359086] RAX: fffffffffffffe00 RBX: 00007f6b82b54b50 RCX: 00007f6b89a51887
> > [  738.359091] RDX: 00007f6b82b54b50 RSI: 00000000c02064c3 RDI: 0000000000000007
> > [  738.359096] RBP: 00000000c02064c3 R08: 0000000000000003 R09: 00007f6b82b54bbc
> > [  738.359101] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000165a0bc00
> > [  738.359106] R13: 0000000000000007 R14: 0000000000000001 R15: 0000000000000000
> > [  738.359129]
> >                 Showing all locks held in the system:
> > [  738.359141] 1 lock held by khungtaskd/54:
> > [  738.359148]  #0: ffffffff829f6840 (rcu_read_lock){....}-{1:2}, at:
> > debug_show_all_locks+0x15/0x183
> > [  738.359187] 1 lock held by systemd-journal/174:
> > [  738.359202] 1 lock held by MatrixMultiplic/653:
> > [  738.359206]  #0: ffff88810e364fe0
> > (&adev->notifier_lock){+.+.}-{3:3}, at:
> > amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> >
> > Daniel
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [amdgpu] deadlock
  2021-02-03 14:41                         ` Daniel Vetter
@ 2021-02-03 16:42                           ` Alex Deucher
  2021-02-03 17:14                             ` Daniel Vetter
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Deucher @ 2021-02-03 16:42 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Bridgman, John, Linux Kernel Mailing List, dri-devel,
	amd-gfx list, Deucher, Alexander, Daniel Gomez, Koenig,
	Christian

On Wed, Feb 3, 2021 at 9:42 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Wed, Feb 3, 2021 at 3:33 PM Bridgman, John <John.Bridgman@amd.com> wrote:
> >
> > ‎>>Uh, that doesn't work. If you want infinite compute queues you need the
> > amdkfd model with preempt-ctx dma_fence. If you allow normal cs ioctl to
> > run forever, you just hang the kernel whenever userspace feels like. Not
> > just the gpu, the kernel (anything that allocates memory, irrespective of
> > process can hang). That's no good.
> >
> > We have moved from using gfx paths to using kfd paths as of the 20.45 release a couple of months ago. Not sure if that applies to APU's yet but if not I would expect it to just be a matter of time.
>
> Yeah but that still leaves a DOS attack open. I think we have to
> change the reset timeout for compute kernels to something reasonable
> to close that (and eat some of the angry bug reporters and politely
> tell them to pls upgrade). Hanging gpu's is kinda fine (but shouldn't
> affect other process really, if at all possible), hanging kernels at
> large not so much.

This can also potentially affect long running Vulkan or OpenGL compute
kernels.  Not sure we have a good solution for them.  People are
starting to build ML stuff on vulkan.

Alex


> -Daniel
>
> > Thanks,
> > John
> >   Original Message
> > From: Daniel Vetter
> > Sent: Wednesday, February 3, 2021 9:27 AM
> > To: Alex Deucher
> > Cc: Linux Kernel Mailing List; dri-devel; amd-gfx list; Deucher, Alexander; Daniel Gomez; Koenig, Christian
> > Subject: Re: [amdgpu] deadlock
> >
> >
> > On Wed, Feb 03, 2021 at 08:56:17AM -0500, Alex Deucher wrote:
> > > On Wed, Feb 3, 2021 at 7:30 AM Christian König <christian.koenig@amd.com> wrote:
> > > >
> > > > Am 03.02.21 um 13:24 schrieb Daniel Vetter:
> > > > > On Wed, Feb 03, 2021 at 01:21:20PM +0100, Christian König wrote:
> > > > >> Am 03.02.21 um 12:45 schrieb Daniel Gomez:
> > > > >>> On Wed, 3 Feb 2021 at 10:47, Daniel Gomez <daniel@qtec.com> wrote:
> > > > >>>> On Wed, 3 Feb 2021 at 10:17, Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > >>>>> On Wed, Feb 3, 2021 at 9:51 AM Christian König <christian.koenig@amd.com> wrote:
> > > > >>>>>> Am 03.02.21 um 09:48 schrieb Daniel Vetter:
> > > > >>>>>>> On Wed, Feb 3, 2021 at 9:36 AM Christian König <christian.koenig@amd.com> wrote:
> > > > >>>>>>>> Hi Daniel,
> > > > >>>>>>>>
> > > > >>>>>>>> this is not a deadlock, but rather a hardware lockup.
> > > > >>>>>>> Are you sure? Ime getting stuck in dma_fence_wait has generally good
> > > > >>>>>>> chance of being a dma_fence deadlock. GPU hang should never result in
> > > > >>>>>>> a forever stuck dma_fence.
> > > > >>>>>> Yes, I'm pretty sure. Otherwise the hardware clocks wouldn't go up like
> > > > >>>>>> this.
> > > > >>>>> Maybe clarifying, could be both. TDR should notice and get us out of
> > > > >>>>> this, but if there's a dma_fence deadlock and we can't re-emit or
> > > > >>>>> force complete the pending things, then we're stuck for good.
> > > > >>>>> -Daniel
> > > > >>>>>
> > > > >>>>>> Question is rather why we end up in the userptr handling for GFX? Our
> > > > >>>>>> ROCm OpenCL stack shouldn't use this.
> > > > >>>>>>
> > > > >>>>>>> Daniel, can you pls re-hang your machine and then dump backtraces of
> > > > >>>>>>> all tasks into dmesg with sysrq-t, and then attach that? Without all
> > > > >>>>>>> the backtraces it's tricky to construct the full dependency chain of
> > > > >>>>>>> what's going on. Also is this plain -rc6, not some more patches on
> > > > >>>>>>> top?
> > > > >>>>>> Yeah, that's still a good idea to have.
> > > > >>>> Here the full backtrace dmesg logs after the hang:
> > > > >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2Fkzivm2L3&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320075525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=z5%2FBK1akJi7%2BGrUZOA8cmyN7uOAn02ckU4tv1EprVQk%3D&amp;reserved=0
> > > > >>>>
> > > > >>>> This is another dmesg log with the backtraces after SIGKILL the matrix process:
> > > > >>>> (I didn't have the sysrq enable at the time):
> > > > >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2FpRBwGcj1&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320075525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=TPt%2BS8l6%2Boza78KQwGplTf%2FHj5guCctbJGq3WiioIGg%3D&amp;reserved=0
> > > > >>> I've now removed all our v4l2 patches and did the same test with the 'plain'
> > > > >>> mainline version (-rc6).
> > > > >>>
> > > > >>> Reference: 3aaf0a27ffc29b19a62314edd684b9bc6346f9a8
> > > > >>>
> > > > >>> Same error, same behaviour. Full dmesg log attached:
> > > > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2FKgaEf7Y1&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320075525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=wG%2Bex7ZOJWd%2B4ZQyhJA%2BHTVbzZXC2lRPSvzYfZlzwIY%3D&amp;reserved=0
> > > > >>> Note:
> > > > >>>     dmesg with sysrq-t before running the test starts in [  122.016502]
> > > > >>> sysrq: Show State
> > > > >>>     dmesg with sysrq-t after the test starts in: [  495.587671] sysrq: Show State
> > > > >> There is nothing amdgpu related in there except for waiting for the
> > > > >> hardware.
> > > > > Yeah, but there's also no other driver that could cause a stuck dma_fence,
> > > > > so why is reset not cleaning up the mess here? Irrespective of why the gpu
> > > > > is stuck, the kernel should at least complete all the dma_fences even if
> > > > > the gpu for some reason is terminally ill ...
> > > >
> > > > That's a good question as well. I'm digging into this.
> > > >
> > > > My best theory is that the amdgpu packages disabled GPU reset for some
> > > > reason.
> > >
> > > The timeout for compute queues is infinite because of long running
> > > compute kernels.  You can override with the amdgpu.lockup_timeout
> > > parameter.
> >
> > Uh, that doesn't work. If you want infinite compute queues you need the
> > amdkfd model with preempt-ctx dma_fence. If you allow normal cs ioctl to
> > run forever, you just hang the kernel whenever userspace feels like. Not
> > just the gpu, the kernel (anything that allocates memory, irrespective of
> > process can hang). That's no good.
> > -Daniel
> >
> > >
> > > Alex
> > >
> > > >
> > > > But the much more interesting question is why we end up in this call
> > > > path. I've pinged internally, but east coast is not awake yet :)
> > > >
> > > > Christian.
> > > >
> > > > > -Daniel
> > > > >
> > > > >> This is a pretty standard hardware lockup, but I'm still waiting for an
> > > > >> explanation why we end up in this call path in the first place.
> > > > >>
> > > > >> Christian.
> > > > >>
> > > > >>>
> > > > >>>>>> Christian.
> > > > >>>>>>
> > > > >>>>>>> -Daniel
> > > > >>>>>>>
> > > > >>>>>>>> Which OpenCl stack are you using?
> > > > >>>>>>>>
> > > > >>>>>>>> Regards,
> > > > >>>>>>>> Christian.
> > > > >>>>>>>>
> > > > >>>>>>>> Am 03.02.21 um 09:33 schrieb Daniel Gomez:
> > > > >>>>>>>>> Hi all,
> > > > >>>>>>>>>
> > > > >>>>>>>>> I have a deadlock with the amdgpu mainline driver when running in parallel two
> > > > >>>>>>>>> OpenCL applications. So far, we've been able to replicate it easily by executing
> > > > >>>>>>>>> clinfo and MatrixMultiplication (from AMD opencl-samples). It's quite old the
> > > > >>>>>>>>> opencl-samples so, if you have any other suggestion for testing I'd be very
> > > > >>>>>>>>> happy to test it as well.
> > > > >>>>>>>>>
> > > > >>>>>>>>> How to replicate the issue:
> > > > >>>>>>>>>
> > > > >>>>>>>>> # while true; do /usr/bin/MatrixMultiplication --device gpu \
> > > > >>>>>>>>>         --deviceId 0 -x 1000 -y 1000 -z 1000 -q -t -i 50; done
> > > > >>>>>>>>> # while true; do clinfo; done
> > > > >>>>>>>>>
> > > > >>>>>>>>> Output:
> > > > >>>>>>>>>
> > > > >>>>>>>>> After a minute or less (sometimes could be more) I can see that
> > > > >>>>>>>>> MatrixMultiplication and clinfo hang. In addition, with radeontop you can see
> > > > >>>>>>>>> how the Graphics pipe goes from ~50% to 100%. Also the shader clocks
> > > > >>>>>>>>> goes up from ~35% to ~96%.
> > > > >>>>>>>>>
> > > > >>>>>>>>> clinfo keeps printing:
> > > > >>>>>>>>> ioctl(7, DRM_IOCTL_SYNCOBJ_WAIT, 0x7ffe46e5f950) = -1 ETIME (Timer expired)
> > > > >>>>>>>>>
> > > > >>>>>>>>> And MatrixMultiplication prints the following (strace) if you try to
> > > > >>>>>>>>> kill the process:
> > > > >>>>>>>>>
> > > > >>>>>>>>> sched_yield()                           = 0
> > > > >>>>>>>>> futex(0x557e945343b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
> > > > >>>>>>>>> NULL, FUTEX_BITSET_MATCH_ANYstrace: Process 651 detached
> > > > >>>>>>>>>      <detached ...>
> > > > >>>>>>>>>
> > > > >>>>>>>>> After this, the gpu is not functional at all and you'd need a power cycle reset
> > > > >>>>>>>>> to restore the system.
> > > > >>>>>>>>>
> > > > >>>>>>>>> Hardware info:
> > > > >>>>>>>>> CPU: AMD Ryzen Embedded V1605B with Radeon Vega Gfx (8) @ 2.000GHz
> > > > >>>>>>>>> GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series
> > > > >>>>>>>>>
> > > > >>>>>>>>> 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> > > > >>>>>>>>> [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
> > > > >>>>>>>>> (rev 83)
> > > > >>>>>>>>>         DeviceName: Broadcom 5762
> > > > >>>>>>>>>         Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge
> > > > >>>>>>>>> [Radeon Vega Series / Radeon Vega Mobile Series]
> > > > >>>>>>>>>         Kernel driver in use: amdgpu
> > > > >>>>>>>>>         Kernel modules: amdgpu
> > > > >>>>>>>>>
> > > > >>>>>>>>> Linux kernel info:
> > > > >>>>>>>>>
> > > > >>>>>>>>> root@qt5222:~# uname -a
> > > > >>>>>>>>> Linux qt5222 5.11.0-rc6-qtec-standard #2 SMP Tue Feb 2 09:41:46 UTC
> > > > >>>>>>>>> 2021 x86_64 x86_64 x86_64 GNU/Linux
> > > > >>>>>>>>>
> > > > >>>>>>>>> By enabling the kernel locks stats I could see the MatrixMultiplication is
> > > > >>>>>>>>> hanged in the amdgpu_mn_invalidate_gfx function:
> > > > >>>>>>>>>
> > > > >>>>>>>>> [  738.359202] 1 lock held by MatrixMultiplic/653:
> > > > >>>>>>>>> [  738.359206]  #0: ffff88810e364fe0
> > > > >>>>>>>>> (&adev->notifier_lock){+.+.}-{3:3}, at:
> > > > >>>>>>>>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> > > > >>>>>>>>>
> > > > >>>>>>>>> I can see in the the amdgpu_mn_invalidate_gfx function: the
> > > > >>>>>>>>> dma_resv_wait_timeout_rcu uses wait_all (fences) and MAX_SCHEDULE_TIMEOUT so, I
> > > > >>>>>>>>> guess the code gets stuck there waiting forever. According to the
> > > > >>>>>>>>> documentation: "When somebody tries to invalidate the page tables we block the
> > > > >>>>>>>>> update until all operations on the pages in question are completed, then those
> > > > >>>>>>>>> pages are marked  as accessed and also dirty if it wasn’t a read only access."
> > > > >>>>>>>>> Looks like the fences are deadlocked and therefore, it never returns. Could it
> > > > >>>>>>>>> be possible? any hint to where can I look to fix this?
> > > > >>>>>>>>>
> > > > >>>>>>>>> Thank you  in advance.
> > > > >>>>>>>>>
> > > > >>>>>>>>> Here the full dmesg output:
> > > > >>>>>>>>>
> > > > >>>>>>>>> [  738.337726] INFO: task MatrixMultiplic:653 blocked for more than 122 seconds.
> > > > >>>>>>>>> [  738.344937]       Not tainted 5.11.0-rc6-qtec-standard #2
> > > > >>>>>>>>> [  738.350384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > > >>>>>>>>> disables this message.
> > > > >>>>>>>>> [  738.358240] task:MatrixMultiplic state:D stack:    0 pid:  653
> > > > >>>>>>>>> ppid:     1 flags:0x00004000
> > > > >>>>>>>>> [  738.358254] Call Trace:
> > > > >>>>>>>>> [  738.358261]  ? dma_fence_default_wait+0x1eb/0x230
> > > > >>>>>>>>> [  738.358276]  __schedule+0x370/0x960
> > > > >>>>>>>>> [  738.358291]  ? dma_fence_default_wait+0x117/0x230
> > > > >>>>>>>>> [  738.358297]  ? dma_fence_default_wait+0x1eb/0x230
> > > > >>>>>>>>> [  738.358305]  schedule+0x51/0xc0
> > > > >>>>>>>>> [  738.358312]  schedule_timeout+0x275/0x380
> > > > >>>>>>>>> [  738.358324]  ? dma_fence_default_wait+0x1eb/0x230
> > > > >>>>>>>>> [  738.358332]  ? mark_held_locks+0x4f/0x70
> > > > >>>>>>>>> [  738.358341]  ? dma_fence_default_wait+0x117/0x230
> > > > >>>>>>>>> [  738.358347]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> > > > >>>>>>>>> [  738.358353]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> > > > >>>>>>>>> [  738.358362]  ? dma_fence_default_wait+0x117/0x230
> > > > >>>>>>>>> [  738.358370]  ? dma_fence_default_wait+0x1eb/0x230
> > > > >>>>>>>>> [  738.358375]  dma_fence_default_wait+0x214/0x230
> > > > >>>>>>>>> [  738.358384]  ? dma_fence_release+0x1a0/0x1a0
> > > > >>>>>>>>> [  738.358396]  dma_fence_wait_timeout+0x105/0x200
> > > > >>>>>>>>> [  738.358405]  dma_resv_wait_timeout_rcu+0x1aa/0x5e0
> > > > >>>>>>>>> [  738.358421]  amdgpu_mn_invalidate_gfx+0x55/0xa0 [amdgpu]
> > > > >>>>>>>>> [  738.358688]  __mmu_notifier_release+0x1bb/0x210
> > > > >>>>>>>>> [  738.358710]  exit_mmap+0x2f/0x1e0
> > > > >>>>>>>>> [  738.358723]  ? find_held_lock+0x34/0xa0
> > > > >>>>>>>>> [  738.358746]  mmput+0x39/0xe0
> > > > >>>>>>>>> [  738.358756]  do_exit+0x5c3/0xc00
> > > > >>>>>>>>> [  738.358763]  ? find_held_lock+0x34/0xa0
> > > > >>>>>>>>> [  738.358780]  do_group_exit+0x47/0xb0
> > > > >>>>>>>>> [  738.358791]  get_signal+0x15b/0xc50
> > > > >>>>>>>>> [  738.358807]  arch_do_signal_or_restart+0xaf/0x710
> > > > >>>>>>>>> [  738.358816]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> > > > >>>>>>>>> [  738.358822]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> > > > >>>>>>>>> [  738.358831]  ? ktime_get_mono_fast_ns+0x50/0xa0
> > > > >>>>>>>>> [  738.358844]  ? amdgpu_drm_ioctl+0x6b/0x80 [amdgpu]
> > > > >>>>>>>>> [  738.359044]  exit_to_user_mode_prepare+0xf2/0x1b0
> > > > >>>>>>>>> [  738.359054]  syscall_exit_to_user_mode+0x19/0x60
> > > > >>>>>>>>> [  738.359062]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > >>>>>>>>> [  738.359069] RIP: 0033:0x7f6b89a51887
> > > > >>>>>>>>> [  738.359076] RSP: 002b:00007f6b82b54b18 EFLAGS: 00000246 ORIG_RAX:
> > > > >>>>>>>>> 0000000000000010
> > > > >>>>>>>>> [  738.359086] RAX: fffffffffffffe00 RBX: 00007f6b82b54b50 RCX: 00007f6b89a51887
> > > > >>>>>>>>> [  738.359091] RDX: 00007f6b82b54b50 RSI: 00000000c02064c3 RDI: 0000000000000007
> > > > >>>>>>>>> [  738.359096] RBP: 00000000c02064c3 R08: 0000000000000003 R09: 00007f6b82b54bbc
> > > > >>>>>>>>> [  738.359101] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000165a0bc00
> > > > >>>>>>>>> [  738.359106] R13: 0000000000000007 R14: 0000000000000001 R15: 0000000000000000
> > > > >>>>>>>>> [  738.359129]
> > > > >>>>>>>>>                    Showing all locks held in the system:
> > > > >>>>>>>>> [  738.359141] 1 lock held by khungtaskd/54:
> > > > >>>>>>>>> [  738.359148]  #0: ffffffff829f6840 (rcu_read_lock){....}-{1:2}, at:
> > > > >>>>>>>>> debug_show_all_locks+0x15/0x183
> > > > >>>>>>>>> [  738.359187] 1 lock held by systemd-journal/174:
> > > > >>>>>>>>> [  738.359202] 1 lock held by MatrixMultiplic/653:
> > > > >>>>>>>>> [  738.359206]  #0: ffff88810e364fe0
> > > > >>>>>>>>> (&adev->notifier_lock){+.+.}-{3:3}, at:
> > > > >>>>>>>>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> > > > >>>>>>>>>
> > > > >>>>>>>>> Daniel
> > > > >>>>>>>> _______________________________________________
> > > > >>>>>>>> dri-devel mailing list
> > > > >>>>>>>> dri-devel@lists.freedesktop.org
> > > > >>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320075525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=5lN%2Fc2Kncy31PnJEJIUC%2BFAItXAmXAtKAHp%2F6d%2Burgo%3D&amp;reserved=0
> > > > >>>>> --
> > > > >>>>> Daniel Vetter
> > > > >>>>> Software Engineer, Intel Corporation
> > > > >>>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=E6uF1FUrp9RSryVDLD1zIPaMFJU5aP5xvx7F9Mgf53s%3D&amp;reserved=0
> > > > >>> _______________________________________________
> > > > >>> amd-gfx mailing list
> > > > >>> amd-gfx@lists.freedesktop.org
> > > > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=sXukJnECw3PKXuY9V4JeUQtn86qH7EHHjHZ9brfILK0%3D&amp;reserved=0
> > > >
> > > > _______________________________________________
> > > > dri-devel mailing list
> > > > dri-devel@lists.freedesktop.org
> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2OSwOUbIkoQYy%2F5JGiDyNEK2oOFR%2FGFd9kiP65z15F0%3D&amp;reserved=0
> > > _______________________________________________
> > > dri-devel mailing list
> > > dri-devel@lists.freedesktop.org
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2OSwOUbIkoQYy%2F5JGiDyNEK2oOFR%2FGFd9kiP65z15F0%3D&amp;reserved=0
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=E6uF1FUrp9RSryVDLD1zIPaMFJU5aP5xvx7F9Mgf53s%3D&amp;reserved=0
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=sXukJnECw3PKXuY9V4JeUQtn86qH7EHHjHZ9brfILK0%3D&amp;reserved=0
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [amdgpu] deadlock
  2021-02-03 16:42                           ` Alex Deucher
@ 2021-02-03 17:14                             ` Daniel Vetter
  2021-02-03 17:33                               ` Daniel Vetter
  0 siblings, 1 reply; 20+ messages in thread
From: Daniel Vetter @ 2021-02-03 17:14 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Bridgman, John, Linux Kernel Mailing List, dri-devel,
	amd-gfx list, Deucher, Alexander, Daniel Gomez, Koenig,
	Christian

On Wed, Feb 3, 2021 at 5:42 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Wed, Feb 3, 2021 at 9:42 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Wed, Feb 3, 2021 at 3:33 PM Bridgman, John <John.Bridgman@amd.com> wrote:
> > >
> > > ‎>>Uh, that doesn't work. If you want infinite compute queues you need the
> > > amdkfd model with preempt-ctx dma_fence. If you allow normal cs ioctl to
> > > run forever, you just hang the kernel whenever userspace feels like. Not
> > > just the gpu, the kernel (anything that allocates memory, irrespective of
> > > process can hang). That's no good.
> > >
> > > We have moved from using gfx paths to using kfd paths as of the 20.45 release a couple of months ago. Not sure if that applies to APU's yet but if not I would expect it to just be a matter of time.
> >
> > Yeah but that still leaves a DOS attack open. I think we have to
> > change the reset timeout for compute kernels to something reasonable
> > to close that (and eat some of the angry bug reporters and politely
> > tell them to pls upgrade). Hanging gpu's is kinda fine (but shouldn't
> > affect other process really, if at all possible), hanging kernels at
> > large not so much.
>
> This can also potentially affect long running Vulkan or OpenGL compute
> kernels.  Not sure we have a good solution for them.  People are
> starting to build ML stuff on vulkan.

Yeah they need the compute mode with userspace fences and long running
batches for that. It wont work on legacy implict synced cs, at least
not with major surgery. That's why I asked whether the entirely
separate amdkfd world for compute is really such a great idea, since I
expect it'll come and bite us pretty bad.

Fundamentally you can't have indefinite fences with implicit synced
CS, and long running jobs is just one of these indefinite fences.
-Daniel

>
> Alex
>
>
> > -Daniel
> >
> > > Thanks,
> > > John
> > >   Original Message
> > > From: Daniel Vetter
> > > Sent: Wednesday, February 3, 2021 9:27 AM
> > > To: Alex Deucher
> > > Cc: Linux Kernel Mailing List; dri-devel; amd-gfx list; Deucher, Alexander; Daniel Gomez; Koenig, Christian
> > > Subject: Re: [amdgpu] deadlock
> > >
> > >
> > > On Wed, Feb 03, 2021 at 08:56:17AM -0500, Alex Deucher wrote:
> > > > On Wed, Feb 3, 2021 at 7:30 AM Christian König <christian.koenig@amd.com> wrote:
> > > > >
> > > > > Am 03.02.21 um 13:24 schrieb Daniel Vetter:
> > > > > > On Wed, Feb 03, 2021 at 01:21:20PM +0100, Christian König wrote:
> > > > > >> Am 03.02.21 um 12:45 schrieb Daniel Gomez:
> > > > > >>> On Wed, 3 Feb 2021 at 10:47, Daniel Gomez <daniel@qtec.com> wrote:
> > > > > >>>> On Wed, 3 Feb 2021 at 10:17, Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > > >>>>> On Wed, Feb 3, 2021 at 9:51 AM Christian König <christian.koenig@amd.com> wrote:
> > > > > >>>>>> Am 03.02.21 um 09:48 schrieb Daniel Vetter:
> > > > > >>>>>>> On Wed, Feb 3, 2021 at 9:36 AM Christian König <christian.koenig@amd.com> wrote:
> > > > > >>>>>>>> Hi Daniel,
> > > > > >>>>>>>>
> > > > > >>>>>>>> this is not a deadlock, but rather a hardware lockup.
> > > > > >>>>>>> Are you sure? Ime getting stuck in dma_fence_wait has generally good
> > > > > >>>>>>> chance of being a dma_fence deadlock. GPU hang should never result in
> > > > > >>>>>>> a forever stuck dma_fence.
> > > > > >>>>>> Yes, I'm pretty sure. Otherwise the hardware clocks wouldn't go up like
> > > > > >>>>>> this.
> > > > > >>>>> Maybe clarifying, could be both. TDR should notice and get us out of
> > > > > >>>>> this, but if there's a dma_fence deadlock and we can't re-emit or
> > > > > >>>>> force complete the pending things, then we're stuck for good.
> > > > > >>>>> -Daniel
> > > > > >>>>>
> > > > > >>>>>> Question is rather why we end up in the userptr handling for GFX? Our
> > > > > >>>>>> ROCm OpenCL stack shouldn't use this.
> > > > > >>>>>>
> > > > > >>>>>>> Daniel, can you pls re-hang your machine and then dump backtraces of
> > > > > >>>>>>> all tasks into dmesg with sysrq-t, and then attach that? Without all
> > > > > >>>>>>> the backtraces it's tricky to construct the full dependency chain of
> > > > > >>>>>>> what's going on. Also is this plain -rc6, not some more patches on
> > > > > >>>>>>> top?
> > > > > >>>>>> Yeah, that's still a good idea to have.
> > > > > >>>> Here the full backtrace dmesg logs after the hang:
> > > > > >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2Fkzivm2L3&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320075525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=z5%2FBK1akJi7%2BGrUZOA8cmyN7uOAn02ckU4tv1EprVQk%3D&amp;reserved=0
> > > > > >>>>
> > > > > >>>> This is another dmesg log with the backtraces after SIGKILL the matrix process:
> > > > > >>>> (I didn't have the sysrq enable at the time):
> > > > > >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2FpRBwGcj1&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320075525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=TPt%2BS8l6%2Boza78KQwGplTf%2FHj5guCctbJGq3WiioIGg%3D&amp;reserved=0
> > > > > >>> I've now removed all our v4l2 patches and did the same test with the 'plain'
> > > > > >>> mainline version (-rc6).
> > > > > >>>
> > > > > >>> Reference: 3aaf0a27ffc29b19a62314edd684b9bc6346f9a8
> > > > > >>>
> > > > > >>> Same error, same behaviour. Full dmesg log attached:
> > > > > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2FKgaEf7Y1&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320075525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=wG%2Bex7ZOJWd%2B4ZQyhJA%2BHTVbzZXC2lRPSvzYfZlzwIY%3D&amp;reserved=0
> > > > > >>> Note:
> > > > > >>>     dmesg with sysrq-t before running the test starts in [  122.016502]
> > > > > >>> sysrq: Show State
> > > > > >>>     dmesg with sysrq-t after the test starts in: [  495.587671] sysrq: Show State
> > > > > >> There is nothing amdgpu related in there except for waiting for the
> > > > > >> hardware.
> > > > > > Yeah, but there's also no other driver that could cause a stuck dma_fence,
> > > > > > so why is reset not cleaning up the mess here? Irrespective of why the gpu
> > > > > > is stuck, the kernel should at least complete all the dma_fences even if
> > > > > > the gpu for some reason is terminally ill ...
> > > > >
> > > > > That's a good question as well. I'm digging into this.
> > > > >
> > > > > My best theory is that the amdgpu packages disabled GPU reset for some
> > > > > reason.
> > > >
> > > > The timeout for compute queues is infinite because of long running
> > > > compute kernels.  You can override with the amdgpu.lockup_timeout
> > > > parameter.
> > >
> > > Uh, that doesn't work. If you want infinite compute queues you need the
> > > amdkfd model with preempt-ctx dma_fence. If you allow normal cs ioctl to
> > > run forever, you just hang the kernel whenever userspace feels like. Not
> > > just the gpu, the kernel (anything that allocates memory, irrespective of
> > > process can hang). That's no good.
> > > -Daniel
> > >
> > > >
> > > > Alex
> > > >
> > > > >
> > > > > But the much more interesting question is why we end up in this call
> > > > > path. I've pinged internally, but east coast is not awake yet :)
> > > > >
> > > > > Christian.
> > > > >
> > > > > > -Daniel
> > > > > >
> > > > > >> This is a pretty standard hardware lockup, but I'm still waiting for an
> > > > > >> explanation why we end up in this call path in the first place.
> > > > > >>
> > > > > >> Christian.
> > > > > >>
> > > > > >>>
> > > > > >>>>>> Christian.
> > > > > >>>>>>
> > > > > >>>>>>> -Daniel
> > > > > >>>>>>>
> > > > > >>>>>>>> Which OpenCl stack are you using?
> > > > > >>>>>>>>
> > > > > >>>>>>>> Regards,
> > > > > >>>>>>>> Christian.
> > > > > >>>>>>>>
> > > > > >>>>>>>> Am 03.02.21 um 09:33 schrieb Daniel Gomez:
> > > > > >>>>>>>>> Hi all,
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I have a deadlock with the amdgpu mainline driver when running in parallel two
> > > > > >>>>>>>>> OpenCL applications. So far, we've been able to replicate it easily by executing
> > > > > >>>>>>>>> clinfo and MatrixMultiplication (from AMD opencl-samples). It's quite old the
> > > > > >>>>>>>>> opencl-samples so, if you have any other suggestion for testing I'd be very
> > > > > >>>>>>>>> happy to test it as well.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> How to replicate the issue:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> # while true; do /usr/bin/MatrixMultiplication --device gpu \
> > > > > >>>>>>>>>         --deviceId 0 -x 1000 -y 1000 -z 1000 -q -t -i 50; done
> > > > > >>>>>>>>> # while true; do clinfo; done
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Output:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> After a minute or less (sometimes could be more) I can see that
> > > > > >>>>>>>>> MatrixMultiplication and clinfo hang. In addition, with radeontop you can see
> > > > > >>>>>>>>> how the Graphics pipe goes from ~50% to 100%. Also the shader clocks
> > > > > >>>>>>>>> goes up from ~35% to ~96%.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> clinfo keeps printing:
> > > > > >>>>>>>>> ioctl(7, DRM_IOCTL_SYNCOBJ_WAIT, 0x7ffe46e5f950) = -1 ETIME (Timer expired)
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> And MatrixMultiplication prints the following (strace) if you try to
> > > > > >>>>>>>>> kill the process:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> sched_yield()                           = 0
> > > > > >>>>>>>>> futex(0x557e945343b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
> > > > > >>>>>>>>> NULL, FUTEX_BITSET_MATCH_ANYstrace: Process 651 detached
> > > > > >>>>>>>>>      <detached ...>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> After this, the gpu is not functional at all and you'd need a power cycle reset
> > > > > >>>>>>>>> to restore the system.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Hardware info:
> > > > > >>>>>>>>> CPU: AMD Ryzen Embedded V1605B with Radeon Vega Gfx (8) @ 2.000GHz
> > > > > >>>>>>>>> GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> > > > > >>>>>>>>> [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
> > > > > >>>>>>>>> (rev 83)
> > > > > >>>>>>>>>         DeviceName: Broadcom 5762
> > > > > >>>>>>>>>         Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge
> > > > > >>>>>>>>> [Radeon Vega Series / Radeon Vega Mobile Series]
> > > > > >>>>>>>>>         Kernel driver in use: amdgpu
> > > > > >>>>>>>>>         Kernel modules: amdgpu
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Linux kernel info:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> root@qt5222:~# uname -a
> > > > > >>>>>>>>> Linux qt5222 5.11.0-rc6-qtec-standard #2 SMP Tue Feb 2 09:41:46 UTC
> > > > > >>>>>>>>> 2021 x86_64 x86_64 x86_64 GNU/Linux
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> By enabling the kernel locks stats I could see the MatrixMultiplication is
> > > > > >>>>>>>>> hanged in the amdgpu_mn_invalidate_gfx function:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> [  738.359202] 1 lock held by MatrixMultiplic/653:
> > > > > >>>>>>>>> [  738.359206]  #0: ffff88810e364fe0
> > > > > >>>>>>>>> (&adev->notifier_lock){+.+.}-{3:3}, at:
> > > > > >>>>>>>>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I can see in the the amdgpu_mn_invalidate_gfx function: the
> > > > > >>>>>>>>> dma_resv_wait_timeout_rcu uses wait_all (fences) and MAX_SCHEDULE_TIMEOUT so, I
> > > > > >>>>>>>>> guess the code gets stuck there waiting forever. According to the
> > > > > >>>>>>>>> documentation: "When somebody tries to invalidate the page tables we block the
> > > > > >>>>>>>>> update until all operations on the pages in question are completed, then those
> > > > > >>>>>>>>> pages are marked  as accessed and also dirty if it wasn’t a read only access."
> > > > > >>>>>>>>> Looks like the fences are deadlocked and therefore, it never returns. Could it
> > > > > >>>>>>>>> be possible? any hint to where can I look to fix this?
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Thank you  in advance.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Here the full dmesg output:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> [  738.337726] INFO: task MatrixMultiplic:653 blocked for more than 122 seconds.
> > > > > >>>>>>>>> [  738.344937]       Not tainted 5.11.0-rc6-qtec-standard #2
> > > > > >>>>>>>>> [  738.350384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > > > >>>>>>>>> disables this message.
> > > > > >>>>>>>>> [  738.358240] task:MatrixMultiplic state:D stack:    0 pid:  653
> > > > > >>>>>>>>> ppid:     1 flags:0x00004000
> > > > > >>>>>>>>> [  738.358254] Call Trace:
> > > > > >>>>>>>>> [  738.358261]  ? dma_fence_default_wait+0x1eb/0x230
> > > > > >>>>>>>>> [  738.358276]  __schedule+0x370/0x960
> > > > > >>>>>>>>> [  738.358291]  ? dma_fence_default_wait+0x117/0x230
> > > > > >>>>>>>>> [  738.358297]  ? dma_fence_default_wait+0x1eb/0x230
> > > > > >>>>>>>>> [  738.358305]  schedule+0x51/0xc0
> > > > > >>>>>>>>> [  738.358312]  schedule_timeout+0x275/0x380
> > > > > >>>>>>>>> [  738.358324]  ? dma_fence_default_wait+0x1eb/0x230
> > > > > >>>>>>>>> [  738.358332]  ? mark_held_locks+0x4f/0x70
> > > > > >>>>>>>>> [  738.358341]  ? dma_fence_default_wait+0x117/0x230
> > > > > >>>>>>>>> [  738.358347]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> > > > > >>>>>>>>> [  738.358353]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> > > > > >>>>>>>>> [  738.358362]  ? dma_fence_default_wait+0x117/0x230
> > > > > >>>>>>>>> [  738.358370]  ? dma_fence_default_wait+0x1eb/0x230
> > > > > >>>>>>>>> [  738.358375]  dma_fence_default_wait+0x214/0x230
> > > > > >>>>>>>>> [  738.358384]  ? dma_fence_release+0x1a0/0x1a0
> > > > > >>>>>>>>> [  738.358396]  dma_fence_wait_timeout+0x105/0x200
> > > > > >>>>>>>>> [  738.358405]  dma_resv_wait_timeout_rcu+0x1aa/0x5e0
> > > > > >>>>>>>>> [  738.358421]  amdgpu_mn_invalidate_gfx+0x55/0xa0 [amdgpu]
> > > > > >>>>>>>>> [  738.358688]  __mmu_notifier_release+0x1bb/0x210
> > > > > >>>>>>>>> [  738.358710]  exit_mmap+0x2f/0x1e0
> > > > > >>>>>>>>> [  738.358723]  ? find_held_lock+0x34/0xa0
> > > > > >>>>>>>>> [  738.358746]  mmput+0x39/0xe0
> > > > > >>>>>>>>> [  738.358756]  do_exit+0x5c3/0xc00
> > > > > >>>>>>>>> [  738.358763]  ? find_held_lock+0x34/0xa0
> > > > > >>>>>>>>> [  738.358780]  do_group_exit+0x47/0xb0
> > > > > >>>>>>>>> [  738.358791]  get_signal+0x15b/0xc50
> > > > > >>>>>>>>> [  738.358807]  arch_do_signal_or_restart+0xaf/0x710
> > > > > >>>>>>>>> [  738.358816]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> > > > > >>>>>>>>> [  738.358822]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> > > > > >>>>>>>>> [  738.358831]  ? ktime_get_mono_fast_ns+0x50/0xa0
> > > > > >>>>>>>>> [  738.358844]  ? amdgpu_drm_ioctl+0x6b/0x80 [amdgpu]
> > > > > >>>>>>>>> [  738.359044]  exit_to_user_mode_prepare+0xf2/0x1b0
> > > > > >>>>>>>>> [  738.359054]  syscall_exit_to_user_mode+0x19/0x60
> > > > > >>>>>>>>> [  738.359062]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > >>>>>>>>> [  738.359069] RIP: 0033:0x7f6b89a51887
> > > > > >>>>>>>>> [  738.359076] RSP: 002b:00007f6b82b54b18 EFLAGS: 00000246 ORIG_RAX:
> > > > > >>>>>>>>> 0000000000000010
> > > > > >>>>>>>>> [  738.359086] RAX: fffffffffffffe00 RBX: 00007f6b82b54b50 RCX: 00007f6b89a51887
> > > > > >>>>>>>>> [  738.359091] RDX: 00007f6b82b54b50 RSI: 00000000c02064c3 RDI: 0000000000000007
> > > > > >>>>>>>>> [  738.359096] RBP: 00000000c02064c3 R08: 0000000000000003 R09: 00007f6b82b54bbc
> > > > > >>>>>>>>> [  738.359101] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000165a0bc00
> > > > > >>>>>>>>> [  738.359106] R13: 0000000000000007 R14: 0000000000000001 R15: 0000000000000000
> > > > > >>>>>>>>> [  738.359129]
> > > > > >>>>>>>>>                    Showing all locks held in the system:
> > > > > >>>>>>>>> [  738.359141] 1 lock held by khungtaskd/54:
> > > > > >>>>>>>>> [  738.359148]  #0: ffffffff829f6840 (rcu_read_lock){....}-{1:2}, at:
> > > > > >>>>>>>>> debug_show_all_locks+0x15/0x183
> > > > > >>>>>>>>> [  738.359187] 1 lock held by systemd-journal/174:
> > > > > >>>>>>>>> [  738.359202] 1 lock held by MatrixMultiplic/653:
> > > > > >>>>>>>>> [  738.359206]  #0: ffff88810e364fe0
> > > > > >>>>>>>>> (&adev->notifier_lock){+.+.}-{3:3}, at:
> > > > > >>>>>>>>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Daniel
> > > > > >>>>>>>> _______________________________________________
> > > > > >>>>>>>> dri-devel mailing list
> > > > > >>>>>>>> dri-devel@lists.freedesktop.org
> > > > > >>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320075525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=5lN%2Fc2Kncy31PnJEJIUC%2BFAItXAmXAtKAHp%2F6d%2Burgo%3D&amp;reserved=0
> > > > > >>>>> --
> > > > > >>>>> Daniel Vetter
> > > > > >>>>> Software Engineer, Intel Corporation
> > > > > >>>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=E6uF1FUrp9RSryVDLD1zIPaMFJU5aP5xvx7F9Mgf53s%3D&amp;reserved=0
> > > > > >>> _______________________________________________
> > > > > >>> amd-gfx mailing list
> > > > > >>> amd-gfx@lists.freedesktop.org
> > > > > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=sXukJnECw3PKXuY9V4JeUQtn86qH7EHHjHZ9brfILK0%3D&amp;reserved=0
> > > > >
> > > > > _______________________________________________
> > > > > dri-devel mailing list
> > > > > dri-devel@lists.freedesktop.org
> > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2OSwOUbIkoQYy%2F5JGiDyNEK2oOFR%2FGFd9kiP65z15F0%3D&amp;reserved=0
> > > > _______________________________________________
> > > > dri-devel mailing list
> > > > dri-devel@lists.freedesktop.org
> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2OSwOUbIkoQYy%2F5JGiDyNEK2oOFR%2FGFd9kiP65z15F0%3D&amp;reserved=0
> > >
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=E6uF1FUrp9RSryVDLD1zIPaMFJU5aP5xvx7F9Mgf53s%3D&amp;reserved=0
> > > _______________________________________________
> > > amd-gfx mailing list
> > > amd-gfx@lists.freedesktop.org
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=sXukJnECw3PKXuY9V4JeUQtn86qH7EHHjHZ9brfILK0%3D&amp;reserved=0
> >
> >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [amdgpu] deadlock
  2021-02-03 17:14                             ` Daniel Vetter
@ 2021-02-03 17:33                               ` Daniel Vetter
  0 siblings, 0 replies; 20+ messages in thread
From: Daniel Vetter @ 2021-02-03 17:33 UTC (permalink / raw)
  To: Alex Deucher, Jason Ekstrand
  Cc: Bridgman, John, Linux Kernel Mailing List, dri-devel,
	amd-gfx list, Deucher, Alexander, Daniel Gomez, Koenig,
	Christian

On Wed, Feb 3, 2021 at 6:14 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Wed, Feb 3, 2021 at 5:42 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> >
> > On Wed, Feb 3, 2021 at 9:42 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > > On Wed, Feb 3, 2021 at 3:33 PM Bridgman, John <John.Bridgman@amd.com> wrote:
> > > >
> > > > ‎>>Uh, that doesn't work. If you want infinite compute queues you need the
> > > > amdkfd model with preempt-ctx dma_fence. If you allow normal cs ioctl to
> > > > run forever, you just hang the kernel whenever userspace feels like. Not
> > > > just the gpu, the kernel (anything that allocates memory, irrespective of
> > > > process can hang). That's no good.
> > > >
> > > > We have moved from using gfx paths to using kfd paths as of the 20.45 release a couple of months ago. Not sure if that applies to APU's yet but if not I would expect it to just be a matter of time.
> > >
> > > Yeah but that still leaves a DOS attack open. I think we have to
> > > change the reset timeout for compute kernels to something reasonable
> > > to close that (and eat some of the angry bug reporters and politely
> > > tell them to pls upgrade). Hanging gpu's is kinda fine (but shouldn't
> > > affect other process really, if at all possible), hanging kernels at
> > > large not so much.
> >
> > This can also potentially affect long running Vulkan or OpenGL compute
> > kernels.  Not sure we have a good solution for them.  People are
> > starting to build ML stuff on vulkan.
>
> Yeah they need the compute mode with userspace fences and long running
> batches for that. It wont work on legacy implict synced cs, at least
> not with major surgery. That's why I asked whether the entirely
> separate amdkfd world for compute is really such a great idea, since I
> expect it'll come and bite us pretty bad.
>
> Fundamentally you can't have indefinite fences with implicit synced
> CS, and long running jobs is just one of these indefinite fences.

Adding Jason, he's been involved in a lot of the indefinite fence
discussion, especially around vk.
-Daniel


>
> >
> > Alex
> >
> >
> > > -Daniel
> > >
> > > > Thanks,
> > > > John
> > > >   Original Message
> > > > From: Daniel Vetter
> > > > Sent: Wednesday, February 3, 2021 9:27 AM
> > > > To: Alex Deucher
> > > > Cc: Linux Kernel Mailing List; dri-devel; amd-gfx list; Deucher, Alexander; Daniel Gomez; Koenig, Christian
> > > > Subject: Re: [amdgpu] deadlock
> > > >
> > > >
> > > > On Wed, Feb 03, 2021 at 08:56:17AM -0500, Alex Deucher wrote:
> > > > > On Wed, Feb 3, 2021 at 7:30 AM Christian König <christian.koenig@amd.com> wrote:
> > > > > >
> > > > > > Am 03.02.21 um 13:24 schrieb Daniel Vetter:
> > > > > > > On Wed, Feb 03, 2021 at 01:21:20PM +0100, Christian König wrote:
> > > > > > >> Am 03.02.21 um 12:45 schrieb Daniel Gomez:
> > > > > > >>> On Wed, 3 Feb 2021 at 10:47, Daniel Gomez <daniel@qtec.com> wrote:
> > > > > > >>>> On Wed, 3 Feb 2021 at 10:17, Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > > > >>>>> On Wed, Feb 3, 2021 at 9:51 AM Christian König <christian.koenig@amd.com> wrote:
> > > > > > >>>>>> Am 03.02.21 um 09:48 schrieb Daniel Vetter:
> > > > > > >>>>>>> On Wed, Feb 3, 2021 at 9:36 AM Christian König <christian.koenig@amd.com> wrote:
> > > > > > >>>>>>>> Hi Daniel,
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> this is not a deadlock, but rather a hardware lockup.
> > > > > > >>>>>>> Are you sure? Ime getting stuck in dma_fence_wait has generally good
> > > > > > >>>>>>> chance of being a dma_fence deadlock. GPU hang should never result in
> > > > > > >>>>>>> a forever stuck dma_fence.
> > > > > > >>>>>> Yes, I'm pretty sure. Otherwise the hardware clocks wouldn't go up like
> > > > > > >>>>>> this.
> > > > > > >>>>> Maybe clarifying, could be both. TDR should notice and get us out of
> > > > > > >>>>> this, but if there's a dma_fence deadlock and we can't re-emit or
> > > > > > >>>>> force complete the pending things, then we're stuck for good.
> > > > > > >>>>> -Daniel
> > > > > > >>>>>
> > > > > > >>>>>> Question is rather why we end up in the userptr handling for GFX? Our
> > > > > > >>>>>> ROCm OpenCL stack shouldn't use this.
> > > > > > >>>>>>
> > > > > > >>>>>>> Daniel, can you pls re-hang your machine and then dump backtraces of
> > > > > > >>>>>>> all tasks into dmesg with sysrq-t, and then attach that? Without all
> > > > > > >>>>>>> the backtraces it's tricky to construct the full dependency chain of
> > > > > > >>>>>>> what's going on. Also is this plain -rc6, not some more patches on
> > > > > > >>>>>>> top?
> > > > > > >>>>>> Yeah, that's still a good idea to have.
> > > > > > >>>> Here the full backtrace dmesg logs after the hang:
> > > > > > >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2Fkzivm2L3&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320075525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=z5%2FBK1akJi7%2BGrUZOA8cmyN7uOAn02ckU4tv1EprVQk%3D&amp;reserved=0
> > > > > > >>>>
> > > > > > >>>> This is another dmesg log with the backtraces after SIGKILL the matrix process:
> > > > > > >>>> (I didn't have the sysrq enable at the time):
> > > > > > >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2FpRBwGcj1&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320075525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=TPt%2BS8l6%2Boza78KQwGplTf%2FHj5guCctbJGq3WiioIGg%3D&amp;reserved=0
> > > > > > >>> I've now removed all our v4l2 patches and did the same test with the 'plain'
> > > > > > >>> mainline version (-rc6).
> > > > > > >>>
> > > > > > >>> Reference: 3aaf0a27ffc29b19a62314edd684b9bc6346f9a8
> > > > > > >>>
> > > > > > >>> Same error, same behaviour. Full dmesg log attached:
> > > > > > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2FKgaEf7Y1&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320075525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=wG%2Bex7ZOJWd%2B4ZQyhJA%2BHTVbzZXC2lRPSvzYfZlzwIY%3D&amp;reserved=0
> > > > > > >>> Note:
> > > > > > >>>     dmesg with sysrq-t before running the test starts in [  122.016502]
> > > > > > >>> sysrq: Show State
> > > > > > >>>     dmesg with sysrq-t after the test starts in: [  495.587671] sysrq: Show State
> > > > > > >> There is nothing amdgpu related in there except for waiting for the
> > > > > > >> hardware.
> > > > > > > Yeah, but there's also no other driver that could cause a stuck dma_fence,
> > > > > > > so why is reset not cleaning up the mess here? Irrespective of why the gpu
> > > > > > > is stuck, the kernel should at least complete all the dma_fences even if
> > > > > > > the gpu for some reason is terminally ill ...
> > > > > >
> > > > > > That's a good question as well. I'm digging into this.
> > > > > >
> > > > > > My best theory is that the amdgpu packages disabled GPU reset for some
> > > > > > reason.
> > > > >
> > > > > The timeout for compute queues is infinite because of long running
> > > > > compute kernels.  You can override with the amdgpu.lockup_timeout
> > > > > parameter.
> > > >
> > > > Uh, that doesn't work. If you want infinite compute queues you need the
> > > > amdkfd model with preempt-ctx dma_fence. If you allow normal cs ioctl to
> > > > run forever, you just hang the kernel whenever userspace feels like. Not
> > > > just the gpu, the kernel (anything that allocates memory, irrespective of
> > > > process can hang). That's no good.
> > > > -Daniel
> > > >
> > > > >
> > > > > Alex
> > > > >
> > > > > >
> > > > > > But the much more interesting question is why we end up in this call
> > > > > > path. I've pinged internally, but east coast is not awake yet :)
> > > > > >
> > > > > > Christian.
> > > > > >
> > > > > > > -Daniel
> > > > > > >
> > > > > > >> This is a pretty standard hardware lockup, but I'm still waiting for an
> > > > > > >> explanation why we end up in this call path in the first place.
> > > > > > >>
> > > > > > >> Christian.
> > > > > > >>
> > > > > > >>>
> > > > > > >>>>>> Christian.
> > > > > > >>>>>>
> > > > > > >>>>>>> -Daniel
> > > > > > >>>>>>>
> > > > > > >>>>>>>> Which OpenCl stack are you using?
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Regards,
> > > > > > >>>>>>>> Christian.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Am 03.02.21 um 09:33 schrieb Daniel Gomez:
> > > > > > >>>>>>>>> Hi all,
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> I have a deadlock with the amdgpu mainline driver when running in parallel two
> > > > > > >>>>>>>>> OpenCL applications. So far, we've been able to replicate it easily by executing
> > > > > > >>>>>>>>> clinfo and MatrixMultiplication (from AMD opencl-samples). It's quite old the
> > > > > > >>>>>>>>> opencl-samples so, if you have any other suggestion for testing I'd be very
> > > > > > >>>>>>>>> happy to test it as well.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> How to replicate the issue:
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> # while true; do /usr/bin/MatrixMultiplication --device gpu \
> > > > > > >>>>>>>>>         --deviceId 0 -x 1000 -y 1000 -z 1000 -q -t -i 50; done
> > > > > > >>>>>>>>> # while true; do clinfo; done
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Output:
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> After a minute or less (sometimes could be more) I can see that
> > > > > > >>>>>>>>> MatrixMultiplication and clinfo hang. In addition, with radeontop you can see
> > > > > > >>>>>>>>> how the Graphics pipe goes from ~50% to 100%. Also the shader clocks
> > > > > > >>>>>>>>> goes up from ~35% to ~96%.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> clinfo keeps printing:
> > > > > > >>>>>>>>> ioctl(7, DRM_IOCTL_SYNCOBJ_WAIT, 0x7ffe46e5f950) = -1 ETIME (Timer expired)
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> And MatrixMultiplication prints the following (strace) if you try to
> > > > > > >>>>>>>>> kill the process:
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> sched_yield()                           = 0
> > > > > > >>>>>>>>> futex(0x557e945343b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
> > > > > > >>>>>>>>> NULL, FUTEX_BITSET_MATCH_ANYstrace: Process 651 detached
> > > > > > >>>>>>>>>      <detached ...>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> After this, the gpu is not functional at all and you'd need a power cycle reset
> > > > > > >>>>>>>>> to restore the system.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Hardware info:
> > > > > > >>>>>>>>> CPU: AMD Ryzen Embedded V1605B with Radeon Vega Gfx (8) @ 2.000GHz
> > > > > > >>>>>>>>> GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> > > > > > >>>>>>>>> [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
> > > > > > >>>>>>>>> (rev 83)
> > > > > > >>>>>>>>>         DeviceName: Broadcom 5762
> > > > > > >>>>>>>>>         Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge
> > > > > > >>>>>>>>> [Radeon Vega Series / Radeon Vega Mobile Series]
> > > > > > >>>>>>>>>         Kernel driver in use: amdgpu
> > > > > > >>>>>>>>>         Kernel modules: amdgpu
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Linux kernel info:
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> root@qt5222:~# uname -a
> > > > > > >>>>>>>>> Linux qt5222 5.11.0-rc6-qtec-standard #2 SMP Tue Feb 2 09:41:46 UTC
> > > > > > >>>>>>>>> 2021 x86_64 x86_64 x86_64 GNU/Linux
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> By enabling the kernel locks stats I could see the MatrixMultiplication is
> > > > > > >>>>>>>>> hanged in the amdgpu_mn_invalidate_gfx function:
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> [  738.359202] 1 lock held by MatrixMultiplic/653:
> > > > > > >>>>>>>>> [  738.359206]  #0: ffff88810e364fe0
> > > > > > >>>>>>>>> (&adev->notifier_lock){+.+.}-{3:3}, at:
> > > > > > >>>>>>>>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> I can see in the the amdgpu_mn_invalidate_gfx function: the
> > > > > > >>>>>>>>> dma_resv_wait_timeout_rcu uses wait_all (fences) and MAX_SCHEDULE_TIMEOUT so, I
> > > > > > >>>>>>>>> guess the code gets stuck there waiting forever. According to the
> > > > > > >>>>>>>>> documentation: "When somebody tries to invalidate the page tables we block the
> > > > > > >>>>>>>>> update until all operations on the pages in question are completed, then those
> > > > > > >>>>>>>>> pages are marked  as accessed and also dirty if it wasn’t a read only access."
> > > > > > >>>>>>>>> Looks like the fences are deadlocked and therefore, it never returns. Could it
> > > > > > >>>>>>>>> be possible? any hint to where can I look to fix this?
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Thank you  in advance.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Here the full dmesg output:
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> [  738.337726] INFO: task MatrixMultiplic:653 blocked for more than 122 seconds.
> > > > > > >>>>>>>>> [  738.344937]       Not tainted 5.11.0-rc6-qtec-standard #2
> > > > > > >>>>>>>>> [  738.350384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > > > > >>>>>>>>> disables this message.
> > > > > > >>>>>>>>> [  738.358240] task:MatrixMultiplic state:D stack:    0 pid:  653
> > > > > > >>>>>>>>> ppid:     1 flags:0x00004000
> > > > > > >>>>>>>>> [  738.358254] Call Trace:
> > > > > > >>>>>>>>> [  738.358261]  ? dma_fence_default_wait+0x1eb/0x230
> > > > > > >>>>>>>>> [  738.358276]  __schedule+0x370/0x960
> > > > > > >>>>>>>>> [  738.358291]  ? dma_fence_default_wait+0x117/0x230
> > > > > > >>>>>>>>> [  738.358297]  ? dma_fence_default_wait+0x1eb/0x230
> > > > > > >>>>>>>>> [  738.358305]  schedule+0x51/0xc0
> > > > > > >>>>>>>>> [  738.358312]  schedule_timeout+0x275/0x380
> > > > > > >>>>>>>>> [  738.358324]  ? dma_fence_default_wait+0x1eb/0x230
> > > > > > >>>>>>>>> [  738.358332]  ? mark_held_locks+0x4f/0x70
> > > > > > >>>>>>>>> [  738.358341]  ? dma_fence_default_wait+0x117/0x230
> > > > > > >>>>>>>>> [  738.358347]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> > > > > > >>>>>>>>> [  738.358353]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> > > > > > >>>>>>>>> [  738.358362]  ? dma_fence_default_wait+0x117/0x230
> > > > > > >>>>>>>>> [  738.358370]  ? dma_fence_default_wait+0x1eb/0x230
> > > > > > >>>>>>>>> [  738.358375]  dma_fence_default_wait+0x214/0x230
> > > > > > >>>>>>>>> [  738.358384]  ? dma_fence_release+0x1a0/0x1a0
> > > > > > >>>>>>>>> [  738.358396]  dma_fence_wait_timeout+0x105/0x200
> > > > > > >>>>>>>>> [  738.358405]  dma_resv_wait_timeout_rcu+0x1aa/0x5e0
> > > > > > >>>>>>>>> [  738.358421]  amdgpu_mn_invalidate_gfx+0x55/0xa0 [amdgpu]
> > > > > > >>>>>>>>> [  738.358688]  __mmu_notifier_release+0x1bb/0x210
> > > > > > >>>>>>>>> [  738.358710]  exit_mmap+0x2f/0x1e0
> > > > > > >>>>>>>>> [  738.358723]  ? find_held_lock+0x34/0xa0
> > > > > > >>>>>>>>> [  738.358746]  mmput+0x39/0xe0
> > > > > > >>>>>>>>> [  738.358756]  do_exit+0x5c3/0xc00
> > > > > > >>>>>>>>> [  738.358763]  ? find_held_lock+0x34/0xa0
> > > > > > >>>>>>>>> [  738.358780]  do_group_exit+0x47/0xb0
> > > > > > >>>>>>>>> [  738.358791]  get_signal+0x15b/0xc50
> > > > > > >>>>>>>>> [  738.358807]  arch_do_signal_or_restart+0xaf/0x710
> > > > > > >>>>>>>>> [  738.358816]  ? lockdep_hardirqs_on_prepare+0xd4/0x180
> > > > > > >>>>>>>>> [  738.358822]  ? _raw_spin_unlock_irqrestore+0x39/0x40
> > > > > > >>>>>>>>> [  738.358831]  ? ktime_get_mono_fast_ns+0x50/0xa0
> > > > > > >>>>>>>>> [  738.358844]  ? amdgpu_drm_ioctl+0x6b/0x80 [amdgpu]
> > > > > > >>>>>>>>> [  738.359044]  exit_to_user_mode_prepare+0xf2/0x1b0
> > > > > > >>>>>>>>> [  738.359054]  syscall_exit_to_user_mode+0x19/0x60
> > > > > > >>>>>>>>> [  738.359062]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > > >>>>>>>>> [  738.359069] RIP: 0033:0x7f6b89a51887
> > > > > > >>>>>>>>> [  738.359076] RSP: 002b:00007f6b82b54b18 EFLAGS: 00000246 ORIG_RAX:
> > > > > > >>>>>>>>> 0000000000000010
> > > > > > >>>>>>>>> [  738.359086] RAX: fffffffffffffe00 RBX: 00007f6b82b54b50 RCX: 00007f6b89a51887
> > > > > > >>>>>>>>> [  738.359091] RDX: 00007f6b82b54b50 RSI: 00000000c02064c3 RDI: 0000000000000007
> > > > > > >>>>>>>>> [  738.359096] RBP: 00000000c02064c3 R08: 0000000000000003 R09: 00007f6b82b54bbc
> > > > > > >>>>>>>>> [  738.359101] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000165a0bc00
> > > > > > >>>>>>>>> [  738.359106] R13: 0000000000000007 R14: 0000000000000001 R15: 0000000000000000
> > > > > > >>>>>>>>> [  738.359129]
> > > > > > >>>>>>>>>                    Showing all locks held in the system:
> > > > > > >>>>>>>>> [  738.359141] 1 lock held by khungtaskd/54:
> > > > > > >>>>>>>>> [  738.359148]  #0: ffffffff829f6840 (rcu_read_lock){....}-{1:2}, at:
> > > > > > >>>>>>>>> debug_show_all_locks+0x15/0x183
> > > > > > >>>>>>>>> [  738.359187] 1 lock held by systemd-journal/174:
> > > > > > >>>>>>>>> [  738.359202] 1 lock held by MatrixMultiplic/653:
> > > > > > >>>>>>>>> [  738.359206]  #0: ffff88810e364fe0
> > > > > > >>>>>>>>> (&adev->notifier_lock){+.+.}-{3:3}, at:
> > > > > > >>>>>>>>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Daniel
> > > > > > >>>>>>>> _______________________________________________
> > > > > > >>>>>>>> dri-devel mailing list
> > > > > > >>>>>>>> dri-devel@lists.freedesktop.org
> > > > > > >>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320075525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=5lN%2Fc2Kncy31PnJEJIUC%2BFAItXAmXAtKAHp%2F6d%2Burgo%3D&amp;reserved=0
> > > > > > >>>>> --
> > > > > > >>>>> Daniel Vetter
> > > > > > >>>>> Software Engineer, Intel Corporation
> > > > > > >>>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=E6uF1FUrp9RSryVDLD1zIPaMFJU5aP5xvx7F9Mgf53s%3D&amp;reserved=0
> > > > > > >>> _______________________________________________
> > > > > > >>> amd-gfx mailing list
> > > > > > >>> amd-gfx@lists.freedesktop.org
> > > > > > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=sXukJnECw3PKXuY9V4JeUQtn86qH7EHHjHZ9brfILK0%3D&amp;reserved=0
> > > > > >
> > > > > > _______________________________________________
> > > > > > dri-devel mailing list
> > > > > > dri-devel@lists.freedesktop.org
> > > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2OSwOUbIkoQYy%2F5JGiDyNEK2oOFR%2FGFd9kiP65z15F0%3D&amp;reserved=0
> > > > > _______________________________________________
> > > > > dri-devel mailing list
> > > > > dri-devel@lists.freedesktop.org
> > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2OSwOUbIkoQYy%2F5JGiDyNEK2oOFR%2FGFd9kiP65z15F0%3D&amp;reserved=0
> > > >
> > > > --
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=E6uF1FUrp9RSryVDLD1zIPaMFJU5aP5xvx7F9Mgf53s%3D&amp;reserved=0
> > > > _______________________________________________
> > > > amd-gfx mailing list
> > > > amd-gfx@lists.freedesktop.org
> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cjohn.bridgman%40amd.com%7Cbe4d5642b52242dd9fdb08d8c84fca13%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479592320085517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=sXukJnECw3PKXuY9V4JeUQtn86qH7EHHjHZ9brfILK0%3D&amp;reserved=0
> > >
> > >
> > >
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-02-03 17:34 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-03  8:33 [amdgpu] deadlock Daniel Gomez
2021-02-03  8:36 ` Christian König
2021-02-03  8:48   ` Daniel Vetter
2021-02-03  8:51     ` Christian König
2021-02-03  8:56       ` Daniel Gomez
2021-02-03  9:17       ` Daniel Vetter
2021-02-03  9:47         ` Daniel Gomez
2021-02-03 11:45           ` Daniel Gomez
2021-02-03 12:21             ` Christian König
2021-02-03 12:24               ` Daniel Vetter
2021-02-03 12:30                 ` Christian König
2021-02-03 13:56                   ` Alex Deucher
2021-02-03 14:27                     ` Daniel Vetter
2021-02-03 14:33                       ` Bridgman, John
2021-02-03 14:41                         ` Daniel Vetter
2021-02-03 16:42                           ` Alex Deucher
2021-02-03 17:14                             ` Daniel Vetter
2021-02-03 17:33                               ` Daniel Vetter
2021-02-03 14:37 ` Christian König
2021-02-03 14:54   ` Daniel Gomez

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).