All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Christian König" <ckoenig.leichtzumerken@gmail.com>
To: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>,
	Alex Deucher <alexdeucher@gmail.com>
Cc: michel@daenzer.net, Borislav Petkov <bp@alien8.de>,
	amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: grab extra fence reference for drm_sched_job_add_dependency
Date: Thu, 5 Jan 2023 11:03:14 +0100	[thread overview]
Message-ID: <e6b6a599-8fdd-a4fc-a2bb-d0750e6d477d@gmail.com> (raw)
In-Reply-To: <CABXGCsOmtfo=7YWUv0QmGGrCat1Md59oz7UWw9-7MPn7f6AAdA@mail.gmail.com>

Am 05.01.23 um 02:44 schrieb Mikhail Gavrilov:
> On Tue, Jan 3, 2023 at 7:26 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>> On Tue, Jan 3, 2023 at 3:34 AM Christian König
>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>> I assume that this was already upstreamed while I was on sick leave?
>> Yes.
>>
>> Alex
>>
> What about commit 2fdb8a8f07c2f1353770a324fd19b8114e4329ac ?

That one should be fixed by:

commit 9f1ecfc5dcb47a7ca37be47b0eaca0f37f1ae93d
Author: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Date:   Wed Nov 23 03:13:03 2022 +0300

     drm/scheduler: Fix lockup in drm_sched_entity_kill()

     The drm_sched_entity_kill() is invoked twice by 
drm_sched_entity_destroy()
     while userspace process is exiting or being killed. First time it's 
invoked
     when sched entity is flushed and second time when entity is 
released. This
     causes a lockup within wait_for_completion(entity_idle) due to how 
completion
     API works.

     Calling wait_for_completion() more times than complete() was 
invoked is a
     error condition that causes lockup because completion internally uses
     counter for complete/wait calls. The complete_all() must be used 
instead
     in such cases.

     This patch fixes lockup of Panfrost driver that is reproducible by 
killing
     any application in a middle of 3d drawing operation.

     Fixes: 2fdb8a8f07c2 ("drm/scheduler: rework entity flush, kill and 
fini")
     Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
     Reviewed-by: Christian König <christian.koenig@amd.com>
     Link: 
https://patchwork.freedesktop.org/patch/msgid/20221123001303.533968-1-dmitry.osipenko@collabora.com

Regards,
Christian.

> I checked twice and I'm sure that this commit is the reason why I
> can't terminate some games (and others processes).
> Demonstration: https://youtu.be/O0AfjiMdFGw
> I also attached a full kernel log.
>
> INFO: task ZAT.exe:4745 blocked for more than 122 seconds.
>        Tainted: G        W    L
> 6.1.0-rc1-13-2fdb8a8f07c2f1353770a324fd19b8114e4329ac+ #18
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:ZAT.exe         state:D stack:12608 pid:4745  ppid:1      flags:0x20004006
> Call Trace:
>   <TASK>
>   __schedule+0x4c5/0x1740
>   schedule+0x5d/0xe0
>   schedule_timeout+0xf0/0x130
>   __wait_for_common+0xa9/0x1f0
>   ? usleep_range_state+0x90/0x90
>   drm_sched_entity_kill.part.0+0x4d/0x210 [gpu_sched]
>   drm_sched_entity_flush+0xa0/0x260 [gpu_sched]
>   amdgpu_ctx_mgr_entity_flush+0x83/0xd0 [amdgpu]
>   amdgpu_flush+0x25/0x40 [amdgpu]
>   filp_close+0x31/0x70
>   put_files_struct+0x78/0xf0
>   do_exit+0x364/0xc30
>   ? sched_clock_cpu+0xb/0xc0
>   do_group_exit+0x33/0xa0
>   get_signal+0xb41/0xb50
>   arch_do_signal_or_restart+0x44/0x7a0
>   exit_to_user_mode_prepare+0x17b/0x250
>   syscall_exit_to_user_mode+0x16/0x50
>   __do_fast_syscall_32+0x94/0xf0
> 2132]: Reached target exit.target - Exit the Session.
> 1]: user@1000.service: Killing process 4402 (reaper) with signal SIGKILL.
> 1]: user@1000.service: Killing process 4745 (ZAT.exe) with signal SIGKILL.
> 1]: Started plymouth-reboot.service - Show Plymouth Reboot Screen.
> : SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295
> subj=system_u:system_r:init_t:s0 msg='unit=plymouth-reboot
> comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=?
> terminal=? res=succe>
> 1]: plymouth-switch-root-initramfs.service - Tell Plymouth To Jump To
> initramfs was skipped because of an unmet condition check
> (ConditionPathExists=/run/initramfs/bin/sh).
> INFO: task ZAT.exe:4745 blocked for more than 122 seconds.
>        Tainted: G        W    L
> 6.1.0-rc1-13-2fdb8a8f07c2f1353770a324fd19b8114e4329ac+ #18
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:ZAT.exe         state:D stack:12608 pid:4745  ppid:1      flags:0x20004006
> Call Trace:
>   <TASK>
>   __schedule+0x4c5/0x1740
>   schedule+0x5d/0xe0
>   schedule_timeout+0xf0/0x130
>   __wait_for_common+0xa9/0x1f0
>   ? usleep_range_state+0x90/0x90
>   drm_sched_entity_kill.part.0+0x4d/0x210 [gpu_sched]
>   drm_sched_entity_flush+0xa0/0x260 [gpu_sched]
>   amdgpu_ctx_mgr_entity_flush+0x83/0xd0 [amdgpu]
>   amdgpu_flush+0x25/0x40 [amdgpu]
>   filp_close+0x31/0x70
>   put_files_struct+0x78/0xf0
>   do_exit+0x364/0xc30
>   ? sched_clock_cpu+0xb/0xc0
>   do_group_exit+0x33/0xa0
>   get_signal+0xb41/0xb50
>   arch_do_signal_or_restart+0x44/0x7a0
>   exit_to_user_mode_prepare+0x17b/0x250
>   syscall_exit_to_user_mode+0x16/0x50
>   __do_fast_syscall_32+0x94/0xf0
>   ? __do_fast_syscall_32+0x94/0xf0
>   ? lockdep_hardirqs_on+0x7d/0x100
>   ? __do_fast_syscall_32+0x94/0xf0
>   ? __do_fast_syscall_32+0x94/0xf0
>   do_fast_syscall_32+0x2f/0x70
>   entry_SYSCALL_compat_after_hwframe+0x62/0x6a
> RIP: 0023:0xf7f6b579
> RSP: 002b:00000000e8dffd40 EFLAGS: 00200282 ORIG_RAX: 00000000000000f0
> RAX: fffffffffffffe00 RBX: 00000000f0b54dcc RCX: 0000000000000189
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> RBP: 00000000ffffffff R08: 00000000e8dffd40 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000200282 R12: 0000000000000000
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>   </TASK>
>
> Showing all locks held in the system:
> 1 lock held by rcu_tasks_kthre/11:
>   #0: ffffffffae368a20 (rcu_tasks.tasks_gp_mutex){+.+.}-{3:3}, at:
> rcu_tasks_one_gp+0x2b/0x3e0
> 1 lock held by rcu_tasks_rude_/12:
>   #0: ffffffffae368760 (rcu_tasks_rude.tasks_gp_mutex){+.+.}-{3:3}, at:
> rcu_tasks_one_gp+0x2b/0x3e0
> 1 lock held by rcu_tasks_trace/13:
>   #0: ffffffffae368460 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{3:3},
> at: rcu_tasks_one_gp+0x2b/0x3e0
> 1 lock held by khungtaskd/182:
>   #0: ffffffffae369520 (rcu_read_lock){....}-{1:2}, at:
> debug_show_all_locks+0x15/0x16b
> 2 locks held by kworker/25:1/215:
> 1 lock held by systemd-journal/852:
> 1 lock held by ZAT.exe/4745:
>   #0: ffff9b087c337cf8 (&mgr->lock#3){+.+.}-{3:3}, at:
> amdgpu_ctx_mgr_entity_flush+0x3a/0xd0 [amdgpu]
>
> =============================================
> 1]: user@1000.service: Processes still around after final SIGKILL.
> Entering failed mode.
> 1]: user@1000.service: Failed with result 'timeout'.
> 1]: Stopped user@1000.service - User Manager for UID 1000.
>
>


  reply	other threads:[~2023-01-05 10:03 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-19 10:47 [PATCH] drm/amdgpu: grab extra fence reference for drm_sched_job_add_dependency Christian König
2022-12-19 14:00 ` Borislav Petkov
2022-12-21 21:10   ` Alex Deucher
2023-01-03  8:34     ` Christian König
2023-01-03 14:26       ` Alex Deucher
2023-01-03 14:28         ` Michel Dänzer
2023-01-05  1:44         ` Mikhail Gavrilov
2023-01-05 10:03           ` Christian König [this message]
2023-01-06 12:59             ` Mikhail Gavrilov
2023-01-06 14:24               ` Alex Deucher
2023-01-06 15:27                 ` Christian König
2023-01-09 13:13                   ` Mikhail Gavrilov
2023-01-09 13:40                     ` Christian König
2023-01-10 18:21                       ` Mikhail Gavrilov
2023-01-12 12:05                         ` Christian König
2022-12-19 15:08 ` Luben Tuikov
2022-12-23 10:00 ` Michal Kubecek
2022-12-23 22:55 ` Mikhail Gavrilov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e6b6a599-8fdd-a4fc-a2bb-d0750e6d477d@gmail.com \
    --to=ckoenig.leichtzumerken@gmail.com \
    --cc=alexdeucher@gmail.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=bp@alien8.de \
    --cc=michel@daenzer.net \
    --cc=mikhail.v.gavrilov@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.