amd-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: "Christian König" <christian.koenig@amd.com>
To: "Li, Dennis" <Dennis.Li@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Kuehling, Felix" <Felix.Kuehling@amd.com>,
	"Zhang, Hawking" <Hawking.Zhang@amd.com>
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Date: Thu, 18 Mar 2021 10:51:33 +0100	[thread overview]
Message-ID: <378fdffb-99b5-2a14-736d-a06f310b040c@amd.com> (raw)
In-Reply-To: <DM5PR12MB253379E8C89D8A20C8A0245AED699@DM5PR12MB2533.namprd12.prod.outlook.com>


[-- Attachment #1.1: Type: text/plain, Size: 8737 bytes --]

Am 18.03.21 um 10:30 schrieb Li, Dennis:
>
> >>> The GPU reset doesn't complete the fences we wait for. It only 
> completes the hardware fences as part of the reset.
>
> >>> So waiting for a fence while holding the reset lock is illegal and 
> needs to be avoided.
>
> I understood your concern. It is more complex for DRM GFX, therefore I 
> abandon adding lock protection for DRM ioctls now. Maybe we can try to 
> add all kernel  dma_fence waiting in a list, and signal all in 
> recovery threads. Do you have same concern for compute cases?
>

Yes, compute (KFD) is even harder to handle.

See you can't signal the dma_fence waiting. Waiting for a dma_fence also 
means you wait for the GPU reset to finish.

When we would signal the dma_fence during the GPU reset then we would 
run into memory corruption because the hardware jobs running after the 
GPU reset would access memory which is already freed.

> >>> Lockdep also complains about this when it is used correctly. The 
> only reason it doesn't complain here is because you use an 
> atomic+wait_event instead of a locking primitive.
>
> Agree. This approach will escape the monitor of lockdep.  Its goal is 
> to block other threads when GPU recovery thread start. But I couldn’t 
> find a better method to solve this problem. Do you have some suggestion?
>

Well, completely abandon those change here.

What we need to do is to identify where hardware access happens and then 
insert taking the read side of the GPU reset lock so that we don't wait 
for a dma_fence or allocate memory, but still protect the hardware from 
concurrent access and reset.

Regards,
Christian.

> Best Regards
>
> Dennis Li
>
> *From:* Koenig, Christian <Christian.Koenig@amd.com>
> *Sent:* Thursday, March 18, 2021 4:59 PM
> *To:* Li, Dennis <Dennis.Li@amd.com>; amd-gfx@lists.freedesktop.org; 
> Deucher, Alexander <Alexander.Deucher@amd.com>; Kuehling, Felix 
> <Felix.Kuehling@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>
> *Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to enhance its 
> stability
>
> Exactly that's what you don't seem to understand.
>
> The GPU reset doesn't complete the fences we wait for. It only 
> completes the hardware fences as part of the reset.
>
> So waiting for a fence while holding the reset lock is illegal and 
> needs to be avoided.
>
> Lockdep also complains about this when it is used correctly. The only 
> reason it doesn't complain here is because you use an 
> atomic+wait_event instead of a locking primitive.
>
> Regards,
>
> Christian.
>
> ------------------------------------------------------------------------
>
> *Von:*Li, Dennis <Dennis.Li@amd.com <mailto:Dennis.Li@amd.com>>
> *Gesendet:* Donnerstag, 18. März 2021 09:28
> *An:* Koenig, Christian <Christian.Koenig@amd.com 
> <mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org 
> <mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org 
> <mailto:amd-gfx@lists.freedesktop.org>>; Deucher, Alexander 
> <Alexander.Deucher@amd.com <mailto:Alexander.Deucher@amd.com>>; 
> Kuehling, Felix <Felix.Kuehling@amd.com 
> <mailto:Felix.Kuehling@amd.com>>; Zhang, Hawking 
> <Hawking.Zhang@amd.com <mailto:Hawking.Zhang@amd.com>>
> *Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its 
> stability
>
> >>> Those two steps need to be exchanged or otherwise it is possible 
> that new delayed work items etc are started before the lock is taken.
> What about adding check for adev->in_gpu_reset in work item? If 
> exchange the two steps, it maybe introduce the deadlock.  For example, 
> the user thread hold the read lock and waiting for the fence, if 
> recovery thread try to hold write lock and then complete fences, in 
> this case, recovery thread will always be blocked.
>
>
> Best Regards
> Dennis Li
> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig@amd.com 
> <mailto:Christian.Koenig@amd.com>>
> Sent: Thursday, March 18, 2021 3:54 PM
> To: Li, Dennis <Dennis.Li@amd.com <mailto:Dennis.Li@amd.com>>; 
> amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>; 
> Deucher, Alexander <Alexander.Deucher@amd.com 
> <mailto:Alexander.Deucher@amd.com>>; Kuehling, Felix 
> <Felix.Kuehling@amd.com <mailto:Felix.Kuehling@amd.com>>; Zhang, 
> Hawking <Hawking.Zhang@amd.com <mailto:Hawking.Zhang@amd.com>>
> Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its 
> stability
>
> Am 18.03.21 um 08:23 schrieb Dennis Li:
> > We have defined two variables in_gpu_reset and reset_sem in adev 
> object. The atomic type variable in_gpu_reset is used to avoid 
> recovery thread reenter and make lower functions return more earlier 
> when recovery start, but couldn't block recovery thread when it access 
> hardware. The r/w semaphore reset_sem is used to solve these 
> synchronization issues between recovery thread and other threads.
> >
> > The original solution locked registers' access in lower functions, 
> which will introduce following issues:
> >
> > 1) many lower functions are used in both recovery thread and others. 
> Firstly we must harvest these functions, it is easy to miss someones. 
> Secondly these functions need select which lock (read lock or write 
> lock) will be used, according to the thread it is running in. If the 
> thread context isn't considered, the added lock will easily introduce 
> deadlock. Besides that, in most time, developer easily forget to add 
> locks for new functions.
> >
> > 2) performance drop. More lower functions are more frequently called.
> >
> > 3) easily introduce false positive lockdep complaint, because write 
> lock has big range in recovery thread, but low level functions will 
> hold read lock may be protected by other locks in other threads.
> >
> > Therefore the new solution will try to add lock protection for 
> ioctls of kfd. Its goal is that there are no threads except for 
> recovery thread or its children (for xgmi) to access hardware when 
> doing GPU reset and resume. So refine recovery thread as the following:
> >
> > Step 0: atomic_cmpxchg(&adev->in_gpu_reset, 0, 1)
> >     1). if failed, it means system had a recovery thread running, 
> current thread exit directly;
> >     2). if success, enter recovery thread;
> >
> > Step 1: cancel all delay works, stop drm schedule, complete all 
> unreceived fences and so on. It try to stop or pause other threads.
> >
> > Step 2: call down_write(&adev->reset_sem) to hold write lock, which 
> will block recovery thread until other threads release read locks.
>
> Those two steps need to be exchanged or otherwise it is possible that 
> new delayed work items etc are started before the lock is taken.
>
> Just to make it clear until this is fixed the whole patch set is a NAK.
>
> Regards,
> Christian.
>
> >
> > Step 3: normally, there is only recovery threads running to access 
> hardware, it is safe to do gpu reset now.
> >
> > Step 4: do post gpu reset, such as call all ips' resume functions;
> >
> > Step 5: atomic set adev->in_gpu_reset as 0, wake up other threads 
> and release write lock. Recovery thread exit normally.
> >
> > Other threads call the amdgpu_read_lock to synchronize with recovery 
> thread. If it finds that in_gpu_reset is 1, it should release read 
> lock if it has holden one, and then blocks itself to wait for recovery 
> finished event. If thread successfully hold read lock and in_gpu_reset 
> is 0, it continues. It will exit normally or be stopped by recovery 
> thread in step 1.
> >
> > Dennis Li (4):
> >    drm/amdgpu: remove reset lock from low level functions
> >    drm/amdgpu: refine the GPU recovery sequence
> >    drm/amdgpu: instead of using down/up_read directly
> >    drm/amdkfd: add reset lock protection for kfd entry functions
> >
> >   drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   6 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  14 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 173 +++++++++++++-----
> >   .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    |   8 -
> >   drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c        |   4 +-
> >   drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c         |   9 +-
> >   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c         |   5 +-
> >   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c         |   5 +-
> >   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 172 ++++++++++++++++-
> >   drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |   3 +-
> >   drivers/gpu/drm/amd/amdkfd/kfd_process.c      |   4 +
> >   .../amd/amdkfd/kfd_process_queue_manager.c    |  17 ++
> >   12 files changed, 345 insertions(+), 75 deletions(-)
> >
>


[-- Attachment #1.2: Type: text/html, Size: 18202 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  reply	other threads:[~2021-03-18  9:51 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-18  7:23 [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Dennis Li
2021-03-18  7:23 ` [PATCH 1/4] drm/amdgpu: remove reset lock from low level functions Dennis Li
2021-03-18  7:23 ` [PATCH 2/4] drm/amdgpu: refine the GPU recovery sequence Dennis Li
2021-03-18  7:56   ` Christian König
2021-03-18  7:23 ` [PATCH 3/4] drm/amdgpu: instead of using down/up_read directly Dennis Li
2021-03-18  7:23 ` [PATCH 4/4] drm/amdkfd: add reset lock protection for kfd entry functions Dennis Li
2021-03-18  7:53 ` [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Christian König
2021-03-18  8:28   ` Li, Dennis
2021-03-18  8:58     ` AW: " Koenig, Christian
2021-03-18  9:30       ` Li, Dennis
2021-03-18  9:51         ` Christian König [this message]
2021-04-05 17:58           ` Andrey Grodzovsky
2021-04-06 10:34             ` Christian König
2021-04-06 11:21               ` Christian König
2021-04-06 21:22               ` Andrey Grodzovsky
2021-04-07 10:28                 ` Christian König
2021-04-07 19:44                   ` Andrey Grodzovsky
2021-04-08  8:22                     ` Christian König
2021-04-08  8:32                       ` Christian König
2021-04-08 16:08                         ` Andrey Grodzovsky
2021-04-08 18:58                           ` Christian König
2021-04-08 20:39                             ` Andrey Grodzovsky
2021-04-09  6:53                               ` Christian König
2021-04-09  7:01                                 ` Christian König
2021-04-09 15:42                                   ` Andrey Grodzovsky
2021-04-09 16:39                                     ` Christian König
2021-04-09 18:18                                       ` Andrey Grodzovsky
2021-04-10 17:34                                         ` Christian König
2021-04-12 17:27                                           ` Andrey Grodzovsky
2021-04-12 17:44                                             ` Christian König
2021-04-12 18:01                                               ` Andrey Grodzovsky
2021-04-12 18:05                                                 ` Christian König
2021-04-12 18:18                                                   ` Andrey Grodzovsky
2021-04-12 18:23                                                     ` Christian König
2021-04-12 19:12                                                       ` Andrey Grodzovsky
2021-04-12 19:18                                                         ` Christian König
2021-04-12 20:01                                                           ` Andrey Grodzovsky
2021-04-13  7:10                                                             ` Christian König
2021-04-13  9:13                                                               ` Li, Dennis
2021-04-13  9:14                                                                 ` Christian König
2021-04-13 20:08                                                                 ` Daniel Vetter
2021-04-13 15:12                                                               ` Andrey Grodzovsky
2021-04-13 18:03                                                                 ` Christian König
2021-04-13 18:18                                                                   ` Andrey Grodzovsky
2021-04-13 18:25                                                                     ` Christian König
2021-04-13 18:30                                                                       ` Andrey Grodzovsky
2021-04-14  7:01                                                                         ` Christian König
2021-04-14 14:36                                                                           ` Andrey Grodzovsky
2021-04-14 14:58                                                                             ` Christian König
2021-04-15  6:27                                                                               ` Andrey Grodzovsky
2021-04-15  7:02                                                                                 ` Christian König
2021-04-15 14:11                                                                                   ` Andrey Grodzovsky
2021-04-15 15:09                                                                                     ` Christian König
2021-04-13 20:07                                                               ` Daniel Vetter
2021-04-13  5:36                                                       ` Andrey Grodzovsky
2021-04-13  7:07                                                         ` Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=378fdffb-99b5-2a14-736d-a06f310b040c@amd.com \
    --to=christian.koenig@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Dennis.Li@amd.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=Hawking.Zhang@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).