amd-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
To: "Christian König" <christian.koenig@amd.com>,
	"Li, Dennis" <Dennis.Li@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Kuehling, Felix" <Felix.Kuehling@amd.com>,
	"Zhang, Hawking" <Hawking.Zhang@amd.com>
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Date: Mon, 5 Apr 2021 13:58:09 -0400	[thread overview]
Message-ID: <1e37bb4d-f54d-1b7e-4632-94cfcf749528@amd.com> (raw)
In-Reply-To: <378fdffb-99b5-2a14-736d-a06f310b040c@amd.com>


[-- Attachment #1.1: Type: text/plain, Size: 9910 bytes --]

Denis, Christian, are there any updates in the plan on how to move on 
with this ? As you know I need very similar code for my up-streaming of 
device hot-unplug. My latest solution 
(https://lists.freedesktop.org/archives/amd-gfx/2021-January/058606.html) 
was not acceptable because of low level guards on the register accessors 
level which was hurting performance. Basically I need a way to prevent 
any MMIO write accesses from kernel driver after device is removed (UMD 
accesses are taken care of by page faulting dummy page). We are using 
now hot-unplug code for Freemont program and so up-streaming became more 
of a priority then before. This MMIO access issue is currently my main 
blocker from up-streaming. Is there any way I can assist in pushing this 
on ?

Andrey

On 2021-03-18 5:51 a.m., Christian König wrote:
> Am 18.03.21 um 10:30 schrieb Li, Dennis:
>>
>> >>> The GPU reset doesn't complete the fences we wait for. It only 
>> completes the hardware fences as part of the reset.
>>
>> >>> So waiting for a fence while holding the reset lock is illegal 
>> and needs to be avoided.
>>
>> I understood your concern. It is more complex for DRM GFX, therefore 
>> I abandon adding lock protection for DRM ioctls now. Maybe we can try 
>> to add all kernel  dma_fence waiting in a list, and signal all in 
>> recovery threads. Do you have same concern for compute cases?
>>
>
> Yes, compute (KFD) is even harder to handle.
>
> See you can't signal the dma_fence waiting. Waiting for a dma_fence 
> also means you wait for the GPU reset to finish.
>
> When we would signal the dma_fence during the GPU reset then we would 
> run into memory corruption because the hardware jobs running after the 
> GPU reset would access memory which is already freed.
>
>> >>> Lockdep also complains about this when it is used correctly. The 
>> only reason it doesn't complain here is because you use an 
>> atomic+wait_event instead of a locking primitive.
>>
>> Agree. This approach will escape the monitor of lockdep.  Its goal is 
>> to block other threads when GPU recovery thread start. But I couldn’t 
>> find a better method to solve this problem. Do you have some suggestion?
>>
>
> Well, completely abandon those change here.
>
> What we need to do is to identify where hardware access happens and 
> then insert taking the read side of the GPU reset lock so that we 
> don't wait for a dma_fence or allocate memory, but still protect the 
> hardware from concurrent access and reset.
>
> Regards,
> Christian.
>
>> Best Regards
>>
>> Dennis Li
>>
>> *From:* Koenig, Christian <Christian.Koenig@amd.com>
>> *Sent:* Thursday, March 18, 2021 4:59 PM
>> *To:* Li, Dennis <Dennis.Li@amd.com>; amd-gfx@lists.freedesktop.org; 
>> Deucher, Alexander <Alexander.Deucher@amd.com>; Kuehling, Felix 
>> <Felix.Kuehling@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>
>> *Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to enhance 
>> its stability
>>
>> Exactly that's what you don't seem to understand.
>>
>> The GPU reset doesn't complete the fences we wait for. It only 
>> completes the hardware fences as part of the reset.
>>
>> So waiting for a fence while holding the reset lock is illegal and 
>> needs to be avoided.
>>
>> Lockdep also complains about this when it is used correctly. The only 
>> reason it doesn't complain here is because you use an 
>> atomic+wait_event instead of a locking primitive.
>>
>> Regards,
>>
>> Christian.
>>
>> ------------------------------------------------------------------------
>>
>> *Von:*Li, Dennis <Dennis.Li@amd.com <mailto:Dennis.Li@amd.com>>
>> *Gesendet:* Donnerstag, 18. März 2021 09:28
>> *An:* Koenig, Christian <Christian.Koenig@amd.com 
>> <mailto:Christian.Koenig@amd.com>>; amd-gfx@lists.freedesktop.org 
>> <mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org 
>> <mailto:amd-gfx@lists.freedesktop.org>>; Deucher, Alexander 
>> <Alexander.Deucher@amd.com <mailto:Alexander.Deucher@amd.com>>; 
>> Kuehling, Felix <Felix.Kuehling@amd.com 
>> <mailto:Felix.Kuehling@amd.com>>; Zhang, Hawking 
>> <Hawking.Zhang@amd.com <mailto:Hawking.Zhang@amd.com>>
>> *Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance 
>> its stability
>>
>> >>> Those two steps need to be exchanged or otherwise it is possible 
>> that new delayed work items etc are started before the lock is taken.
>> What about adding check for adev->in_gpu_reset in work item? If 
>> exchange the two steps, it maybe introduce the deadlock.  For 
>> example, the user thread hold the read lock and waiting for the 
>> fence, if recovery thread try to hold write lock and then complete 
>> fences, in this case, recovery thread will always be blocked.
>>
>>
>> Best Regards
>> Dennis Li
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig@amd.com 
>> <mailto:Christian.Koenig@amd.com>>
>> Sent: Thursday, March 18, 2021 3:54 PM
>> To: Li, Dennis <Dennis.Li@amd.com <mailto:Dennis.Li@amd.com>>; 
>> amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>; 
>> Deucher, Alexander <Alexander.Deucher@amd.com 
>> <mailto:Alexander.Deucher@amd.com>>; Kuehling, Felix 
>> <Felix.Kuehling@amd.com <mailto:Felix.Kuehling@amd.com>>; Zhang, 
>> Hawking <Hawking.Zhang@amd.com <mailto:Hawking.Zhang@amd.com>>
>> Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its 
>> stability
>>
>> Am 18.03.21 um 08:23 schrieb Dennis Li:
>> > We have defined two variables in_gpu_reset and reset_sem in adev 
>> object. The atomic type variable in_gpu_reset is used to avoid 
>> recovery thread reenter and make lower functions return more earlier 
>> when recovery start, but couldn't block recovery thread when it 
>> access hardware. The r/w semaphore reset_sem is used to solve these 
>> synchronization issues between recovery thread and other threads.
>> >
>> > The original solution locked registers' access in lower functions, 
>> which will introduce following issues:
>> >
>> > 1) many lower functions are used in both recovery thread and 
>> others. Firstly we must harvest these functions, it is easy to miss 
>> someones. Secondly these functions need select which lock (read lock 
>> or write lock) will be used, according to the thread it is running 
>> in. If the thread context isn't considered, the added lock will 
>> easily introduce deadlock. Besides that, in most time, developer 
>> easily forget to add locks for new functions.
>> >
>> > 2) performance drop. More lower functions are more frequently called.
>> >
>> > 3) easily introduce false positive lockdep complaint, because write 
>> lock has big range in recovery thread, but low level functions will 
>> hold read lock may be protected by other locks in other threads.
>> >
>> > Therefore the new solution will try to add lock protection for 
>> ioctls of kfd. Its goal is that there are no threads except for 
>> recovery thread or its children (for xgmi) to access hardware when 
>> doing GPU reset and resume. So refine recovery thread as the following:
>> >
>> > Step 0: atomic_cmpxchg(&adev->in_gpu_reset, 0, 1)
>> >     1). if failed, it means system had a recovery thread running, 
>> current thread exit directly;
>> >     2). if success, enter recovery thread;
>> >
>> > Step 1: cancel all delay works, stop drm schedule, complete all 
>> unreceived fences and so on. It try to stop or pause other threads.
>> >
>> > Step 2: call down_write(&adev->reset_sem) to hold write lock, which 
>> will block recovery thread until other threads release read locks.
>>
>> Those two steps need to be exchanged or otherwise it is possible that 
>> new delayed work items etc are started before the lock is taken.
>>
>> Just to make it clear until this is fixed the whole patch set is a NAK.
>>
>> Regards,
>> Christian.
>>
>> >
>> > Step 3: normally, there is only recovery threads running to access 
>> hardware, it is safe to do gpu reset now.
>> >
>> > Step 4: do post gpu reset, such as call all ips' resume functions;
>> >
>> > Step 5: atomic set adev->in_gpu_reset as 0, wake up other threads 
>> and release write lock. Recovery thread exit normally.
>> >
>> > Other threads call the amdgpu_read_lock to synchronize with 
>> recovery thread. If it finds that in_gpu_reset is 1, it should 
>> release read lock if it has holden one, and then blocks itself to 
>> wait for recovery finished event. If thread successfully hold read 
>> lock and in_gpu_reset is 0, it continues. It will exit normally or be 
>> stopped by recovery thread in step 1.
>> >
>> > Dennis Li (4):
>> >    drm/amdgpu: remove reset lock from low level functions
>> >    drm/amdgpu: refine the GPU recovery sequence
>> >    drm/amdgpu: instead of using down/up_read directly
>> >    drm/amdkfd: add reset lock protection for kfd entry functions
>> >
>> >   drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +
>> >   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   | 14 +-
>> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 173 
>> +++++++++++++-----
>> >   .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    | 8 -
>> >   drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c        | 4 +-
>> >   drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c         | 9 +-
>> >   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c         | 5 +-
>> >   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c         | 5 +-
>> >   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 172 ++++++++++++++++-
>> >   drivers/gpu/drm/amd/amdkfd/kfd_priv.h         | 3 +-
>> >   drivers/gpu/drm/amd/amdkfd/kfd_process.c      | 4 +
>> >   .../amd/amdkfd/kfd_process_queue_manager.c    | 17 ++
>> >   12 files changed, 345 insertions(+), 75 deletions(-)
>> >
>>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[-- Attachment #1.2: Type: text/html, Size: 20574 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  reply	other threads:[~2021-04-05 17:58 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-18  7:23 [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Dennis Li
2021-03-18  7:23 ` [PATCH 1/4] drm/amdgpu: remove reset lock from low level functions Dennis Li
2021-03-18  7:23 ` [PATCH 2/4] drm/amdgpu: refine the GPU recovery sequence Dennis Li
2021-03-18  7:56   ` Christian König
2021-03-18  7:23 ` [PATCH 3/4] drm/amdgpu: instead of using down/up_read directly Dennis Li
2021-03-18  7:23 ` [PATCH 4/4] drm/amdkfd: add reset lock protection for kfd entry functions Dennis Li
2021-03-18  7:53 ` [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Christian König
2021-03-18  8:28   ` Li, Dennis
2021-03-18  8:58     ` AW: " Koenig, Christian
2021-03-18  9:30       ` Li, Dennis
2021-03-18  9:51         ` Christian König
2021-04-05 17:58           ` Andrey Grodzovsky [this message]
2021-04-06 10:34             ` Christian König
2021-04-06 11:21               ` Christian König
2021-04-06 21:22               ` Andrey Grodzovsky
2021-04-07 10:28                 ` Christian König
2021-04-07 19:44                   ` Andrey Grodzovsky
2021-04-08  8:22                     ` Christian König
2021-04-08  8:32                       ` Christian König
2021-04-08 16:08                         ` Andrey Grodzovsky
2021-04-08 18:58                           ` Christian König
2021-04-08 20:39                             ` Andrey Grodzovsky
2021-04-09  6:53                               ` Christian König
2021-04-09  7:01                                 ` Christian König
2021-04-09 15:42                                   ` Andrey Grodzovsky
2021-04-09 16:39                                     ` Christian König
2021-04-09 18:18                                       ` Andrey Grodzovsky
2021-04-10 17:34                                         ` Christian König
2021-04-12 17:27                                           ` Andrey Grodzovsky
2021-04-12 17:44                                             ` Christian König
2021-04-12 18:01                                               ` Andrey Grodzovsky
2021-04-12 18:05                                                 ` Christian König
2021-04-12 18:18                                                   ` Andrey Grodzovsky
2021-04-12 18:23                                                     ` Christian König
2021-04-12 19:12                                                       ` Andrey Grodzovsky
2021-04-12 19:18                                                         ` Christian König
2021-04-12 20:01                                                           ` Andrey Grodzovsky
2021-04-13  7:10                                                             ` Christian König
2021-04-13  9:13                                                               ` Li, Dennis
2021-04-13  9:14                                                                 ` Christian König
2021-04-13 20:08                                                                 ` Daniel Vetter
2021-04-13 15:12                                                               ` Andrey Grodzovsky
2021-04-13 18:03                                                                 ` Christian König
2021-04-13 18:18                                                                   ` Andrey Grodzovsky
2021-04-13 18:25                                                                     ` Christian König
2021-04-13 18:30                                                                       ` Andrey Grodzovsky
2021-04-14  7:01                                                                         ` Christian König
2021-04-14 14:36                                                                           ` Andrey Grodzovsky
2021-04-14 14:58                                                                             ` Christian König
2021-04-15  6:27                                                                               ` Andrey Grodzovsky
2021-04-15  7:02                                                                                 ` Christian König
2021-04-15 14:11                                                                                   ` Andrey Grodzovsky
2021-04-15 15:09                                                                                     ` Christian König
2021-04-13 20:07                                                               ` Daniel Vetter
2021-04-13  5:36                                                       ` Andrey Grodzovsky
2021-04-13  7:07                                                         ` Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1e37bb4d-f54d-1b7e-4632-94cfcf749528@amd.com \
    --to=andrey.grodzovsky@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Dennis.Li@amd.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=Hawking.Zhang@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).