amd-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
To: "Christian König" <christian.koenig@amd.com>,
	"Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Li, Dennis" <Dennis.Li@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Kuehling, Felix" <Felix.Kuehling@amd.com>,
	"Zhang, Hawking" <Hawking.Zhang@amd.com>
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Date: Fri, 9 Apr 2021 14:18:40 -0400	[thread overview]
Message-ID: <62a329d4-ffd1-3ac1-03eb-dd0089b75541@amd.com> (raw)
In-Reply-To: <70a534b7-2e55-cdd7-2f82-3b8799967aa0@amd.com>


On 2021-04-09 12:39 p.m., Christian König wrote:
> Am 09.04.21 um 17:42 schrieb Andrey Grodzovsky:
>>
>> On 2021-04-09 3:01 a.m., Christian König wrote:
>>> Am 09.04.21 um 08:53 schrieb Christian König:
>>>> Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky:
>>>>> [SNIP]
>>>>> But inserting dmr_dev_enter/exit on the highest level in drm_ioctl 
>>>>> is much less effort and less room for error then going through 
>>>>> each IOCTL and trying to identify at what point (possibly multiple 
>>>>> points) they are about to access HW, some of this is hidden deep 
>>>>> in HAL layers such as DC layer in display driver or the multi 
>>>>> layers of powerplay/SMU libraries. Also, we can't only limit 
>>>>> our-self to back-end if by this you mean ASIC specific functions 
>>>>> which access registers. We also need to take care of any MMIO 
>>>>> kernel BO (VRAM BOs) where we may access directly MMIO space by 
>>>>> pointer from the front end of the driver (HW agnostic) and TTM/DRM 
>>>>> layers.
>>>>
>>>> Exactly, yes. The key point is we need to identify such places 
>>>> anyway for GPU reset to work properly. So we could just piggy back 
>>>> hotplug on top of that work and are done.
>>
>>
>> I see most of this was done By Denis in this patch 
>> https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=drm-misc-next&id=df9c8d1aa278c435c30a69b8f2418b4a52fcb929, 
>> indeed this doesn't cover the direct by pointer accesses of MMIO and 
>> will introduce much more of those and, as people write new code, new 
>> places to cover will pop up leading to regressions and extra work to 
>> fix. It would be really much better if we could blanket cover it at 
>> the very top  such as root of all IOCTLs or, for any queued 
>> work/timer at the very top function, to handle it once and for all.
>
> And exactly that's what is not possible. At least for the reset case 
> you need to look into each hardware access and handle that bit by bit 
> and I think that for the hotplug case we should go down that route as 
> well.
>
>>>>
>>>>>
>>>>> Our problem here is how to signal all the existing  fences on one 
>>>>> hand and on the other prevent any new dma_fence waits after we 
>>>>> finished signaling existing fences. Once we solved this then there 
>>>>> is no problem using drm_dev_unplug in conjunction with 
>>>>> drm_dev_enter/exit at the highest level of drm_ioctl to flush any 
>>>>> IOCTLs in flight and block any new ones.
>>>>>
>>>>> IMHO when we speak about signalling all fences we don't mean ALL 
>>>>> the currently existing dma_fence structs (they are spread all over 
>>>>> the place) but rather signal all the HW fences because HW is 
>>>>> what's gone and we can't expect for those fences to be ever 
>>>>> signaled. All the rest such as: scheduler fences, user fences, 
>>>>> drm_gem reservation objects e.t.c. are either dependent on those 
>>>>> HW fences and hence signaling the HW fences will in turn signal 
>>>>> them or, are not impacted by the HW being gone and hence can still 
>>>>> be waited on and will complete. If this assumption is correct then 
>>>>> I think that we should use some flag to prevent any new submission 
>>>>> to HW which creates HW fences (somewhere around 
>>>>> amdgpu_fence_emit), then traverse all existing HW fences 
>>>>> (currently they are spread in a few places so maybe we need to 
>>>>> track them in a list) and signal them. After that it's safe to cal 
>>>>> drm_dev_unplug and be sure synchronize_srcu won't stall because of 
>>>>> of dma_fence_wait. After that we can proceed to canceling work 
>>>>> items, stopping schedulers e.t.c.
>>>>
>>>> That is problematic as well since you need to make sure that the 
>>>> scheduler is not creating a new hardware fence in the moment you 
>>>> try to signal all of them. It would require another SRCU or lock 
>>>> for this.
>>
>>
>> If we use a list and a flag called 'emit_allowed' under a lock such 
>> that in amdgpu_fence_emit we lock the list, check the flag and if 
>> true add the new HW fence to list and proceed to HW emition as 
>> normal, otherwise return with -ENODEV. In amdgpu_pci_remove we take 
>> the lock, set the flag to false, and then iterate the list and force 
>> signal it. Will this not prevent any new HW fence creation from now 
>> on from any place trying to do so ?
>
> Way to much overhead. The fence processing is intentionally lock free 
> to avoid cache line bouncing because the IRQ can move from CPU to CPU.
>
> We need something which at least the processing of fences in the 
> interrupt handler doesn't affect at all.


As far as I see in the code, amdgpu_fence_emit is only called from task 
context. Also, we can skip this list I proposed and just use 
amdgpu_fence_driver_force_completion for each ring to signal all created 
HW fences.


>
>>>
>>> Alternatively grabbing the reset write side and stopping and then 
>>> restarting the scheduler could work as well.
>>>
>>> Christian.
>>
>>
>> I didn't get the above and I don't see why I need to reuse the GPU 
>> reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not 
>> clear to me why are we focusing on the scheduler threads, any code 
>> patch to generate HW fences should be covered, so any code leading to 
>> amdgpu_fence_emit needs to be taken into account such as, direct IB 
>> submissions, VM flushes e.t.c
>
> You need to work together with the reset lock anyway, cause a hotplug 
> could run at the same time as a reset.


For going my way indeed now I see now that I have to take reset write 
side lock during HW fences signalling in order to protect against 
scheduler/HW fences detachment and reattachment during schedulers 
stop/restart. But if we go with your approach  then calling 
drm_dev_unplug and scoping amdgpu_job_timeout with drm_dev_enter/exit 
should be enough to prevent any concurrent GPU resets during unplug. In 
fact I already do it anyway - 
https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=drm-misc-next&id=ef0ea4dd29ef44d2649c5eda16c8f4869acc36b1

Andrey


>
>
> Christian.
>
>>
>> Andrey
>>
>>
>>>
>>>>
>>>> Christian.
>>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  reply	other threads:[~2021-04-09 18:18 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-18  7:23 [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Dennis Li
2021-03-18  7:23 ` [PATCH 1/4] drm/amdgpu: remove reset lock from low level functions Dennis Li
2021-03-18  7:23 ` [PATCH 2/4] drm/amdgpu: refine the GPU recovery sequence Dennis Li
2021-03-18  7:56   ` Christian König
2021-03-18  7:23 ` [PATCH 3/4] drm/amdgpu: instead of using down/up_read directly Dennis Li
2021-03-18  7:23 ` [PATCH 4/4] drm/amdkfd: add reset lock protection for kfd entry functions Dennis Li
2021-03-18  7:53 ` [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Christian König
2021-03-18  8:28   ` Li, Dennis
2021-03-18  8:58     ` AW: " Koenig, Christian
2021-03-18  9:30       ` Li, Dennis
2021-03-18  9:51         ` Christian König
2021-04-05 17:58           ` Andrey Grodzovsky
2021-04-06 10:34             ` Christian König
2021-04-06 11:21               ` Christian König
2021-04-06 21:22               ` Andrey Grodzovsky
2021-04-07 10:28                 ` Christian König
2021-04-07 19:44                   ` Andrey Grodzovsky
2021-04-08  8:22                     ` Christian König
2021-04-08  8:32                       ` Christian König
2021-04-08 16:08                         ` Andrey Grodzovsky
2021-04-08 18:58                           ` Christian König
2021-04-08 20:39                             ` Andrey Grodzovsky
2021-04-09  6:53                               ` Christian König
2021-04-09  7:01                                 ` Christian König
2021-04-09 15:42                                   ` Andrey Grodzovsky
2021-04-09 16:39                                     ` Christian König
2021-04-09 18:18                                       ` Andrey Grodzovsky [this message]
2021-04-10 17:34                                         ` Christian König
2021-04-12 17:27                                           ` Andrey Grodzovsky
2021-04-12 17:44                                             ` Christian König
2021-04-12 18:01                                               ` Andrey Grodzovsky
2021-04-12 18:05                                                 ` Christian König
2021-04-12 18:18                                                   ` Andrey Grodzovsky
2021-04-12 18:23                                                     ` Christian König
2021-04-12 19:12                                                       ` Andrey Grodzovsky
2021-04-12 19:18                                                         ` Christian König
2021-04-12 20:01                                                           ` Andrey Grodzovsky
2021-04-13  7:10                                                             ` Christian König
2021-04-13  9:13                                                               ` Li, Dennis
2021-04-13  9:14                                                                 ` Christian König
2021-04-13 20:08                                                                 ` Daniel Vetter
2021-04-13 15:12                                                               ` Andrey Grodzovsky
2021-04-13 18:03                                                                 ` Christian König
2021-04-13 18:18                                                                   ` Andrey Grodzovsky
2021-04-13 18:25                                                                     ` Christian König
2021-04-13 18:30                                                                       ` Andrey Grodzovsky
2021-04-14  7:01                                                                         ` Christian König
2021-04-14 14:36                                                                           ` Andrey Grodzovsky
2021-04-14 14:58                                                                             ` Christian König
2021-04-15  6:27                                                                               ` Andrey Grodzovsky
2021-04-15  7:02                                                                                 ` Christian König
2021-04-15 14:11                                                                                   ` Andrey Grodzovsky
2021-04-15 15:09                                                                                     ` Christian König
2021-04-13 20:07                                                               ` Daniel Vetter
2021-04-13  5:36                                                       ` Andrey Grodzovsky
2021-04-13  7:07                                                         ` Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=62a329d4-ffd1-3ac1-03eb-dd0089b75541@amd.com \
    --to=andrey.grodzovsky@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Dennis.Li@amd.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=Hawking.Zhang@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=ckoenig.leichtzumerken@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).