Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
To: "Christian König" <christian.koenig@amd.com>,
	"Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Li, Dennis" <Dennis.Li@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Kuehling, Felix" <Felix.Kuehling@amd.com>,
	"Zhang, Hawking" <Hawking.Zhang@amd.com>
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Date: Fri, 9 Apr 2021 14:18:40 -0400	[thread overview]
Message-ID: <62a329d4-ffd1-3ac1-03eb-dd0089b75541@amd.com> (raw)
In-Reply-To: <70a534b7-2e55-cdd7-2f82-3b8799967aa0@amd.com>

On 2021-04-09 12:39 p.m., Christian König wrote:
> Am 09.04.21 um 17:42 schrieb Andrey Grodzovsky:
>>
>> On 2021-04-09 3:01 a.m., Christian König wrote:
>>> Am 09.04.21 um 08:53 schrieb Christian König:
>>>> Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky:
>>>>> [SNIP]
>>>>> But inserting dmr_dev_enter/exit on the highest level in drm_ioctl 
>>>>> is much less effort and less room for error then going through 
>>>>> each IOCTL and trying to identify at what point (possibly multiple 
>>>>> points) they are about to access HW, some of this is hidden deep 
>>>>> in HAL layers such as DC layer in display driver or the multi 
>>>>> layers of powerplay/SMU libraries. Also, we can't only limit 
>>>>> our-self to back-end if by this you mean ASIC specific functions 
>>>>> which access registers. We also need to take care of any MMIO 
>>>>> kernel BO (VRAM BOs) where we may access directly MMIO space by 
>>>>> pointer from the front end of the driver (HW agnostic) and TTM/DRM 
>>>>> layers.
>>>>
>>>> Exactly, yes. The key point is we need to identify such places 
>>>> anyway for GPU reset to work properly. So we could just piggy back 
>>>> hotplug on top of that work and are done.
>>
>>
>> I see most of this was done By Denis in this patch 
>> https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=drm-misc-next&id=df9c8d1aa278c435c30a69b8f2418b4a52fcb929, 
>> indeed this doesn't cover the direct by pointer accesses of MMIO and 
>> will introduce much more of those and, as people write new code, new 
>> places to cover will pop up leading to regressions and extra work to 
>> fix. It would be really much better if we could blanket cover it at 
>> the very top  such as root of all IOCTLs or, for any queued 
>> work/timer at the very top function, to handle it once and for all.
>
> And exactly that's what is not possible. At least for the reset case 
> you need to look into each hardware access and handle that bit by bit 
> and I think that for the hotplug case we should go down that route as 
> well.
>
>>>>
>>>>>
>>>>> Our problem here is how to signal all the existing  fences on one 
>>>>> hand and on the other prevent any new dma_fence waits after we 
>>>>> finished signaling existing fences. Once we solved this then there 
>>>>> is no problem using drm_dev_unplug in conjunction with 
>>>>> drm_dev_enter/exit at the highest level of drm_ioctl to flush any 
>>>>> IOCTLs in flight and block any new ones.
>>>>>
>>>>> IMHO when we speak about signalling all fences we don't mean ALL 
>>>>> the currently existing dma_fence structs (they are spread all over 
>>>>> the place) but rather signal all the HW fences because HW is 
>>>>> what's gone and we can't expect for those fences to be ever 
>>>>> signaled. All the rest such as: scheduler fences, user fences, 
>>>>> drm_gem reservation objects e.t.c. are either dependent on those 
>>>>> HW fences and hence signaling the HW fences will in turn signal 
>>>>> them or, are not impacted by the HW being gone and hence can still 
>>>>> be waited on and will complete. If this assumption is correct then 
>>>>> I think that we should use some flag to prevent any new submission 
>>>>> to HW which creates HW fences (somewhere around 
>>>>> amdgpu_fence_emit), then traverse all existing HW fences 
>>>>> (currently they are spread in a few places so maybe we need to 
>>>>> track them in a list) and signal them. After that it's safe to cal 
>>>>> drm_dev_unplug and be sure synchronize_srcu won't stall because of 
>>>>> of dma_fence_wait. After that we can proceed to canceling work 
>>>>> items, stopping schedulers e.t.c.
>>>>
>>>> That is problematic as well since you need to make sure that the 
>>>> scheduler is not creating a new hardware fence in the moment you 
>>>> try to signal all of them. It would require another SRCU or lock 
>>>> for this.
>>
>>
>> If we use a list and a flag called 'emit_allowed' under a lock such 
>> that in amdgpu_fence_emit we lock the list, check the flag and if 
>> true add the new HW fence to list and proceed to HW emition as 
>> normal, otherwise return with -ENODEV. In amdgpu_pci_remove we take 
>> the lock, set the flag to false, and then iterate the list and force 
>> signal it. Will this not prevent any new HW fence creation from now 
>> on from any place trying to do so ?
>
> Way to much overhead. The fence processing is intentionally lock free 
> to avoid cache line bouncing because the IRQ can move from CPU to CPU.
>
> We need something which at least the processing of fences in the 
> interrupt handler doesn't affect at all.

As far as I see in the code, amdgpu_fence_emit is only called from task 
context. Also, we can skip this list I proposed and just use 
amdgpu_fence_driver_force_completion for each ring to signal all created 
HW fences.

>
>>>
>>> Alternatively grabbing the reset write side and stopping and then 
>>> restarting the scheduler could work as well.
>>>
>>> Christian.
>>
>>
>> I didn't get the above and I don't see why I need to reuse the GPU 
>> reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not 
>> clear to me why are we focusing on the scheduler threads, any code 
>> patch to generate HW fences should be covered, so any code leading to 
>> amdgpu_fence_emit needs to be taken into account such as, direct IB 
>> submissions, VM flushes e.t.c
>
> You need to work together with the reset lock anyway, cause a hotplug 
> could run at the same time as a reset.

For going my way indeed now I see now that I have to take reset write 
side lock during HW fences signalling in order to protect against 
scheduler/HW fences detachment and reattachment during schedulers 
stop/restart. But if we go with your approach  then calling 
drm_dev_unplug and scoping amdgpu_job_timeout with drm_dev_enter/exit 
should be enough to prevent any concurrent GPU resets during unplug. In 
fact I already do it anyway - 
https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=drm-misc-next&id=ef0ea4dd29ef44d2649c5eda16c8f4869acc36b1

Andrey

>
>
> Christian.
>
>>
>> Andrey
>>
>>
>>>
>>>>
>>>> Christian.
>>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx