amd-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
To: "Christian König" <christian.koenig@amd.com>,
	"Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Li, Dennis" <Dennis.Li@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Kuehling, Felix" <Felix.Kuehling@amd.com>,
	"Zhang, Hawking" <Hawking.Zhang@amd.com>
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Date: Mon, 12 Apr 2021 14:01:23 -0400	[thread overview]
Message-ID: <aaa2b266-f091-dd9c-e49d-5e528decfbd7@amd.com> (raw)
In-Reply-To: <cd94e02c-11c8-0198-ab70-0ceee54d437b@amd.com>


[-- Attachment #1.1: Type: text/plain, Size: 6597 bytes --]


On 2021-04-12 1:44 p.m., Christian König wrote:
>
> Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky:
>> On 2021-04-10 1:34 p.m., Christian König wrote:
>>> Hi Andrey,
>>>
>>> Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky:
>>>> [SNIP]
>>>>>>
>>>>>> If we use a list and a flag called 'emit_allowed' under a lock 
>>>>>> such that in amdgpu_fence_emit we lock the list, check the flag 
>>>>>> and if true add the new HW fence to list and proceed to HW 
>>>>>> emition as normal, otherwise return with -ENODEV. In 
>>>>>> amdgpu_pci_remove we take the lock, set the flag to false, and 
>>>>>> then iterate the list and force signal it. Will this not prevent 
>>>>>> any new HW fence creation from now on from any place trying to do 
>>>>>> so ?
>>>>>
>>>>> Way to much overhead. The fence processing is intentionally lock 
>>>>> free to avoid cache line bouncing because the IRQ can move from 
>>>>> CPU to CPU.
>>>>>
>>>>> We need something which at least the processing of fences in the 
>>>>> interrupt handler doesn't affect at all.
>>>>
>>>>
>>>> As far as I see in the code, amdgpu_fence_emit is only called from 
>>>> task context. Also, we can skip this list I proposed and just use 
>>>> amdgpu_fence_driver_force_completion for each ring to signal all 
>>>> created HW fences.
>>>
>>> Ah, wait a second this gave me another idea.
>>>
>>> See amdgpu_fence_driver_force_completion():
>>>
>>> amdgpu_fence_write(ring, ring->fence_drv.sync_seq);
>>>
>>> If we change that to something like:
>>>
>>> amdgpu_fence_write(ring, ring->fence_drv.sync_seq + 0x3FFFFFFF);
>>>
>>> Not only the currently submitted, but also the next 0x3FFFFFFF 
>>> fences will be considered signaled.
>>>
>>> This basically solves out problem of making sure that new fences are 
>>> also signaled without any additional overhead whatsoever.
>>
>>
>> Problem with this is that the act of setting the sync_seq to some MAX 
>> value alone is not enough, you actually have to call 
>> amdgpu_fence_process to iterate and signal the fences currently 
>> stored in ring->fence_drv.fences array and to guarantee that once you 
>> done your signalling no more HW fences will be added to that array 
>> anymore. I was thinking to do something like bellow:
>>
>
> Well we could implement the is_signaled callback once more, but I'm 
> not sure if that is a good idea.


This indeed could save the explicit signaling I am doing bellow but I 
also set an error code there which might be helpful to propagate to users


>
>> amdgpu_fence_emit()
>>
>> {
>>
>>     dma_fence_init(fence);
>>
>>     srcu_read_lock(amdgpu_unplug_srcu)
>>
>>     if (!adev->unplug)) {
>>
>>         seq = ++ring->fence_drv.sync_seq;
>>         emit_fence(fence);
>>
>> */* We can't wait forever as the HW might be gone at any point*/**
>>        dma_fence_wait_timeout(old_fence, 5S);*
>>
>
> You can pretty much ignore this wait here. It is only as a last resort 
> so that we never overwrite the ring buffers.


If device is present how can I ignore this ?


>
> But it should not have a timeout as far as I can see.


Without timeout wait the who approach falls apart as I can't call 
srcu_synchronize on this scope because once device is physically gone 
the wait here will be forever


>
>>         ring->fence_drv.fences[seq & ring->fence_drv.num_fences_mask] 
>> = fence;
>>
>>     } else {
>>
>>         dma_fence_set_error(fence, -ENODEV);
>>         DMA_fence_signal(fence)
>>
>>     }
>>
>>     srcu_read_unlock(amdgpu_unplug_srcu)
>>     return fence;
>>
>> }
>>
>> amdgpu_pci_remove
>>
>> {
>>
>>     adev->unplug = true;
>>     synchronize_srcu(amdgpu_unplug_srcu)
>>
>
> Well that is just duplicating what drm_dev_unplug() should be doing on 
> a different level.


drm_dev_unplug is on a much wider scope, for everything in the device 
including 'flushing' in flight IOCTLs, this deals specifically with the 
issue of force signalling HW fences

Andrey


>
> Christian.
>
>>     /* Past this point no more fence are submitted to HW ring and 
>> hence we can safely call force signal on all that are currently there.
>>      * Any subsequently created  HW fences will be returned signaled 
>> with an error code right away
>>      */
>>
>>     for_each_ring(adev)
>>         amdgpu_fence_process(ring)
>>
>>     drm_dev_unplug(dev);
>>     Stop schedulers
>>     cancel_sync(all timers and queued works);
>>     hw_fini
>>     unmap_mmio
>>
>> }
>>
>>
>> Andrey
>>
>>
>>>
>>>
>>>>
>>>>>>>
>>>>>>> Alternatively grabbing the reset write side and stopping and 
>>>>>>> then restarting the scheduler could work as well.
>>>>>>>
>>>>>>> Christian.
>>>>>>
>>>>>>
>>>>>> I didn't get the above and I don't see why I need to reuse the 
>>>>>> GPU reset rw_lock. I rely on the SRCU unplug flag for unplug. 
>>>>>> Also, not clear to me why are we focusing on the scheduler 
>>>>>> threads, any code patch to generate HW fences should be covered, 
>>>>>> so any code leading to amdgpu_fence_emit needs to be taken into 
>>>>>> account such as, direct IB submissions, VM flushes e.t.c
>>>>>
>>>>> You need to work together with the reset lock anyway, cause a 
>>>>> hotplug could run at the same time as a reset.
>>>>
>>>>
>>>> For going my way indeed now I see now that I have to take reset 
>>>> write side lock during HW fences signalling in order to protect 
>>>> against scheduler/HW fences detachment and reattachment during 
>>>> schedulers stop/restart. But if we go with your approach  then 
>>>> calling drm_dev_unplug and scoping amdgpu_job_timeout with 
>>>> drm_dev_enter/exit should be enough to prevent any concurrent GPU 
>>>> resets during unplug. In fact I already do it anyway - 
>>>> https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ceefa9c90ed8c405ec3b708d8fc46daaa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637536728550884740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=UiNaJE%2BH45iYmbwSDnMSKZS5z0iak0fNlbbfYqKS2Jo%3D&amp;reserved=0
>>>
>>> Yes, good point as well.
>>>
>>> Christian.
>>>
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>>
>>>>> Christian.
>>>>>
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Andrey
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>
>

[-- Attachment #1.2: Type: text/html, Size: 12171 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  reply	other threads:[~2021-04-12 18:01 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-18  7:23 [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Dennis Li
2021-03-18  7:23 ` [PATCH 1/4] drm/amdgpu: remove reset lock from low level functions Dennis Li
2021-03-18  7:23 ` [PATCH 2/4] drm/amdgpu: refine the GPU recovery sequence Dennis Li
2021-03-18  7:56   ` Christian König
2021-03-18  7:23 ` [PATCH 3/4] drm/amdgpu: instead of using down/up_read directly Dennis Li
2021-03-18  7:23 ` [PATCH 4/4] drm/amdkfd: add reset lock protection for kfd entry functions Dennis Li
2021-03-18  7:53 ` [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Christian König
2021-03-18  8:28   ` Li, Dennis
2021-03-18  8:58     ` AW: " Koenig, Christian
2021-03-18  9:30       ` Li, Dennis
2021-03-18  9:51         ` Christian König
2021-04-05 17:58           ` Andrey Grodzovsky
2021-04-06 10:34             ` Christian König
2021-04-06 11:21               ` Christian König
2021-04-06 21:22               ` Andrey Grodzovsky
2021-04-07 10:28                 ` Christian König
2021-04-07 19:44                   ` Andrey Grodzovsky
2021-04-08  8:22                     ` Christian König
2021-04-08  8:32                       ` Christian König
2021-04-08 16:08                         ` Andrey Grodzovsky
2021-04-08 18:58                           ` Christian König
2021-04-08 20:39                             ` Andrey Grodzovsky
2021-04-09  6:53                               ` Christian König
2021-04-09  7:01                                 ` Christian König
2021-04-09 15:42                                   ` Andrey Grodzovsky
2021-04-09 16:39                                     ` Christian König
2021-04-09 18:18                                       ` Andrey Grodzovsky
2021-04-10 17:34                                         ` Christian König
2021-04-12 17:27                                           ` Andrey Grodzovsky
2021-04-12 17:44                                             ` Christian König
2021-04-12 18:01                                               ` Andrey Grodzovsky [this message]
2021-04-12 18:05                                                 ` Christian König
2021-04-12 18:18                                                   ` Andrey Grodzovsky
2021-04-12 18:23                                                     ` Christian König
2021-04-12 19:12                                                       ` Andrey Grodzovsky
2021-04-12 19:18                                                         ` Christian König
2021-04-12 20:01                                                           ` Andrey Grodzovsky
2021-04-13  7:10                                                             ` Christian König
2021-04-13  9:13                                                               ` Li, Dennis
2021-04-13  9:14                                                                 ` Christian König
2021-04-13 20:08                                                                 ` Daniel Vetter
2021-04-13 15:12                                                               ` Andrey Grodzovsky
2021-04-13 18:03                                                                 ` Christian König
2021-04-13 18:18                                                                   ` Andrey Grodzovsky
2021-04-13 18:25                                                                     ` Christian König
2021-04-13 18:30                                                                       ` Andrey Grodzovsky
2021-04-14  7:01                                                                         ` Christian König
2021-04-14 14:36                                                                           ` Andrey Grodzovsky
2021-04-14 14:58                                                                             ` Christian König
2021-04-15  6:27                                                                               ` Andrey Grodzovsky
2021-04-15  7:02                                                                                 ` Christian König
2021-04-15 14:11                                                                                   ` Andrey Grodzovsky
2021-04-15 15:09                                                                                     ` Christian König
2021-04-13 20:07                                                               ` Daniel Vetter
2021-04-13  5:36                                                       ` Andrey Grodzovsky
2021-04-13  7:07                                                         ` Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aaa2b266-f091-dd9c-e49d-5e528decfbd7@amd.com \
    --to=andrey.grodzovsky@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Dennis.Li@amd.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=Hawking.Zhang@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=ckoenig.leichtzumerken@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).