amd-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
To: "Christian König" <christian.koenig@amd.com>,
	"Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Li, Dennis" <Dennis.Li@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Kuehling, Felix" <Felix.Kuehling@amd.com>,
	"Zhang, Hawking" <Hawking.Zhang@amd.com>
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Date: Mon, 12 Apr 2021 16:01:07 -0400	[thread overview]
Message-ID: <9ea8f0c1-8d58-f4ab-e2a9-1ef8fd00718c@amd.com> (raw)
In-Reply-To: <a92e8f96-029f-6d21-4430-7e3116c21dea@amd.com>


On 2021-04-12 3:18 p.m., Christian König wrote:
> Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:
>> [SNIP]
>>>>
>>>> So what's the right approach ? How we guarantee that when running 
>>>> amdgpu_fence_driver_force_completion we will signal all the HW 
>>>> fences and not racing against some more fences insertion into that 
>>>> array ?
>>>>
>>>
>>> Well I would still say the best approach would be to insert this 
>>> between the front end and the backend and not rely on signaling 
>>> fences while holding the device srcu.
>>
>>
>> My question is, even now, when we run 
>> amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or 
>> amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, 
>> what there prevents a race with another fence being at the same time 
>> emitted and inserted into the fence array ? Looks like nothing.
>>
>
> Each ring can only be used by one thread at the same time, this 
> includes emitting fences as well as other stuff.
>
> During GPU reset we make sure nobody writes to the rings by stopping 
> the scheduler and taking the GPU reset lock (so that nobody else can 
> start the scheduler again).


What about direct submissions not through scheduler - 
amdgpu_job_submit_direct, I don't see how this is protected.


>
>>>
>>> BTW: Could it be that the device SRCU protects more than one device 
>>> and we deadlock because of this?
>>
>>
>> I haven't actually experienced any deadlock until now but, yes, 
>> drm_unplug_srcu is defined as static in drm_drv.c and so in the 
>> presence  of multiple devices from same or different drivers we in 
>> fact are dependent on all their critical sections i guess.
>>
>
> Shit, yeah the devil is a squirrel. So for A+I laptops we actually 
> need to sync that up with Daniel and the rest of the i915 guys.
>
> IIRC we could actually have an amdgpu device in a docking station 
> which needs hotplug and the driver might depend on waiting for the 
> i915 driver as well.


Can't we propose a patch to make drm_unplug_srcu per drm_device ? I 
don't see why it has to be global and not per device thing.

Andrey


>
> Christian.
>
>> Andrey
>>
>>
>>>
>>> Christian.
>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>>>     /* Past this point no more fence are submitted to HW ring 
>>>>>>>> and hence we can safely call force signal on all that are 
>>>>>>>> currently there.
>>>>>>>>      * Any subsequently created  HW fences will be returned 
>>>>>>>> signaled with an error code right away
>>>>>>>>      */
>>>>>>>>
>>>>>>>>     for_each_ring(adev)
>>>>>>>>         amdgpu_fence_process(ring)
>>>>>>>>
>>>>>>>>     drm_dev_unplug(dev);
>>>>>>>>     Stop schedulers
>>>>>>>>     cancel_sync(all timers and queued works);
>>>>>>>>     hw_fini
>>>>>>>>     unmap_mmio
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Alternatively grabbing the reset write side and stopping 
>>>>>>>>>>>>> and then restarting the scheduler could work as well.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I didn't get the above and I don't see why I need to reuse 
>>>>>>>>>>>> the GPU reset rw_lock. I rely on the SRCU unplug flag for 
>>>>>>>>>>>> unplug. Also, not clear to me why are we focusing on the 
>>>>>>>>>>>> scheduler threads, any code patch to generate HW fences 
>>>>>>>>>>>> should be covered, so any code leading to amdgpu_fence_emit 
>>>>>>>>>>>> needs to be taken into account such as, direct IB 
>>>>>>>>>>>> submissions, VM flushes e.t.c
>>>>>>>>>>>
>>>>>>>>>>> You need to work together with the reset lock anyway, cause 
>>>>>>>>>>> a hotplug could run at the same time as a reset.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> For going my way indeed now I see now that I have to take 
>>>>>>>>>> reset write side lock during HW fences signalling in order to 
>>>>>>>>>> protect against scheduler/HW fences detachment and 
>>>>>>>>>> reattachment during schedulers stop/restart. But if we go 
>>>>>>>>>> with your approach  then calling drm_dev_unplug and scoping 
>>>>>>>>>> amdgpu_job_timeout with drm_dev_enter/exit should be enough 
>>>>>>>>>> to prevent any concurrent GPU resets during unplug. In fact I 
>>>>>>>>>> already do it anyway - 
>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ceefa9c90ed8c405ec3b708d8fc46daaa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637536728550884740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=UiNaJE%2BH45iYmbwSDnMSKZS5z0iak0fNlbbfYqKS2Jo%3D&amp;reserved=0
>>>>>>>>>
>>>>>>>>> Yes, good point as well.
>>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Andrey
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Andrey
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  reply	other threads:[~2021-04-12 20:01 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-18  7:23 [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Dennis Li
2021-03-18  7:23 ` [PATCH 1/4] drm/amdgpu: remove reset lock from low level functions Dennis Li
2021-03-18  7:23 ` [PATCH 2/4] drm/amdgpu: refine the GPU recovery sequence Dennis Li
2021-03-18  7:56   ` Christian König
2021-03-18  7:23 ` [PATCH 3/4] drm/amdgpu: instead of using down/up_read directly Dennis Li
2021-03-18  7:23 ` [PATCH 4/4] drm/amdkfd: add reset lock protection for kfd entry functions Dennis Li
2021-03-18  7:53 ` [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Christian König
2021-03-18  8:28   ` Li, Dennis
2021-03-18  8:58     ` AW: " Koenig, Christian
2021-03-18  9:30       ` Li, Dennis
2021-03-18  9:51         ` Christian König
2021-04-05 17:58           ` Andrey Grodzovsky
2021-04-06 10:34             ` Christian König
2021-04-06 11:21               ` Christian König
2021-04-06 21:22               ` Andrey Grodzovsky
2021-04-07 10:28                 ` Christian König
2021-04-07 19:44                   ` Andrey Grodzovsky
2021-04-08  8:22                     ` Christian König
2021-04-08  8:32                       ` Christian König
2021-04-08 16:08                         ` Andrey Grodzovsky
2021-04-08 18:58                           ` Christian König
2021-04-08 20:39                             ` Andrey Grodzovsky
2021-04-09  6:53                               ` Christian König
2021-04-09  7:01                                 ` Christian König
2021-04-09 15:42                                   ` Andrey Grodzovsky
2021-04-09 16:39                                     ` Christian König
2021-04-09 18:18                                       ` Andrey Grodzovsky
2021-04-10 17:34                                         ` Christian König
2021-04-12 17:27                                           ` Andrey Grodzovsky
2021-04-12 17:44                                             ` Christian König
2021-04-12 18:01                                               ` Andrey Grodzovsky
2021-04-12 18:05                                                 ` Christian König
2021-04-12 18:18                                                   ` Andrey Grodzovsky
2021-04-12 18:23                                                     ` Christian König
2021-04-12 19:12                                                       ` Andrey Grodzovsky
2021-04-12 19:18                                                         ` Christian König
2021-04-12 20:01                                                           ` Andrey Grodzovsky [this message]
2021-04-13  7:10                                                             ` Christian König
2021-04-13  9:13                                                               ` Li, Dennis
2021-04-13  9:14                                                                 ` Christian König
2021-04-13 20:08                                                                 ` Daniel Vetter
2021-04-13 15:12                                                               ` Andrey Grodzovsky
2021-04-13 18:03                                                                 ` Christian König
2021-04-13 18:18                                                                   ` Andrey Grodzovsky
2021-04-13 18:25                                                                     ` Christian König
2021-04-13 18:30                                                                       ` Andrey Grodzovsky
2021-04-14  7:01                                                                         ` Christian König
2021-04-14 14:36                                                                           ` Andrey Grodzovsky
2021-04-14 14:58                                                                             ` Christian König
2021-04-15  6:27                                                                               ` Andrey Grodzovsky
2021-04-15  7:02                                                                                 ` Christian König
2021-04-15 14:11                                                                                   ` Andrey Grodzovsky
2021-04-15 15:09                                                                                     ` Christian König
2021-04-13 20:07                                                               ` Daniel Vetter
2021-04-13  5:36                                                       ` Andrey Grodzovsky
2021-04-13  7:07                                                         ` Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9ea8f0c1-8d58-f4ab-e2a9-1ef8fd00718c@amd.com \
    --to=andrey.grodzovsky@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Dennis.Li@amd.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=Hawking.Zhang@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=ckoenig.leichtzumerken@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).