amd-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
To: "Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Li, Dennis" <Dennis.Li@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Kuehling, Felix" <Felix.Kuehling@amd.com>,
	"Zhang, Hawking" <Hawking.Zhang@amd.com>,
	"Daniel Vetter" <daniel@ffwll.ch>
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Date: Thu, 15 Apr 2021 10:11:55 -0400	[thread overview]
Message-ID: <1b35efc3-87c8-7ec3-1ad3-bb7c9ab515ba@amd.com> (raw)
In-Reply-To: <cd66c76d-5678-f495-75a8-b8c4f6458353@gmail.com>


On 2021-04-15 3:02 a.m., Christian König wrote:
> Am 15.04.21 um 08:27 schrieb Andrey Grodzovsky:
>>
>> On 2021-04-14 10:58 a.m., Christian König wrote:
>>> Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky:
>>>>  [SNIP]
>>>>>>>
>>>>>>> We are racing here once more and need to handle that.
>>>>>>
>>>>>>
>>>>>> But why, I wrote above that we first stop the all schedulers, 
>>>>>> then only call drm_sched_entity_kill_jobs.
>>>>>
>>>>> The schedulers consuming jobs is not the problem, we already 
>>>>> handle that correct.
>>>>>
>>>>> The problem is that the entities might continue feeding stuff into 
>>>>> the scheduler.
>>>>
>>>>
>>>> Missed that.  Ok, can I just use non sleeping RCU with a flag 
>>>> around drm_sched_entity_push_job at the amdgpu level (only 2 
>>>> functions call it - amdgpu_cs_submit and amdgpu_job_submit) as a 
>>>> preliminary step to flush and block in flight and future 
>>>> submissions to entity queue ?
>>>
>>> Double checking the code I think we can use the notifier_lock for this.
>>>
>>> E.g. in amdgpu_cs.c see where we have the goto error_abort.
>>>
>>> That is the place where such a check could be added without any 
>>> additional overhead.
>>
>>
>> Sure, I will just have to add this lock to amdgpu_job_submit too.
>
> Not ideal, but I think that's fine with me. You might want to rename 
> the lock for this thought.
>
>>
>>> [SNIP]
>>>>>>
>>>>>> Maybe just empirically - let's try it and see under different 
>>>>>> test scenarios what actually happens  ?
>>>>>
>>>>> Not a good idea in general, we have that approach way to often at 
>>>>> AMD and are then surprised that everything works in QA but fails 
>>>>> in production.
>>>>>
>>>>> But Daniel already noted in his reply that waiting for a fence 
>>>>> while holding the SRCU is expected to work.
>>>>>
>>>>> So let's stick with the approach of high level locking for hotplug.
>>>>
>>>>
>>>> To my understanding this is true for other devises, not the one 
>>>> being extracted, for him you still need to do all the HW fence 
>>>> signalling dance because the HW is gone and we block any TDRs 
>>>> (which won't help anyway).
>>>>
>>>> Andrey
>>
>>
>> Do you agree to the above ?
>
> Yeah, I think that is correct.
>
> But on the other hand what Daniel reminded me of is that the handling 
> needs to be consistent over different devices. And since some device 
> already go with the approach of canceling everything we simply have to 
> go down that route as well.
>
> Christian.


What does it mean in our context ? What needs to be done which we are 
not doing now ?

Andrey


>
>>
>> Andrey
>>
>>
>>>>
>>>>
>>>>>
>>>>> Christian.
>>>>>
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Andrey
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> BTW: Could it be that the device SRCU protects more than 
>>>>>>>>>>>>>>> one device and we deadlock because of this?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I haven't actually experienced any deadlock until now 
>>>>>>>>>>>>>> but, yes, drm_unplug_srcu is defined as static in 
>>>>>>>>>>>>>> drm_drv.c and so in the presence of multiple devices from 
>>>>>>>>>>>>>> same or different drivers we in fact are dependent on all 
>>>>>>>>>>>>>> their critical sections i guess.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Shit, yeah the devil is a squirrel. So for A+I laptops we 
>>>>>>>>>>>>> actually need to sync that up with Daniel and the rest of 
>>>>>>>>>>>>> the i915 guys.
>>>>>>>>>>>>>
>>>>>>>>>>>>> IIRC we could actually have an amdgpu device in a docking 
>>>>>>>>>>>>> station which needs hotplug and the driver might depend on 
>>>>>>>>>>>>> waiting for the i915 driver as well.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Can't we propose a patch to make drm_unplug_srcu per 
>>>>>>>>>>>> drm_device ? I don't see why it has to be global and not 
>>>>>>>>>>>> per device thing.
>>>>>>>>>>>
>>>>>>>>>>> I'm really wondering the same thing for quite a while now.
>>>>>>>>>>>
>>>>>>>>>>> Adding Daniel as well, maybe he knows why the 
>>>>>>>>>>> drm_unplug_srcu is global.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Andrey
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>     /* Past this point no more fence are submitted 
>>>>>>>>>>>>>>>>>>>> to HW ring and hence we can safely call force 
>>>>>>>>>>>>>>>>>>>> signal on all that are currently there.
>>>>>>>>>>>>>>>>>>>>      * Any subsequently created HW fences will be 
>>>>>>>>>>>>>>>>>>>> returned signaled with an error code right away
>>>>>>>>>>>>>>>>>>>>      */
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>     for_each_ring(adev)
>>>>>>>>>>>>>>>>>>>> amdgpu_fence_process(ring)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>     drm_dev_unplug(dev);
>>>>>>>>>>>>>>>>>>>>     Stop schedulers
>>>>>>>>>>>>>>>>>>>>     cancel_sync(all timers and queued works);
>>>>>>>>>>>>>>>>>>>>     hw_fini
>>>>>>>>>>>>>>>>>>>>     unmap_mmio
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Alternatively grabbing the reset write side 
>>>>>>>>>>>>>>>>>>>>>>>>> and stopping and then restarting the scheduler 
>>>>>>>>>>>>>>>>>>>>>>>>> could work as well.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I didn't get the above and I don't see why I 
>>>>>>>>>>>>>>>>>>>>>>>> need to reuse the GPU reset rw_lock. I rely on 
>>>>>>>>>>>>>>>>>>>>>>>> the SRCU unplug flag for unplug. Also, not 
>>>>>>>>>>>>>>>>>>>>>>>> clear to me why are we focusing on the 
>>>>>>>>>>>>>>>>>>>>>>>> scheduler threads, any code patch to generate 
>>>>>>>>>>>>>>>>>>>>>>>> HW fences should be covered, so any code 
>>>>>>>>>>>>>>>>>>>>>>>> leading to amdgpu_fence_emit needs to be taken 
>>>>>>>>>>>>>>>>>>>>>>>> into account such as, direct IB submissions, VM 
>>>>>>>>>>>>>>>>>>>>>>>> flushes e.t.c
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> You need to work together with the reset lock 
>>>>>>>>>>>>>>>>>>>>>>> anyway, cause a hotplug could run at the same 
>>>>>>>>>>>>>>>>>>>>>>> time as a reset.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> For going my way indeed now I see now that I have 
>>>>>>>>>>>>>>>>>>>>>> to take reset write side lock during HW fences 
>>>>>>>>>>>>>>>>>>>>>> signalling in order to protect against 
>>>>>>>>>>>>>>>>>>>>>> scheduler/HW fences detachment and reattachment 
>>>>>>>>>>>>>>>>>>>>>> during schedulers stop/restart. But if we go with 
>>>>>>>>>>>>>>>>>>>>>> your approach then calling drm_dev_unplug and 
>>>>>>>>>>>>>>>>>>>>>> scoping amdgpu_job_timeout with 
>>>>>>>>>>>>>>>>>>>>>> drm_dev_enter/exit should be enough to prevent 
>>>>>>>>>>>>>>>>>>>>>> any concurrent GPU resets during unplug. In fact 
>>>>>>>>>>>>>>>>>>>>>> I already do it anyway - 
>>>>>>>>>>>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ca64b1f5e0df0403a656408d8ffdc7bdb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637540669732692484%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=pLcplnlDIESV998tLO7iydxEo5lh71BjQCbAOxKif2Q%3D&amp;reserved=0 
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Yes, good point as well.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  reply	other threads:[~2021-04-15 14:12 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-18  7:23 [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Dennis Li
2021-03-18  7:23 ` [PATCH 1/4] drm/amdgpu: remove reset lock from low level functions Dennis Li
2021-03-18  7:23 ` [PATCH 2/4] drm/amdgpu: refine the GPU recovery sequence Dennis Li
2021-03-18  7:56   ` Christian König
2021-03-18  7:23 ` [PATCH 3/4] drm/amdgpu: instead of using down/up_read directly Dennis Li
2021-03-18  7:23 ` [PATCH 4/4] drm/amdkfd: add reset lock protection for kfd entry functions Dennis Li
2021-03-18  7:53 ` [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Christian König
2021-03-18  8:28   ` Li, Dennis
2021-03-18  8:58     ` AW: " Koenig, Christian
2021-03-18  9:30       ` Li, Dennis
2021-03-18  9:51         ` Christian König
2021-04-05 17:58           ` Andrey Grodzovsky
2021-04-06 10:34             ` Christian König
2021-04-06 11:21               ` Christian König
2021-04-06 21:22               ` Andrey Grodzovsky
2021-04-07 10:28                 ` Christian König
2021-04-07 19:44                   ` Andrey Grodzovsky
2021-04-08  8:22                     ` Christian König
2021-04-08  8:32                       ` Christian König
2021-04-08 16:08                         ` Andrey Grodzovsky
2021-04-08 18:58                           ` Christian König
2021-04-08 20:39                             ` Andrey Grodzovsky
2021-04-09  6:53                               ` Christian König
2021-04-09  7:01                                 ` Christian König
2021-04-09 15:42                                   ` Andrey Grodzovsky
2021-04-09 16:39                                     ` Christian König
2021-04-09 18:18                                       ` Andrey Grodzovsky
2021-04-10 17:34                                         ` Christian König
2021-04-12 17:27                                           ` Andrey Grodzovsky
2021-04-12 17:44                                             ` Christian König
2021-04-12 18:01                                               ` Andrey Grodzovsky
2021-04-12 18:05                                                 ` Christian König
2021-04-12 18:18                                                   ` Andrey Grodzovsky
2021-04-12 18:23                                                     ` Christian König
2021-04-12 19:12                                                       ` Andrey Grodzovsky
2021-04-12 19:18                                                         ` Christian König
2021-04-12 20:01                                                           ` Andrey Grodzovsky
2021-04-13  7:10                                                             ` Christian König
2021-04-13  9:13                                                               ` Li, Dennis
2021-04-13  9:14                                                                 ` Christian König
2021-04-13 20:08                                                                 ` Daniel Vetter
2021-04-13 15:12                                                               ` Andrey Grodzovsky
2021-04-13 18:03                                                                 ` Christian König
2021-04-13 18:18                                                                   ` Andrey Grodzovsky
2021-04-13 18:25                                                                     ` Christian König
2021-04-13 18:30                                                                       ` Andrey Grodzovsky
2021-04-14  7:01                                                                         ` Christian König
2021-04-14 14:36                                                                           ` Andrey Grodzovsky
2021-04-14 14:58                                                                             ` Christian König
2021-04-15  6:27                                                                               ` Andrey Grodzovsky
2021-04-15  7:02                                                                                 ` Christian König
2021-04-15 14:11                                                                                   ` Andrey Grodzovsky [this message]
2021-04-15 15:09                                                                                     ` Christian König
2021-04-13 20:07                                                               ` Daniel Vetter
2021-04-13  5:36                                                       ` Andrey Grodzovsky
2021-04-13  7:07                                                         ` Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1b35efc3-87c8-7ec3-1ad3-bb7c9ab515ba@amd.com \
    --to=andrey.grodzovsky@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Dennis.Li@amd.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=Hawking.Zhang@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=daniel@ffwll.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).