From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
To: "Christian König" <christian.koenig@amd.com>,
"Christian König" <ckoenig.leichtzumerken@gmail.com>,
"Li, Dennis" <Dennis.Li@amd.com>,
"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
"Deucher, Alexander" <Alexander.Deucher@amd.com>,
"Kuehling, Felix" <Felix.Kuehling@amd.com>,
"Zhang, Hawking" <Hawking.Zhang@amd.com>,
"Daniel Vetter" <daniel@ffwll.ch>
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Date: Thu, 15 Apr 2021 02:27:35 -0400 [thread overview]
Message-ID: <bb13794d-1067-6b91-c2dc-138118c3ef0d@amd.com> (raw)
In-Reply-To: <3dab0504-b834-add7-1f2f-0237b0f0c20d@amd.com>
On 2021-04-14 10:58 a.m., Christian König wrote:
> Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky:
>> [SNIP]
>>>>>
>>>>> We are racing here once more and need to handle that.
>>>>
>>>>
>>>> But why, I wrote above that we first stop the all schedulers, then
>>>> only call drm_sched_entity_kill_jobs.
>>>
>>> The schedulers consuming jobs is not the problem, we already handle
>>> that correct.
>>>
>>> The problem is that the entities might continue feeding stuff into
>>> the scheduler.
>>
>>
>> Missed that. Ok, can I just use non sleeping RCU with a flag around
>> drm_sched_entity_push_job at the amdgpu level (only 2 functions call
>> it - amdgpu_cs_submit and amdgpu_job_submit) as a preliminary step to
>> flush and block in flight and future submissions to entity queue ?
>
> Double checking the code I think we can use the notifier_lock for this.
>
> E.g. in amdgpu_cs.c see where we have the goto error_abort.
>
> That is the place where such a check could be added without any
> additional overhead.
Sure, I will just have to add this lock to amdgpu_job_submit too.
>
> Christian.
>
>>
>>
>>>
>>>>>
>>>>>>>
>>>>>>> For waiting for other device I have no idea if that couldn't
>>>>>>> deadlock somehow.
>>>>>>
>>>>>>
>>>>>> Yea, not sure for imported fences and dma_bufs, I would assume
>>>>>> the other devices should not be impacted by our device removal
>>>>>> but, who knows...
>>>>>>
>>>>>> So I guess we are NOT going with finalizing HW fences before
>>>>>> drm_dev_unplug and instead will just call drm_dev_enter/exit at
>>>>>> the back-ends all over the place where there are MMIO accesses ?
>>>>>
>>>>> Good question. As you said that is really the hard path.
>>>>>
>>>>> Handling it all at once at IOCTL level certainly has some appeal
>>>>> as well, but I have no idea if we can guarantee that this is lock
>>>>> free.
>>>>
>>>>
>>>> Maybe just empirically - let's try it and see under different test
>>>> scenarios what actually happens ?
>>>
>>> Not a good idea in general, we have that approach way to often at
>>> AMD and are then surprised that everything works in QA but fails in
>>> production.
>>>
>>> But Daniel already noted in his reply that waiting for a fence while
>>> holding the SRCU is expected to work.
>>>
>>> So let's stick with the approach of high level locking for hotplug.
>>
>>
>> To my understanding this is true for other devises, not the one being
>> extracted, for him you still need to do all the HW fence signalling
>> dance because the HW is gone and we block any TDRs (which won't help
>> anyway).
>>
>> Andrey
Do you agree to the above ?
Andrey
>>
>>
>>>
>>> Christian.
>>>
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>> Christian.
>>>>>
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> BTW: Could it be that the device SRCU protects more than
>>>>>>>>>>>>> one device and we deadlock because of this?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I haven't actually experienced any deadlock until now but,
>>>>>>>>>>>> yes, drm_unplug_srcu is defined as static in drm_drv.c and
>>>>>>>>>>>> so in the presence of multiple devices from same or
>>>>>>>>>>>> different drivers we in fact are dependent on all their
>>>>>>>>>>>> critical sections i guess.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Shit, yeah the devil is a squirrel. So for A+I laptops we
>>>>>>>>>>> actually need to sync that up with Daniel and the rest of
>>>>>>>>>>> the i915 guys.
>>>>>>>>>>>
>>>>>>>>>>> IIRC we could actually have an amdgpu device in a docking
>>>>>>>>>>> station which needs hotplug and the driver might depend on
>>>>>>>>>>> waiting for the i915 driver as well.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Can't we propose a patch to make drm_unplug_srcu per
>>>>>>>>>> drm_device ? I don't see why it has to be global and not per
>>>>>>>>>> device thing.
>>>>>>>>>
>>>>>>>>> I'm really wondering the same thing for quite a while now.
>>>>>>>>>
>>>>>>>>> Adding Daniel as well, maybe he knows why the drm_unplug_srcu
>>>>>>>>> is global.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Andrey
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>>> Andrey
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> /* Past this point no more fence are submitted to
>>>>>>>>>>>>>>>>>> HW ring and hence we can safely call force signal on
>>>>>>>>>>>>>>>>>> all that are currently there.
>>>>>>>>>>>>>>>>>> * Any subsequently created HW fences will be
>>>>>>>>>>>>>>>>>> returned signaled with an error code right away
>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> for_each_ring(adev)
>>>>>>>>>>>>>>>>>> amdgpu_fence_process(ring)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> drm_dev_unplug(dev);
>>>>>>>>>>>>>>>>>> Stop schedulers
>>>>>>>>>>>>>>>>>> cancel_sync(all timers and queued works);
>>>>>>>>>>>>>>>>>> hw_fini
>>>>>>>>>>>>>>>>>> unmap_mmio
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Alternatively grabbing the reset write side and
>>>>>>>>>>>>>>>>>>>>>>> stopping and then restarting the scheduler could
>>>>>>>>>>>>>>>>>>>>>>> work as well.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I didn't get the above and I don't see why I need
>>>>>>>>>>>>>>>>>>>>>> to reuse the GPU reset rw_lock. I rely on the
>>>>>>>>>>>>>>>>>>>>>> SRCU unplug flag for unplug. Also, not clear to
>>>>>>>>>>>>>>>>>>>>>> me why are we focusing on the scheduler threads,
>>>>>>>>>>>>>>>>>>>>>> any code patch to generate HW fences should be
>>>>>>>>>>>>>>>>>>>>>> covered, so any code leading to amdgpu_fence_emit
>>>>>>>>>>>>>>>>>>>>>> needs to be taken into account such as, direct IB
>>>>>>>>>>>>>>>>>>>>>> submissions, VM flushes e.t.c
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> You need to work together with the reset lock
>>>>>>>>>>>>>>>>>>>>> anyway, cause a hotplug could run at the same time
>>>>>>>>>>>>>>>>>>>>> as a reset.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> For going my way indeed now I see now that I have
>>>>>>>>>>>>>>>>>>>> to take reset write side lock during HW fences
>>>>>>>>>>>>>>>>>>>> signalling in order to protect against scheduler/HW
>>>>>>>>>>>>>>>>>>>> fences detachment and reattachment during
>>>>>>>>>>>>>>>>>>>> schedulers stop/restart. But if we go with your
>>>>>>>>>>>>>>>>>>>> approach then calling drm_dev_unplug and scoping
>>>>>>>>>>>>>>>>>>>> amdgpu_job_timeout with drm_dev_enter/exit should
>>>>>>>>>>>>>>>>>>>> be enough to prevent any concurrent GPU resets
>>>>>>>>>>>>>>>>>>>> during unplug. In fact I already do it anyway -
>>>>>>>>>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cc7fc6cb505c34aedfe6d08d8fe4b3947%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637538946324857369%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=64362PRC8xTgR2Uj2R256bMegVm8YWq1KI%2BAjzeYXv4%3D&reserved=0
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Yes, good point as well.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
next prev parent reply other threads:[~2021-04-15 6:27 UTC|newest]
Thread overview: 56+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-18 7:23 [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Dennis Li
2021-03-18 7:23 ` [PATCH 1/4] drm/amdgpu: remove reset lock from low level functions Dennis Li
2021-03-18 7:23 ` [PATCH 2/4] drm/amdgpu: refine the GPU recovery sequence Dennis Li
2021-03-18 7:56 ` Christian König
2021-03-18 7:23 ` [PATCH 3/4] drm/amdgpu: instead of using down/up_read directly Dennis Li
2021-03-18 7:23 ` [PATCH 4/4] drm/amdkfd: add reset lock protection for kfd entry functions Dennis Li
2021-03-18 7:53 ` [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Christian König
2021-03-18 8:28 ` Li, Dennis
2021-03-18 8:58 ` AW: " Koenig, Christian
2021-03-18 9:30 ` Li, Dennis
2021-03-18 9:51 ` Christian König
2021-04-05 17:58 ` Andrey Grodzovsky
2021-04-06 10:34 ` Christian König
2021-04-06 11:21 ` Christian König
2021-04-06 21:22 ` Andrey Grodzovsky
2021-04-07 10:28 ` Christian König
2021-04-07 19:44 ` Andrey Grodzovsky
2021-04-08 8:22 ` Christian König
2021-04-08 8:32 ` Christian König
2021-04-08 16:08 ` Andrey Grodzovsky
2021-04-08 18:58 ` Christian König
2021-04-08 20:39 ` Andrey Grodzovsky
2021-04-09 6:53 ` Christian König
2021-04-09 7:01 ` Christian König
2021-04-09 15:42 ` Andrey Grodzovsky
2021-04-09 16:39 ` Christian König
2021-04-09 18:18 ` Andrey Grodzovsky
2021-04-10 17:34 ` Christian König
2021-04-12 17:27 ` Andrey Grodzovsky
2021-04-12 17:44 ` Christian König
2021-04-12 18:01 ` Andrey Grodzovsky
2021-04-12 18:05 ` Christian König
2021-04-12 18:18 ` Andrey Grodzovsky
2021-04-12 18:23 ` Christian König
2021-04-12 19:12 ` Andrey Grodzovsky
2021-04-12 19:18 ` Christian König
2021-04-12 20:01 ` Andrey Grodzovsky
2021-04-13 7:10 ` Christian König
2021-04-13 9:13 ` Li, Dennis
2021-04-13 9:14 ` Christian König
2021-04-13 20:08 ` Daniel Vetter
2021-04-13 15:12 ` Andrey Grodzovsky
2021-04-13 18:03 ` Christian König
2021-04-13 18:18 ` Andrey Grodzovsky
2021-04-13 18:25 ` Christian König
2021-04-13 18:30 ` Andrey Grodzovsky
2021-04-14 7:01 ` Christian König
2021-04-14 14:36 ` Andrey Grodzovsky
2021-04-14 14:58 ` Christian König
2021-04-15 6:27 ` Andrey Grodzovsky [this message]
2021-04-15 7:02 ` Christian König
2021-04-15 14:11 ` Andrey Grodzovsky
2021-04-15 15:09 ` Christian König
2021-04-13 20:07 ` Daniel Vetter
2021-04-13 5:36 ` Andrey Grodzovsky
2021-04-13 7:07 ` Christian König
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=bb13794d-1067-6b91-c2dc-138118c3ef0d@amd.com \
--to=andrey.grodzovsky@amd.com \
--cc=Alexander.Deucher@amd.com \
--cc=Dennis.Li@amd.com \
--cc=Felix.Kuehling@amd.com \
--cc=Hawking.Zhang@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=christian.koenig@amd.com \
--cc=ckoenig.leichtzumerken@gmail.com \
--cc=daniel@ffwll.ch \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).