From: "Christian König" <christian.koenig@amd.com>
To: "Andrey Grodzovsky" <andrey.grodzovsky@amd.com>,
"Christian König" <ckoenig.leichtzumerken@gmail.com>,
"Li, Dennis" <Dennis.Li@amd.com>,
"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
"Deucher, Alexander" <Alexander.Deucher@amd.com>,
"Kuehling, Felix" <Felix.Kuehling@amd.com>,
"Zhang, Hawking" <Hawking.Zhang@amd.com>,
"Daniel Vetter" <daniel@ffwll.ch>
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Date: Thu, 15 Apr 2021 17:09:01 +0200 [thread overview]
Message-ID: <c96838ae-9fcd-ce32-2bf7-a0324016ebb8@amd.com> (raw)
In-Reply-To: <1b35efc3-87c8-7ec3-1ad3-bb7c9ab515ba@amd.com>
>>>
>>>> [SNIP]
>>>>>>>
>>>>>>> Maybe just empirically - let's try it and see under different
>>>>>>> test scenarios what actually happens ?
>>>>>>
>>>>>> Not a good idea in general, we have that approach way to often at
>>>>>> AMD and are then surprised that everything works in QA but fails
>>>>>> in production.
>>>>>>
>>>>>> But Daniel already noted in his reply that waiting for a fence
>>>>>> while holding the SRCU is expected to work.
>>>>>>
>>>>>> So let's stick with the approach of high level locking for hotplug.
>>>>>
>>>>>
>>>>> To my understanding this is true for other devises, not the one
>>>>> being extracted, for him you still need to do all the HW fence
>>>>> signalling dance because the HW is gone and we block any TDRs
>>>>> (which won't help anyway).
>>>>>
>>>>> Andrey
>>>
>>>
>>> Do you agree to the above ?
>>
>> Yeah, I think that is correct.
>>
>> But on the other hand what Daniel reminded me of is that the handling
>> needs to be consistent over different devices. And since some device
>> already go with the approach of canceling everything we simply have
>> to go down that route as well.
>>
>> Christian.
>
>
> What does it mean in our context ? What needs to be done which we are
> not doing now ?
I think we are fine, we just need to continue with the approach of
forcefully signaling all fences on hotplug.
Christian.
>
> Andrey
>
>
>>
>>>
>>> Andrey
>>>
>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Andrey
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> BTW: Could it be that the device SRCU protects more
>>>>>>>>>>>>>>>> than one device and we deadlock because of this?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I haven't actually experienced any deadlock until now
>>>>>>>>>>>>>>> but, yes, drm_unplug_srcu is defined as static in
>>>>>>>>>>>>>>> drm_drv.c and so in the presence of multiple devices
>>>>>>>>>>>>>>> from same or different drivers we in fact are dependent
>>>>>>>>>>>>>>> on all their critical sections i guess.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Shit, yeah the devil is a squirrel. So for A+I laptops we
>>>>>>>>>>>>>> actually need to sync that up with Daniel and the rest of
>>>>>>>>>>>>>> the i915 guys.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> IIRC we could actually have an amdgpu device in a docking
>>>>>>>>>>>>>> station which needs hotplug and the driver might depend
>>>>>>>>>>>>>> on waiting for the i915 driver as well.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can't we propose a patch to make drm_unplug_srcu per
>>>>>>>>>>>>> drm_device ? I don't see why it has to be global and not
>>>>>>>>>>>>> per device thing.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm really wondering the same thing for quite a while now.
>>>>>>>>>>>>
>>>>>>>>>>>> Adding Daniel as well, maybe he knows why the
>>>>>>>>>>>> drm_unplug_srcu is global.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Christian.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> /* Past this point no more fence are submitted to
>>>>>>>>>>>>>>>>>>>>> HW ring and hence we can safely call force signal
>>>>>>>>>>>>>>>>>>>>> on all that are currently there.
>>>>>>>>>>>>>>>>>>>>> * Any subsequently created HW fences will be
>>>>>>>>>>>>>>>>>>>>> returned signaled with an error code right away
>>>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> for_each_ring(adev)
>>>>>>>>>>>>>>>>>>>>> amdgpu_fence_process(ring)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> drm_dev_unplug(dev);
>>>>>>>>>>>>>>>>>>>>> Stop schedulers
>>>>>>>>>>>>>>>>>>>>> cancel_sync(all timers and queued works);
>>>>>>>>>>>>>>>>>>>>> hw_fini
>>>>>>>>>>>>>>>>>>>>> unmap_mmio
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Alternatively grabbing the reset write side
>>>>>>>>>>>>>>>>>>>>>>>>>> and stopping and then restarting the
>>>>>>>>>>>>>>>>>>>>>>>>>> scheduler could work as well.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I didn't get the above and I don't see why I
>>>>>>>>>>>>>>>>>>>>>>>>> need to reuse the GPU reset rw_lock. I rely on
>>>>>>>>>>>>>>>>>>>>>>>>> the SRCU unplug flag for unplug. Also, not
>>>>>>>>>>>>>>>>>>>>>>>>> clear to me why are we focusing on the
>>>>>>>>>>>>>>>>>>>>>>>>> scheduler threads, any code patch to generate
>>>>>>>>>>>>>>>>>>>>>>>>> HW fences should be covered, so any code
>>>>>>>>>>>>>>>>>>>>>>>>> leading to amdgpu_fence_emit needs to be taken
>>>>>>>>>>>>>>>>>>>>>>>>> into account such as, direct IB submissions,
>>>>>>>>>>>>>>>>>>>>>>>>> VM flushes e.t.c
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> You need to work together with the reset lock
>>>>>>>>>>>>>>>>>>>>>>>> anyway, cause a hotplug could run at the same
>>>>>>>>>>>>>>>>>>>>>>>> time as a reset.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> For going my way indeed now I see now that I
>>>>>>>>>>>>>>>>>>>>>>> have to take reset write side lock during HW
>>>>>>>>>>>>>>>>>>>>>>> fences signalling in order to protect against
>>>>>>>>>>>>>>>>>>>>>>> scheduler/HW fences detachment and reattachment
>>>>>>>>>>>>>>>>>>>>>>> during schedulers stop/restart. But if we go
>>>>>>>>>>>>>>>>>>>>>>> with your approach then calling drm_dev_unplug
>>>>>>>>>>>>>>>>>>>>>>> and scoping amdgpu_job_timeout with
>>>>>>>>>>>>>>>>>>>>>>> drm_dev_enter/exit should be enough to prevent
>>>>>>>>>>>>>>>>>>>>>>> any concurrent GPU resets during unplug. In fact
>>>>>>>>>>>>>>>>>>>>>>> I already do it anyway -
>>>>>>>>>>>>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ca64b1f5e0df0403a656408d8ffdc7bdb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637540669732692484%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=pLcplnlDIESV998tLO7iydxEo5lh71BjQCbAOxKif2Q%3D&reserved=0
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Yes, good point as well.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
next prev parent reply other threads:[~2021-04-15 15:09 UTC|newest]
Thread overview: 56+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-18 7:23 [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Dennis Li
2021-03-18 7:23 ` [PATCH 1/4] drm/amdgpu: remove reset lock from low level functions Dennis Li
2021-03-18 7:23 ` [PATCH 2/4] drm/amdgpu: refine the GPU recovery sequence Dennis Li
2021-03-18 7:56 ` Christian König
2021-03-18 7:23 ` [PATCH 3/4] drm/amdgpu: instead of using down/up_read directly Dennis Li
2021-03-18 7:23 ` [PATCH 4/4] drm/amdkfd: add reset lock protection for kfd entry functions Dennis Li
2021-03-18 7:53 ` [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Christian König
2021-03-18 8:28 ` Li, Dennis
2021-03-18 8:58 ` AW: " Koenig, Christian
2021-03-18 9:30 ` Li, Dennis
2021-03-18 9:51 ` Christian König
2021-04-05 17:58 ` Andrey Grodzovsky
2021-04-06 10:34 ` Christian König
2021-04-06 11:21 ` Christian König
2021-04-06 21:22 ` Andrey Grodzovsky
2021-04-07 10:28 ` Christian König
2021-04-07 19:44 ` Andrey Grodzovsky
2021-04-08 8:22 ` Christian König
2021-04-08 8:32 ` Christian König
2021-04-08 16:08 ` Andrey Grodzovsky
2021-04-08 18:58 ` Christian König
2021-04-08 20:39 ` Andrey Grodzovsky
2021-04-09 6:53 ` Christian König
2021-04-09 7:01 ` Christian König
2021-04-09 15:42 ` Andrey Grodzovsky
2021-04-09 16:39 ` Christian König
2021-04-09 18:18 ` Andrey Grodzovsky
2021-04-10 17:34 ` Christian König
2021-04-12 17:27 ` Andrey Grodzovsky
2021-04-12 17:44 ` Christian König
2021-04-12 18:01 ` Andrey Grodzovsky
2021-04-12 18:05 ` Christian König
2021-04-12 18:18 ` Andrey Grodzovsky
2021-04-12 18:23 ` Christian König
2021-04-12 19:12 ` Andrey Grodzovsky
2021-04-12 19:18 ` Christian König
2021-04-12 20:01 ` Andrey Grodzovsky
2021-04-13 7:10 ` Christian König
2021-04-13 9:13 ` Li, Dennis
2021-04-13 9:14 ` Christian König
2021-04-13 20:08 ` Daniel Vetter
2021-04-13 15:12 ` Andrey Grodzovsky
2021-04-13 18:03 ` Christian König
2021-04-13 18:18 ` Andrey Grodzovsky
2021-04-13 18:25 ` Christian König
2021-04-13 18:30 ` Andrey Grodzovsky
2021-04-14 7:01 ` Christian König
2021-04-14 14:36 ` Andrey Grodzovsky
2021-04-14 14:58 ` Christian König
2021-04-15 6:27 ` Andrey Grodzovsky
2021-04-15 7:02 ` Christian König
2021-04-15 14:11 ` Andrey Grodzovsky
2021-04-15 15:09 ` Christian König [this message]
2021-04-13 20:07 ` Daniel Vetter
2021-04-13 5:36 ` Andrey Grodzovsky
2021-04-13 7:07 ` Christian König
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c96838ae-9fcd-ce32-2bf7-a0324016ebb8@amd.com \
--to=christian.koenig@amd.com \
--cc=Alexander.Deucher@amd.com \
--cc=Dennis.Li@amd.com \
--cc=Felix.Kuehling@amd.com \
--cc=Hawking.Zhang@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=andrey.grodzovsky@amd.com \
--cc=ckoenig.leichtzumerken@gmail.com \
--cc=daniel@ffwll.ch \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).