From: "Li, Dennis" <Dennis.Li@amd.com>
To: "Christian König" <ckoenig.leichtzumerken@gmail.com>,
"Grodzovsky, Andrey" <Andrey.Grodzovsky@amd.com>,
"Koenig, Christian" <Christian.Koenig@amd.com>,
"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
"Deucher, Alexander" <Alexander.Deucher@amd.com>,
"Kuehling, Felix" <Felix.Kuehling@amd.com>,
"Zhang, Hawking" <Hawking.Zhang@amd.com>,
"Daniel Vetter" <daniel@ffwll.ch>
Subject: RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Date: Tue, 13 Apr 2021 09:13:45 +0000 [thread overview]
Message-ID: <DM5PR12MB25334B94CB060BD4C3EA4223ED4F9@DM5PR12MB2533.namprd12.prod.outlook.com> (raw)
In-Reply-To: <9b9e84ca-b259-b683-ef9d-1f9be95e1734@gmail.com>
[AMD Official Use Only - Internal Distribution Only]
Hi, Christian and Andrey,
We maybe try to implement "wait" callback function of dma_fence_ops, when GPU reset or unplug happen, make this callback return - ENODEV, to notify the caller device lost.
* Must return -ERESTARTSYS if the wait is intr = true and the wait was
* interrupted, and remaining jiffies if fence has signaled, or 0 if wait
* timed out. Can also return other error values on custom implementations,
* which should be treated as if the fence is signaled. For example a hardware
* lockup could be reported like that.
*
* This callback is optional.
*/
signed long (*wait)(struct dma_fence *fence,
bool intr, signed long timeout);
Best Regards
Dennis Li
-----Original Message-----
From: Christian König <ckoenig.leichtzumerken@gmail.com>
Sent: Tuesday, April 13, 2021 3:10 PM
To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Li, Dennis <Dennis.Li@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Daniel Vetter <daniel@ffwll.ch>
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:
>
> On 2021-04-12 3:18 p.m., Christian König wrote:
>> Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:
>>> [SNIP]
>>>>>
>>>>> So what's the right approach ? How we guarantee that when running
>>>>> amdgpu_fence_driver_force_completion we will signal all the HW
>>>>> fences and not racing against some more fences insertion into that
>>>>> array ?
>>>>>
>>>>
>>>> Well I would still say the best approach would be to insert this
>>>> between the front end and the backend and not rely on signaling
>>>> fences while holding the device srcu.
>>>
>>>
>>> My question is, even now, when we run
>>> amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or
>>> amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion,
>>> what there prevents a race with another fence being at the same time
>>> emitted and inserted into the fence array ? Looks like nothing.
>>>
>>
>> Each ring can only be used by one thread at the same time, this
>> includes emitting fences as well as other stuff.
>>
>> During GPU reset we make sure nobody writes to the rings by stopping
>> the scheduler and taking the GPU reset lock (so that nobody else can
>> start the scheduler again).
>
>
> What about direct submissions not through scheduler -
> amdgpu_job_submit_direct, I don't see how this is protected.
Those only happen during startup and GPU reset.
>>
>>>>
>>>> BTW: Could it be that the device SRCU protects more than one device
>>>> and we deadlock because of this?
>>>
>>>
>>> I haven't actually experienced any deadlock until now but, yes,
>>> drm_unplug_srcu is defined as static in drm_drv.c and so in the
>>> presence of multiple devices from same or different drivers we in
>>> fact are dependent on all their critical sections i guess.
>>>
>>
>> Shit, yeah the devil is a squirrel. So for A+I laptops we actually
>> need to sync that up with Daniel and the rest of the i915 guys.
>>
>> IIRC we could actually have an amdgpu device in a docking station
>> which needs hotplug and the driver might depend on waiting for the
>> i915 driver as well.
>
>
> Can't we propose a patch to make drm_unplug_srcu per drm_device ? I
> don't see why it has to be global and not per device thing.
I'm really wondering the same thing for quite a while now.
Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global.
Regards,
Christian.
>
> Andrey
>
>
>>
>> Christian.
>>
>>> Andrey
>>>
>>>
>>>>
>>>> Christian.
>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>> /* Past this point no more fence are submitted to HW ring
>>>>>>>>> and hence we can safely call force signal on all that are
>>>>>>>>> currently there.
>>>>>>>>> * Any subsequently created HW fences will be returned
>>>>>>>>> signaled with an error code right away
>>>>>>>>> */
>>>>>>>>>
>>>>>>>>> for_each_ring(adev)
>>>>>>>>> amdgpu_fence_process(ring)
>>>>>>>>>
>>>>>>>>> drm_dev_unplug(dev);
>>>>>>>>> Stop schedulers
>>>>>>>>> cancel_sync(all timers and queued works);
>>>>>>>>> hw_fini
>>>>>>>>> unmap_mmio
>>>>>>>>>
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Alternatively grabbing the reset write side and stopping
>>>>>>>>>>>>>> and then restarting the scheduler could work as well.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I didn't get the above and I don't see why I need to reuse
>>>>>>>>>>>>> the GPU reset rw_lock. I rely on the SRCU unplug flag for
>>>>>>>>>>>>> unplug. Also, not clear to me why are we focusing on the
>>>>>>>>>>>>> scheduler threads, any code patch to generate HW fences
>>>>>>>>>>>>> should be covered, so any code leading to
>>>>>>>>>>>>> amdgpu_fence_emit needs to be taken into account such as,
>>>>>>>>>>>>> direct IB submissions, VM flushes e.t.c
>>>>>>>>>>>>
>>>>>>>>>>>> You need to work together with the reset lock anyway, cause
>>>>>>>>>>>> a hotplug could run at the same time as a reset.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> For going my way indeed now I see now that I have to take
>>>>>>>>>>> reset write side lock during HW fences signalling in order
>>>>>>>>>>> to protect against scheduler/HW fences detachment and
>>>>>>>>>>> reattachment during schedulers stop/restart. But if we go
>>>>>>>>>>> with your approach then calling drm_dev_unplug and scoping
>>>>>>>>>>> amdgpu_job_timeout with drm_dev_enter/exit should be enough
>>>>>>>>>>> to prevent any concurrent GPU resets during unplug. In fact
>>>>>>>>>>> I already do it anyway -
>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https:%2
>>>>>>>>>>> F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh
>>>>>>>>>>> %3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc
>>>>>>>>>>> 36b1&data=04%7C01%7CDennis.Li%40amd.com%7Cc7fc6cb505c34a
>>>>>>>>>>> edfe6d08d8fe4b3947%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C
>>>>>>>>>>> 0%7C637538946323194151%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wL
>>>>>>>>>>> jAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000
>>>>>>>>>>> &sdata=%2Fe%2BqJNlcuUjLHsLvfHCKqerK%2Ff8lzujqOBhnMBIRP8E
>>>>>>>>>>> %3D&reserved=0
>>>>>>>>>>
>>>>>>>>>> Yes, good point as well.
>>>>>>>>>>
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Andrey
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Christian.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
next prev parent reply other threads:[~2021-04-13 9:13 UTC|newest]
Thread overview: 56+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-18 7:23 [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Dennis Li
2021-03-18 7:23 ` [PATCH 1/4] drm/amdgpu: remove reset lock from low level functions Dennis Li
2021-03-18 7:23 ` [PATCH 2/4] drm/amdgpu: refine the GPU recovery sequence Dennis Li
2021-03-18 7:56 ` Christian König
2021-03-18 7:23 ` [PATCH 3/4] drm/amdgpu: instead of using down/up_read directly Dennis Li
2021-03-18 7:23 ` [PATCH 4/4] drm/amdkfd: add reset lock protection for kfd entry functions Dennis Li
2021-03-18 7:53 ` [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Christian König
2021-03-18 8:28 ` Li, Dennis
2021-03-18 8:58 ` AW: " Koenig, Christian
2021-03-18 9:30 ` Li, Dennis
2021-03-18 9:51 ` Christian König
2021-04-05 17:58 ` Andrey Grodzovsky
2021-04-06 10:34 ` Christian König
2021-04-06 11:21 ` Christian König
2021-04-06 21:22 ` Andrey Grodzovsky
2021-04-07 10:28 ` Christian König
2021-04-07 19:44 ` Andrey Grodzovsky
2021-04-08 8:22 ` Christian König
2021-04-08 8:32 ` Christian König
2021-04-08 16:08 ` Andrey Grodzovsky
2021-04-08 18:58 ` Christian König
2021-04-08 20:39 ` Andrey Grodzovsky
2021-04-09 6:53 ` Christian König
2021-04-09 7:01 ` Christian König
2021-04-09 15:42 ` Andrey Grodzovsky
2021-04-09 16:39 ` Christian König
2021-04-09 18:18 ` Andrey Grodzovsky
2021-04-10 17:34 ` Christian König
2021-04-12 17:27 ` Andrey Grodzovsky
2021-04-12 17:44 ` Christian König
2021-04-12 18:01 ` Andrey Grodzovsky
2021-04-12 18:05 ` Christian König
2021-04-12 18:18 ` Andrey Grodzovsky
2021-04-12 18:23 ` Christian König
2021-04-12 19:12 ` Andrey Grodzovsky
2021-04-12 19:18 ` Christian König
2021-04-12 20:01 ` Andrey Grodzovsky
2021-04-13 7:10 ` Christian König
2021-04-13 9:13 ` Li, Dennis [this message]
2021-04-13 9:14 ` Christian König
2021-04-13 20:08 ` Daniel Vetter
2021-04-13 15:12 ` Andrey Grodzovsky
2021-04-13 18:03 ` Christian König
2021-04-13 18:18 ` Andrey Grodzovsky
2021-04-13 18:25 ` Christian König
2021-04-13 18:30 ` Andrey Grodzovsky
2021-04-14 7:01 ` Christian König
2021-04-14 14:36 ` Andrey Grodzovsky
2021-04-14 14:58 ` Christian König
2021-04-15 6:27 ` Andrey Grodzovsky
2021-04-15 7:02 ` Christian König
2021-04-15 14:11 ` Andrey Grodzovsky
2021-04-15 15:09 ` Christian König
2021-04-13 20:07 ` Daniel Vetter
2021-04-13 5:36 ` Andrey Grodzovsky
2021-04-13 7:07 ` Christian König
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DM5PR12MB25334B94CB060BD4C3EA4223ED4F9@DM5PR12MB2533.namprd12.prod.outlook.com \
--to=dennis.li@amd.com \
--cc=Alexander.Deucher@amd.com \
--cc=Andrey.Grodzovsky@amd.com \
--cc=Christian.Koenig@amd.com \
--cc=Felix.Kuehling@amd.com \
--cc=Hawking.Zhang@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=ckoenig.leichtzumerken@gmail.com \
--cc=daniel@ffwll.ch \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).