amd-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: "Christian König" <christian.koenig@amd.com>
To: "Andrey Grodzovsky" <andrey.grodzovsky@amd.com>,
	"Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Li, Dennis" <Dennis.Li@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Kuehling, Felix" <Felix.Kuehling@amd.com>,
	"Zhang, Hawking" <Hawking.Zhang@amd.com>
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Date: Thu, 8 Apr 2021 20:58:33 +0200	[thread overview]
Message-ID: <4a1e3961-8708-46b7-bfd8-86a4ee4e2726@amd.com> (raw)
In-Reply-To: <50e0907c-52d6-1fdb-aa5e-39b1c326180c@amd.com>

Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky:
> On 2021-04-08 4:32 a.m., Christian König wrote:
>> Am 08.04.21 um 10:22 schrieb Christian König:
>>> [SNIP]
>>>>>
>>>>>
>>>>>> Beyond blocking all delayed works and scheduler threads we also 
>>>>>> need to guarantee no  IOCTL can access MMIO post device unplug OR 
>>>>>> in flight IOCTLs are done before we finish pci_remove 
>>>>>> (amdgpu_pci_remove for us).
>>>>>> For this I suggest we do something like what we worked on with 
>>>>>> Takashi Iwai the ALSA maintainer recently when he helped 
>>>>>> implementing PCI BARs move support for snd_hda_intel. Take a look at
>>>>>> https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1&id=cbaa324799718e2b828a8c7b5b001dd896748497 
>>>>>> and
>>>>>> https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1&id=e36365d9ab5bbc30bdc221ab4b3437de34492440
>>>>>> We also had same issue there, how to prevent MMIO accesses while 
>>>>>> the BARs are migrating. What was done there is a refcount was 
>>>>>> added to count all IOCTLs in flight, for any in flight  IOCTL the 
>>>>>> BAR migration handler would
>>>>>> block for the refcount to drop to 0 before it would proceed, for 
>>>>>> any later IOCTL it stops and wait if device is in migration 
>>>>>> state. We even don't need the wait part, nothing to wait for, we 
>>>>>> just return with -ENODEV for this case.
>>>>>>
>>>>>
>>>>> This is essentially what the DRM SRCU is doing as well.
>>>>>
>>>>> For the hotplug case we could do this in the toplevel since we can 
>>>>> signal the fence and don't need to block memory management.
>>>>
>>>>
>>>> To make SRCU 'wait for' all IOCTLs in flight we would need to wrap 
>>>> every IOCTL ( practically - just drm_ioctl function) with 
>>>> drm_dev_enter/drm_dev_exit - can we do it ?
>>>>
>>
>> Sorry totally missed this question.
>>
>> Yes, exactly that. As discussed for the hotplug case we can do this.
>
>
> Thinking more about it - assuming we are  treating synchronize_srcu as 
> a 'wait for completion' of any in flight {drm_dev_enter, drm_dev_exit} 
> scope, some of those scopes might do dma_fence_wait inside. Since we 
> haven't force signaled the fences yet we will end up a deadlock. We 
> have to signal all the various fences before doing the 'wait for'. But 
> we can't signal the fences before setting 'dev->unplugged = true' to 
> reject further CS and other stuff which might create more fences we 
> were supposed-to force signal and now missed them. Effectively setting 
> 'dev->unplugged = true' and doing synchronize_srcu in one call like 
> drm_dev_unplug does without signalling all the fences in the device in 
> between these two steps looks luck a possible deadlock to me - what do 
> you think ?
>

Indeed, that is a really good argument to handle it the same way as the 
reset lock.

E.g. not taking it at the high level IOCTL, but rather when the frontend 
of the driver has acquired all the necessary locks (BO resv, VM lock 
etc...) before calling into the backend to actually do things with the 
hardware.

Christian.

> Andrey
>
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  reply	other threads:[~2021-04-08 18:58 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-18  7:23 [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Dennis Li
2021-03-18  7:23 ` [PATCH 1/4] drm/amdgpu: remove reset lock from low level functions Dennis Li
2021-03-18  7:23 ` [PATCH 2/4] drm/amdgpu: refine the GPU recovery sequence Dennis Li
2021-03-18  7:56   ` Christian König
2021-03-18  7:23 ` [PATCH 3/4] drm/amdgpu: instead of using down/up_read directly Dennis Li
2021-03-18  7:23 ` [PATCH 4/4] drm/amdkfd: add reset lock protection for kfd entry functions Dennis Li
2021-03-18  7:53 ` [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Christian König
2021-03-18  8:28   ` Li, Dennis
2021-03-18  8:58     ` AW: " Koenig, Christian
2021-03-18  9:30       ` Li, Dennis
2021-03-18  9:51         ` Christian König
2021-04-05 17:58           ` Andrey Grodzovsky
2021-04-06 10:34             ` Christian König
2021-04-06 11:21               ` Christian König
2021-04-06 21:22               ` Andrey Grodzovsky
2021-04-07 10:28                 ` Christian König
2021-04-07 19:44                   ` Andrey Grodzovsky
2021-04-08  8:22                     ` Christian König
2021-04-08  8:32                       ` Christian König
2021-04-08 16:08                         ` Andrey Grodzovsky
2021-04-08 18:58                           ` Christian König [this message]
2021-04-08 20:39                             ` Andrey Grodzovsky
2021-04-09  6:53                               ` Christian König
2021-04-09  7:01                                 ` Christian König
2021-04-09 15:42                                   ` Andrey Grodzovsky
2021-04-09 16:39                                     ` Christian König
2021-04-09 18:18                                       ` Andrey Grodzovsky
2021-04-10 17:34                                         ` Christian König
2021-04-12 17:27                                           ` Andrey Grodzovsky
2021-04-12 17:44                                             ` Christian König
2021-04-12 18:01                                               ` Andrey Grodzovsky
2021-04-12 18:05                                                 ` Christian König
2021-04-12 18:18                                                   ` Andrey Grodzovsky
2021-04-12 18:23                                                     ` Christian König
2021-04-12 19:12                                                       ` Andrey Grodzovsky
2021-04-12 19:18                                                         ` Christian König
2021-04-12 20:01                                                           ` Andrey Grodzovsky
2021-04-13  7:10                                                             ` Christian König
2021-04-13  9:13                                                               ` Li, Dennis
2021-04-13  9:14                                                                 ` Christian König
2021-04-13 20:08                                                                 ` Daniel Vetter
2021-04-13 15:12                                                               ` Andrey Grodzovsky
2021-04-13 18:03                                                                 ` Christian König
2021-04-13 18:18                                                                   ` Andrey Grodzovsky
2021-04-13 18:25                                                                     ` Christian König
2021-04-13 18:30                                                                       ` Andrey Grodzovsky
2021-04-14  7:01                                                                         ` Christian König
2021-04-14 14:36                                                                           ` Andrey Grodzovsky
2021-04-14 14:58                                                                             ` Christian König
2021-04-15  6:27                                                                               ` Andrey Grodzovsky
2021-04-15  7:02                                                                                 ` Christian König
2021-04-15 14:11                                                                                   ` Andrey Grodzovsky
2021-04-15 15:09                                                                                     ` Christian König
2021-04-13 20:07                                                               ` Daniel Vetter
2021-04-13  5:36                                                       ` Andrey Grodzovsky
2021-04-13  7:07                                                         ` Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4a1e3961-8708-46b7-bfd8-86a4ee4e2726@amd.com \
    --to=christian.koenig@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Dennis.Li@amd.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=Hawking.Zhang@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=andrey.grodzovsky@amd.com \
    --cc=ckoenig.leichtzumerken@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).