dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
To: Daniel Vetter <daniel@ffwll.ch>
Cc: ckoenig.leichtzumerken@gmail.com, gregkh@linuxfoundation.org,
	daniel.vetter@ffwll.ch, Felix.Kuehling@amd.com,
	amd-gfx@lists.freedesktop.org, helgaas@kernel.org,
	dri-devel@lists.freedesktop.org, linux-pci@vger.kernel.org,
	Alexander.Deucher@amd.com
Subject: Re: [PATCH v5 20/27] drm: Scope all DRM IOCTLs with drm_dev_enter/exit
Date: Wed, 5 May 2021 09:57:28 -0400	[thread overview]
Message-ID: <de7ecf08-b2e5-48de-710a-217b4bfde6ca@amd.com> (raw)
In-Reply-To: <b6d0c32c-cf90-6118-5c60-238b6f4a0aaa@amd.com>

Ping

Andrey

On 2021-04-30 1:27 p.m., Andrey Grodzovsky wrote:
> 
> 
> On 2021-04-30 6:25 a.m., Daniel Vetter wrote:
>> On Thu, Apr 29, 2021 at 04:34:55PM -0400, Andrey Grodzovsky wrote:
>>>
>>>
>>> On 2021-04-29 3:05 p.m., Daniel Vetter wrote:
>>>> On Thu, Apr 29, 2021 at 12:04:33PM -0400, Andrey Grodzovsky wrote:
>>>>>
>>>>>
>>>>> On 2021-04-29 7:32 a.m., Daniel Vetter wrote:
>>>>>> On Thu, Apr 29, 2021 at 01:23:19PM +0200, Daniel Vetter wrote:
>>>>>>> On Wed, Apr 28, 2021 at 11:12:00AM -0400, Andrey Grodzovsky wrote:
>>>>>>>> With this calling drm_dev_unplug will flush and block
>>>>>>>> all in flight IOCTLs
>>>>>>>>
>>>>>>>> Also, add feature such that if device supports graceful unplug
>>>>>>>> we enclose entire IOCTL in SRCU critical section.
>>>>>>>>
>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>
>>>>>>> Nope.
>>>>>>>
>>>>>>> The idea of drm_dev_enter/exit is to mark up hw access. Not 
>>>>>>> entire ioctl.
>>>>>
>>>>> Then I am confused why we have 
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Fv5.12%2Fsource%2Fdrivers%2Fgpu%2Fdrm%2Fdrm_ioctl.c%23L826&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cf4c0568093cc462f625808d90bc23a3c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637553751106596888%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=PPKrQYBrgRMjpwlL0r8n5zenIhQMFWc6gniHgUTxTAY%3D&amp;reserved=0 
>>>>>
>>>>> currently in code ?
>>>>
>>>> I forgot about this one, again. Thanks for reminding.
>>>>
>>>>>>> Especially not with an opt-in flag so that it could be shrugged 
>>>>>>> of as a
>>>>>>> driver hack. Most of these ioctls should have absolutely no problem
>>>>>>> working after hotunplug.
>>>>>>>
>>>>>>> Also, doing this defeats the point since it pretty much guarantees
>>>>>>> userspace will die in assert()s and stuff. E.g. on i915 the rough 
>>>>>>> contract
>>>>>>> is that only execbuf (and even that only when userspace has 
>>>>>>> indicated
>>>>>>> support for non-recoverable hw ctx) is allowed to fail. Anything 
>>>>>>> else
>>>>>>> might crash userspace.
>>>>>
>>>>> Given that as I pointed above we already fail any IOCTls with -ENODEV
>>>>> when device is unplugged, it seems those crashes don't happen that
>>>>> often ? Also, in all my testing I don't think I saw a user space crash
>>>>> I could attribute to this.
>>>>
>>>> I guess it should be ok.
>>>
>>> What should be ok ?
>>
>> Your approach, but not your patch. If we go with this let's just lift it
>> to drm_ioctl() as the default behavior. No driver opt-in flag, because
>> that's definitely worse than any other approach because we really need to
>> get rid of driver specific behaviour for generic ioctls, especially
>> anything a compositor will use directly.
>>
>>>> My reasons for making this work is both less trouble for userspace (did
>>>> you test with various wayland compositors out there, not just amdgpu 
>>>> x86
>>>
>>> I didn't - will give it a try.
> 
> Weston worked without crashes, run the egl tester cube there.
> 
>>>
>>>> driver?), but also testing.
>>>>
>>>> We still need a bunch of these checks in various places or you'll 
>>>> wait a
>>>> very long time for a pending modeset or similar to complete. Being 
>>>> able to
>>>> run that code easily after hotunplug has completed should help a lot 
>>>> with
>>>> testing.
>>>>
>>>> Plus various drivers already acquired drm_dev_enter/exit and now I 
>>>> wonder
>>>> whether that was properly tested or not ...
>>>>
>>>> I guess maybe we need a drm module option to disable this check, so 
>>>> that
>>>> we can exercise the code as if the ioctl has raced with hotunplug at 
>>>> the
>>>> worst possible moment.
>>>>
>>>> Also atomic is really tricky here: I assume your testing has just done
>>>> normal synchronous commits, but anything that goes through atomic 
>>>> can be
>>>> done nonblocking in a separate thread. Which the ioctl catch-all 
>>>> here wont
>>>> capture.
>>>
>>> Yes, async commit was on my mind and thanks for reminding me. Indeed
>>> I forgot this but i planned to scope the entire amdgpu_dm_atomic_tail in
>>> drm_dev_enter/exit. Note that i have a bunch of patches, all name's
>>> starting with 'Scope....' that just methodically put all the background
>>> work items and timers the drivers schedules in drm_dev_enter/exit scope.
>>> This was supposed to be part of the 'Scope Display code' patch.
>>
>> That's too much. You still have to arrange that the flip completion event
>> gets sent out. So it's a bit tricky.
>>
>> In other places the same problem applies, e.g. probe functions need to
>> make sure they report "disconnected".
> 
> I see, well, this is all part of KMS support which I defer for now
> anyway. Will tackle it then.
> 
>>
>>>>>>> You probably need similar (and very precisely defined) rules for 
>>>>>>> amdgpu.
>>>>>>> And those must definitely exclude any shard ioctls from randomly 
>>>>>>> failing
>>>>>>> with EIO, because that just kills the box and defeats the point 
>>>>>>> of trying
>>>>>>> to gracefully handling hotunplug and making sure userspace has a 
>>>>>>> chance of
>>>>>>> survival. E.g. for atomic everything should continue, including flip
>>>>>>> completion, but we set all outputs to "disconnected" and send out 
>>>>>>> the
>>>>>>> uevent. Maybe crtc enabling can fail too, but that can also be 
>>>>>>> handled
>>>>>>> through the async status we're using to signal DP link failures to
>>>>>>> userspace.
>>>>>
>>>>> As I pointed before, because of the complexity of the topic I 
>>>>> prefer to
>>>>> take it step by step and solve first for secondary device use case, 
>>>>> not
>>>>> for primary, display attached device.
>>>>
>>>> Yeah makes sense. But then I think the right patch is to roll this 
>>>> out for
>>>> all drivers, properly justified with existing code. Not behind a driver
>>>> flag, because with all these different compositors the last thing we 
>>>> want
>>>> is a proliferation of driver-specific behaviour. That's imo the worst
>>>> option of all of them and needs to be avoided.
>>>
>>> So this kind of patch would be acceptable to you if I unconditionally
>>> scope the drm_ioctl with drm_dev_enter/exit without the driver flag ?
>>> I am worried to break other drivers with this, see patch 
>>> https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Df0c593f35b22ca5bf60ed9e7ce2bf2b80e6c68c6&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cf4c0568093cc462f625808d90bc23a3c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637553751106596888%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2F3Jq6SvTm%2BZX7AVpaxEepfOj0C3O7%2Bo2Wm3y0gxrmKI%3D&amp;reserved=0 
>>>
>>> Before setting drm_dev_unplug I go through a whole process of signalling
>>> all possible fences in the system which some one some where might be
>>> waiting on. My concern is that in the absence of HW those fences won't
>>> signal and so unless I signal them myself srcu_synchrionize in
>>> drm_dev_unplug will hang waiting for any such code scoped by
>>> drm_dev_enter/exit.
>>
>> Uh right. I forgot about this.
>>
>> Which would kinda mean the top level scope is maybe not the best idea, 
>> and
>> perhaps we should indeed drill it down. But then the testing issue
>> definitely gets a lot worse.
>>
>> So what if we'd push that drm_dev_is_unplugged check down into ioctls?
>> Then we can make a case-by case decision whether it should be 
>> converted to
>> drm_dev_enter/exit, needs to be pushed down further into drivers (due to
>> fence wait issues) or other concerns?
>>
>> Also I guess we need to have a subsystem wide rule on whether you need to
>> force complete all fences before you call drm_dev_unplug, or afterwards.
> 
> I don't see how you can handle it afterwards. If a thread is stuck in
> dma_fence_wait in non interruptible wait (any kernel thread) and with no
> timeout there is nothing you can do to stop the wait. Any such code
> scopped with drm_dev_enter/exit will cause a hang in drm_dev_unplug.
> The only way then is to preemptively force signal all such fences before
> calling drm_dev_unplug - as I do in the above mentioned patch.
> 
>> If we have mixed behaviour on this there will be disappointment. And 
>> since
>> hotunplug and dma_fence completion are both userspace visible that
>> inconsistency might have bigger impact.
>>
>> This is all very tricky indeed :-/
>>
>> btw for the "gradual pushing drm_dev_enter into ioctl" approach, if we go
>> with that: We could do the same trick we've done for DRM_UNLOCKED:
>> - drm_dev_enter/exit is called for any ioctl that has not set the
>>    DRM_HOTUNPLUG_SAFE flag
>> - for drm core ioctls we push them into all ioctls and decide how to
>>    handle/where (with the aim to have the least amount of code flow
>>    different during hotunplug vs after hotunplug has finished, to reduce
>>    testing scope)
>> - then we make DRM_HOTUNPLUG_SAFE the implied default
>>
>> This would have us left with render ioctls, and I think the defensive
>> assumption there is that they're all hotunplug safe. We might hang on a
>> fence wait, but that's fixable, and it's better than blowing up on a
>> use-after-free security bug.
>>
>> Thoughts?
> 
> I don't fully see a difference between the approach described above and
> the full drill down to each driver and even within the driver, to the HW
> back-ends - what criteria I would use to decide if for a given IOCTL i
> scope with drm_dev_enter/exit at the highest level while for another
> i go all the way down ? If we would agree that signaling the fences
> preemptively before engaging drm_dev_unplug is generically the right
> approach maybe we can then scope drm_ioctl unconditionally with
> drm_dev_enter/exit and then for each driver go through the same process
> I do for amdgpu - writing driver specific function which takes care of
> all the fences. We could then just create a drm callback which would
> be called from drm_ioctl before drm_dev_unplug is called.
> 
> Andrey
> 
>>
>> It is unfortunately even more work until we've reached the goal, but I
>> think it's safest and most flexible approach overall.
>>
>> Cheers, Daniel
>>
>>>
>>> Andrey
>>>
>>>>
>>>> Cheers, Daniel
>>>>
>>>>
>>>>>
>>>>>>>
>>>>>>> I guess we should clarify this in the hotunplug doc?
>>>>>
>>>>> Agree
>>>>>
>>>>>>
>>>>>> To clarify: I'm not against throwing an ENODEV at userspace for 
>>>>>> ioctl that
>>>>>> really make no sense, and where we're rather confident that all 
>>>>>> properly
>>>>>> implemented userspace will gracefully handle failures. Like a 
>>>>>> modeset, or
>>>>>> opening a device, or trying to import a dma-buf or stuff like that 
>>>>>> which
>>>>>> can already fail in normal operation for any kind of reason.
>>>>>>
>>>>>> But stuff that never fails, like GETRESOURCES ioctl, really 
>>>>>> shouldn't fail
>>>>>> after hotunplug.
>>>>>
>>>>> As I pointed above, this a bit confuses me given that we already do
>>>>> blanker rejection of IOCTLs if device is unplugged.
>>>>
>>>> Well I'm confused about this too :-/
>>>>
>>>>>> And then there's the middle ground, like doing a pageflip or 
>>>>>> buffer flush,
>>>>>> which I guess some userspace might handle, but risky to inflict those
>>>>>> consequences on them. atomic modeset is especially fun since 
>>>>>> depending
>>>>>> what you're doing it can be both "failures expected" and "failures 
>>>>>> not
>>>>>> really expected in normal operation".
>>>>>>
>>>>>> Also, this really should be consistent across drivers, not solved 
>>>>>> with a
>>>>>> driver flag for every possible combination.
>>>>>>
>>>>>> If you look at the current hotunplug kms drivers, they have
>>>>>> drm_dev_enter/exit sprinkled in specific hw callback functions 
>>>>>> because of
>>>>>> the above problems. But maybe it makes sense to change things in a 
>>>>>> few
>>>>>> cases. But then we should do it across the board.
>>>>>
>>>>> So as I understand your preferred approach is that I scope any 
>>>>> back_end, HW
>>>>> specific function with drm_dev_enter/exit because that where MMIO
>>>>> access takes place. But besides explicit MMIO access thorough
>>>>> register accessors in the HW back-end there is also indirect MMIO 
>>>>> access
>>>>> taking place throughout the code in the driver because of various VRAM
>>>>> BOs which provide CPU access to VRAM through the VRAM BAR. This 
>>>>> kind of
>>>>> access is spread all over in the driver and even in mid-layers such as
>>>>> TTM and not limited to HW back-end functions. It means it's much 
>>>>> harder
>>>>> to spot such places to surgically scope them with 
>>>>> drm_dev_enter/exit and
>>>>> also that any new such code introduced will immediately break hot 
>>>>> unplug
>>>>> because the developers can't be expected to remember making their code
>>>>> robust to this specific use case. That why when we discussed 
>>>>> internally
>>>>> what approach to take to protecting code with drm_dev_enter/exit we
>>>>> opted for using the widest available scope.
>>>>
>>>> The thing is, you kinda have to anyway. There's enormous amounts of
>>>> asynchronous processing going on. E.g. nonblocking atomic commits 
>>>> also do
>>>> ttm unpinning and fun stuff like that, which if you sync things 
>>>> wrong can
>>>> happen way late. So the door for bad fallout is wide open :-(
>>>>
>>>> I'm not sure where the right tradeoff is to make sure we catch them 
>>>> all,
>>>> and can make sure with testing that we've indeed caught them all.
>>>> -Daniel
>>>>
>>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

  reply	other threads:[~2021-05-05 13:57 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-28 15:11 [PATCH v5 00/27] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
2021-04-28 15:11 ` [PATCH v5 01/27] drm/ttm: Remap all page faults to per process dummy page Andrey Grodzovsky
2021-04-28 15:11 ` [PATCH v5 02/27] drm/ttm: Expose ttm_tt_unpopulate for driver use Andrey Grodzovsky
2021-04-28 15:11 ` [PATCH v5 03/27] drm/amdgpu: Split amdgpu_device_fini into early and late Andrey Grodzovsky
2021-04-29  7:04   ` Christian König
2021-04-30  3:10     ` Alex Deucher
2021-04-30  5:19   ` Lazar, Lijo
2021-04-30  5:39     ` Andrey Grodzovsky
2021-04-30  5:49       ` Lazar, Lijo
2021-04-28 15:11 ` [PATCH v5 04/27] drm/amdkfd: Split kfd suspend from devie exit Andrey Grodzovsky
2021-04-28 15:11 ` [PATCH v5 05/27] drm/amdgpu: Add early fini callback Andrey Grodzovsky
2021-04-28 15:11 ` [PATCH v5 06/27] drm/amdgpu: Handle IOMMU enabled case Andrey Grodzovsky
2021-04-29  7:08   ` Christian König
2021-05-03 20:43     ` Andrey Grodzovsky
2021-05-04  7:03       ` Christian König
2021-05-04 15:43         ` Andrey Grodzovsky
2021-05-05 14:51           ` Andrey Grodzovsky
2021-04-30  3:13   ` Alex Deucher
2021-05-03 18:00     ` Andrey Grodzovsky
2021-05-04 17:05   ` Felix Kuehling
2021-05-05 14:05     ` Andrey Grodzovsky
2021-04-28 15:11 ` [PATCH v5 07/27] drm/amdgpu: Remap all page faults to per process dummy page Andrey Grodzovsky
2021-04-29  7:09   ` Christian König
2021-04-28 15:11 ` [PATCH v5 08/27] PCI: add support for dev_groups to struct pci_device_driver Andrey Grodzovsky
2021-04-28 16:53   ` Bjorn Helgaas
2021-04-29 16:53     ` Andrey Grodzovsky
2021-04-29 19:23       ` Bjorn Helgaas
2021-04-29 20:36         ` Andrey Grodzovsky
2021-04-28 15:11 ` [PATCH v5 09/27] dmr/amdgpu: Move some sysfs attrs creation to default_attr Andrey Grodzovsky
2021-04-28 17:23   ` Bjorn Helgaas
2021-04-29  7:11   ` Christian König
2021-04-28 15:11 ` [PATCH v5 10/27] drm/amdgpu: Guard against write accesses after device removal Andrey Grodzovsky
2021-04-29  7:14   ` Christian König
2021-04-28 15:11 ` [PATCH v5 11/27] drm/sched: Make timeout timer rearm conditional Andrey Grodzovsky
2021-04-28 15:11 ` [PATCH v5 12/27] drm/amdgpu: Prevent any job recoveries after device is unplugged Andrey Grodzovsky
2021-04-28 15:11 ` [PATCH v5 13/27] drm/amdgpu: When filizing the fence driver. stop scheduler first Andrey Grodzovsky
2021-04-29  7:15   ` Christian König
2021-04-29 17:12     ` Andrey Grodzovsky
2021-04-28 15:11 ` [PATCH v5 14/27] drm/amdgpu: Fix hang on device removal Andrey Grodzovsky
2021-04-28 15:11 ` [PATCH v5 15/27] drm/scheduler: Fix hang when sched_entity released Andrey Grodzovsky
2021-04-29  7:18   ` Christian König
2021-04-29 17:06     ` Andrey Grodzovsky
2021-04-30  6:47       ` Christian König
2021-04-30 16:10         ` Andrey Grodzovsky
2021-05-05 13:57           ` Andrey Grodzovsky
2021-05-07 16:29           ` Daniel Vetter
2021-05-07 16:32             ` Andrey Grodzovsky
2021-04-28 15:11 ` [PATCH v5 16/27] drm/amdgpu: Unmap all MMIO mappings Andrey Grodzovsky
2021-04-29  7:19   ` Christian König
2021-04-29 16:55     ` Andrey Grodzovsky
2021-04-28 15:11 ` [PATCH v5 17/27] drm/amdgpu: Add rw_sem to pushing job into sched queue Andrey Grodzovsky
2021-04-28 15:11 ` [PATCH v5 18/27] drm/sched: Expose drm_sched_entity_kill_jobs Andrey Grodzovsky
2021-04-28 15:11 ` [PATCH v5 19/27] drm/amdgpu: Finilise device fences on device remove Andrey Grodzovsky
2021-04-28 15:20   ` Andrey Grodzovsky
2021-04-28 15:12 ` [PATCH v5 20/27] drm: Scope all DRM IOCTLs with drm_dev_enter/exit Andrey Grodzovsky
2021-04-29 11:23   ` Daniel Vetter
2021-04-29 11:32     ` Daniel Vetter
2021-04-29 16:04       ` Andrey Grodzovsky
2021-04-29 16:15         ` Felix Kuehling
2021-04-29 16:21           ` Andrey Grodzovsky
2021-04-29 16:29             ` Felix Kuehling
2021-04-29 16:33               ` Andrey Grodzovsky
2021-04-29 19:05         ` Daniel Vetter
2021-04-29 20:34           ` Andrey Grodzovsky
2021-04-30 10:25             ` Daniel Vetter
2021-04-30 17:27               ` Andrey Grodzovsky
2021-05-05 13:57                 ` Andrey Grodzovsky [this message]
2021-05-06  9:40                 ` Daniel Vetter
2021-05-06 16:25                   ` Andrey Grodzovsky
2021-05-07  9:11                     ` Daniel Vetter
2021-05-07 15:39                       ` Andrey Grodzovsky
2021-05-07 16:24                         ` Daniel Vetter
2021-05-07 18:00                           ` Andrey Grodzovsky
2021-05-10 15:46                             ` Daniel Vetter
2021-04-28 15:12 ` [PATCH v5 21/27] drm/amdgpu: Add support for hot-unplug feature at DRM level Andrey Grodzovsky
2021-04-28 15:12 ` [PATCH v5 22/27] drm/amd/display: Scope all DM queued work with drm_dev_enter/exit Andrey Grodzovsky
2021-04-28 15:12 ` [PATCH v5 23/27] drm/amd/powerplay: Scope all PM " Andrey Grodzovsky
2021-04-28 15:12 ` [PATCH v5 24/27] drm/amdkfd: Scope all KFD " Andrey Grodzovsky
2021-04-28 15:12 ` [PATCH v5 25/27] drm/amdgpu: Scope all amdgpu " Andrey Grodzovsky
2021-04-28 15:12 ` [PATCH v5 26/27] drm/amd/display: Remove superflous drm_mode_config_cleanup Andrey Grodzovsky
2021-04-28 15:12 ` [PATCH v5 27/27] drm/amdgpu: Verify DMA opearations from device are done Andrey Grodzovsky
2021-04-28 17:07 ` [PATCH v5 00/27] RFC Support hot device unplug in amdgpu Bjorn Helgaas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=de7ecf08-b2e5-48de-710a-217b4bfde6ca@amd.com \
    --to=andrey.grodzovsky@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=daniel.vetter@ffwll.ch \
    --cc=daniel@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=helgaas@kernel.org \
    --cc=linux-pci@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).