All of lore.kernel.org
 help / color / mirror / Atom feed
From: Daniel Vetter <daniel@ffwll.ch>
To: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
Cc: Dave Airlie <airlied@redhat.com>, Sean Paul <sean@poorly.run>,
	"Koenig, Christian" <Christian.Koenig@amd.com>,
	dri-devel <dri-devel@lists.freedesktop.org>
Subject: Re: [PATCH] drm/doc: device hot-unplug for userspace
Date: Wed, 27 May 2020 16:39:33 +0200	[thread overview]
Message-ID: <CAKMK7uGHrPfAeN-PVZ5ixf7hSGj-RatTebt-nR5pTsykWOMkAw@mail.gmail.com> (raw)
In-Reply-To: <63d2e957-ae79-8c70-29c9-fd53a7de92cf@amd.com>

On Wed, May 27, 2020 at 3:51 PM Andrey Grodzovsky
<Andrey.Grodzovsky@amd.com> wrote:
>
>
> On 5/27/20 2:44 AM, Pekka Paalanen wrote:
> > On Tue, 26 May 2020 10:30:20 -0400
> > Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> wrote:
> >
> >> On 5/19/20 6:06 AM, Pekka Paalanen wrote:
> >>> From: Pekka Paalanen <pekka.paalanen@collabora.com>
> >>>
> >>> Set up the expectations on how hot-unplugging a DRM device should look like to
> >>> userspace.
> >>>
> >>> Written by Daniel Vetter's request and largely based on his comments in IRC and
> >>> from https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Farchives%2Fdri-devel%2F2020-May%2F265484.html&amp;data=02%7C01%7Candrey.grodzovsky%40amd.com%7Ce8e13dc4c85648e5fcd408d7fbdc5f2b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637254796242596783&amp;sdata=bA%2FAy3VGvzNqmV1kGigNROSZQEws2E98JibDxvEICNs%3D&amp;reserved=0 .
> >>>
> >>> Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.com>
> >>> Cc: Daniel Vetter <daniel@ffwll.ch>
> >>> Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> >>> Cc: Dave Airlie <airlied@redhat.com>
> >>> Cc: Sean Paul <sean@poorly.run>
> >>>
> >>> ---
> >>>
> >>> Disclaimer: I am a userspace developer writing for other userspace developers.
> >>> I took some liberties in defining what should happen without knowing what is
> >>> actually possible or what existing drivers already implement.
> >>> ---
> >>>    Documentation/gpu/drm-uapi.rst | 75 ++++++++++++++++++++++++++++++++++
> >>>    1 file changed, 75 insertions(+)
> >>>
> >>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> >>> index 56fec6ed1ad8..80db4abd2cbd 100644
> >>> --- a/Documentation/gpu/drm-uapi.rst
> >>> +++ b/Documentation/gpu/drm-uapi.rst
> >>> @@ -1,3 +1,5 @@
> >>> +.. Copyright 2020 DisplayLink (UK) Ltd.
> >>> +
> >>>    ===================
> >>>    Userland interfaces
> >>>    ===================
> >>> @@ -162,6 +164,79 @@ other hand, a driver requires shared state between clients which is
> >>>    visible to user-space and accessible beyond open-file boundaries, they
> >>>    cannot support render nodes.
> >>>
> >>> +Device Hot-Unplug
> >>> +=================
> >>> +
> >>> +.. note::
> >>> +   The following is the plan. Implementation is not there yet
> >>> +   (2020 May 13).
> >>> +
> >>> +Graphics devices (display and/or render) may be connected via USB (e.g.
> >>> +display adapters or docking stations) or Thunderbolt (e.g. eGPU). An end
> >>> +user is able to hot-unplug this kind of devices while they are being
> >>> +used, and expects that the very least the machine does not crash. Any
> >>> +damage from hot-unplugging a DRM device needs to be limited as much as
> >>> +possible and userspace must be given the chance to handle it if it wants
> >>> +to. Ideally, unplugging a DRM device still lets a desktop to continue
> >>> +running, but that is going to need explicit support throughout the whole
> >>> +graphics stack: from kernel and userspace drivers, through display
> >>> +servers, via window system protocols, and in applications and libraries.
> >>> +
> >>> +Other scenarios that should lead to the same are: unrecoverable GPU
> >>> +crash, PCI device disappearing off the bus, or forced unbind of a driver
> >>> +from the physical device.
> >>> +
> >>> +In other words, from userspace perspective everything needs to keep on
> >>> +working more or less, until userspace stops using the disappeared DRM
> >>> +device and closes it completely. Userspace will learn of the device
> >>> +disappearance from the device removed uevent or in some cases specific
> >>> +ioctls returning EIO.
> >>> +
> >>> +This goal raises at least the following requirements for the kernel and
> >>> +drivers:
> >>> +
> >>> +- The kernel must not hang, crash or oops, no matter what userspace was
> >>> +  in the middle of doing when the device disappeared.
> >>> +
> >>> +- All GPU jobs that can no longer run must have their fences
> >>> +  force-signalled to avoid inflicting hangs to userspace.
> >>> +
> >>> +- KMS connectors must change their status to disconnected.
> >>> +
> >>> +- Legacy modesets and pageflips fake success.
> >>> +
> >>> +- Atomic commits, both real and TEST_ONLY, fake success.
> >>> +
> >>> +- Pending non-blocking KMS operations deliver the DRM events userspace
> >>> +  is expecting.
> >>> +
> >>> +- If underlying memory disappears, the mmaps are replaced with harmless
> >>> +  zero pages where access does not raise SIGBUS. Reads return zeros,
> >>> +  writes are ignored.
> >>
> >> Regarding this paragraph - what about exiting mappings ? In the first
> >> patchset we would actively invalidate all the existing CPU mappings to
> >> device memory and i think we still should do it otherwise we will see
> >> random crashes in applications as was before. I guess it's because TLBs
> >> and page tables are not updated to reflect the fact the device is gone.
> > Hi,
> >
> > I was talking about existing mappings. What I forgot to specify is how
> > new mmap() calls after the device disappearance should work (the end
> > result should be the same still, not failure).
> >
> > I'll clarify this in the next revision.
> >
> >
> > Thanks,
> > pq
>
>
> I see, that ok.
>
> Next related question is more for Daniel/Christian - about the
> implementation of this paragraph, I was thinking about something like
> checking for device disconnect in ttm_bo_vm_fault_reserved and if so
> remap the entire VA range for the VMA where the fault address belongs to
> the global zero page (i.e. (remap_pfn_range(vma, vma->vm_start,
> page_to_pfn(ZERO_PAGE(vma->vm_start), vma->vm_end - vma->vm_start,
> vma->vm_page_prot)). Question is, when the doc says 'writes are ignored'
> does it mean i should use copy on write for the vma->vm_page_prot and if
> so how i actually do it as i was not able to find what flags to set into
> vm_page_prot to force copy on write behavior.

Already discussed this with Pekka on irc, I think simply a private
page (per gpu ctx to avoid leaks) is good enough. Otherwise we need to
catch write faults and throw the writes away, and that's a) a bit
tricky to implement and b) slow, which we kinda don't want to. If the
desktop is stuck for a few seconds because we're trapping every write
of a 4k buffer that's getting uploaded, the user is going to have a
bad time :-/
-Daniel

>
> Andrey
>
>
>
>
> >
> >
> >>> +
> >>> +- dmabuf which point to memory that has disappeared are rewritten to
> >>> +  point to harmless zero pages, similar to mmaps. Imports still succeed
> >>> +  both ways: an existing device importing a dmabuf pointing to
> >>> +  disappeared memory, and a disappeared device importing any dmabuf.
> >>> +
> >>> +- Render ioctls return EIO which is then handled in userspace drivers,
> >>> +  e.g. Mesa, to have the device disappearance handled in the way
> >>> +  specified for each API (OpenGL, GL ES: GL_KHR_robustness;
> >>> +  Vulkan: VK_ERROR_DEVICE_LOST; etc.)
> >>> +
> >>> +Raising SIGBUS is not an option, because userspace cannot realistically
> >>> +handle it.  Signal handlers are global, which makes them extremely
> >>> +difficult to use correctly from libraries like Mesa produces. Signal
> >>> +handlers are not composable, you can't have different handlers for GPU1
> >>> +and GPU2 from different vendors, and a third handler for mmapped regular
> >>> +files.  Threads cause additional pain with signal handling as well.
> >>> +
> >>> +Only after userspace has closed all relevant DRM device and dmabuf file
> >>> +descriptors and removed all mmaps, the DRM driver can tear down its
> >>> +instance for the device that no longer exists. If the same physical
> >>> +device somehow comes back in the mean time, it shall be a new DRM
> >>> +device.
> >>> +
> >>>    .. _drm_driver_ioctl:
> >>>
> >>>    IOCTL Support on Device Nodes



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

  reply	other threads:[~2020-05-27 14:39 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-19 10:06 [PATCH] drm/doc: device hot-unplug for userspace Pekka Paalanen
2020-05-19 14:37 ` Andrey Grodzovsky
2020-05-20 11:19   ` Pekka Paalanen
2020-05-20 12:46     ` Daniel Vetter
2020-05-20 14:50       ` Andrey Grodzovsky
2020-05-25 12:13         ` Andrey Grodzovsky
2020-05-25 14:28         ` Daniel Vetter
2020-05-25 14:55           ` Pekka Paalanen
2020-05-25 15:09             ` Daniel Vetter
2020-05-28 12:27               ` Pekka Paalanen
2020-05-28 14:45                 ` Daniel Vetter
2020-06-01 12:04                   ` Pekka Paalanen
2020-05-20 12:55 ` Daniel Vetter
2020-05-20 13:20   ` Simon Ser
2020-05-20 14:19     ` Daniel Vetter
2020-05-22  9:54       ` Pekka Paalanen
2020-05-25 14:29         ` Daniel Vetter
2020-05-25 12:46 ` [PATCH v2] " Pekka Paalanen
2020-05-25 13:51   ` Andrey Grodzovsky
2020-05-25 14:30     ` Daniel Vetter
2020-05-25 15:02       ` Pekka Paalanen
2020-05-25 14:41     ` Pekka Paalanen
2020-05-26 14:30 ` [PATCH] " Andrey Grodzovsky
2020-05-27  6:44   ` Pekka Paalanen
2020-05-27 13:51     ` Andrey Grodzovsky
2020-05-27 14:39       ` Daniel Vetter [this message]
2020-05-27 15:23         ` Andrey Grodzovsky
2020-05-27 19:44           ` Christian König
2020-05-27 20:25             ` Daniel Vetter
2020-05-28 12:13               ` Pekka Paalanen
2020-05-28 21:28               ` Andrey Grodzovsky
2020-06-01 14:32 ` [PATCH v3] " Pekka Paalanen
2020-06-02 14:00   ` Andrey Grodzovsky
2020-06-03  8:19     ` Daniel Vetter
2020-06-08 16:05   ` Pekka Paalanen
2020-06-10 15:40   ` Daniel Vetter
2020-06-22 14:05 ` [PATCH v4] " Pekka Paalanen
2020-06-22 16:56   ` Alex Deucher
2020-06-27 10:02   ` Noralf Trønnes
2020-07-07 11:38 ` [PATCH v5] " Pekka Paalanen
2020-07-07 11:49   ` Karol Herbst
2020-07-07 12:13     ` Daniel Vetter
2020-07-07 12:59   ` Simon Ser
2020-07-30 13:35   ` drm/doc: missing SPDX license identifier (Was: Re: [PATCH v5] drm/doc: device hot-unplug for userspace) Emil Velikov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAKMK7uGHrPfAeN-PVZ5ixf7hSGj-RatTebt-nR5pTsykWOMkAw@mail.gmail.com \
    --to=daniel@ffwll.ch \
    --cc=Andrey.Grodzovsky@amd.com \
    --cc=Christian.Koenig@amd.com \
    --cc=airlied@redhat.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=sean@poorly.run \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.