Re: [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

From: Daniel Vetter <daniel@ffwll.ch>
To: "Marek Olšák" <maraeo@gmail.com>
Cc: ML Mesa-dev <mesa-dev@lists.freedesktop.org>,
	dri-devel <dri-devel@lists.freedesktop.org>
Subject: Re: [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
Date: Tue, 20 Apr 2021 14:01:29 +0200	[thread overview]
Message-ID: <YH7CmXyKFFh3lagO@phenom.ffwll.local> (raw)
In-Reply-To: <CAAxE2A4mpapnCE7uw8GNWkaRR4jXeoz9qa9j=9XknjR3yeq3YQ@mail.gmail.com>

On Mon, Apr 19, 2021 at 06:47:48AM -0400, Marek Olšák wrote:
> Hi,
> 
> This is our initial proposal for explicit fences everywhere and new memory
> management that doesn't use BO fences. It's a redesign of how Linux
> graphics drivers work, and it can coexist with what we have now.
> 
> 
> *1. Introduction*
> (skip this if you are already sold on explicit fences)
> 
> The current Linux graphics architecture was initially designed for GPUs
> with only one graphics queue where everything was executed in the
> submission order and per-BO fences were used for memory management and
> CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
> queues were added on top, which required the introduction of implicit
> GPU-GPU synchronization between queues of different processes using per-BO
> fences. Recently, even parallel execution within one queue was enabled
> where a command buffer starts draws and compute shaders, but doesn't wait
> for them, enabling parallelism between back-to-back command buffers.
> Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
> was created to enable all those use cases, and it's the only reason why the
> scheduler exists.
> 
> The GPU scheduler, implicit synchronization, BO-fence-based memory
> management, and the tracking of per-BO fences increase CPU overhead and
> latency, and reduce parallelism. There is a desire to replace all of them
> with something much simpler. Below is how we could do it.

I get the feeling you're mixing up a lot of things here that have more
nuance, so first some lingo.

- There's kernel based synchronization, based on dma_fence. These come in
  two major variants: Implicit synchronization, where the kernel attaches
  the dma_fences to a dma-buf, and explicit synchronization, where the
  dma_fence gets passed around as a stand-alone object, either a sync_file
  or a drm_syncobj

- Then there's userspace fence synchronization, where userspace issues any
  fences directly and the kernel doesn't even know what's going on. This
  is the only model that allows you to ditch the kernel overhead, and it's
  also the model that vk uses.

  I concur with Jason that this one is the future, it's the model hw
  wants, compute wants and vk wants. Building an explicit fence world
  which doesn't aim at this is imo wasted effort.

Now you smash them into one thing by also changing the memory model, but I
think that doesn't work:

- Relying on gpu page faults across the board wont happen. I think right
  now only amd's GFX10 or so has enough pagefault support to allow this,
  and not even there I'm really sure. Nothing else will anytime soon, at
  least not as far as I know. So we need to support slightly more hw in
  upstream than just that.  Any plan that's realistic needs to cope with
  dma_fence for a really long time.

- Pown^WPin All The Things! is probably not a general enough memory
  management approach. We've kinda tried for years to move away from it.
  Sure we can support it as an optimization in specific workloads, and it
  will make stuff faster, but it's not going to be the default I think.

- We live in a post xf86-video-$vendor world, and all these other
  compositors rely on implicit sync. You're not going to be able to get
  rid of them anytime soon. What's worse, all the various EGL/vk buffer
  sharing things also rely on implicit sync, so you get to fix up tons of
  applications on top. Any plan that's realistic needs to cope with
  implicit/explicit at the same time together won't work.

- Absolute infuriating, but you can't use page-faulting together with any
  dma_fence synchronization primitives, whether implicit or explicit. This
  means until the entire ecosystem moved forward (good luck with that) we
  have to support dma_fence. The only sync model that works together with
  page faults is userspace fence based sync.

Then there's the somewhat aside topic of how amdgpu/radeonsi does implicit
sync, at least last I checked. Currently this oversynchronizes badly
because it's left to the kernel to guess what should be synchronized, and
that gets things wrong. What you need there is explicit implicit
synchronization:

- on the cs side, userspace must set explicit for which buffers the kernel
  should engage in implicit synchronization. That's how it works on all
  other drivers that support more explicit userspace like vk or gl drivers
  that are internally all explicit. So essentially you only set the
  implicit fence slot when you really want to, and only userspace knows
  this. Implementing this without breaking the current logic probably
  needs some flags.

- the other side isn't there yet upstream, but Jason has patches.
  Essentially you also need to sample your implicit sync points at the
  right spot, to avoid oversync on later rendering by the producer.
  Jason's patch solves this by adding an ioctl to dma-buf to get the
  current set.

- without any of this things for pure explicit fencing userspace the
  kernel will simply maintain a list of all current users of a buffer. For
  memory management, which means eviction handling roughly works like you
  describe below, we wait for everything before a buffer can be moved.

This should get rid of the oversync issues, and since implicit sync is
backed in everywhere right now, you'll have to deal with implicit sync for
a very long time.

Next up is reducing the memory manager overhead of all this, without
changing the ecosystem.

- hw option would be page faults, but until we have full explicit
  userspace sync we can't use those. Which currently means compute only.
  Note that for vulkan or maybe also gl this is quite nasty for userspace,
  since as soon as you need to switch to dma_fenc sync or implicit sync
  (winsys buffer, or buffer sharing with any of the current set of
  extensions) you have to flip your internal driver state around all sync
  points over from userspace fencing to dma_fence kernel fencing. Can
  still be all explicit using drm_syncobj ofc.

- next up if your hw has preemption, you could use that, except preemption
  takes a while longer, so from memory pov really should be done with
  dma_fence. Plus it has all the same problems in that it requires
  userspace fences.

- now for making dma_fence O(1) in the fastpath you need the shared
  dma_resv trick and the lru bulk move. radv/amdvlk use that, but I think
  radeonsi not yet. But maybe I missed that. Either way we need to do some
  better kernel work so it can also be fast for shared buffers, if those
  become a problem. On the GL side doing this will use a lot of the tricks
  for residency/working set management you describe below, except the
  kernel can still throw out an entire gpu job. This is essentially what
  you describe with 3.1. Vulkan/compute already work like this.

Now this gets the performance up, but it doesn't give us any road towards
using page faults (outside of compute) and so retiring dma_fence for good.
For that we need a few pieces:

- Full new set of userspace winsys protocols and egl/vk extensions. Pray
  it actually gets adopted, because neither AMD nor Intel have the
  engineers to push these kind of ecosystems/middleware issues forward on
  their payrolls. Good pick is probably using drm_syncobj as the kernel
  primitive for this. Still uses dma_fence underneath.

- Some clever kernel tricks so that we can substitute dma_fence for
  userspace fences within a drm_syncobj. drm_syncobj already has the
  notion of waiting for a dma_fence to materialize. We can abuse that to
  create an upgrade path from dma_fence based sync to userspace fence
  syncing. Ofc none of this will be on the table if userspace hasn't
  adopted explicit sync.

With these two things I think we can have a reasonable upgrade path. None
of this will be break the world type things though.

Bunch of comments below.

> *2. Explicit synchronization for window systems and modesetting*
> 
> The producer is an application and the consumer is a compositor or a
> modesetting driver.
> 
> *2.1. The Present request*
> 
> As part of the Present request, the producer will pass 2 fences (sync
> objects) to the consumer alongside the presented DMABUF BO:
> - The submit fence: Initially unsignalled, it will be signalled when the
> producer has finished drawing into the presented buffer.
> - The return fence: Initially unsignalled, it will be signalled when the
> consumer has finished using the presented buffer.

Build this with syncobj timelines and it makes a lot more sense I think.
We'll need that for having a proper upgrade path, both on the hw/driver
side (being able to support stuff like preempt or gpu page faults) and the
ecosystem side (so that we don't have to rev protocols twice, once going
to explicit dma_fence sync and once more for userspace sync).

> Deadlock mitigation to recover from segfaults:
> - The kernel knows which process is obliged to signal which fence. This
> information is part of the Present request and supplied by userspace.
> - If the producer crashes, the kernel signals the submit fence, so that the
> consumer can make forward progress.
> - If the consumer crashes, the kernel signals the return fence, so that the
> producer can reclaim the buffer.

So for kernel based sync imo simplest is to just reuse dma_fence, same
rules apply.

For userspace fencing the kernel simply doesn't care how stupid userspace
is. Security checks at boundaries (e.g. client vs compositor) is also
usersepace's problem and can be handled by e.g.  timeouts + conditional
rendering on the compositor side. The timeout might be in the compat glue,
e.g. when we stall for a dma_fence to materialize from a drm_syncobj. I
think in vulkan this is defacto already up to applications to deal with
entirely if they deal with untrusted fences.

> - A GPU hang signals all fences. Other deadlocks will be handled like GPU
> hangs.

Nope, we can't just shrug off all deadlocks with "gpu reset rolls in". For
one, with userspace fencing the kernel isn't aware of any deadlocks, you
fundamentally can't tell "has deadlocked" from "is still doing useful
computations" because that amounts to solving the halting problem.

Any programming model we come up with where both kernel and userspace are
involved needs to come up with rules where at least non-evil userspace
never deadlocks. And if you just allow both then it's pretty easy to come
up with scenarios where both userspace and kernel along are deadlock free,
but interactions result in hangs. That's why we've recently documented all
the corner cases around indefinite dma_fences, and also why you can't use
gpu page faults currently anything that uses dma_fence for sync.

That's why I think with userspace fencing the kernel simply should not be
involved at all, aside from providing optimized/blocking cpu wait
functionality.

> Other window system requests can follow the same idea.
> 
> Merged fences where one fence object contains multiple fences will be
> supported. A merged fence is signalled only when its fences are signalled.
> The consumer will have the option to redefine the unsignalled return fence
> to a merged fence.
> 
> *2.2. Modesetting*
> 
> Since a modesetting driver can also be the consumer, the present ioctl will
> contain a submit fence and a return fence too. One small problem with this
> is that userspace can hang the modesetting driver, but in theory, any later
> present ioctl can override the previous one, so the unsignalled
> presentation is never used.
> 
> 
> *3. New memory management*
> 
> The per-BO fences will be removed and the kernel will not know which
> buffers are busy. This will reduce CPU overhead and latency. The kernel
> will not need per-BO fences with explicit synchronization, so we just need
> to remove their last user: buffer evictions. It also resolves the current
> OOM deadlock.

What's "the current OOM deadlock"?

> 
> *3.1. Evictions*
> 
> If the kernel wants to move a buffer, it will have to wait for everything
> to go idle, halt all userspace command submissions, move the buffer, and
> resume everything. This is not expected to happen when memory is not
> exhausted. Other more efficient ways of synchronization are also possible
> (e.g. sync only one process), but are not discussed here.
> 
> *3.2. Per-process VRAM usage quota*
> 
> Each process can optionally and periodically query its VRAM usage quota and
> change domains of its buffers to obey that quota. For example, a process
> allocated 2 GB of buffers in VRAM, but the kernel decreased the quota to 1
> GB. The process can change the domains of the least important buffers to
> GTT to get the best outcome for itself. If the process doesn't do it, the
> kernel will choose which buffers to evict at random. (thanks to Christian
> Koenig for this idea)
> 
> *3.3. Buffer destruction without per-BO fences*
> 
> When the buffer destroy ioctl is called, an optional fence list can be
> passed to the kernel to indicate when it's safe to deallocate the buffer.
> If the fence list is empty, the buffer will be deallocated immediately.
> Shared buffers will be handled by merging fence lists from all processes
> that destroy them. Mitigation of malicious behavior:
> - If userspace destroys a busy buffer, it will get a GPU page fault.
> - If userspace sends fences that never signal, the kernel will have a
> timeout period and then will proceed to deallocate the buffer anyway.
> 
> *3.4. Other notes on MM*
> 
> Overcommitment of GPU-accessible memory will cause an allocation failure or
> invoke the OOM killer. Evictions to GPU-inaccessible memory might not be
> supported.
> 
> Kernel drivers could move to this new memory management today. Only buffer
> residency and evictions would stop using per-BO fences.
> 
> 
> 
> *4. Deprecating implicit synchronization*
> 
> It can be phased out by introducing a new generation of hardware where the
> driver doesn't add support for it (like a driver fork would do), assuming
> userspace has all the changes for explicit synchronization. This could
> potentially create an isolated part of the kernel DRM where all drivers
> only support explicit synchronization.

10-20 years I'd say before that's even an option.
-Daniel

> 
> Marek

> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel