Re: [Linaro-mm-sig] [PATCH] RFC: dma-fence: Document recoverable page fault implications

From: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
To: Felix Kuehling <felix.kuehling@amd.com>
Cc: "Daniel Vetter" <daniel.vetter@ffwll.ch>,
	"DRI Development" <dri-devel@lists.freedesktop.org>,
	linaro-mm-sig@lists.linaro.org,
	"Jerome Glisse" <jglisse@redhat.com>,
	"Thomas Hellström" <thomas.hellstrom@intel.com>,
	"Daniel Vetter" <daniel.vetter@intel.com>,
	"Koenig, Christian" <christian.koenig@amd.com>,
	linux-media@vger.kernel.org
Subject: Re: [Linaro-mm-sig] [PATCH] RFC: dma-fence: Document recoverable page fault implications
Date: Tue, 9 Feb 2021 04:13:44 +0100	[thread overview]
Message-ID: <CAP+8YyG1Yupdqvrp4uUunYkVeZvhWB4rojfAtuSwMmqXRj44oQ@mail.gmail.com> (raw)
In-Reply-To: <65b7a61c-b4b9-a210-5a37-0f69d01f667c@amd.com>

On Thu, Jan 28, 2021 at 4:40 PM Felix Kuehling <felix.kuehling@amd.com> wrote:
>
> Am 2021-01-28 um 2:39 a.m. schrieb Christian König:
> > Am 27.01.21 um 23:00 schrieb Felix Kuehling:
> >> Am 2021-01-27 um 7:16 a.m. schrieb Christian König:
> >>> Am 27.01.21 um 13:11 schrieb Maarten Lankhorst:
> >>>> Op 27-01-2021 om 01:22 schreef Felix Kuehling:
> >>>>> Am 2021-01-21 um 2:40 p.m. schrieb Daniel Vetter:
> >>>>>> Recently there was a fairly long thread about recoreable hardware
> >>>>>> page
> >>>>>> faults, how they can deadlock, and what to do about that.
> >>>>>>
> >>>>>> While the discussion is still fresh I figured good time to try and
> >>>>>> document the conclusions a bit.
> >>>>>>
> >>>>>> References:
> >>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210107030127.20393-1-Felix.Kuehling%40amd.com%2F&amp;data=04%7C01%7Cfelix.kuehling%40amd.com%7C4e4884be55d74c4dda1408d8c35fd0ab%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637474163592260552%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=y2VzC4vbfMi0ctyerAHfqODZ6tthz1FUDwpMCp0PIrQ%3D&amp;reserved=0
> >>>>>>
> >>>>>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> >>>>>> Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> >>>>>> Cc: "Christian König" <christian.koenig@amd.com>
> >>>>>> Cc: Jerome Glisse <jglisse@redhat.com>
> >>>>>> Cc: Felix Kuehling <felix.kuehling@amd.com>
> >>>>>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> >>>>>> Cc: Sumit Semwal <sumit.semwal@linaro.org>
> >>>>>> Cc: linux-media@vger.kernel.org
> >>>>>> Cc: linaro-mm-sig@lists.linaro.org
> >>>>>> --
> >>>>>> I'll be away next week, but figured I'll type this up quickly for
> >>>>>> some
> >>>>>> comments and to check whether I got this all roughly right.
> >>>>>>
> >>>>>> Critique very much wanted on this, so that we can make sure hw which
> >>>>>> can't preempt (with pagefaults pending) like gfx10 has a clear
> >>>>>> path to
> >>>>>> support page faults in upstream. So anything I missed, got wrong or
> >>>>>> like that would be good.
> >>>>>> -Daniel
> >>>>>> ---
> >>>>>>    Documentation/driver-api/dma-buf.rst | 66
> >>>>>> ++++++++++++++++++++++++++++
> >>>>>>    1 file changed, 66 insertions(+)
> >>>>>>
> >>>>>> diff --git a/Documentation/driver-api/dma-buf.rst
> >>>>>> b/Documentation/driver-api/dma-buf.rst
> >>>>>> index a2133d69872c..e924c1e4f7a3 100644
> >>>>>> --- a/Documentation/driver-api/dma-buf.rst
> >>>>>> +++ b/Documentation/driver-api/dma-buf.rst
> >>>>>> @@ -257,3 +257,69 @@ fences in the kernel. This means:
> >>>>>>      userspace is allowed to use userspace fencing or long running
> >>>>>> compute
> >>>>>>      workloads. This also means no implicit fencing for shared
> >>>>>> buffers in these
> >>>>>>      cases.
> >>>>>> +
> >>>>>> +Recoverable Hardware Page Faults Implications
> >>>>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>>>>> +
> >>>>>> +Modern hardware supports recoverable page faults, which has a
> >>>>>> lot of
> >>>>>> +implications for DMA fences.
> >>>>>> +
> >>>>>> +First, a pending page fault obviously holds up the work that's
> >>>>>> running on the
> >>>>>> +accelerator and a memory allocation is usually required to resolve
> >>>>>> the fault.
> >>>>>> +But memory allocations are not allowed to gate completion of DMA
> >>>>>> fences, which
> >>>>>> +means any workload using recoverable page faults cannot use DMA
> >>>>>> fences for
> >>>>>> +synchronization. Synchronization fences controlled by userspace
> >>>>>> must be used
> >>>>>> +instead.
> >>>>>> +
> >>>>>> +On GPUs this poses a problem, because current desktop compositor
> >>>>>> protocols on
> >>>>>> +Linus rely on DMA fences, which means without an entirely new
> >>>>>> userspace stack
> >>>>>> +built on top of userspace fences, they cannot benefit from
> >>>>>> recoverable page
> >>>>>> +faults. The exception is when page faults are only used as
> >>>>>> migration hints and
> >>>>>> +never to on-demand fill a memory request. For now this means
> >>>>>> recoverable page
> >>>>>> +faults on GPUs are limited to pure compute workloads.
> >>>>>> +
> >>>>>> +Furthermore GPUs usually have shared resources between the 3D
> >>>>>> rendering and
> >>>>>> +compute side, like compute units or command submission engines. If
> >>>>>> both a 3D
> >>>>>> +job with a DMA fence and a compute workload using recoverable page
> >>>>>> faults are
> >>>>>> +pending they could deadlock:
> >>>>>> +
> >>>>>> +- The 3D workload might need to wait for the compute job to finish
> >>>>>> and release
> >>>>>> +  hardware resources first.
> >>>>>> +
> >>>>>> +- The compute workload might be stuck in a page fault, because the
> >>>>>> memory
> >>>>>> +  allocation is waiting for the DMA fence of the 3D workload to
> >>>>>> complete.
> >>>>>> +
> >>>>>> +There are a few ways to prevent this problem:
> >>>>>> +
> >>>>>> +- Compute workloads can always be preempted, even when a page
> >>>>>> fault is pending
> >>>>>> +  and not yet repaired. Not all hardware supports this.
> >>>>>> +
> >>>>>> +- DMA fence workloads and workloads which need page fault handling
> >>>>>> have
> >>>>>> +  independent hardware resources to guarantee forward progress.
> >>>>>> This could be
> >>>>>> +  achieved through e.g. through dedicated engines and minimal
> >>>>>> compute unit
> >>>>>> +  reservations for DMA fence workloads.
> >>>>>> +
> >>>>>> +- The reservation approach could be further refined by only
> >>>>>> reserving the
> >>>>>> +  hardware resources for DMA fence workloads when they are
> >>>>>> in-flight. This must
> >>>>>> +  cover the time from when the DMA fence is visible to other
> >>>>>> threads up to
> >>>>>> +  moment when fence is completed through dma_fence_signal().
> >>>>>> +
> >>>>>> +- As a last resort, if the hardware provides no useful reservation
> >>>>>> mechanics,
> >>>>>> +  all workloads must be flushed from the GPU when switching
> >>>>>> between jobs
> >>>>>> +  requiring DMA fences or jobs requiring page fault handling: This
> >>>>>> means all DMA
> >>>>>> +  fences must complete before a compute job with page fault
> >>>>>> handling can be
> >>>>>> +  inserted into the scheduler queue. And vice versa, before a DMA
> >>>>>> fence can be
> >>>>>> +  made visible anywhere in the system, all compute workloads must
> >>>>>> be preempted
> >>>>>> +  to guarantee all pending GPU page faults are flushed.
> >>>>> I thought of another possible workaround:
> >>>>>
> >>>>>     * Partition the memory. Servicing of page faults will use a
> >>>>> separate
> >>>>>       memory pool that can always be allocated from without
> >>>>> waiting for
> >>>>>       fences. This includes memory for page tables and memory for
> >>>>>       migrating data to. You may steal memory from other processes
> >>>>> that
> >>>>>       can page fault, so no fence waiting is necessary. Being able to
> >>>>>       steal memory at any time also means there are basically no
> >>>>>       out-of-memory situations you need to worry about. Even page
> >>>>> tables
> >>>>>       (except the root page directory of each process) can be
> >>>>> stolen in
> >>>>>       the worst case.
> >>>> I think 'overcommit' would be a nice way to describe this. But I'm not
> >>>> sure how easy this is to implement in practice. You would basically
> >>>> need
> >>>> to create your own memory manager for this.
> >>> Well you would need a completely separate pool for both device as well
> >>> as system memory.
> >>>
> >>> E.g. on boot we say we steal X GB system memory only for HMM.
> >> Why? The GPU driver doesn't need to allocate system memory for HMM.
> >> Migrations to system memory are handled by the kernel's handle_mm_fault
> >> and page allocator and swap logic.
> >
> > And that one depends on dma_fence completion because you can easily
> > need to wait for an MMU notifier callback.
>
> I see, the GFX MMU notifier for userpointers in amdgpu currently waits
> for fences. For the KFD MMU notifier I am planning to fix this by
> causing GPU page faults instead of preempting the queues. Can we limit
> userptrs in amdgpu to engines that can page fault. Basically make it
> illegal to attach userptr BOs to graphics CS BO lists, so they can only
> be used in user mode command submissions, which can page fault. Then the
> GFX MMU notifier could invalidate PTEs and would not have to wait for
> fences.

sadly graphics + userptr is already exposed via Mesa.

>
>
> >
> > As Maarten wrote when you want to go down this route you need a
> > complete separate memory management parallel to the one of the kernel.
>
> Not really. I'm trying to make the GPU memory management more similar to
> what the kernel does for system memory.
>
> I understood Maarten's comment as "I'm creating a new memory manager and
> not using TTM any more". This is true. The idea is that this portion of
> VRAM would be managed more like system memory.
>
> Regards,
>   Felix
>
>
> >
> > Regards,
> > Christian.
> >
> >>   It doesn't depend on any fences, so
> >> it cannot deadlock with any GPU driver-managed memory. The GPU driver
> >> gets involved in the MMU notifier to invalidate device page tables. But
> >> that also doesn't need to wait for any fences.
> >>
> >> And if the kernel runs out of pageable memory, you're in trouble anyway.
> >> The OOM killer will step in, nothing new there.
> >>
> >> Regards,
> >>    Felix
> >>
> >>
> >>>> But from a design point of view, definitely a valid solution.
> >>> I think the restriction above makes it pretty much unusable.
> >>>
> >>>> But this looks good, those solutions are definitely the valid
> >>>> options we
> >>>> can choose from.
> >>> It's certainly worth noting, yes. And just to make sure that nobody
> >>> has the idea to reserve only device memory.
> >>>
> >>> Christian.
> >>>
> >>>> ~Maarten
> >>>>
> >> _______________________________________________
> >> Linaro-mm-sig mailing list
> >> Linaro-mm-sig@lists.linaro.org
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.linaro.org%2Fmailman%2Flistinfo%2Flinaro-mm-sig&amp;data=04%7C01%7Cfelix.kuehling%40amd.com%7C4e4884be55d74c4dda1408d8c35fd0ab%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637474163592260552%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=gQj51eDK8OUWoQcbYliY639jOPleRjyLY3Q16nj2PL0%3D&amp;reserved=0
> >>
> >
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel