Re: [Linaro-mm-sig] [PATCH] RFC: dma-fence: Document recoverable page fault implications

From: "Christian König" <ckoenig.leichtzumerken@gmail.com>
To: "Felix Kuehling" <felix.kuehling@amd.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Maarten Lankhorst" <maarten.lankhorst@linux.intel.com>,
	"Daniel Vetter" <daniel.vetter@ffwll.ch>,
	"DRI Development" <dri-devel@lists.freedesktop.org>
Cc: linaro-mm-sig@lists.linaro.org,
	"Daniel Vetter" <daniel.vetter@intel.com>,
	"Jerome Glisse" <jglisse@redhat.com>,
	"Thomas Hellström" <thomas.hellstrom@intel.com>,
	linux-media@vger.kernel.org
Subject: Re: [Linaro-mm-sig] [PATCH] RFC: dma-fence: Document recoverable page fault implications
Date: Thu, 28 Jan 2021 08:39:10 +0100	[thread overview]
Message-ID: <c9c8d386-87a1-6678-b5c6-854de210d8d3@gmail.com> (raw)
In-Reply-To: <18e7efbd-3d10-5ad1-49c9-7e26f0a27ef2@amd.com>

Am 27.01.21 um 23:00 schrieb Felix Kuehling:
> Am 2021-01-27 um 7:16 a.m. schrieb Christian König:
>> Am 27.01.21 um 13:11 schrieb Maarten Lankhorst:
>>> Op 27-01-2021 om 01:22 schreef Felix Kuehling:
>>>> Am 2021-01-21 um 2:40 p.m. schrieb Daniel Vetter:
>>>>> Recently there was a fairly long thread about recoreable hardware page
>>>>> faults, how they can deadlock, and what to do about that.
>>>>>
>>>>> While the discussion is still fresh I figured good time to try and
>>>>> document the conclusions a bit.
>>>>>
>>>>> References:
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210107030127.20393-1-Felix.Kuehling%40amd.com%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Cbee0aeff80f440bcc52108d8c2bcc11f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637473463245588199%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ncr%2Fqv5lw0ONrYxFvfdcFAXAZ%2BXcJJa6UY%2BxGfcKGVM%3D&amp;reserved=0
>>>>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>>>> Cc: Thomas Hellström <thomas.hellstrom@intel.com>
>>>>> Cc: "Christian König" <christian.koenig@amd.com>
>>>>> Cc: Jerome Glisse <jglisse@redhat.com>
>>>>> Cc: Felix Kuehling <felix.kuehling@amd.com>
>>>>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>>>>> Cc: Sumit Semwal <sumit.semwal@linaro.org>
>>>>> Cc: linux-media@vger.kernel.org
>>>>> Cc: linaro-mm-sig@lists.linaro.org
>>>>> -- 
>>>>> I'll be away next week, but figured I'll type this up quickly for some
>>>>> comments and to check whether I got this all roughly right.
>>>>>
>>>>> Critique very much wanted on this, so that we can make sure hw which
>>>>> can't preempt (with pagefaults pending) like gfx10 has a clear path to
>>>>> support page faults in upstream. So anything I missed, got wrong or
>>>>> like that would be good.
>>>>> -Daniel
>>>>> ---
>>>>>    Documentation/driver-api/dma-buf.rst | 66
>>>>> ++++++++++++++++++++++++++++
>>>>>    1 file changed, 66 insertions(+)
>>>>>
>>>>> diff --git a/Documentation/driver-api/dma-buf.rst
>>>>> b/Documentation/driver-api/dma-buf.rst
>>>>> index a2133d69872c..e924c1e4f7a3 100644
>>>>> --- a/Documentation/driver-api/dma-buf.rst
>>>>> +++ b/Documentation/driver-api/dma-buf.rst
>>>>> @@ -257,3 +257,69 @@ fences in the kernel. This means:
>>>>>      userspace is allowed to use userspace fencing or long running
>>>>> compute
>>>>>      workloads. This also means no implicit fencing for shared
>>>>> buffers in these
>>>>>      cases.
>>>>> +
>>>>> +Recoverable Hardware Page Faults Implications
>>>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>> +
>>>>> +Modern hardware supports recoverable page faults, which has a lot of
>>>>> +implications for DMA fences.
>>>>> +
>>>>> +First, a pending page fault obviously holds up the work that's
>>>>> running on the
>>>>> +accelerator and a memory allocation is usually required to resolve
>>>>> the fault.
>>>>> +But memory allocations are not allowed to gate completion of DMA
>>>>> fences, which
>>>>> +means any workload using recoverable page faults cannot use DMA
>>>>> fences for
>>>>> +synchronization. Synchronization fences controlled by userspace
>>>>> must be used
>>>>> +instead.
>>>>> +
>>>>> +On GPUs this poses a problem, because current desktop compositor
>>>>> protocols on
>>>>> +Linus rely on DMA fences, which means without an entirely new
>>>>> userspace stack
>>>>> +built on top of userspace fences, they cannot benefit from
>>>>> recoverable page
>>>>> +faults. The exception is when page faults are only used as
>>>>> migration hints and
>>>>> +never to on-demand fill a memory request. For now this means
>>>>> recoverable page
>>>>> +faults on GPUs are limited to pure compute workloads.
>>>>> +
>>>>> +Furthermore GPUs usually have shared resources between the 3D
>>>>> rendering and
>>>>> +compute side, like compute units or command submission engines. If
>>>>> both a 3D
>>>>> +job with a DMA fence and a compute workload using recoverable page
>>>>> faults are
>>>>> +pending they could deadlock:
>>>>> +
>>>>> +- The 3D workload might need to wait for the compute job to finish
>>>>> and release
>>>>> +  hardware resources first.
>>>>> +
>>>>> +- The compute workload might be stuck in a page fault, because the
>>>>> memory
>>>>> +  allocation is waiting for the DMA fence of the 3D workload to
>>>>> complete.
>>>>> +
>>>>> +There are a few ways to prevent this problem:
>>>>> +
>>>>> +- Compute workloads can always be preempted, even when a page
>>>>> fault is pending
>>>>> +  and not yet repaired. Not all hardware supports this.
>>>>> +
>>>>> +- DMA fence workloads and workloads which need page fault handling
>>>>> have
>>>>> +  independent hardware resources to guarantee forward progress.
>>>>> This could be
>>>>> +  achieved through e.g. through dedicated engines and minimal
>>>>> compute unit
>>>>> +  reservations for DMA fence workloads.
>>>>> +
>>>>> +- The reservation approach could be further refined by only
>>>>> reserving the
>>>>> +  hardware resources for DMA fence workloads when they are
>>>>> in-flight. This must
>>>>> +  cover the time from when the DMA fence is visible to other
>>>>> threads up to
>>>>> +  moment when fence is completed through dma_fence_signal().
>>>>> +
>>>>> +- As a last resort, if the hardware provides no useful reservation
>>>>> mechanics,
>>>>> +  all workloads must be flushed from the GPU when switching
>>>>> between jobs
>>>>> +  requiring DMA fences or jobs requiring page fault handling: This
>>>>> means all DMA
>>>>> +  fences must complete before a compute job with page fault
>>>>> handling can be
>>>>> +  inserted into the scheduler queue. And vice versa, before a DMA
>>>>> fence can be
>>>>> +  made visible anywhere in the system, all compute workloads must
>>>>> be preempted
>>>>> +  to guarantee all pending GPU page faults are flushed.
>>>> I thought of another possible workaround:
>>>>
>>>>     * Partition the memory. Servicing of page faults will use a separate
>>>>       memory pool that can always be allocated from without waiting for
>>>>       fences. This includes memory for page tables and memory for
>>>>       migrating data to. You may steal memory from other processes that
>>>>       can page fault, so no fence waiting is necessary. Being able to
>>>>       steal memory at any time also means there are basically no
>>>>       out-of-memory situations you need to worry about. Even page tables
>>>>       (except the root page directory of each process) can be stolen in
>>>>       the worst case.
>>> I think 'overcommit' would be a nice way to describe this. But I'm not
>>> sure how easy this is to implement in practice. You would basically need
>>> to create your own memory manager for this.
>> Well you would need a completely separate pool for both device as well
>> as system memory.
>>
>> E.g. on boot we say we steal X GB system memory only for HMM.
> Why? The GPU driver doesn't need to allocate system memory for HMM.
> Migrations to system memory are handled by the kernel's handle_mm_fault
> and page allocator and swap logic.

And that one depends on dma_fence completion because you can easily need 
to wait for an MMU notifier callback.

As Maarten wrote when you want to go down this route you need a complete 
separate memory management parallel to the one of the kernel.

Regards,
Christian.

>   It doesn't depend on any fences, so
> it cannot deadlock with any GPU driver-managed memory. The GPU driver
> gets involved in the MMU notifier to invalidate device page tables. But
> that also doesn't need to wait for any fences.
>
> And if the kernel runs out of pageable memory, you're in trouble anyway.
> The OOM killer will step in, nothing new there.
>
> Regards,
>    Felix
>
>
>>> But from a design point of view, definitely a valid solution.
>> I think the restriction above makes it pretty much unusable.
>>
>>> But this looks good, those solutions are definitely the valid options we
>>> can choose from.
>> It's certainly worth noting, yes. And just to make sure that nobody
>> has the idea to reserve only device memory.
>>
>> Christian.
>>
>>> ~Maarten
>>>
> _______________________________________________
> Linaro-mm-sig mailing list
> Linaro-mm-sig@lists.linaro.org
> https://lists.linaro.org/mailman/listinfo/linaro-mm-sig

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel