From: "Christian König" <ckoenig.leichtzumerken@gmail.com>
To: "Felix Kuehling" <felix.kuehling@amd.com>,
"Christian König" <christian.koenig@amd.com>,
"Maarten Lankhorst" <maarten.lankhorst@linux.intel.com>,
"Daniel Vetter" <daniel.vetter@ffwll.ch>,
"DRI Development" <dri-devel@lists.freedesktop.org>
Cc: linaro-mm-sig@lists.linaro.org,
"Daniel Vetter" <daniel.vetter@intel.com>,
"Jerome Glisse" <jglisse@redhat.com>,
"Thomas Hellström" <thomas.hellstrom@intel.com>,
linux-media@vger.kernel.org
Subject: Re: [Linaro-mm-sig] [PATCH] RFC: dma-fence: Document recoverable page fault implications
Date: Thu, 28 Jan 2021 08:39:10 +0100 [thread overview]
Message-ID: <c9c8d386-87a1-6678-b5c6-854de210d8d3@gmail.com> (raw)
In-Reply-To: <18e7efbd-3d10-5ad1-49c9-7e26f0a27ef2@amd.com>
Am 27.01.21 um 23:00 schrieb Felix Kuehling:
> Am 2021-01-27 um 7:16 a.m. schrieb Christian König:
>> Am 27.01.21 um 13:11 schrieb Maarten Lankhorst:
>>> Op 27-01-2021 om 01:22 schreef Felix Kuehling:
>>>> Am 2021-01-21 um 2:40 p.m. schrieb Daniel Vetter:
>>>>> Recently there was a fairly long thread about recoreable hardware page
>>>>> faults, how they can deadlock, and what to do about that.
>>>>>
>>>>> While the discussion is still fresh I figured good time to try and
>>>>> document the conclusions a bit.
>>>>>
>>>>> References:
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210107030127.20393-1-Felix.Kuehling%40amd.com%2F&data=04%7C01%7Cchristian.koenig%40amd.com%7Cbee0aeff80f440bcc52108d8c2bcc11f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637473463245588199%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ncr%2Fqv5lw0ONrYxFvfdcFAXAZ%2BXcJJa6UY%2BxGfcKGVM%3D&reserved=0
>>>>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>>>> Cc: Thomas Hellström <thomas.hellstrom@intel.com>
>>>>> Cc: "Christian König" <christian.koenig@amd.com>
>>>>> Cc: Jerome Glisse <jglisse@redhat.com>
>>>>> Cc: Felix Kuehling <felix.kuehling@amd.com>
>>>>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>>>>> Cc: Sumit Semwal <sumit.semwal@linaro.org>
>>>>> Cc: linux-media@vger.kernel.org
>>>>> Cc: linaro-mm-sig@lists.linaro.org
>>>>> --
>>>>> I'll be away next week, but figured I'll type this up quickly for some
>>>>> comments and to check whether I got this all roughly right.
>>>>>
>>>>> Critique very much wanted on this, so that we can make sure hw which
>>>>> can't preempt (with pagefaults pending) like gfx10 has a clear path to
>>>>> support page faults in upstream. So anything I missed, got wrong or
>>>>> like that would be good.
>>>>> -Daniel
>>>>> ---
>>>>> Documentation/driver-api/dma-buf.rst | 66
>>>>> ++++++++++++++++++++++++++++
>>>>> 1 file changed, 66 insertions(+)
>>>>>
>>>>> diff --git a/Documentation/driver-api/dma-buf.rst
>>>>> b/Documentation/driver-api/dma-buf.rst
>>>>> index a2133d69872c..e924c1e4f7a3 100644
>>>>> --- a/Documentation/driver-api/dma-buf.rst
>>>>> +++ b/Documentation/driver-api/dma-buf.rst
>>>>> @@ -257,3 +257,69 @@ fences in the kernel. This means:
>>>>> userspace is allowed to use userspace fencing or long running
>>>>> compute
>>>>> workloads. This also means no implicit fencing for shared
>>>>> buffers in these
>>>>> cases.
>>>>> +
>>>>> +Recoverable Hardware Page Faults Implications
>>>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>> +
>>>>> +Modern hardware supports recoverable page faults, which has a lot of
>>>>> +implications for DMA fences.
>>>>> +
>>>>> +First, a pending page fault obviously holds up the work that's
>>>>> running on the
>>>>> +accelerator and a memory allocation is usually required to resolve
>>>>> the fault.
>>>>> +But memory allocations are not allowed to gate completion of DMA
>>>>> fences, which
>>>>> +means any workload using recoverable page faults cannot use DMA
>>>>> fences for
>>>>> +synchronization. Synchronization fences controlled by userspace
>>>>> must be used
>>>>> +instead.
>>>>> +
>>>>> +On GPUs this poses a problem, because current desktop compositor
>>>>> protocols on
>>>>> +Linus rely on DMA fences, which means without an entirely new
>>>>> userspace stack
>>>>> +built on top of userspace fences, they cannot benefit from
>>>>> recoverable page
>>>>> +faults. The exception is when page faults are only used as
>>>>> migration hints and
>>>>> +never to on-demand fill a memory request. For now this means
>>>>> recoverable page
>>>>> +faults on GPUs are limited to pure compute workloads.
>>>>> +
>>>>> +Furthermore GPUs usually have shared resources between the 3D
>>>>> rendering and
>>>>> +compute side, like compute units or command submission engines. If
>>>>> both a 3D
>>>>> +job with a DMA fence and a compute workload using recoverable page
>>>>> faults are
>>>>> +pending they could deadlock:
>>>>> +
>>>>> +- The 3D workload might need to wait for the compute job to finish
>>>>> and release
>>>>> + hardware resources first.
>>>>> +
>>>>> +- The compute workload might be stuck in a page fault, because the
>>>>> memory
>>>>> + allocation is waiting for the DMA fence of the 3D workload to
>>>>> complete.
>>>>> +
>>>>> +There are a few ways to prevent this problem:
>>>>> +
>>>>> +- Compute workloads can always be preempted, even when a page
>>>>> fault is pending
>>>>> + and not yet repaired. Not all hardware supports this.
>>>>> +
>>>>> +- DMA fence workloads and workloads which need page fault handling
>>>>> have
>>>>> + independent hardware resources to guarantee forward progress.
>>>>> This could be
>>>>> + achieved through e.g. through dedicated engines and minimal
>>>>> compute unit
>>>>> + reservations for DMA fence workloads.
>>>>> +
>>>>> +- The reservation approach could be further refined by only
>>>>> reserving the
>>>>> + hardware resources for DMA fence workloads when they are
>>>>> in-flight. This must
>>>>> + cover the time from when the DMA fence is visible to other
>>>>> threads up to
>>>>> + moment when fence is completed through dma_fence_signal().
>>>>> +
>>>>> +- As a last resort, if the hardware provides no useful reservation
>>>>> mechanics,
>>>>> + all workloads must be flushed from the GPU when switching
>>>>> between jobs
>>>>> + requiring DMA fences or jobs requiring page fault handling: This
>>>>> means all DMA
>>>>> + fences must complete before a compute job with page fault
>>>>> handling can be
>>>>> + inserted into the scheduler queue. And vice versa, before a DMA
>>>>> fence can be
>>>>> + made visible anywhere in the system, all compute workloads must
>>>>> be preempted
>>>>> + to guarantee all pending GPU page faults are flushed.
>>>> I thought of another possible workaround:
>>>>
>>>> * Partition the memory. Servicing of page faults will use a separate
>>>> memory pool that can always be allocated from without waiting for
>>>> fences. This includes memory for page tables and memory for
>>>> migrating data to. You may steal memory from other processes that
>>>> can page fault, so no fence waiting is necessary. Being able to
>>>> steal memory at any time also means there are basically no
>>>> out-of-memory situations you need to worry about. Even page tables
>>>> (except the root page directory of each process) can be stolen in
>>>> the worst case.
>>> I think 'overcommit' would be a nice way to describe this. But I'm not
>>> sure how easy this is to implement in practice. You would basically need
>>> to create your own memory manager for this.
>> Well you would need a completely separate pool for both device as well
>> as system memory.
>>
>> E.g. on boot we say we steal X GB system memory only for HMM.
> Why? The GPU driver doesn't need to allocate system memory for HMM.
> Migrations to system memory are handled by the kernel's handle_mm_fault
> and page allocator and swap logic.
And that one depends on dma_fence completion because you can easily need
to wait for an MMU notifier callback.
As Maarten wrote when you want to go down this route you need a complete
separate memory management parallel to the one of the kernel.
Regards,
Christian.
> It doesn't depend on any fences, so
> it cannot deadlock with any GPU driver-managed memory. The GPU driver
> gets involved in the MMU notifier to invalidate device page tables. But
> that also doesn't need to wait for any fences.
>
> And if the kernel runs out of pageable memory, you're in trouble anyway.
> The OOM killer will step in, nothing new there.
>
> Regards,
> Felix
>
>
>>> But from a design point of view, definitely a valid solution.
>> I think the restriction above makes it pretty much unusable.
>>
>>> But this looks good, those solutions are definitely the valid options we
>>> can choose from.
>> It's certainly worth noting, yes. And just to make sure that nobody
>> has the idea to reserve only device memory.
>>
>> Christian.
>>
>>> ~Maarten
>>>
> _______________________________________________
> Linaro-mm-sig mailing list
> Linaro-mm-sig@lists.linaro.org
> https://lists.linaro.org/mailman/listinfo/linaro-mm-sig
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
next prev parent reply other threads:[~2021-01-28 7:39 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-21 19:40 [PATCH] RFC: dma-fence: Document recoverable page fault implications Daniel Vetter
2021-01-22 13:10 ` Christian König
2021-01-22 13:18 ` Daniel Vetter
2021-01-22 13:24 ` Christian König
2021-01-22 13:34 ` Daniel Vetter
2021-01-22 16:29 ` Felix Kuehling
2021-01-27 0:22 ` Felix Kuehling
2021-01-27 12:11 ` Maarten Lankhorst
2021-01-27 12:16 ` Christian König
2021-01-27 22:00 ` Felix Kuehling
2021-01-28 7:39 ` Christian König [this message]
2021-01-28 15:39 ` [Linaro-mm-sig] " Felix Kuehling
2021-01-28 15:46 ` Christian König
2021-02-02 15:38 ` Daniel Vetter
2021-02-09 3:13 ` Bas Nieuwenhuizen
2021-02-09 6:37 ` Daniel Vetter
2021-02-09 11:15 ` Felix Kuehling
2021-02-09 14:08 ` Daniel Vetter
2021-02-09 14:25 ` Felix Kuehling
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c9c8d386-87a1-6678-b5c6-854de210d8d3@gmail.com \
--to=ckoenig.leichtzumerken@gmail.com \
--cc=christian.koenig@amd.com \
--cc=daniel.vetter@ffwll.ch \
--cc=daniel.vetter@intel.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=felix.kuehling@amd.com \
--cc=jglisse@redhat.com \
--cc=linaro-mm-sig@lists.linaro.org \
--cc=linux-media@vger.kernel.org \
--cc=maarten.lankhorst@linux.intel.com \
--cc=thomas.hellstrom@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).