From: Daniel Vetter <email@example.com>
To: "Christian König" <firstname.lastname@example.org>
Cc: "Jason Gunthorpe" <email@example.com>,
"Thomas Hellström (Intel)" <firstname.lastname@example.org>,
"David Airlie" <email@example.com>,
"Linux MM" <firstname.lastname@example.org>,
"Andrew Morton" <email@example.com>,
"Linux Kernel Mailing List" <firstname.lastname@example.org>,
Subject: Re: [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages
Date: Thu, 25 Mar 2021 10:41:43 +0100 [thread overview]
Message-ID: <CAKMK7uHPwT2zuTywb7O2gVbPcb0wsh=VCWdQmgQd_NaJwTTpFA@mail.gmail.com> (raw)
On Thu, Mar 25, 2021 at 8:50 AM Christian König
> Am 25.03.21 um 00:14 schrieb Jason Gunthorpe:
> > On Wed, Mar 24, 2021 at 09:07:53PM +0100, Thomas Hellström (Intel) wrote:
> >> On 3/24/21 7:31 PM, Christian König wrote:
> >>> Am 24.03.21 um 17:38 schrieb Jason Gunthorpe:
> >>>> On Wed, Mar 24, 2021 at 04:50:14PM +0100, Thomas Hellström (Intel)
> >>>> wrote:
> >>>>> On 3/24/21 2:48 PM, Jason Gunthorpe wrote:
> >>>>>> On Wed, Mar 24, 2021 at 02:35:38PM +0100, Thomas Hellström
> >>>>>> (Intel) wrote:
> >>>>>>>> In an ideal world the creation/destruction of page
> >>>>>>>> table levels would
> >>>>>>>> by dynamic at this point, like THP.
> >>>>>>> Hmm, but I'm not sure what problem we're trying to solve
> >>>>>>> by changing the
> >>>>>>> interface in this way?
> >>>>>> We are trying to make a sensible driver API to deal with huge pages.
> >>>>>>> Currently if the core vm requests a huge pud, we give it
> >>>>>>> one, and if we
> >>>>>>> can't or don't want to (because of dirty-tracking, for
> >>>>>>> example, which is
> >>>>>>> always done on 4K page-level) we just return
> >>>>>>> VM_FAULT_FALLBACK, and the
> >>>>>>> fault is retried at a lower level.
> >>>>>> Well, my thought would be to move the pte related stuff into
> >>>>>> vmf_insert_range instead of recursing back via VM_FAULT_FALLBACK.
> >>>>>> I don't know if the locking works out, but it feels cleaner that the
> >>>>>> driver tells the vmf how big a page it can stuff in, not the vm
> >>>>>> telling the driver to stuff in a certain size page which it might not
> >>>>>> want to do.
> >>>>>> Some devices want to work on a in-between page size like 64k so they
> >>>>>> can't form 2M pages but they can stuff 64k of 4K pages in a batch on
> >>>>>> every fault.
> >>>>> Hmm, yes, but we would in that case be limited anyway to insert ranges
> >>>>> smaller than and equal to the fault size to avoid extensive and
> >>>>> possibly
> >>>>> unnecessary checks for contigous memory.
> >>>> Why? The insert function is walking the page tables, it just updates
> >>>> things as they are. It learns the arragement for free while doing the
> >>>> walk.
> >>>> The device has to always provide consistent data, if it overlaps into
> >>>> pages that are already populated that is fine so long as it isn't
> >>>> changing their addresses.
> >>>>> And then if we can't support the full fault size, we'd need to
> >>>>> either presume a size and alignment of the next level or search for
> >>>>> contigous memory in both directions around the fault address,
> >>>>> perhaps unnecessarily as well.
> >>>> You don't really need to care about levels, the device should be
> >>>> faulting in the largest memory regions it can within its efficiency.
> >>>> If it works on 4M pages then it should be faulting 4M pages. The page
> >>>> size of the underlying CPU doesn't really matter much other than some
> >>>> tuning to impact how the device's allocator works.
> >> Yes, but then we'd be adding a lot of complexity into this function that is
> >> already provided by the current interface for DAX, for little or no gain, at
> >> least in the drm/ttm setting. Please think of the following situation: You
> >> get a fault, you do an extensive time-consuming scan of your VRAM buffer
> >> object into which the fault goes and determine you can fault 1GB. Now you
> >> hand it to vmf_insert_range() and because the user-space address is
> >> misaligned, or already partly populated because of a previous eviction, you
> >> can only fault single pages, and you end up faulting a full GB of single
> >> pages perhaps for a one-time small update.
> > Why would "you can only fault single pages" ever be true? If you have
> > 1GB of pages then the vmf_insert_range should allocate enough page
> > table entries to consume it, regardless of alignment.
> Completely agree with Jason. Filling in the CPU page tables is
> relatively cheap if you fill in a large continuous range.
> In other words filling in 1GiB of a linear range is *much* less overhead
> than filling in 1<<18 4KiB faults.
> I would say that this is always preferable even if the CPU only wants to
> update a single byte.
> > And why shouldn't DAX switch to this kind of interface anyhow? It is
> > basically exactly the same problem. The underlying filesystem block
> > size is *not* necessarily aligned to the CPU page table sizes and DAX
> > would benefit from better handling of this mismatch.
> >> On top of this, unless we want to do the walk trying increasingly smaller
> >> sizes of vmf_insert_xxx(), we'd have to use apply_to_page_range() and teach
> >> it about transhuge page table entries, because pagewalk.c can't be used (It
> >> can't populate page tables). That also means apply_to_page_range() needs to
> >> be complicated with page table locks since transhuge pages aren't stable and
> >> can be zapped and refaulted under us while we do the walk.
> > I didn't say it would be simple :) But we also need to stop hacking
> > around the sides of all this huge page stuff and come up with sensible
> > APIs that drivers can actually implement correctly. Exposing drivers
> > to specific kinds of page levels really feels like the wrong level of
> > abstraction.
> > Once we start doing this we should do it everywhere, the io_remap_pfn
> > stuff should be able to create huge special IO pages as well, for
> > instance.
> Oh, yes please!
> We easily have 16GiB of VRAM which is linear mapped into the kernel
> space for each GPU instance.
> Doing that with 1GiB mapping instead of 4KiB would be quite a win.
io_remap_pfn is for userspace mmaps. Kernel mappings should be as big
as possible already I think for everything.
> >> On top of this, the user-space address allocator needs to know how large gpu
> >> pages are aligned in buffer objects to have a reasonable chance of aligning
> >> with CPU huge page boundaries which is a requirement to be able to insert a
> >> huge CPU page table entry, so the driver would basically need the drm helper
> >> that can do this alignment anyway.
> > Don't you have this problem anyhow?
> > Jason
> dri-devel mailing list
Software Engineer, Intel Corporation
next prev parent reply other threads:[~2021-03-25 9:41 UTC|newest]
Thread overview: 63+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-21 18:45 [RFC PATCH 0/2] mm,drm/ttm: Always block GUP to TTM pages Thomas Hellström (Intel)
2021-03-21 18:45 ` [RFC PATCH 1/2] mm,drm/ttm: Block fast GUP to TTM huge pages Thomas Hellström (Intel)
2021-03-23 11:34 ` Daniel Vetter
2021-03-23 16:34 ` Thomas Hellström (Intel)
2021-03-23 16:37 ` Jason Gunthorpe
2021-03-23 16:59 ` Christoph Hellwig
2021-03-23 17:06 ` Thomas Hellström (Intel)
2021-03-24 9:56 ` Daniel Vetter
2021-03-24 12:24 ` Jason Gunthorpe
2021-03-24 12:35 ` Thomas Hellström (Intel)
2021-03-24 12:41 ` Jason Gunthorpe
2021-03-24 13:35 ` Thomas Hellström (Intel)
2021-03-24 13:48 ` Jason Gunthorpe
2021-03-24 15:50 ` Thomas Hellström (Intel)
2021-03-24 16:38 ` Jason Gunthorpe
2021-03-24 18:31 ` Christian König
2021-03-24 20:07 ` Thomas Hellström (Intel)
2021-03-24 23:14 ` Jason Gunthorpe
2021-03-25 7:48 ` Thomas Hellström (Intel)
2021-03-25 8:27 ` Christian König
2021-03-25 9:51 ` Thomas Hellström (Intel)
2021-03-25 11:30 ` Jason Gunthorpe
2021-03-25 11:53 ` Thomas Hellström (Intel)
2021-03-25 12:01 ` Jason Gunthorpe
2021-03-25 12:09 ` Christian König
2021-03-25 12:36 ` Thomas Hellström (Intel)
2021-03-25 13:02 ` Christian König
2021-03-25 13:31 ` Thomas Hellström (Intel)
2021-03-25 12:42 ` Jason Gunthorpe
2021-03-25 13:05 ` Christian König
2021-03-25 13:17 ` Jason Gunthorpe
2021-03-25 13:26 ` Christian König
2021-03-25 13:33 ` Jason Gunthorpe
2021-03-25 13:54 ` Christian König
2021-03-25 13:56 ` Jason Gunthorpe
2021-03-25 7:49 ` Christian König
2021-03-25 9:41 ` Daniel Vetter [this message]
2021-03-23 13:52 ` Jason Gunthorpe
2021-03-23 15:05 ` Thomas Hellström (Intel)
2021-03-23 19:52 ` Williams, Dan J
2021-03-23 20:42 ` Thomas Hellström (Intel)
2021-03-24 9:58 ` Daniel Vetter
2021-03-24 10:05 ` Thomas Hellström (Intel)
[not found] ` <email@example.com>
2021-03-24 20:22 ` Thomas Hellström (Intel)
2021-03-24 20:25 ` Dave Hansen
2021-03-25 17:51 ` Thomas Hellström (Intel)
2021-03-25 17:55 ` Jason Gunthorpe
2021-03-25 18:13 ` Thomas Hellström (Intel)
2021-03-25 18:24 ` Jason Gunthorpe
2021-03-25 18:42 ` Thomas Hellström (Intel)
2021-03-26 9:08 ` Thomas Hellström (Intel)
2021-03-26 11:46 ` Jason Gunthorpe
2021-03-26 12:33 ` Thomas Hellström (Intel)
2021-03-21 18:45 ` [RFC PATCH 2/2] mm,drm/ttm: Use VM_PFNMAP for TTM vmas Thomas Hellström (Intel)
2021-03-22 7:47 ` Christian König
2021-03-22 8:13 ` Thomas Hellström (Intel)
2021-03-23 11:57 ` Christian König
2021-03-23 11:47 ` Daniel Vetter
2021-03-23 14:04 ` Jason Gunthorpe
2021-03-23 15:51 ` Thomas Hellström (Intel)
2021-03-23 14:00 ` Jason Gunthorpe
2021-03-23 15:46 ` Thomas Hellström (Intel)
2021-03-23 16:06 ` Jason Gunthorpe
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).