RFC: A KVM-specific alternative to UserfaultFD

From: David Matlack <dmatlack@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm list <kvm@vger.kernel.org>,
	Sean Christopherson <seanjc@google.com>,
	 James Houghton <jthoughton@google.com>,
	Oliver Upton <oupton@google.com>, Peter Xu <peterx@redhat.com>,
	 Axel Rasmussen <axelrasmussen@google.com>
Subject: RFC: A KVM-specific alternative to UserfaultFD
Date: Mon, 6 Nov 2023 10:25:13 -0800	[thread overview]
Message-ID: <CALzav=d23P5uE=oYqMpjFohvn0CASMJxXB_XEOEi-jtqWcFTDA@mail.gmail.com> (raw)

Hi Paolo,

I'd like your feedback on whether you would merge a KVM-specific
alternative to UserfaultFD.

Within Google we have a feature called "KVM Demand Paging" that we
have been using for post-copy live migration since 2014 and memory
poisoning emulation more recently. The high-level design is:

  (a) A bitmap that tracks which GFNs are present, along with a UAPI
to enable/disable the present bitmap.
  (b) UAPIs for marking GFNs present and non-present.
  (c) KVM_RUN support for returning to userspace on guest page faults
to non-present GFNs.
  (d) A notification mechanism and wait queue to coordinate KVM
accesses to non-present GFNs.
  (e) UAPI or KVM policy for collapsing SPTEs into huge pages as guest
memory becomes present.

The actual implementation within Google has a lot of warts that I
won't get into... but I think we could have a pretty clean upstream
solution.

In fact, a lot of the infrastructure needed to support this design is
already in-flight upstream. e.g. (a) and (b) could be built on top of
the new memory attributes (although I have concerns about the
performance of using xarray vs. bitmaps), (c) can be built on top of
the memory-fault exiting. The most complex piece of new code would be
the notification mechanism for (d). Within Google we've been using a
netlink socket, but I think we should use a custom file descriptor
instead.

If we do it right, almost no architecture-specific support is needed.
Just a small bit in the page fault path (for (c) and to account for
the present bitmap when determining what (huge)page size to map).

The most painful part of carrying KVM Demand Paging out-of-tree has
been maintaining the hooks for (d). But this has been mostly
self-inflicted. We started out by manually annotating all of the code
where KVM reads/writes guest memory. But there are more core routines
that all guest-memory accesses go through (e.g. __gfn_to_hva_many())
where we could put a single hook, and then KVM just has to make sure
to invalidate an gfn-to-hva/pfn caches and SPTEs when a page becomes
non-present (which is rare and typically only happens before a vCPU
starts running). And hooking KVM accesses to guest memory isn't
exactly new, KVM already manually tracks all writes to keep the dirty
log up to date.

So why merge a KVM-specific alternative to UserfaultFD?

Taking a step back, let's look at what UserfaultFD is actually
providing for KVM VMs:

  1. Coordination of userspace accesses to guest memory.
  2. Coordination of KVM+guest accesses to guest memory.

(1.) technically does not need kernel support. It's possible to solve
this problem in userspace, and likely can be more efficient to solve
it in userspace because you have more flexibility and can avoid
bouncing through the kernel page fault handler. And it's not
unreasonable to expect VMMs to support this. VMMs already need to
manually intercept userspace _writes_ to guest memory to implement
dirty tracking efficiently. It's a small step beyond that to intercept
both reads and writes for post-copy. And VMMs are increasingly
multi-process. UserfaultFD provides coordination within a process but
VMMs already need to deal with coordinating across processes already.
i.e. UserfaultFD is only solving part of the problem for (1.).

The KVM-specific approach is basically to provide kernel support for
(2) and let userspace solve (1) however it likes.

But if UserfaultFD solves (1) and (2), why introduce a KVM feature
that solves only (2)?

Well, UserfaultFD has some very real downsides:

  * Lack of sufficient HugeTLB Support: The most recent and glaring
    problem is upstream's NACK of HugeTLB High Granularity Mapping [1].
    Without HGM, UserfaultFD can only handle HugeTLB faults at huge
    page granularity. i.e. If a VM is backed with 1GiB HugeTLB, then
    UserfaultFD can only handle 1GiB faults. Demand-fetching 1GiB of
    memory from a remote host during the post-copy phase of live
    migration is untenable. Even 2MiB fetches are painful with most
    current NICs. In effect, there is no line-of-sight on an upstream
    solution for post-copy live migration for VMs backed with HugeTLB.

  * Memory Overhead: UserfaultFD requires an extra 8 bytes per page of
    guest memory for the userspace page table entries.

  * CPU Overhead: UserfaultFD has to manipulate userspace page tables to
    split mappings down to PAGE_SIZE, handle PAGE_SIZE'd faults, and,
    later, collapse mappings back into huge pages. These manipulations take
    locks like mmap_lock, page locks, and page table locks.

  * Complexity: UserfaultFD-based demand paging depends on functionality
    across multiple subsystems in the kernel including Core MM, KVM, as
    well as the each of the memory filesystems (tmpfs, HugeTLB, and
    eventually guest_memfd). Debugging problems requires
    knowledge across many domains that many engineers do not have. And
    solving problems requires getting buy-in from multiple subsystem
    maintainers that may not all be aligned (see: HGM).

All of these are addressed with a KVM-specific solution. A
KVM-specific solution can have:

  * Transparent support for any backing memory subsystem (tmpfs,
    HugeTLB, and even guest_memfd).
  * Only 1 bit of overhead per page of guest memory.
  * No need to modify host page tables.
  * All code contained within KVM.
  * Significantly fewer LOC than UserfaultFD.

Ok, that's the pitch. What are your thoughts?

[1] https://lore.kernel.org/linux-mm/20230218002819.1486479-1-jthoughton@google.com/