All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Axel Rasmussen <axelrasmussen@google.com>
Cc: "Alexander Viro" <viro@zeniv.linux.org.uk>,
	"Alexey Dobriyan" <adobriyan@gmail.com>,
	"Andrea Arcangeli" <aarcange@redhat.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Anshuman Khandual" <anshuman.khandual@arm.com>,
	"Catalin Marinas" <catalin.marinas@arm.com>,
	"Chinwen Chang" <chinwen.chang@mediatek.com>,
	"Huang Ying" <ying.huang@intel.com>,
	"Ingo Molnar" <mingo@redhat.com>, "Jann Horn" <jannh@google.com>,
	"Jerome Glisse" <jglisse@redhat.com>,
	"Lokesh Gidra" <lokeshgidra@google.com>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	"Michael Ellerman" <mpe@ellerman.id.au>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Michel Lespinasse" <walken@google.com>,
	"Mike Kravetz" <mike.kravetz@oracle.com>,
	"Mike Rapoport" <rppt@linux.vnet.ibm.com>,
	"Nicholas Piggin" <npiggin@gmail.com>,
	"Peter Xu" <peterx@redhat.com>, "Shaohua Li" <shli@fb.com>,
	"Shawn Anastasio" <shawn@anastas.io>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Steven Price" <steven.price@arm.com>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, "Adam Ruprecht" <ruprecht@google.com>,
	"Cannon Matthews" <cannonmatthews@google.com>,
	"David Rientjes" <rientjes@google.com>,
	"Oliver Upton" <oupton@google.com>
Subject: Re: [RFC PATCH 0/2] userfaultfd: handle minor faults, add UFFDIO_CONTINUE
Date: Mon, 11 Jan 2021 11:43:40 +0000	[thread overview]
Message-ID: <20210111114340.GF2965@work-vm> (raw)
In-Reply-To: <20210107190453.3051110-1-axelrasmussen@google.com>

* Axel Rasmussen (axelrasmussen@google.com) wrote:
> Overview
> ========
> 
> This series adds a new userfaultfd registration mode,
> UFFDIO_REGISTER_MODE_MINOR. This allows userspace to intercept "minor" faults.
> By "minor" fault, I mean the following situation:
> 
> Let there exist two mappings (i.e., VMAs) to the same page(s) (shared memory).
> One of the mappings is registered with userfaultfd (in minor mode), and the
> other is not. Via the non-UFFD mapping, the underlying pages have already been
> allocated & filled with some contents. The UFFD mapping has not yet been
> faulted in; when it is touched for the first time, this results in what I'm
> calling a "minor" fault. As a concrete example, when working with hugetlbfs, we
> have huge_pte_none(), but find_lock_page() finds an existing page.
>
> We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea is,
> userspace resolves the fault by either a) doing nothing if the contents are
> already correct, or b) updating the underlying contents using the second,
> non-UFFD mapping (via memcpy/memset or similar, or something fancier like RDMA,
> or etc...). In either case, userspace issues UFFDIO_CONTINUE to tell the kernel
> "I have ensured the page contents are correct, carry on setting up the mapping".
> 
> Use Case
> ========
> 
> Consider the use case of VM live migration (e.g. under QEMU/KVM):
> 
> 1. While a VM is still running, we copy the contents of its memory to a
>    target machine. The pages are populated on the target by writing to the
>    non-UFFD mapping, using the setup described above. The VM is still running
>    (and therefore its memory is likely changing), so this may be repeated
>    several times, until we decide the target is "up to date enough".
> 
> 2. We pause the VM on the source, and start executing on the target machine.
>    During this gap, the VM's user(s) will *see* a pause, so it is desirable to
>    minimize this window.
> 
> 3. Between the last time any page was copied from the source to the target, and
>    when the VM was paused, the contents of that page may have changed - and
>    therefore the copy we have on the target machine is out of date. Although we
>    can keep track of which pages are out of date, for VMs with large amounts of
>    memory, it is "slow" to transfer this information to the target machine. We
>    want to resume execution before such a transfer would complete.
> 
> 4. So, the guest begins executing on the target machine. The first time it
>    touches its memory (via the UFFD-registered mapping), userspace wants to
>    intercept this fault. Userspace checks whether or not the page is up to date,
>    and if not, copies the updated page from the source machine, via the non-UFFD
>    mapping. Finally, whether a copy was performed or not, userspace issues a
>    UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
>    are correct, carry on setting up the mapping".
> 
> We don't have to do all of the final updates on-demand. The userfaultfd manager
> can, in the background, also copy over updated pages once it receives the map of
> which pages are up-to-date or not.

Yes, this would make the handover during postcopy of large VMs a heck of
a lot faster; and probably simpler; the cleanup code that tidies up the
re-dirty pages is pretty messy.

Dave

> Interaction with Existing APIs
> ==============================
> 
> Because it's possible to combine registration modes (e.g. a single VMA can be
> userfaultfd-registered MINOR | MISSING), and because it's up to userspace how to
> resolve faults once they are received, I spent some time thinking through how
> the existing API interacts with the new feature.
> 
> UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
> allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault:
> 
> - For non-shared memory or shmem, -EINVAL is returned.
> - For hugetlb, -EFAULT is returned.
> 
> UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults. Without
> modifications, the existing codepath assumes a new page needs to be allocated.
> This is okay, since userspace must have a second non-UFFD-registered mapping
> anyway, thus there isn't much reason to want to use these in any case (just
> memcpy or memset or similar).
> 
> - If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
> - If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
>   in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
> - UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
>   -ENOENT in that case (regardless of the kind of fault).
> 
> Remaining Work
> ==============
> 
> This patchset doesn't include updates to userfaultfd's documentation or
> selftests. This will be added before I send a non-RFC version of this series
> (I want to find out if there are strong objections to the API surface before
> spending the time to document it.)
> 
> Currently the patchset only supports hugetlbfs. There is no reason it can't work
> with shmem, but I expect hugetlbfs to be much more commonly used since we're
> talking about backing guest memory for VMs. I plan to implement shmem support in
> a follow-up patch series.
> 
> Axel Rasmussen (2):
>   userfaultfd: add minor fault registration mode
>   userfaultfd: add UFFDIO_CONTINUE ioctl
> 
>  fs/proc/task_mmu.c               |   1 +
>  fs/userfaultfd.c                 | 143 ++++++++++++++++++++++++-------
>  include/linux/mm.h               |   1 +
>  include/linux/userfaultfd_k.h    |  14 ++-
>  include/trace/events/mmflags.h   |   1 +
>  include/uapi/linux/userfaultfd.h |  36 +++++++-
>  mm/hugetlb.c                     |  42 +++++++--
>  mm/userfaultfd.c                 |  86 ++++++++++++++-----
>  8 files changed, 261 insertions(+), 63 deletions(-)
> 
> --
> 2.29.2.729.g45daf8777d-goog
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


  parent reply	other threads:[~2021-01-11 11:45 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-07 19:04 [RFC PATCH 0/2] userfaultfd: handle minor faults, add UFFDIO_CONTINUE Axel Rasmussen
2021-01-07 19:04 ` Axel Rasmussen
2021-01-07 19:04 ` [RFC PATCH 1/2] userfaultfd: add minor fault registration mode Axel Rasmussen
2021-01-07 19:04   ` Axel Rasmussen
2021-01-11 11:58   ` Dr. David Alan Gilbert
2021-01-11 17:37     ` Axel Rasmussen
2021-01-11 18:09       ` Dr. David Alan Gilbert
2021-01-07 19:04 ` [RFC PATCH 2/2] userfaultfd: add UFFDIO_CONTINUE ioctl Axel Rasmussen
2021-01-07 19:04   ` Axel Rasmussen
2021-01-11 11:43 ` Dr. David Alan Gilbert [this message]
2021-01-11 22:42 ` [RFC PATCH 0/2] userfaultfd: handle minor faults, add UFFDIO_CONTINUE Mike Kravetz
2021-01-11 23:08   ` Peter Xu
2021-01-12  0:13     ` Mike Kravetz
2021-01-12  1:49       ` Peter Xu
2021-01-12 17:37         ` Axel Rasmussen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210111114340.GF2965@work-vm \
    --to=dgilbert@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=adobriyan@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=axelrasmussen@google.com \
    --cc=cannonmatthews@google.com \
    --cc=catalin.marinas@arm.com \
    --cc=chinwen.chang@mediatek.com \
    --cc=jannh@google.com \
    --cc=jglisse@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lokeshgidra@google.com \
    --cc=mike.kravetz@oracle.com \
    --cc=mingo@redhat.com \
    --cc=mkoutny@suse.com \
    --cc=mpe@ellerman.id.au \
    --cc=npiggin@gmail.com \
    --cc=oupton@google.com \
    --cc=peterx@redhat.com \
    --cc=rientjes@google.com \
    --cc=rostedt@goodmis.org \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=ruprecht@google.com \
    --cc=shawn@anastas.io \
    --cc=shli@fb.com \
    --cc=steven.price@arm.com \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=walken@google.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.