From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-21.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT, USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4B46CC433DB for ; Thu, 4 Feb 2021 19:22:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id EC8BC64F64 for ; Thu, 4 Feb 2021 19:22:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239742AbhBDTWA (ORCPT ); Thu, 4 Feb 2021 14:22:00 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50662 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239205AbhBDSfZ (ORCPT ); Thu, 4 Feb 2021 13:35:25 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E210BC061788 for ; Thu, 4 Feb 2021 10:34:44 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id v17so4164350ybq.9 for ; Thu, 04 Feb 2021 10:34:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=sender:date:message-id:mime-version:subject:from:to:cc; bh=LGAhoRpIMLMV8lLvzohdnm8zNRvhlfv4yM9g2xTCO4U=; b=NiIv/XjhB8GVmYOTv9aioConYB6OgZkgJKZMXV6b3+kseEfqRh8YOiLA5dGff/8BUu KxO+xa0lnZIk8ksh4gOwc/YH9MQIOfC1hem8Xy646/5vI3XewpoUT0fL/9Di9RjQe5Jx 0QxfzBNWNsNuJBrfYz9dm4wB9mYIHHw3wHUyq0EbEEy68evAv33LEF/D0H5//KhcVgCy bD/leXR5Viy7cdUKv74RCEF806P9fRh9qSIQ8BpYdp8NAwLtZ5qKDx33gdaPeAerq7Cf eUNwQ7/1dQHDzjlD6MRVIfQDFC9zg38BiLcQqXYJZLcYU7TfRawslVWkGxtd90TAmUEr 1XrA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:message-id:mime-version:subject:from :to:cc; bh=LGAhoRpIMLMV8lLvzohdnm8zNRvhlfv4yM9g2xTCO4U=; b=UWrom5kj6bzv2LRb4wqJ4fgiIkB47mmm0Kd4+q0EQ6dAUaOmwvPaaREg2ZXkt5B5K6 qnl8utOnMYAuHdEQ8iqtw+QOPHyVPhLAFlJxLcaai8HRtVzsaEtdLy4ofLy3yrxS0pEo Z0UUqp23goCT5y+MWgCd9n9ar4R7Yx28n8+aYkFR8fckHcCTZlo6HJeVyYDINRWaYLsg 3eiYY09LKo5pgzLovcJuSNqSw4/auIEMos6EMZPUYwCgZdR23J6W7JNmqcoWD0yyQ8+H gzAPJ+eYshTeSumffCBOD8onmfmtcLZE6F3Aglaa1QUm/6saaAgCoBf5cqVBVviKTmbP iEFA== X-Gm-Message-State: AOAM530PfVa7pvc8Ln6Ov/D1214rQP1nphxyBQwkopD55vz8Fep/TYIP ejLKAxenm73ZESvyGUOdtpSzEzF5emVbdwWtVmKd X-Google-Smtp-Source: ABdhPJznEHvRCh2TB2b7mU7vFY+ytZanoWacRX3rnqkVrY20LOdV1f3n2FBroZiQ0WdfXoZ/1zL37HOOQc0OW47DSdKF Sender: "axelrasmussen via sendgmr" X-Received: from ajr0.svl.corp.google.com ([2620:15c:2cd:203:b001:12c1:dc19:2089]) (user=axelrasmussen job=sendgmr) by 2002:a25:a3e2:: with SMTP id e89mr867768ybi.446.1612463684091; Thu, 04 Feb 2021 10:34:44 -0800 (PST) Date: Thu, 4 Feb 2021 10:34:23 -0800 Message-Id: <20210204183433.1431202-1-axelrasmussen@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.30.0.365.g02bc693789-goog Subject: [PATCH v4 00/10] userfaultfd: add minor fault handling From: Axel Rasmussen To: Alexander Viro , Alexey Dobriyan , Andrea Arcangeli , Andrew Morton , Anshuman Khandual , Catalin Marinas , Chinwen Chang , Huang Ying , Ingo Molnar , Jann Horn , Jerome Glisse , Lokesh Gidra , "Matthew Wilcox (Oracle)" , Michael Ellerman , "=?UTF-8?q?Michal=20Koutn=C3=BD?=" , Michel Lespinasse , Mike Kravetz , Mike Rapoport , Nicholas Piggin , Peter Xu , Shaohua Li , Shawn Anastasio , Steven Rostedt , Steven Price , Vlastimil Babka Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Adam Ruprecht , Axel Rasmussen , Cannon Matthews , "Dr . David Alan Gilbert" , David Rientjes , Mina Almasry , Oliver Upton Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Changelog ========= v3->v4: - Reordered if() branches in hugetlb_mcopy_atomic_pte, so the conditions are simpler and easier to read. - Reverted most of the mfill_atomic_pte change (the anon / shmem path). Just return -EINVAL for CONTINUE, and set zeropage = (mode == MCOPY_ATOMIC_ZEROPAGE), so we can keep the delta small. - Split out adding #ifdef CONFIG_USERFAULTFD to a separate patch (instead of lumping it together with adding UFFDIO_CONTINUE). Also, extended it to make the same change for shmem as well as suggested by Hugh Dickins. - Fixed signature of hugetlb_mcopy_atomic_pte for !CONFIG_HUGETLB_PAGE (signature must be the same in either case). - Rebased onto a newer version of Peter's patches to disable huge PMD sharing. v2->v3: - Added #ifdef CONFIG_USERFAULTFD around hugetlb helper functions, to fix build errors when building without CONFIG_USERFAULTFD set. v1->v2: - Fixed a bug in the hugetlb_mcopy_atomic_pte retry case. We now plumb in the enum mcopy_atomic_mode, so we can differentiate between the three cases this function needs to handle: 1) We're doing a COPY op, and need to allocate a page, add to cache, etc. 2) We're doing a COPY op, but allocation in this function failed previously; we're in the retry path. The page was allocated, but not e.g. added to page cache, so that still needs to be done. 3) We're doing a CONTINUE op, we need to look up an existing page instead of allocating a new one. - Rebased onto a newer version of Peter's patches to disable huge PMD sharing, which fixes syzbot complaints on some non-x86 architectures. - Moved __VM_UFFD_FLAGS into userfaultfd_k.h, so inline helpers can use it. - Renamed UFFD_FEATURE_MINOR_FAULT_HUGETLBFS to UFFD_FEATURE_MINOR_HUGETLBFS, for consistency with other existing feature flags. - Moved the userfaultfd_minor hook in hugetlb.c into the else block, so we don't have to explicitly check for !new_page. RFC->v1: - Rebased onto Peter Xu's patches for disabling huge PMD sharing for certain userfaultfd-registered areas. - Added commits which update documentation, and add a self test which exercises the new feature. - Fixed reporting CONTINUE as a supported ioctl even for non-MINOR ranges. Overview ======== This series adds a new userfaultfd registration mode, UFFDIO_REGISTER_MODE_MINOR. This allows userspace to intercept "minor" faults. By "minor" fault, I mean the following situation: Let there exist two mappings (i.e., VMAs) to the same page(s) (shared memory). One of the mappings is registered with userfaultfd (in minor mode), and the other is not. Via the non-UFFD mapping, the underlying pages have already been allocated & filled with some contents. The UFFD mapping has not yet been faulted in; when it is touched for the first time, this results in what I'm calling a "minor" fault. As a concrete example, when working with hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing page. We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea is, userspace resolves the fault by either a) doing nothing if the contents are already correct, or b) updating the underlying contents using the second, non-UFFD mapping (via memcpy/memset or similar, or something fancier like RDMA, or etc...). In either case, userspace issues UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are correct, carry on setting up the mapping". Use Case ======== Consider the use case of VM live migration (e.g. under QEMU/KVM): 1. While a VM is still running, we copy the contents of its memory to a target machine. The pages are populated on the target by writing to the non-UFFD mapping, using the setup described above. The VM is still running (and therefore its memory is likely changing), so this may be repeated several times, until we decide the target is "up to date enough". 2. We pause the VM on the source, and start executing on the target machine. During this gap, the VM's user(s) will *see* a pause, so it is desirable to minimize this window. 3. Between the last time any page was copied from the source to the target, and when the VM was paused, the contents of that page may have changed - and therefore the copy we have on the target machine is out of date. Although we can keep track of which pages are out of date, for VMs with large amounts of memory, it is "slow" to transfer this information to the target machine. We want to resume execution before such a transfer would complete. 4. So, the guest begins executing on the target machine. The first time it touches its memory (via the UFFD-registered mapping), userspace wants to intercept this fault. Userspace checks whether or not the page is up to date, and if not, copies the updated page from the source machine, via the non-UFFD mapping. Finally, whether a copy was performed or not, userspace issues a UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents are correct, carry on setting up the mapping". We don't have to do all of the final updates on-demand. The userfaultfd manager can, in the background, also copy over updated pages once it receives the map of which pages are up-to-date or not. Interaction with Existing APIs ============================== Because it's possible to combine registration modes (e.g. a single VMA can be userfaultfd-registered MINOR | MISSING), and because it's up to userspace how to resolve faults once they are received, I spent some time thinking through how the existing API interacts with the new feature. UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault: - For non-shared memory or shmem, -EINVAL is returned. - For hugetlb, -EFAULT is returned. UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults. Without modifications, the existing codepath assumes a new page needs to be allocated. This is okay, since userspace must have a second non-UFFD-registered mapping anyway, thus there isn't much reason to want to use these in any case (just memcpy or memset or similar). - If UFFDIO_COPY is used on a minor fault, -EEXIST is returned. - If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case). - UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns -ENOENT in that case (regardless of the kind of fault). Dependencies ============ I've included 4 commits from Peter Xu's larger series (https://lore.kernel.org/patchwork/cover/1366017/) in this series. My changes depend on his work, to disable huge PMD sharing for MINOR registered userfaultfd areas. I included the 4 commits directly because a) it lets this series just be applied and work as-is, and b) they are fairly standalone, and could potentially be merged even without the rest of the larger series Peter submitted. Thanks Peter! Also, although it doesn't affect minor fault handling, I did notice that the userfaultfd self test sometimes experienced memory corruption (https://lore.kernel.org/patchwork/cover/1356755/). For anyone testing this series, it may be useful to apply that series first to fix the selftest flakiness. That series doesn't have to be merged into mainline / maintaner branches before mine, though. Future Work =========== Currently the patchset only supports hugetlbfs. There is no reason it can't work with shmem, but I expect hugetlbfs to be much more commonly used since we're talking about backing guest memory for VMs. I plan to implement shmem support in a follow-up patch series. Axel Rasmussen (6): userfaultfd: add minor fault registration mode userfaultfd: disable huge PMD sharing for MINOR registered VMAs userfaultfd: hugetlbfs: only compile UFFD helpers if config enabled userfaultfd: add UFFDIO_CONTINUE ioctl userfaultfd: update documentation to describe minor fault handling userfaultfd/selftests: add test exercising minor fault handling Peter Xu (4): hugetlb: Pass vma into huge_pte_alloc() and huge_pmd_share() hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled mm/hugetlb: Move flush_hugetlb_tlb_range() into hugetlb.h hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp Documentation/admin-guide/mm/userfaultfd.rst | 107 ++++++---- arch/arm64/mm/hugetlbpage.c | 7 +- arch/ia64/mm/hugetlbpage.c | 3 +- arch/mips/mm/hugetlbpage.c | 4 +- arch/parisc/mm/hugetlbpage.c | 2 +- arch/powerpc/mm/hugetlbpage.c | 3 +- arch/s390/mm/hugetlbpage.c | 2 +- arch/sh/mm/hugetlbpage.c | 2 +- arch/sparc/mm/hugetlbpage.c | 6 +- fs/proc/task_mmu.c | 1 + fs/userfaultfd.c | 196 ++++++++++++++++--- include/linux/hugetlb.h | 22 ++- include/linux/mm.h | 1 + include/linux/mmu_notifier.h | 1 + include/linux/userfaultfd_k.h | 49 ++++- include/trace/events/mmflags.h | 1 + include/uapi/linux/userfaultfd.h | 36 +++- mm/hugetlb.c | 113 +++++++---- mm/userfaultfd.c | 51 +++-- tools/testing/selftests/vm/userfaultfd.c | 147 +++++++++++++- 20 files changed, 601 insertions(+), 153 deletions(-) -- 2.30.0.365.g02bc693789-goog From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.5 required=3.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED,DKIM_INVALID,DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 63BBDC433E0 for ; Thu, 4 Feb 2021 18:34:47 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id BA68564D9A for ; Thu, 4 Feb 2021 18:34:46 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org BA68564D9A Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 197C66B0005; Thu, 4 Feb 2021 13:34:46 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1226F6B0006; Thu, 4 Feb 2021 13:34:46 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 00F336B006E; Thu, 4 Feb 2021 13:34:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0154.hostedemail.com [216.40.44.154]) by kanga.kvack.org (Postfix) with ESMTP id DC88F6B0005 for ; Thu, 4 Feb 2021 13:34:45 -0500 (EST) Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 90EC91EF1 for ; Thu, 4 Feb 2021 18:34:45 +0000 (UTC) X-FDA: 77781436530.01.AED312B Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) by imf09.hostedemail.com (Postfix) with ESMTP id 9C72C60001A5 for ; Thu, 4 Feb 2021 18:34:44 +0000 (UTC) Received: by mail-yb1-f201.google.com with SMTP id l197so4163664ybf.17 for ; Thu, 04 Feb 2021 10:34:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=sender:date:message-id:mime-version:subject:from:to:cc; bh=LGAhoRpIMLMV8lLvzohdnm8zNRvhlfv4yM9g2xTCO4U=; b=NiIv/XjhB8GVmYOTv9aioConYB6OgZkgJKZMXV6b3+kseEfqRh8YOiLA5dGff/8BUu KxO+xa0lnZIk8ksh4gOwc/YH9MQIOfC1hem8Xy646/5vI3XewpoUT0fL/9Di9RjQe5Jx 0QxfzBNWNsNuJBrfYz9dm4wB9mYIHHw3wHUyq0EbEEy68evAv33LEF/D0H5//KhcVgCy bD/leXR5Viy7cdUKv74RCEF806P9fRh9qSIQ8BpYdp8NAwLtZ5qKDx33gdaPeAerq7Cf eUNwQ7/1dQHDzjlD6MRVIfQDFC9zg38BiLcQqXYJZLcYU7TfRawslVWkGxtd90TAmUEr 1XrA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:message-id:mime-version:subject:from :to:cc; bh=LGAhoRpIMLMV8lLvzohdnm8zNRvhlfv4yM9g2xTCO4U=; b=FHJiqBFHVt+aeMV8GbmxTSE+YLrCo/kmnxIExD2VzBCimYFepWgFGnSkhanSBrBY7S TfsujK9eOWTJQjVGMBXY/nEPyHQQ6Qo0+sIJ+diK41deJ1s4wduRUncOyi4KqE42WsIR 2Vp9RgoZgwarb0qcIiWEZN3d2QQnjunwHSFybbyArNWaZHZ+YepFv9rqtfj3j9e9QUoE YC/zslDsFScVMlIji9zwAHIp/b1OGUZhUJQqq6W/sSR4FG1Q+RjtC2vazB6lKf2PAAHF 71YitOwflyw2cPEUgeM+zWZhG2roCdnVybeI+brBdkEbH30hghuC4b7Q8n48B0YZ1+A4 jaig== X-Gm-Message-State: AOAM533qAFVgTaIAtXUbl7AujQl7uyXfIIBlLXQOHNqJyrJTefCL+T4g AY3e9sN0HAnzXUZKHTVBrryX6pm/pkiCoFoKT99w X-Google-Smtp-Source: ABdhPJznEHvRCh2TB2b7mU7vFY+ytZanoWacRX3rnqkVrY20LOdV1f3n2FBroZiQ0WdfXoZ/1zL37HOOQc0OW47DSdKF X-Received: from ajr0.svl.corp.google.com ([2620:15c:2cd:203:b001:12c1:dc19:2089]) (user=axelrasmussen job=sendgmr) by 2002:a25:a3e2:: with SMTP id e89mr867768ybi.446.1612463684091; Thu, 04 Feb 2021 10:34:44 -0800 (PST) Date: Thu, 4 Feb 2021 10:34:23 -0800 Message-Id: <20210204183433.1431202-1-axelrasmussen@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.30.0.365.g02bc693789-goog Subject: [PATCH v4 00/10] userfaultfd: add minor fault handling From: Axel Rasmussen To: Alexander Viro , Alexey Dobriyan , Andrea Arcangeli , Andrew Morton , Anshuman Khandual , Catalin Marinas , Chinwen Chang , Huang Ying , Ingo Molnar , Jann Horn , Jerome Glisse , Lokesh Gidra , "Matthew Wilcox (Oracle)" , Michael Ellerman , "=?UTF-8?q?Michal=20Koutn=C3=BD?=" , Michel Lespinasse , Mike Kravetz , Mike Rapoport , Nicholas Piggin , Peter Xu , Shaohua Li , Shawn Anastasio , Steven Rostedt , Steven Price , Vlastimil Babka Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Adam Ruprecht , Axel Rasmussen , Cannon Matthews , "Dr . David Alan Gilbert" , David Rientjes , Mina Almasry , Oliver Upton Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 9C72C60001A5 X-Stat-Signature: so8cs4wob99fdbeadxj6umf918heg638 Received-SPF: none (flex--axelrasmussen.bounces.google.com>: No applicable sender policy available) receiver=imf09; identity=mailfrom; envelope-from="<3RD4cYA0KCDkVsZgmVnhpnnZibjjbgZ.Xjhgdips-hhfqVXf.jmb@flex--axelrasmussen.bounces.google.com>"; helo=mail-yb1-f201.google.com; client-ip=209.85.219.201 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1612463684-699103 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Changelog ========= v3->v4: - Reordered if() branches in hugetlb_mcopy_atomic_pte, so the conditions are simpler and easier to read. - Reverted most of the mfill_atomic_pte change (the anon / shmem path). Just return -EINVAL for CONTINUE, and set zeropage = (mode == MCOPY_ATOMIC_ZEROPAGE), so we can keep the delta small. - Split out adding #ifdef CONFIG_USERFAULTFD to a separate patch (instead of lumping it together with adding UFFDIO_CONTINUE). Also, extended it to make the same change for shmem as well as suggested by Hugh Dickins. - Fixed signature of hugetlb_mcopy_atomic_pte for !CONFIG_HUGETLB_PAGE (signature must be the same in either case). - Rebased onto a newer version of Peter's patches to disable huge PMD sharing. v2->v3: - Added #ifdef CONFIG_USERFAULTFD around hugetlb helper functions, to fix build errors when building without CONFIG_USERFAULTFD set. v1->v2: - Fixed a bug in the hugetlb_mcopy_atomic_pte retry case. We now plumb in the enum mcopy_atomic_mode, so we can differentiate between the three cases this function needs to handle: 1) We're doing a COPY op, and need to allocate a page, add to cache, etc. 2) We're doing a COPY op, but allocation in this function failed previously; we're in the retry path. The page was allocated, but not e.g. added to page cache, so that still needs to be done. 3) We're doing a CONTINUE op, we need to look up an existing page instead of allocating a new one. - Rebased onto a newer version of Peter's patches to disable huge PMD sharing, which fixes syzbot complaints on some non-x86 architectures. - Moved __VM_UFFD_FLAGS into userfaultfd_k.h, so inline helpers can use it. - Renamed UFFD_FEATURE_MINOR_FAULT_HUGETLBFS to UFFD_FEATURE_MINOR_HUGETLBFS, for consistency with other existing feature flags. - Moved the userfaultfd_minor hook in hugetlb.c into the else block, so we don't have to explicitly check for !new_page. RFC->v1: - Rebased onto Peter Xu's patches for disabling huge PMD sharing for certain userfaultfd-registered areas. - Added commits which update documentation, and add a self test which exercises the new feature. - Fixed reporting CONTINUE as a supported ioctl even for non-MINOR ranges. Overview ======== This series adds a new userfaultfd registration mode, UFFDIO_REGISTER_MODE_MINOR. This allows userspace to intercept "minor" faults. By "minor" fault, I mean the following situation: Let there exist two mappings (i.e., VMAs) to the same page(s) (shared memory). One of the mappings is registered with userfaultfd (in minor mode), and the other is not. Via the non-UFFD mapping, the underlying pages have already been allocated & filled with some contents. The UFFD mapping has not yet been faulted in; when it is touched for the first time, this results in what I'm calling a "minor" fault. As a concrete example, when working with hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing page. We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea is, userspace resolves the fault by either a) doing nothing if the contents are already correct, or b) updating the underlying contents using the second, non-UFFD mapping (via memcpy/memset or similar, or something fancier like RDMA, or etc...). In either case, userspace issues UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are correct, carry on setting up the mapping". Use Case ======== Consider the use case of VM live migration (e.g. under QEMU/KVM): 1. While a VM is still running, we copy the contents of its memory to a target machine. The pages are populated on the target by writing to the non-UFFD mapping, using the setup described above. The VM is still running (and therefore its memory is likely changing), so this may be repeated several times, until we decide the target is "up to date enough". 2. We pause the VM on the source, and start executing on the target machine. During this gap, the VM's user(s) will *see* a pause, so it is desirable to minimize this window. 3. Between the last time any page was copied from the source to the target, and when the VM was paused, the contents of that page may have changed - and therefore the copy we have on the target machine is out of date. Although we can keep track of which pages are out of date, for VMs with large amounts of memory, it is "slow" to transfer this information to the target machine. We want to resume execution before such a transfer would complete. 4. So, the guest begins executing on the target machine. The first time it touches its memory (via the UFFD-registered mapping), userspace wants to intercept this fault. Userspace checks whether or not the page is up to date, and if not, copies the updated page from the source machine, via the non-UFFD mapping. Finally, whether a copy was performed or not, userspace issues a UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents are correct, carry on setting up the mapping". We don't have to do all of the final updates on-demand. The userfaultfd manager can, in the background, also copy over updated pages once it receives the map of which pages are up-to-date or not. Interaction with Existing APIs ============================== Because it's possible to combine registration modes (e.g. a single VMA can be userfaultfd-registered MINOR | MISSING), and because it's up to userspace how to resolve faults once they are received, I spent some time thinking through how the existing API interacts with the new feature. UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault: - For non-shared memory or shmem, -EINVAL is returned. - For hugetlb, -EFAULT is returned. UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults. Without modifications, the existing codepath assumes a new page needs to be allocated. This is okay, since userspace must have a second non-UFFD-registered mapping anyway, thus there isn't much reason to want to use these in any case (just memcpy or memset or similar). - If UFFDIO_COPY is used on a minor fault, -EEXIST is returned. - If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case). - UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns -ENOENT in that case (regardless of the kind of fault). Dependencies ============ I've included 4 commits from Peter Xu's larger series (https://lore.kernel.org/patchwork/cover/1366017/) in this series. My changes depend on his work, to disable huge PMD sharing for MINOR registered userfaultfd areas. I included the 4 commits directly because a) it lets this series just be applied and work as-is, and b) they are fairly standalone, and could potentially be merged even without the rest of the larger series Peter submitted. Thanks Peter! Also, although it doesn't affect minor fault handling, I did notice that the userfaultfd self test sometimes experienced memory corruption (https://lore.kernel.org/patchwork/cover/1356755/). For anyone testing this series, it may be useful to apply that series first to fix the selftest flakiness. That series doesn't have to be merged into mainline / maintaner branches before mine, though. Future Work =========== Currently the patchset only supports hugetlbfs. There is no reason it can't work with shmem, but I expect hugetlbfs to be much more commonly used since we're talking about backing guest memory for VMs. I plan to implement shmem support in a follow-up patch series. Axel Rasmussen (6): userfaultfd: add minor fault registration mode userfaultfd: disable huge PMD sharing for MINOR registered VMAs userfaultfd: hugetlbfs: only compile UFFD helpers if config enabled userfaultfd: add UFFDIO_CONTINUE ioctl userfaultfd: update documentation to describe minor fault handling userfaultfd/selftests: add test exercising minor fault handling Peter Xu (4): hugetlb: Pass vma into huge_pte_alloc() and huge_pmd_share() hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled mm/hugetlb: Move flush_hugetlb_tlb_range() into hugetlb.h hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp Documentation/admin-guide/mm/userfaultfd.rst | 107 ++++++---- arch/arm64/mm/hugetlbpage.c | 7 +- arch/ia64/mm/hugetlbpage.c | 3 +- arch/mips/mm/hugetlbpage.c | 4 +- arch/parisc/mm/hugetlbpage.c | 2 +- arch/powerpc/mm/hugetlbpage.c | 3 +- arch/s390/mm/hugetlbpage.c | 2 +- arch/sh/mm/hugetlbpage.c | 2 +- arch/sparc/mm/hugetlbpage.c | 6 +- fs/proc/task_mmu.c | 1 + fs/userfaultfd.c | 196 ++++++++++++++++--- include/linux/hugetlb.h | 22 ++- include/linux/mm.h | 1 + include/linux/mmu_notifier.h | 1 + include/linux/userfaultfd_k.h | 49 ++++- include/trace/events/mmflags.h | 1 + include/uapi/linux/userfaultfd.h | 36 +++- mm/hugetlb.c | 113 +++++++---- mm/userfaultfd.c | 51 +++-- tools/testing/selftests/vm/userfaultfd.c | 147 +++++++++++++- 20 files changed, 601 insertions(+), 153 deletions(-) -- 2.30.0.365.g02bc693789-goog