linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 00/23] userfaultfd-wp: Support shmem and hugetlbfs
@ 2021-11-15  7:54 Peter Xu
  2021-11-15  7:55 ` [PATCH v6 01/23] mm: Introduce PTE_MARKER swap entry Peter Xu
                   ` (22 more replies)
  0 siblings, 23 replies; 42+ messages in thread
From: Peter Xu @ 2021-11-15  7:54 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Axel Rasmussen, Nadav Amit, Mike Rapoport, Hugh Dickins,
	Mike Kravetz, Kirill A . Shutemov, Alistair Popple,
	Jerome Glisse, Matthew Wilcox, Andrew Morton, peterx,
	David Hildenbrand, Andrea Arcangeli

This is v6 of the series to add shmem+hugetlbfs support for userfaultfd write
protection.  It is based on v5.16-rc1 (fa55b7dcdc43), with below two patches
applied first:

  Subject: [PATCH RFC 0/2] mm: Rework zap ptes on swap entries
  https://lore.kernel.org/lkml/20211110082952.19266-1-peterx@redhat.com/

The whole tree can be found here for testing:

  https://github.com/xzpeter/linux/tree/uffd-wp-shmem-hugetlbfs

Previous versions:

  RFC: https://lore.kernel.org/lkml/20210115170907.24498-1-peterx@redhat.com/
  v1:  https://lore.kernel.org/lkml/20210323004912.35132-1-peterx@redhat.com/
  v2:  https://lore.kernel.org/lkml/20210427161317.50682-1-peterx@redhat.com/
  v3:  https://lore.kernel.org/lkml/20210527201927.29586-1-peterx@redhat.com/
  v4:  https://lore.kernel.org/lkml/20210714222117.47648-1-peterx@redhat.com/
  v5:  https://lore.kernel.org/lkml/20210715201422.211004-1-peterx@redhat.com/

Overview
==================

This is the first version of this work to rebase the uffd-wp logic work upon
PTE markers.  The major logic will be the same as v5, but since there're quite
a few minor changes here and there, I decided to not provide a change log at
all as it'll stop to be helpful.  However I should have addressed all the
comments that were raised by reviewers, please shoot if I missed something.  I
still kept many of the Mike's Review-By tag when there's merely no change to
the patch content (I touched up quite a few commit messages), but it'll be nice
if Mike could still went over the patches even if there're R-bs standing.

PTE marker is a new type of swap entry that is ony applicable to file-backed
memories like shmem and hugetlbfs.  It's used to persist some pte-level
information even if the original present ptes in pgtable are zapped.  These
information could be one of:

  (1) Userfaultfd wr-protect information
  (2) PTE soft-dirty information
  (3) Or others

This series only uses the marker to store uffd-wp information across temporary
zappings of shmem/hugetlbfs pgtables, for example, when a shmem thp is split.
So even if ptes are temporarily zapped, the wr-protect information can still be
kept within the pgtables.  Then when the page fault triggers again, we'll know
this pte is wr-protected so we can treat the pte the same as a normal uffd
wr-protected pte.

The extra information is encoded into the swap entry, or swp_offset to be
explicit, with the swp_type being PTE_MARKER.  So far uffd-wp only uses one bit
out of the swap entry, the rest bits of swp_offset are still reserved for other
purposes.

There're two configs to enable/disable PTE markers:

  CONFIG_PTE_MARKER
  CONFIG_PTE_MARKER_UFFD_WP

We can set !PTE_MARKER to completely disable all the PTE markers, along with
uffd-wp support.  I made two config so we can also enable PTE marker but
disable uffd-wp file-backed for other purposes.  At the end of current series,
I'll enable CONFIG_PTE_MARKER by default, but that patch is standalone and if
anyone worries about having it by default, we can also consider turn it off by
dropping that oneliner patch.  So far I don't see a huge risk of doing so, so I
kept that patch.

In most cases, PTE markers should be treated as none ptes.  It is because that
unlike most of the other swap entry types, there's no PFN or block offset
information encoded into PTE markers but some extra well-defined bits showing
the status of the pte.  These bits should only be used as extra data when
servicing an upcoming page fault, and that should be it.

I did spend a lot of time observing all the pte_none() users this time. It is
indeed a challenge because there're a lot, and I hope I didn't miss a single of
them when we should take care of pte markers.  Luckily, I don't think it'll
need to be considered in many cases, for example: boot code, arch code
(especially non-x86), kernel-only page handlings (e.g. CPA), or device driver
codes when we're tackling with pure PFN mappings.

I introduced pte_none_mostly() in this series when we need to handle pte
markers the same as none pte, the "mostly" is the other way to write "either
none pte or a pte marker".

I didn't replace pte_none() to cover pte markers for below reasons:

  - Very rare case of pte_none() callers will handle pte markers.  E.g., all
    the kernel pages do not require knowledge of pte markers.  So we don't
    pollute the major use cases.

  - Unconditionally change pte_none() semantics could confuse people, because
    pte_none() existed for so long a time.

  - Unconditionally change pte_none() semantics could make pte_none() slower
    even if in many cases pte markers do not exist.

  - There're cases where we'd like to handle pte markers differntly from
    pte_none(), so a full replace is also impossible.  E.g. khugepaged should
    still treat pte markers as normal swap ptes rather than none ptes, because
    pte markers will always need a fault-in to merge the marker with a valid
    pte.  Or the smap code will need to parse PTE markers not none ptes.

Patch Layout
============

Introducing PTE marker and uffd-wp bit in PTE marker:

  mm: Introduce PTE_MARKER swap entry
  mm: Teach core mm about pte markers
  mm: Check against orig_pte for finish_fault()
  mm/uffd: PTE_MARKER_UFFD_WP

Adding support for shmem uffd-wp:

  mm/shmem: Take care of UFFDIO_COPY_MODE_WP
  mm/shmem: Handle uffd-wp special pte in page fault handler
  mm/shmem: Persist uffd-wp bit across zapping for file-backed
  mm/shmem: Allow uffd wr-protect none pte for file-backed mem
  mm/shmem: Allows file-back mem to be uffd wr-protected on thps
  mm/shmem: Handle uffd-wp during fork()

Adding support for hugetlbfs uffd-wp:

  mm/hugetlb: Introduce huge pte version of uffd-wp helpers
  mm/hugetlb: Hook page faults for uffd write protection
  mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP
  mm/hugetlb: Handle UFFDIO_WRITEPROTECT
  mm/hugetlb: Handle pte markers in page faults
  mm/hugetlb: Allow uffd wr-protect none ptes
  mm/hugetlb: Only drop uffd-wp special pte if required
  mm/hugetlb: Handle uffd-wp during fork()

Misc handling on the rest mm for uffd-wp file-backed:

  mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered
  mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs

Enabling of uffd-wp on file-backed memory:

  mm/uffd: Enable write protection for shmem & hugetlbfs
  mm: Enable PTE markers by default
  selftests/uffd: Enable uffd-wp for shmem/hugetlbfs

Tests
==============

- x86_64
  - Compile tested on:
    - PTE_MARKER && PTE_MARKER_UFFD_WP,
    - PTE_MARKER && !PTE_MARKER_UFFD_WP,
    - !PTE_MARKER
    - !USERFAULTFD
  - Kernel userfaultfd selftests for shmem/hugetlb/hugetlb_shared
  - Umapsort [1,2] test for shmem/hugetlb, with swap on/off
- aarch64
  - Compile and smoke tested with !PTE_MARKER

[1] https://github.com/xzpeter/umap-apps/tree/peter
[2] https://github.com/xzpeter/umap/tree/peter-shmem-hugetlbfs

Peter Xu (23):
  mm: Introduce PTE_MARKER swap entry
  mm: Teach core mm about pte markers
  mm: Check against orig_pte for finish_fault()
  mm/uffd: PTE_MARKER_UFFD_WP
  mm/shmem: Take care of UFFDIO_COPY_MODE_WP
  mm/shmem: Handle uffd-wp special pte in page fault handler
  mm/shmem: Persist uffd-wp bit across zapping for file-backed
  mm/shmem: Allow uffd wr-protect none pte for file-backed mem
  mm/shmem: Allows file-back mem to be uffd wr-protected on thps
  mm/shmem: Handle uffd-wp during fork()
  mm/hugetlb: Introduce huge pte version of uffd-wp helpers
  mm/hugetlb: Hook page faults for uffd write protection
  mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP
  mm/hugetlb: Handle UFFDIO_WRITEPROTECT
  mm/hugetlb: Handle pte markers in page faults
  mm/hugetlb: Allow uffd wr-protect none ptes
  mm/hugetlb: Only drop uffd-wp special pte if required
  mm/hugetlb: Handle uffd-wp during fork()
  mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered
  mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs
  mm/uffd: Enable write protection for shmem & hugetlbfs
  mm: Enable PTE markers by default
  selftests/uffd: Enable uffd-wp for shmem/hugetlbfs

 arch/s390/include/asm/hugetlb.h          |  15 ++
 fs/hugetlbfs/inode.c                     |  15 +-
 fs/proc/task_mmu.c                       |  11 ++
 fs/userfaultfd.c                         |  31 +---
 include/asm-generic/hugetlb.h            |  24 +++
 include/linux/hugetlb.h                  |  27 ++--
 include/linux/mm.h                       |  20 +++
 include/linux/mm_inline.h                |  45 ++++++
 include/linux/shmem_fs.h                 |   4 +-
 include/linux/swap.h                     |  15 +-
 include/linux/swapops.h                  |  79 ++++++++++
 include/linux/userfaultfd_k.h            |  67 +++++++++
 include/uapi/linux/userfaultfd.h         |  10 +-
 mm/Kconfig                               |  16 ++
 mm/filemap.c                             |   5 +
 mm/hmm.c                                 |   2 +-
 mm/hugetlb.c                             | 181 +++++++++++++++++-----
 mm/khugepaged.c                          |  14 +-
 mm/memcontrol.c                          |   8 +-
 mm/memory.c                              | 184 ++++++++++++++++++++---
 mm/mincore.c                             |   3 +-
 mm/mprotect.c                            |  76 +++++++++-
 mm/rmap.c                                |   8 +
 mm/shmem.c                               |   4 +-
 mm/userfaultfd.c                         |  61 +++++---
 tools/testing/selftests/vm/userfaultfd.c |   4 +-
 26 files changed, 798 insertions(+), 131 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v6 01/23] mm: Introduce PTE_MARKER swap entry
  2021-11-15  7:54 [PATCH v6 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
@ 2021-11-15  7:55 ` Peter Xu
  2021-12-03  3:30   ` Alistair Popple
  2021-11-15  7:55 ` [PATCH v6 02/23] mm: Teach core mm about pte markers Peter Xu
                   ` (21 subsequent siblings)
  22 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2021-11-15  7:55 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Axel Rasmussen, Nadav Amit, Mike Rapoport, Hugh Dickins,
	Mike Kravetz, Kirill A . Shutemov, Alistair Popple,
	Jerome Glisse, Matthew Wilcox, Andrew Morton, peterx,
	David Hildenbrand, Andrea Arcangeli

This patch introduces a new swap entry type called PTE_MARKER.  It can be
installed for any pte that maps a file-backed memory when the pte is
temporarily zapped, so as to maintain per-pte information.

The information that kept in the pte is called a "marker".  Here we define the
marker as "unsigned long" just to match pgoff_t, however it will only work if
it still fits in swp_offset(), which is e.g. currently 58 bits on x86_64.

A new config CONFIG_PTE_MARKER is introduced too; it's by default off.  A bunch
of helpers are defined altogether to service the rest of the pte marker code.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/asm-generic/hugetlb.h |  9 ++++
 include/linux/swap.h          | 15 ++++++-
 include/linux/swapops.h       | 78 +++++++++++++++++++++++++++++++++++
 mm/Kconfig                    |  7 ++++
 4 files changed, 108 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index 8e1e6244a89d..f39cad20ffc6 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -2,6 +2,9 @@
 #ifndef _ASM_GENERIC_HUGETLB_H
 #define _ASM_GENERIC_HUGETLB_H
 
+#include <linux/swap.h>
+#include <linux/swapops.h>
+
 static inline pte_t mk_huge_pte(struct page *page, pgprot_t pgprot)
 {
 	return mk_pte(page, pgprot);
@@ -80,6 +83,12 @@ static inline int huge_pte_none(pte_t pte)
 }
 #endif
 
+/* Please refer to comments above pte_none_mostly() for the usage */
+static inline int huge_pte_none_mostly(pte_t pte)
+{
+	return huge_pte_none(pte) || is_pte_marker(pte);
+}
+
 #ifndef __HAVE_ARCH_HUGE_PTE_WRPROTECT
 static inline pte_t huge_pte_wrprotect(pte_t pte)
 {
diff --git a/include/linux/swap.h b/include/linux/swap.h
index d1ea44b31f19..cc9adcfd666f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -55,6 +55,19 @@ static inline int current_is_kswapd(void)
  * actions on faults.
  */
 
+/*
+ * PTE markers are used to persist information onto PTEs that are mapped with
+ * file-backed memories.  As its name "PTE" hints, it should only be applied to
+ * the leaves of pgtables.
+ */
+#ifdef CONFIG_PTE_MARKER
+#define SWP_PTE_MARKER_NUM 1
+#define SWP_PTE_MARKER     (MAX_SWAPFILES + SWP_HWPOISON_NUM + \
+			    SWP_MIGRATION_NUM + SWP_DEVICE_NUM)
+#else
+#define SWP_PTE_MARKER_NUM 0
+#endif
+
 /*
  * Unaddressable device memory support. See include/linux/hmm.h and
  * Documentation/vm/hmm.rst. Short description is we need struct pages for
@@ -100,7 +113,7 @@ static inline int current_is_kswapd(void)
 
 #define MAX_SWAPFILES \
 	((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
-	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - SWP_PTE_MARKER_NUM)
 
 /*
  * Magic header for a swap area. The first part of the union is
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index d356ab4047f7..5103d2a4ae38 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -247,6 +247,84 @@ static inline int is_writable_migration_entry(swp_entry_t entry)
 
 #endif
 
+typedef unsigned long pte_marker;
+
+#define  PTE_MARKER_MASK     (0)
+
+#ifdef CONFIG_PTE_MARKER
+
+static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
+{
+	return swp_entry(SWP_PTE_MARKER, marker);
+}
+
+static inline bool is_pte_marker_entry(swp_entry_t entry)
+{
+	return swp_type(entry) == SWP_PTE_MARKER;
+}
+
+static inline pte_marker pte_marker_get(swp_entry_t entry)
+{
+	return swp_offset(entry) & PTE_MARKER_MASK;
+}
+
+static inline bool is_pte_marker(pte_t pte)
+{
+	return is_swap_pte(pte) && is_pte_marker_entry(pte_to_swp_entry(pte));
+}
+
+#else /* CONFIG_PTE_MARKER */
+
+static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
+{
+	/* This should never be called if !CONFIG_PTE_MARKER */
+	WARN_ON_ONCE(1);
+	return swp_entry(0, 0);
+}
+
+static inline bool is_pte_marker_entry(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline pte_marker pte_marker_get(swp_entry_t entry)
+{
+	return 0;
+}
+
+static inline bool is_pte_marker(pte_t pte)
+{
+	return false;
+}
+
+#endif /* CONFIG_PTE_MARKER */
+
+static inline pte_t make_pte_marker(pte_marker marker)
+{
+	return swp_entry_to_pte(make_pte_marker_entry(marker));
+}
+
+/*
+ * This is a special version to check pte_none() just to cover the case when
+ * the pte is a pte marker.  It existed because in many cases the pte marker
+ * should be seen as a none pte; it's just that we have stored some information
+ * onto the none pte so it becomes not-none any more.
+ *
+ * It should be used when the pte is file-backed, ram-based and backing
+ * userspace pages, like shmem.  It is not needed upon pgtables that do not
+ * support pte markers at all.  For example, it's not needed on anonymous
+ * memory, kernel-only memory (including when the system is during-boot),
+ * non-ram based generic file-system.  It's fine to be used even there, but the
+ * extra pte marker check will be pure overhead.
+ *
+ * For systems configured with !CONFIG_PTE_MARKER this will be automatically
+ * optimized to pte_none().
+ */
+static inline int pte_none_mostly(pte_t pte)
+{
+	return pte_none(pte) || is_pte_marker(pte);
+}
+
 static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
 {
 	struct page *p = pfn_to_page(swp_offset(entry));
diff --git a/mm/Kconfig b/mm/Kconfig
index 068ce591a13a..66f23c6c2032 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -897,6 +897,13 @@ config IO_MAPPING
 config SECRETMEM
 	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
 
+config PTE_MARKER
+	def_bool n
+	bool "Marker PTEs support"
+
+	help
+	  Allows to create marker PTEs for file-backed memory.
+
 source "mm/damon/Kconfig"
 
 endmenu
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 02/23] mm: Teach core mm about pte markers
  2021-11-15  7:54 [PATCH v6 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
  2021-11-15  7:55 ` [PATCH v6 01/23] mm: Introduce PTE_MARKER swap entry Peter Xu
@ 2021-11-15  7:55 ` Peter Xu
  2021-11-15  7:55 ` [PATCH v6 03/23] mm: Check against orig_pte for finish_fault() Peter Xu
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2021-11-15  7:55 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Axel Rasmussen, Nadav Amit, Mike Rapoport, Hugh Dickins,
	Mike Kravetz, Kirill A . Shutemov, Alistair Popple,
	Jerome Glisse, Matthew Wilcox, Andrew Morton, peterx,
	David Hildenbrand, Andrea Arcangeli

This patch still does not use pte marker in any way, however it teaches the
core mm about the pte marker idea.

For example, handle_pte_marker() is introduced that will parse and handle all
the pte marker faults.

Many of the places are more about commenting it up - so that we know there's
the possibility of pte marker showing up, and why we don't need special code
for the cases.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c | 10 ++++++----
 mm/filemap.c     |  5 +++++
 mm/hmm.c         |  2 +-
 mm/memcontrol.c  |  8 ++++++--
 mm/memory.c      | 23 +++++++++++++++++++++++
 mm/mincore.c     |  3 ++-
 mm/mprotect.c    |  3 +++
 7 files changed, 46 insertions(+), 8 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 22bf14ab2d16..fa24c72a849e 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -245,9 +245,10 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
 
 	/*
 	 * Lockless access: we're in a wait_event so it's ok if it
-	 * changes under us.
+	 * changes under us.  PTE markers should be handled the same as none
+	 * ptes here.
 	 */
-	if (huge_pte_none(pte))
+	if (huge_pte_none_mostly(pte))
 		ret = true;
 	if (!huge_pte_write(pte) && (reason & VM_UFFD_WP))
 		ret = true;
@@ -326,9 +327,10 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 	pte = pte_offset_map(pmd, address);
 	/*
 	 * Lockless access: we're in a wait_event so it's ok if it
-	 * changes under us.
+	 * changes under us.  PTE markers should be handled the same as none
+	 * ptes here.
 	 */
-	if (pte_none(*pte))
+	if (pte_none_mostly(*pte))
 		ret = true;
 	if (!pte_write(*pte) && (reason & VM_UFFD_WP))
 		ret = true;
diff --git a/mm/filemap.c b/mm/filemap.c
index daa0e23a6ee6..9a7228b95b30 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3327,6 +3327,11 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		vmf->pte += xas.xa_index - last_pgoff;
 		last_pgoff = xas.xa_index;
 
+		/*
+		 * NOTE: If there're PTE markers, we'll leave them to be
+		 * handled in the specific fault path, and it'll prohibit the
+		 * fault-around logic.
+		 */
 		if (!pte_none(*vmf->pte))
 			goto unlock;
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 842e26599238..a0f72a540dc3 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -239,7 +239,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 	pte_t pte = *ptep;
 	uint64_t pfn_req_flags = *hmm_pfn;
 
-	if (pte_none(pte)) {
+	if (pte_none_mostly(pte)) {
 		required_fault =
 			hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
 		if (required_fault)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 781605e92015..eaddbc77aa5a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5692,10 +5692,14 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 
 	if (pte_present(ptent))
 		page = mc_handle_present_pte(vma, addr, ptent);
+	else if (pte_none_mostly(ptent))
+		/*
+		 * PTE markers should be treated as a none pte here, separated
+		 * from other swap handling below.
+		 */
+		page = mc_handle_file_pte(vma, addr, ptent);
 	else if (is_swap_pte(ptent))
 		page = mc_handle_swap_pte(vma, ptent, &ent);
-	else if (pte_none(ptent))
-		page = mc_handle_file_pte(vma, addr, ptent);
 
 	if (!page && !ent.val)
 		return ret;
diff --git a/mm/memory.c b/mm/memory.c
index e5d59a6b6479..04662b010005 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -98,6 +98,8 @@ struct page *mem_map;
 EXPORT_SYMBOL(mem_map);
 #endif
 
+static vm_fault_t do_fault(struct vm_fault *vmf);
+
 /*
  * A number of key systems in x86 including ioremap() rely on the assumption
  * that high_memory defines the upper bound on direct map memory, then end
@@ -1380,6 +1382,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			if (unlikely(zap_skip_check_mapping(details, page)))
 				continue;
 			rss[mm_counter(page)]--;
+		} else if (is_pte_marker_entry(entry)) {
+			/* By default, simply drop all pte markers when zap */
 		} else if (!non_swap_entry(entry)) {
 			rss[MM_SWAPENTS]--;
 			if (unlikely(!free_swap_and_cache(entry)))
@@ -3448,6 +3452,23 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 	return 0;
 }
 
+static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
+{
+	swp_entry_t entry = pte_to_swp_entry(vmf->orig_pte);
+	unsigned long marker = pte_marker_get(entry);
+
+	/*
+	 * PTE markers should always be with file-backed memories, and the
+	 * marker should never be empty.  If anything weird happened, the best
+	 * thing to do is to kill the process along with its mm.
+	 */
+	if (WARN_ON_ONCE(vma_is_anonymous(vmf->vma) || !marker))
+		return VM_FAULT_SIGBUS;
+
+	/* TODO: handle pte markers */
+	return 0;
+}
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -3484,6 +3505,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
+		} else if (is_pte_marker_entry(entry)) {
+			ret = handle_pte_marker(vmf);
 		} else {
 			print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
 			ret = VM_FAULT_SIGBUS;
diff --git a/mm/mincore.c b/mm/mincore.c
index 9122676b54d6..736869f4b409 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -121,7 +121,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	for (; addr != end; ptep++, addr += PAGE_SIZE) {
 		pte_t pte = *ptep;
 
-		if (pte_none(pte))
+		/* We need to do cache lookup too for pte markers */
+		if (pte_none_mostly(pte))
 			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
 						 vma, vec);
 		else if (pte_present(pte))
diff --git a/mm/mprotect.c b/mm/mprotect.c
index e552f5e0ccbd..890bc1f9ca24 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -173,6 +173,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 					newpte = pte_swp_mksoft_dirty(newpte);
 				if (pte_swp_uffd_wp(oldpte))
 					newpte = pte_swp_mkuffd_wp(newpte);
+			} else if (is_pte_marker_entry(entry)) {
+				/* Skip it, the same as none pte */
+				continue;
 			} else {
 				newpte = oldpte;
 			}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 03/23] mm: Check against orig_pte for finish_fault()
  2021-11-15  7:54 [PATCH v6 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
  2021-11-15  7:55 ` [PATCH v6 01/23] mm: Introduce PTE_MARKER swap entry Peter Xu
  2021-11-15  7:55 ` [PATCH v6 02/23] mm: Teach core mm about pte markers Peter Xu
@ 2021-11-15  7:55 ` Peter Xu
  2021-12-16  5:01   ` Alistair Popple
  2021-11-15  7:55 ` [PATCH v6 04/23] mm/uffd: PTE_MARKER_UFFD_WP Peter Xu
                   ` (19 subsequent siblings)
  22 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2021-11-15  7:55 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Axel Rasmussen, Nadav Amit, Mike Rapoport, Hugh Dickins,
	Mike Kravetz, Kirill A . Shutemov, Alistair Popple,
	Jerome Glisse, Matthew Wilcox, Andrew Morton, peterx,
	David Hildenbrand, Andrea Arcangeli

We used to check against none pte and in those cases orig_pte should always be
none pte anyway.

This change prepares us to be able to call do_fault() on !none ptes.  For
example, we should allow that to happen for pte marker so that we can restore
information out of the pte markers.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 04662b010005..d5966d9e24c3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4052,7 +4052,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 				      vmf->address, &vmf->ptl);
 	ret = 0;
 	/* Re-check under ptl */
-	if (likely(pte_none(*vmf->pte)))
+	if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
 		do_set_pte(vmf, page, vmf->address);
 	else
 		ret = VM_FAULT_NOPAGE;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 04/23] mm/uffd: PTE_MARKER_UFFD_WP
  2021-11-15  7:54 [PATCH v6 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (2 preceding siblings ...)
  2021-11-15  7:55 ` [PATCH v6 03/23] mm: Check against orig_pte for finish_fault() Peter Xu
@ 2021-11-15  7:55 ` Peter Xu
  2021-12-16  5:18   ` Alistair Popple
  2021-11-15  7:55 ` [PATCH v6 05/23] mm/shmem: Take care of UFFDIO_COPY_MODE_WP Peter Xu
                   ` (18 subsequent siblings)
  22 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2021-11-15  7:55 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Axel Rasmussen, Nadav Amit, Mike Rapoport, Hugh Dickins,
	Mike Kravetz, Kirill A . Shutemov, Alistair Popple,
	Jerome Glisse, Matthew Wilcox, Andrew Morton, peterx,
	David Hildenbrand, Andrea Arcangeli

This patch introduces the 1st user of pte marker: the uffd-wp marker.

When the pte marker is installed with the uffd-wp bit set, it means this pte
was wr-protected by uffd.

We will use this special pte to arm the ptes that got either unmapped or
swapped out for a file-backed region that was previously wr-protected.  This
special pte could trigger a page fault just like swap entries.

This idea is greatly inspired by Hugh and Andrea in the discussion, which is
referenced in the links below.

Some helpers are introduced to detect whether a swap pte is uffd wr-protected.
After the pte marker introduced, one swap pte can be wr-protected in two forms:
either it is a normal swap pte and it has _PAGE_SWP_UFFD_WP set, or it's a pte
marker that has PTE_MARKER_UFFD_WP set.

Link: https://lore.kernel.org/lkml/20201126222359.8120-1-peterx@redhat.com/
Link: https://lore.kernel.org/lkml/20201130230603.46187-1-peterx@redhat.com/
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/swapops.h       |  3 ++-
 include/linux/userfaultfd_k.h | 38 +++++++++++++++++++++++++++++++++++
 mm/Kconfig                    |  9 +++++++++
 3 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 5103d2a4ae38..2cec3ef355a7 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -249,7 +249,8 @@ static inline int is_writable_migration_entry(swp_entry_t entry)
 
 typedef unsigned long pte_marker;
 
-#define  PTE_MARKER_MASK     (0)
+#define  PTE_MARKER_UFFD_WP  BIT(0)
+#define  PTE_MARKER_MASK     (PTE_MARKER_UFFD_WP)
 
 #ifdef CONFIG_PTE_MARKER
 
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 33cea484d1ad..7d7ffec53ddb 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -15,6 +15,8 @@
 
 #include <linux/fcntl.h>
 #include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 #include <asm-generic/pgtable_uffd.h>
 
 /* The set of all possible UFFD-related VM flags. */
@@ -236,4 +238,40 @@ static inline void userfaultfd_unmap_complete(struct mm_struct *mm,
 
 #endif /* CONFIG_USERFAULTFD */
 
+static inline bool is_pte_marker_uffd_wp(pte_t pte)
+{
+#ifdef CONFIG_PTE_MARKER_UFFD_WP
+	swp_entry_t entry;
+
+	if (!is_swap_pte(pte))
+		return false;
+
+	entry = pte_to_swp_entry(pte);
+
+	return is_pte_marker_entry(entry) &&
+	    (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
+#else
+	return false;
+#endif
+}
+
+/*
+ * Returns true if this is a swap pte and was uffd-wp wr-protected in either
+ * forms (pte marker or a normal swap pte), false otherwise.
+ */
+static inline bool pte_swp_uffd_wp_any(pte_t pte)
+{
+#ifdef CONFIG_PTE_MARKER_UFFD_WP
+	if (!is_swap_pte(pte))
+		return false;
+
+	if (pte_swp_uffd_wp(pte))
+		return true;
+
+	if (is_pte_marker_uffd_wp(pte))
+		return true;
+#endif
+	return false;
+}
+
 #endif /* _LINUX_USERFAULTFD_K_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 66f23c6c2032..f01c8e0afadf 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -904,6 +904,15 @@ config PTE_MARKER
 	help
 	  Allows to create marker PTEs for file-backed memory.
 
+config PTE_MARKER_UFFD_WP
+	bool "Marker PTEs support for userfaultfd write protection"
+	depends on PTE_MARKER && HAVE_ARCH_USERFAULTFD_WP
+
+	help
+	  Allows to create marker PTEs for userfaultfd write protection
+	  purposes.  It is required to enable userfaultfd write protection on
+	  file-backed memory types like shmem and hugetlbfs.
+
 source "mm/damon/Kconfig"
 
 endmenu
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 05/23] mm/shmem: Take care of UFFDIO_COPY_MODE_WP
  2021-11-15  7:54 [PATCH v6 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (3 preceding siblings ...)
  2021-11-15  7:55 ` [PATCH v6 04/23] mm/uffd: PTE_MARKER_UFFD_WP Peter Xu
@ 2021-11-15  7:55 ` Peter Xu
  2021-11-15  7:55 ` [PATCH v6 06/23] mm/shmem: Handle uffd-wp special pte in page fault handler Peter Xu
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2021-11-15  7:55 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Axel Rasmussen, Nadav Amit, Mike Rapoport, Hugh Dickins,
	Mike Kravetz, Kirill A . Shutemov, Alistair Popple,
	Jerome Glisse, Matthew Wilcox, Andrew Morton, peterx,
	David Hildenbrand, Andrea Arcangeli

Pass wp_copy into shmem_mfill_atomic_pte() through the stack, then apply the
UFFD_WP bit properly when the UFFDIO_COPY on shmem is with UFFDIO_COPY_MODE_WP.
wp_copy lands mfill_atomic_install_pte() finally.

Note: we must do pte_wrprotect() if !writable in mfill_atomic_install_pte(), as
mk_pte() could return a writable pte (e.g., when VM_SHARED on a shmem file).

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/shmem_fs.h |  4 ++--
 mm/shmem.c               |  4 ++--
 mm/userfaultfd.c         | 30 +++++++++++++++++++++---------
 3 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 166158b6e917..0ee0f437b14f 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -145,11 +145,11 @@ extern int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 				  struct vm_area_struct *dst_vma,
 				  unsigned long dst_addr,
 				  unsigned long src_addr,
-				  bool zeropage,
+				  bool zeropage, bool wp_copy,
 				  struct page **pagep);
 #else /* !CONFIG_SHMEM */
 #define shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, \
-			       src_addr, zeropage, pagep)       ({ BUG(); 0; })
+			       src_addr, zeropage, wp_copy, pagep) ({ BUG(); 0; })
 #endif /* CONFIG_SHMEM */
 #endif /* CONFIG_USERFAULTFD */
 
diff --git a/mm/shmem.c b/mm/shmem.c
index dc038ce78700..167a46e6a1ff 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2344,7 +2344,7 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 			   struct vm_area_struct *dst_vma,
 			   unsigned long dst_addr,
 			   unsigned long src_addr,
-			   bool zeropage,
+			   bool zeropage, bool wp_copy,
 			   struct page **pagep)
 {
 	struct inode *inode = file_inode(dst_vma->vm_file);
@@ -2415,7 +2415,7 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 		goto out_release;
 
 	ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
-				       page, true, false);
+				       page, true, wp_copy);
 	if (ret)
 		goto out_delete_from_cache;
 
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index ac6f036298cd..95e5a9ba3196 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -70,14 +70,22 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 
 	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
 	_dst_pte = pte_mkdirty(_dst_pte);
-	if (page_in_cache && !vm_shared)
+	/* Don't write if uffd-wp wr-protected */
+	if (wp_copy) {
+		_dst_pte = pte_mkuffd_wp(_dst_pte);
 		writable = false;
-	if (writable) {
-		if (wp_copy)
-			_dst_pte = pte_mkuffd_wp(_dst_pte);
-		else
-			_dst_pte = pte_mkwrite(_dst_pte);
 	}
+	/* Don't write if page cache privately mapped */
+	if (page_in_cache && !vm_shared)
+		writable = false;
+	if (writable)
+		_dst_pte = pte_mkwrite(_dst_pte);
+	else
+		/*
+		 * We need this to make sure write bit removed; as mk_pte()
+		 * could return a pte with write bit set.
+		 */
+		_dst_pte = pte_wrprotect(_dst_pte);
 
 	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
 
@@ -92,7 +100,12 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	}
 
 	ret = -EEXIST;
-	if (!pte_none(*dst_pte))
+	/*
+	 * We allow to overwrite a pte marker: consider when both MISSING|WP
+	 * registered, we firstly wr-protect a none pte which has no page cache
+	 * page backing it, then access the page.
+	 */
+	if (!pte_none_mostly(*dst_pte))
 		goto out_unlock;
 
 	if (page_in_cache)
@@ -467,11 +480,10 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 			err = mfill_zeropage_pte(dst_mm, dst_pmd,
 						 dst_vma, dst_addr);
 	} else {
-		VM_WARN_ON_ONCE(wp_copy);
 		err = shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma,
 					     dst_addr, src_addr,
 					     mode != MCOPY_ATOMIC_NORMAL,
-					     page);
+					     wp_copy, page);
 	}
 
 	return err;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 06/23] mm/shmem: Handle uffd-wp special pte in page fault handler
  2021-11-15  7:54 [PATCH v6 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (4 preceding siblings ...)
  2021-11-15  7:55 ` [PATCH v6 05/23] mm/shmem: Take care of UFFDIO_COPY_MODE_WP Peter Xu
@ 2021-11-15  7:55 ` Peter Xu
  2021-12-16  5:56   ` Alistair Popple
  2021-11-15  7:55 ` [PATCH v6 07/23] mm/shmem: Persist uffd-wp bit across zapping for file-backed Peter Xu
                   ` (16 subsequent siblings)
  22 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2021-11-15  7:55 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Axel Rasmussen, Nadav Amit, Mike Rapoport, Hugh Dickins,
	Mike Kravetz, Kirill A . Shutemov, Alistair Popple,
	Jerome Glisse, Matthew Wilcox, Andrew Morton, peterx,
	David Hildenbrand, Andrea Arcangeli

File-backed memories are prone to unmap/swap so the ptes are always unstable,
because they can be easily faulted back later using the page cache.  This could
lead to uffd-wp getting lost when unmapping or swapping out such memory.  One
example is shmem.  PTE markers are needed to store those information.

This patch prepares it by handling uffd-wp pte markers first it is applied
elsewhere, so that the page fault handler can recognize uffd-wp pte markers.

The handling of uffd-wp pte markers is similar to missing fault, it's just that
we'll handle this "missing fault" when we see the pte markers, meanwhile we
need to make sure the marker information is kept during processing the fault.

This is a slow path of uffd-wp handling, because zapping of wr-protected shmem
ptes should be rare.  So far it should only trigger in two conditions:

  (1) When trying to punch holes in shmem_fallocate(), there is an optimization
      to zap the pgtables before evicting the page.

  (2) When swapping out shmem pages.

Because of this, the page fault handling is simplifed too by not sending the
wr-protect message in the 1st page fault, instead the page will be installed
read-only, so the uffd-wp message will be generated in the next fault, which
will trigger the do_wp_page() path of general uffd-wp handling.

Disable fault-around for all uffd-wp registered ranges for extra safety just
like uffd-minor fault, and clean the code up.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/userfaultfd_k.h | 17 +++++++++
 mm/memory.c                   | 71 ++++++++++++++++++++++++++++++-----
 2 files changed, 79 insertions(+), 9 deletions(-)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 7d7ffec53ddb..05cec02140cb 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -96,6 +96,18 @@ static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
 	return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
 }
 
+/*
+ * Don't do fault around for either WP or MINOR registered uffd range.  For
+ * MINOR registered range, fault around will be a total disaster and ptes can
+ * be installed without notifications; for WP it should mostly be fine as long
+ * as the fault around checks for pte_none() before the installation, however
+ * to be super safe we just forbid it.
+ */
+static inline bool uffd_disable_fault_around(struct vm_area_struct *vma)
+{
+	return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
+}
+
 static inline bool userfaultfd_missing(struct vm_area_struct *vma)
 {
 	return vma->vm_flags & VM_UFFD_MISSING;
@@ -236,6 +248,11 @@ static inline void userfaultfd_unmap_complete(struct mm_struct *mm,
 {
 }
 
+static inline bool uffd_disable_fault_around(struct vm_area_struct *vma)
+{
+	return false;
+}
+
 #endif /* CONFIG_USERFAULTFD */
 
 static inline bool is_pte_marker_uffd_wp(pte_t pte)
diff --git a/mm/memory.c b/mm/memory.c
index d5966d9e24c3..e8557d43a87d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3452,6 +3452,43 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 	return 0;
 }
 
+static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
+{
+	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+				       vmf->address, &vmf->ptl);
+	/*
+	 * Be careful so that we will only recover a special uffd-wp pte into a
+	 * none pte.  Otherwise it means the pte could have changed, so retry.
+	 */
+	if (is_pte_marker(*vmf->pte))
+		pte_clear(vmf->vma->vm_mm, vmf->address, vmf->pte);
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
+	return 0;
+}
+
+/*
+ * This is actually a page-missing access, but with uffd-wp special pte
+ * installed.  It means this pte was wr-protected before being unmapped.
+ */
+static vm_fault_t pte_marker_handle_uffd_wp(struct vm_fault *vmf)
+{
+	/* Careful!  vmf->pte unmapped after return */
+	if (!pte_unmap_same(vmf))
+		return 0;
+
+	/*
+	 * Just in case there're leftover special ptes even after the region
+	 * got unregistered - we can simply clear them.  We can also do that
+	 * proactively when e.g. when we do UFFDIO_UNREGISTER upon some uffd-wp
+	 * ranges, but it should be more efficient to be done lazily here.
+	 */
+	if (unlikely(!userfaultfd_wp(vmf->vma) || vma_is_anonymous(vmf->vma)))
+		return pte_marker_clear(vmf);
+
+	/* do_fault() can handle pte markers too like none pte */
+	return do_fault(vmf);
+}
+
 static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
 {
 	swp_entry_t entry = pte_to_swp_entry(vmf->orig_pte);
@@ -3465,8 +3502,11 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
 	if (WARN_ON_ONCE(vma_is_anonymous(vmf->vma) || !marker))
 		return VM_FAULT_SIGBUS;
 
-	/* TODO: handle pte markers */
-	return 0;
+	if (marker & PTE_MARKER_UFFD_WP)
+		return pte_marker_handle_uffd_wp(vmf);
+
+	/* This is an unknown pte marker */
+	return VM_FAULT_SIGBUS;
 }
 
 /*
@@ -3968,6 +4008,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
 {
 	struct vm_area_struct *vma = vmf->vma;
+	bool uffd_wp = is_pte_marker_uffd_wp(vmf->orig_pte);
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
 	bool prefault = vmf->address != addr;
 	pte_t entry;
@@ -3982,6 +4023,8 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
 
 	if (write)
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+	if (unlikely(uffd_wp))
+		entry = pte_mkuffd_wp(pte_wrprotect(entry));
 	/* copy-on-write page */
 	if (write && !(vma->vm_flags & VM_SHARED)) {
 		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
@@ -4155,9 +4198,21 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
 	return vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff);
 }
 
+/* Return true if we should do read fault-around, false otherwise */
+static inline bool should_fault_around(struct vm_fault *vmf)
+{
+	/* No ->map_pages?  No way to fault around... */
+	if (!vmf->vma->vm_ops->map_pages)
+		return false;
+
+	if (uffd_disable_fault_around(vmf->vma))
+		return false;
+
+	return fault_around_bytes >> PAGE_SHIFT > 1;
+}
+
 static vm_fault_t do_read_fault(struct vm_fault *vmf)
 {
-	struct vm_area_struct *vma = vmf->vma;
 	vm_fault_t ret = 0;
 
 	/*
@@ -4165,12 +4220,10 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf)
 	 * if page by the offset is not ready to be mapped (cold cache or
 	 * something).
 	 */
-	if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
-		if (likely(!userfaultfd_minor(vmf->vma))) {
-			ret = do_fault_around(vmf);
-			if (ret)
-				return ret;
-		}
+	if (should_fault_around(vmf)) {
+		ret = do_fault_around(vmf);
+		if (ret)
+			return ret;
 	}
 
 	ret = __do_fault(vmf);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 07/23] mm/shmem: Persist uffd-wp bit across zapping for file-backed
  2021-11-15  7:54 [PATCH v6 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (5 preceding siblings ...)
  2021-11-15  7:55 ` [PATCH v6 06/23] mm/shmem: Handle uffd-wp special pte in page fault handler Peter Xu
@ 2021-11-15  7:55 ` Peter Xu
  2021-11-15  8:00 ` [PATCH v6 08/23] mm/shmem: Allow uffd wr-protect none pte for file-backed mem Peter Xu
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2021-11-15  7:55 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Axel Rasmussen, Nadav Amit, Mike Rapoport, Hugh Dickins,
	Mike Kravetz, Kirill A . Shutemov, Alistair Popple,
	Jerome Glisse, Matthew Wilcox, Andrew Morton, peterx,
	David Hildenbrand, Andrea Arcangeli

File-backed memory is prone to being unmapped at any time.  It means all
information in the pte will be dropped, including the uffd-wp flag.

To persist the uffd-wp flag, we'll use the pte markers.  This patch teaches the
zap code to understand uffd-wp and know when to keep or drop the uffd-wp bit.

Add a new flag ZAP_FLAG_DROP_MARKER and set it in zap_details when we don't
want to persist such an information, for example, when destroying the whole
vma, or punching a hole in a shmem file.  For the rest cases we should never
drop the uffd-wp bit, or the wr-protect information will get lost.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h        | 20 +++++++++++++++++
 include/linux/mm_inline.h | 45 +++++++++++++++++++++++++++++++++++++++
 mm/memory.c               | 38 +++++++++++++++++++++++++++++++--
 mm/rmap.c                 |  8 +++++++
 4 files changed, 109 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7e4a9e7d807..015e287063a8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1825,12 +1825,23 @@ static inline bool can_do_mlock(void) { return false; }
 extern int user_shm_lock(size_t, struct ucounts *);
 extern void user_shm_unlock(size_t, struct ucounts *);
 
+typedef unsigned int __bitwise zap_flags_t;
+
+/*
+ * Whether to drop the pte markers, for example, the uffd-wp information for
+ * file-backed memory.  This should only be specified when we will completely
+ * drop the page in the mm, either by truncation or unmapping of the vma.  By
+ * default, the flag is not set.
+ */
+#define  ZAP_FLAG_DROP_MARKER        ((__force zap_flags_t) BIT(0))
+
 /*
  * Parameter block passed down to zap_pte_range in exceptional cases.
  */
 struct zap_details {
 	struct address_space *zap_mapping;	/* Check page->mapping if set */
 	struct page *single_page;		/* Locked page to be unmapped */
+	zap_flags_t zap_flags;			/* Extra flags for zapping */
 };
 
 /*
@@ -1847,6 +1858,15 @@ zap_skip_check_mapping(struct zap_details *details, struct page *page)
 	    (details->zap_mapping != page_rmapping(page));
 }
 
+static inline bool
+zap_drop_file_uffd_wp(struct zap_details *details)
+{
+	if (!details)
+		return false;
+
+	return details->zap_flags & ZAP_FLAG_DROP_MARKER;
+}
+
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			     pte_t pte);
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index e2ec68b0515c..ca861e910938 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -4,6 +4,8 @@
 
 #include <linux/huge_mm.h>
 #include <linux/swap.h>
+#include <linux/userfaultfd_k.h>
+#include <linux/swapops.h>
 
 /**
  * folio_is_file_lru - Should the folio be on a file LRU or anon LRU?
@@ -135,4 +137,47 @@ static __always_inline void del_page_from_lru_list(struct page *page,
 {
 	lruvec_del_folio(lruvec, page_folio(page));
 }
+
+/*
+ * If this pte is wr-protected by uffd-wp in any form, arm the special pte to
+ * replace a none pte.  NOTE!  This should only be called when *pte is already
+ * cleared so we will never accidentally replace something valuable.  Meanwhile
+ * none pte also means we are not demoting the pte so tlb flushed is not needed.
+ * E.g., when pte cleared the caller should have taken care of the tlb flush.
+ *
+ * Must be called with pgtable lock held so that no thread will see the none
+ * pte, and if they see it, they'll fault and serialize at the pgtable lock.
+ *
+ * This function is a no-op if PTE_MARKER_UFFD_WP is not enabled.
+ */
+static inline void
+pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
+			      pte_t *pte, pte_t pteval)
+{
+#ifdef CONFIG_PTE_MARKER_UFFD_WP
+	bool arm_uffd_pte = false;
+
+	/* The current status of the pte should be "cleared" before calling */
+	WARN_ON_ONCE(!pte_none(*pte));
+
+	if (vma_is_anonymous(vma))
+		return;
+
+	/* A uffd-wp wr-protected normal pte */
+	if (unlikely(pte_present(pteval) && pte_uffd_wp(pteval)))
+		arm_uffd_pte = true;
+
+	/*
+	 * A uffd-wp wr-protected swap pte.  Note: this should even cover an
+	 * existing pte marker with uffd-wp bit set.
+	 */
+	if (unlikely(pte_swp_uffd_wp_any(pteval)))
+		arm_uffd_pte = true;
+
+	if (unlikely(arm_uffd_pte))
+		set_pte_at(vma->vm_mm, addr, pte,
+			   make_pte_marker(PTE_MARKER_UFFD_WP));
+#endif
+}
+
 #endif
diff --git a/mm/memory.c b/mm/memory.c
index e8557d43a87d..fef6a91c5dfb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -73,6 +73,7 @@
 #include <linux/perf_event.h>
 #include <linux/ptrace.h>
 #include <linux/vmalloc.h>
+#include <linux/mm_inline.h>
 
 #include <trace/events/kmem.h>
 
@@ -1306,6 +1307,21 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 	return ret;
 }
 
+/*
+ * This function makes sure that we'll replace the none pte with an uffd-wp
+ * swap special pte marker when necessary. Must be with the pgtable lock held.
+ */
+static inline void
+zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
+			      unsigned long addr, pte_t *pte,
+			      struct zap_details *details, pte_t pteval)
+{
+	if (zap_drop_file_uffd_wp(details))
+		return;
+
+	pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
@@ -1343,6 +1359,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
 			tlb_remove_tlb_entry(tlb, pte, addr);
+			zap_install_uffd_wp_if_needed(vma, addr, pte, details,
+						      ptent);
 			if (unlikely(!page))
 				continue;
 
@@ -1373,6 +1391,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			page = pfn_swap_entry_to_page(entry);
 			if (unlikely(zap_skip_check_mapping(details, page)))
 				continue;
+			/*
+			 * Both device private/exclusive mappings should only
+			 * work with anonymous page so far, so we don't need to
+			 * consider uffd-wp bit when zap. For more information,
+			 * see zap_install_uffd_wp_if_needed().
+			 */
+			WARN_ON_ONCE(!vma_is_anonymous(vma));
 			rss[mm_counter(page)]--;
 			if (is_device_private_entry(entry))
 				page_remove_rmap(page, false);
@@ -1383,13 +1408,18 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				continue;
 			rss[mm_counter(page)]--;
 		} else if (is_pte_marker_entry(entry)) {
-			/* By default, simply drop all pte markers when zap */
+			/* Currently there's only uffd-wp marker bit */
+			WARN_ON_ONCE(!(pte_marker_get(entry) & PTE_MARKER_UFFD_WP));
+			/* Only drop the uffd-wp marker if explicitly requested */
+			if (!zap_drop_file_uffd_wp(details))
+				continue;
 		} else if (!non_swap_entry(entry)) {
 			rss[MM_SWAPENTS]--;
 			if (unlikely(!free_swap_and_cache(entry)))
 				print_bad_pte(vma, addr, ptent, NULL);
 		}
 		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+		zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 
 	add_mm_rss_vec(mm, rss);
@@ -1600,12 +1630,15 @@ void unmap_vmas(struct mmu_gather *tlb,
 		unsigned long end_addr)
 {
 	struct mmu_notifier_range range;
+	struct zap_details details = {
+		.zap_flags = ZAP_FLAG_DROP_MARKER,
+	};
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
 				start_addr, end_addr);
 	mmu_notifier_invalidate_range_start(&range);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
-		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
+		unmap_single_vma(tlb, vma, start_addr, end_addr, &details);
 	mmu_notifier_invalidate_range_end(&range);
 }
 
@@ -3350,6 +3383,7 @@ void unmap_mapping_page(struct page *page)
 
 	details.zap_mapping = mapping;
 	details.single_page = page;
+	details.zap_flags = ZAP_FLAG_DROP_MARKER;
 
 	i_mmap_lock_write(mapping);
 	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
diff --git a/mm/rmap.c b/mm/rmap.c
index 163ac4e6bcee..89068e957486 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -73,6 +73,7 @@
 #include <linux/page_idle.h>
 #include <linux/memremap.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/mm_inline.h>
 
 #include <asm/tlbflush.h>
 
@@ -1517,6 +1518,13 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			pteval = ptep_clear_flush(vma, address, pvmw.pte);
 		}
 
+		/*
+		 * Now the pte is cleared.  If this is uffd-wp armed pte, we
+		 * may want to replace a none pte with a marker pte if it's
+		 * file-backed, so we don't lose the tracking information.
+		 */
+		pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
+
 		/* Move the dirty bit to the page. Now the pte is gone. */
 		if (pte_dirty(pteval))
 			set_page_dirty(page);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 08/23] mm/shmem: Allow uffd wr-protect none pte for file-backed mem
  2021-11-15  7:54 [PATCH v6 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (6 preceding siblings ...)
  2021-11-15  7:55 ` [PATCH v6 07/23] mm/shmem: Persist uffd-wp bit across zapping for file-backed Peter Xu
@ 2021-11-15  8:00 ` Peter Xu
  2021-11-15  8:00 ` [PATCH v6 09/23] mm/shmem: Allows file-back mem to be uffd wr-protected on thps Peter Xu
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2021-11-15  8:00 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, peterx, Alistair Popple, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Matthew Wilcox, Jerome Glisse, Axel Rasmussen,
	Kirill A . Shutemov, David Hildenbrand, Andrea Arcangeli,
	Hugh Dickins

File-backed memory differs from anonymous memory in that even if the pte is
missing, the data could still resides either in the file or in page/swap cache.
So when wr-protect a pte, we need to consider none ptes too.

We do that by installing the uffd-wp pte markers when necessary.  So when
there's a future write to the pte, the fault handler will go the special path
to first fault-in the page as read-only, then report to userfaultfd server with
the wr-protect message.

On the other hand, when unprotecting a page, it's also possible that the pte
got unmapped but replaced by the special uffd-wp marker.  Then we'll need to be
able to recover from a uffd-wp pte marker into a none pte, so that the next
access to the page will fault in correctly as usual when accessed the next
time.

Special care needs to be taken throughout the change_protection_range()
process.  Since now we allow user to wr-protect a none pte, we need to be able
to pre-populate the page table entries if we see (!anonymous && MM_CP_UFFD_WP)
requests, otherwise change_protection_range() will always skip when the pgtable
entry does not exist.

For example, the pgtable can be missing for a whole chunk of 2M pmd, but the
page cache can exist for the 2M range.  When we want to wr-protect one 4K page
within the 2M pmd range, we need to pre-populate the pgtable and install the
pte marker showing that we want to get a message and block the thread when the
page cache of that 4K page is written.  Without pre-populating the pmd,
change_protection() will simply skip that whole pmd.

Note that this patch only covers the small pages (pte level) but not covering
any of the transparent huge pages yet.  That will be done later, and this patch
will be a preparation for it too.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/mprotect.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 62 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 890bc1f9ca24..be837c4dbc64 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -29,6 +29,7 @@
 #include <linux/uaccess.h>
 #include <linux/mm_inline.h>
 #include <linux/pgtable.h>
+#include <linux/userfaultfd_k.h>
 #include <asm/cacheflush.h>
 #include <asm/mmu_context.h>
 #include <asm/tlbflush.h>
@@ -174,7 +175,16 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				if (pte_swp_uffd_wp(oldpte))
 					newpte = pte_swp_mkuffd_wp(newpte);
 			} else if (is_pte_marker_entry(entry)) {
-				/* Skip it, the same as none pte */
+				/*
+				 * If this is uffd-wp pte marker and we'd like
+				 * to unprotect it, drop it; the next page
+				 * fault will trigger without uffd trapping.
+				 */
+				if (uffd_wp_resolve &&
+				    (pte_marker_get(entry) & PTE_MARKER_UFFD_WP)) {
+					pte_clear(vma->vm_mm, addr, pte);
+					pages++;
+				}
 				continue;
 			} else {
 				newpte = oldpte;
@@ -189,6 +199,20 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				set_pte_at(vma->vm_mm, addr, pte, newpte);
 				pages++;
 			}
+		} else {
+			/* It must be an none page, or what else?.. */
+			WARN_ON_ONCE(!pte_none(oldpte));
+			if (unlikely(uffd_wp && !vma_is_anonymous(vma))) {
+				/*
+				 * For file-backed mem, we need to be able to
+				 * wr-protect a none pte, because even if the
+				 * pte is none, the page/swap cache could
+				 * exist.  Doing that by install a marker.
+				 */
+				set_pte_at(vma->vm_mm, addr, pte,
+					   make_pte_marker(PTE_MARKER_UFFD_WP));
+				pages++;
+			}
 		}
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
@@ -222,6 +246,39 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
 	return 0;
 }
 
+/* Return true if we're uffd wr-protecting file-backed memory, or false */
+static inline bool
+uffd_wp_protect_file(struct vm_area_struct *vma, unsigned long cp_flags)
+{
+	return (cp_flags & MM_CP_UFFD_WP) && !vma_is_anonymous(vma);
+}
+
+/*
+ * If wr-protecting the range for file-backed, populate pgtable for the case
+ * when pgtable is empty but page cache exists.  When {pte|pmd|...}_alloc()
+ * failed it means no memory, we don't have a better option but stop.
+ */
+#define  change_pmd_prepare(vma, pmd, cp_flags)				\
+	do {								\
+		if (unlikely(uffd_wp_protect_file(vma, cp_flags))) {	\
+			if (WARN_ON_ONCE(pte_alloc(vma->vm_mm, pmd)))	\
+				break;					\
+		}							\
+	} while (0)
+/*
+ * This is the general pud/p4d/pgd version of change_pmd_prepare(). We need to
+ * have separate change_pmd_prepare() because pte_alloc() returns 0 on success,
+ * while {pmd|pud|p4d}_alloc() returns the valid pointer on success.
+ */
+#define  change_prepare(vma, high, low, addr, cp_flags)			\
+	do {								\
+		if (unlikely(uffd_wp_protect_file(vma, cp_flags))) {	\
+			low##_t *p = low##_alloc(vma->vm_mm, high, addr); \
+			if (WARN_ON_ONCE(p == NULL))			\
+				break;					\
+		}							\
+	} while (0)
+
 static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pud_t *pud, unsigned long addr, unsigned long end,
 		pgprot_t newprot, unsigned long cp_flags)
@@ -240,6 +297,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 
 		next = pmd_addr_end(addr, end);
 
+		change_pmd_prepare(vma, pmd, cp_flags);
 		/*
 		 * Automatic NUMA balancing walks the tables with mmap_lock
 		 * held for read. It's possible a parallel update to occur
@@ -305,6 +363,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 	pud = pud_offset(p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
+		change_prepare(vma, pud, pmd, addr, cp_flags);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		pages += change_pmd_range(vma, pud, addr, next, newprot,
@@ -325,6 +384,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
 	p4d = p4d_offset(pgd, addr);
 	do {
 		next = p4d_addr_end(addr, end);
+		change_prepare(vma, p4d, pud, addr, cp_flags);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		pages += change_pud_range(vma, p4d, addr, next, newprot,
@@ -350,6 +410,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 	inc_tlb_flush_pending(mm);
 	do {
 		next = pgd_addr_end(addr, end);
+		change_prepare(vma, pgd, p4d, addr, cp_flags);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		pages += change_p4d_range(vma, pgd, addr, next, newprot,
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 09/23] mm/shmem: Allows file-back mem to be uffd wr-protected on thps
  2021-11-15  7:54 [PATCH v6 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (7 preceding siblings ...)
  2021-11-15  8:00 ` [PATCH v6 08/23] mm/shmem: Allow uffd wr-protect none pte for file-backed mem Peter Xu
@ 2021-11-15  8:00 ` Peter Xu
  2021-11-15  8:01 ` [PATCH v6 10/23] mm/shmem: Handle uffd-wp during fork() Peter Xu
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2021-11-15  8:00 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, peterx, Alistair Popple, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Matthew Wilcox, Jerome Glisse, Axel Rasmussen,
	Kirill A . Shutemov, David Hildenbrand, Andrea Arcangeli,
	Hugh Dickins

We don't have "huge" version of pte markers, instead when necessary we split
the thp.

However split the thp is not enough, because file-backed thp is handled totally
differently comparing to anonymous thps: rather than doing a real split, the
thp pmd will simply got cleared in __split_huge_pmd_locked().

That is not enough if e.g. when there is a thp covers range [0, 2M) but we want
to wr-protect small page resides in [4K, 8K) range, because after
__split_huge_pmd() returns, there will be a none pmd, and change_pmd_range()
will just skip it right after the split.

Here we leverage the previously introduced change_pmd_prepare() macro so that
we'll populate the pmd with a pgtable page after the pmd split (in which
process the pmd will be cleared for cases like shmem).  Then change_pte_range()
will do all the rest for us by installing the uffd-wp pte marker at any none
pte that we'd like to wr-protect.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/mprotect.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index be837c4dbc64..0d4bf755cee8 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -319,8 +319,15 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		}
 
 		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
-			if (next - addr != HPAGE_PMD_SIZE) {
+			if ((next - addr != HPAGE_PMD_SIZE) ||
+			    uffd_wp_protect_file(vma, cp_flags)) {
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
+				/*
+				 * For file-backed, the pmd could have been
+				 * cleared; make sure pmd populated if
+				 * necessary, then fall-through to pte level.
+				 */
+				change_pmd_prepare(vma, pmd, cp_flags);
 			} else {
 				int nr_ptes = change_huge_pmd(vma, pmd, addr,
 							      newprot, cp_flags);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 10/23] mm/shmem: Handle uffd-wp during fork()
  2021-11-15  7:54 [PATCH v6 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (8 preceding siblings ...)
  2021-11-15  8:00 ` [PATCH v6 09/23] mm/shmem: Allows file-back mem to be uffd wr-protected on thps Peter Xu
@ 2021-11-15  8:01 ` Peter Xu
  2021-11-15  8:01 ` [PATCH v6 11/23] mm/hugetlb: Introduce huge pte version of uffd-wp helpers Peter Xu
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2021-11-15  8:01 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, peterx, Alistair Popple, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Matthew Wilcox, Jerome Glisse, Axel Rasmussen,
	Kirill A . Shutemov, David Hildenbrand, Andrea Arcangeli,
	Hugh Dickins

Normally we skip copy page when fork() for VM_SHARED shmem, but we can't skip
it anymore if uffd-wp is enabled on dst vma.  This should only happen when the
src uffd has UFFD_FEATURE_EVENT_FORK enabled on uffd-wp shmem vma, so that
VM_UFFD_WP will be propagated onto dst vma too, then we should copy the
pgtables with uffd-wp bit and pte markers, because these information will be
lost otherwise.

Since the condition checks will become even more complicated for deciding
"whether a vma needs to copy the pgtable during fork()", introduce a helper
vma_needs_copy() for it, so everything will be clearer.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c | 49 +++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 41 insertions(+), 8 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index fef6a91c5dfb..cc625c616645 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -859,6 +859,14 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		if (try_restore_exclusive_pte(src_pte, src_vma, addr))
 			return -EBUSY;
 		return -ENOENT;
+	} else if (is_pte_marker_entry(entry)) {
+		/*
+		 * We're copying the pgtable should only because dst_vma has
+		 * uffd-wp enabled, do sanity check.
+		 */
+		WARN_ON_ONCE(!userfaultfd_wp(dst_vma));
+		set_pte_at(dst_mm, addr, dst_pte, pte);
+		return 0;
 	}
 	if (!userfaultfd_wp(dst_vma))
 		pte = pte_swp_clear_uffd_wp(pte);
@@ -1227,6 +1235,38 @@ copy_p4d_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	return 0;
 }
 
+/*
+ * Return true if the vma needs to copy the pgtable during this fork().  Return
+ * false when we can speed up fork() by allowing lazy page faults later until
+ * when the child accesses the memory range.
+ */
+bool
+vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
+{
+	/*
+	 * Always copy pgtables when dst_vma has uffd-wp enabled even if it's
+	 * file-backed (e.g. shmem). Because when uffd-wp is enabled, pgtable
+	 * contains uffd-wp protection information, that's something we can't
+	 * retrieve from page cache, and skip copying will lose those info.
+	 */
+	if (userfaultfd_wp(dst_vma))
+		return true;
+
+	if (src_vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP))
+		return true;
+
+	if (src_vma->anon_vma)
+		return true;
+
+	/*
+	 * Don't copy ptes where a page fault will fill them correctly.  Fork
+	 * becomes much lighter when there are big shared or private readonly
+	 * mappings. The tradeoff is that copy_page_range is more efficient
+	 * than faulting.
+	 */
+	return false;
+}
+
 int
 copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 {
@@ -1240,14 +1280,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 	bool is_cow;
 	int ret;
 
-	/*
-	 * Don't copy ptes where a page fault will fill them correctly.
-	 * Fork becomes much lighter when there are big shared or private
-	 * readonly mappings. The tradeoff is that copy_page_range is more
-	 * efficient than faulting.
-	 */
-	if (!(src_vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) &&
-	    !src_vma->anon_vma)
+	if (!vma_needs_copy(dst_vma, src_vma))
 		return 0;
 
 	if (is_vm_hugetlb_page(src_vma))
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 11/23] mm/hugetlb: Introduce huge pte version of uffd-wp helpers
  2021-11-15  7:54 [PATCH v6 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (9 preceding siblings ...)
  2021-11-15  8:01 ` [PATCH v6 10/23] mm/shmem: Handle uffd-wp during fork() Peter Xu
@ 2021-11-15  8:01 ` Peter Xu
  2021-11-15  8:01 ` [PATCH v6 12/23] mm/hugetlb: Hook page faults for uffd write protection Peter Xu
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2021-11-15  8:01 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, peterx, Alistair Popple, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Matthew Wilcox, Jerome Glisse, Axel Rasmussen,
	Kirill A . Shutemov, David Hildenbrand, Andrea Arcangeli,
	Hugh Dickins

They will be used in the follow up patches to either check/set/clear uffd-wp
bit of a huge pte.

So far it reuses all the small pte helpers.  Archs can overwrite these versions
when necessary (with __HAVE_ARCH_HUGE_PTE_UFFD_WP* macros) in the future.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/s390/include/asm/hugetlb.h | 15 +++++++++++++++
 include/asm-generic/hugetlb.h   | 15 +++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
index 60f9241e5e4a..19c4b4431d27 100644
--- a/arch/s390/include/asm/hugetlb.h
+++ b/arch/s390/include/asm/hugetlb.h
@@ -115,6 +115,21 @@ static inline pte_t huge_pte_modify(pte_t pte, pgprot_t newprot)
 	return pte_modify(pte, newprot);
 }
 
+static inline pte_t huge_pte_mkuffd_wp(pte_t pte)
+{
+	return pte;
+}
+
+static inline pte_t huge_pte_clear_uffd_wp(pte_t pte)
+{
+	return pte;
+}
+
+static inline int huge_pte_uffd_wp(pte_t pte)
+{
+	return 0;
+}
+
 static inline bool gigantic_page_runtime_supported(void)
 {
 	return true;
diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index f39cad20ffc6..896f341f614d 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -35,6 +35,21 @@ static inline pte_t huge_pte_modify(pte_t pte, pgprot_t newprot)
 	return pte_modify(pte, newprot);
 }
 
+static inline pte_t huge_pte_mkuffd_wp(pte_t pte)
+{
+	return pte_mkuffd_wp(pte);
+}
+
+static inline pte_t huge_pte_clear_uffd_wp(pte_t pte)
+{
+	return pte_clear_uffd_wp(pte);
+}
+
+static inline int huge_pte_uffd_wp(pte_t pte)
+{
+	return pte_uffd_wp(pte);
+}
+
 #ifndef __HAVE_ARCH_HUGE_PTE_CLEAR
 static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
 		    pte_t *ptep, unsigned long sz)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 12/23] mm/hugetlb: Hook page faults for uffd write protection
  2021-11-15  7:54 [PATCH v6 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (10 preceding siblings ...)
  2021-11-15  8:01 ` [PATCH v6 11/23] mm/hugetlb: Introduce huge pte version of uffd-wp helpers Peter Xu
@ 2021-11-15  8:01 ` Peter Xu
  2021-11-15  8:01 ` [PATCH v6 13/23] mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP Peter Xu
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2021-11-15  8:01 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, peterx, Alistair Popple, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Matthew Wilcox, Jerome Glisse, Axel Rasmussen,
	Kirill A . Shutemov, David Hildenbrand, Andrea Arcangeli,
	Hugh Dickins

Hook up hugetlbfs_fault() with the capability to handle userfaultfd-wp faults.

We do this slightly earlier than hugetlb_cow() so that we can avoid taking some
extra locks that we definitely don't need.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e09159c957e3..3a10274b2e39 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5658,6 +5658,25 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
 		goto out_ptl;
 
+	/* Handle userfault-wp first, before trying to lock more pages */
+	if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(ptep)) &&
+	    (flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
+		struct vm_fault vmf = {
+			.vma = vma,
+			.address = haddr,
+			.flags = flags,
+		};
+
+		spin_unlock(ptl);
+		if (pagecache_page) {
+			unlock_page(pagecache_page);
+			put_page(pagecache_page);
+		}
+		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+		i_mmap_unlock_read(mapping);
+		return handle_userfault(&vmf, VM_UFFD_WP);
+	}
+
 	/*
 	 * hugetlb_cow() requires page locks of pte_page(entry) and
 	 * pagecache_page, so here we need take the former one
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 13/23] mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP
  2021-11-15  7:54 [PATCH v6 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (11 preceding siblings ...)
  2021-11-15  8:01 ` [PATCH v6 12/23] mm/hugetlb: Hook page faults for uffd write protection Peter Xu
@ 2021-11-15  8:01 ` Peter Xu
  2021-11-15  8:02 ` [PATCH v6 14/23] mm/hugetlb: Handle UFFDIO_WRITEPROTECT Peter Xu
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2021-11-15  8:01 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, peterx, Alistair Popple, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Matthew Wilcox, Jerome Glisse, Axel Rasmussen,
	Kirill A . Shutemov, David Hildenbrand, Andrea Arcangeli,
	Hugh Dickins

Pass the wp_copy variable into hugetlb_mcopy_atomic_pte() thoughout the stack.
Apply the UFFD_WP bit if UFFDIO_COPY_MODE_WP is with UFFDIO_COPY.

Hugetlb pages are only managed by hugetlbfs, so we're safe even without setting
dirty bit in the huge pte if the page is installed as read-only.  However we'd
better still keep the dirty bit set for a read-only UFFDIO_COPY pte (when
UFFDIO_COPY_MODE_WP bit is set), not only to match what we do with shmem, but
also because the page does contain dirty data that the kernel just copied from
the userspace.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/hugetlb.h |  6 ++++--
 mm/hugetlb.c            | 29 +++++++++++++++++++++++------
 mm/userfaultfd.c        | 14 +++++++++-----
 3 files changed, 36 insertions(+), 13 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 00351ccb49a3..4da0c4b4159a 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -160,7 +160,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
 				unsigned long dst_addr,
 				unsigned long src_addr,
 				enum mcopy_atomic_mode mode,
-				struct page **pagep);
+				struct page **pagep,
+				bool wp_copy);
 #endif /* CONFIG_USERFAULTFD */
 bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						struct vm_area_struct *vma,
@@ -355,7 +356,8 @@ static inline int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 						unsigned long dst_addr,
 						unsigned long src_addr,
 						enum mcopy_atomic_mode mode,
-						struct page **pagep)
+						struct page **pagep,
+						bool wp_copy)
 {
 	BUG();
 	return 0;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3a10274b2e39..8146240eefc6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5740,7 +5740,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    unsigned long dst_addr,
 			    unsigned long src_addr,
 			    enum mcopy_atomic_mode mode,
-			    struct page **pagep)
+			    struct page **pagep,
+			    bool wp_copy)
 {
 	bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
 	struct hstate *h = hstate_vma(dst_vma);
@@ -5868,7 +5869,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		goto out_release_unlock;
 
 	ret = -EEXIST;
-	if (!huge_pte_none(huge_ptep_get(dst_pte)))
+	/*
+	 * We allow to overwrite a pte marker: consider when both MISSING|WP
+	 * registered, we firstly wr-protect a none pte which has no page cache
+	 * page backing it, then access the page.
+	 */
+	if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
 		goto out_release_unlock;
 
 	if (vm_shared) {
@@ -5878,17 +5884,28 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
 	}
 
-	/* For CONTINUE on a non-shared VMA, don't set VM_WRITE for CoW. */
-	if (is_continue && !vm_shared)
+	/*
+	 * For either: (1) CONTINUE on a non-shared VMA, or (2) UFFDIO_COPY
+	 * with wp flag set, don't set pte write bit.
+	 */
+	if (wp_copy || (is_continue && !vm_shared))
 		writable = 0;
 	else
 		writable = dst_vma->vm_flags & VM_WRITE;
 
 	_dst_pte = make_huge_pte(dst_vma, page, writable);
-	if (writable)
-		_dst_pte = huge_pte_mkdirty(_dst_pte);
+	/*
+	 * Always mark UFFDIO_COPY page dirty; note that this may not be
+	 * extremely important for hugetlbfs for now since swapping is not
+	 * supported, but we should still be clear in that this page cannot be
+	 * thrown away at will, even if write bit not set.
+	 */
+	_dst_pte = huge_pte_mkdirty(_dst_pte);
 	_dst_pte = pte_mkyoung(_dst_pte);
 
+	if (wp_copy)
+		_dst_pte = huge_pte_mkuffd_wp(_dst_pte);
+
 	set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
 
 	(void)huge_ptep_set_access_flags(dst_vma, dst_addr, dst_pte, _dst_pte,
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 95e5a9ba3196..6174a212c72f 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -291,7 +291,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 					      unsigned long dst_start,
 					      unsigned long src_start,
 					      unsigned long len,
-					      enum mcopy_atomic_mode mode)
+					      enum mcopy_atomic_mode mode,
+					      bool wp_copy)
 {
 	int vm_shared = dst_vma->vm_flags & VM_SHARED;
 	ssize_t err;
@@ -379,7 +380,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		}
 
 		if (mode != MCOPY_ATOMIC_CONTINUE &&
-		    !huge_pte_none(huge_ptep_get(dst_pte))) {
+		    !huge_pte_none_mostly(huge_ptep_get(dst_pte))) {
 			err = -EEXIST;
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			i_mmap_unlock_read(mapping);
@@ -387,7 +388,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		}
 
 		err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
-					       dst_addr, src_addr, mode, &page);
+					       dst_addr, src_addr, mode, &page,
+					       wp_copy);
 
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		i_mmap_unlock_read(mapping);
@@ -442,7 +444,8 @@ extern ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 				      unsigned long dst_start,
 				      unsigned long src_start,
 				      unsigned long len,
-				      enum mcopy_atomic_mode mode);
+				      enum mcopy_atomic_mode mode,
+				      bool wp_copy);
 #endif /* CONFIG_HUGETLB_PAGE */
 
 static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
@@ -562,7 +565,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	 */
 	if (is_vm_hugetlb_page(dst_vma))
 		return  __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
-						src_start, len, mcopy_mode);
+					       src_start, len, mcopy_mode,
+					       wp_copy);
 
 	if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
 		goto out_unlock;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 14/23] mm/hugetlb: Handle UFFDIO_WRITEPROTECT
  2021-11-15  7:54 [PATCH v6 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (12 preceding siblings ...)
  2021-11-15  8:01 ` [PATCH v6 13/23] mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP Peter Xu
@ 2021-11-15  8:02 ` Peter Xu
  2021-11-15  8:02 ` [PATCH v6 15/23] mm/hugetlb: Handle pte markers in page faults Peter Xu
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2021-11-15  8:02 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, peterx, Alistair Popple, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Matthew Wilcox, Jerome Glisse, Axel Rasmussen,
	Kirill A . Shutemov, David Hildenbrand, Andrea Arcangeli,
	Hugh Dickins

This starts from passing cp_flags into hugetlb_change_protection() so hugetlb
will be able to handle MM_CP_UFFD_WP[_RESOLVE] requests.

huge_pte_clear_uffd_wp() is introduced to handle the case where the
UFFDIO_WRITEPROTECT is requested upon migrating huge page entries.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/hugetlb.h |  6 ++++--
 mm/hugetlb.c            | 13 ++++++++++++-
 mm/mprotect.c           |  3 ++-
 mm/userfaultfd.c        |  8 ++++++++
 4 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 4da0c4b4159a..a46011510e49 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -210,7 +210,8 @@ struct page *follow_huge_pgd(struct mm_struct *mm, unsigned long address,
 int pmd_huge(pmd_t pmd);
 int pud_huge(pud_t pud);
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
-		unsigned long address, unsigned long end, pgprot_t newprot);
+		unsigned long address, unsigned long end, pgprot_t newprot,
+		unsigned long cp_flags);
 
 bool is_hugetlb_entry_migration(pte_t pte);
 void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
@@ -391,7 +392,8 @@ static inline void move_hugetlb_state(struct page *oldpage,
 
 static inline unsigned long hugetlb_change_protection(
 			struct vm_area_struct *vma, unsigned long address,
-			unsigned long end, pgprot_t newprot)
+			unsigned long end, pgprot_t newprot,
+			unsigned long cp_flags)
 {
 	return 0;
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8146240eefc6..7fc213c0ebf8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6127,7 +6127,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
-		unsigned long address, unsigned long end, pgprot_t newprot)
+		unsigned long address, unsigned long end,
+		pgprot_t newprot, unsigned long cp_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long start = address;
@@ -6137,6 +6138,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	unsigned long pages = 0;
 	bool shared_pmd = false;
 	struct mmu_notifier_range range;
+	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
+	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
 
 	/*
 	 * In the case of shared PMDs, the area to flush could be beyond
@@ -6178,6 +6181,10 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 				entry = make_readable_migration_entry(
 							swp_offset(entry));
 				newpte = swp_entry_to_pte(entry);
+				if (uffd_wp)
+					newpte = pte_swp_mkuffd_wp(newpte);
+				else if (uffd_wp_resolve)
+					newpte = pte_swp_clear_uffd_wp(newpte);
 				set_huge_swap_pte_at(mm, address, ptep,
 						     newpte, huge_page_size(h));
 				pages++;
@@ -6192,6 +6199,10 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 			old_pte = huge_ptep_modify_prot_start(vma, address, ptep);
 			pte = pte_mkhuge(huge_pte_modify(old_pte, newprot));
 			pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
+			if (uffd_wp)
+				pte = huge_pte_mkuffd_wp(huge_pte_wrprotect(pte));
+			else if (uffd_wp_resolve)
+				pte = huge_pte_clear_uffd_wp(pte);
 			huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
 			pages++;
 		}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 0d4bf755cee8..1cc4a6d1886b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -441,7 +441,8 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 	BUG_ON((cp_flags & MM_CP_UFFD_WP_ALL) == MM_CP_UFFD_WP_ALL);
 
 	if (is_vm_hugetlb_page(vma))
-		pages = hugetlb_change_protection(vma, start, end, newprot);
+		pages = hugetlb_change_protection(vma, start, end, newprot,
+						  cp_flags);
 	else
 		pages = change_protection_range(vma, start, end, newprot,
 						cp_flags);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 6174a212c72f..037f82719e64 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -690,6 +690,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 			atomic_t *mmap_changing)
 {
 	struct vm_area_struct *dst_vma;
+	unsigned long page_mask;
 	pgprot_t newprot;
 	int err;
 
@@ -726,6 +727,13 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 	if (!vma_is_anonymous(dst_vma))
 		goto out_unlock;
 
+	if (is_vm_hugetlb_page(dst_vma)) {
+		err = -EINVAL;
+		page_mask = vma_kernel_pagesize(dst_vma) - 1;
+		if ((start & page_mask) || (len & page_mask))
+			goto out_unlock;
+	}
+
 	if (enable_wp)
 		newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE));
 	else
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 15/23] mm/hugetlb: Handle pte markers in page faults
  2021-11-15  7:54 [PATCH v6 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (13 preceding siblings ...)
  2021-11-15  8:02 ` [PATCH v6 14/23] mm/hugetlb: Handle UFFDIO_WRITEPROTECT Peter Xu
@ 2021-11-15  8:02 ` Peter Xu
  2021-11-15  8:02 ` [PATCH v6 16/23] mm/hugetlb: Allow uffd wr-protect none ptes Peter Xu
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2021-11-15  8:02 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, peterx, Alistair Popple, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Matthew Wilcox, Jerome Glisse, Axel Rasmussen,
	Kirill A . Shutemov, David Hildenbrand, Andrea Arcangeli,
	Hugh Dickins

Allow hugetlb code to handle pte markers just like none ptes.  It's mostly
there, we just need to make sure we don't assume hugetlb_no_page() only handles
none pte, so when detecting pte change we should use pte_same() rather than
pte_none().  We need to pass in the old_pte to do the comparison.

Check the original pte to see whether it's a pte marker, if it is, we should
recover uffd-wp bit on the new pte to be installed, so that the next write will
be trapped by uffd.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7fc213c0ebf8..e8d01277af0f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5361,7 +5361,8 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
 static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			struct vm_area_struct *vma,
 			struct address_space *mapping, pgoff_t idx,
-			unsigned long address, pte_t *ptep, unsigned int flags)
+			unsigned long address, pte_t *ptep,
+			pte_t old_pte, unsigned int flags)
 {
 	struct hstate *h = hstate_vma(vma);
 	vm_fault_t ret = VM_FAULT_SIGBUS;
@@ -5487,7 +5488,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 
 	ptl = huge_pte_lock(h, mm, ptep);
 	ret = 0;
-	if (!huge_pte_none(huge_ptep_get(ptep)))
+	/* If pte changed from under us, retry */
+	if (!pte_same(huge_ptep_get(ptep), old_pte))
 		goto backout;
 
 	if (anon_rmap) {
@@ -5497,6 +5499,12 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		page_dup_rmap(page, true);
 	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
 				&& (vma->vm_flags & VM_SHARED)));
+	/*
+	 * If this pte was previously wr-protected, keep it wr-protected even
+	 * if populated.
+	 */
+	if (unlikely(is_pte_marker_uffd_wp(old_pte)))
+		new_pte = huge_pte_wrprotect(huge_pte_mkuffd_wp(new_pte));
 	set_huge_pte_at(mm, haddr, ptep, new_pte);
 
 	hugetlb_count_add(pages_per_huge_page(h), mm);
@@ -5614,8 +5622,10 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
 	entry = huge_ptep_get(ptep);
-	if (huge_pte_none(entry)) {
-		ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags);
+	/* PTE markers should be handled the same way as none pte */
+	if (huge_pte_none_mostly(entry)) {
+		ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
+				      entry, flags);
 		goto out_mutex;
 	}
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 16/23] mm/hugetlb: Allow uffd wr-protect none ptes
  2021-11-15  7:54 [PATCH v6 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (14 preceding siblings ...)
  2021-11-15  8:02 ` [PATCH v6 15/23] mm/hugetlb: Handle pte markers in page faults Peter Xu
@ 2021-11-15  8:02 ` Peter Xu
  2021-11-15  8:02 ` [PATCH v6 17/23] mm/hugetlb: Only drop uffd-wp special pte if required Peter Xu
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2021-11-15  8:02 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, peterx, Alistair Popple, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Matthew Wilcox, Jerome Glisse, Axel Rasmussen,
	Kirill A . Shutemov, David Hildenbrand, Andrea Arcangeli,
	Hugh Dickins

Teach hugetlbfs code to wr-protect none ptes just in case the page cache
existed for that pte.  Meanwhile we also need to be able to recognize a uffd-wp
marker pte and remove it for uffd_wp_resolve.

Since at it, introduce a variable "psize" to replace all references to the huge
page size fetcher.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 28 ++++++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e8d01277af0f..bba2ede5f6dc 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6145,7 +6145,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	pte_t *ptep;
 	pte_t pte;
 	struct hstate *h = hstate_vma(vma);
-	unsigned long pages = 0;
+	unsigned long pages = 0, psize = huge_page_size(h);
 	bool shared_pmd = false;
 	struct mmu_notifier_range range;
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
@@ -6165,13 +6165,19 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 
 	mmu_notifier_invalidate_range_start(&range);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
-	for (; address < end; address += huge_page_size(h)) {
+	for (; address < end; address += psize) {
 		spinlock_t *ptl;
-		ptep = huge_pte_offset(mm, address, huge_page_size(h));
+		ptep = huge_pte_offset(mm, address, psize);
 		if (!ptep)
 			continue;
 		ptl = huge_pte_lock(h, mm, ptep);
 		if (huge_pmd_unshare(mm, vma, &address, ptep)) {
+			/*
+			 * When uffd-wp is enabled on the vma, unshare
+			 * shouldn't happen at all.  Warn about it if it
+			 * happened due to some reason.
+			 */
+			WARN_ON_ONCE(uffd_wp || uffd_wp_resolve);
 			pages++;
 			spin_unlock(ptl);
 			shared_pmd = true;
@@ -6196,12 +6202,20 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 				else if (uffd_wp_resolve)
 					newpte = pte_swp_clear_uffd_wp(newpte);
 				set_huge_swap_pte_at(mm, address, ptep,
-						     newpte, huge_page_size(h));
+						     newpte, psize);
 				pages++;
 			}
 			spin_unlock(ptl);
 			continue;
 		}
+		if (unlikely(is_pte_marker_uffd_wp(pte))) {
+			/*
+			 * This is changing a non-present pte into a none pte,
+			 * no need for huge_ptep_modify_prot_start/commit().
+			 */
+			if (uffd_wp_resolve)
+				huge_pte_clear(mm, address, ptep, psize);
+		}
 		if (!huge_pte_none(pte)) {
 			pte_t old_pte;
 			unsigned int shift = huge_page_shift(hstate_vma(vma));
@@ -6215,6 +6229,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 				pte = huge_pte_clear_uffd_wp(pte);
 			huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
 			pages++;
+		} else {
+			/* None pte */
+			if (unlikely(uffd_wp))
+				/* Safe to modify directly (none->non-present). */
+				set_huge_pte_at(mm, address, ptep,
+						make_pte_marker(PTE_MARKER_UFFD_WP));
 		}
 		spin_unlock(ptl);
 	}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 17/23] mm/hugetlb: Only drop uffd-wp special pte if required
  2021-11-15  7:54 [PATCH v6 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (15 preceding siblings ...)
  2021-11-15  8:02 ` [PATCH v6 16/23] mm/hugetlb: Allow uffd wr-protect none ptes Peter Xu
@ 2021-11-15  8:02 ` Peter Xu
  2021-11-15  8:02 ` [PATCH v6 18/23] mm/hugetlb: Handle uffd-wp during fork() Peter Xu
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2021-11-15  8:02 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, peterx, Alistair Popple, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Matthew Wilcox, Jerome Glisse, Axel Rasmussen,
	Kirill A . Shutemov, David Hildenbrand, Andrea Arcangeli,
	Hugh Dickins

As with shmem uffd-wp special ptes, only drop the uffd-wp special swap pte if
unmapping an entire vma or synchronized such that faults can not race with the
unmap operation.  This requires passing zap_flags all the way to the lowest
level hugetlb unmap routine: __unmap_hugepage_range.

In general, unmap calls originated in hugetlbfs code will pass the
ZAP_FLAG_DROP_MARKER flag as synchronization is in place to prevent faults.
The exception is hole punch which will first unmap without any synchronization.
Later when hole punch actually removes the page from the file, it will check to
see if there was a subsequent fault and if so take the hugetlb fault mutex
while unmapping again.  This second unmap will pass in ZAP_FLAG_DROP_MARKER.

The justification of "whether to apply ZAP_FLAG_DROP_MARKER flag when unmap a
hugetlb range" is (IMHO): we should never reach a state when a page fault could
errornously fault in a page-cache page that was wr-protected to be writable,
even in an extremely short period.  That could happen if e.g. we pass
ZAP_FLAG_DROP_MARKER when hugetlbfs_punch_hole() calls hugetlb_vmdelete_list(),
because if a page faults after that call and before remove_inode_hugepages() is
executed, the page cache can be mapped writable again in the small racy window,
that can cause unexpected data overwritten.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/hugetlbfs/inode.c    | 15 +++++++++------
 include/linux/hugetlb.h |  8 +++++---
 mm/hugetlb.c            | 33 +++++++++++++++++++++++++--------
 mm/memory.c             |  5 ++++-
 4 files changed, 43 insertions(+), 18 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 49d2e686be74..92c8d1a47404 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -404,7 +404,8 @@ static void remove_huge_page(struct page *page)
 }
 
 static void
-hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
+hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
+		      unsigned long zap_flags)
 {
 	struct vm_area_struct *vma;
 
@@ -437,7 +438,7 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
 		}
 
 		unmap_hugepage_range(vma, vma->vm_start + v_offset, v_end,
-									NULL);
+				     NULL, zap_flags);
 	}
 }
 
@@ -515,7 +516,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 				mutex_lock(&hugetlb_fault_mutex_table[hash]);
 				hugetlb_vmdelete_list(&mapping->i_mmap,
 					index * pages_per_huge_page(h),
-					(index + 1) * pages_per_huge_page(h));
+					(index + 1) * pages_per_huge_page(h),
+					ZAP_FLAG_DROP_MARKER);
 				i_mmap_unlock_write(mapping);
 			}
 
@@ -581,7 +583,8 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	i_mmap_lock_write(mapping);
 	i_size_write(inode, offset);
 	if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
-		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
+		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
+				      ZAP_FLAG_DROP_MARKER);
 	i_mmap_unlock_write(mapping);
 	remove_inode_hugepages(inode, offset, LLONG_MAX);
 }
@@ -614,8 +617,8 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 		i_mmap_lock_write(mapping);
 		if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
 			hugetlb_vmdelete_list(&mapping->i_mmap,
-						hole_start >> PAGE_SHIFT,
-						hole_end  >> PAGE_SHIFT);
+					      hole_start >> PAGE_SHIFT,
+					      hole_end >> PAGE_SHIFT, 0);
 		i_mmap_unlock_write(mapping);
 		remove_inode_hugepages(inode, hole_start, hole_end);
 		inode_unlock(inode);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index a46011510e49..4c3ea7ee8ce8 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -143,11 +143,12 @@ long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
 			 unsigned long *, unsigned long *, long, unsigned int,
 			 int *);
 void unmap_hugepage_range(struct vm_area_struct *,
-			  unsigned long, unsigned long, struct page *);
+			  unsigned long, unsigned long, struct page *,
+			  unsigned long);
 void __unmap_hugepage_range_final(struct mmu_gather *tlb,
 			  struct vm_area_struct *vma,
 			  unsigned long start, unsigned long end,
-			  struct page *ref_page);
+			  struct page *ref_page, unsigned long zap_flags);
 void hugetlb_report_meminfo(struct seq_file *);
 int hugetlb_report_node_meminfo(char *buf, int len, int nid);
 void hugetlb_show_meminfo(void);
@@ -400,7 +401,8 @@ static inline unsigned long hugetlb_change_protection(
 
 static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
 			struct vm_area_struct *vma, unsigned long start,
-			unsigned long end, struct page *ref_page)
+			unsigned long end, struct page *ref_page,
+			unsigned long zap_flags)
 {
 	BUG();
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bba2ede5f6dc..16fb9cd8d9c5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4926,7 +4926,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 
 static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 				   unsigned long start, unsigned long end,
-				   struct page *ref_page)
+				   struct page *ref_page, unsigned long zap_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
@@ -4983,7 +4983,18 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 		 * unmapped and its refcount is dropped, so just clear pte here.
 		 */
 		if (unlikely(!pte_present(pte))) {
-			huge_pte_clear(mm, address, ptep, sz);
+			/*
+			 * If the pte was wr-protected by uffd-wp in any of the
+			 * swap forms, meanwhile the caller does not want to
+			 * drop the uffd-wp bit in this zap, then replace the
+			 * pte with a marker.
+			 */
+			if (pte_swp_uffd_wp_any(pte) &&
+			    !(zap_flags & ZAP_FLAG_DROP_MARKER))
+				set_huge_pte_at(mm, address, ptep,
+						make_pte_marker(PTE_MARKER_UFFD_WP));
+			else
+				huge_pte_clear(mm, address, ptep, sz);
 			spin_unlock(ptl);
 			continue;
 		}
@@ -5011,7 +5022,11 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 		tlb_remove_huge_tlb_entry(h, tlb, ptep, address);
 		if (huge_pte_dirty(pte))
 			set_page_dirty(page);
-
+		/* Leave a uffd-wp pte marker if needed */
+		if (huge_pte_uffd_wp(pte) &&
+		    !(zap_flags & ZAP_FLAG_DROP_MARKER))
+			set_huge_pte_at(mm, address, ptep,
+					make_pte_marker(PTE_MARKER_UFFD_WP));
 		hugetlb_count_sub(pages_per_huge_page(h), mm);
 		page_remove_rmap(page, true);
 
@@ -5029,9 +5044,10 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 
 void __unmap_hugepage_range_final(struct mmu_gather *tlb,
 			  struct vm_area_struct *vma, unsigned long start,
-			  unsigned long end, struct page *ref_page)
+			  unsigned long end, struct page *ref_page,
+			  unsigned long zap_flags)
 {
-	__unmap_hugepage_range(tlb, vma, start, end, ref_page);
+	__unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);
 
 	/*
 	 * Clear this flag so that x86's huge_pmd_share page_table_shareable
@@ -5047,12 +5063,13 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
 }
 
 void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
-			  unsigned long end, struct page *ref_page)
+			  unsigned long end, struct page *ref_page,
+			  unsigned long zap_flags)
 {
 	struct mmu_gather tlb;
 
 	tlb_gather_mmu(&tlb, vma->vm_mm);
-	__unmap_hugepage_range(&tlb, vma, start, end, ref_page);
+	__unmap_hugepage_range(&tlb, vma, start, end, ref_page, zap_flags);
 	tlb_finish_mmu(&tlb);
 }
 
@@ -5107,7 +5124,7 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		if (!is_vma_resv_set(iter_vma, HPAGE_RESV_OWNER))
 			unmap_hugepage_range(iter_vma, address,
-					     address + huge_page_size(h), page);
+					     address + huge_page_size(h), page, 0);
 	}
 	i_mmap_unlock_write(mapping);
 }
diff --git a/mm/memory.c b/mm/memory.c
index cc625c616645..69a73d47513b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1631,8 +1631,11 @@ static void unmap_single_vma(struct mmu_gather *tlb,
 			 * safe to do nothing in this case.
 			 */
 			if (vma->vm_file) {
+				unsigned long zap_flags = details ?
+				    details->zap_flags : 0;
 				i_mmap_lock_write(vma->vm_file->f_mapping);
-				__unmap_hugepage_range_final(tlb, vma, start, end, NULL);
+				__unmap_hugepage_range_final(tlb, vma, start, end,
+							     NULL, zap_flags);
 				i_mmap_unlock_write(vma->vm_file->f_mapping);
 			}
 		} else
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 18/23] mm/hugetlb: Handle uffd-wp during fork()
  2021-11-15  7:54 [PATCH v6 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (16 preceding siblings ...)
  2021-11-15  8:02 ` [PATCH v6 17/23] mm/hugetlb: Only drop uffd-wp special pte if required Peter Xu
@ 2021-11-15  8:02 ` Peter Xu
  2021-11-15  8:03 ` [PATCH v6 19/23] mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered Peter Xu
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2021-11-15  8:02 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, peterx, Alistair Popple, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Matthew Wilcox, Jerome Glisse, Axel Rasmussen,
	Kirill A . Shutemov, David Hildenbrand, Andrea Arcangeli,
	Hugh Dickins

Firstly, we'll need to pass in dst_vma into copy_hugetlb_page_range() because
for uffd-wp it's the dst vma that matters on deciding how we should treat
uffd-wp protected ptes.

We should recognize pte markers during fork and do the pte copy if needed.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/hugetlb.h |  7 +++++--
 mm/hugetlb.c            | 41 +++++++++++++++++++++++++++--------------
 mm/memory.c             |  2 +-
 3 files changed, 33 insertions(+), 17 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 4c3ea7ee8ce8..6935b02f1081 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -137,7 +137,8 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 			     struct vm_area_struct *new_vma,
 			     unsigned long old_addr, unsigned long new_addr,
 			     unsigned long len);
-int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
+int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *,
+			    struct vm_area_struct *, struct vm_area_struct *);
 long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
 			 struct page **, struct vm_area_struct **,
 			 unsigned long *, unsigned long *, long, unsigned int,
@@ -268,7 +269,9 @@ static inline struct page *follow_huge_addr(struct mm_struct *mm,
 }
 
 static inline int copy_hugetlb_page_range(struct mm_struct *dst,
-			struct mm_struct *src, struct vm_area_struct *vma)
+					  struct mm_struct *src,
+					  struct vm_area_struct *dst_vma,
+					  struct vm_area_struct *src_vma)
 {
 	BUG();
 	return 0;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 16fb9cd8d9c5..cf9a0e8c32ba 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4690,23 +4690,24 @@ hugetlb_install_page(struct vm_area_struct *vma, pte_t *ptep, unsigned long addr
 }
 
 int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
-			    struct vm_area_struct *vma)
+			    struct vm_area_struct *dst_vma,
+			    struct vm_area_struct *src_vma)
 {
 	pte_t *src_pte, *dst_pte, entry, dst_entry;
 	struct page *ptepage;
 	unsigned long addr;
-	bool cow = is_cow_mapping(vma->vm_flags);
-	struct hstate *h = hstate_vma(vma);
+	bool cow = is_cow_mapping(src_vma->vm_flags);
+	struct hstate *h = hstate_vma(src_vma);
 	unsigned long sz = huge_page_size(h);
 	unsigned long npages = pages_per_huge_page(h);
-	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct address_space *mapping = src_vma->vm_file->f_mapping;
 	struct mmu_notifier_range range;
 	int ret = 0;
 
 	if (cow) {
-		mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, src,
-					vma->vm_start,
-					vma->vm_end);
+		mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, src_vma, src,
+					src_vma->vm_start,
+					src_vma->vm_end);
 		mmu_notifier_invalidate_range_start(&range);
 	} else {
 		/*
@@ -4718,12 +4719,12 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		i_mmap_lock_read(mapping);
 	}
 
-	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
+	for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
 		src_pte = huge_pte_offset(src, addr, sz);
 		if (!src_pte)
 			continue;
-		dst_pte = huge_pte_alloc(dst, vma, addr, sz);
+		dst_pte = huge_pte_alloc(dst, dst_vma, addr, sz);
 		if (!dst_pte) {
 			ret = -ENOMEM;
 			break;
@@ -4758,6 +4759,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		} else if (unlikely(is_hugetlb_entry_migration(entry) ||
 				    is_hugetlb_entry_hwpoisoned(entry))) {
 			swp_entry_t swp_entry = pte_to_swp_entry(entry);
+			bool uffd_wp = huge_pte_uffd_wp(entry);
 
 			if (is_writable_migration_entry(swp_entry) && cow) {
 				/*
@@ -4767,10 +4769,21 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				swp_entry = make_readable_migration_entry(
 							swp_offset(swp_entry));
 				entry = swp_entry_to_pte(swp_entry);
+				if (userfaultfd_wp(src_vma) && uffd_wp)
+					entry = huge_pte_mkuffd_wp(entry);
 				set_huge_swap_pte_at(src, addr, src_pte,
 						     entry, sz);
 			}
+			if (!userfaultfd_wp(dst_vma) && uffd_wp)
+				entry = huge_pte_clear_uffd_wp(entry);
 			set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
+		} else if (unlikely(is_pte_marker(entry))) {
+			/*
+			 * We copy the pte marker only if the dst vma has
+			 * uffd-wp enabled.
+			 */
+			if (userfaultfd_wp(dst_vma))
+				set_huge_pte_at(dst, addr, dst_pte, entry);
 		} else {
 			entry = huge_ptep_get(src_pte);
 			ptepage = pte_page(entry);
@@ -4785,20 +4798,20 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			 * need to be without the pgtable locks since we could
 			 * sleep during the process.
 			 */
-			if (unlikely(page_needs_cow_for_dma(vma, ptepage))) {
+			if (unlikely(page_needs_cow_for_dma(src_vma, ptepage))) {
 				pte_t src_pte_old = entry;
 				struct page *new;
 
 				spin_unlock(src_ptl);
 				spin_unlock(dst_ptl);
 				/* Do not use reserve as it's private owned */
-				new = alloc_huge_page(vma, addr, 1);
+				new = alloc_huge_page(dst_vma, addr, 1);
 				if (IS_ERR(new)) {
 					put_page(ptepage);
 					ret = PTR_ERR(new);
 					break;
 				}
-				copy_user_huge_page(new, ptepage, addr, vma,
+				copy_user_huge_page(new, ptepage, addr, dst_vma,
 						    npages);
 				put_page(ptepage);
 
@@ -4808,13 +4821,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 				entry = huge_ptep_get(src_pte);
 				if (!pte_same(src_pte_old, entry)) {
-					restore_reserve_on_error(h, vma, addr,
+					restore_reserve_on_error(h, dst_vma, addr,
 								new);
 					put_page(new);
 					/* dst_entry won't change as in child */
 					goto again;
 				}
-				hugetlb_install_page(vma, dst_pte, addr, new);
+				hugetlb_install_page(dst_vma, dst_pte, addr, new);
 				spin_unlock(src_ptl);
 				spin_unlock(dst_ptl);
 				continue;
diff --git a/mm/memory.c b/mm/memory.c
index 69a73d47513b..89715d1ec956 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1284,7 +1284,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 		return 0;
 
 	if (is_vm_hugetlb_page(src_vma))
-		return copy_hugetlb_page_range(dst_mm, src_mm, src_vma);
+		return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
 
 	if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
 		/*
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 19/23] mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered
  2021-11-15  7:54 [PATCH v6 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (17 preceding siblings ...)
  2021-11-15  8:02 ` [PATCH v6 18/23] mm/hugetlb: Handle uffd-wp during fork() Peter Xu
@ 2021-11-15  8:03 ` Peter Xu
  2021-11-15  8:03 ` [PATCH v6 20/23] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs Peter Xu
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2021-11-15  8:03 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, peterx, Alistair Popple, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Matthew Wilcox, Jerome Glisse, Axel Rasmussen,
	Kirill A . Shutemov, David Hildenbrand, Andrea Arcangeli,
	Hugh Dickins

When we're trying to collapse a 2M huge shmem page, don't retract pgtable pmd
page if it's registered with uffd-wp, because that pgtable could have pte
markers installed.  Recycling of that pgtable means we'll lose the pte markers.
That could cause data loss for an uffd-wp enabled application on shmem.

Instead of disabling khugepaged on these files, simply skip retracting these
special VMAs, then the page cache can still be merged into a huge thp, and
other mm/vma can still map the range of file with a huge thp when proper.

Note that checking VM_UFFD_WP needs to be done with mmap_sem held for write,
that avoids race like:

         khugepaged                             user thread
         ==========                             ===========
     check VM_UFFD_WP, not set
                                       UFFDIO_REGISTER with uffd-wp on shmem
                                       wr-protect some pages (install markers)
     take mmap_sem write lock
     erase pmd and free pmd page
      --> pte markers are dropped unnoticed!

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/khugepaged.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e99101162f1a..9c75153a36de 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1454,6 +1454,10 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 	if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE))
 		return;
 
+	/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
+	if (userfaultfd_wp(vma))
+		return;
+
 	hpage = find_lock_page(vma->vm_file->f_mapping,
 			       linear_page_index(vma, haddr));
 	if (!hpage)
@@ -1594,7 +1598,15 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		 * reverse order. Trylock is a way to avoid deadlock.
 		 */
 		if (mmap_write_trylock(mm)) {
-			if (!khugepaged_test_exit(mm)) {
+			/*
+			 * When a vma is registered with uffd-wp, we can't
+			 * recycle the pmd pgtable because there can be pte
+			 * markers installed.  Skip it only, so the rest mm/vma
+			 * can still have the same file mapped hugely, however
+			 * it'll always mapped in small page size for uffd-wp
+			 * registered ranges.
+			 */
+			if (!khugepaged_test_exit(mm) && !userfaultfd_wp(vma)) {
 				spinlock_t *ptl = pmd_lock(mm, pmd);
 				/* assume page table is clear */
 				_pmd = pmdp_collapse_flush(vma, addr, pmd);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 20/23] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs
  2021-11-15  7:54 [PATCH v6 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (18 preceding siblings ...)
  2021-11-15  8:03 ` [PATCH v6 19/23] mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered Peter Xu
@ 2021-11-15  8:03 ` Peter Xu
  2021-11-15  8:03 ` [PATCH v6 21/23] mm/uffd: Enable write protection for shmem & hugetlbfs Peter Xu
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2021-11-15  8:03 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, peterx, Alistair Popple, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Matthew Wilcox, Jerome Glisse, Axel Rasmussen,
	Kirill A . Shutemov, David Hildenbrand, Andrea Arcangeli,
	Hugh Dickins

This requires the pagemap code to be able to recognize the newly introduced
swap special pte for uffd-wp, meanwhile the general case for hugetlb that we
recently start to support.  It should make pagemap uffd-wp support complete.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/proc/task_mmu.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ad667dbc96f5..5d2f73b2e63d 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1390,6 +1390,12 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 		flags |= PM_SWAP;
 		if (is_pfn_swap_entry(entry))
 			page = pfn_swap_entry_to_page(entry);
+		if (is_pte_marker_entry(entry)) {
+			pte_marker marker = pte_marker_get(entry);
+
+			if (marker & PTE_MARKER_UFFD_WP)
+				flags |= PM_UFFD_WP;
+		}
 	}
 
 	if (page && !PageAnon(page))
@@ -1523,10 +1529,15 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
 		if (page_mapcount(page) == 1)
 			flags |= PM_MMAP_EXCLUSIVE;
 
+		if (huge_pte_uffd_wp(pte))
+			flags |= PM_UFFD_WP;
+
 		flags |= PM_PRESENT;
 		if (pm->show_pfn)
 			frame = pte_pfn(pte) +
 				((addr & ~hmask) >> PAGE_SHIFT);
+	} else if (pte_swp_uffd_wp_any(pte)) {
+		flags |= PM_UFFD_WP;
 	}
 
 	for (; addr != end; addr += PAGE_SIZE) {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 21/23] mm/uffd: Enable write protection for shmem & hugetlbfs
  2021-11-15  7:54 [PATCH v6 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (19 preceding siblings ...)
  2021-11-15  8:03 ` [PATCH v6 20/23] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs Peter Xu
@ 2021-11-15  8:03 ` Peter Xu
  2021-11-15  8:03 ` [PATCH v6 22/23] mm: Enable PTE markers by default Peter Xu
  2021-11-15  8:04 ` [PATCH v6 23/23] selftests/uffd: Enable uffd-wp for shmem/hugetlbfs Peter Xu
  22 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2021-11-15  8:03 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, peterx, Alistair Popple, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Matthew Wilcox, Jerome Glisse, Axel Rasmussen,
	Kirill A . Shutemov, David Hildenbrand, Andrea Arcangeli,
	Hugh Dickins

We've had all the necessary changes ready for both shmem and hugetlbfs.  Turn
on all the shmem/hugetlbfs switches for userfaultfd-wp.

We can expand UFFD_API_RANGE_IOCTLS_BASIC with _UFFDIO_WRITEPROTECT too because
all existing types now support write protection mode.

Since vma_can_userfault() will be used elsewhere, move into userfaultfd_k.h.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c                 | 21 ++-------------------
 include/linux/userfaultfd_k.h    | 12 ++++++++++++
 include/uapi/linux/userfaultfd.h | 10 ++++++++--
 mm/userfaultfd.c                 |  9 +++------
 4 files changed, 25 insertions(+), 27 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index fa24c72a849e..b74cad206d0a 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1253,24 +1253,6 @@ static __always_inline int validate_range(struct mm_struct *mm,
 	return 0;
 }
 
-static inline bool vma_can_userfault(struct vm_area_struct *vma,
-				     unsigned long vm_flags)
-{
-	/* FIXME: add WP support to hugetlbfs and shmem */
-	if (vm_flags & VM_UFFD_WP) {
-		if (is_vm_hugetlb_page(vma) || vma_is_shmem(vma))
-			return false;
-	}
-
-	if (vm_flags & VM_UFFD_MINOR) {
-		if (!(is_vm_hugetlb_page(vma) || vma_is_shmem(vma)))
-			return false;
-	}
-
-	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
-	       vma_is_shmem(vma);
-}
-
 static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 				unsigned long arg)
 {
@@ -1949,7 +1931,8 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
 		~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM);
 #endif
 #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-	uffdio_api.features &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP;
+	uffdio_api.features &=
+	    ~(UFFD_FEATURE_PAGEFAULT_FLAG_WP | UFFD_FEATURE_WP_HUGETLBFS_SHMEM);
 #endif
 	uffdio_api.ioctls = UFFD_API_IOCTLS;
 	ret = -EFAULT;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 05cec02140cb..ef9b70f6447e 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -18,6 +18,7 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <asm-generic/pgtable_uffd.h>
+#include <linux/hugetlb_inline.h>
 
 /* The set of all possible UFFD-related VM flags. */
 #define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR)
@@ -140,6 +141,17 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 	return vma->vm_flags & __VM_UFFD_FLAGS;
 }
 
+static inline bool vma_can_userfault(struct vm_area_struct *vma,
+				     unsigned long vm_flags)
+{
+	if (vm_flags & VM_UFFD_MINOR)
+		return is_vm_hugetlb_page(vma) || vma_is_shmem(vma);
+
+	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
+	       vma_is_shmem(vma);
+}
+
+
 extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
 extern void dup_userfaultfd_complete(struct list_head *);
 
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 05b31d60acf6..a67b5185a7a9 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -32,7 +32,8 @@
 			   UFFD_FEATURE_SIGBUS |		\
 			   UFFD_FEATURE_THREAD_ID |		\
 			   UFFD_FEATURE_MINOR_HUGETLBFS |	\
-			   UFFD_FEATURE_MINOR_SHMEM)
+			   UFFD_FEATURE_MINOR_SHMEM |		\
+			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -46,7 +47,8 @@
 #define UFFD_API_RANGE_IOCTLS_BASIC		\
 	((__u64)1 << _UFFDIO_WAKE |		\
 	 (__u64)1 << _UFFDIO_COPY |		\
-	 (__u64)1 << _UFFDIO_CONTINUE)
+	 (__u64)1 << _UFFDIO_CONTINUE |		\
+	 (__u64)1 << _UFFDIO_WRITEPROTECT)
 
 /*
  * Valid ioctl command number range with this API is from 0x00 to
@@ -189,6 +191,9 @@ struct uffdio_api {
 	 *
 	 * UFFD_FEATURE_MINOR_SHMEM indicates the same support as
 	 * UFFD_FEATURE_MINOR_HUGETLBFS, but for shmem-backed pages instead.
+	 *
+	 * UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd
+	 * write-protection mode is supported on both shmem and hugetlbfs.
 	 */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
 #define UFFD_FEATURE_EVENT_FORK			(1<<1)
@@ -201,6 +206,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_THREAD_ID			(1<<8)
 #define UFFD_FEATURE_MINOR_HUGETLBFS		(1<<9)
 #define UFFD_FEATURE_MINOR_SHMEM		(1<<10)
+#define UFFD_FEATURE_WP_HUGETLBFS_SHMEM		(1<<11)
 	__u64 features;
 
 	__u64 ioctls;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 037f82719e64..6d8cd9f6b8a1 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -716,15 +716,12 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 
 	err = -ENOENT;
 	dst_vma = find_dst_vma(dst_mm, start, len);
-	/*
-	 * Make sure the vma is not shared, that the dst range is
-	 * both valid and fully within a single existing vma.
-	 */
-	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+
+	if (!dst_vma)
 		goto out_unlock;
 	if (!userfaultfd_wp(dst_vma))
 		goto out_unlock;
-	if (!vma_is_anonymous(dst_vma))
+	if (!vma_can_userfault(dst_vma, dst_vma->vm_flags))
 		goto out_unlock;
 
 	if (is_vm_hugetlb_page(dst_vma)) {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 22/23] mm: Enable PTE markers by default
  2021-11-15  7:54 [PATCH v6 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (20 preceding siblings ...)
  2021-11-15  8:03 ` [PATCH v6 21/23] mm/uffd: Enable write protection for shmem & hugetlbfs Peter Xu
@ 2021-11-15  8:03 ` Peter Xu
  2021-11-15  8:04 ` [PATCH v6 23/23] selftests/uffd: Enable uffd-wp for shmem/hugetlbfs Peter Xu
  22 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2021-11-15  8:03 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, peterx, Alistair Popple, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Matthew Wilcox, Jerome Glisse, Axel Rasmussen,
	Kirill A . Shutemov, David Hildenbrand, Andrea Arcangeli,
	Hugh Dickins

Enable PTE markers by default.  On x86_64 it means it'll auto-enable
PTE_MARKER_UFFD_WP as well.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index f01c8e0afadf..401e4dff5f42 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -898,7 +898,7 @@ config SECRETMEM
 	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
 
 config PTE_MARKER
-	def_bool n
+	def_bool y
 	bool "Marker PTEs support"
 
 	help
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v6 23/23] selftests/uffd: Enable uffd-wp for shmem/hugetlbfs
  2021-11-15  7:54 [PATCH v6 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (21 preceding siblings ...)
  2021-11-15  8:03 ` [PATCH v6 22/23] mm: Enable PTE markers by default Peter Xu
@ 2021-11-15  8:04 ` Peter Xu
  22 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2021-11-15  8:04 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, peterx, Alistair Popple, Andrew Morton, Mike Kravetz,
	Mike Rapoport, Matthew Wilcox, Jerome Glisse, Axel Rasmussen,
	Kirill A . Shutemov, David Hildenbrand, Andrea Arcangeli,
	Hugh Dickins

After we added support for shmem and hugetlbfs, we can turn uffd-wp test on
always now.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/vm/userfaultfd.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 64845be3971d..232cc6083039 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -81,7 +81,7 @@ static int test_type;
 static volatile bool test_uffdio_copy_eexist = true;
 static volatile bool test_uffdio_zeropage_eexist = true;
 /* Whether to test uffd write-protection */
-static bool test_uffdio_wp = false;
+static bool test_uffdio_wp = true;
 /* Whether to test uffd minor faults */
 static bool test_uffdio_minor = false;
 
@@ -1588,8 +1588,6 @@ static void set_test_type(const char *type)
 	if (!strcmp(type, "anon")) {
 		test_type = TEST_ANON;
 		uffd_test_ops = &anon_uffd_test_ops;
-		/* Only enable write-protect test for anonymous test */
-		test_uffdio_wp = true;
 	} else if (!strcmp(type, "hugetlb")) {
 		test_type = TEST_HUGETLB;
 		uffd_test_ops = &hugetlb_uffd_test_ops;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v6 01/23] mm: Introduce PTE_MARKER swap entry
  2021-11-15  7:55 ` [PATCH v6 01/23] mm: Introduce PTE_MARKER swap entry Peter Xu
@ 2021-12-03  3:30   ` Alistair Popple
  2021-12-03  4:21     ` Peter Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Alistair Popple @ 2021-12-03  3:30 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Peter Xu
  Cc: Axel Rasmussen, Nadav Amit, Mike Rapoport, Hugh Dickins,
	Mike Kravetz, Kirill A . Shutemov, Jerome Glisse, Matthew Wilcox,
	Andrew Morton, peterx, David Hildenbrand, Andrea Arcangeli

On Monday, 15 November 2021 6:55:00 PM AEDT Peter Xu wrote:

[...]

> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index d356ab4047f7..5103d2a4ae38 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -247,6 +247,84 @@ static inline int is_writable_migration_entry(swp_entry_t entry)
>  
>  #endif
>  
> +typedef unsigned long pte_marker;
> +
> +#define  PTE_MARKER_MASK     (0)
> +
> +#ifdef CONFIG_PTE_MARKER
> +
> +static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
> +{
> +	return swp_entry(SWP_PTE_MARKER, marker);
> +}
> +
> +static inline bool is_pte_marker_entry(swp_entry_t entry)
> +{
> +	return swp_type(entry) == SWP_PTE_MARKER;
> +}
> +
> +static inline pte_marker pte_marker_get(swp_entry_t entry)
> +{
> +	return swp_offset(entry) & PTE_MARKER_MASK;

I'm not sure the PTE_MARKER_MASK adds much, especially as we only have one
user. I don't see a problem with open-coding these kind of checks (ie.
swp_offset(entry) & PTE_MARKER_UFFD_WP) as you kind of end up doing that anyway.
Alternatively if you want helper functions I think it would be better to define
them for each marker. Eg: is_pte_marker_uffd_wp().

> +}
> +
> +static inline bool is_pte_marker(pte_t pte)
> +{
> +	return is_swap_pte(pte) && is_pte_marker_entry(pte_to_swp_entry(pte));
> +}
> +
> +#else /* CONFIG_PTE_MARKER */
> +
> +static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
> +{
> +	/* This should never be called if !CONFIG_PTE_MARKER */

Can we leave this function undefined then? That way we will get an obvious
build error.

Overall I'm liking the swap entry approach a lot more than the special pte
approach, but maybe that's just because I'm more familiar with special swap
entries :-)

> +	WARN_ON_ONCE(1);
> +	return swp_entry(0, 0);
> +}
> +
> +static inline bool is_pte_marker_entry(swp_entry_t entry)
> +{
> +	return false;
> +}
> +
> +static inline pte_marker pte_marker_get(swp_entry_t entry)
> +{
> +	return 0;
> +}
> +
> +static inline bool is_pte_marker(pte_t pte)
> +{
> +	return false;
> +}
> +
> +#endif /* CONFIG_PTE_MARKER */
> +
> +static inline pte_t make_pte_marker(pte_marker marker)
> +{
> +	return swp_entry_to_pte(make_pte_marker_entry(marker));
> +}
> +
> +/*
> + * This is a special version to check pte_none() just to cover the case when
> + * the pte is a pte marker.  It existed because in many cases the pte marker
> + * should be seen as a none pte; it's just that we have stored some information
> + * onto the none pte so it becomes not-none any more.
> + *
> + * It should be used when the pte is file-backed, ram-based and backing
> + * userspace pages, like shmem.  It is not needed upon pgtables that do not
> + * support pte markers at all.  For example, it's not needed on anonymous
> + * memory, kernel-only memory (including when the system is during-boot),
> + * non-ram based generic file-system.  It's fine to be used even there, but the
> + * extra pte marker check will be pure overhead.
> + *
> + * For systems configured with !CONFIG_PTE_MARKER this will be automatically
> + * optimized to pte_none().
> + */
> +static inline int pte_none_mostly(pte_t pte)
> +{
> +	return pte_none(pte) || is_pte_marker(pte);
> +}
> +
>  static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
>  {
>  	struct page *p = pfn_to_page(swp_offset(entry));
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 068ce591a13a..66f23c6c2032 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -897,6 +897,13 @@ config IO_MAPPING
>  config SECRETMEM
>  	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
>  
> +config PTE_MARKER
> +	def_bool n
> +	bool "Marker PTEs support"
> +
> +	help
> +	  Allows to create marker PTEs for file-backed memory.
> +
>  source "mm/damon/Kconfig"
>  
>  endmenu
> 





^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v6 01/23] mm: Introduce PTE_MARKER swap entry
  2021-12-03  3:30   ` Alistair Popple
@ 2021-12-03  4:21     ` Peter Xu
  2021-12-03  5:35       ` Alistair Popple
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2021-12-03  4:21 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, linux-kernel, Axel Rasmussen, Nadav Amit,
	Mike Rapoport, Hugh Dickins, Mike Kravetz, Kirill A . Shutemov,
	Jerome Glisse, Matthew Wilcox, Andrew Morton, David Hildenbrand,
	Andrea Arcangeli

On Fri, Dec 03, 2021 at 02:30:00PM +1100, Alistair Popple wrote:
> On Monday, 15 November 2021 6:55:00 PM AEDT Peter Xu wrote:
> 
> [...]
> 
> > diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> > index d356ab4047f7..5103d2a4ae38 100644
> > --- a/include/linux/swapops.h
> > +++ b/include/linux/swapops.h
> > @@ -247,6 +247,84 @@ static inline int is_writable_migration_entry(swp_entry_t entry)
> >  
> >  #endif
> >  
> > +typedef unsigned long pte_marker;
> > +
> > +#define  PTE_MARKER_MASK     (0)
> > +
> > +#ifdef CONFIG_PTE_MARKER
> > +
> > +static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
> > +{
> > +	return swp_entry(SWP_PTE_MARKER, marker);
> > +}
> > +
> > +static inline bool is_pte_marker_entry(swp_entry_t entry)
> > +{
> > +	return swp_type(entry) == SWP_PTE_MARKER;
> > +}
> > +
> > +static inline pte_marker pte_marker_get(swp_entry_t entry)
> > +{
> > +	return swp_offset(entry) & PTE_MARKER_MASK;
> 
> I'm not sure the PTE_MARKER_MASK adds much, especially as we only have one
> user. I don't see a problem with open-coding these kind of checks (ie.

It's more or less a safety belt to make sure anything pte_marker_get() returned
will be pte_marker defined bits only.

> swp_offset(entry) & PTE_MARKER_UFFD_WP) as you kind of end up doing that anyway.
> Alternatively if you want helper functions I think it would be better to define
> them for each marker. Eg: is_pte_marker_uffd_wp().

Yes we can have something like is_pte_marker_uffd_wp(), I didn't do that
explicitly because I want us to be clear that pte_marker is a bitmask, so
calling "is_*" will be slightly opaque - strictly speaking it should be
"pte_marker_has_uffd_wp_bit()" if there will be more bits defined, but then the
name of the helper will look a bit odd too.  Hence I just keep the only
interface to fetch the whole marker and use "&" in the call sites to check.

> 
> > +}
> > +
> > +static inline bool is_pte_marker(pte_t pte)
> > +{
> > +	return is_swap_pte(pte) && is_pte_marker_entry(pte_to_swp_entry(pte));
> > +}
> > +
> > +#else /* CONFIG_PTE_MARKER */
> > +
> > +static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
> > +{
> > +	/* This should never be called if !CONFIG_PTE_MARKER */
> 
> Can we leave this function undefined then? That way we will get an obvious
> build error.

We can, but then we need more macros to cover the common code.  E.g. currently
in hugetlb_change_protection() we have:

        /* None pte */
        if (unlikely(uffd_wp))
                /* Safe to modify directly (none->non-present). */
                set_huge_pte_at(mm, address, ptep,
                                make_pte_marker(PTE_MARKER_UFFD_WP));

If we drop this definition, to let it compile with !PTE_MARKER, we'll need:

+#ifdef PTE_MARKER
        /* None pte */
        if (unlikely(uffd_wp))
                /* Safe to modify directly (none->non-present). */
                set_huge_pte_at(mm, address, ptep,
                                make_pte_marker(PTE_MARKER_UFFD_WP));
+#endif

Comparing to adding macro checks over a few other places, I figured maybe it's
easier to define them in the header once then we proper WARN_ON_ONCE() if
triggered (while they should just never).

> 
> Overall I'm liking the swap entry approach a lot more than the special pte
> approach, but maybe that's just because I'm more familiar with special swap
> entries :-)

Swap entry solution is definitely cleaner to me if not considering wasting it
with one bit.

Operating on pte directly is actually slightly more challenging, because we
don't have the protection of is_swap_pte() anymore.  It can help shield out
quite some strange stuff due to the pte->swp level hierachy.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v6 01/23] mm: Introduce PTE_MARKER swap entry
  2021-12-03  4:21     ` Peter Xu
@ 2021-12-03  5:35       ` Alistair Popple
  2021-12-03  6:45         ` Peter Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Alistair Popple @ 2021-12-03  5:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Axel Rasmussen, Nadav Amit,
	Mike Rapoport, Hugh Dickins, Mike Kravetz, Kirill A . Shutemov,
	Jerome Glisse, Matthew Wilcox, Andrew Morton, David Hildenbrand,
	Andrea Arcangeli

On Friday, 3 December 2021 3:21:12 PM AEDT Peter Xu wrote:
> On Fri, Dec 03, 2021 at 02:30:00PM +1100, Alistair Popple wrote:
> > On Monday, 15 November 2021 6:55:00 PM AEDT Peter Xu wrote:
> > 
> > [...]
> > 
> > > diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> > > index d356ab4047f7..5103d2a4ae38 100644
> > > --- a/include/linux/swapops.h
> > > +++ b/include/linux/swapops.h
> > > @@ -247,6 +247,84 @@ static inline int is_writable_migration_entry(swp_entry_t entry)
> > >  
> > >  #endif
> > >  
> > > +typedef unsigned long pte_marker;
> > > +
> > > +#define  PTE_MARKER_MASK     (0)
> > > +
> > > +#ifdef CONFIG_PTE_MARKER
> > > +
> > > +static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
> > > +{
> > > +	return swp_entry(SWP_PTE_MARKER, marker);
> > > +}
> > > +
> > > +static inline bool is_pte_marker_entry(swp_entry_t entry)
> > > +{
> > > +	return swp_type(entry) == SWP_PTE_MARKER;
> > > +}
> > > +
> > > +static inline pte_marker pte_marker_get(swp_entry_t entry)
> > > +{
> > > +	return swp_offset(entry) & PTE_MARKER_MASK;
> > 
> > I'm not sure the PTE_MARKER_MASK adds much, especially as we only have one
> > user. I don't see a problem with open-coding these kind of checks (ie.
> 
> It's more or less a safety belt to make sure anything pte_marker_get() returned
> will be pte_marker defined bits only.
> 
> > swp_offset(entry) & PTE_MARKER_UFFD_WP) as you kind of end up doing that anyway.
> > Alternatively if you want helper functions I think it would be better to define
> > them for each marker. Eg: is_pte_marker_uffd_wp().
> 
> Yes we can have something like is_pte_marker_uffd_wp(), I didn't do that
> explicitly because I want us to be clear that pte_marker is a bitmask, so
> calling "is_*" will be slightly opaque - strictly speaking it should be
> "pte_marker_has_uffd_wp_bit()" if there will be more bits defined, but then the
> name of the helper will look a bit odd too.  Hence I just keep the only
> interface to fetch the whole marker and use "&" in the call sites to check.

Why does a caller need to care if it's a bitmask or not though? Isn't that an
implementation detail that could be left to the "is_*" functions? I must admit
I'm still working through the rest of this series though - is it because you
end up storing some kind of value in the upper bits of the PTE marker?

> > 
> > > +}
> > > +
> > > +static inline bool is_pte_marker(pte_t pte)
> > > +{
> > > +	return is_swap_pte(pte) && is_pte_marker_entry(pte_to_swp_entry(pte));
> > > +}
> > > +
> > > +#else /* CONFIG_PTE_MARKER */
> > > +
> > > +static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
> > > +{
> > > +	/* This should never be called if !CONFIG_PTE_MARKER */
> > 
> > Can we leave this function undefined then? That way we will get an obvious
> > build error.
> 
> We can, but then we need more macros to cover the common code.  E.g. currently
> in hugetlb_change_protection() we have:
> 
>         /* None pte */
>         if (unlikely(uffd_wp))
>                 /* Safe to modify directly (none->non-present). */
>                 set_huge_pte_at(mm, address, ptep,
>                                 make_pte_marker(PTE_MARKER_UFFD_WP));
> 
> If we drop this definition, to let it compile with !PTE_MARKER, we'll need:
> 
> +#ifdef PTE_MARKER
>         /* None pte */
>         if (unlikely(uffd_wp))
>                 /* Safe to modify directly (none->non-present). */
>                 set_huge_pte_at(mm, address, ptep,
>                                 make_pte_marker(PTE_MARKER_UFFD_WP));
> +#endif
> 
> Comparing to adding macro checks over a few other places, I figured maybe it's
> easier to define them in the header once then we proper WARN_ON_ONCE() if
> triggered (while they should just never).

Ok, makes sense. Agree that adding macro checks everywhere isn't great.

> > 
> > Overall I'm liking the swap entry approach a lot more than the special pte
> > approach, but maybe that's just because I'm more familiar with special swap
> > entries :-)
> 
> Swap entry solution is definitely cleaner to me if not considering wasting it
> with one bit.
> 
> Operating on pte directly is actually slightly more challenging, because we
> don't have the protection of is_swap_pte() anymore.  It can help shield out
> quite some strange stuff due to the pte->swp level hierachy.

So I guess now we have the protection of is_swap_pte() there are probably a few
places where we need to check for marker pte entries when we find swap entries.
I'm not suggesting you haven't already found all of those cases of course, just
noting that it's something to review.

> Thanks,
> 
> 





^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v6 01/23] mm: Introduce PTE_MARKER swap entry
  2021-12-03  5:35       ` Alistair Popple
@ 2021-12-03  6:45         ` Peter Xu
  2021-12-07  2:12           ` Alistair Popple
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2021-12-03  6:45 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, linux-kernel, Axel Rasmussen, Nadav Amit,
	Mike Rapoport, Hugh Dickins, Mike Kravetz, Kirill A . Shutemov,
	Jerome Glisse, Matthew Wilcox, Andrew Morton, David Hildenbrand,
	Andrea Arcangeli

On Fri, Dec 03, 2021 at 04:35:38PM +1100, Alistair Popple wrote:
> > > > +static inline pte_marker pte_marker_get(swp_entry_t entry)
> > > > +{
> > > > +	return swp_offset(entry) & PTE_MARKER_MASK;
> > > 
> > > I'm not sure the PTE_MARKER_MASK adds much, especially as we only have one
> > > user. I don't see a problem with open-coding these kind of checks (ie.
> > 
> > It's more or less a safety belt to make sure anything pte_marker_get() returned
> > will be pte_marker defined bits only.
> > 
> > > swp_offset(entry) & PTE_MARKER_UFFD_WP) as you kind of end up doing that anyway.
> > > Alternatively if you want helper functions I think it would be better to define
> > > them for each marker. Eg: is_pte_marker_uffd_wp().
> > 
> > Yes we can have something like is_pte_marker_uffd_wp(), I didn't do that
> > explicitly because I want us to be clear that pte_marker is a bitmask, so
> > calling "is_*" will be slightly opaque - strictly speaking it should be
> > "pte_marker_has_uffd_wp_bit()" if there will be more bits defined, but then the
> > name of the helper will look a bit odd too.  Hence I just keep the only
> > interface to fetch the whole marker and use "&" in the call sites to check.
> 
> Why does a caller need to care if it's a bitmask or not though? Isn't that an
> implementation detail that could be left to the "is_*" functions? I must admit
> I'm still working through the rest of this series though - is it because you
> end up storing some kind of value in the upper bits of the PTE marker?

Nop. I'm just afraid the caller could overlook the fact that it's a bitmask,
then there can be code like:

  if (is_pte_marker_uffd_wp(*ptep) && drop_uffd_wp)
      pte_clear(ptep)

While we should only do:

  if (is_pte_marker_uffd_wp(*ptep) && drop_uffd_wp)
      // remove uffd-wp bit in the pte_marker, keep the reset bitmask

I could be worrying too much, there's no real user of it.  If you prefer the
helper a lot I can add it in the new version.  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v6 01/23] mm: Introduce PTE_MARKER swap entry
  2021-12-03  6:45         ` Peter Xu
@ 2021-12-07  2:12           ` Alistair Popple
  2021-12-07  2:30             ` Peter Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Alistair Popple @ 2021-12-07  2:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Axel Rasmussen, Nadav Amit,
	Mike Rapoport, Hugh Dickins, Mike Kravetz, Kirill A . Shutemov,
	Jerome Glisse, Matthew Wilcox, Andrew Morton, David Hildenbrand,
	Andrea Arcangeli

On Friday, 3 December 2021 5:45:37 PM AEDT Peter Xu wrote:
> On Fri, Dec 03, 2021 at 04:35:38PM +1100, Alistair Popple wrote:
> > > > > +static inline pte_marker pte_marker_get(swp_entry_t entry)
> > > > > +{
> > > > > +	return swp_offset(entry) & PTE_MARKER_MASK;
> > > > 
> > > > I'm not sure the PTE_MARKER_MASK adds much, especially as we only have one
> > > > user. I don't see a problem with open-coding these kind of checks (ie.
> > > 
> > > It's more or less a safety belt to make sure anything pte_marker_get() returned
> > > will be pte_marker defined bits only.
> > > 
> > > > swp_offset(entry) & PTE_MARKER_UFFD_WP) as you kind of end up doing that anyway.
> > > > Alternatively if you want helper functions I think it would be better to define
> > > > them for each marker. Eg: is_pte_marker_uffd_wp().
> > > 
> > > Yes we can have something like is_pte_marker_uffd_wp(), I didn't do that
> > > explicitly because I want us to be clear that pte_marker is a bitmask, so
> > > calling "is_*" will be slightly opaque - strictly speaking it should be
> > > "pte_marker_has_uffd_wp_bit()" if there will be more bits defined, but then the
> > > name of the helper will look a bit odd too.  Hence I just keep the only
> > > interface to fetch the whole marker and use "&" in the call sites to check.
> > 
> > Why does a caller need to care if it's a bitmask or not though? Isn't that an
> > implementation detail that could be left to the "is_*" functions? I must admit
> > I'm still working through the rest of this series though - is it because you
> > end up storing some kind of value in the upper bits of the PTE marker?
> 
> Nop. I'm just afraid the caller could overlook the fact that it's a bitmask,
> then there can be code like:
> 
>   if (is_pte_marker_uffd_wp(*ptep) && drop_uffd_wp)
>       pte_clear(ptep)
> 
> While we should only do:
> 
>   if (is_pte_marker_uffd_wp(*ptep) && drop_uffd_wp)
>       // remove uffd-wp bit in the pte_marker, keep the reset bitmask

I'm not sure how having the helper function prevents or changes this though? In
fact I just noticed this in patch 8:

                             if (uffd_wp_resolve &&
                                    (pte_marker_get(entry) & PTE_MARKER_UFFD_WP)) {
                                        pte_clear(vma->vm_mm, addr, pte);
                                        pages++;
                                }

And if I'm understanding your point correctly isn't that wrong because if there
were other users of pte markers they would inadvertently get cleared? Unless of
course I've missed something - I haven't looked at patch 8 yet for context. To
help with the above situation I think you would need a helper for clearing
ptes.

> I could be worrying too much, there's no real user of it.  If you prefer the
> helper a lot I can add it in the new version.  Thanks,

It's not a massive issue, but I do think either defining a helper or open
coding the bit check is clearer. I think we can worry about other users if/when
they appear.

 - Alistair



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v6 01/23] mm: Introduce PTE_MARKER swap entry
  2021-12-07  2:12           ` Alistair Popple
@ 2021-12-07  2:30             ` Peter Xu
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2021-12-07  2:30 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, linux-kernel, Axel Rasmussen, Nadav Amit,
	Mike Rapoport, Hugh Dickins, Mike Kravetz, Kirill A . Shutemov,
	Jerome Glisse, Matthew Wilcox, Andrew Morton, David Hildenbrand,
	Andrea Arcangeli

On Tue, Dec 07, 2021 at 01:12:23PM +1100, Alistair Popple wrote:
> On Friday, 3 December 2021 5:45:37 PM AEDT Peter Xu wrote:
> > On Fri, Dec 03, 2021 at 04:35:38PM +1100, Alistair Popple wrote:
> > > > > > +static inline pte_marker pte_marker_get(swp_entry_t entry)
> > > > > > +{
> > > > > > +	return swp_offset(entry) & PTE_MARKER_MASK;
> > > > > 
> > > > > I'm not sure the PTE_MARKER_MASK adds much, especially as we only have one
> > > > > user. I don't see a problem with open-coding these kind of checks (ie.
> > > > 
> > > > It's more or less a safety belt to make sure anything pte_marker_get() returned
> > > > will be pte_marker defined bits only.
> > > > 
> > > > > swp_offset(entry) & PTE_MARKER_UFFD_WP) as you kind of end up doing that anyway.
> > > > > Alternatively if you want helper functions I think it would be better to define
> > > > > them for each marker. Eg: is_pte_marker_uffd_wp().
> > > > 
> > > > Yes we can have something like is_pte_marker_uffd_wp(), I didn't do that
> > > > explicitly because I want us to be clear that pte_marker is a bitmask, so
> > > > calling "is_*" will be slightly opaque - strictly speaking it should be
> > > > "pte_marker_has_uffd_wp_bit()" if there will be more bits defined, but then the
> > > > name of the helper will look a bit odd too.  Hence I just keep the only
> > > > interface to fetch the whole marker and use "&" in the call sites to check.
> > > 
> > > Why does a caller need to care if it's a bitmask or not though? Isn't that an
> > > implementation detail that could be left to the "is_*" functions? I must admit
> > > I'm still working through the rest of this series though - is it because you
> > > end up storing some kind of value in the upper bits of the PTE marker?
> > 
> > Nop. I'm just afraid the caller could overlook the fact that it's a bitmask,
> > then there can be code like:
> > 
> >   if (is_pte_marker_uffd_wp(*ptep) && drop_uffd_wp)
> >       pte_clear(ptep)
> > 
> > While we should only do:
> > 
> >   if (is_pte_marker_uffd_wp(*ptep) && drop_uffd_wp)
> >       // remove uffd-wp bit in the pte_marker, keep the reset bitmask
> 
> I'm not sure how having the helper function prevents or changes this though? In
> fact I just noticed this in patch 8:
> 
>                              if (uffd_wp_resolve &&
>                                     (pte_marker_get(entry) & PTE_MARKER_UFFD_WP)) {
>                                         pte_clear(vma->vm_mm, addr, pte);
>                                         pages++;
>                                 }
> 
> And if I'm understanding your point correctly isn't that wrong because if there
> were other users of pte markers they would inadvertently get cleared? Unless of
> course I've missed something - I haven't looked at patch 8 yet for context. To
> help with the above situation I think you would need a helper for clearing
> ptes.

What I wanted to say is pte_marker_get() will make sure the caller will be
aware of the fact that the marker is a bitmask.  But it's true at least my
example might make it even more confusing..

> 
> > I could be worrying too much, there's no real user of it.  If you prefer the
> > helper a lot I can add it in the new version.  Thanks,
> 
> It's not a massive issue, but I do think either defining a helper or open
> coding the bit check is clearer. I think we can worry about other users if/when
> they appear.

OK, I'll add it.  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v6 03/23] mm: Check against orig_pte for finish_fault()
  2021-11-15  7:55 ` [PATCH v6 03/23] mm: Check against orig_pte for finish_fault() Peter Xu
@ 2021-12-16  5:01   ` Alistair Popple
  2021-12-16  5:38     ` Peter Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Alistair Popple @ 2021-12-16  5:01 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Peter Xu
  Cc: Axel Rasmussen, Nadav Amit, Mike Rapoport, Hugh Dickins,
	Mike Kravetz, Kirill A . Shutemov, Jerome Glisse, Matthew Wilcox,
	Andrew Morton, peterx, David Hildenbrand, Andrea Arcangeli

On Monday, 15 November 2021 6:55:02 PM AEDT Peter Xu wrote:
> We used to check against none pte and in those cases orig_pte should always be
> none pte anyway.

Is that always true? From what I can see in handle_pte_fault() orig_pte only
gets initialised in the !pmd_none() case so might not be pte_none.

> This change prepares us to be able to call do_fault() on !none ptes.  For
> example, we should allow that to happen for pte marker so that we can restore
> information out of the pte markers.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/memory.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 04662b010005..d5966d9e24c3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4052,7 +4052,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>  				      vmf->address, &vmf->ptl);
>  	ret = 0;
>  	/* Re-check under ptl */
> -	if (likely(pte_none(*vmf->pte)))
> +	if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
>  		do_set_pte(vmf, page, vmf->address);
>  	else
>  		ret = VM_FAULT_NOPAGE;
> 





^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v6 04/23] mm/uffd: PTE_MARKER_UFFD_WP
  2021-11-15  7:55 ` [PATCH v6 04/23] mm/uffd: PTE_MARKER_UFFD_WP Peter Xu
@ 2021-12-16  5:18   ` Alistair Popple
  2021-12-16  5:45     ` Peter Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Alistair Popple @ 2021-12-16  5:18 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Peter Xu
  Cc: Axel Rasmussen, Nadav Amit, Mike Rapoport, Hugh Dickins,
	Mike Kravetz, Kirill A . Shutemov, Jerome Glisse, Matthew Wilcox,
	Andrew Morton, peterx, David Hildenbrand, Andrea Arcangeli

On Monday, 15 November 2021 6:55:03 PM AEDT Peter Xu wrote:

[...]

> +/*
> + * Returns true if this is a swap pte and was uffd-wp wr-protected in either
> + * forms (pte marker or a normal swap pte), false otherwise.
> + */
> +static inline bool pte_swp_uffd_wp_any(pte_t pte)
> +{
> +#ifdef CONFIG_PTE_MARKER_UFFD_WP
> +	if (!is_swap_pte(pte))
> +		return false;
> +
> +	if (pte_swp_uffd_wp(pte))
> +		return true;

If I'm not mistaken normal swap uffd-wp ptes can still exist when
CONFIG_PTE_MARKER_UFFD_WP=n so shouldn't this be outside the #ifdef protection?

In fact we could drop the #ifdef entirely here as it is safe to call
is_pte_marker_uffd_wp() without CONFIG_PTE_MARKER_UFFD_WP.

> +
> +	if (is_pte_marker_uffd_wp(pte))
> +		return true;
> +#endif
> +	return false;
> +}
> +
>  #endif /* _LINUX_USERFAULTFD_K_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 66f23c6c2032..f01c8e0afadf 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -904,6 +904,15 @@ config PTE_MARKER
>  	help
>  	  Allows to create marker PTEs for file-backed memory.
>  
> +config PTE_MARKER_UFFD_WP
> +	bool "Marker PTEs support for userfaultfd write protection"
> +	depends on PTE_MARKER && HAVE_ARCH_USERFAULTFD_WP
> +
> +	help
> +	  Allows to create marker PTEs for userfaultfd write protection
> +	  purposes.  It is required to enable userfaultfd write protection on
> +	  file-backed memory types like shmem and hugetlbfs.
> +
>  source "mm/damon/Kconfig"
>  
>  endmenu
> 





^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v6 03/23] mm: Check against orig_pte for finish_fault()
  2021-12-16  5:01   ` Alistair Popple
@ 2021-12-16  5:38     ` Peter Xu
  2021-12-16  5:50       ` Peter Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2021-12-16  5:38 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, linux-kernel, Axel Rasmussen, Nadav Amit,
	Mike Rapoport, Hugh Dickins, Mike Kravetz, Kirill A . Shutemov,
	Jerome Glisse, Matthew Wilcox, Andrew Morton, David Hildenbrand,
	Andrea Arcangeli

On Thu, Dec 16, 2021 at 04:01:47PM +1100, Alistair Popple wrote:
> On Monday, 15 November 2021 6:55:02 PM AEDT Peter Xu wrote:
> > We used to check against none pte and in those cases orig_pte should always be
> > none pte anyway.
> 
> Is that always true? From what I can see in handle_pte_fault() orig_pte only
> gets initialised in the !pmd_none() case so might not be pte_none.

I believe it's true, otherwise I must have overlooked.

IMHO it's not "when we set orig_pte" that matters - note that finish_fault()
(that this patch modifies) is only called for file-backed memories, and it's
only called in do_fault() where the pte is not mapped at all.

DAX seems to call it too, but still DAX comes from do_fault() too, afaict.

The pte will not be mapped in two cases in handle_pte_fault():

  - When pmd_none

  - When !pmd_none, however if we find that pte_none==true, that's:

        if (pte_none(vmf->orig_pte)) {
                pte_unmap(vmf->pte);
                vmf->pte = NULL;
        }

So when we're already in do_fault(), afaict, orig_pte must be pte_none().
Another side note is that, IIUC pte_none() is a looser check than the
pte_val()==0 and it should be arch dependent.

Thanks,

> 
> > This change prepares us to be able to call do_fault() on !none ptes.  For
> > example, we should allow that to happen for pte marker so that we can restore
> > information out of the pte markers.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  mm/memory.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 04662b010005..d5966d9e24c3 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4052,7 +4052,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> >  				      vmf->address, &vmf->ptl);
> >  	ret = 0;
> >  	/* Re-check under ptl */
> > -	if (likely(pte_none(*vmf->pte)))
> > +	if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
> >  		do_set_pte(vmf, page, vmf->address);
> >  	else
> >  		ret = VM_FAULT_NOPAGE;
> > 
> 
> 
> 
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v6 04/23] mm/uffd: PTE_MARKER_UFFD_WP
  2021-12-16  5:18   ` Alistair Popple
@ 2021-12-16  5:45     ` Peter Xu
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2021-12-16  5:45 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, linux-kernel, Axel Rasmussen, Nadav Amit,
	Mike Rapoport, Hugh Dickins, Mike Kravetz, Kirill A . Shutemov,
	Jerome Glisse, Matthew Wilcox, Andrew Morton, David Hildenbrand,
	Andrea Arcangeli

On Thu, Dec 16, 2021 at 04:18:50PM +1100, Alistair Popple wrote:
> On Monday, 15 November 2021 6:55:03 PM AEDT Peter Xu wrote:
> 
> [...]
> 
> > +/*
> > + * Returns true if this is a swap pte and was uffd-wp wr-protected in either
> > + * forms (pte marker or a normal swap pte), false otherwise.
> > + */
> > +static inline bool pte_swp_uffd_wp_any(pte_t pte)
> > +{
> > +#ifdef CONFIG_PTE_MARKER_UFFD_WP
> > +	if (!is_swap_pte(pte))
> > +		return false;
> > +
> > +	if (pte_swp_uffd_wp(pte))
> > +		return true;
> 
> If I'm not mistaken normal swap uffd-wp ptes can still exist when
> CONFIG_PTE_MARKER_UFFD_WP=n so shouldn't this be outside the #ifdef protection?
> 
> In fact we could drop the #ifdef entirely here as it is safe to call
> is_pte_marker_uffd_wp() without CONFIG_PTE_MARKER_UFFD_WP.

But if CONFIG_PTE_MARKER_UFFD_WP=n then it means we don't support file-backed
uffd-wp.  The thing is pte_swp_uffd_wp_any() is only needed for file-backed..
Anonymous uffd-wp code never uses it.

In other words, pte_swp_uffd_wp() is indeed still a valid helper, however this
specific function (pte_swp_uffd_wp_any) may not really be necessary anymore.

Keeping the "#ifdef" could still help, imho, to generate clean code for direct
calls to pte_swp_uffd_wp_any() if someone wants to turn pte markers off,
e.g. we could still have chance to optimize below:

        if (pte_swp_uffd_wp_any(pte) &&
                !(zap_flags & ZAP_FLAG_DROP_MARKER))
                set_huge_pte_at(mm, address, ptep,
                                make_pte_marker(PTE_MARKER_UFFD_WP));
        else
                huge_pte_clear(mm, address, ptep, sz);

Into:

        huge_pte_clear(mm, address, ptep, sz);

with any decent compiler.

That's majorly why I still slightly prefer to keep it, and that's also one of
the major reason to have the config knob, imho.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v6 03/23] mm: Check against orig_pte for finish_fault()
  2021-12-16  5:38     ` Peter Xu
@ 2021-12-16  5:50       ` Peter Xu
  2021-12-16  6:23         ` Alistair Popple
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2021-12-16  5:50 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, linux-kernel, Axel Rasmussen, Nadav Amit,
	Mike Rapoport, Hugh Dickins, Mike Kravetz, Kirill A . Shutemov,
	Jerome Glisse, Matthew Wilcox, Andrew Morton, David Hildenbrand,
	Andrea Arcangeli

On Thu, Dec 16, 2021 at 01:38:33PM +0800, Peter Xu wrote:
> On Thu, Dec 16, 2021 at 04:01:47PM +1100, Alistair Popple wrote:
> > On Monday, 15 November 2021 6:55:02 PM AEDT Peter Xu wrote:
> > > We used to check against none pte and in those cases orig_pte should always be
> > > none pte anyway.
> > 
> > Is that always true? From what I can see in handle_pte_fault() orig_pte only
> > gets initialised in the !pmd_none() case so might not be pte_none.
> 
> I believe it's true, otherwise I must have overlooked.
> 
> IMHO it's not "when we set orig_pte" that matters - note that finish_fault()
> (that this patch modifies) is only called for file-backed memories, and it's
> only called in do_fault() where the pte is not mapped at all.
> 
> DAX seems to call it too, but still DAX comes from do_fault() too, afaict.
> 
> The pte will not be mapped in two cases in handle_pte_fault():
> 
>   - When pmd_none
> 
>   - When !pmd_none, however if we find that pte_none==true, that's:
> 
>         if (pte_none(vmf->orig_pte)) {
>                 pte_unmap(vmf->pte);
>                 vmf->pte = NULL;
>         }
> 
> So when we're already in do_fault(), afaict, orig_pte must be pte_none().
> Another side note is that, IIUC pte_none() is a looser check than the
> pte_val()==0 and it should be arch dependent.

So one more thing I forgot to mention... Of course above is based on the fact
that orig_pte will be initialized to zero when creating vmf structure, and
that's done in __handle_mm_fault():

	struct vm_fault vmf = {
		.vma = vma,
		.address = address & PAGE_MASK,
		.flags = flags,
		.pgoff = linear_page_index(vma, address),
		.gfp_mask = __get_fault_gfp_mask(vma),
	};

I'm not sure whether I should explicitly set it to pte_val(0), in most C
programs we'll already assume it's a proper reset of orig_pte value in c99
initialization format, but if anyone thinks we should do that explicitly plus
some comments I can do that too.

> 
> Thanks,
> 
> > 
> > > This change prepares us to be able to call do_fault() on !none ptes.  For
> > > example, we should allow that to happen for pte marker so that we can restore
> > > information out of the pte markers.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >  mm/memory.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 04662b010005..d5966d9e24c3 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -4052,7 +4052,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> > >  				      vmf->address, &vmf->ptl);
> > >  	ret = 0;
> > >  	/* Re-check under ptl */
> > > -	if (likely(pte_none(*vmf->pte)))
> > > +	if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
> > >  		do_set_pte(vmf, page, vmf->address);
> > >  	else
> > >  		ret = VM_FAULT_NOPAGE;
> > > 
> > 
> > 
> > 
> > 
> 
> -- 
> Peter Xu

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v6 06/23] mm/shmem: Handle uffd-wp special pte in page fault handler
  2021-11-15  7:55 ` [PATCH v6 06/23] mm/shmem: Handle uffd-wp special pte in page fault handler Peter Xu
@ 2021-12-16  5:56   ` Alistair Popple
  2021-12-16  6:17     ` Peter Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Alistair Popple @ 2021-12-16  5:56 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Peter Xu
  Cc: Axel Rasmussen, Nadav Amit, Mike Rapoport, Hugh Dickins,
	Mike Kravetz, Kirill A . Shutemov, Jerome Glisse, Matthew Wilcox,
	Andrew Morton, peterx, David Hildenbrand, Andrea Arcangeli

On Monday, 15 November 2021 6:55:05 PM AEDT Peter Xu wrote:

[...]

> diff --git a/mm/memory.c b/mm/memory.c
> index d5966d9e24c3..e8557d43a87d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3452,6 +3452,43 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
>  	return 0;
>  }
>  
> +static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
> +{
> +	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
> +				       vmf->address, &vmf->ptl);
> +	/*
> +	 * Be careful so that we will only recover a special uffd-wp pte into a
> +	 * none pte.  Otherwise it means the pte could have changed, so retry.
> +	 */
> +	if (is_pte_marker(*vmf->pte))
> +		pte_clear(vmf->vma->vm_mm, vmf->address, vmf->pte);
> +	pte_unmap_unlock(vmf->pte, vmf->ptl);
> +	return 0;
> +}
> +
> +/*
> + * This is actually a page-missing access, but with uffd-wp special pte
> + * installed.  It means this pte was wr-protected before being unmapped.
> + */
> +static vm_fault_t pte_marker_handle_uffd_wp(struct vm_fault *vmf)
> +{
> +	/* Careful!  vmf->pte unmapped after return */
> +	if (!pte_unmap_same(vmf))

Hasn't vmf->pte already been unmapped by do_swap_page() by the time we get
here?

> +		return 0;
> +
> +	/*
> +	 * Just in case there're leftover special ptes even after the region
> +	 * got unregistered - we can simply clear them.  We can also do that
> +	 * proactively when e.g. when we do UFFDIO_UNREGISTER upon some uffd-wp
> +	 * ranges, but it should be more efficient to be done lazily here.
> +	 */
> +	if (unlikely(!userfaultfd_wp(vmf->vma) || vma_is_anonymous(vmf->vma)))
> +		return pte_marker_clear(vmf);
> +
> +	/* do_fault() can handle pte markers too like none pte */
> +	return do_fault(vmf);
> +}
> +
>  static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
>  {
>  	swp_entry_t entry = pte_to_swp_entry(vmf->orig_pte);
> @@ -3465,8 +3502,11 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
>  	if (WARN_ON_ONCE(vma_is_anonymous(vmf->vma) || !marker))
>  		return VM_FAULT_SIGBUS;
>  
> -	/* TODO: handle pte markers */
> -	return 0;
> +	if (marker & PTE_MARKER_UFFD_WP)

Can we make this check `marker == PTE_MARKER_UFFD_WP`? There is currently only
one user of pte markers, and from what I can tell pte_marker_handle_uffd_wp()
wouldn't do the correct thing if other users were added because it could clear
non-uffd-wp markers. I don't think it's worth making it do the right thing now,
but a comment noting that would be helpful.

> +		return pte_marker_handle_uffd_wp(vmf);
> +
> +	/* This is an unknown pte marker */
> +	return VM_FAULT_SIGBUS;
>  }
>  
>  /*
> @@ -3968,6 +4008,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>  void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
> +	bool uffd_wp = is_pte_marker_uffd_wp(vmf->orig_pte);
>  	bool write = vmf->flags & FAULT_FLAG_WRITE;
>  	bool prefault = vmf->address != addr;
>  	pte_t entry;
> @@ -3982,6 +4023,8 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>  
>  	if (write)
>  		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +	if (unlikely(uffd_wp))
> +		entry = pte_mkuffd_wp(pte_wrprotect(entry));
>  	/* copy-on-write page */
>  	if (write && !(vma->vm_flags & VM_SHARED)) {
>  		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
> @@ -4155,9 +4198,21 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
>  	return vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff);
>  }
>  
> +/* Return true if we should do read fault-around, false otherwise */
> +static inline bool should_fault_around(struct vm_fault *vmf)
> +{
> +	/* No ->map_pages?  No way to fault around... */
> +	if (!vmf->vma->vm_ops->map_pages)
> +		return false;
> +
> +	if (uffd_disable_fault_around(vmf->vma))
> +		return false;
> +
> +	return fault_around_bytes >> PAGE_SHIFT > 1;
> +}
> +
>  static vm_fault_t do_read_fault(struct vm_fault *vmf)
>  {
> -	struct vm_area_struct *vma = vmf->vma;
>  	vm_fault_t ret = 0;
>  
>  	/*
> @@ -4165,12 +4220,10 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf)
>  	 * if page by the offset is not ready to be mapped (cold cache or
>  	 * something).
>  	 */
> -	if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
> -		if (likely(!userfaultfd_minor(vmf->vma))) {
> -			ret = do_fault_around(vmf);
> -			if (ret)
> -				return ret;
> -		}
> +	if (should_fault_around(vmf)) {
> +		ret = do_fault_around(vmf);
> +		if (ret)
> +			return ret;
>  	}
>  
>  	ret = __do_fault(vmf);
> 





^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v6 06/23] mm/shmem: Handle uffd-wp special pte in page fault handler
  2021-12-16  5:56   ` Alistair Popple
@ 2021-12-16  6:17     ` Peter Xu
  2021-12-16  6:30       ` Alistair Popple
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2021-12-16  6:17 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, linux-kernel, Axel Rasmussen, Nadav Amit,
	Mike Rapoport, Hugh Dickins, Mike Kravetz, Kirill A . Shutemov,
	Jerome Glisse, Matthew Wilcox, Andrew Morton, David Hildenbrand,
	Andrea Arcangeli

On Thu, Dec 16, 2021 at 04:56:42PM +1100, Alistair Popple wrote:
> On Monday, 15 November 2021 6:55:05 PM AEDT Peter Xu wrote:
> 
> [...]
> 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index d5966d9e24c3..e8557d43a87d 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3452,6 +3452,43 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> >  	return 0;
> >  }
> >  
> > +static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
> > +{
> > +	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
> > +				       vmf->address, &vmf->ptl);
> > +	/*
> > +	 * Be careful so that we will only recover a special uffd-wp pte into a
> > +	 * none pte.  Otherwise it means the pte could have changed, so retry.
> > +	 */
> > +	if (is_pte_marker(*vmf->pte))
> > +		pte_clear(vmf->vma->vm_mm, vmf->address, vmf->pte);
> > +	pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +	return 0;
> > +}
> > +
> > +/*
> > + * This is actually a page-missing access, but with uffd-wp special pte
> > + * installed.  It means this pte was wr-protected before being unmapped.
> > + */
> > +static vm_fault_t pte_marker_handle_uffd_wp(struct vm_fault *vmf)
> > +{
> > +	/* Careful!  vmf->pte unmapped after return */
> > +	if (!pte_unmap_same(vmf))
> 
> Hasn't vmf->pte already been unmapped by do_swap_page() by the time we get
> here?

Great catch, thanks!

It was needed before with the "swap special pte" version because that was
handled outside do_swap_page().  After the rebase I forgot to remove it.

I believe it didn't crash simply because we've got commit 2ca99358671a ("mm:
clear vmf->pte after pte_unmap_same() returns", 2021-11-06) very recently so it
just became a safe no-op, so all things will still work.

I'll drop it.

> 
> > +		return 0;
> > +
> > +	/*
> > +	 * Just in case there're leftover special ptes even after the region
> > +	 * got unregistered - we can simply clear them.  We can also do that
> > +	 * proactively when e.g. when we do UFFDIO_UNREGISTER upon some uffd-wp
> > +	 * ranges, but it should be more efficient to be done lazily here.
> > +	 */
> > +	if (unlikely(!userfaultfd_wp(vmf->vma) || vma_is_anonymous(vmf->vma)))
> > +		return pte_marker_clear(vmf);
> > +
> > +	/* do_fault() can handle pte markers too like none pte */
> > +	return do_fault(vmf);
> > +}
> > +
> >  static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> >  {
> >  	swp_entry_t entry = pte_to_swp_entry(vmf->orig_pte);
> > @@ -3465,8 +3502,11 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> >  	if (WARN_ON_ONCE(vma_is_anonymous(vmf->vma) || !marker))
> >  		return VM_FAULT_SIGBUS;
> >  
> > -	/* TODO: handle pte markers */
> > -	return 0;
> > +	if (marker & PTE_MARKER_UFFD_WP)
> 
> Can we make this check `marker == PTE_MARKER_UFFD_WP`? There is currently only
> one user of pte markers, and from what I can tell pte_marker_handle_uffd_wp()
> wouldn't do the correct thing if other users were added because it could clear
> non-uffd-wp markers. I don't think it's worth making it do the right thing now,
> but a comment noting that would be helpful.

Sure thing, and yeah I agree it's trivial and shouldn't matter in real-life.

I'll change it to "marker == PTE_MARKER_UFFD_WP" as you suggested, so if
there's surprise we'll get a sigbus.

Thanks,

> 
> > +		return pte_marker_handle_uffd_wp(vmf);
> > +
> > +	/* This is an unknown pte marker */
> > +	return VM_FAULT_SIGBUS;
> >  }

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v6 03/23] mm: Check against orig_pte for finish_fault()
  2021-12-16  5:50       ` Peter Xu
@ 2021-12-16  6:23         ` Alistair Popple
  2021-12-16  7:06           ` Peter Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Alistair Popple @ 2021-12-16  6:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Axel Rasmussen, Nadav Amit,
	Mike Rapoport, Hugh Dickins, Mike Kravetz, Kirill A . Shutemov,
	Jerome Glisse, Matthew Wilcox, Andrew Morton, David Hildenbrand,
	Andrea Arcangeli

On Thursday, 16 December 2021 4:50:47 PM AEDT Peter Xu wrote:
> On Thu, Dec 16, 2021 at 01:38:33PM +0800, Peter Xu wrote:
> > On Thu, Dec 16, 2021 at 04:01:47PM +1100, Alistair Popple wrote:
> > > On Monday, 15 November 2021 6:55:02 PM AEDT Peter Xu wrote:
> > > > We used to check against none pte and in those cases orig_pte should always be
> > > > none pte anyway.
> > > 
> > > Is that always true? From what I can see in handle_pte_fault() orig_pte only
> > > gets initialised in the !pmd_none() case so might not be pte_none.
> > 
> > I believe it's true, otherwise I must have overlooked.
> > 
> > IMHO it's not "when we set orig_pte" that matters - note that finish_fault()
> > (that this patch modifies) is only called for file-backed memories, and it's
> > only called in do_fault() where the pte is not mapped at all.
> > 
> > DAX seems to call it too, but still DAX comes from do_fault() too, afaict.
> > 
> > The pte will not be mapped in two cases in handle_pte_fault():
> > 
> >   - When pmd_none
> > 
> >   - When !pmd_none, however if we find that pte_none==true, that's:
> > 
> >         if (pte_none(vmf->orig_pte)) {
> >                 pte_unmap(vmf->pte);
> >                 vmf->pte = NULL;
> >         }
> > 
> > So when we're already in do_fault(), afaict, orig_pte must be pte_none().
> > Another side note is that, IIUC pte_none() is a looser check than the
> > pte_val()==0 and it should be arch dependent.
> 
> So one more thing I forgot to mention... Of course above is based on the fact
> that orig_pte will be initialized to zero when creating vmf structure, and
> that's done in __handle_mm_fault():
> 
> 	struct vm_fault vmf = {
> 		.vma = vma,
> 		.address = address & PAGE_MASK,
> 		.flags = flags,
> 		.pgoff = linear_page_index(vma, address),
> 		.gfp_mask = __get_fault_gfp_mask(vma),
> 	};
> 
> I'm not sure whether I should explicitly set it to pte_val(0), in most C
> programs we'll already assume it's a proper reset of orig_pte value in c99
> initialization format, but if anyone thinks we should do that explicitly plus
> some comments I can do that too.

Ok, that was really my question. Is:

	if (likely(pte_none(*vmf->pte)))

equivalent to:

	if (likely(pte_same(*vmf->pte, __pte(0))))

for every architecture? Looking at Xtensa for example suggests it might not be:

arch/xtensa/include/asm/pgtable.h:
# define pte_none(pte)	 (pte_val(pte) == (_PAGE_CA_INVALID | _PAGE_USER))

> > 
> > Thanks,
> > 
> > > 
> > > > This change prepares us to be able to call do_fault() on !none ptes.  For
> > > > example, we should allow that to happen for pte marker so that we can restore
> > > > information out of the pte markers.
> > > > 
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > >  mm/memory.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index 04662b010005..d5966d9e24c3 100644
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -4052,7 +4052,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> > > >  				      vmf->address, &vmf->ptl);
> > > >  	ret = 0;
> > > >  	/* Re-check under ptl */
> > > > -	if (likely(pte_none(*vmf->pte)))
> > > > +	if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
> > > >  		do_set_pte(vmf, page, vmf->address);
> > > >  	else
> > > >  		ret = VM_FAULT_NOPAGE;
> > > > 
> > > 
> > > 
> > > 
> > > 
> > 
> 
> 





^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v6 06/23] mm/shmem: Handle uffd-wp special pte in page fault handler
  2021-12-16  6:17     ` Peter Xu
@ 2021-12-16  6:30       ` Alistair Popple
  0 siblings, 0 replies; 42+ messages in thread
From: Alistair Popple @ 2021-12-16  6:30 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Axel Rasmussen, Nadav Amit,
	Mike Rapoport, Hugh Dickins, Mike Kravetz, Kirill A . Shutemov,
	Jerome Glisse, Matthew Wilcox, Andrew Morton, David Hildenbrand,
	Andrea Arcangeli

On Thursday, 16 December 2021 5:17:30 PM AEDT Peter Xu wrote:
> On Thu, Dec 16, 2021 at 04:56:42PM +1100, Alistair Popple wrote:
> > On Monday, 15 November 2021 6:55:05 PM AEDT Peter Xu wrote:
> > 
> > [...]
> > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index d5966d9e24c3..e8557d43a87d 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -3452,6 +3452,43 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> > >  	return 0;
> > >  }
> > >  
> > > +static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
> > > +{
> > > +	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
> > > +				       vmf->address, &vmf->ptl);
> > > +	/*
> > > +	 * Be careful so that we will only recover a special uffd-wp pte into a
> > > +	 * none pte.  Otherwise it means the pte could have changed, so retry.
> > > +	 */
> > > +	if (is_pte_marker(*vmf->pte))
> > > +		pte_clear(vmf->vma->vm_mm, vmf->address, vmf->pte);
> > > +	pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > +	return 0;
> > > +}
> > > +
> > > +/*
> > > + * This is actually a page-missing access, but with uffd-wp special pte
> > > + * installed.  It means this pte was wr-protected before being unmapped.
> > > + */
> > > +static vm_fault_t pte_marker_handle_uffd_wp(struct vm_fault *vmf)
> > > +{
> > > +	/* Careful!  vmf->pte unmapped after return */
> > > +	if (!pte_unmap_same(vmf))
> > 
> > Hasn't vmf->pte already been unmapped by do_swap_page() by the time we get
> > here?
> 
> Great catch, thanks!
> 
> It was needed before with the "swap special pte" version because that was
> handled outside do_swap_page().  After the rebase I forgot to remove it.

No worries, and for what it's worth IMHO this version that handles it inside
do_swap_page() along with all the other "special" cases is much nicer.

> I believe it didn't crash simply because we've got commit 2ca99358671a ("mm:
> clear vmf->pte after pte_unmap_same() returns", 2021-11-06) very recently so it
> just became a safe no-op, so all things will still work.
> 
> I'll drop it.
> 
> > 
> > > +		return 0;
> > > +
> > > +	/*
> > > +	 * Just in case there're leftover special ptes even after the region
> > > +	 * got unregistered - we can simply clear them.  We can also do that
> > > +	 * proactively when e.g. when we do UFFDIO_UNREGISTER upon some uffd-wp
> > > +	 * ranges, but it should be more efficient to be done lazily here.
> > > +	 */
> > > +	if (unlikely(!userfaultfd_wp(vmf->vma) || vma_is_anonymous(vmf->vma)))
> > > +		return pte_marker_clear(vmf);
> > > +
> > > +	/* do_fault() can handle pte markers too like none pte */
> > > +	return do_fault(vmf);
> > > +}
> > > +
> > >  static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> > >  {
> > >  	swp_entry_t entry = pte_to_swp_entry(vmf->orig_pte);
> > > @@ -3465,8 +3502,11 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> > >  	if (WARN_ON_ONCE(vma_is_anonymous(vmf->vma) || !marker))
> > >  		return VM_FAULT_SIGBUS;
> > >  
> > > -	/* TODO: handle pte markers */
> > > -	return 0;
> > > +	if (marker & PTE_MARKER_UFFD_WP)
> > 
> > Can we make this check `marker == PTE_MARKER_UFFD_WP`? There is currently only
> > one user of pte markers, and from what I can tell pte_marker_handle_uffd_wp()
> > wouldn't do the correct thing if other users were added because it could clear
> > non-uffd-wp markers. I don't think it's worth making it do the right thing now,
> > but a comment noting that would be helpful.
> 
> Sure thing, and yeah I agree it's trivial and shouldn't matter in real-life.
> 
> I'll change it to "marker == PTE_MARKER_UFFD_WP" as you suggested, so if
> there's surprise we'll get a sigbus.
> 
> Thanks,
> 
> > 
> > > +		return pte_marker_handle_uffd_wp(vmf);
> > > +
> > > +	/* This is an unknown pte marker */
> > > +	return VM_FAULT_SIGBUS;
> > >  }
> 
> 





^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v6 03/23] mm: Check against orig_pte for finish_fault()
  2021-12-16  6:23         ` Alistair Popple
@ 2021-12-16  7:06           ` Peter Xu
  2021-12-16  7:45             ` Alistair Popple
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2021-12-16  7:06 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, linux-kernel, Axel Rasmussen, Nadav Amit,
	Mike Rapoport, Hugh Dickins, Mike Kravetz, Kirill A . Shutemov,
	Jerome Glisse, Matthew Wilcox, Andrew Morton, David Hildenbrand,
	Andrea Arcangeli

On Thu, Dec 16, 2021 at 05:23:40PM +1100, Alistair Popple wrote:
> On Thursday, 16 December 2021 4:50:47 PM AEDT Peter Xu wrote:
> > On Thu, Dec 16, 2021 at 01:38:33PM +0800, Peter Xu wrote:
> > > On Thu, Dec 16, 2021 at 04:01:47PM +1100, Alistair Popple wrote:
> > > > On Monday, 15 November 2021 6:55:02 PM AEDT Peter Xu wrote:
> > > > > We used to check against none pte and in those cases orig_pte should always be
> > > > > none pte anyway.
> > > > 
> > > > Is that always true? From what I can see in handle_pte_fault() orig_pte only
> > > > gets initialised in the !pmd_none() case so might not be pte_none.
> > > 
> > > I believe it's true, otherwise I must have overlooked.
> > > 
> > > IMHO it's not "when we set orig_pte" that matters - note that finish_fault()
> > > (that this patch modifies) is only called for file-backed memories, and it's
> > > only called in do_fault() where the pte is not mapped at all.
> > > 
> > > DAX seems to call it too, but still DAX comes from do_fault() too, afaict.
> > > 
> > > The pte will not be mapped in two cases in handle_pte_fault():
> > > 
> > >   - When pmd_none
> > > 
> > >   - When !pmd_none, however if we find that pte_none==true, that's:
> > > 
> > >         if (pte_none(vmf->orig_pte)) {
> > >                 pte_unmap(vmf->pte);
> > >                 vmf->pte = NULL;
> > >         }
> > > 
> > > So when we're already in do_fault(), afaict, orig_pte must be pte_none().
> > > Another side note is that, IIUC pte_none() is a looser check than the
> > > pte_val()==0 and it should be arch dependent.
> > 
> > So one more thing I forgot to mention... Of course above is based on the fact
> > that orig_pte will be initialized to zero when creating vmf structure, and
> > that's done in __handle_mm_fault():
> > 
> > 	struct vm_fault vmf = {
> > 		.vma = vma,
> > 		.address = address & PAGE_MASK,
> > 		.flags = flags,
> > 		.pgoff = linear_page_index(vma, address),
> > 		.gfp_mask = __get_fault_gfp_mask(vma),
> > 	};
> > 
> > I'm not sure whether I should explicitly set it to pte_val(0), in most C
> > programs we'll already assume it's a proper reset of orig_pte value in c99
> > initialization format, but if anyone thinks we should do that explicitly plus
> > some comments I can do that too.
> 
> Ok, that was really my question. Is:
> 
> 	if (likely(pte_none(*vmf->pte)))
> 
> equivalent to:
> 
> 	if (likely(pte_same(*vmf->pte, __pte(0))))
> 
> for every architecture? Looking at Xtensa for example suggests it might not be:
> 
> arch/xtensa/include/asm/pgtable.h:
> # define pte_none(pte)	 (pte_val(pte) == (_PAGE_CA_INVALID | _PAGE_USER))

Yes, another good question...

I never expected arch that has pte_none(pte_val(0))==false.. but indeed xtensa
is one of them.  I digged a bit more, s390 seems to be the other one.

I wondered how it could have worked - I thought e.g. pte_alloc_one() will
always return a pgtable page will all zero-filled, whose allocation should
require __GFP_ZERO anyway.  But then I quickly noticed that pte_alloc_one() is
per-arch too..  That explains, because per-arch can re-initialize the default
pte values.

S390 re-initializes its pgtable pages in arch/s390/mm/pgalloc.c:

     unsigned long *page_table_alloc(struct mm_struct *mm)
        memset64((u64 *)table, _PAGE_INVALID, PTRS_PER_PTE);

While similarly xtensa has:

#define pte_clear(mm,addr,ptep)						\
	do { update_pte(ptep, __pte(_PAGE_CA_INVALID | _PAGE_USER)); } while (0)

The solution should be simple - I could re-introduce FAULT_FLAG_UFFD_WP.  That
flag used to exist in older versions, e.g. this is v1 of current patchset where
it is defined:

https://lore.kernel.org/lkml/20210323004912.35132-6-peterx@redhat.com/

I thought this patch can greatly simplify things but I overlooked the
pte_none() check you mentioned.  So it seems I have no good choice but add that
flag back.

There's another alternative is we do pte_clear() on vmf->orig_pte as the new
way to initialize it.  I believe it should work too for s390 and xtensa.

Any preference?

Thanks,

> 
> > > 
> > > Thanks,
> > > 
> > > > 
> > > > > This change prepares us to be able to call do_fault() on !none ptes.  For
> > > > > example, we should allow that to happen for pte marker so that we can restore
> > > > > information out of the pte markers.
> > > > > 
> > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > > ---
> > > > >  mm/memory.c | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > index 04662b010005..d5966d9e24c3 100644
> > > > > --- a/mm/memory.c
> > > > > +++ b/mm/memory.c
> > > > > @@ -4052,7 +4052,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> > > > >  				      vmf->address, &vmf->ptl);
> > > > >  	ret = 0;
> > > > >  	/* Re-check under ptl */
> > > > > -	if (likely(pte_none(*vmf->pte)))
> > > > > +	if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
> > > > >  		do_set_pte(vmf, page, vmf->address);
> > > > >  	else
> > > > >  		ret = VM_FAULT_NOPAGE;
> > > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > 
> > 
> > 
> 
> 
> 
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v6 03/23] mm: Check against orig_pte for finish_fault()
  2021-12-16  7:06           ` Peter Xu
@ 2021-12-16  7:45             ` Alistair Popple
  2021-12-16  8:04               ` Peter Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Alistair Popple @ 2021-12-16  7:45 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Axel Rasmussen, Nadav Amit,
	Mike Rapoport, Hugh Dickins, Mike Kravetz, Kirill A . Shutemov,
	Jerome Glisse, Matthew Wilcox, Andrew Morton, David Hildenbrand,
	Andrea Arcangeli

On Thursday, 16 December 2021 6:06:54 PM AEDT Peter Xu wrote:

[...]

> I wondered how it could have worked - I thought e.g. pte_alloc_one() will
> always return a pgtable page will all zero-filled, whose allocation should
> require __GFP_ZERO anyway.  But then I quickly noticed that pte_alloc_one() is
> per-arch too..  That explains, because per-arch can re-initialize the default
> pte values.

Yes, I have wondered the same things before as well. It's all a little bit of
fun some of this stuff.

> I thought this patch can greatly simplify things but I overlooked the
> pte_none() check you mentioned.  So it seems I have no good choice but add that
> flag back.
> 
> There's another alternative is we do pte_clear() on vmf->orig_pte as the new
> way to initialize it.  I believe it should work too for s390 and xtensa.
> 
> Any preference?

I prefer the later approach (initialising to pte_clear) as it seems cleaner,
and pte_none(pte_clear()) is true for every architecture afaik.

> Thanks,
> 
> > 
> > > > 
> > > > Thanks,
> > > > 
> > > > > 
> > > > > > This change prepares us to be able to call do_fault() on !none ptes.  For
> > > > > > example, we should allow that to happen for pte marker so that we can restore
> > > > > > information out of the pte markers.
> > > > > > 
> > > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > > > ---
> > > > > >  mm/memory.c | 2 +-
> > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > index 04662b010005..d5966d9e24c3 100644
> > > > > > --- a/mm/memory.c
> > > > > > +++ b/mm/memory.c
> > > > > > @@ -4052,7 +4052,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> > > > > >  				      vmf->address, &vmf->ptl);
> > > > > >  	ret = 0;
> > > > > >  	/* Re-check under ptl */
> > > > > > -	if (likely(pte_none(*vmf->pte)))
> > > > > > +	if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
> > > > > >  		do_set_pte(vmf, page, vmf->address);
> > > > > >  	else
> > > > > >  		ret = VM_FAULT_NOPAGE;
> > > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> > 
> > 
> 
> 





^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v6 03/23] mm: Check against orig_pte for finish_fault()
  2021-12-16  7:45             ` Alistair Popple
@ 2021-12-16  8:04               ` Peter Xu
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2021-12-16  8:04 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, linux-kernel, Axel Rasmussen, Nadav Amit,
	Mike Rapoport, Hugh Dickins, Mike Kravetz, Kirill A . Shutemov,
	Jerome Glisse, Matthew Wilcox, Andrew Morton, David Hildenbrand,
	Andrea Arcangeli

On Thu, Dec 16, 2021 at 06:45:07PM +1100, Alistair Popple wrote:
> On Thursday, 16 December 2021 6:06:54 PM AEDT Peter Xu wrote:
> 
> [...]
> 
> > I wondered how it could have worked - I thought e.g. pte_alloc_one() will
> > always return a pgtable page will all zero-filled, whose allocation should
> > require __GFP_ZERO anyway.  But then I quickly noticed that pte_alloc_one() is
> > per-arch too..  That explains, because per-arch can re-initialize the default
> > pte values.
> 
> Yes, I have wondered the same things before as well. It's all a little bit of
> fun some of this stuff.
> 
> > I thought this patch can greatly simplify things but I overlooked the
> > pte_none() check you mentioned.  So it seems I have no good choice but add that
> > flag back.
> > 
> > There's another alternative is we do pte_clear() on vmf->orig_pte as the new
> > way to initialize it.  I believe it should work too for s390 and xtensa.
> > 
> > Any preference?
> 
> I prefer the later approach (initialising to pte_clear) as it seems cleaner,
> and pte_none(pte_clear()) is true for every architecture afaik.

Will do.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2021-12-16  8:04 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-15  7:54 [PATCH v6 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
2021-11-15  7:55 ` [PATCH v6 01/23] mm: Introduce PTE_MARKER swap entry Peter Xu
2021-12-03  3:30   ` Alistair Popple
2021-12-03  4:21     ` Peter Xu
2021-12-03  5:35       ` Alistair Popple
2021-12-03  6:45         ` Peter Xu
2021-12-07  2:12           ` Alistair Popple
2021-12-07  2:30             ` Peter Xu
2021-11-15  7:55 ` [PATCH v6 02/23] mm: Teach core mm about pte markers Peter Xu
2021-11-15  7:55 ` [PATCH v6 03/23] mm: Check against orig_pte for finish_fault() Peter Xu
2021-12-16  5:01   ` Alistair Popple
2021-12-16  5:38     ` Peter Xu
2021-12-16  5:50       ` Peter Xu
2021-12-16  6:23         ` Alistair Popple
2021-12-16  7:06           ` Peter Xu
2021-12-16  7:45             ` Alistair Popple
2021-12-16  8:04               ` Peter Xu
2021-11-15  7:55 ` [PATCH v6 04/23] mm/uffd: PTE_MARKER_UFFD_WP Peter Xu
2021-12-16  5:18   ` Alistair Popple
2021-12-16  5:45     ` Peter Xu
2021-11-15  7:55 ` [PATCH v6 05/23] mm/shmem: Take care of UFFDIO_COPY_MODE_WP Peter Xu
2021-11-15  7:55 ` [PATCH v6 06/23] mm/shmem: Handle uffd-wp special pte in page fault handler Peter Xu
2021-12-16  5:56   ` Alistair Popple
2021-12-16  6:17     ` Peter Xu
2021-12-16  6:30       ` Alistair Popple
2021-11-15  7:55 ` [PATCH v6 07/23] mm/shmem: Persist uffd-wp bit across zapping for file-backed Peter Xu
2021-11-15  8:00 ` [PATCH v6 08/23] mm/shmem: Allow uffd wr-protect none pte for file-backed mem Peter Xu
2021-11-15  8:00 ` [PATCH v6 09/23] mm/shmem: Allows file-back mem to be uffd wr-protected on thps Peter Xu
2021-11-15  8:01 ` [PATCH v6 10/23] mm/shmem: Handle uffd-wp during fork() Peter Xu
2021-11-15  8:01 ` [PATCH v6 11/23] mm/hugetlb: Introduce huge pte version of uffd-wp helpers Peter Xu
2021-11-15  8:01 ` [PATCH v6 12/23] mm/hugetlb: Hook page faults for uffd write protection Peter Xu
2021-11-15  8:01 ` [PATCH v6 13/23] mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP Peter Xu
2021-11-15  8:02 ` [PATCH v6 14/23] mm/hugetlb: Handle UFFDIO_WRITEPROTECT Peter Xu
2021-11-15  8:02 ` [PATCH v6 15/23] mm/hugetlb: Handle pte markers in page faults Peter Xu
2021-11-15  8:02 ` [PATCH v6 16/23] mm/hugetlb: Allow uffd wr-protect none ptes Peter Xu
2021-11-15  8:02 ` [PATCH v6 17/23] mm/hugetlb: Only drop uffd-wp special pte if required Peter Xu
2021-11-15  8:02 ` [PATCH v6 18/23] mm/hugetlb: Handle uffd-wp during fork() Peter Xu
2021-11-15  8:03 ` [PATCH v6 19/23] mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered Peter Xu
2021-11-15  8:03 ` [PATCH v6 20/23] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs Peter Xu
2021-11-15  8:03 ` [PATCH v6 21/23] mm/uffd: Enable write protection for shmem & hugetlbfs Peter Xu
2021-11-15  8:03 ` [PATCH v6 22/23] mm: Enable PTE markers by default Peter Xu
2021-11-15  8:04 ` [PATCH v6 23/23] selftests/uffd: Enable uffd-wp for shmem/hugetlbfs Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).