linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs
@ 2022-04-05  1:46 Peter Xu
  2022-04-05  1:46 ` [PATCH v8 01/23] mm: Introduce PTE_MARKER swap entry Peter Xu
                   ` (24 more replies)
  0 siblings, 25 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-05  1:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Andrew Morton, David Hildenbrand, Matthew Wilcox,
	peterx, Alistair Popple, Nadav Amit, Axel Rasmussen,
	Andrea Arcangeli, Kirill A . Shutemov, Hugh Dickins,
	Jerome Glisse, Mike Rapoport

This is v8 of the series to add shmem+hugetlbfs support for userfaultfd
write protection.  It is based on v5.17-mmots-2022-03-31-20-40.

I touched up two small details after the rebase, namely:

  - Let UFFDIO_REGISTER fail gracefully if CONFIG_PTE_MARKER_UFFD_WP is not
    set, in "mm/uffd: Enable write protection for shmem & hugetlbfs".

  - Tweaked the patch "mm: Enable PTE markers by default" to make sure
    it'll be auto-enabled on x86_64 (for real) by kconfig where proper.

During testing of recent versions, I grew another unit test for uffd-wp
specifically (uffd-test [0], name is not important.. though).  The current
vm/userfaultfd test doesn't cover the case to check what message we expect,
so the simple new test can catch errors when e.g. one page was wr-protected
but it was wrongly written without being noticed by the fault resolving
thread, hence data corrupt.

I used to only find such issues with umapsort only, and MISSING mode won't
have those data loss issues.  But now many of it can also be found with
uffd-test [0].  I plan to port it to linux repo after this series lands.

The whole tree can be found here for testing:

  https://github.com/xzpeter/linux/tree/uffd-wp-shmem-hugetlbfs

Previous versions:

  RFC: https://lore.kernel.org/lkml/20210115170907.24498-1-peterx@redhat.com
  v1:  https://lore.kernel.org/lkml/20210323004912.35132-1-peterx@redhat.com
  v2:  https://lore.kernel.org/lkml/20210427161317.50682-1-peterx@redhat.com
  v3:  https://lore.kernel.org/lkml/20210527201927.29586-1-peterx@redhat.com
  v4:  https://lore.kernel.org/lkml/20210714222117.47648-1-peterx@redhat.com
  v5:  https://lore.kernel.org/lkml/20210715201422.211004-1-peterx@redhat.com
  v6:  https://lore.kernel.org/lkml/20211115075522.73795-1-peterx@redhat.com
  v7:  https://lore.kernel.org/lkml/20220304051708.86193-1-peterx@redhat.com

Overview
========

Userfaultfd-wp anonymous support was merged two years ago.  There're quite
a few applications that started to leverage this capability either to take
snapshots for user-app memory, or use it for full user controled swapping.

This series tries to complete the feature for uffd-wp so as to cover all
the RAM-based memory types.  So far uffd-wp is the only missing piece of
the rest features (uffd-missing & uffd-minor mode).

One major reason to do so is that anonymous pages are sometimes not
satisfying the need of applications, and there're growing users of either
shmem and hugetlbfs for either sharing purpose (e.g., sharing guest mem
between hypervisor process and device emulation process, shmem local live
migration for upgrades), or for performance on tlb hits.

All these mean that if a uffd-wp app wants to switch to any of the memory
types, it'll stop working.  I think it's worthwhile to have the kernel to
cover all these aspects.

This series chose to protect pages in pte level not page level.

One major reason is safety.  I have no idea how we could make it safe if
any of the uffd-privileged app can wr-protect a page that any other
application can use.  It means this app can block any process potentially
for any time it wants.

The other reason is that it aligns very well with not only the anonymous
uffd-wp solution, but also uffd as a whole.  For example, userfaultfd is
implemented fundamentally based on VMAs.  We set flags to VMAs showing the
status of uffd tracking.  For another per-page based protection solution,
it'll be crossing the fundation line on VMA-based, and it could simply be
too far away already from what's called userfaultfd.

PTE markers
===========

The patchset is based on the idea called PTE markers.  It was discussed in
one of the mm alignment sessions, proposed starting from v6, and this is
the 2nd version of it using PTE marker idea.

PTE marker is a new type of swap entry that is ony applicable to file
backed memories like shmem and hugetlbfs.  It's used to persist some
pte-level information even if the original present ptes in pgtable are
zapped.

Logically pte markers can store more than uffd-wp information, but so far
only one bit is used for uffd-wp purpose.  When the pte marker is installed
with uffd-wp bit set, it means this pte is wr-protected by uffd.

It solves the problem on e.g. file-backed memory mapped ptes got zapped due
to any reason (e.g. thp split, or swapped out), we can still keep the
wr-protect information in the ptes.  Then when the page fault triggers
again, we'll know this pte is wr-protected so we can treat the pte the same
as a normal uffd wr-protected pte.

The extra information is encoded into the swap entry, or swp_offset to be
explicit, with the swp_type being PTE_MARKER.  So far uffd-wp only uses one
bit out of the swap entry, the rest bits of swp_offset are still reserved
for other purposes.

There're two configs to enable/disable PTE markers:

  CONFIG_PTE_MARKER
  CONFIG_PTE_MARKER_UFFD_WP

We can set !PTE_MARKER to completely disable all the PTE markers, along with
uffd-wp support.  I made two config so we can also enable PTE marker but
disable uffd-wp file-backed for other purposes.  At the end of current series,
I'll enable CONFIG_PTE_MARKER by default, but that patch is standalone and if
anyone worries about having it by default, we can also consider turn it off by
dropping that oneliner patch.  So far I don't see a huge risk of doing so, so I
kept that patch.

In most cases, PTE markers should be treated as none ptes.  It is because that
unlike most of the other swap entry types, there's no PFN or block offset
information encoded into PTE markers but some extra well-defined bits showing
the status of the pte.  These bits should only be used as extra data when
servicing an upcoming page fault, and then we behave as if it's a none pte.

I did spend a lot of time observing all the pte_none() users this time. It is
indeed a challenge because there're a lot, and I hope I didn't miss a single of
them when we should take care of pte markers.  Luckily, I don't think it'll
need to be considered in many cases, for example: boot code, arch code
(especially non-x86), kernel-only page handlings (e.g. CPA), or device driver
codes when we're tackling with pure PFN mappings.

I introduced pte_none_mostly() in this series when we need to handle pte
markers the same as none pte, the "mostly" is the other way to write "either
none pte or a pte marker".

I didn't replace pte_none() to cover pte markers for below reasons:

  - Very rare case of pte_none() callers will handle pte markers.  E.g., all
    the kernel pages do not require knowledge of pte markers.  So we don't
    pollute the major use cases.

  - Unconditionally change pte_none() semantics could confuse people, because
    pte_none() existed for so long a time.

  - Unconditionally change pte_none() semantics could make pte_none() slower
    even if in many cases pte markers do not exist.

  - There're cases where we'd like to handle pte markers differntly from
    pte_none(), so a full replace is also impossible.  E.g. khugepaged should
    still treat pte markers as normal swap ptes rather than none ptes, because
    pte markers will always need a fault-in to merge the marker with a valid
    pte.  Or the smap code will need to parse PTE markers not none ptes.

Patch Layout
============

Introducing PTE marker and uffd-wp bit in PTE marker:

  mm: Introduce PTE_MARKER swap entry
  mm: Teach core mm about pte markers
  mm: Check against orig_pte for finish_fault()
  mm/uffd: PTE_MARKER_UFFD_WP

Adding support for shmem uffd-wp:

  mm/shmem: Take care of UFFDIO_COPY_MODE_WP
  mm/shmem: Handle uffd-wp special pte in page fault handler
  mm/shmem: Persist uffd-wp bit across zapping for file-backed
  mm/shmem: Allow uffd wr-protect none pte for file-backed mem
  mm/shmem: Allows file-back mem to be uffd wr-protected on thps
  mm/shmem: Handle uffd-wp during fork()

Adding support for hugetlbfs uffd-wp:

  mm/hugetlb: Introduce huge pte version of uffd-wp helpers
  mm/hugetlb: Hook page faults for uffd write protection
  mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP
  mm/hugetlb: Handle UFFDIO_WRITEPROTECT
  mm/hugetlb: Handle pte markers in page faults
  mm/hugetlb: Allow uffd wr-protect none ptes
  mm/hugetlb: Only drop uffd-wp special pte if required
  mm/hugetlb: Handle uffd-wp during fork()

Misc handling on the rest mm for uffd-wp file-backed:

  mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered
  mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs

Enabling of uffd-wp on file-backed memory:

  mm/uffd: Enable write protection for shmem & hugetlbfs
  mm: Enable PTE markers by default
  selftests/uffd: Enable uffd-wp for shmem/hugetlbfs

Tests
=====

- Compile test on x86_64 and aarch64 on different configs
- Kernel selftests
- uffd-test [0]
- Umapsort [1,2] test for shmem/hugetlb, with swap on/off

Please review, thanks.

[0] https://github.com/xzpeter/clibs/tree/master/uffd-test
[1] https://github.com/xzpeter/umap-apps/tree/peter
[2] https://github.com/xzpeter/umap/tree/peter-shmem-hugetlbfs

Peter Xu (23):
  mm: Introduce PTE_MARKER swap entry
  mm: Teach core mm about pte markers
  mm: Check against orig_pte for finish_fault()
  mm/uffd: PTE_MARKER_UFFD_WP
  mm/shmem: Take care of UFFDIO_COPY_MODE_WP
  mm/shmem: Handle uffd-wp special pte in page fault handler
  mm/shmem: Persist uffd-wp bit across zapping for file-backed
  mm/shmem: Allow uffd wr-protect none pte for file-backed mem
  mm/shmem: Allows file-back mem to be uffd wr-protected on thps
  mm/shmem: Handle uffd-wp during fork()
  mm/hugetlb: Introduce huge pte version of uffd-wp helpers
  mm/hugetlb: Hook page faults for uffd write protection
  mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP
  mm/hugetlb: Handle UFFDIO_WRITEPROTECT
  mm/hugetlb: Handle pte markers in page faults
  mm/hugetlb: Allow uffd wr-protect none ptes
  mm/hugetlb: Only drop uffd-wp special pte if required
  mm/hugetlb: Handle uffd-wp during fork()
  mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered
  mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs
  mm/uffd: Enable write protection for shmem & hugetlbfs
  mm: Enable PTE markers by default
  selftests/uffd: Enable uffd-wp for shmem/hugetlbfs

 arch/s390/include/asm/hugetlb.h          |  15 ++
 fs/hugetlbfs/inode.c                     |  15 +-
 fs/proc/task_mmu.c                       |   7 +
 fs/userfaultfd.c                         |  31 ++--
 include/asm-generic/hugetlb.h            |  24 +++
 include/linux/hugetlb.h                  |  27 ++--
 include/linux/mm.h                       |  10 ++
 include/linux/mm_inline.h                |  43 +++++
 include/linux/shmem_fs.h                 |   4 +-
 include/linux/swap.h                     |  15 +-
 include/linux/swapops.h                  |  79 +++++++++
 include/linux/userfaultfd_k.h            |  80 +++++++++
 include/uapi/linux/userfaultfd.h         |  10 +-
 mm/Kconfig                               |  17 ++
 mm/filemap.c                             |   5 +
 mm/hmm.c                                 |   2 +-
 mm/hugetlb.c                             | 183 ++++++++++++++++-----
 mm/khugepaged.c                          |  14 +-
 mm/memcontrol.c                          |   8 +-
 mm/memory.c                              | 196 ++++++++++++++++++++---
 mm/mincore.c                             |   3 +-
 mm/mprotect.c                            |  75 ++++++++-
 mm/rmap.c                                |   8 +
 mm/shmem.c                               |   4 +-
 mm/userfaultfd.c                         |  54 +++++--
 tools/testing/selftests/vm/userfaultfd.c |   4 +-
 26 files changed, 807 insertions(+), 126 deletions(-)

-- 
2.32.0



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v8 01/23] mm: Introduce PTE_MARKER swap entry
  2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
@ 2022-04-05  1:46 ` Peter Xu
  2022-04-12  1:07   ` Alistair Popple
  2022-04-19  8:25   ` Alistair Popple
  2022-04-05  1:48 ` [PATCH v8 02/23] mm: Teach core mm about pte markers Peter Xu
                   ` (23 subsequent siblings)
  24 siblings, 2 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-05  1:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Andrew Morton, David Hildenbrand, Matthew Wilcox,
	peterx, Alistair Popple, Nadav Amit, Axel Rasmussen,
	Andrea Arcangeli, Kirill A . Shutemov, Hugh Dickins,
	Jerome Glisse, Mike Rapoport

This patch introduces a new swap entry type called PTE_MARKER.  It can be
installed for any pte that maps a file-backed memory when the pte is
temporarily zapped, so as to maintain per-pte information.

The information that kept in the pte is called a "marker".  Here we define the
marker as "unsigned long" just to match pgoff_t, however it will only work if
it still fits in swp_offset(), which is e.g. currently 58 bits on x86_64.

A new config CONFIG_PTE_MARKER is introduced too; it's by default off.  A bunch
of helpers are defined altogether to service the rest of the pte marker code.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/asm-generic/hugetlb.h |  9 ++++
 include/linux/swap.h          | 15 ++++++-
 include/linux/swapops.h       | 78 +++++++++++++++++++++++++++++++++++
 mm/Kconfig                    |  6 +++
 4 files changed, 107 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index 8e1e6244a89d..f39cad20ffc6 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -2,6 +2,9 @@
 #ifndef _ASM_GENERIC_HUGETLB_H
 #define _ASM_GENERIC_HUGETLB_H
 
+#include <linux/swap.h>
+#include <linux/swapops.h>
+
 static inline pte_t mk_huge_pte(struct page *page, pgprot_t pgprot)
 {
 	return mk_pte(page, pgprot);
@@ -80,6 +83,12 @@ static inline int huge_pte_none(pte_t pte)
 }
 #endif
 
+/* Please refer to comments above pte_none_mostly() for the usage */
+static inline int huge_pte_none_mostly(pte_t pte)
+{
+	return huge_pte_none(pte) || is_pte_marker(pte);
+}
+
 #ifndef __HAVE_ARCH_HUGE_PTE_WRPROTECT
 static inline pte_t huge_pte_wrprotect(pte_t pte)
 {
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7daae5a4b3e1..5553189d0215 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -55,6 +55,19 @@ static inline int current_is_kswapd(void)
  * actions on faults.
  */
 
+/*
+ * PTE markers are used to persist information onto PTEs that are mapped with
+ * file-backed memories.  As its name "PTE" hints, it should only be applied to
+ * the leaves of pgtables.
+ */
+#ifdef CONFIG_PTE_MARKER
+#define SWP_PTE_MARKER_NUM 1
+#define SWP_PTE_MARKER     (MAX_SWAPFILES + SWP_HWPOISON_NUM + \
+			    SWP_MIGRATION_NUM + SWP_DEVICE_NUM)
+#else
+#define SWP_PTE_MARKER_NUM 0
+#endif
+
 /*
  * Unaddressable device memory support. See include/linux/hmm.h and
  * Documentation/vm/hmm.rst. Short description is we need struct pages for
@@ -107,7 +120,7 @@ static inline int current_is_kswapd(void)
 
 #define MAX_SWAPFILES \
 	((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
-	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - SWP_PTE_MARKER_NUM)
 
 /*
  * Magic header for a swap area. The first part of the union is
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 32d517a28969..7a00627845f0 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -274,6 +274,84 @@ static inline int is_readable_migration_entry(swp_entry_t entry)
 
 #endif
 
+typedef unsigned long pte_marker;
+
+#define  PTE_MARKER_MASK     (0)
+
+#ifdef CONFIG_PTE_MARKER
+
+static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
+{
+	return swp_entry(SWP_PTE_MARKER, marker);
+}
+
+static inline bool is_pte_marker_entry(swp_entry_t entry)
+{
+	return swp_type(entry) == SWP_PTE_MARKER;
+}
+
+static inline pte_marker pte_marker_get(swp_entry_t entry)
+{
+	return swp_offset(entry) & PTE_MARKER_MASK;
+}
+
+static inline bool is_pte_marker(pte_t pte)
+{
+	return is_swap_pte(pte) && is_pte_marker_entry(pte_to_swp_entry(pte));
+}
+
+#else /* CONFIG_PTE_MARKER */
+
+static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
+{
+	/* This should never be called if !CONFIG_PTE_MARKER */
+	WARN_ON_ONCE(1);
+	return swp_entry(0, 0);
+}
+
+static inline bool is_pte_marker_entry(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline pte_marker pte_marker_get(swp_entry_t entry)
+{
+	return 0;
+}
+
+static inline bool is_pte_marker(pte_t pte)
+{
+	return false;
+}
+
+#endif /* CONFIG_PTE_MARKER */
+
+static inline pte_t make_pte_marker(pte_marker marker)
+{
+	return swp_entry_to_pte(make_pte_marker_entry(marker));
+}
+
+/*
+ * This is a special version to check pte_none() just to cover the case when
+ * the pte is a pte marker.  It existed because in many cases the pte marker
+ * should be seen as a none pte; it's just that we have stored some information
+ * onto the none pte so it becomes not-none any more.
+ *
+ * It should be used when the pte is file-backed, ram-based and backing
+ * userspace pages, like shmem.  It is not needed upon pgtables that do not
+ * support pte markers at all.  For example, it's not needed on anonymous
+ * memory, kernel-only memory (including when the system is during-boot),
+ * non-ram based generic file-system.  It's fine to be used even there, but the
+ * extra pte marker check will be pure overhead.
+ *
+ * For systems configured with !CONFIG_PTE_MARKER this will be automatically
+ * optimized to pte_none().
+ */
+static inline int pte_none_mostly(pte_t pte)
+{
+	return pte_none(pte) || is_pte_marker(pte);
+}
+
 static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
 {
 	struct page *p = pfn_to_page(swp_offset(entry));
diff --git a/mm/Kconfig b/mm/Kconfig
index 034d87953600..a1688b9314b2 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -909,6 +909,12 @@ config ANON_VMA_NAME
 	  area from being merged with adjacent virtual memory areas due to the
 	  difference in their name.
 
+config PTE_MARKER
+	bool "Marker PTEs support"
+
+	help
+	  Allows to create marker PTEs for file-backed memory.
+
 source "mm/damon/Kconfig"
 
 endmenu
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 02/23] mm: Teach core mm about pte markers
  2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
  2022-04-05  1:46 ` [PATCH v8 01/23] mm: Introduce PTE_MARKER swap entry Peter Xu
@ 2022-04-05  1:48 ` Peter Xu
  2022-04-12  1:22   ` Alistair Popple
  2022-04-05  1:48 ` [PATCH v8 03/23] mm: Check against orig_pte for finish_fault() Peter Xu
                   ` (22 subsequent siblings)
  24 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2022-04-05  1:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Nadav Amit, Matthew Wilcox, Mike Rapoport,
	David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple, peterx

This patch still does not use pte marker in any way, however it teaches the
core mm about the pte marker idea.

For example, handle_pte_marker() is introduced that will parse and handle all
the pte marker faults.

Many of the places are more about commenting it up - so that we know there's
the possibility of pte marker showing up, and why we don't need special code
for the cases.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c | 10 ++++++----
 mm/filemap.c     |  5 +++++
 mm/hmm.c         |  2 +-
 mm/memcontrol.c  |  8 ++++++--
 mm/memory.c      | 23 +++++++++++++++++++++++
 mm/mincore.c     |  3 ++-
 mm/mprotect.c    |  3 +++
 7 files changed, 46 insertions(+), 8 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index aa0c47cb0d16..8b4a94f5a238 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -249,9 +249,10 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
 
 	/*
 	 * Lockless access: we're in a wait_event so it's ok if it
-	 * changes under us.
+	 * changes under us.  PTE markers should be handled the same as none
+	 * ptes here.
 	 */
-	if (huge_pte_none(pte))
+	if (huge_pte_none_mostly(pte))
 		ret = true;
 	if (!huge_pte_write(pte) && (reason & VM_UFFD_WP))
 		ret = true;
@@ -330,9 +331,10 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 	pte = pte_offset_map(pmd, address);
 	/*
 	 * Lockless access: we're in a wait_event so it's ok if it
-	 * changes under us.
+	 * changes under us.  PTE markers should be handled the same as none
+	 * ptes here.
 	 */
-	if (pte_none(*pte))
+	if (pte_none_mostly(*pte))
 		ret = true;
 	if (!pte_write(*pte) && (reason & VM_UFFD_WP))
 		ret = true;
diff --git a/mm/filemap.c b/mm/filemap.c
index 3a5ffb5587cd..ef77dae8c28d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3382,6 +3382,11 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		vmf->pte += xas.xa_index - last_pgoff;
 		last_pgoff = xas.xa_index;
 
+		/*
+		 * NOTE: If there're PTE markers, we'll leave them to be
+		 * handled in the specific fault path, and it'll prohibit the
+		 * fault-around logic.
+		 */
 		if (!pte_none(*vmf->pte))
 			goto unlock;
 
diff --git a/mm/hmm.c b/mm/hmm.c
index af71aac3140e..3fd3242c5e50 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -239,7 +239,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 	pte_t pte = *ptep;
 	uint64_t pfn_req_flags = *hmm_pfn;
 
-	if (pte_none(pte)) {
+	if (pte_none_mostly(pte)) {
 		required_fault =
 			hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
 		if (required_fault)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7a08737bac4b..08af97c73f0f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5644,10 +5644,14 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 
 	if (pte_present(ptent))
 		page = mc_handle_present_pte(vma, addr, ptent);
+	else if (pte_none_mostly(ptent))
+		/*
+		 * PTE markers should be treated as a none pte here, separated
+		 * from other swap handling below.
+		 */
+		page = mc_handle_file_pte(vma, addr, ptent);
 	else if (is_swap_pte(ptent))
 		page = mc_handle_swap_pte(vma, ptent, &ent);
-	else if (pte_none(ptent))
-		page = mc_handle_file_pte(vma, addr, ptent);
 
 	if (!page && !ent.val)
 		return ret;
diff --git a/mm/memory.c b/mm/memory.c
index 2c5d1bb4694f..3f396241a7db 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -100,6 +100,8 @@ struct page *mem_map;
 EXPORT_SYMBOL(mem_map);
 #endif
 
+static vm_fault_t do_fault(struct vm_fault *vmf);
+
 /*
  * A number of key systems in x86 including ioremap() rely on the assumption
  * that high_memory defines the upper bound on direct map memory, then end
@@ -1415,6 +1417,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			if (!should_zap_page(details, page))
 				continue;
 			rss[mm_counter(page)]--;
+		} else if (is_pte_marker_entry(entry)) {
+			/* By default, simply drop all pte markers when zap */
 		} else if (is_hwpoison_entry(entry)) {
 			if (!should_zap_cows(details))
 				continue;
@@ -3555,6 +3559,23 @@ static inline bool should_try_to_free_swap(struct page *page,
 		page_count(page) == 2;
 }
 
+static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
+{
+	swp_entry_t entry = pte_to_swp_entry(vmf->orig_pte);
+	unsigned long marker = pte_marker_get(entry);
+
+	/*
+	 * PTE markers should always be with file-backed memories, and the
+	 * marker should never be empty.  If anything weird happened, the best
+	 * thing to do is to kill the process along with its mm.
+	 */
+	if (WARN_ON_ONCE(vma_is_anonymous(vmf->vma) || !marker))
+		return VM_FAULT_SIGBUS;
+
+	/* TODO: handle pte markers */
+	return 0;
+}
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -3592,6 +3613,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
+		} else if (is_pte_marker_entry(entry)) {
+			ret = handle_pte_marker(vmf);
 		} else {
 			print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
 			ret = VM_FAULT_SIGBUS;
diff --git a/mm/mincore.c b/mm/mincore.c
index f4f627325e12..fa200c14185f 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -122,7 +122,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	for (; addr != end; ptep++, addr += PAGE_SIZE) {
 		pte_t pte = *ptep;
 
-		if (pte_none(pte))
+		/* We need to do cache lookup too for pte markers */
+		if (pte_none_mostly(pte))
 			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
 						 vma, vec);
 		else if (pte_present(pte))
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 56060acdabd3..709a6f73b764 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -188,6 +188,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 					newpte = pte_swp_mksoft_dirty(newpte);
 				if (pte_swp_uffd_wp(oldpte))
 					newpte = pte_swp_mkuffd_wp(newpte);
+			} else if (is_pte_marker_entry(entry)) {
+				/* Skip it, the same as none pte */
+				continue;
 			} else {
 				newpte = oldpte;
 			}
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 03/23] mm: Check against orig_pte for finish_fault()
  2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
  2022-04-05  1:46 ` [PATCH v8 01/23] mm: Introduce PTE_MARKER swap entry Peter Xu
  2022-04-05  1:48 ` [PATCH v8 02/23] mm: Teach core mm about pte markers Peter Xu
@ 2022-04-05  1:48 ` Peter Xu
  2022-04-12  2:05   ` Alistair Popple
                     ` (2 more replies)
  2022-04-05  1:48 ` [PATCH v8 04/23] mm/uffd: PTE_MARKER_UFFD_WP Peter Xu
                   ` (21 subsequent siblings)
  24 siblings, 3 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-05  1:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Nadav Amit, Matthew Wilcox, Mike Rapoport,
	David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple, peterx

We used to check against none pte in finish_fault(), with the assumption
that the orig_pte is always none pte.

This change prepares us to be able to call do_fault() on !none ptes.  For
example, we should allow that to happen for pte marker so that we can restore
information out of the pte markers.

Let's change the "pte_none" check into detecting changes since we fetched
orig_pte.  One trivial thing to take care of here is, when pmd==NULL for
the pgtable we may not initialize orig_pte at all in handle_pte_fault().

By default orig_pte will be all zeros however the problem is not all
architectures are using all-zeros for a none pte.  pte_clear() will be the
right thing to use here so that we'll always have a valid orig_pte value
for the whole handle_pte_fault() call.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 3f396241a7db..b1af996b09ca 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4241,7 +4241,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 				      vmf->address, &vmf->ptl);
 	ret = 0;
 	/* Re-check under ptl */
-	if (likely(pte_none(*vmf->pte)))
+	if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
 		do_set_pte(vmf, page, vmf->address);
 	else
 		ret = VM_FAULT_NOPAGE;
@@ -4709,6 +4709,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		 * concurrent faults and from rmap lookups.
 		 */
 		vmf->pte = NULL;
+		/*
+		 * Always initialize orig_pte.  This matches with below
+		 * code to have orig_pte to be the none pte if pte==NULL.
+		 * This makes the rest code to be always safe to reference
+		 * it, e.g. in finish_fault() we'll detect pte changes.
+		 */
+		pte_clear(vmf->vma->vm_mm, vmf->address, &vmf->orig_pte);
 	} else {
 		/*
 		 * If a huge pmd materialized under us just retry later.  Use
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 04/23] mm/uffd: PTE_MARKER_UFFD_WP
  2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (2 preceding siblings ...)
  2022-04-05  1:48 ` [PATCH v8 03/23] mm: Check against orig_pte for finish_fault() Peter Xu
@ 2022-04-05  1:48 ` Peter Xu
  2022-04-06  1:41   ` kernel test robot
  2022-04-05  1:48 ` [PATCH v8 05/23] mm/shmem: Take care of UFFDIO_COPY_MODE_WP Peter Xu
                   ` (20 subsequent siblings)
  24 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2022-04-05  1:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Nadav Amit, Matthew Wilcox, Mike Rapoport,
	David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple, peterx

This patch introduces the 1st user of pte marker: the uffd-wp marker.

When the pte marker is installed with the uffd-wp bit set, it means this pte
was wr-protected by uffd.

We will use this special pte to arm the ptes that got either unmapped or
swapped out for a file-backed region that was previously wr-protected.  This
special pte could trigger a page fault just like swap entries.

This idea is greatly inspired by Hugh and Andrea in the discussion, which is
referenced in the links below.

Some helpers are introduced to detect whether a swap pte is uffd wr-protected.
After the pte marker introduced, one swap pte can be wr-protected in two forms:
either it is a normal swap pte and it has _PAGE_SWP_UFFD_WP set, or it's a pte
marker that has PTE_MARKER_UFFD_WP set.

Link: https://lore.kernel.org/lkml/20201126222359.8120-1-peterx@redhat.com/
Link: https://lore.kernel.org/lkml/20201130230603.46187-1-peterx@redhat.com/
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/swapops.h       |  3 ++-
 include/linux/userfaultfd_k.h | 43 +++++++++++++++++++++++++++++++++++
 mm/Kconfig                    |  9 ++++++++
 3 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 7a00627845f0..fffbba0036f6 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -276,7 +276,8 @@ static inline int is_readable_migration_entry(swp_entry_t entry)
 
 typedef unsigned long pte_marker;
 
-#define  PTE_MARKER_MASK     (0)
+#define  PTE_MARKER_UFFD_WP  BIT(0)
+#define  PTE_MARKER_MASK     (PTE_MARKER_UFFD_WP)
 
 #ifdef CONFIG_PTE_MARKER
 
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 33cea484d1ad..bd09c3c89b59 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -15,6 +15,8 @@
 
 #include <linux/fcntl.h>
 #include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 #include <asm-generic/pgtable_uffd.h>
 
 /* The set of all possible UFFD-related VM flags. */
@@ -236,4 +238,45 @@ static inline void userfaultfd_unmap_complete(struct mm_struct *mm,
 
 #endif /* CONFIG_USERFAULTFD */
 
+static inline bool pte_marker_entry_uffd_wp(swp_entry_t entry)
+{
+	return is_pte_marker_entry(entry) &&
+	    (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
+}
+
+static inline bool pte_marker_uffd_wp(pte_t pte)
+{
+#ifdef CONFIG_PTE_MARKER_UFFD_WP
+	swp_entry_t entry;
+
+	if (!is_swap_pte(pte))
+		return false;
+
+	entry = pte_to_swp_entry(pte);
+
+	return pte_marker_entry_uffd_wp(entry);
+#else
+	return false;
+#endif
+}
+
+/*
+ * Returns true if this is a swap pte and was uffd-wp wr-protected in either
+ * forms (pte marker or a normal swap pte), false otherwise.
+ */
+static inline bool pte_swp_uffd_wp_any(pte_t pte)
+{
+#ifdef CONFIG_PTE_MARKER_UFFD_WP
+	if (!is_swap_pte(pte))
+		return false;
+
+	if (pte_swp_uffd_wp(pte))
+		return true;
+
+	if (pte_marker_uffd_wp(pte))
+		return true;
+#endif
+	return false;
+}
+
 #endif /* _LINUX_USERFAULTFD_K_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index a1688b9314b2..6e7c2d59fa96 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -915,6 +915,15 @@ config PTE_MARKER
 	help
 	  Allows to create marker PTEs for file-backed memory.
 
+config PTE_MARKER_UFFD_WP
+	bool "Marker PTEs support for userfaultfd write protection"
+	depends on PTE_MARKER && HAVE_ARCH_USERFAULTFD_WP
+
+	help
+	  Allows to create marker PTEs for userfaultfd write protection
+	  purposes.  It is required to enable userfaultfd write protection on
+	  file-backed memory types like shmem and hugetlbfs.
+
 source "mm/damon/Kconfig"
 
 endmenu
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 05/23] mm/shmem: Take care of UFFDIO_COPY_MODE_WP
  2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (3 preceding siblings ...)
  2022-04-05  1:48 ` [PATCH v8 04/23] mm/uffd: PTE_MARKER_UFFD_WP Peter Xu
@ 2022-04-05  1:48 ` Peter Xu
  2022-04-05  1:48 ` [PATCH v8 06/23] mm/shmem: Handle uffd-wp special pte in page fault handler Peter Xu
                   ` (19 subsequent siblings)
  24 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-05  1:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Nadav Amit, Matthew Wilcox, Mike Rapoport,
	David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple, peterx

Pass wp_copy into shmem_mfill_atomic_pte() through the stack, then apply the
UFFD_WP bit properly when the UFFDIO_COPY on shmem is with UFFDIO_COPY_MODE_WP.
wp_copy lands mfill_atomic_install_pte() finally.

Note: we must do pte_wrprotect() if !writable in mfill_atomic_install_pte(), as
mk_pte() could return a writable pte (e.g., when VM_SHARED on a shmem file).

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/shmem_fs.h |  4 ++--
 mm/shmem.c               |  4 ++--
 mm/userfaultfd.c         | 23 ++++++++++++++++++-----
 3 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 3e915cc550bc..a68f982f22d1 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -145,11 +145,11 @@ extern int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 				  struct vm_area_struct *dst_vma,
 				  unsigned long dst_addr,
 				  unsigned long src_addr,
-				  bool zeropage,
+				  bool zeropage, bool wp_copy,
 				  struct page **pagep);
 #else /* !CONFIG_SHMEM */
 #define shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, \
-			       src_addr, zeropage, pagep)       ({ BUG(); 0; })
+			       src_addr, zeropage, wp_copy, pagep) ({ BUG(); 0; })
 #endif /* CONFIG_SHMEM */
 #endif /* CONFIG_USERFAULTFD */
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 7004c7f55716..9efb8a96d75e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2319,7 +2319,7 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 			   struct vm_area_struct *dst_vma,
 			   unsigned long dst_addr,
 			   unsigned long src_addr,
-			   bool zeropage,
+			   bool zeropage, bool wp_copy,
 			   struct page **pagep)
 {
 	struct inode *inode = file_inode(dst_vma->vm_file);
@@ -2392,7 +2392,7 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 		goto out_release;
 
 	ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
-				       page, true, false);
+				       page, true, wp_copy);
 	if (ret)
 		goto out_delete_from_cache;
 
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index dae25d985d15..b1c875b77fbb 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -77,10 +77,19 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	 * Always mark a PTE as write-protected when needed, regardless of
 	 * VM_WRITE, which the user might change.
 	 */
-	if (wp_copy)
+	if (wp_copy) {
 		_dst_pte = pte_mkuffd_wp(_dst_pte);
-	else if (writable)
+		writable = false;
+	}
+
+	if (writable)
 		_dst_pte = pte_mkwrite(_dst_pte);
+	else
+		/*
+		 * We need this to make sure write bit removed; as mk_pte()
+		 * could return a pte with write bit set.
+		 */
+		_dst_pte = pte_wrprotect(_dst_pte);
 
 	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
 
@@ -95,7 +104,12 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	}
 
 	ret = -EEXIST;
-	if (!pte_none(*dst_pte))
+	/*
+	 * We allow to overwrite a pte marker: consider when both MISSING|WP
+	 * registered, we firstly wr-protect a none pte which has no page cache
+	 * page backing it, then access the page.
+	 */
+	if (!pte_none_mostly(*dst_pte))
 		goto out_unlock;
 
 	if (page_in_cache) {
@@ -479,11 +493,10 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 			err = mfill_zeropage_pte(dst_mm, dst_pmd,
 						 dst_vma, dst_addr);
 	} else {
-		VM_WARN_ON_ONCE(wp_copy);
 		err = shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma,
 					     dst_addr, src_addr,
 					     mode != MCOPY_ATOMIC_NORMAL,
-					     page);
+					     wp_copy, page);
 	}
 
 	return err;
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 06/23] mm/shmem: Handle uffd-wp special pte in page fault handler
  2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (4 preceding siblings ...)
  2022-04-05  1:48 ` [PATCH v8 05/23] mm/shmem: Take care of UFFDIO_COPY_MODE_WP Peter Xu
@ 2022-04-05  1:48 ` Peter Xu
  2022-05-11 16:30   ` David Hildenbrand
  2022-04-05  1:48 ` [PATCH v8 07/23] mm/shmem: Persist uffd-wp bit across zapping for file-backed Peter Xu
                   ` (18 subsequent siblings)
  24 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2022-04-05  1:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Nadav Amit, Matthew Wilcox, Mike Rapoport,
	David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple, peterx

File-backed memories are prone to unmap/swap so the ptes are always unstable,
because they can be easily faulted back later using the page cache.  This could
lead to uffd-wp getting lost when unmapping or swapping out such memory.  One
example is shmem.  PTE markers are needed to store those information.

This patch prepares it by handling uffd-wp pte markers first it is applied
elsewhere, so that the page fault handler can recognize uffd-wp pte markers.

The handling of uffd-wp pte markers is similar to missing fault, it's just that
we'll handle this "missing fault" when we see the pte markers, meanwhile we
need to make sure the marker information is kept during processing the fault.

This is a slow path of uffd-wp handling, because zapping of wr-protected shmem
ptes should be rare.  So far it should only trigger in two conditions:

  (1) When trying to punch holes in shmem_fallocate(), there is an optimization
      to zap the pgtables before evicting the page.

  (2) When swapping out shmem pages.

Because of this, the page fault handling is simplifed too by not sending the
wr-protect message in the 1st page fault, instead the page will be installed
read-only, so the uffd-wp message will be generated in the next fault, which
will trigger the do_wp_page() path of general uffd-wp handling.

Disable fault-around for all uffd-wp registered ranges for extra safety just
like uffd-minor fault, and clean the code up.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/userfaultfd_k.h | 17 +++++++++
 mm/memory.c                   | 67 ++++++++++++++++++++++++++++++-----
 2 files changed, 75 insertions(+), 9 deletions(-)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index bd09c3c89b59..827e38b7be65 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -96,6 +96,18 @@ static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
 	return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
 }
 
+/*
+ * Don't do fault around for either WP or MINOR registered uffd range.  For
+ * MINOR registered range, fault around will be a total disaster and ptes can
+ * be installed without notifications; for WP it should mostly be fine as long
+ * as the fault around checks for pte_none() before the installation, however
+ * to be super safe we just forbid it.
+ */
+static inline bool uffd_disable_fault_around(struct vm_area_struct *vma)
+{
+	return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
+}
+
 static inline bool userfaultfd_missing(struct vm_area_struct *vma)
 {
 	return vma->vm_flags & VM_UFFD_MISSING;
@@ -236,6 +248,11 @@ static inline void userfaultfd_unmap_complete(struct mm_struct *mm,
 {
 }
 
+static inline bool uffd_disable_fault_around(struct vm_area_struct *vma)
+{
+	return false;
+}
+
 #endif /* CONFIG_USERFAULTFD */
 
 static inline bool pte_marker_entry_uffd_wp(swp_entry_t entry)
diff --git a/mm/memory.c b/mm/memory.c
index b1af996b09ca..21abb8a30553 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3559,6 +3559,39 @@ static inline bool should_try_to_free_swap(struct page *page,
 		page_count(page) == 2;
 }
 
+static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
+{
+	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+				       vmf->address, &vmf->ptl);
+	/*
+	 * Be careful so that we will only recover a special uffd-wp pte into a
+	 * none pte.  Otherwise it means the pte could have changed, so retry.
+	 */
+	if (is_pte_marker(*vmf->pte))
+		pte_clear(vmf->vma->vm_mm, vmf->address, vmf->pte);
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
+	return 0;
+}
+
+/*
+ * This is actually a page-missing access, but with uffd-wp special pte
+ * installed.  It means this pte was wr-protected before being unmapped.
+ */
+static vm_fault_t pte_marker_handle_uffd_wp(struct vm_fault *vmf)
+{
+	/*
+	 * Just in case there're leftover special ptes even after the region
+	 * got unregistered - we can simply clear them.  We can also do that
+	 * proactively when e.g. when we do UFFDIO_UNREGISTER upon some uffd-wp
+	 * ranges, but it should be more efficient to be done lazily here.
+	 */
+	if (unlikely(!userfaultfd_wp(vmf->vma) || vma_is_anonymous(vmf->vma)))
+		return pte_marker_clear(vmf);
+
+	/* do_fault() can handle pte markers too like none pte */
+	return do_fault(vmf);
+}
+
 static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
 {
 	swp_entry_t entry = pte_to_swp_entry(vmf->orig_pte);
@@ -3572,8 +3605,11 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
 	if (WARN_ON_ONCE(vma_is_anonymous(vmf->vma) || !marker))
 		return VM_FAULT_SIGBUS;
 
-	/* TODO: handle pte markers */
-	return 0;
+	if (pte_marker_entry_uffd_wp(entry))
+		return pte_marker_handle_uffd_wp(vmf);
+
+	/* This is an unknown pte marker */
+	return VM_FAULT_SIGBUS;
 }
 
 /*
@@ -4157,6 +4193,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
 {
 	struct vm_area_struct *vma = vmf->vma;
+	bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
 	bool prefault = vmf->address != addr;
 	pte_t entry;
@@ -4171,6 +4208,8 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
 
 	if (write)
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+	if (unlikely(uffd_wp))
+		entry = pte_mkuffd_wp(pte_wrprotect(entry));
 	/* copy-on-write page */
 	if (write && !(vma->vm_flags & VM_SHARED)) {
 		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
@@ -4344,9 +4383,21 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
 	return vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff);
 }
 
+/* Return true if we should do read fault-around, false otherwise */
+static inline bool should_fault_around(struct vm_fault *vmf)
+{
+	/* No ->map_pages?  No way to fault around... */
+	if (!vmf->vma->vm_ops->map_pages)
+		return false;
+
+	if (uffd_disable_fault_around(vmf->vma))
+		return false;
+
+	return fault_around_bytes >> PAGE_SHIFT > 1;
+}
+
 static vm_fault_t do_read_fault(struct vm_fault *vmf)
 {
-	struct vm_area_struct *vma = vmf->vma;
 	vm_fault_t ret = 0;
 
 	/*
@@ -4354,12 +4405,10 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf)
 	 * if page by the offset is not ready to be mapped (cold cache or
 	 * something).
 	 */
-	if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
-		if (likely(!userfaultfd_minor(vmf->vma))) {
-			ret = do_fault_around(vmf);
-			if (ret)
-				return ret;
-		}
+	if (should_fault_around(vmf)) {
+		ret = do_fault_around(vmf);
+		if (ret)
+			return ret;
 	}
 
 	ret = __do_fault(vmf);
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 07/23] mm/shmem: Persist uffd-wp bit across zapping for file-backed
  2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (5 preceding siblings ...)
  2022-04-05  1:48 ` [PATCH v8 06/23] mm/shmem: Handle uffd-wp special pte in page fault handler Peter Xu
@ 2022-04-05  1:48 ` Peter Xu
  2022-04-05  1:48 ` [PATCH v8 08/23] mm/shmem: Allow uffd wr-protect none pte for file-backed mem Peter Xu
                   ` (17 subsequent siblings)
  24 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-05  1:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Nadav Amit, Matthew Wilcox, Mike Rapoport,
	David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple, peterx

File-backed memory is prone to being unmapped at any time.  It means all
information in the pte will be dropped, including the uffd-wp flag.

To persist the uffd-wp flag, we'll use the pte markers.  This patch teaches the
zap code to understand uffd-wp and know when to keep or drop the uffd-wp bit.

Add a new flag ZAP_FLAG_DROP_MARKER and set it in zap_details when we don't
want to persist such an information, for example, when destroying the whole
vma, or punching a hole in a shmem file.  For the rest cases we should never
drop the uffd-wp bit, or the wr-protect information will get lost.

The new ZAP_FLAG_DROP_MARKER needs to be put into mm.h rather than memory.c
because it'll be further referenced in hugetlb files later.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h        | 10 ++++++++
 include/linux/mm_inline.h | 43 ++++++++++++++++++++++++++++++++++
 mm/memory.c               | 49 ++++++++++++++++++++++++++++++++++++---
 mm/rmap.c                 |  8 +++++++
 4 files changed, 107 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 26428ff262fc..857bc8f7af45 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3422,4 +3422,14 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
 }
 #endif
 
+typedef unsigned int __bitwise zap_flags_t;
+
+/*
+ * Whether to drop the pte markers, for example, the uffd-wp information for
+ * file-backed memory.  This should only be specified when we will completely
+ * drop the page in the mm, either by truncation or unmapping of the vma.  By
+ * default, the flag is not set.
+ */
+#define  ZAP_FLAG_DROP_MARKER        ((__force zap_flags_t) BIT(0))
+
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index ac32125745ab..7b25b53c474a 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -6,6 +6,8 @@
 #include <linux/huge_mm.h>
 #include <linux/swap.h>
 #include <linux/string.h>
+#include <linux/userfaultfd_k.h>
+#include <linux/swapops.h>
 
 /**
  * folio_is_file_lru - Should the folio be on a file LRU or anon LRU?
@@ -316,5 +318,46 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
 	return atomic_read(&mm->tlb_flush_pending) > 1;
 }
 
+/*
+ * If this pte is wr-protected by uffd-wp in any form, arm the special pte to
+ * replace a none pte.  NOTE!  This should only be called when *pte is already
+ * cleared so we will never accidentally replace something valuable.  Meanwhile
+ * none pte also means we are not demoting the pte so tlb flushed is not needed.
+ * E.g., when pte cleared the caller should have taken care of the tlb flush.
+ *
+ * Must be called with pgtable lock held so that no thread will see the none
+ * pte, and if they see it, they'll fault and serialize at the pgtable lock.
+ *
+ * This function is a no-op if PTE_MARKER_UFFD_WP is not enabled.
+ */
+static inline void
+pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
+			      pte_t *pte, pte_t pteval)
+{
+#ifdef CONFIG_PTE_MARKER_UFFD_WP
+	bool arm_uffd_pte = false;
+
+	/* The current status of the pte should be "cleared" before calling */
+	WARN_ON_ONCE(!pte_none(*pte));
+
+	if (vma_is_anonymous(vma) || !userfaultfd_wp(vma))
+		return;
+
+	/* A uffd-wp wr-protected normal pte */
+	if (unlikely(pte_present(pteval) && pte_uffd_wp(pteval)))
+		arm_uffd_pte = true;
+
+	/*
+	 * A uffd-wp wr-protected swap pte.  Note: this should even cover an
+	 * existing pte marker with uffd-wp bit set.
+	 */
+	if (unlikely(pte_swp_uffd_wp_any(pteval)))
+		arm_uffd_pte = true;
+
+	if (unlikely(arm_uffd_pte))
+		set_pte_at(vma->vm_mm, addr, pte,
+			   make_pte_marker(PTE_MARKER_UFFD_WP));
+#endif
+}
 
 #endif
diff --git a/mm/memory.c b/mm/memory.c
index 21abb8a30553..1144845ff734 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -74,6 +74,7 @@
 #include <linux/perf_event.h>
 #include <linux/ptrace.h>
 #include <linux/vmalloc.h>
+#include <linux/mm_inline.h>
 
 #include <trace/events/kmem.h>
 
@@ -1306,6 +1307,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 struct zap_details {
 	struct folio *single_folio;	/* Locked folio to be unmapped */
 	bool even_cows;			/* Zap COWed private pages too? */
+	zap_flags_t zap_flags;		/* Extra flags for zapping */
 };
 
 /* Whether we should zap all COWed (private) pages too */
@@ -1334,6 +1336,29 @@ static inline bool should_zap_page(struct zap_details *details, struct page *pag
 	return !PageAnon(page);
 }
 
+static inline bool zap_drop_file_uffd_wp(struct zap_details *details)
+{
+	if (!details)
+		return false;
+
+	return details->zap_flags & ZAP_FLAG_DROP_MARKER;
+}
+
+/*
+ * This function makes sure that we'll replace the none pte with an uffd-wp
+ * swap special pte marker when necessary. Must be with the pgtable lock held.
+ */
+static inline void
+zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
+			      unsigned long addr, pte_t *pte,
+			      struct zap_details *details, pte_t pteval)
+{
+	if (zap_drop_file_uffd_wp(details))
+		return;
+
+	pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
@@ -1371,6 +1396,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
 			tlb_remove_tlb_entry(tlb, pte, addr);
+			zap_install_uffd_wp_if_needed(vma, addr, pte, details,
+						      ptent);
 			if (unlikely(!page))
 				continue;
 
@@ -1401,6 +1428,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			page = pfn_swap_entry_to_page(entry);
 			if (unlikely(!should_zap_page(details, page)))
 				continue;
+			/*
+			 * Both device private/exclusive mappings should only
+			 * work with anonymous page so far, so we don't need to
+			 * consider uffd-wp bit when zap. For more information,
+			 * see zap_install_uffd_wp_if_needed().
+			 */
+			WARN_ON_ONCE(!vma_is_anonymous(vma));
 			rss[mm_counter(page)]--;
 			if (is_device_private_entry(entry))
 				page_remove_rmap(page, vma, false);
@@ -1417,8 +1451,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			if (!should_zap_page(details, page))
 				continue;
 			rss[mm_counter(page)]--;
-		} else if (is_pte_marker_entry(entry)) {
-			/* By default, simply drop all pte markers when zap */
+		} else if (pte_marker_entry_uffd_wp(entry)) {
+			/* Only drop the uffd-wp marker if explicitly requested */
+			if (!zap_drop_file_uffd_wp(details))
+				continue;
 		} else if (is_hwpoison_entry(entry)) {
 			if (!should_zap_cows(details))
 				continue;
@@ -1427,6 +1463,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			WARN_ON_ONCE(1);
 		}
 		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+		zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 
 	add_mm_rss_vec(mm, rss);
@@ -1637,12 +1674,17 @@ void unmap_vmas(struct mmu_gather *tlb,
 		unsigned long end_addr)
 {
 	struct mmu_notifier_range range;
+	struct zap_details details = {
+		.zap_flags = ZAP_FLAG_DROP_MARKER,
+		/* Careful - we need to zap private pages too! */
+		.even_cows = true,
+	};
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
 				start_addr, end_addr);
 	mmu_notifier_invalidate_range_start(&range);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
-		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
+		unmap_single_vma(tlb, vma, start_addr, end_addr, &details);
 	mmu_notifier_invalidate_range_end(&range);
 }
 
@@ -3438,6 +3480,7 @@ void unmap_mapping_folio(struct folio *folio)
 
 	details.even_cows = false;
 	details.single_folio = folio;
+	details.zap_flags = ZAP_FLAG_DROP_MARKER;
 
 	i_mmap_lock_read(mapping);
 	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
diff --git a/mm/rmap.c b/mm/rmap.c
index 208b2c683cec..69416072b1a6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -73,6 +73,7 @@
 #include <linux/page_idle.h>
 #include <linux/memremap.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/mm_inline.h>
 
 #include <asm/tlbflush.h>
 
@@ -1538,6 +1539,13 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			pteval = ptep_clear_flush(vma, address, pvmw.pte);
 		}
 
+		/*
+		 * Now the pte is cleared. If this pte was uffd-wp armed,
+		 * we may want to replace a none pte with a marker pte if
+		 * it's file-backed, so we don't lose the tracking info.
+		 */
+		pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
+
 		/* Set the dirty flag on the folio now the pte is gone. */
 		if (pte_dirty(pteval))
 			folio_mark_dirty(folio);
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 08/23] mm/shmem: Allow uffd wr-protect none pte for file-backed mem
  2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (6 preceding siblings ...)
  2022-04-05  1:48 ` [PATCH v8 07/23] mm/shmem: Persist uffd-wp bit across zapping for file-backed Peter Xu
@ 2022-04-05  1:48 ` Peter Xu
  2022-04-05  1:48 ` [PATCH v8 09/23] mm/shmem: Allows file-back mem to be uffd wr-protected on thps Peter Xu
                   ` (16 subsequent siblings)
  24 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-05  1:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Nadav Amit, Matthew Wilcox, Mike Rapoport,
	David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple, peterx

File-backed memory differs from anonymous memory in that even if the pte is
missing, the data could still resides either in the file or in page/swap cache.
So when wr-protect a pte, we need to consider none ptes too.

We do that by installing the uffd-wp pte markers when necessary.  So when
there's a future write to the pte, the fault handler will go the special path
to first fault-in the page as read-only, then report to userfaultfd server with
the wr-protect message.

On the other hand, when unprotecting a page, it's also possible that the pte
got unmapped but replaced by the special uffd-wp marker.  Then we'll need to be
able to recover from a uffd-wp pte marker into a none pte, so that the next
access to the page will fault in correctly as usual when accessed the next
time.

Special care needs to be taken throughout the change_protection_range()
process.  Since now we allow user to wr-protect a none pte, we need to be able
to pre-populate the page table entries if we see (!anonymous && MM_CP_UFFD_WP)
requests, otherwise change_protection_range() will always skip when the pgtable
entry does not exist.

For example, the pgtable can be missing for a whole chunk of 2M pmd, but the
page cache can exist for the 2M range.  When we want to wr-protect one 4K page
within the 2M pmd range, we need to pre-populate the pgtable and install the
pte marker showing that we want to get a message and block the thread when the
page cache of that 4K page is written.  Without pre-populating the pmd,
change_protection() will simply skip that whole pmd.

Note that this patch only covers the small pages (pte level) but not covering
any of the transparent huge pages yet.  That will be done later, and this patch
will be a preparation for it too.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/mprotect.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 62 insertions(+), 2 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 709a6f73b764..bd62d5938c6c 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -30,6 +30,7 @@
 #include <linux/mm_inline.h>
 #include <linux/pgtable.h>
 #include <linux/sched/sysctl.h>
+#include <linux/userfaultfd_k.h>
 #include <asm/cacheflush.h>
 #include <asm/mmu_context.h>
 #include <asm/tlbflush.h>
@@ -188,8 +189,16 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 					newpte = pte_swp_mksoft_dirty(newpte);
 				if (pte_swp_uffd_wp(oldpte))
 					newpte = pte_swp_mkuffd_wp(newpte);
-			} else if (is_pte_marker_entry(entry)) {
-				/* Skip it, the same as none pte */
+			} else if (pte_marker_entry_uffd_wp(entry)) {
+				/*
+				 * If this is uffd-wp pte marker and we'd like
+				 * to unprotect it, drop it; the next page
+				 * fault will trigger without uffd trapping.
+				 */
+				if (uffd_wp_resolve) {
+					pte_clear(vma->vm_mm, addr, pte);
+					pages++;
+				}
 				continue;
 			} else {
 				newpte = oldpte;
@@ -204,6 +213,20 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				set_pte_at(vma->vm_mm, addr, pte, newpte);
 				pages++;
 			}
+		} else {
+			/* It must be an none page, or what else?.. */
+			WARN_ON_ONCE(!pte_none(oldpte));
+			if (unlikely(uffd_wp && !vma_is_anonymous(vma))) {
+				/*
+				 * For file-backed mem, we need to be able to
+				 * wr-protect a none pte, because even if the
+				 * pte is none, the page/swap cache could
+				 * exist.  Doing that by install a marker.
+				 */
+				set_pte_at(vma->vm_mm, addr, pte,
+					   make_pte_marker(PTE_MARKER_UFFD_WP));
+				pages++;
+			}
 		}
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
@@ -237,6 +260,39 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
 	return 0;
 }
 
+/* Return true if we're uffd wr-protecting file-backed memory, or false */
+static inline bool
+uffd_wp_protect_file(struct vm_area_struct *vma, unsigned long cp_flags)
+{
+	return (cp_flags & MM_CP_UFFD_WP) && !vma_is_anonymous(vma);
+}
+
+/*
+ * If wr-protecting the range for file-backed, populate pgtable for the case
+ * when pgtable is empty but page cache exists.  When {pte|pmd|...}_alloc()
+ * failed it means no memory, we don't have a better option but stop.
+ */
+#define  change_pmd_prepare(vma, pmd, cp_flags)				\
+	do {								\
+		if (unlikely(uffd_wp_protect_file(vma, cp_flags))) {	\
+			if (WARN_ON_ONCE(pte_alloc(vma->vm_mm, pmd)))	\
+				break;					\
+		}							\
+	} while (0)
+/*
+ * This is the general pud/p4d/pgd version of change_pmd_prepare(). We need to
+ * have separate change_pmd_prepare() because pte_alloc() returns 0 on success,
+ * while {pmd|pud|p4d}_alloc() returns the valid pointer on success.
+ */
+#define  change_prepare(vma, high, low, addr, cp_flags)			\
+	do {								\
+		if (unlikely(uffd_wp_protect_file(vma, cp_flags))) {	\
+			low##_t *p = low##_alloc(vma->vm_mm, high, addr); \
+			if (WARN_ON_ONCE(p == NULL))			\
+				break;					\
+		}							\
+	} while (0)
+
 static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pud_t *pud, unsigned long addr, unsigned long end,
 		pgprot_t newprot, unsigned long cp_flags)
@@ -255,6 +311,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 
 		next = pmd_addr_end(addr, end);
 
+		change_pmd_prepare(vma, pmd, cp_flags);
 		/*
 		 * Automatic NUMA balancing walks the tables with mmap_lock
 		 * held for read. It's possible a parallel update to occur
@@ -320,6 +377,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 	pud = pud_offset(p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
+		change_prepare(vma, pud, pmd, addr, cp_flags);
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		pages += change_pmd_range(vma, pud, addr, next, newprot,
@@ -340,6 +398,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
 	p4d = p4d_offset(pgd, addr);
 	do {
 		next = p4d_addr_end(addr, end);
+		change_prepare(vma, p4d, pud, addr, cp_flags);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		pages += change_pud_range(vma, p4d, addr, next, newprot,
@@ -365,6 +424,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 	inc_tlb_flush_pending(mm);
 	do {
 		next = pgd_addr_end(addr, end);
+		change_prepare(vma, pgd, p4d, addr, cp_flags);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		pages += change_p4d_range(vma, pgd, addr, next, newprot,
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 09/23] mm/shmem: Allows file-back mem to be uffd wr-protected on thps
  2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (7 preceding siblings ...)
  2022-04-05  1:48 ` [PATCH v8 08/23] mm/shmem: Allow uffd wr-protect none pte for file-backed mem Peter Xu
@ 2022-04-05  1:48 ` Peter Xu
  2022-04-05  1:48 ` [PATCH v8 10/23] mm/shmem: Handle uffd-wp during fork() Peter Xu
                   ` (15 subsequent siblings)
  24 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-05  1:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Nadav Amit, Matthew Wilcox, Mike Rapoport,
	David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple, peterx

We don't have "huge" version of pte markers, instead when necessary we split
the thp.

However split the thp is not enough, because file-backed thp is handled totally
differently comparing to anonymous thps: rather than doing a real split, the
thp pmd will simply got cleared in __split_huge_pmd_locked().

That is not enough if e.g. when there is a thp covers range [0, 2M) but we want
to wr-protect small page resides in [4K, 8K) range, because after
__split_huge_pmd() returns, there will be a none pmd, and change_pmd_range()
will just skip it right after the split.

Here we leverage the previously introduced change_pmd_prepare() macro so that
we'll populate the pmd with a pgtable page after the pmd split (in which
process the pmd will be cleared for cases like shmem).  Then change_pte_range()
will do all the rest for us by installing the uffd-wp pte marker at any none
pte that we'd like to wr-protect.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/mprotect.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index bd62d5938c6c..e0a567b66d07 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -333,8 +333,15 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		}
 
 		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
-			if (next - addr != HPAGE_PMD_SIZE) {
+			if ((next - addr != HPAGE_PMD_SIZE) ||
+			    uffd_wp_protect_file(vma, cp_flags)) {
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
+				/*
+				 * For file-backed, the pmd could have been
+				 * cleared; make sure pmd populated if
+				 * necessary, then fall-through to pte level.
+				 */
+				change_pmd_prepare(vma, pmd, cp_flags);
 			} else {
 				int nr_ptes = change_huge_pmd(vma, pmd, addr,
 							      newprot, cp_flags);
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 10/23] mm/shmem: Handle uffd-wp during fork()
  2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (8 preceding siblings ...)
  2022-04-05  1:48 ` [PATCH v8 09/23] mm/shmem: Allows file-back mem to be uffd wr-protected on thps Peter Xu
@ 2022-04-05  1:48 ` Peter Xu
  2022-04-06  6:16   ` kernel test robot
  2022-04-05  1:48 ` [PATCH v8 11/23] mm/hugetlb: Introduce huge pte version of uffd-wp helpers Peter Xu
                   ` (14 subsequent siblings)
  24 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2022-04-05  1:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Nadav Amit, Matthew Wilcox, Mike Rapoport,
	David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple, peterx

Normally we skip copy page when fork() for VM_SHARED shmem, but we can't skip
it anymore if uffd-wp is enabled on dst vma.  This should only happen when the
src uffd has UFFD_FEATURE_EVENT_FORK enabled on uffd-wp shmem vma, so that
VM_UFFD_WP will be propagated onto dst vma too, then we should copy the
pgtables with uffd-wp bit and pte markers, because these information will be
lost otherwise.

Since the condition checks will become even more complicated for deciding
"whether a vma needs to copy the pgtable during fork()", introduce a helper
vma_needs_copy() for it, so everything will be clearer.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c | 49 +++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 41 insertions(+), 8 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 1144845ff734..8ba1bb196095 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -867,6 +867,14 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		if (try_restore_exclusive_pte(src_pte, src_vma, addr))
 			return -EBUSY;
 		return -ENOENT;
+	} else if (is_pte_marker_entry(entry)) {
+		/*
+		 * We're copying the pgtable should only because dst_vma has
+		 * uffd-wp enabled, do sanity check.
+		 */
+		WARN_ON_ONCE(!userfaultfd_wp(dst_vma));
+		set_pte_at(dst_mm, addr, dst_pte, pte);
+		return 0;
 	}
 	if (!userfaultfd_wp(dst_vma))
 		pte = pte_swp_clear_uffd_wp(pte);
@@ -1221,6 +1229,38 @@ copy_p4d_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	return 0;
 }
 
+/*
+ * Return true if the vma needs to copy the pgtable during this fork().  Return
+ * false when we can speed up fork() by allowing lazy page faults later until
+ * when the child accesses the memory range.
+ */
+bool
+vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
+{
+	/*
+	 * Always copy pgtables when dst_vma has uffd-wp enabled even if it's
+	 * file-backed (e.g. shmem). Because when uffd-wp is enabled, pgtable
+	 * contains uffd-wp protection information, that's something we can't
+	 * retrieve from page cache, and skip copying will lose those info.
+	 */
+	if (userfaultfd_wp(dst_vma))
+		return true;
+
+	if (src_vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP))
+		return true;
+
+	if (src_vma->anon_vma)
+		return true;
+
+	/*
+	 * Don't copy ptes where a page fault will fill them correctly.  Fork
+	 * becomes much lighter when there are big shared or private readonly
+	 * mappings. The tradeoff is that copy_page_range is more efficient
+	 * than faulting.
+	 */
+	return false;
+}
+
 int
 copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 {
@@ -1234,14 +1274,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 	bool is_cow;
 	int ret;
 
-	/*
-	 * Don't copy ptes where a page fault will fill them correctly.
-	 * Fork becomes much lighter when there are big shared or private
-	 * readonly mappings. The tradeoff is that copy_page_range is more
-	 * efficient than faulting.
-	 */
-	if (!(src_vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) &&
-	    !src_vma->anon_vma)
+	if (!vma_needs_copy(dst_vma, src_vma))
 		return 0;
 
 	if (is_vm_hugetlb_page(src_vma))
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 11/23] mm/hugetlb: Introduce huge pte version of uffd-wp helpers
  2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (9 preceding siblings ...)
  2022-04-05  1:48 ` [PATCH v8 10/23] mm/shmem: Handle uffd-wp during fork() Peter Xu
@ 2022-04-05  1:48 ` Peter Xu
  2022-04-05  1:49 ` [PATCH v8 12/23] mm/hugetlb: Hook page faults for uffd write protection Peter Xu
                   ` (13 subsequent siblings)
  24 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-05  1:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Nadav Amit, Matthew Wilcox, Mike Rapoport,
	David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple, peterx

They will be used in the follow up patches to either check/set/clear uffd-wp
bit of a huge pte.

So far it reuses all the small pte helpers.  Archs can overwrite these versions
when necessary (with __HAVE_ARCH_HUGE_PTE_UFFD_WP* macros) in the future.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/s390/include/asm/hugetlb.h | 15 +++++++++++++++
 include/asm-generic/hugetlb.h   | 15 +++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
index bea47e7cc6a0..be99eda87f4d 100644
--- a/arch/s390/include/asm/hugetlb.h
+++ b/arch/s390/include/asm/hugetlb.h
@@ -115,6 +115,21 @@ static inline pte_t huge_pte_modify(pte_t pte, pgprot_t newprot)
 	return pte_modify(pte, newprot);
 }
 
+static inline pte_t huge_pte_mkuffd_wp(pte_t pte)
+{
+	return pte;
+}
+
+static inline pte_t huge_pte_clear_uffd_wp(pte_t pte)
+{
+	return pte;
+}
+
+static inline int huge_pte_uffd_wp(pte_t pte)
+{
+	return 0;
+}
+
 static inline bool gigantic_page_runtime_supported(void)
 {
 	return true;
diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index f39cad20ffc6..896f341f614d 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -35,6 +35,21 @@ static inline pte_t huge_pte_modify(pte_t pte, pgprot_t newprot)
 	return pte_modify(pte, newprot);
 }
 
+static inline pte_t huge_pte_mkuffd_wp(pte_t pte)
+{
+	return pte_mkuffd_wp(pte);
+}
+
+static inline pte_t huge_pte_clear_uffd_wp(pte_t pte)
+{
+	return pte_clear_uffd_wp(pte);
+}
+
+static inline int huge_pte_uffd_wp(pte_t pte)
+{
+	return pte_uffd_wp(pte);
+}
+
 #ifndef __HAVE_ARCH_HUGE_PTE_CLEAR
 static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
 		    pte_t *ptep, unsigned long sz)
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 12/23] mm/hugetlb: Hook page faults for uffd write protection
  2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (10 preceding siblings ...)
  2022-04-05  1:48 ` [PATCH v8 11/23] mm/hugetlb: Introduce huge pte version of uffd-wp helpers Peter Xu
@ 2022-04-05  1:49 ` Peter Xu
  2022-04-05  1:49 ` [PATCH v8 13/23] mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP Peter Xu
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-05  1:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Nadav Amit, Matthew Wilcox, Mike Rapoport,
	David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple, peterx

Hook up hugetlbfs_fault() with the capability to handle userfaultfd-wp faults.

We do this slightly earlier than hugetlb_cow() so that we can avoid taking some
extra locks that we definitely don't need.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dd642cfc538b..82df0fcfedf9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5711,6 +5711,26 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
 		goto out_ptl;
 
+	/* Handle userfault-wp first, before trying to lock more pages */
+	if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(ptep)) &&
+	    (flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
+		struct vm_fault vmf = {
+			.vma = vma,
+			.address = haddr,
+			.real_address = address,
+			.flags = flags,
+		};
+
+		spin_unlock(ptl);
+		if (pagecache_page) {
+			unlock_page(pagecache_page);
+			put_page(pagecache_page);
+		}
+		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+		i_mmap_unlock_read(mapping);
+		return handle_userfault(&vmf, VM_UFFD_WP);
+	}
+
 	/*
 	 * hugetlb_wp() requires page locks of pte_page(entry) and
 	 * pagecache_page, so here we need take the former one
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 13/23] mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP
  2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (11 preceding siblings ...)
  2022-04-05  1:49 ` [PATCH v8 12/23] mm/hugetlb: Hook page faults for uffd write protection Peter Xu
@ 2022-04-05  1:49 ` Peter Xu
  2022-04-05  1:49 ` [PATCH v8 14/23] mm/hugetlb: Handle UFFDIO_WRITEPROTECT Peter Xu
                   ` (11 subsequent siblings)
  24 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-05  1:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Nadav Amit, Matthew Wilcox, Mike Rapoport,
	David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple, peterx

Pass the wp_copy variable into hugetlb_mcopy_atomic_pte() thoughout the stack.
Apply the UFFD_WP bit if UFFDIO_COPY_MODE_WP is with UFFDIO_COPY.

Hugetlb pages are only managed by hugetlbfs, so we're safe even without setting
dirty bit in the huge pte if the page is installed as read-only.  However we'd
better still keep the dirty bit set for a read-only UFFDIO_COPY pte (when
UFFDIO_COPY_MODE_WP bit is set), not only to match what we do with shmem, but
also because the page does contain dirty data that the kernel just copied from
the userspace.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/hugetlb.h |  6 ++++--
 mm/hugetlb.c            | 29 +++++++++++++++++++++++------
 mm/userfaultfd.c        | 14 +++++++++-----
 3 files changed, 36 insertions(+), 13 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 53c1b6082a4c..6347298778b6 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -160,7 +160,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
 				unsigned long dst_addr,
 				unsigned long src_addr,
 				enum mcopy_atomic_mode mode,
-				struct page **pagep);
+				struct page **pagep,
+				bool wp_copy);
 #endif /* CONFIG_USERFAULTFD */
 bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						struct vm_area_struct *vma,
@@ -355,7 +356,8 @@ static inline int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 						unsigned long dst_addr,
 						unsigned long src_addr,
 						enum mcopy_atomic_mode mode,
-						struct page **pagep)
+						struct page **pagep,
+						bool wp_copy)
 {
 	BUG();
 	return 0;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 82df0fcfedf9..c94deead22b2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5795,7 +5795,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    unsigned long dst_addr,
 			    unsigned long src_addr,
 			    enum mcopy_atomic_mode mode,
-			    struct page **pagep)
+			    struct page **pagep,
+			    bool wp_copy)
 {
 	bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
 	struct hstate *h = hstate_vma(dst_vma);
@@ -5925,7 +5926,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		goto out_release_unlock;
 
 	ret = -EEXIST;
-	if (!huge_pte_none(huge_ptep_get(dst_pte)))
+	/*
+	 * We allow to overwrite a pte marker: consider when both MISSING|WP
+	 * registered, we firstly wr-protect a none pte which has no page cache
+	 * page backing it, then access the page.
+	 */
+	if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
 		goto out_release_unlock;
 
 	if (vm_shared) {
@@ -5935,17 +5941,28 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
 	}
 
-	/* For CONTINUE on a non-shared VMA, don't set VM_WRITE for CoW. */
-	if (is_continue && !vm_shared)
+	/*
+	 * For either: (1) CONTINUE on a non-shared VMA, or (2) UFFDIO_COPY
+	 * with wp flag set, don't set pte write bit.
+	 */
+	if (wp_copy || (is_continue && !vm_shared))
 		writable = 0;
 	else
 		writable = dst_vma->vm_flags & VM_WRITE;
 
 	_dst_pte = make_huge_pte(dst_vma, page, writable);
-	if (writable)
-		_dst_pte = huge_pte_mkdirty(_dst_pte);
+	/*
+	 * Always mark UFFDIO_COPY page dirty; note that this may not be
+	 * extremely important for hugetlbfs for now since swapping is not
+	 * supported, but we should still be clear in that this page cannot be
+	 * thrown away at will, even if write bit not set.
+	 */
+	_dst_pte = huge_pte_mkdirty(_dst_pte);
 	_dst_pte = pte_mkyoung(_dst_pte);
 
+	if (wp_copy)
+		_dst_pte = huge_pte_mkuffd_wp(_dst_pte);
+
 	set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
 
 	(void)huge_ptep_set_access_flags(dst_vma, dst_addr, dst_pte, _dst_pte,
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index b1c875b77fbb..da0b3ed2a6b5 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -304,7 +304,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 					      unsigned long dst_start,
 					      unsigned long src_start,
 					      unsigned long len,
-					      enum mcopy_atomic_mode mode)
+					      enum mcopy_atomic_mode mode,
+					      bool wp_copy)
 {
 	int vm_shared = dst_vma->vm_flags & VM_SHARED;
 	ssize_t err;
@@ -392,7 +393,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		}
 
 		if (mode != MCOPY_ATOMIC_CONTINUE &&
-		    !huge_pte_none(huge_ptep_get(dst_pte))) {
+		    !huge_pte_none_mostly(huge_ptep_get(dst_pte))) {
 			err = -EEXIST;
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			i_mmap_unlock_read(mapping);
@@ -400,7 +401,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		}
 
 		err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
-					       dst_addr, src_addr, mode, &page);
+					       dst_addr, src_addr, mode, &page,
+					       wp_copy);
 
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		i_mmap_unlock_read(mapping);
@@ -455,7 +457,8 @@ extern ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 				      unsigned long dst_start,
 				      unsigned long src_start,
 				      unsigned long len,
-				      enum mcopy_atomic_mode mode);
+				      enum mcopy_atomic_mode mode,
+				      bool wp_copy);
 #endif /* CONFIG_HUGETLB_PAGE */
 
 static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
@@ -575,7 +578,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	 */
 	if (is_vm_hugetlb_page(dst_vma))
 		return  __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
-						src_start, len, mcopy_mode);
+					       src_start, len, mcopy_mode,
+					       wp_copy);
 
 	if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
 		goto out_unlock;
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 14/23] mm/hugetlb: Handle UFFDIO_WRITEPROTECT
  2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (12 preceding siblings ...)
  2022-04-05  1:49 ` [PATCH v8 13/23] mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP Peter Xu
@ 2022-04-05  1:49 ` Peter Xu
  2022-04-05  1:49 ` [PATCH v8 15/23] mm/hugetlb: Handle pte markers in page faults Peter Xu
                   ` (10 subsequent siblings)
  24 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-05  1:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Nadav Amit, Matthew Wilcox, Mike Rapoport,
	David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple, peterx

This starts from passing cp_flags into hugetlb_change_protection() so hugetlb
will be able to handle MM_CP_UFFD_WP[_RESOLVE] requests.

huge_pte_clear_uffd_wp() is introduced to handle the case where the
UFFDIO_WRITEPROTECT is requested upon migrating huge page entries.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/hugetlb.h |  6 ++++--
 mm/hugetlb.c            | 13 ++++++++++++-
 mm/mprotect.c           |  3 ++-
 mm/userfaultfd.c        |  8 ++++++++
 4 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 6347298778b6..38c5ac28b787 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -210,7 +210,8 @@ struct page *follow_huge_pgd(struct mm_struct *mm, unsigned long address,
 int pmd_huge(pmd_t pmd);
 int pud_huge(pud_t pud);
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
-		unsigned long address, unsigned long end, pgprot_t newprot);
+		unsigned long address, unsigned long end, pgprot_t newprot,
+		unsigned long cp_flags);
 
 bool is_hugetlb_entry_migration(pte_t pte);
 void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
@@ -391,7 +392,8 @@ static inline void move_hugetlb_state(struct page *oldpage,
 
 static inline unsigned long hugetlb_change_protection(
 			struct vm_area_struct *vma, unsigned long address,
-			unsigned long end, pgprot_t newprot)
+			unsigned long end, pgprot_t newprot,
+			unsigned long cp_flags)
 {
 	return 0;
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c94deead22b2..2401dd5997b7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6207,7 +6207,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
-		unsigned long address, unsigned long end, pgprot_t newprot)
+		unsigned long address, unsigned long end,
+		pgprot_t newprot, unsigned long cp_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long start = address;
@@ -6217,6 +6218,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	unsigned long pages = 0;
 	bool shared_pmd = false;
 	struct mmu_notifier_range range;
+	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
+	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
 
 	/*
 	 * In the case of shared PMDs, the area to flush could be beyond
@@ -6263,6 +6266,10 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 					entry = make_readable_migration_entry(
 								swp_offset(entry));
 				newpte = swp_entry_to_pte(entry);
+				if (uffd_wp)
+					newpte = pte_swp_mkuffd_wp(newpte);
+				else if (uffd_wp_resolve)
+					newpte = pte_swp_clear_uffd_wp(newpte);
 				set_huge_swap_pte_at(mm, address, ptep,
 						     newpte, huge_page_size(h));
 				pages++;
@@ -6277,6 +6284,10 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 			old_pte = huge_ptep_modify_prot_start(vma, address, ptep);
 			pte = huge_pte_modify(old_pte, newprot);
 			pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
+			if (uffd_wp)
+				pte = huge_pte_mkuffd_wp(huge_pte_wrprotect(pte));
+			else if (uffd_wp_resolve)
+				pte = huge_pte_clear_uffd_wp(pte);
 			huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
 			pages++;
 		}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index e0a567b66d07..6b0e8c213508 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -455,7 +455,8 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 	BUG_ON((cp_flags & MM_CP_UFFD_WP_ALL) == MM_CP_UFFD_WP_ALL);
 
 	if (is_vm_hugetlb_page(vma))
-		pages = hugetlb_change_protection(vma, start, end, newprot);
+		pages = hugetlb_change_protection(vma, start, end, newprot,
+						  cp_flags);
 	else
 		pages = change_protection_range(vma, start, end, newprot,
 						cp_flags);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index da0b3ed2a6b5..58d67f2bf980 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -704,6 +704,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 			atomic_t *mmap_changing)
 {
 	struct vm_area_struct *dst_vma;
+	unsigned long page_mask;
 	pgprot_t newprot;
 	int err;
 
@@ -740,6 +741,13 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 	if (!vma_is_anonymous(dst_vma))
 		goto out_unlock;
 
+	if (is_vm_hugetlb_page(dst_vma)) {
+		err = -EINVAL;
+		page_mask = vma_kernel_pagesize(dst_vma) - 1;
+		if ((start & page_mask) || (len & page_mask))
+			goto out_unlock;
+	}
+
 	if (enable_wp)
 		newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE));
 	else
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 15/23] mm/hugetlb: Handle pte markers in page faults
  2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (13 preceding siblings ...)
  2022-04-05  1:49 ` [PATCH v8 14/23] mm/hugetlb: Handle UFFDIO_WRITEPROTECT Peter Xu
@ 2022-04-05  1:49 ` Peter Xu
  2022-04-06 13:37   ` kernel test robot
  2022-04-05  1:49 ` [PATCH v8 16/23] mm/hugetlb: Allow uffd wr-protect none ptes Peter Xu
                   ` (9 subsequent siblings)
  24 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2022-04-05  1:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Nadav Amit, Matthew Wilcox, Mike Rapoport,
	David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple, peterx

Allow hugetlb code to handle pte markers just like none ptes.  It's mostly
there, we just need to make sure we don't assume hugetlb_no_page() only handles
none pte, so when detecting pte change we should use pte_same() rather than
pte_none().  We need to pass in the old_pte to do the comparison.

Check the original pte to see whether it's a pte marker, if it is, we should
recover uffd-wp bit on the new pte to be installed, so that the next write will
be trapped by uffd.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2401dd5997b7..9317b790161d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5412,7 +5412,8 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
 static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			struct vm_area_struct *vma,
 			struct address_space *mapping, pgoff_t idx,
-			unsigned long address, pte_t *ptep, unsigned int flags)
+			unsigned long address, pte_t *ptep,
+			pte_t old_pte, unsigned int flags)
 {
 	struct hstate *h = hstate_vma(vma);
 	vm_fault_t ret = VM_FAULT_SIGBUS;
@@ -5539,7 +5540,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 
 	ptl = huge_pte_lock(h, mm, ptep);
 	ret = 0;
-	if (!huge_pte_none(huge_ptep_get(ptep)))
+	/* If pte changed from under us, retry */
+	if (!pte_same(huge_ptep_get(ptep), old_pte))
 		goto backout;
 
 	if (anon_rmap) {
@@ -5549,6 +5551,12 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		page_dup_file_rmap(page, true);
 	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
 				&& (vma->vm_flags & VM_SHARED)));
+	/*
+	 * If this pte was previously wr-protected, keep it wr-protected even
+	 * if populated.
+	 */
+	if (unlikely(pte_marker_uffd_wp(old_pte)))
+		new_pte = huge_pte_wrprotect(huge_pte_mkuffd_wp(new_pte));
 	set_huge_pte_at(mm, haddr, ptep, new_pte);
 
 	hugetlb_count_add(pages_per_huge_page(h), mm);
@@ -5666,8 +5674,10 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
 	entry = huge_ptep_get(ptep);
-	if (huge_pte_none(entry)) {
-		ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags);
+	/* PTE markers should be handled the same way as none pte */
+	if (huge_pte_none_mostly(entry)) {
+		ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
+				      entry, flags);
 		goto out_mutex;
 	}
 
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 16/23] mm/hugetlb: Allow uffd wr-protect none ptes
  2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (14 preceding siblings ...)
  2022-04-05  1:49 ` [PATCH v8 15/23] mm/hugetlb: Handle pte markers in page faults Peter Xu
@ 2022-04-05  1:49 ` Peter Xu
  2022-04-05  1:49 ` [PATCH v8 17/23] mm/hugetlb: Only drop uffd-wp special pte if required Peter Xu
                   ` (8 subsequent siblings)
  24 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-05  1:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Nadav Amit, Matthew Wilcox, Mike Rapoport,
	David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple, peterx

Teach hugetlbfs code to wr-protect none ptes just in case the page cache
existed for that pte.  Meanwhile we also need to be able to recognize a uffd-wp
marker pte and remove it for uffd_wp_resolve.

Since at it, introduce a variable "psize" to replace all references to the huge
page size fetcher.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 28 ++++++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9317b790161d..578c48ef931a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6225,7 +6225,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	pte_t *ptep;
 	pte_t pte;
 	struct hstate *h = hstate_vma(vma);
-	unsigned long pages = 0;
+	unsigned long pages = 0, psize = huge_page_size(h);
 	bool shared_pmd = false;
 	struct mmu_notifier_range range;
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
@@ -6245,13 +6245,19 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 
 	mmu_notifier_invalidate_range_start(&range);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
-	for (; address < end; address += huge_page_size(h)) {
+	for (; address < end; address += psize) {
 		spinlock_t *ptl;
-		ptep = huge_pte_offset(mm, address, huge_page_size(h));
+		ptep = huge_pte_offset(mm, address, psize);
 		if (!ptep)
 			continue;
 		ptl = huge_pte_lock(h, mm, ptep);
 		if (huge_pmd_unshare(mm, vma, &address, ptep)) {
+			/*
+			 * When uffd-wp is enabled on the vma, unshare
+			 * shouldn't happen at all.  Warn about it if it
+			 * happened due to some reason.
+			 */
+			WARN_ON_ONCE(uffd_wp || uffd_wp_resolve);
 			pages++;
 			spin_unlock(ptl);
 			shared_pmd = true;
@@ -6281,12 +6287,20 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 				else if (uffd_wp_resolve)
 					newpte = pte_swp_clear_uffd_wp(newpte);
 				set_huge_swap_pte_at(mm, address, ptep,
-						     newpte, huge_page_size(h));
+						     newpte, psize);
 				pages++;
 			}
 			spin_unlock(ptl);
 			continue;
 		}
+		if (unlikely(pte_marker_uffd_wp(pte))) {
+			/*
+			 * This is changing a non-present pte into a none pte,
+			 * no need for huge_ptep_modify_prot_start/commit().
+			 */
+			if (uffd_wp_resolve)
+				huge_pte_clear(mm, address, ptep, psize);
+		}
 		if (!huge_pte_none(pte)) {
 			pte_t old_pte;
 			unsigned int shift = huge_page_shift(hstate_vma(vma));
@@ -6300,6 +6314,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 				pte = huge_pte_clear_uffd_wp(pte);
 			huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
 			pages++;
+		} else {
+			/* None pte */
+			if (unlikely(uffd_wp))
+				/* Safe to modify directly (none->non-present). */
+				set_huge_pte_at(mm, address, ptep,
+						make_pte_marker(PTE_MARKER_UFFD_WP));
 		}
 		spin_unlock(ptl);
 	}
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 17/23] mm/hugetlb: Only drop uffd-wp special pte if required
  2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (15 preceding siblings ...)
  2022-04-05  1:49 ` [PATCH v8 16/23] mm/hugetlb: Allow uffd wr-protect none ptes Peter Xu
@ 2022-04-05  1:49 ` Peter Xu
  2022-04-05  1:49 ` [PATCH v8 18/23] mm/hugetlb: Handle uffd-wp during fork() Peter Xu
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-05  1:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Nadav Amit, Matthew Wilcox, Mike Rapoport,
	David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple, peterx

As with shmem uffd-wp special ptes, only drop the uffd-wp special swap pte if
unmapping an entire vma or synchronized such that faults can not race with the
unmap operation.  This requires passing zap_flags all the way to the lowest
level hugetlb unmap routine: __unmap_hugepage_range.

In general, unmap calls originated in hugetlbfs code will pass the
ZAP_FLAG_DROP_MARKER flag as synchronization is in place to prevent faults.
The exception is hole punch which will first unmap without any synchronization.
Later when hole punch actually removes the page from the file, it will check to
see if there was a subsequent fault and if so take the hugetlb fault mutex
while unmapping again.  This second unmap will pass in ZAP_FLAG_DROP_MARKER.

The justification of "whether to apply ZAP_FLAG_DROP_MARKER flag when unmap a
hugetlb range" is (IMHO): we should never reach a state when a page fault could
errornously fault in a page-cache page that was wr-protected to be writable,
even in an extremely short period.  That could happen if e.g. we pass
ZAP_FLAG_DROP_MARKER when hugetlbfs_punch_hole() calls hugetlb_vmdelete_list(),
because if a page faults after that call and before remove_inode_hugepages() is
executed, the page cache can be mapped writable again in the small racy window,
that can cause unexpected data overwritten.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/hugetlbfs/inode.c    | 15 +++++++++------
 include/linux/hugetlb.h |  8 +++++---
 mm/hugetlb.c            | 33 +++++++++++++++++++++++++--------
 mm/memory.c             |  5 ++++-
 4 files changed, 43 insertions(+), 18 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 99c7477cee5c..8b5b9df2be7d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -404,7 +404,8 @@ static void remove_huge_page(struct page *page)
 }
 
 static void
-hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
+hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
+		      unsigned long zap_flags)
 {
 	struct vm_area_struct *vma;
 
@@ -438,7 +439,7 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
 		}
 
 		unmap_hugepage_range(vma, vma->vm_start + v_offset, v_end,
-									NULL);
+				     NULL, zap_flags);
 	}
 }
 
@@ -516,7 +517,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 				mutex_lock(&hugetlb_fault_mutex_table[hash]);
 				hugetlb_vmdelete_list(&mapping->i_mmap,
 					index * pages_per_huge_page(h),
-					(index + 1) * pages_per_huge_page(h));
+					(index + 1) * pages_per_huge_page(h),
+					ZAP_FLAG_DROP_MARKER);
 				i_mmap_unlock_write(mapping);
 			}
 
@@ -582,7 +584,8 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	i_mmap_lock_write(mapping);
 	i_size_write(inode, offset);
 	if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
-		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
+		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
+				      ZAP_FLAG_DROP_MARKER);
 	i_mmap_unlock_write(mapping);
 	remove_inode_hugepages(inode, offset, LLONG_MAX);
 }
@@ -615,8 +618,8 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 		i_mmap_lock_write(mapping);
 		if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
 			hugetlb_vmdelete_list(&mapping->i_mmap,
-						hole_start >> PAGE_SHIFT,
-						hole_end  >> PAGE_SHIFT);
+					      hole_start >> PAGE_SHIFT,
+					      hole_end >> PAGE_SHIFT, 0);
 		i_mmap_unlock_write(mapping);
 		remove_inode_hugepages(inode, hole_start, hole_end);
 		inode_unlock(inode);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 38c5ac28b787..ab48b3bbb0e6 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -143,11 +143,12 @@ long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
 			 unsigned long *, unsigned long *, long, unsigned int,
 			 int *);
 void unmap_hugepage_range(struct vm_area_struct *,
-			  unsigned long, unsigned long, struct page *);
+			  unsigned long, unsigned long, struct page *,
+			  unsigned long);
 void __unmap_hugepage_range_final(struct mmu_gather *tlb,
 			  struct vm_area_struct *vma,
 			  unsigned long start, unsigned long end,
-			  struct page *ref_page);
+			  struct page *ref_page, unsigned long zap_flags);
 void hugetlb_report_meminfo(struct seq_file *);
 int hugetlb_report_node_meminfo(char *buf, int len, int nid);
 void hugetlb_show_meminfo(void);
@@ -400,7 +401,8 @@ static inline unsigned long hugetlb_change_protection(
 
 static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
 			struct vm_area_struct *vma, unsigned long start,
-			unsigned long end, struct page *ref_page)
+			unsigned long end, struct page *ref_page,
+			unsigned long zap_flags)
 {
 	BUG();
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 578c48ef931a..e4af8b357b90 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4947,7 +4947,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 
 static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 				   unsigned long start, unsigned long end,
-				   struct page *ref_page)
+				   struct page *ref_page, unsigned long zap_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
@@ -5003,7 +5003,18 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 		 * unmapped and its refcount is dropped, so just clear pte here.
 		 */
 		if (unlikely(!pte_present(pte))) {
-			huge_pte_clear(mm, address, ptep, sz);
+			/*
+			 * If the pte was wr-protected by uffd-wp in any of the
+			 * swap forms, meanwhile the caller does not want to
+			 * drop the uffd-wp bit in this zap, then replace the
+			 * pte with a marker.
+			 */
+			if (pte_swp_uffd_wp_any(pte) &&
+			    !(zap_flags & ZAP_FLAG_DROP_MARKER))
+				set_huge_pte_at(mm, address, ptep,
+						make_pte_marker(PTE_MARKER_UFFD_WP));
+			else
+				huge_pte_clear(mm, address, ptep, sz);
 			spin_unlock(ptl);
 			continue;
 		}
@@ -5031,7 +5042,11 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 		tlb_remove_huge_tlb_entry(h, tlb, ptep, address);
 		if (huge_pte_dirty(pte))
 			set_page_dirty(page);
-
+		/* Leave a uffd-wp pte marker if needed */
+		if (huge_pte_uffd_wp(pte) &&
+		    !(zap_flags & ZAP_FLAG_DROP_MARKER))
+			set_huge_pte_at(mm, address, ptep,
+					make_pte_marker(PTE_MARKER_UFFD_WP));
 		hugetlb_count_sub(pages_per_huge_page(h), mm);
 		page_remove_rmap(page, vma, true);
 
@@ -5065,9 +5080,10 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 
 void __unmap_hugepage_range_final(struct mmu_gather *tlb,
 			  struct vm_area_struct *vma, unsigned long start,
-			  unsigned long end, struct page *ref_page)
+			  unsigned long end, struct page *ref_page,
+			  unsigned long zap_flags)
 {
-	__unmap_hugepage_range(tlb, vma, start, end, ref_page);
+	__unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);
 
 	/*
 	 * Clear this flag so that x86's huge_pmd_share page_table_shareable
@@ -5083,12 +5099,13 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
 }
 
 void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
-			  unsigned long end, struct page *ref_page)
+			  unsigned long end, struct page *ref_page,
+			  unsigned long zap_flags)
 {
 	struct mmu_gather tlb;
 
 	tlb_gather_mmu(&tlb, vma->vm_mm);
-	__unmap_hugepage_range(&tlb, vma, start, end, ref_page);
+	__unmap_hugepage_range(&tlb, vma, start, end, ref_page, zap_flags);
 	tlb_finish_mmu(&tlb);
 }
 
@@ -5143,7 +5160,7 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		if (!is_vma_resv_set(iter_vma, HPAGE_RESV_OWNER))
 			unmap_hugepage_range(iter_vma, address,
-					     address + huge_page_size(h), page);
+					     address + huge_page_size(h), page, 0);
 	}
 	i_mmap_unlock_write(mapping);
 }
diff --git a/mm/memory.c b/mm/memory.c
index 8ba1bb196095..9808edfe18d4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1675,8 +1675,11 @@ static void unmap_single_vma(struct mmu_gather *tlb,
 			 * safe to do nothing in this case.
 			 */
 			if (vma->vm_file) {
+				unsigned long zap_flags = details ?
+				    details->zap_flags : 0;
 				i_mmap_lock_write(vma->vm_file->f_mapping);
-				__unmap_hugepage_range_final(tlb, vma, start, end, NULL);
+				__unmap_hugepage_range_final(tlb, vma, start, end,
+							     NULL, zap_flags);
 				i_mmap_unlock_write(vma->vm_file->f_mapping);
 			}
 		} else
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 18/23] mm/hugetlb: Handle uffd-wp during fork()
  2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (16 preceding siblings ...)
  2022-04-05  1:49 ` [PATCH v8 17/23] mm/hugetlb: Only drop uffd-wp special pte if required Peter Xu
@ 2022-04-05  1:49 ` Peter Xu
  2022-04-05  1:49 ` [PATCH v8 19/23] mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered Peter Xu
                   ` (6 subsequent siblings)
  24 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-05  1:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Nadav Amit, Matthew Wilcox, Mike Rapoport,
	David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple, peterx

Firstly, we'll need to pass in dst_vma into copy_hugetlb_page_range() because
for uffd-wp it's the dst vma that matters on deciding how we should treat
uffd-wp protected ptes.

We should recognize pte markers during fork and do the pte copy if needed.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/hugetlb.h |  7 +++++--
 mm/hugetlb.c            | 42 +++++++++++++++++++++++++++--------------
 mm/memory.c             |  2 +-
 3 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ab48b3bbb0e6..6df51d23b7ee 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -137,7 +137,8 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 			     struct vm_area_struct *new_vma,
 			     unsigned long old_addr, unsigned long new_addr,
 			     unsigned long len);
-int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
+int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *,
+			    struct vm_area_struct *, struct vm_area_struct *);
 long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
 			 struct page **, struct vm_area_struct **,
 			 unsigned long *, unsigned long *, long, unsigned int,
@@ -268,7 +269,9 @@ static inline struct page *follow_huge_addr(struct mm_struct *mm,
 }
 
 static inline int copy_hugetlb_page_range(struct mm_struct *dst,
-			struct mm_struct *src, struct vm_area_struct *vma)
+					  struct mm_struct *src,
+					  struct vm_area_struct *dst_vma,
+					  struct vm_area_struct *src_vma)
 {
 	BUG();
 	return 0;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e4af8b357b90..e1571179698a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4706,23 +4706,24 @@ hugetlb_install_page(struct vm_area_struct *vma, pte_t *ptep, unsigned long addr
 }
 
 int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
-			    struct vm_area_struct *vma)
+			    struct vm_area_struct *dst_vma,
+			    struct vm_area_struct *src_vma)
 {
 	pte_t *src_pte, *dst_pte, entry, dst_entry;
 	struct page *ptepage;
 	unsigned long addr;
-	bool cow = is_cow_mapping(vma->vm_flags);
-	struct hstate *h = hstate_vma(vma);
+	bool cow = is_cow_mapping(src_vma->vm_flags);
+	struct hstate *h = hstate_vma(src_vma);
 	unsigned long sz = huge_page_size(h);
 	unsigned long npages = pages_per_huge_page(h);
-	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct address_space *mapping = src_vma->vm_file->f_mapping;
 	struct mmu_notifier_range range;
 	int ret = 0;
 
 	if (cow) {
-		mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, src,
-					vma->vm_start,
-					vma->vm_end);
+		mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, src_vma, src,
+					src_vma->vm_start,
+					src_vma->vm_end);
 		mmu_notifier_invalidate_range_start(&range);
 		mmap_assert_write_locked(src);
 		raw_write_seqcount_begin(&src->write_protect_seq);
@@ -4736,12 +4737,12 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		i_mmap_lock_read(mapping);
 	}
 
-	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
+	for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
 		src_pte = huge_pte_offset(src, addr, sz);
 		if (!src_pte)
 			continue;
-		dst_pte = huge_pte_alloc(dst, vma, addr, sz);
+		dst_pte = huge_pte_alloc(dst, dst_vma, addr, sz);
 		if (!dst_pte) {
 			ret = -ENOMEM;
 			break;
@@ -4776,6 +4777,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		} else if (unlikely(is_hugetlb_entry_migration(entry) ||
 				    is_hugetlb_entry_hwpoisoned(entry))) {
 			swp_entry_t swp_entry = pte_to_swp_entry(entry);
+			bool uffd_wp = huge_pte_uffd_wp(entry);
 
 			if (!is_readable_migration_entry(swp_entry) && cow) {
 				/*
@@ -4785,10 +4787,21 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				swp_entry = make_readable_migration_entry(
 							swp_offset(swp_entry));
 				entry = swp_entry_to_pte(swp_entry);
+				if (userfaultfd_wp(src_vma) && uffd_wp)
+					entry = huge_pte_mkuffd_wp(entry);
 				set_huge_swap_pte_at(src, addr, src_pte,
 						     entry, sz);
 			}
+			if (!userfaultfd_wp(dst_vma) && uffd_wp)
+				entry = huge_pte_clear_uffd_wp(entry);
 			set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
+		} else if (unlikely(is_pte_marker(entry))) {
+			/*
+			 * We copy the pte marker only if the dst vma has
+			 * uffd-wp enabled.
+			 */
+			if (userfaultfd_wp(dst_vma))
+				set_huge_pte_at(dst, addr, dst_pte, entry);
 		} else {
 			entry = huge_ptep_get(src_pte);
 			ptepage = pte_page(entry);
@@ -4806,20 +4819,21 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			 */
 			if (!PageAnon(ptepage)) {
 				page_dup_file_rmap(ptepage, true);
-			} else if (page_try_dup_anon_rmap(ptepage, true, vma)) {
+			} else if (page_try_dup_anon_rmap(ptepage, true,
+							  src_vma)) {
 				pte_t src_pte_old = entry;
 				struct page *new;
 
 				spin_unlock(src_ptl);
 				spin_unlock(dst_ptl);
 				/* Do not use reserve as it's private owned */
-				new = alloc_huge_page(vma, addr, 1);
+				new = alloc_huge_page(dst_vma, addr, 1);
 				if (IS_ERR(new)) {
 					put_page(ptepage);
 					ret = PTR_ERR(new);
 					break;
 				}
-				copy_user_huge_page(new, ptepage, addr, vma,
+				copy_user_huge_page(new, ptepage, addr, dst_vma,
 						    npages);
 				put_page(ptepage);
 
@@ -4829,13 +4843,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 				entry = huge_ptep_get(src_pte);
 				if (!pte_same(src_pte_old, entry)) {
-					restore_reserve_on_error(h, vma, addr,
+					restore_reserve_on_error(h, dst_vma, addr,
 								new);
 					put_page(new);
 					/* dst_entry won't change as in child */
 					goto again;
 				}
-				hugetlb_install_page(vma, dst_pte, addr, new);
+				hugetlb_install_page(dst_vma, dst_pte, addr, new);
 				spin_unlock(src_ptl);
 				spin_unlock(dst_ptl);
 				continue;
diff --git a/mm/memory.c b/mm/memory.c
index 9808edfe18d4..d1e9c2517dfb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1278,7 +1278,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 		return 0;
 
 	if (is_vm_hugetlb_page(src_vma))
-		return copy_hugetlb_page_range(dst_mm, src_mm, src_vma);
+		return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);
 
 	if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
 		/*
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 19/23] mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered
  2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (17 preceding siblings ...)
  2022-04-05  1:49 ` [PATCH v8 18/23] mm/hugetlb: Handle uffd-wp during fork() Peter Xu
@ 2022-04-05  1:49 ` Peter Xu
  2022-04-05  1:49 ` [PATCH v8 20/23] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs Peter Xu
                   ` (5 subsequent siblings)
  24 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-05  1:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Nadav Amit, Matthew Wilcox, Mike Rapoport,
	David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple, peterx

When we're trying to collapse a 2M huge shmem page, don't retract pgtable pmd
page if it's registered with uffd-wp, because that pgtable could have pte
markers installed.  Recycling of that pgtable means we'll lose the pte markers.
That could cause data loss for an uffd-wp enabled application on shmem.

Instead of disabling khugepaged on these files, simply skip retracting these
special VMAs, then the page cache can still be merged into a huge thp, and
other mm/vma can still map the range of file with a huge thp when proper.

Note that checking VM_UFFD_WP needs to be done with mmap_sem held for write,
that avoids race like:

         khugepaged                             user thread
         ==========                             ===========
     check VM_UFFD_WP, not set
                                       UFFDIO_REGISTER with uffd-wp on shmem
                                       wr-protect some pages (install markers)
     take mmap_sem write lock
     erase pmd and free pmd page
      --> pte markers are dropped unnoticed!

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/khugepaged.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 04a972259136..d7c5bb9fd1fb 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1464,6 +1464,10 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 	if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE))
 		return;
 
+	/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
+	if (userfaultfd_wp(vma))
+		return;
+
 	hpage = find_lock_page(vma->vm_file->f_mapping,
 			       linear_page_index(vma, haddr));
 	if (!hpage)
@@ -1599,7 +1603,15 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		 * reverse order. Trylock is a way to avoid deadlock.
 		 */
 		if (mmap_write_trylock(mm)) {
-			if (!khugepaged_test_exit(mm))
+			/*
+			 * When a vma is registered with uffd-wp, we can't
+			 * recycle the pmd pgtable because there can be pte
+			 * markers installed.  Skip it only, so the rest mm/vma
+			 * can still have the same file mapped hugely, however
+			 * it'll always mapped in small page size for uffd-wp
+			 * registered ranges.
+			 */
+			if (!khugepaged_test_exit(mm) && !userfaultfd_wp(vma))
 				collapse_and_free_pmd(mm, vma, addr, pmd);
 			mmap_write_unlock(mm);
 		} else {
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 20/23] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs
  2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (18 preceding siblings ...)
  2022-04-05  1:49 ` [PATCH v8 19/23] mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered Peter Xu
@ 2022-04-05  1:49 ` Peter Xu
  2022-04-05  1:49 ` [PATCH v8 21/23] mm/uffd: Enable write protection for shmem & hugetlbfs Peter Xu
                   ` (4 subsequent siblings)
  24 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-05  1:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Nadav Amit, Matthew Wilcox, Mike Rapoport,
	David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple, peterx

This requires the pagemap code to be able to recognize the newly introduced
swap special pte for uffd-wp, meanwhile the general case for hugetlb that we
recently start to support.  It should make pagemap uffd-wp support complete.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/proc/task_mmu.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f46060eb91b5..194dfd7abf2b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1421,6 +1421,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 		migration = is_migration_entry(entry);
 		if (is_pfn_swap_entry(entry))
 			page = pfn_swap_entry_to_page(entry);
+		if (pte_marker_entry_uffd_wp(entry))
+			flags |= PM_UFFD_WP;
 	}
 
 	if (page && !PageAnon(page))
@@ -1556,10 +1558,15 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
 		if (page_mapcount(page) == 1)
 			flags |= PM_MMAP_EXCLUSIVE;
 
+		if (huge_pte_uffd_wp(pte))
+			flags |= PM_UFFD_WP;
+
 		flags |= PM_PRESENT;
 		if (pm->show_pfn)
 			frame = pte_pfn(pte) +
 				((addr & ~hmask) >> PAGE_SHIFT);
+	} else if (pte_swp_uffd_wp_any(pte)) {
+		flags |= PM_UFFD_WP;
 	}
 
 	for (; addr != end; addr += PAGE_SIZE) {
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 21/23] mm/uffd: Enable write protection for shmem & hugetlbfs
  2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (19 preceding siblings ...)
  2022-04-05  1:49 ` [PATCH v8 20/23] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs Peter Xu
@ 2022-04-05  1:49 ` Peter Xu
  2022-04-05  1:49 ` [PATCH v8 22/23] mm: Enable PTE markers by default Peter Xu
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-05  1:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Nadav Amit, Matthew Wilcox, Mike Rapoport,
	David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple, peterx

We've had all the necessary changes ready for both shmem and hugetlbfs.  Turn
on all the shmem/hugetlbfs switches for userfaultfd-wp.

We can expand UFFD_API_RANGE_IOCTLS_BASIC with _UFFDIO_WRITEPROTECT too because
all existing types now support write protection mode.

Since vma_can_userfault() will be used elsewhere, move into userfaultfd_k.h.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c                 | 21 +++------------------
 include/linux/userfaultfd_k.h    | 20 ++++++++++++++++++++
 include/uapi/linux/userfaultfd.h | 10 ++++++++--
 mm/userfaultfd.c                 |  9 +++------
 4 files changed, 34 insertions(+), 26 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 8b4a94f5a238..fb45522a2b44 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1257,24 +1257,6 @@ static __always_inline int validate_range(struct mm_struct *mm,
 	return 0;
 }
 
-static inline bool vma_can_userfault(struct vm_area_struct *vma,
-				     unsigned long vm_flags)
-{
-	/* FIXME: add WP support to hugetlbfs and shmem */
-	if (vm_flags & VM_UFFD_WP) {
-		if (is_vm_hugetlb_page(vma) || vma_is_shmem(vma))
-			return false;
-	}
-
-	if (vm_flags & VM_UFFD_MINOR) {
-		if (!(is_vm_hugetlb_page(vma) || vma_is_shmem(vma)))
-			return false;
-	}
-
-	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
-	       vma_is_shmem(vma);
-}
-
 static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 				unsigned long arg)
 {
@@ -1955,6 +1937,9 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
 #endif
 #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
 	uffdio_api.features &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP;
+#endif
+#ifndef CONFIG_PTE_MARKER_UFFD_WP
+	uffdio_api.features &= ~UFFD_FEATURE_WP_HUGETLBFS_SHMEM;
 #endif
 	uffdio_api.ioctls = UFFD_API_IOCTLS;
 	ret = -EFAULT;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 827e38b7be65..ea11bed9bb7e 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -18,6 +18,7 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <asm-generic/pgtable_uffd.h>
+#include <linux/hugetlb_inline.h>
 
 /* The set of all possible UFFD-related VM flags. */
 #define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR)
@@ -140,6 +141,25 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 	return vma->vm_flags & __VM_UFFD_FLAGS;
 }
 
+static inline bool vma_can_userfault(struct vm_area_struct *vma,
+				     unsigned long vm_flags)
+{
+	if (vm_flags & VM_UFFD_MINOR)
+		return is_vm_hugetlb_page(vma) || vma_is_shmem(vma);
+
+#ifndef CONFIG_PTE_MARKER_UFFD_WP
+	/*
+	 * If user requested uffd-wp but not enabled pte markers for
+	 * uffd-wp, then shmem & hugetlbfs are not supported but only
+	 * anonymous.
+	 */
+	if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
+		return false;
+#endif
+	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
+	    vma_is_shmem(vma);
+}
+
 extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
 extern void dup_userfaultfd_complete(struct list_head *);
 
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index ef739054cb1c..7d32b1e797fb 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -33,7 +33,8 @@
 			   UFFD_FEATURE_THREAD_ID |		\
 			   UFFD_FEATURE_MINOR_HUGETLBFS |	\
 			   UFFD_FEATURE_MINOR_SHMEM |		\
-			   UFFD_FEATURE_EXACT_ADDRESS)
+			   UFFD_FEATURE_EXACT_ADDRESS |		\
+			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -47,7 +48,8 @@
 #define UFFD_API_RANGE_IOCTLS_BASIC		\
 	((__u64)1 << _UFFDIO_WAKE |		\
 	 (__u64)1 << _UFFDIO_COPY |		\
-	 (__u64)1 << _UFFDIO_CONTINUE)
+	 (__u64)1 << _UFFDIO_CONTINUE |		\
+	 (__u64)1 << _UFFDIO_WRITEPROTECT)
 
 /*
  * Valid ioctl command number range with this API is from 0x00 to
@@ -194,6 +196,9 @@ struct uffdio_api {
 	 * UFFD_FEATURE_EXACT_ADDRESS indicates that the exact address of page
 	 * faults would be provided and the offset within the page would not be
 	 * masked.
+	 *
+	 * UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd
+	 * write-protection mode is supported on both shmem and hugetlbfs.
 	 */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
 #define UFFD_FEATURE_EVENT_FORK			(1<<1)
@@ -207,6 +212,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_MINOR_HUGETLBFS		(1<<9)
 #define UFFD_FEATURE_MINOR_SHMEM		(1<<10)
 #define UFFD_FEATURE_EXACT_ADDRESS		(1<<11)
+#define UFFD_FEATURE_WP_HUGETLBFS_SHMEM		(1<<12)
 	__u64 features;
 
 	__u64 ioctls;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 58d67f2bf980..156e9bdf9f23 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -730,15 +730,12 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 
 	err = -ENOENT;
 	dst_vma = find_dst_vma(dst_mm, start, len);
-	/*
-	 * Make sure the vma is not shared, that the dst range is
-	 * both valid and fully within a single existing vma.
-	 */
-	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+
+	if (!dst_vma)
 		goto out_unlock;
 	if (!userfaultfd_wp(dst_vma))
 		goto out_unlock;
-	if (!vma_is_anonymous(dst_vma))
+	if (!vma_can_userfault(dst_vma, dst_vma->vm_flags))
 		goto out_unlock;
 
 	if (is_vm_hugetlb_page(dst_vma)) {
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 22/23] mm: Enable PTE markers by default
  2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (20 preceding siblings ...)
  2022-04-05  1:49 ` [PATCH v8 21/23] mm/uffd: Enable write protection for shmem & hugetlbfs Peter Xu
@ 2022-04-05  1:49 ` Peter Xu
  2022-04-19 15:13   ` Johannes Weiner
  2022-04-05  1:49 ` [PATCH v8 23/23] selftests/uffd: Enable uffd-wp for shmem/hugetlbfs Peter Xu
                   ` (2 subsequent siblings)
  24 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2022-04-05  1:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Nadav Amit, Matthew Wilcox, Mike Rapoport,
	David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple, peterx

Enable PTE markers by default.  On x86_64 it means it'll auto-enable
PTE_MARKER_UFFD_WP as well.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/Kconfig | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 6e7c2d59fa96..3eca34c864c5 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -911,12 +911,14 @@ config ANON_VMA_NAME
 
 config PTE_MARKER
 	bool "Marker PTEs support"
+	default y
 
 	help
 	  Allows to create marker PTEs for file-backed memory.
 
 config PTE_MARKER_UFFD_WP
 	bool "Marker PTEs support for userfaultfd write protection"
+	default y
 	depends on PTE_MARKER && HAVE_ARCH_USERFAULTFD_WP
 
 	help
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 23/23] selftests/uffd: Enable uffd-wp for shmem/hugetlbfs
  2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (21 preceding siblings ...)
  2022-04-05  1:49 ` [PATCH v8 22/23] mm: Enable PTE markers by default Peter Xu
@ 2022-04-05  1:49 ` Peter Xu
  2022-04-05 22:16 ` [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Andrew Morton
  2022-05-10 19:05 ` Andrew Morton
  24 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-05  1:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Nadav Amit, Matthew Wilcox, Mike Rapoport,
	David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple, peterx

After we added support for shmem and hugetlbfs, we can turn uffd-wp test on
always now.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/vm/userfaultfd.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 92a4516f8f0d..bbc4a6d8cf7b 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -82,7 +82,7 @@ static int test_type;
 static volatile bool test_uffdio_copy_eexist = true;
 static volatile bool test_uffdio_zeropage_eexist = true;
 /* Whether to test uffd write-protection */
-static bool test_uffdio_wp = false;
+static bool test_uffdio_wp = true;
 /* Whether to test uffd minor faults */
 static bool test_uffdio_minor = false;
 
@@ -1594,8 +1594,6 @@ static void set_test_type(const char *type)
 	if (!strcmp(type, "anon")) {
 		test_type = TEST_ANON;
 		uffd_test_ops = &anon_uffd_test_ops;
-		/* Only enable write-protect test for anonymous test */
-		test_uffdio_wp = true;
 	} else if (!strcmp(type, "hugetlb")) {
 		test_type = TEST_HUGETLB;
 		uffd_test_ops = &hugetlb_uffd_test_ops;
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs
  2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (22 preceding siblings ...)
  2022-04-05  1:49 ` [PATCH v8 23/23] selftests/uffd: Enable uffd-wp for shmem/hugetlbfs Peter Xu
@ 2022-04-05 22:16 ` Andrew Morton
  2022-04-05 22:42   ` Peter Xu
  2022-05-10 19:05 ` Andrew Morton
  24 siblings, 1 reply; 63+ messages in thread
From: Andrew Morton @ 2022-04-05 22:16 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Mike Kravetz, David Hildenbrand,
	Matthew Wilcox, Alistair Popple, Nadav Amit, Axel Rasmussen,
	Andrea Arcangeli, Kirill A . Shutemov, Hugh Dickins,
	Jerome Glisse, Mike Rapoport

On Mon,  4 Apr 2022 21:46:23 -0400 Peter Xu <peterx@redhat.com> wrote:

> This is v8 of the series to add shmem+hugetlbfs support for userfaultfd
> write protection. 

Various compilation catastrophes with x86_64 allnoconfig.  I poked at
the include ordering for a while but other things quickly became more
attractive ;)


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs
  2022-04-05 22:16 ` [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Andrew Morton
@ 2022-04-05 22:42   ` Peter Xu
  2022-04-05 22:49     ` Andrew Morton
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2022-04-05 22:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Mike Kravetz, David Hildenbrand,
	Matthew Wilcox, Alistair Popple, Nadav Amit, Axel Rasmussen,
	Andrea Arcangeli, Kirill A . Shutemov, Hugh Dickins,
	Jerome Glisse, Mike Rapoport

On Tue, Apr 05, 2022 at 03:16:16PM -0700, Andrew Morton wrote:
> On Mon,  4 Apr 2022 21:46:23 -0400 Peter Xu <peterx@redhat.com> wrote:
> 
> > This is v8 of the series to add shmem+hugetlbfs support for userfaultfd
> > write protection. 
> 
> Various compilation catastrophes with x86_64 allnoconfig.  I poked at
> the include ordering for a while but other things quickly became more
> attractive ;)

Sorry about that.  I still don't know what's the problem, but I'll give it
a shot soon.

I think I only tried out with the new configs but not all the rest configs.
I thought there're some bot looking after that one, from which I used to
receive build reports. And IIRC I fixed some build issues in early versions
from those reports.  Maybe I was wrong..

Any more hints on the latter?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs
  2022-04-05 22:42   ` Peter Xu
@ 2022-04-05 22:49     ` Andrew Morton
  2022-04-05 23:02       ` Peter Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Andrew Morton @ 2022-04-05 22:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Mike Kravetz, David Hildenbrand,
	Matthew Wilcox, Alistair Popple, Nadav Amit, Axel Rasmussen,
	Andrea Arcangeli, Kirill A . Shutemov, Hugh Dickins,
	Jerome Glisse, Mike Rapoport

On Tue, 5 Apr 2022 18:42:43 -0400 Peter Xu <peterx@redhat.com> wrote:

> On Tue, Apr 05, 2022 at 03:16:16PM -0700, Andrew Morton wrote:
> > On Mon,  4 Apr 2022 21:46:23 -0400 Peter Xu <peterx@redhat.com> wrote:
> > 
> > > This is v8 of the series to add shmem+hugetlbfs support for userfaultfd
> > > write protection. 
> > 
> > Various compilation catastrophes with x86_64 allnoconfig.  I poked at
> > the include ordering for a while but other things quickly became more
> > attractive ;)
> 
> Sorry about that.  I still don't know what's the problem, but I'll give it
> a shot soon.
> 
> I think I only tried out with the new configs but not all the rest configs.
> I thought there're some bot looking after that one, from which I used to
> receive build reports. And IIRC I fixed some build issues in early versions
> from those reports.  Maybe I was wrong..
> 
> Any more hints on the latter?

`make allnoconfig'?


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs
  2022-04-05 22:49     ` Andrew Morton
@ 2022-04-05 23:02       ` Peter Xu
  2022-04-05 23:08         ` Andrew Morton
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2022-04-05 23:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Mike Kravetz, David Hildenbrand,
	Matthew Wilcox, Alistair Popple, Nadav Amit, Axel Rasmussen,
	Andrea Arcangeli, Kirill A . Shutemov, Hugh Dickins,
	Jerome Glisse, Mike Rapoport

[-- Attachment #1: Type: text/plain, Size: 1339 bytes --]

On Tue, Apr 05, 2022 at 03:49:12PM -0700, Andrew Morton wrote:
> On Tue, 5 Apr 2022 18:42:43 -0400 Peter Xu <peterx@redhat.com> wrote:
> 
> > On Tue, Apr 05, 2022 at 03:16:16PM -0700, Andrew Morton wrote:
> > > On Mon,  4 Apr 2022 21:46:23 -0400 Peter Xu <peterx@redhat.com> wrote:
> > > 
> > > > This is v8 of the series to add shmem+hugetlbfs support for userfaultfd
> > > > write protection. 
> > > 
> > > Various compilation catastrophes with x86_64 allnoconfig.  I poked at
> > > the include ordering for a while but other things quickly became more
> > > attractive ;)
> > 
> > Sorry about that.  I still don't know what's the problem, but I'll give it
> > a shot soon.
> > 
> > I think I only tried out with the new configs but not all the rest configs.
> > I thought there're some bot looking after that one, from which I used to
> > receive build reports. And IIRC I fixed some build issues in early versions
> > from those reports.  Maybe I was wrong..
> > 
> > Any more hints on the latter?
> 
> `make allnoconfig'?
> 

Ah!  I thought when you mentioned "other things" you meant there're other
more severe issues... :)

For the allnoconfig, could you try with the attached quick fixup (upon
patch "mm/uffd: PTE_MARKER_UFFD_WP")?

That works for me on x86/arm, but I'm still trying out some other configs.

Thanks,

-- 
Peter Xu

[-- Attachment #2: 0001-fixup-mm-uffd-PTE_MARKER_UFFD_WP.patch --]
[-- Type: text/plain, Size: 891 bytes --]

From 83dbc502eeaba12071413691374474c09e9f24bb Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Tue, 5 Apr 2022 19:00:37 -0400
Subject: [PATCH] fixup! mm/uffd: PTE_MARKER_UFFD_WP

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/userfaultfd_k.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index ea11bed9bb7e..732b522bacb7 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -277,8 +277,12 @@ static inline bool uffd_disable_fault_around(struct vm_area_struct *vma)
 
 static inline bool pte_marker_entry_uffd_wp(swp_entry_t entry)
 {
+#ifdef CONFIG_PTE_MARKER_UFFD_WP
 	return is_pte_marker_entry(entry) &&
 	    (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
+#else
+	return false;
+#endif
 }
 
 static inline bool pte_marker_uffd_wp(pte_t pte)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs
  2022-04-05 23:02       ` Peter Xu
@ 2022-04-05 23:08         ` Andrew Morton
  0 siblings, 0 replies; 63+ messages in thread
From: Andrew Morton @ 2022-04-05 23:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Mike Kravetz, David Hildenbrand,
	Matthew Wilcox, Alistair Popple, Nadav Amit, Axel Rasmussen,
	Andrea Arcangeli, Kirill A . Shutemov, Hugh Dickins,
	Jerome Glisse, Mike Rapoport

On Tue, 5 Apr 2022 19:02:32 -0400 Peter Xu <peterx@redhat.com> wrote:

> For the allnoconfig, could you try with the attached quick fixup (upon
> patch "mm/uffd: PTE_MARKER_UFFD_WP")?

Works for me, thanks.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 04/23] mm/uffd: PTE_MARKER_UFFD_WP
  2022-04-05  1:48 ` [PATCH v8 04/23] mm/uffd: PTE_MARKER_UFFD_WP Peter Xu
@ 2022-04-06  1:41   ` kernel test robot
  0 siblings, 0 replies; 63+ messages in thread
From: kernel test robot @ 2022-04-06  1:41 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: kbuild-all, Mike Kravetz, Nadav Amit, Matthew Wilcox,
	Mike Rapoport, David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Linux Memory Management List, Axel Rasmussen, Alistair Popple,
	peterx

Hi Peter,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on hnaz-mm/master]
[also build test ERROR on arnd-asm-generic/master linus/master v5.18-rc1 next-20220405]
[cannot apply to linux/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20220405-100136
base:   https://github.com/hnaz/linux-mm master
config: ia64-buildonly-randconfig-r005-20220405 (https://download.01.org/0day-ci/archive/20220406/202204060938.eKyGvcev-lkp@intel.com/config)
compiler: ia64-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/5baea0f03d347e5b13fff03af297858f1247d51a
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20220405-100136
        git checkout 5baea0f03d347e5b13fff03af297858f1247d51a
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=ia64 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from arch/ia64/include/asm/pgtable.h:153,
                    from include/linux/pgtable.h:6,
                    from arch/ia64/include/asm/uaccess.h:40,
                    from include/linux/uaccess.h:11,
                    from include/linux/sched/task.h:11,
                    from include/linux/sched/signal.h:9,
                    from include/linux/rcuwait.h:6,
                    from include/linux/percpu-rwsem.h:7,
                    from include/linux/fs.h:33,
                    from arch/ia64/kernel/sys_ia64.c:10:
   arch/ia64/include/asm/mmu_context.h: In function 'reload_context':
   arch/ia64/include/asm/mmu_context.h:127:48: warning: variable 'old_rr4' set but not used [-Wunused-but-set-variable]
     127 |         unsigned long rr0, rr1, rr2, rr3, rr4, old_rr4;
         |                                                ^~~~~~~
   In file included from include/linux/hugetlb.h:14,
                    from arch/ia64/kernel/sys_ia64.c:21:
   include/linux/userfaultfd_k.h: In function 'pte_marker_entry_uffd_wp':
>> include/linux/userfaultfd_k.h:243:16: error: implicit declaration of function 'is_pte_marker_entry' [-Werror=implicit-function-declaration]
     243 |         return is_pte_marker_entry(entry) &&
         |                ^~~~~~~~~~~~~~~~~~~
>> include/linux/userfaultfd_k.h:244:14: error: implicit declaration of function 'pte_marker_get' [-Werror=implicit-function-declaration]
     244 |             (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
         |              ^~~~~~~~~~~~~~
>> include/linux/userfaultfd_k.h:244:38: error: 'PTE_MARKER_UFFD_WP' undeclared (first use in this function)
     244 |             (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
         |                                      ^~~~~~~~~~~~~~~~~~
   include/linux/userfaultfd_k.h:244:38: note: each undeclared identifier is reported only once for each function it appears in
   arch/ia64/kernel/sys_ia64.c: At top level:
   arch/ia64/kernel/sys_ia64.c:71:1: warning: no previous prototype for 'ia64_getpriority' [-Wmissing-prototypes]
      71 | ia64_getpriority (int which, int who)
         | ^~~~~~~~~~~~~~~~
   arch/ia64/kernel/sys_ia64.c:85:1: warning: no previous prototype for 'sys_getpagesize' [-Wmissing-prototypes]
      85 | sys_getpagesize (void)
         | ^~~~~~~~~~~~~~~
   arch/ia64/kernel/sys_ia64.c:91:1: warning: no previous prototype for 'ia64_brk' [-Wmissing-prototypes]
      91 | ia64_brk (unsigned long brk)
         | ^~~~~~~~
   arch/ia64/kernel/sys_ia64.c:161:1: warning: no previous prototype for 'ia64_mremap' [-Wmissing-prototypes]
     161 | ia64_mremap (unsigned long addr, unsigned long old_len, unsigned long new_len, unsigned long flags,
         | ^~~~~~~~~~~
   cc1: some warnings being treated as errors
--
   In file included from arch/ia64/include/asm/pgtable.h:153,
                    from include/linux/pgtable.h:6,
                    from arch/ia64/include/asm/uaccess.h:40,
                    from include/linux/uaccess.h:11,
                    from include/linux/sched/task.h:11,
                    from kernel/fork.c:23:
   arch/ia64/include/asm/mmu_context.h: In function 'reload_context':
   arch/ia64/include/asm/mmu_context.h:127:48: warning: variable 'old_rr4' set but not used [-Wunused-but-set-variable]
     127 |         unsigned long rr0, rr1, rr2, rr3, rr4, old_rr4;
         |                                                ^~~~~~~
   In file included from include/linux/hugetlb.h:14,
                    from kernel/fork.c:52:
   include/linux/userfaultfd_k.h: In function 'pte_marker_entry_uffd_wp':
>> include/linux/userfaultfd_k.h:243:16: error: implicit declaration of function 'is_pte_marker_entry' [-Werror=implicit-function-declaration]
     243 |         return is_pte_marker_entry(entry) &&
         |                ^~~~~~~~~~~~~~~~~~~
>> include/linux/userfaultfd_k.h:244:14: error: implicit declaration of function 'pte_marker_get' [-Werror=implicit-function-declaration]
     244 |             (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
         |              ^~~~~~~~~~~~~~
>> include/linux/userfaultfd_k.h:244:38: error: 'PTE_MARKER_UFFD_WP' undeclared (first use in this function)
     244 |             (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
         |                                      ^~~~~~~~~~~~~~~~~~
   include/linux/userfaultfd_k.h:244:38: note: each undeclared identifier is reported only once for each function it appears in
   kernel/fork.c: At top level:
   kernel/fork.c:163:13: warning: no previous prototype for 'arch_release_task_struct' [-Wmissing-prototypes]
     163 | void __weak arch_release_task_struct(struct task_struct *tsk)
         |             ^~~~~~~~~~~~~~~~~~~~~~~~
   kernel/fork.c:853:20: warning: no previous prototype for 'arch_task_cache_init' [-Wmissing-prototypes]
     853 | void __init __weak arch_task_cache_init(void) { }
         |                    ^~~~~~~~~~~~~~~~~~~~
   kernel/fork.c:948:12: warning: no previous prototype for 'arch_dup_task_struct' [-Wmissing-prototypes]
     948 | int __weak arch_dup_task_struct(struct task_struct *dst,
         |            ^~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors
--
   In file included from arch/ia64/include/asm/pgtable.h:153,
                    from include/linux/pgtable.h:6,
                    from include/linux/mm.h:29,
                    from kernel/sysctl.c:23:
   arch/ia64/include/asm/mmu_context.h: In function 'reload_context':
   arch/ia64/include/asm/mmu_context.h:127:48: warning: variable 'old_rr4' set but not used [-Wunused-but-set-variable]
     127 |         unsigned long rr0, rr1, rr2, rr3, rr4, old_rr4;
         |                                                ^~~~~~~
   In file included from include/linux/hugetlb.h:14,
                    from kernel/sysctl.c:46:
   include/linux/userfaultfd_k.h: In function 'pte_marker_entry_uffd_wp':
>> include/linux/userfaultfd_k.h:243:16: error: implicit declaration of function 'is_pte_marker_entry' [-Werror=implicit-function-declaration]
     243 |         return is_pte_marker_entry(entry) &&
         |                ^~~~~~~~~~~~~~~~~~~
>> include/linux/userfaultfd_k.h:244:14: error: implicit declaration of function 'pte_marker_get' [-Werror=implicit-function-declaration]
     244 |             (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
         |              ^~~~~~~~~~~~~~
>> include/linux/userfaultfd_k.h:244:38: error: 'PTE_MARKER_UFFD_WP' undeclared (first use in this function)
     244 |             (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
         |                                      ^~~~~~~~~~~~~~~~~~
   include/linux/userfaultfd_k.h:244:38: note: each undeclared identifier is reported only once for each function it appears in
   cc1: some warnings being treated as errors
--
   In file included from arch/ia64/include/asm/pgtable.h:153,
                    from include/linux/pgtable.h:6,
                    from include/linux/mm.h:29,
                    from mm/vmscan.c:15:
   arch/ia64/include/asm/mmu_context.h: In function 'reload_context':
   arch/ia64/include/asm/mmu_context.h:127:48: warning: variable 'old_rr4' set but not used [-Wunused-but-set-variable]
     127 |         unsigned long rr0, rr1, rr2, rr3, rr4, old_rr4;
         |                                                ^~~~~~~
   In file included from include/linux/hugetlb.h:14,
                    from include/linux/migrate.h:8,
                    from mm/vmscan.c:44:
   include/linux/userfaultfd_k.h: In function 'pte_marker_entry_uffd_wp':
>> include/linux/userfaultfd_k.h:243:16: error: implicit declaration of function 'is_pte_marker_entry' [-Werror=implicit-function-declaration]
     243 |         return is_pte_marker_entry(entry) &&
         |                ^~~~~~~~~~~~~~~~~~~
>> include/linux/userfaultfd_k.h:244:14: error: implicit declaration of function 'pte_marker_get' [-Werror=implicit-function-declaration]
     244 |             (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
         |              ^~~~~~~~~~~~~~
>> include/linux/userfaultfd_k.h:244:38: error: 'PTE_MARKER_UFFD_WP' undeclared (first use in this function)
     244 |             (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
         |                                      ^~~~~~~~~~~~~~~~~~
   include/linux/userfaultfd_k.h:244:38: note: each undeclared identifier is reported only once for each function it appears in
   In file included from mm/vmscan.c:57:
   include/linux/swapops.h: At top level:
>> include/linux/swapops.h:289:20: error: conflicting types for 'is_pte_marker_entry'; have 'bool(swp_entry_t)' {aka '_Bool(swp_entry_t)'}
     289 | static inline bool is_pte_marker_entry(swp_entry_t entry)
         |                    ^~~~~~~~~~~~~~~~~~~
   In file included from include/linux/hugetlb.h:14,
                    from include/linux/migrate.h:8,
                    from mm/vmscan.c:44:
   include/linux/userfaultfd_k.h:243:16: note: previous implicit declaration of 'is_pte_marker_entry' with type 'int()'
     243 |         return is_pte_marker_entry(entry) &&
         |                ^~~~~~~~~~~~~~~~~~~
   In file included from mm/vmscan.c:57:
>> include/linux/swapops.h:294:26: error: conflicting types for 'pte_marker_get'; have 'pte_marker(swp_entry_t)' {aka 'long unsigned int(swp_entry_t)'}
     294 | static inline pte_marker pte_marker_get(swp_entry_t entry)
         |                          ^~~~~~~~~~~~~~
   In file included from include/linux/hugetlb.h:14,
                    from include/linux/migrate.h:8,
                    from mm/vmscan.c:44:
   include/linux/userfaultfd_k.h:244:14: note: previous implicit declaration of 'pte_marker_get' with type 'int()'
     244 |             (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
         |              ^~~~~~~~~~~~~~
   cc1: some warnings being treated as errors
--
   In file included from arch/ia64/include/asm/pgtable.h:153,
                    from include/linux/pgtable.h:6,
                    from arch/ia64/include/asm/uaccess.h:40,
                    from include/linux/uaccess.h:11,
                    from include/linux/sched/task.h:11,
                    from include/linux/sched/signal.h:9,
                    from include/linux/rcuwait.h:6,
                    from include/linux/percpu-rwsem.h:7,
                    from include/linux/fs.h:33,
                    from fs/proc/meminfo.c:2:
   arch/ia64/include/asm/mmu_context.h: In function 'reload_context':
   arch/ia64/include/asm/mmu_context.h:127:48: warning: variable 'old_rr4' set but not used [-Wunused-but-set-variable]
     127 |         unsigned long rr0, rr1, rr2, rr3, rr4, old_rr4;
         |                                                ^~~~~~~
   In file included from include/linux/hugetlb.h:14,
                    from fs/proc/meminfo.c:6:
   include/linux/userfaultfd_k.h: In function 'pte_marker_entry_uffd_wp':
>> include/linux/userfaultfd_k.h:243:16: error: implicit declaration of function 'is_pte_marker_entry' [-Werror=implicit-function-declaration]
     243 |         return is_pte_marker_entry(entry) &&
         |                ^~~~~~~~~~~~~~~~~~~
>> include/linux/userfaultfd_k.h:244:14: error: implicit declaration of function 'pte_marker_get' [-Werror=implicit-function-declaration]
     244 |             (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
         |              ^~~~~~~~~~~~~~
>> include/linux/userfaultfd_k.h:244:38: error: 'PTE_MARKER_UFFD_WP' undeclared (first use in this function)
     244 |             (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
         |                                      ^~~~~~~~~~~~~~~~~~
   include/linux/userfaultfd_k.h:244:38: note: each undeclared identifier is reported only once for each function it appears in
   fs/proc/meminfo.c: At top level:
   fs/proc/meminfo.c:22:28: warning: no previous prototype for 'arch_report_meminfo' [-Wmissing-prototypes]
      22 | void __attribute__((weak)) arch_report_meminfo(struct seq_file *m)
         |                            ^~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors


vim +/is_pte_marker_entry +243 include/linux/userfaultfd_k.h

   240	
   241	static inline bool pte_marker_entry_uffd_wp(swp_entry_t entry)
   242	{
 > 243		return is_pte_marker_entry(entry) &&
 > 244		    (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
   245	}
   246	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 10/23] mm/shmem: Handle uffd-wp during fork()
  2022-04-05  1:48 ` [PATCH v8 10/23] mm/shmem: Handle uffd-wp during fork() Peter Xu
@ 2022-04-06  6:16   ` kernel test robot
  2022-04-06 12:18     ` Peter Xu
  0 siblings, 1 reply; 63+ messages in thread
From: kernel test robot @ 2022-04-06  6:16 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: kbuild-all, Mike Kravetz, Nadav Amit, Matthew Wilcox,
	Mike Rapoport, David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Linux Memory Management List, Axel Rasmussen, Alistair Popple,
	peterx

Hi Peter,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on hnaz-mm/master]
[cannot apply to arnd-asm-generic/master linus/master linux/master v5.18-rc1 next-20220405]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20220405-100136
base:   https://github.com/hnaz/linux-mm master
config: ia64-buildonly-randconfig-r005-20220405 (https://download.01.org/0day-ci/archive/20220406/202204061453.OXOxSh6e-lkp@intel.com/config)
compiler: ia64-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/49e56558a1f453907d2813e1ba94d91f9d102e73
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20220405-100136
        git checkout 49e56558a1f453907d2813e1ba94d91f9d102e73
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=ia64 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from arch/ia64/include/asm/pgtable.h:153,
                    from include/linux/pgtable.h:6,
                    from arch/ia64/include/asm/uaccess.h:40,
                    from include/linux/uaccess.h:11,
                    from arch/ia64/include/asm/sections.h:11,
                    from include/linux/interrupt.h:21,
                    from include/linux/kernel_stat.h:9,
                    from mm/memory.c:42:
   arch/ia64/include/asm/mmu_context.h: In function 'reload_context':
   arch/ia64/include/asm/mmu_context.h:127:48: warning: variable 'old_rr4' set but not used [-Wunused-but-set-variable]
     127 |         unsigned long rr0, rr1, rr2, rr3, rr4, old_rr4;
         |                                                ^~~~~~~
   In file included from include/linux/mm_inline.h:9,
                    from mm/memory.c:44:
   include/linux/userfaultfd_k.h: In function 'pte_marker_entry_uffd_wp':
   include/linux/userfaultfd_k.h:260:16: error: implicit declaration of function 'is_pte_marker_entry' [-Werror=implicit-function-declaration]
     260 |         return is_pte_marker_entry(entry) &&
         |                ^~~~~~~~~~~~~~~~~~~
   include/linux/userfaultfd_k.h:261:14: error: implicit declaration of function 'pte_marker_get' [-Werror=implicit-function-declaration]
     261 |             (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
         |              ^~~~~~~~~~~~~~
   include/linux/userfaultfd_k.h:261:38: error: 'PTE_MARKER_UFFD_WP' undeclared (first use in this function)
     261 |             (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
         |                                      ^~~~~~~~~~~~~~~~~~
   include/linux/userfaultfd_k.h:261:38: note: each undeclared identifier is reported only once for each function it appears in
   In file included from include/linux/mm_inline.h:10,
                    from mm/memory.c:44:
   include/linux/swapops.h: At top level:
   include/linux/swapops.h:289:20: error: conflicting types for 'is_pte_marker_entry'; have 'bool(swp_entry_t)' {aka '_Bool(swp_entry_t)'}
     289 | static inline bool is_pte_marker_entry(swp_entry_t entry)
         |                    ^~~~~~~~~~~~~~~~~~~
   In file included from include/linux/mm_inline.h:9,
                    from mm/memory.c:44:
   include/linux/userfaultfd_k.h:260:16: note: previous implicit declaration of 'is_pte_marker_entry' with type 'int()'
     260 |         return is_pte_marker_entry(entry) &&
         |                ^~~~~~~~~~~~~~~~~~~
   In file included from include/linux/mm_inline.h:10,
                    from mm/memory.c:44:
   include/linux/swapops.h:294:26: error: conflicting types for 'pte_marker_get'; have 'pte_marker(swp_entry_t)' {aka 'long unsigned int(swp_entry_t)'}
     294 | static inline pte_marker pte_marker_get(swp_entry_t entry)
         |                          ^~~~~~~~~~~~~~
   In file included from include/linux/mm_inline.h:9,
                    from mm/memory.c:44:
   include/linux/userfaultfd_k.h:261:14: note: previous implicit declaration of 'pte_marker_get' with type 'int()'
     261 |             (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
         |              ^~~~~~~~~~~~~~
>> mm/memory.c:1238:1: warning: no previous prototype for 'vma_needs_copy' [-Wmissing-prototypes]
    1238 | vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
         | ^~~~~~~~~~~~~~
   In file included from include/linux/mm_inline.h:9,
                    from mm/memory.c:44:
   include/linux/userfaultfd_k.h: In function 'pte_marker_entry_uffd_wp':
   include/linux/userfaultfd_k.h:262:1: error: control reaches end of non-void function [-Werror=return-type]
     262 | }
         | ^
   cc1: some warnings being treated as errors


vim +/vma_needs_copy +1238 mm/memory.c

  1231	
  1232	/*
  1233	 * Return true if the vma needs to copy the pgtable during this fork().  Return
  1234	 * false when we can speed up fork() by allowing lazy page faults later until
  1235	 * when the child accesses the memory range.
  1236	 */
  1237	bool
> 1238	vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
  1239	{
  1240		/*
  1241		 * Always copy pgtables when dst_vma has uffd-wp enabled even if it's
  1242		 * file-backed (e.g. shmem). Because when uffd-wp is enabled, pgtable
  1243		 * contains uffd-wp protection information, that's something we can't
  1244		 * retrieve from page cache, and skip copying will lose those info.
  1245		 */
  1246		if (userfaultfd_wp(dst_vma))
  1247			return true;
  1248	
  1249		if (src_vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP))
  1250			return true;
  1251	
  1252		if (src_vma->anon_vma)
  1253			return true;
  1254	
  1255		/*
  1256		 * Don't copy ptes where a page fault will fill them correctly.  Fork
  1257		 * becomes much lighter when there are big shared or private readonly
  1258		 * mappings. The tradeoff is that copy_page_range is more efficient
  1259		 * than faulting.
  1260		 */
  1261		return false;
  1262	}
  1263	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 10/23] mm/shmem: Handle uffd-wp during fork()
  2022-04-06  6:16   ` kernel test robot
@ 2022-04-06 12:18     ` Peter Xu
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-06 12:18 UTC (permalink / raw)
  To: kernel test robot
  Cc: linux-kernel, linux-mm, kbuild-all, Mike Kravetz, Nadav Amit,
	Matthew Wilcox, Mike Rapoport, David Hildenbrand, Hugh Dickins,
	Jerome Glisse, Kirill A . Shutemov, Andrea Arcangeli,
	Andrew Morton, Axel Rasmussen, Alistair Popple

I assume below should be the same issue as the allnoconfig one Andrew
reported, IOW after the fixup squashed this report along with the other one
in patch 4 should go away.  Let me know otherwise..

Thanks,

On Wed, Apr 06, 2022 at 02:16:56PM +0800, kernel test robot wrote:
> All warnings (new ones prefixed by >>):
> 
>    In file included from arch/ia64/include/asm/pgtable.h:153,
>                     from include/linux/pgtable.h:6,
>                     from arch/ia64/include/asm/uaccess.h:40,
>                     from include/linux/uaccess.h:11,
>                     from arch/ia64/include/asm/sections.h:11,
>                     from include/linux/interrupt.h:21,
>                     from include/linux/kernel_stat.h:9,
>                     from mm/memory.c:42:
>    arch/ia64/include/asm/mmu_context.h: In function 'reload_context':
>    arch/ia64/include/asm/mmu_context.h:127:48: warning: variable 'old_rr4' set but not used [-Wunused-but-set-variable]
>      127 |         unsigned long rr0, rr1, rr2, rr3, rr4, old_rr4;
>          |                                                ^~~~~~~
>    In file included from include/linux/mm_inline.h:9,
>                     from mm/memory.c:44:
>    include/linux/userfaultfd_k.h: In function 'pte_marker_entry_uffd_wp':
>    include/linux/userfaultfd_k.h:260:16: error: implicit declaration of function 'is_pte_marker_entry' [-Werror=implicit-function-declaration]
>      260 |         return is_pte_marker_entry(entry) &&
>          |                ^~~~~~~~~~~~~~~~~~~
>    include/linux/userfaultfd_k.h:261:14: error: implicit declaration of function 'pte_marker_get' [-Werror=implicit-function-declaration]
>      261 |             (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
>          |              ^~~~~~~~~~~~~~
>    include/linux/userfaultfd_k.h:261:38: error: 'PTE_MARKER_UFFD_WP' undeclared (first use in this function)
>      261 |             (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
>          |                                      ^~~~~~~~~~~~~~~~~~
>    include/linux/userfaultfd_k.h:261:38: note: each undeclared identifier is reported only once for each function it appears in
>    In file included from include/linux/mm_inline.h:10,
>                     from mm/memory.c:44:
>    include/linux/swapops.h: At top level:
>    include/linux/swapops.h:289:20: error: conflicting types for 'is_pte_marker_entry'; have 'bool(swp_entry_t)' {aka '_Bool(swp_entry_t)'}
>      289 | static inline bool is_pte_marker_entry(swp_entry_t entry)
>          |                    ^~~~~~~~~~~~~~~~~~~
>    In file included from include/linux/mm_inline.h:9,
>                     from mm/memory.c:44:
>    include/linux/userfaultfd_k.h:260:16: note: previous implicit declaration of 'is_pte_marker_entry' with type 'int()'
>      260 |         return is_pte_marker_entry(entry) &&
>          |                ^~~~~~~~~~~~~~~~~~~
>    In file included from include/linux/mm_inline.h:10,
>                     from mm/memory.c:44:
>    include/linux/swapops.h:294:26: error: conflicting types for 'pte_marker_get'; have 'pte_marker(swp_entry_t)' {aka 'long unsigned int(swp_entry_t)'}
>      294 | static inline pte_marker pte_marker_get(swp_entry_t entry)
>          |                          ^~~~~~~~~~~~~~
>    In file included from include/linux/mm_inline.h:9,
>                     from mm/memory.c:44:
>    include/linux/userfaultfd_k.h:261:14: note: previous implicit declaration of 'pte_marker_get' with type 'int()'
>      261 |             (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
>          |              ^~~~~~~~~~~~~~

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 15/23] mm/hugetlb: Handle pte markers in page faults
  2022-04-05  1:49 ` [PATCH v8 15/23] mm/hugetlb: Handle pte markers in page faults Peter Xu
@ 2022-04-06 13:37   ` kernel test robot
  2022-04-06 15:02     ` Peter Xu
  0 siblings, 1 reply; 63+ messages in thread
From: kernel test robot @ 2022-04-06 13:37 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: kbuild-all, Mike Kravetz, Nadav Amit, Matthew Wilcox,
	Mike Rapoport, David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Linux Memory Management List, Axel Rasmussen, Alistair Popple,
	peterx

Hi Peter,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on hnaz-mm/master]
[cannot apply to arnd-asm-generic/master linus/master linux/master v5.18-rc1 next-20220406]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20220405-100136
base:   https://github.com/hnaz/linux-mm master
config: s390-randconfig-r044-20220406 (https://download.01.org/0day-ci/archive/20220406/202204062154.2txNJyaf-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/e7e7aaec811e2817cd169f0cc1d8f81bdf1f05c3
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20220405-100136
        git checkout e7e7aaec811e2817cd169f0cc1d8f81bdf1f05c3
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=s390 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   mm/hugetlb.c: In function 'hugetlb_fault':
>> mm/hugetlb.c:5678:13: error: implicit declaration of function 'huge_pte_none_mostly'; did you mean 'pte_none_mostly'? [-Werror=implicit-function-declaration]
    5678 |         if (huge_pte_none_mostly(entry)) {
         |             ^~~~~~~~~~~~~~~~~~~~
         |             pte_none_mostly
   cc1: some warnings being treated as errors


vim +5678 mm/hugetlb.c

  5616	
  5617	vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
  5618				unsigned long address, unsigned int flags)
  5619	{
  5620		pte_t *ptep, entry;
  5621		spinlock_t *ptl;
  5622		vm_fault_t ret;
  5623		u32 hash;
  5624		pgoff_t idx;
  5625		struct page *page = NULL;
  5626		struct page *pagecache_page = NULL;
  5627		struct hstate *h = hstate_vma(vma);
  5628		struct address_space *mapping;
  5629		int need_wait_lock = 0;
  5630		unsigned long haddr = address & huge_page_mask(h);
  5631	
  5632		ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
  5633		if (ptep) {
  5634			/*
  5635			 * Since we hold no locks, ptep could be stale.  That is
  5636			 * OK as we are only making decisions based on content and
  5637			 * not actually modifying content here.
  5638			 */
  5639			entry = huge_ptep_get(ptep);
  5640			if (unlikely(is_hugetlb_entry_migration(entry))) {
  5641				migration_entry_wait_huge(vma, mm, ptep);
  5642				return 0;
  5643			} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
  5644				return VM_FAULT_HWPOISON_LARGE |
  5645					VM_FAULT_SET_HINDEX(hstate_index(h));
  5646		}
  5647	
  5648		/*
  5649		 * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
  5650		 * until finished with ptep.  This serves two purposes:
  5651		 * 1) It prevents huge_pmd_unshare from being called elsewhere
  5652		 *    and making the ptep no longer valid.
  5653		 * 2) It synchronizes us with i_size modifications during truncation.
  5654		 *
  5655		 * ptep could have already be assigned via huge_pte_offset.  That
  5656		 * is OK, as huge_pte_alloc will return the same value unless
  5657		 * something has changed.
  5658		 */
  5659		mapping = vma->vm_file->f_mapping;
  5660		i_mmap_lock_read(mapping);
  5661		ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
  5662		if (!ptep) {
  5663			i_mmap_unlock_read(mapping);
  5664			return VM_FAULT_OOM;
  5665		}
  5666	
  5667		/*
  5668		 * Serialize hugepage allocation and instantiation, so that we don't
  5669		 * get spurious allocation failures if two CPUs race to instantiate
  5670		 * the same page in the page cache.
  5671		 */
  5672		idx = vma_hugecache_offset(h, vma, haddr);
  5673		hash = hugetlb_fault_mutex_hash(mapping, idx);
  5674		mutex_lock(&hugetlb_fault_mutex_table[hash]);
  5675	
  5676		entry = huge_ptep_get(ptep);
  5677		/* PTE markers should be handled the same way as none pte */
> 5678		if (huge_pte_none_mostly(entry)) {
  5679			ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
  5680					      entry, flags);
  5681			goto out_mutex;
  5682		}
  5683	
  5684		ret = 0;
  5685	
  5686		/*
  5687		 * entry could be a migration/hwpoison entry at this point, so this
  5688		 * check prevents the kernel from going below assuming that we have
  5689		 * an active hugepage in pagecache. This goto expects the 2nd page
  5690		 * fault, and is_hugetlb_entry_(migration|hwpoisoned) check will
  5691		 * properly handle it.
  5692		 */
  5693		if (!pte_present(entry))
  5694			goto out_mutex;
  5695	
  5696		/*
  5697		 * If we are going to COW/unshare the mapping later, we examine the
  5698		 * pending reservations for this page now. This will ensure that any
  5699		 * allocations necessary to record that reservation occur outside the
  5700		 * spinlock. For private mappings, we also lookup the pagecache
  5701		 * page now as it is used to determine if a reservation has been
  5702		 * consumed.
  5703		 */
  5704		if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) &&
  5705		    !huge_pte_write(entry)) {
  5706			if (vma_needs_reservation(h, vma, haddr) < 0) {
  5707				ret = VM_FAULT_OOM;
  5708				goto out_mutex;
  5709			}
  5710			/* Just decrements count, does not deallocate */
  5711			vma_end_reservation(h, vma, haddr);
  5712	
  5713			if (!(vma->vm_flags & VM_MAYSHARE))
  5714				pagecache_page = hugetlbfs_pagecache_page(h,
  5715									vma, haddr);
  5716		}
  5717	
  5718		ptl = huge_pte_lock(h, mm, ptep);
  5719	
  5720		/* Check for a racing update before calling hugetlb_wp() */
  5721		if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
  5722			goto out_ptl;
  5723	
  5724		/* Handle userfault-wp first, before trying to lock more pages */
  5725		if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(ptep)) &&
  5726		    (flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
  5727			struct vm_fault vmf = {
  5728				.vma = vma,
  5729				.address = haddr,
  5730				.real_address = address,
  5731				.flags = flags,
  5732			};
  5733	
  5734			spin_unlock(ptl);
  5735			if (pagecache_page) {
  5736				unlock_page(pagecache_page);
  5737				put_page(pagecache_page);
  5738			}
  5739			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
  5740			i_mmap_unlock_read(mapping);
  5741			return handle_userfault(&vmf, VM_UFFD_WP);
  5742		}
  5743	
  5744		/*
  5745		 * hugetlb_wp() requires page locks of pte_page(entry) and
  5746		 * pagecache_page, so here we need take the former one
  5747		 * when page != pagecache_page or !pagecache_page.
  5748		 */
  5749		page = pte_page(entry);
  5750		if (page != pagecache_page)
  5751			if (!trylock_page(page)) {
  5752				need_wait_lock = 1;
  5753				goto out_ptl;
  5754			}
  5755	
  5756		get_page(page);
  5757	
  5758		if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
  5759			if (!huge_pte_write(entry)) {
  5760				ret = hugetlb_wp(mm, vma, address, ptep, flags,
  5761						 pagecache_page, ptl);
  5762				goto out_put_page;
  5763			} else if (likely(flags & FAULT_FLAG_WRITE)) {
  5764				entry = huge_pte_mkdirty(entry);
  5765			}
  5766		}
  5767		entry = pte_mkyoung(entry);
  5768		if (huge_ptep_set_access_flags(vma, haddr, ptep, entry,
  5769							flags & FAULT_FLAG_WRITE))
  5770			update_mmu_cache(vma, haddr, ptep);
  5771	out_put_page:
  5772		if (page != pagecache_page)
  5773			unlock_page(page);
  5774		put_page(page);
  5775	out_ptl:
  5776		spin_unlock(ptl);
  5777	
  5778		if (pagecache_page) {
  5779			unlock_page(pagecache_page);
  5780			put_page(pagecache_page);
  5781		}
  5782	out_mutex:
  5783		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
  5784		i_mmap_unlock_read(mapping);
  5785		/*
  5786		 * Generally it's safe to hold refcount during waiting page lock. But
  5787		 * here we just wait to defer the next page fault to avoid busy loop and
  5788		 * the page is not used after unlocked before returning from the current
  5789		 * page fault. So we are safe from accessing freed page, even if we wait
  5790		 * here without taking refcount.
  5791		 */
  5792		if (need_wait_lock)
  5793			wait_on_page_locked(page);
  5794		return ret;
  5795	}
  5796	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 15/23] mm/hugetlb: Handle pte markers in page faults
  2022-04-06 13:37   ` kernel test robot
@ 2022-04-06 15:02     ` Peter Xu
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-06 15:02 UTC (permalink / raw)
  To: kernel test robot
  Cc: linux-kernel, linux-mm, kbuild-all, Mike Kravetz, Nadav Amit,
	Matthew Wilcox, Mike Rapoport, David Hildenbrand, Hugh Dickins,
	Jerome Glisse, Kirill A . Shutemov, Andrea Arcangeli,
	Andrew Morton, Axel Rasmussen, Alistair Popple

[-- Attachment #1: Type: text/plain, Size: 2495 bytes --]

On Wed, Apr 06, 2022 at 09:37:00PM +0800, kernel test robot wrote:
> Hi Peter,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on hnaz-mm/master]
> [cannot apply to arnd-asm-generic/master linus/master linux/master v5.18-rc1 next-20220406]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20220405-100136
> base:   https://github.com/hnaz/linux-mm master
> config: s390-randconfig-r044-20220406 (https://download.01.org/0day-ci/archive/20220406/202204062154.2txNJyaf-lkp@intel.com/config)
> compiler: s390-linux-gcc (GCC) 11.2.0
> reproduce (this is a W=1 build):
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # https://github.com/intel-lab-lkp/linux/commit/e7e7aaec811e2817cd169f0cc1d8f81bdf1f05c3
>         git remote add linux-review https://github.com/intel-lab-lkp/linux
>         git fetch --no-tags linux-review Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20220405-100136
>         git checkout e7e7aaec811e2817cd169f0cc1d8f81bdf1f05c3
>         # save the config file to linux build tree
>         mkdir build_dir
>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=s390 SHELL=/bin/bash
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
> 
> All errors (new ones prefixed by >>):
> 
>    mm/hugetlb.c: In function 'hugetlb_fault':
> >> mm/hugetlb.c:5678:13: error: implicit declaration of function 'huge_pte_none_mostly'; did you mean 'pte_none_mostly'? [-Werror=implicit-function-declaration]
>     5678 |         if (huge_pte_none_mostly(entry)) {
>          |             ^~~~~~~~~~~~~~~~~~~~
>          |             pte_none_mostly
>    cc1: some warnings being treated as errors

Ah, the s390 stub was forgotten again, sorry.  I hope someday s390 will
start to include asm-generic/hugetlb.h like all the rest archs, because
that's really from the gut feeling of how it should happen.. or the dir
should be renamed to asm-generic-without-s390/. :(

An expected fix patch attached (to be squashed into patch "mm: Introduce
PTE_MARKER swap entry").

Thanks,

-- 
Peter Xu

[-- Attachment #2: 0001-fixup-mm-Introduce-PTE_MARKER-swap-entry.patch --]
[-- Type: text/plain, Size: 788 bytes --]

From 8b84d7f6b86c30623208eff386b4f5169f1cb0db Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Wed, 6 Apr 2022 10:55:55 -0400
Subject: [PATCH] fixup! mm: Introduce PTE_MARKER swap entry

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/s390/include/asm/hugetlb.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
index be99eda87f4d..32c3fd6e878e 100644
--- a/arch/s390/include/asm/hugetlb.h
+++ b/arch/s390/include/asm/hugetlb.h
@@ -85,6 +85,11 @@ static inline int huge_pte_none(pte_t pte)
 	return pte_none(pte);
 }
 
+static inline int huge_pte_none_mostly(pte_t pte)
+{
+	return huge_pte_none(pte);
+}
+
 static inline int huge_pte_write(pte_t pte)
 {
 	return pte_write(pte);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 01/23] mm: Introduce PTE_MARKER swap entry
  2022-04-05  1:46 ` [PATCH v8 01/23] mm: Introduce PTE_MARKER swap entry Peter Xu
@ 2022-04-12  1:07   ` Alistair Popple
  2022-04-12 19:45     ` Peter Xu
  2022-04-19  8:25   ` Alistair Popple
  1 sibling, 1 reply; 63+ messages in thread
From: Alistair Popple @ 2022-04-12  1:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Mike Kravetz, Andrew Morton,
	David Hildenbrand, Matthew Wilcox, Nadav Amit, Axel Rasmussen,
	Andrea Arcangeli, Kirill A . Shutemov, Hugh Dickins,
	Jerome Glisse, Mike Rapoport

[-- Attachment #1: Type: text/plain, Size: 6682 bytes --]

Hi Peter,

I noticed this while reviewing the next patch in the series. I think you need to
add CONFIG_PTE_MARKER to the below as well:

#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION) || \
    defined(CONFIG_DEVICE_PRIVATE)
static inline int non_swap_entry(swp_entry_t entry)
{
	return swp_type(entry) >= MAX_SWAPFILES;
}
#else
static inline int non_swap_entry(swp_entry_t entry)
{
	return 0;
}
#endif

Otherwise marker entries will be treated as swap entries, which is wrong for
example in swapin_walk_pmd_entry() as marker entries are no longer considered
pte_none().

- Alistair

Peter Xu <peterx@redhat.com> writes:

> This patch introduces a new swap entry type called PTE_MARKER.  It can be
> installed for any pte that maps a file-backed memory when the pte is
> temporarily zapped, so as to maintain per-pte information.
>
> The information that kept in the pte is called a "marker".  Here we define the
> marker as "unsigned long" just to match pgoff_t, however it will only work if
> it still fits in swp_offset(), which is e.g. currently 58 bits on x86_64.
>
> A new config CONFIG_PTE_MARKER is introduced too; it's by default off.  A bunch
> of helpers are defined altogether to service the rest of the pte marker code.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/asm-generic/hugetlb.h |  9 ++++
>  include/linux/swap.h          | 15 ++++++-
>  include/linux/swapops.h       | 78 +++++++++++++++++++++++++++++++++++
>  mm/Kconfig                    |  6 +++
>  4 files changed, 107 insertions(+), 1 deletion(-)
>
> diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
> index 8e1e6244a89d..f39cad20ffc6 100644
> --- a/include/asm-generic/hugetlb.h
> +++ b/include/asm-generic/hugetlb.h
> @@ -2,6 +2,9 @@
>  #ifndef _ASM_GENERIC_HUGETLB_H
>  #define _ASM_GENERIC_HUGETLB_H
>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
> +
>  static inline pte_t mk_huge_pte(struct page *page, pgprot_t pgprot)
>  {
>  	return mk_pte(page, pgprot);
> @@ -80,6 +83,12 @@ static inline int huge_pte_none(pte_t pte)
>  }
>  #endif
>
> +/* Please refer to comments above pte_none_mostly() for the usage */
> +static inline int huge_pte_none_mostly(pte_t pte)
> +{
> +	return huge_pte_none(pte) || is_pte_marker(pte);
> +}
> +
>  #ifndef __HAVE_ARCH_HUGE_PTE_WRPROTECT
>  static inline pte_t huge_pte_wrprotect(pte_t pte)
>  {
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 7daae5a4b3e1..5553189d0215 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -55,6 +55,19 @@ static inline int current_is_kswapd(void)
>   * actions on faults.
>   */
>
> +/*
> + * PTE markers are used to persist information onto PTEs that are mapped with
> + * file-backed memories.  As its name "PTE" hints, it should only be applied to
> + * the leaves of pgtables.
> + */
> +#ifdef CONFIG_PTE_MARKER
> +#define SWP_PTE_MARKER_NUM 1
> +#define SWP_PTE_MARKER     (MAX_SWAPFILES + SWP_HWPOISON_NUM + \
> +			    SWP_MIGRATION_NUM + SWP_DEVICE_NUM)
> +#else
> +#define SWP_PTE_MARKER_NUM 0
> +#endif
> +
>  /*
>   * Unaddressable device memory support. See include/linux/hmm.h and
>   * Documentation/vm/hmm.rst. Short description is we need struct pages for
> @@ -107,7 +120,7 @@ static inline int current_is_kswapd(void)
>
>  #define MAX_SWAPFILES \
>  	((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
> -	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
> +	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - SWP_PTE_MARKER_NUM)
>
>  /*
>   * Magic header for a swap area. The first part of the union is
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 32d517a28969..7a00627845f0 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -274,6 +274,84 @@ static inline int is_readable_migration_entry(swp_entry_t entry)
>
>  #endif
>
> +typedef unsigned long pte_marker;
> +
> +#define  PTE_MARKER_MASK     (0)
> +
> +#ifdef CONFIG_PTE_MARKER
> +
> +static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
> +{
> +	return swp_entry(SWP_PTE_MARKER, marker);
> +}
> +
> +static inline bool is_pte_marker_entry(swp_entry_t entry)
> +{
> +	return swp_type(entry) == SWP_PTE_MARKER;
> +}
> +
> +static inline pte_marker pte_marker_get(swp_entry_t entry)
> +{
> +	return swp_offset(entry) & PTE_MARKER_MASK;
> +}
> +
> +static inline bool is_pte_marker(pte_t pte)
> +{
> +	return is_swap_pte(pte) && is_pte_marker_entry(pte_to_swp_entry(pte));
> +}
> +
> +#else /* CONFIG_PTE_MARKER */
> +
> +static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
> +{
> +	/* This should never be called if !CONFIG_PTE_MARKER */
> +	WARN_ON_ONCE(1);
> +	return swp_entry(0, 0);
> +}
> +
> +static inline bool is_pte_marker_entry(swp_entry_t entry)
> +{
> +	return false;
> +}
> +
> +static inline pte_marker pte_marker_get(swp_entry_t entry)
> +{
> +	return 0;
> +}
> +
> +static inline bool is_pte_marker(pte_t pte)
> +{
> +	return false;
> +}
> +
> +#endif /* CONFIG_PTE_MARKER */
> +
> +static inline pte_t make_pte_marker(pte_marker marker)
> +{
> +	return swp_entry_to_pte(make_pte_marker_entry(marker));
> +}
> +
> +/*
> + * This is a special version to check pte_none() just to cover the case when
> + * the pte is a pte marker.  It existed because in many cases the pte marker
> + * should be seen as a none pte; it's just that we have stored some information
> + * onto the none pte so it becomes not-none any more.
> + *
> + * It should be used when the pte is file-backed, ram-based and backing
> + * userspace pages, like shmem.  It is not needed upon pgtables that do not
> + * support pte markers at all.  For example, it's not needed on anonymous
> + * memory, kernel-only memory (including when the system is during-boot),
> + * non-ram based generic file-system.  It's fine to be used even there, but the
> + * extra pte marker check will be pure overhead.
> + *
> + * For systems configured with !CONFIG_PTE_MARKER this will be automatically
> + * optimized to pte_none().
> + */
> +static inline int pte_none_mostly(pte_t pte)
> +{
> +	return pte_none(pte) || is_pte_marker(pte);
> +}
> +
>  static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
>  {
>  	struct page *p = pfn_to_page(swp_offset(entry));
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 034d87953600..a1688b9314b2 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -909,6 +909,12 @@ config ANON_VMA_NAME
>  	  area from being merged with adjacent virtual memory areas due to the
>  	  difference in their name.
>
> +config PTE_MARKER
> +	bool "Marker PTEs support"
> +
> +	help
> +	  Allows to create marker PTEs for file-backed memory.
> +
>  source "mm/damon/Kconfig"
>
>  endmenu

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 02/23] mm: Teach core mm about pte markers
  2022-04-05  1:48 ` [PATCH v8 02/23] mm: Teach core mm about pte markers Peter Xu
@ 2022-04-12  1:22   ` Alistair Popple
  2022-04-12 19:53     ` Peter Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Alistair Popple @ 2022-04-12  1:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Mike Kravetz, Nadav Amit, Matthew Wilcox,
	Mike Rapoport, David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen

[-- Attachment #1: Type: text/plain, Size: 7863 bytes --]

I've been reviewing existing pte_none() call sites and noticed the following:

From khugepaged_scan_pmd():

pte_t pteval = *_pte;
if (is_swap_pte(pteval)) {
    if (++unmapped <= khugepaged_max_ptes_swap) {
        /*
* Always be strict with uffd-wp
* enabled swap entries.  Please see
* comment below for pte_uffd_wp().
         */
        if (pte_swp_uffd_wp(pteval)) {
            result = SCAN_PTE_UFFD_WP;
            goto out_unmap;
        }
        continue;
    } else {
        result = SCAN_EXCEED_SWAP_PTE;
        count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
        goto out_unmap;
    }
}
if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
    if (!userfaultfd_armed(vma) &&
        ++none_or_zero <= khugepaged_max_ptes_none) {
        continue;
    } else {
        result = SCAN_EXCEED_NONE_PTE;
        count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
        goto out_unmap;
    }
}

I think the above could encounter a marker PTE right? So the behviour would be
wrong in that case. As I understand things the is_swap_pte() path will be taken
rather than pte_none() but in the absence of any special handling shouldn't
marker PTE's mostly be treated as pte_none() here?

I think you need to s/pte_none/pte_none_mostly/ here and swap the order of
conditionals around.

- Alistair

Peter Xu <peterx@redhat.com> writes:

> This patch still does not use pte marker in any way, however it teaches the
> core mm about the pte marker idea.
>
> For example, handle_pte_marker() is introduced that will parse and handle all
> the pte marker faults.
>
> Many of the places are more about commenting it up - so that we know there's
> the possibility of pte marker showing up, and why we don't need special code
> for the cases.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  fs/userfaultfd.c | 10 ++++++----
>  mm/filemap.c     |  5 +++++
>  mm/hmm.c         |  2 +-
>  mm/memcontrol.c  |  8 ++++++--
>  mm/memory.c      | 23 +++++++++++++++++++++++
>  mm/mincore.c     |  3 ++-
>  mm/mprotect.c    |  3 +++
>  7 files changed, 46 insertions(+), 8 deletions(-)
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index aa0c47cb0d16..8b4a94f5a238 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -249,9 +249,10 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
>
>  	/*
>  	 * Lockless access: we're in a wait_event so it's ok if it
> -	 * changes under us.
> +	 * changes under us.  PTE markers should be handled the same as none
> +	 * ptes here.
>  	 */
> -	if (huge_pte_none(pte))
> +	if (huge_pte_none_mostly(pte))
>  		ret = true;
>  	if (!huge_pte_write(pte) && (reason & VM_UFFD_WP))
>  		ret = true;
> @@ -330,9 +331,10 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
>  	pte = pte_offset_map(pmd, address);
>  	/*
>  	 * Lockless access: we're in a wait_event so it's ok if it
> -	 * changes under us.
> +	 * changes under us.  PTE markers should be handled the same as none
> +	 * ptes here.
>  	 */
> -	if (pte_none(*pte))
> +	if (pte_none_mostly(*pte))
>  		ret = true;
>  	if (!pte_write(*pte) && (reason & VM_UFFD_WP))
>  		ret = true;
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 3a5ffb5587cd..ef77dae8c28d 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3382,6 +3382,11 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>  		vmf->pte += xas.xa_index - last_pgoff;
>  		last_pgoff = xas.xa_index;
>
> +		/*
> +		 * NOTE: If there're PTE markers, we'll leave them to be
> +		 * handled in the specific fault path, and it'll prohibit the
> +		 * fault-around logic.
> +		 */
>  		if (!pte_none(*vmf->pte))
>  			goto unlock;
>
> diff --git a/mm/hmm.c b/mm/hmm.c
> index af71aac3140e..3fd3242c5e50 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -239,7 +239,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>  	pte_t pte = *ptep;
>  	uint64_t pfn_req_flags = *hmm_pfn;
>
> -	if (pte_none(pte)) {
> +	if (pte_none_mostly(pte)) {
>  		required_fault =
>  			hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
>  		if (required_fault)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7a08737bac4b..08af97c73f0f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5644,10 +5644,14 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
>
>  	if (pte_present(ptent))
>  		page = mc_handle_present_pte(vma, addr, ptent);
> +	else if (pte_none_mostly(ptent))
> +		/*
> +		 * PTE markers should be treated as a none pte here, separated
> +		 * from other swap handling below.
> +		 */
> +		page = mc_handle_file_pte(vma, addr, ptent);
>  	else if (is_swap_pte(ptent))
>  		page = mc_handle_swap_pte(vma, ptent, &ent);
> -	else if (pte_none(ptent))
> -		page = mc_handle_file_pte(vma, addr, ptent);
>
>  	if (!page && !ent.val)
>  		return ret;
> diff --git a/mm/memory.c b/mm/memory.c
> index 2c5d1bb4694f..3f396241a7db 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -100,6 +100,8 @@ struct page *mem_map;
>  EXPORT_SYMBOL(mem_map);
>  #endif
>
> +static vm_fault_t do_fault(struct vm_fault *vmf);
> +
>  /*
>   * A number of key systems in x86 including ioremap() rely on the assumption
>   * that high_memory defines the upper bound on direct map memory, then end
> @@ -1415,6 +1417,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  			if (!should_zap_page(details, page))
>  				continue;
>  			rss[mm_counter(page)]--;
> +		} else if (is_pte_marker_entry(entry)) {
> +			/* By default, simply drop all pte markers when zap */
>  		} else if (is_hwpoison_entry(entry)) {
>  			if (!should_zap_cows(details))
>  				continue;
> @@ -3555,6 +3559,23 @@ static inline bool should_try_to_free_swap(struct page *page,
>  		page_count(page) == 2;
>  }
>
> +static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> +{
> +	swp_entry_t entry = pte_to_swp_entry(vmf->orig_pte);
> +	unsigned long marker = pte_marker_get(entry);
> +
> +	/*
> +	 * PTE markers should always be with file-backed memories, and the
> +	 * marker should never be empty.  If anything weird happened, the best
> +	 * thing to do is to kill the process along with its mm.
> +	 */
> +	if (WARN_ON_ONCE(vma_is_anonymous(vmf->vma) || !marker))
> +		return VM_FAULT_SIGBUS;
> +
> +	/* TODO: handle pte markers */
> +	return 0;
> +}
> +
>  /*
>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>   * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -3592,6 +3613,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
>  		} else if (is_hwpoison_entry(entry)) {
>  			ret = VM_FAULT_HWPOISON;
> +		} else if (is_pte_marker_entry(entry)) {
> +			ret = handle_pte_marker(vmf);
>  		} else {
>  			print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
>  			ret = VM_FAULT_SIGBUS;
> diff --git a/mm/mincore.c b/mm/mincore.c
> index f4f627325e12..fa200c14185f 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -122,7 +122,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	for (; addr != end; ptep++, addr += PAGE_SIZE) {
>  		pte_t pte = *ptep;
>
> -		if (pte_none(pte))
> +		/* We need to do cache lookup too for pte markers */
> +		if (pte_none_mostly(pte))
>  			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
>  						 vma, vec);
>  		else if (pte_present(pte))
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 56060acdabd3..709a6f73b764 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -188,6 +188,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  					newpte = pte_swp_mksoft_dirty(newpte);
>  				if (pte_swp_uffd_wp(oldpte))
>  					newpte = pte_swp_mkuffd_wp(newpte);
> +			} else if (is_pte_marker_entry(entry)) {
> +				/* Skip it, the same as none pte */
> +				continue;
>  			} else {
>  				newpte = oldpte;
>  			}

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 03/23] mm: Check against orig_pte for finish_fault()
  2022-04-05  1:48 ` [PATCH v8 03/23] mm: Check against orig_pte for finish_fault() Peter Xu
@ 2022-04-12  2:05   ` Alistair Popple
  2022-04-12 19:54     ` Peter Xu
       [not found]   ` <CGME20220413140330eucas1p167da41e079712b829ef8237dc27b049c@eucas1p1.samsung.com>
  2022-04-15 14:21   ` Guenter Roeck
  2 siblings, 1 reply; 63+ messages in thread
From: Alistair Popple @ 2022-04-12  2:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Mike Kravetz, Nadav Amit, Matthew Wilcox,
	Mike Rapoport, David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen

[-- Attachment #1: Type: text/plain, Size: 2109 bytes --]

Looks ok to me now and should work for all architectures.

Reviewed-by: Alistair Popple <apopple@nvidia.com>

Peter Xu <peterx@redhat.com> writes:

> We used to check against none pte in finish_fault(), with the assumption
> that the orig_pte is always none pte.
>
> This change prepares us to be able to call do_fault() on !none ptes.  For
> example, we should allow that to happen for pte marker so that we can restore
> information out of the pte markers.
>
> Let's change the "pte_none" check into detecting changes since we fetched
> orig_pte.  One trivial thing to take care of here is, when pmd==NULL for
> the pgtable we may not initialize orig_pte at all in handle_pte_fault().
>
> By default orig_pte will be all zeros however the problem is not all
> architectures are using all-zeros for a none pte.  pte_clear() will be the
> right thing to use here so that we'll always have a valid orig_pte value
> for the whole handle_pte_fault() call.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/memory.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 3f396241a7db..b1af996b09ca 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4241,7 +4241,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>  				      vmf->address, &vmf->ptl);
>  	ret = 0;
>  	/* Re-check under ptl */
> -	if (likely(pte_none(*vmf->pte)))
> +	if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
>  		do_set_pte(vmf, page, vmf->address);
>  	else
>  		ret = VM_FAULT_NOPAGE;
> @@ -4709,6 +4709,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>  		 * concurrent faults and from rmap lookups.
>  		 */
>  		vmf->pte = NULL;
> +		/*
> +		 * Always initialize orig_pte.  This matches with below
> +		 * code to have orig_pte to be the none pte if pte==NULL.
> +		 * This makes the rest code to be always safe to reference
> +		 * it, e.g. in finish_fault() we'll detect pte changes.
> +		 */
> +		pte_clear(vmf->vma->vm_mm, vmf->address, &vmf->orig_pte);
>  	} else {
>  		/*
>  		 * If a huge pmd materialized under us just retry later.  Use

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 01/23] mm: Introduce PTE_MARKER swap entry
  2022-04-12  1:07   ` Alistair Popple
@ 2022-04-12 19:45     ` Peter Xu
  2022-04-13  0:30       ` Alistair Popple
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2022-04-12 19:45 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-kernel, linux-mm, Mike Kravetz, Andrew Morton,
	David Hildenbrand, Matthew Wilcox, Nadav Amit, Axel Rasmussen,
	Andrea Arcangeli, Kirill A . Shutemov, Hugh Dickins,
	Jerome Glisse, Mike Rapoport

On Tue, Apr 12, 2022 at 11:07:56AM +1000, Alistair Popple wrote:
> Hi Peter,

Hi, Alistair,

> 
> I noticed this while reviewing the next patch in the series. I think you need to
> add CONFIG_PTE_MARKER to the below as well:
> 
> #if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION) || \
>     defined(CONFIG_DEVICE_PRIVATE)
> static inline int non_swap_entry(swp_entry_t entry)
> {
> 	return swp_type(entry) >= MAX_SWAPFILES;
> }
> #else
> static inline int non_swap_entry(swp_entry_t entry)
> {
> 	return 0;
> }
> #endif
> 
> Otherwise marker entries will be treated as swap entries, which is wrong for
> example in swapin_walk_pmd_entry() as marker entries are no longer considered
> pte_none().

Thanks for the comment, that makes sense.

Instead of adding PTE_MARKER into this equation, I'm going backward and
wondering purely on why we need to bother with non_swap_entry() at all if
MAX_SWAPFILES is already defined with proper knowledges of all these bits.

#define MAX_SWAPFILES \
	((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)

So, I agree with your analysis, but instead of adding PTE_MARKER, what do
you think about we dropping that complexity as a whole (possibly with a
standalone patch)?

---8<---
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index d356ab4047f7..5af852b68805 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -387,18 +387,10 @@ static inline void num_poisoned_pages_inc(void)
 }
 #endif
 
-#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION) || \
-    defined(CONFIG_DEVICE_PRIVATE)
 static inline int non_swap_entry(swp_entry_t entry)
 {
        return swp_type(entry) >= MAX_SWAPFILES;
 }
-#else
-static inline int non_swap_entry(swp_entry_t entry)
-{
-       return 0;
-}
-#endif
 
 #endif /* CONFIG_MMU */
 #endif /* _LINUX_SWAPOPS_H */
---8<---

Thanks,

-- 
Peter Xu



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 02/23] mm: Teach core mm about pte markers
  2022-04-12  1:22   ` Alistair Popple
@ 2022-04-12 19:53     ` Peter Xu
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-12 19:53 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-kernel, linux-mm, Mike Kravetz, Nadav Amit, Matthew Wilcox,
	Mike Rapoport, David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen

Hi, Alistair,

On Tue, Apr 12, 2022 at 11:22:01AM +1000, Alistair Popple wrote:
> I've been reviewing existing pte_none() call sites and noticed the following:
> 
> From khugepaged_scan_pmd():
> 
> pte_t pteval = *_pte;
> if (is_swap_pte(pteval)) {
>     if (++unmapped <= khugepaged_max_ptes_swap) {
>         /*
> * Always be strict with uffd-wp
> * enabled swap entries.  Please see
> * comment below for pte_uffd_wp().
>          */
>         if (pte_swp_uffd_wp(pteval)) {
>             result = SCAN_PTE_UFFD_WP;
>             goto out_unmap;
>         }
>         continue;
>     } else {
>         result = SCAN_EXCEED_SWAP_PTE;
>         count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
>         goto out_unmap;
>     }
> }
> if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>     if (!userfaultfd_armed(vma) &&
>         ++none_or_zero <= khugepaged_max_ptes_none) {
>         continue;
>     } else {
>         result = SCAN_EXCEED_NONE_PTE;
>         count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>         goto out_unmap;
>     }
> }
> 
> I think the above could encounter a marker PTE right? So the behviour would be
> wrong in that case. As I understand things the is_swap_pte() path will be taken
> rather than pte_none() but in the absence of any special handling shouldn't
> marker PTE's mostly be treated as pte_none() here?
> 
> I think you need to s/pte_none/pte_none_mostly/ here and swap the order of
> conditionals around.

Isn't khugepaged_scan_pmd() only for anonymous?

The shmem side is covered by khugepaged_scan_file(), imho.  We definitely
don't want to collapse shmem vma ranges that has uffd-wp registered, and
that's actually handled explicilty in "mm/khugepaged: Don't recycle vma
pgtable if uffd-wp registered".  Please feel free to have a look.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 03/23] mm: Check against orig_pte for finish_fault()
  2022-04-12  2:05   ` Alistair Popple
@ 2022-04-12 19:54     ` Peter Xu
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-12 19:54 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-kernel, linux-mm, Mike Kravetz, Nadav Amit, Matthew Wilcox,
	Mike Rapoport, David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen

On Tue, Apr 12, 2022 at 12:05:41PM +1000, Alistair Popple wrote:
> Looks ok to me now and should work for all architectures.
> 
> Reviewed-by: Alistair Popple <apopple@nvidia.com>

Thanks again for taking a look.  Currently the series is queued in -mm
already, but I'll take these R-bs if there'll be a new version.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 01/23] mm: Introduce PTE_MARKER swap entry
  2022-04-12 19:45     ` Peter Xu
@ 2022-04-13  0:30       ` Alistair Popple
  2022-04-13 13:44         ` Peter Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Alistair Popple @ 2022-04-13  0:30 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Mike Kravetz, Andrew Morton,
	David Hildenbrand, Matthew Wilcox, Nadav Amit, Axel Rasmussen,
	Andrea Arcangeli, Kirill A . Shutemov, Hugh Dickins,
	Jerome Glisse, Mike Rapoport

[-- Attachment #1: Type: text/plain, Size: 2981 bytes --]

Peter Xu <peterx@redhat.com> writes:

> On Tue, Apr 12, 2022 at 11:07:56AM +1000, Alistair Popple wrote:
>> Hi Peter,
>
> Hi, Alistair,
>
>>
>> I noticed this while reviewing the next patch in the series. I think you need to
>> add CONFIG_PTE_MARKER to the below as well:
>>
>> #if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION) || \
>>     defined(CONFIG_DEVICE_PRIVATE)
>> static inline int non_swap_entry(swp_entry_t entry)
>> {
>> 	return swp_type(entry) >= MAX_SWAPFILES;
>> }
>> #else
>> static inline int non_swap_entry(swp_entry_t entry)
>> {
>> 	return 0;
>> }
>> #endif
>>
>> Otherwise marker entries will be treated as swap entries, which is wrong for
>> example in swapin_walk_pmd_entry() as marker entries are no longer considered
>> pte_none().
>
> Thanks for the comment, that makes sense.
>
> Instead of adding PTE_MARKER into this equation, I'm going backward and
> wondering purely on why we need to bother with non_swap_entry() at all if
> MAX_SWAPFILES is already defined with proper knowledges of all these bits.

I was going to suggest it was to help the compiler optimise the non-swap entry
code away. But I just tested and it makes no difference in .text section size
either way so I think your suggestion is good unless that isn't true for other
architecture/compiler combinations (I only tried gcc-10.2.1 and x86_64).

That's a possibility because the optimisation isn't obvious to me at least.

non_swap_entry() is equivalent to:

(entry.val >> SWP_TYPE_SHIFT) >= MAX_SWAPFILES;
(entry.val >> (BITS_PER_XA_VALUE - MAX_SWAPFILES_SHIFT)) >= (1<<5);
(entry.val >> (BITS_PER_LONG - 1 - 5)) >= (1<<5);
(entry.val >> 58) >= (1<<5);

Where entry.val is a long. So from that alone it's not obvious this could be
optimised away, because nothing there implies entry.val != (1<<63) which would
make the conditional true. But there's a lot of inlining going on in the
creation of swap entries which I didn't trace, so something must end up implying
entry.val < (1<<63).

>
> #define MAX_SWAPFILES \
> 	((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
> 	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
>
> So, I agree with your analysis, but instead of adding PTE_MARKER, what do
> you think about we dropping that complexity as a whole (possibly with a
> standalone patch)?
>
> ---8<---
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index d356ab4047f7..5af852b68805 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -387,18 +387,10 @@ static inline void num_poisoned_pages_inc(void)
>  }
>  #endif
>
> -#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION) || \
> -    defined(CONFIG_DEVICE_PRIVATE)
>  static inline int non_swap_entry(swp_entry_t entry)
>  {
>         return swp_type(entry) >= MAX_SWAPFILES;
>  }
> -#else
> -static inline int non_swap_entry(swp_entry_t entry)
> -{
> -       return 0;
> -}
> -#endif
>
>  #endif /* CONFIG_MMU */
>  #endif /* _LINUX_SWAPOPS_H */
> ---8<---
>
> Thanks,

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 01/23] mm: Introduce PTE_MARKER swap entry
  2022-04-13  0:30       ` Alistair Popple
@ 2022-04-13 13:44         ` Peter Xu
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-13 13:44 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-kernel, linux-mm, Mike Kravetz, Andrew Morton,
	David Hildenbrand, Matthew Wilcox, Nadav Amit, Axel Rasmussen,
	Andrea Arcangeli, Kirill A . Shutemov, Hugh Dickins,
	Jerome Glisse, Mike Rapoport

On Wed, Apr 13, 2022 at 10:30:33AM +1000, Alistair Popple wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Tue, Apr 12, 2022 at 11:07:56AM +1000, Alistair Popple wrote:
> >> Hi Peter,
> >
> > Hi, Alistair,
> >
> >>
> >> I noticed this while reviewing the next patch in the series. I think you need to
> >> add CONFIG_PTE_MARKER to the below as well:
> >>
> >> #if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION) || \
> >>     defined(CONFIG_DEVICE_PRIVATE)
> >> static inline int non_swap_entry(swp_entry_t entry)
> >> {
> >> 	return swp_type(entry) >= MAX_SWAPFILES;
> >> }
> >> #else
> >> static inline int non_swap_entry(swp_entry_t entry)
> >> {
> >> 	return 0;
> >> }
> >> #endif
> >>
> >> Otherwise marker entries will be treated as swap entries, which is wrong for
> >> example in swapin_walk_pmd_entry() as marker entries are no longer considered
> >> pte_none().
> >
> > Thanks for the comment, that makes sense.
> >
> > Instead of adding PTE_MARKER into this equation, I'm going backward and
> > wondering purely on why we need to bother with non_swap_entry() at all if
> > MAX_SWAPFILES is already defined with proper knowledges of all these bits.
> 
> I was going to suggest it was to help the compiler optimise the non-swap entry
> code away. But I just tested and it makes no difference in .text section size
> either way so I think your suggestion is good unless that isn't true for other
> architecture/compiler combinations (I only tried gcc-10.2.1 and x86_64).
> 
> That's a possibility because the optimisation isn't obvious to me at least.
> 
> non_swap_entry() is equivalent to:
> 
> (entry.val >> SWP_TYPE_SHIFT) >= MAX_SWAPFILES;
> (entry.val >> (BITS_PER_XA_VALUE - MAX_SWAPFILES_SHIFT)) >= (1<<5);
> (entry.val >> (BITS_PER_LONG - 1 - 5)) >= (1<<5);
> (entry.val >> 58) >= (1<<5);
> 
> Where entry.val is a long. So from that alone it's not obvious this could be
> optimised away, because nothing there implies entry.val != (1<<63) which would
> make the conditional true. But there's a lot of inlining going on in the
> creation of swap entries which I didn't trace, so something must end up implying
> entry.val < (1<<63).

I think my point was that we check non_swap_entry() with a pre-assumption
that it's a swap entry, then it means it's the slow path already after
we've parsed the pte entry and be aware it's not present.

So I'm doubting how much the optimization (even if at last, applicable)
could help in reality, not to mention that it'll only have an effect when
all of the configs are not set, so it's a micro optimization on slow path
in a rare config setup.

For any sane modern hosts I'd expect CONFIG_MIGRATION should at least be
set.. then it invalidates any potential optimization we're discussing here.

Let me post a patch for this and move the discussion there.  Thanks a lot
for looking into it.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 03/23] mm: Check against orig_pte for finish_fault()
       [not found]   ` <CGME20220413140330eucas1p167da41e079712b829ef8237dc27b049c@eucas1p1.samsung.com>
@ 2022-04-13 14:03     ` Marek Szyprowski
  2022-04-13 16:43       ` Peter Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Marek Szyprowski @ 2022-04-13 14:03 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: Mike Kravetz, Nadav Amit, Matthew Wilcox, Mike Rapoport,
	David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple

Hi,

On 05.04.2022 03:48, Peter Xu wrote:
> We used to check against none pte in finish_fault(), with the assumption
> that the orig_pte is always none pte.
>
> This change prepares us to be able to call do_fault() on !none ptes.  For
> example, we should allow that to happen for pte marker so that we can restore
> information out of the pte markers.
>
> Let's change the "pte_none" check into detecting changes since we fetched
> orig_pte.  One trivial thing to take care of here is, when pmd==NULL for
> the pgtable we may not initialize orig_pte at all in handle_pte_fault().
>
> By default orig_pte will be all zeros however the problem is not all
> architectures are using all-zeros for a none pte.  pte_clear() will be the
> right thing to use here so that we'll always have a valid orig_pte value
> for the whole handle_pte_fault() call.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

This patch landed in today's linux next-202204213 as commit fa6009949163 
("mm: check against orig_pte for finish_fault()"). Unfortunately it 
causes serious system instability on some ARM 32bit machines. I've 
observed it on all tested boards (various Samsung Exynos based, 
Raspberry Pi 3b and 4b, even QEMU's virt 32bit machine) when kernel was 
compiled from multi_v7_defconfig.

Here is a crash log from QEMU's ARM 32bit virt machine:

8<--- cut here ---
Unable to handle kernel paging request at virtual address e093263c
[e093263c] *pgd=42083811, *pte=00000000, *ppte=00000000
Internal error: Oops: 807 [#1] SMP ARM
Modules linked in:
CPU: 1 PID: 37 Comm: kworker/u4:0 Not tainted 
5.18.0-rc2-00176-gfa6009949163 #11684
Hardware name: Generic DT based system
PC is at cpu_ca15_set_pte_ext+0x4c/0x58
LR is at handle_mm_fault+0x46c/0xbb0
pc : [<c031bdec>]    lr : [<c0478144>]    psr: 40000013
sp : e0931df8  ip : e0931e54  fp : c26a8000
r10: 00000081  r9 : c2230880  r8 : 00000000
r7 : 00000081  r6 : beffffed  r5 : c267f000  r4 : c2230880
r3 : 00000000  r2 : 00000000  r1 : 00000040  r0 : e0931e3c
Flags: nZcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 4020406a  DAC: 00000051
Register r0 information: 2-page vmalloc region starting at 0xe0930000 
allocated at kernel_clone+0x8c/0x3a8
Register r1 information: non-paged memory
Register r2 information: NULL pointer
Register r3 information: NULL pointer
Register r4 information: slab task_struct start c2230880 pointer offset 0
Register r5 information: slab vm_area_struct start c267f000 pointer offset 0
Register r6 information: non-paged memory
Register r7 information: non-paged memory
Register r8 information: NULL pointer
Register r9 information: slab task_struct start c2230880 pointer offset 0
Register r10 information: non-paged memory
Register r11 information: slab mm_struct start c26a8000 pointer offset 0 
size 168
Register r12 information: 2-page vmalloc region starting at 0xe0930000 
allocated at kernel_clone+0x8c/0x3a8
Process kworker/u4:0 (pid: 37, stack limit = 0x(ptrval))
Stack: (0xe0931df8 to 0xe0932000)
...
---[ end trace 0000000000000000 ]---
CAN device driver interface
bgmac_bcma: Broadcom 47xx GBit MAC driver loaded
e1000e: Intel(R) PRO/1000 Network Driver
e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
igb: Intel(R) Gigabit Ethernet Network Driver
igb: Copyright (c) 2007-2014 Intel Corporation.
pegasus: Pegasus/Pegasus II USB Ethernet driver
usbcore: registered new interface driver pegasus
usbcore: registered new interface driver asix
usbcore: registered new interface driver ax88179_178a
usbcore: registered new interface driver cdc_ether
usbcore: registered new interface driver smsc75xx
usbcore: registered new interface driver smsc95xx
usbcore: registered new interface driver net1080
usbcore: registered new interface driver cdc_subset
usbcore: registered new interface driver zaurus
usbcore: registered new interface driver cdc_ncm
ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
ehci-pci: EHCI PCI platform driver
ehci-platform: EHCI generic platform driver
ehci-omap: OMAP-EHCI Host Controller driver
ehci-orion: EHCI orion driver
SPEAr-ehci: EHCI SPEAr driver
ehci-st: EHCI STMicroelectronics driver
ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
ohci-pci: OHCI PCI platform driver
ohci-platform: OHCI generic platform driver
SPEAr-ohci: OHCI SPEAr driver
ohci-st: OHCI STMicroelectronics driver
usbcore: registered new interface driver usb-storage
rtc-pl031 9010000.pl031: registered as rtc0
rtc-pl031 9010000.pl031: setting system clock to 2022-04-13T13:49:19 UTC 
(1649857759)
i2c_dev: i2c /dev entries driver
sdhci: Secure Digital Host Controller Interface driver
sdhci: Copyright(c) Pierre Ossman
Synopsys Designware Multimedia Card Interface Driver
sdhci-pltfm: SDHCI platform and OF driver helper
ledtrig-cpu: registered to indicate activity on CPUs
usbcore: registered new interface driver usbhid
usbhid: USB HID core driver
NET: Registered PF_INET6 protocol family
Segment Routing with IPv6
In-situ OAM (IOAM) with IPv6
sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver
NET: Registered PF_PACKET protocol family
can: controller area network core
NET: Registered PF_CAN protocol family
can: raw protocol
can: broadcast manager protocol
can: netlink gateway - max_hops=1
Key type dns_resolver registered
ThumbEE CPU extension supported.
Registering SWP/SWPB emulation handler
Loading compiled-in X.509 certificates
input: gpio-keys as /devices/platform/gpio-keys/input/input0
uart-pl011 9000000.pl011: no DMA platform data
EXT4-fs (vda): mounted filesystem with ordered data mode. Quota mode: 
disabled.
VFS: Mounted root (ext4 filesystem) readonly on device 254:0.
devtmpfs: mounted
Freeing unused kernel image (initmem) memory: 2048K
Run /sbin/init as init process
   with arguments:
     /sbin/init
   with environment:
     HOME=/
     TERM=linux
8<--- cut here ---
Unable to handle kernel paging request at virtual address e082662c
[e082662c] *pgd=42083811, *pte=00000000, *ppte=00000000
Internal error: Oops: 807 [#2] SMP ARM
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Tainted: G      D 
5.18.0-rc2-00176-gfa6009949163 #11684
Hardware name: Generic DT based system
PC is at cpu_ca15_set_pte_ext+0x4c/0x58
LR is at handle_mm_fault+0x46c/0xbb0
pc : [<c031bdec>]    lr : [<c0478144>]    psr: 40000013
sp : e0825de8  ip : e0825e44  fp : c213e000
r10: 00000081  r9 : c20e0000  r8 : 00000000
r7 : 00000081  r6 : befffff1  r5 : c2695000  r4 : c20e0000
r3 : 00000000  r2 : 00000000  r1 : 00000040  r0 : e0825e2c
Flags: nZcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 4020406a  DAC: 00000051
Register r0 information: 2-page vmalloc region starting at 0xe0824000 
allocated at kernel_clone+0x8c/0x3a8
Register r1 information: non-paged memory
Register r2 information: NULL pointer
Register r3 information: NULL pointer
Register r4 information: slab task_struct start c20e0000 pointer offset 0
Register r5 information: slab vm_area_struct start c2695000 pointer offset 0
Register r6 information: non-paged memory
Register r7 information: non-paged memory
Register r8 information: NULL pointer
Register r9 information: slab task_struct start c20e0000 pointer offset 0
Register r10 information: non-paged memory
Register r11 information: slab mm_struct start c213e000 pointer offset 0 
size 168
Register r12 information: 2-page vmalloc region starting at 0xe0824000 
allocated at kernel_clone+0x8c/0x3a8
Process swapper/0 (pid: 1, stack limit = 0x(ptrval))
Stack: (0xe0825de8 to 0xe0826000)
...
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
CPU1: stopping
CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D 
5.18.0-rc2-00176-gfa6009949163 #11684
Hardware name: Generic DT based system
  unwind_backtrace from show_stack+0x10/0x14
  show_stack from dump_stack_lvl+0x40/0x4c
  dump_stack_lvl from do_handle_IPI+0x2c4/0x2fc
  do_handle_IPI from ipi_handler+0x18/0x20
  ipi_handler from handle_percpu_devid_irq+0x8c/0x1e0
  handle_percpu_devid_irq from generic_handle_domain_irq+0x40/0x84
  generic_handle_domain_irq from gic_handle_irq+0x88/0xa8
  gic_handle_irq from generic_handle_arch_irq+0x34/0x44
  generic_handle_arch_irq from call_with_stack+0x18/0x20
  call_with_stack from __irq_svc+0x98/0xb0
Exception stack(0xe0869f50 to 0xe0869f98)
9f40:                                     00009ddc 00000000 00000001 
c031be20
9f60: c20e5d80 c1b48f20 c1904d10 c1904d6c c183e9e8 c1b47971 00000000 
00000000
9f80: c1904e24 e0869fa0 c0307b74 c0307b78 60000113 ffffffff
  __irq_svc from arch_cpu_idle+0x38/0x3c
  arch_cpu_idle from default_idle_call+0x3c/0xb8
  default_idle_call from do_idle+0x1f8/0x298
  do_idle from cpu_startup_entry+0x18/0x1c
  cpu_startup_entry from 0x40301780
---[ end Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x0000000b ]---


> ---
>   mm/memory.c | 9 ++++++++-
>   1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 3f396241a7db..b1af996b09ca 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4241,7 +4241,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>   				      vmf->address, &vmf->ptl);
>   	ret = 0;
>   	/* Re-check under ptl */
> -	if (likely(pte_none(*vmf->pte)))
> +	if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
>   		do_set_pte(vmf, page, vmf->address);
>   	else
>   		ret = VM_FAULT_NOPAGE;
> @@ -4709,6 +4709,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>   		 * concurrent faults and from rmap lookups.
>   		 */
>   		vmf->pte = NULL;
> +		/*
> +		 * Always initialize orig_pte.  This matches with below
> +		 * code to have orig_pte to be the none pte if pte==NULL.
> +		 * This makes the rest code to be always safe to reference
> +		 * it, e.g. in finish_fault() we'll detect pte changes.
> +		 */
> +		pte_clear(vmf->vma->vm_mm, vmf->address, &vmf->orig_pte);
>   	} else {
>   		/*
>   		 * If a huge pmd materialized under us just retry later.  Use

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 03/23] mm: Check against orig_pte for finish_fault()
  2022-04-13 14:03     ` Marek Szyprowski
@ 2022-04-13 16:43       ` Peter Xu
  2022-04-14  7:51         ` Marek Szyprowski
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2022-04-13 16:43 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: linux-kernel, linux-mm, Mike Kravetz, Nadav Amit, Matthew Wilcox,
	Mike Rapoport, David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple

[-- Attachment #1: Type: text/plain, Size: 3534 bytes --]

On Wed, Apr 13, 2022 at 04:03:28PM +0200, Marek Szyprowski wrote:
> Hi,

Hi, Marek,

> 
> On 05.04.2022 03:48, Peter Xu wrote:
> > We used to check against none pte in finish_fault(), with the assumption
> > that the orig_pte is always none pte.
> >
> > This change prepares us to be able to call do_fault() on !none ptes.  For
> > example, we should allow that to happen for pte marker so that we can restore
> > information out of the pte markers.
> >
> > Let's change the "pte_none" check into detecting changes since we fetched
> > orig_pte.  One trivial thing to take care of here is, when pmd==NULL for
> > the pgtable we may not initialize orig_pte at all in handle_pte_fault().
> >
> > By default orig_pte will be all zeros however the problem is not all
> > architectures are using all-zeros for a none pte.  pte_clear() will be the
> > right thing to use here so that we'll always have a valid orig_pte value
> > for the whole handle_pte_fault() call.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> This patch landed in today's linux next-202204213 as commit fa6009949163 
> ("mm: check against orig_pte for finish_fault()"). Unfortunately it 
> causes serious system instability on some ARM 32bit machines. I've 
> observed it on all tested boards (various Samsung Exynos based, 
> Raspberry Pi 3b and 4b, even QEMU's virt 32bit machine) when kernel was 
> compiled from multi_v7_defconfig.

Thanks for the report.

> 
> Here is a crash log from QEMU's ARM 32bit virt machine:
> 
> 8<--- cut here ---
> Unable to handle kernel paging request at virtual address e093263c
> [e093263c] *pgd=42083811, *pte=00000000, *ppte=00000000
> Internal error: Oops: 807 [#1] SMP ARM
> Modules linked in:
> CPU: 1 PID: 37 Comm: kworker/u4:0 Not tainted 
> 5.18.0-rc2-00176-gfa6009949163 #11684
> Hardware name: Generic DT based system
> PC is at cpu_ca15_set_pte_ext+0x4c/0x58
> LR is at handle_mm_fault+0x46c/0xbb0

I had a feeling that for some reason the pte_clear() isn't working right
there when it's applying to a kernel stack variable for arm32.  I'm totally
newbie to arm32, so what I'm reading is this:

https://people.kernel.org/linusw/arm32-page-tables

Especially:

https://dflund.se/~triad/images/classic-mmu-page-table.jpg

It does match with what I read from arm32's proc-v7-2level.S of it, where
from the comment above cpu_v7_set_pte_ext:

  * - ptep  - pointer to level 2 translation table entry
  *    (hardware version is stored at +2048 bytes)        <----------

So it seems to me that arm32 needs to store some metadata at offset 0x800
of any pte_t* pointer passed over to pte_clear(), then it must be a real
pgtable or it'll write to random places in the kernel, am I correct?

Does it mean that all pte_*() operations upon a kernel stack var will be
wrong?  I thought it could happen easily in the rest of mm too but I didn't
yet check much.  The fact shows that it's mostly possible the current code
just work well with arm32 and no such violation occured yet.

That does sound a bit tricky, IMHO.  But I don't have an immediate solution
to make it less tricky.. though I have a thought of workaround, by simply
not calling pte_clear() on the stack var.

Would you try the attached patch to replace this problematic patch? So we
need to revert commit fa6009949163 and apply the new one.  Please let me
know whether it'll solve the problem, so far I only compile tested it, but
I'll run some more test to make sure the uffd-wp scenarios will be working
right with the new version.

Thanks,

-- 
Peter Xu

[-- Attachment #2: 0001-mm-Check-against-orig_pte-for-finish_fault-when-prop.patch --]
[-- Type: text/plain, Size: 4090 bytes --]

From c9eb74ef2414b542b20f934060a4b9123ba83d0f Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Wed, 13 Apr 2022 12:31:01 -0400
Subject: [PATCH] mm: Check against orig_pte for finish_fault() when proper

This patch allows do_fault() to trigger on !pte_none() cases too.  This
prepares for the pte markers to be handled by do_fault() just like none
pte.

To achieve this, instead of unconditionally check against pte_none() in
finish_fault(), we may hit the case that the orig_pte was some pte marker
so what we want to do is to replace the pte marker with some valid pte
entry.  Then if orig_pte was set we'd want to check the current *pte (under
pgtable lock) against orig_pte rather than none pte.

Right now there's no solid way to safely reference orig_pte because when
pmd is not allocated handle_pte_fault() will not initialize orig_pte, so
it's not safe to reference it.

There's another solution proposed before this patch to do pte_clear() for
vmf->orig_pte for pmd==NULL case, however it turns out it'll break arm32
because arm32 could have assumption that pte_t* pointer will always reside
on a real ram32 pgtable, not any kernel stack variable.

To solve this, we add a new flag FAULT_FLAG_ORIG_PTE_VALID, and it'll be
set along with orig_pte when there is valid orig_pte, or it'll be cleared
when orig_pte was not initialized.

It'll be updated every time we call handle_pte_fault(), so e.g. if a page
fault retry happened it'll be properly updated along with orig_pte.

[1] https://lore.kernel.org/lkml/710c48c9-406d-e4c5-a394-10501b951316@samsung.com/

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm_types.h |  3 +++
 mm/memory.c              | 13 ++++++++++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5c567eabfb16..717a47056330 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -817,6 +817,8 @@ typedef struct {
  * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to unshare (and mark
  *                      exclusive) a possibly shared anonymous page that is
  *                      mapped R/O.
+ * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
+ *                        We should only access orig_pte if this flag set.
  *
  * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
  * whether we would allow page faults to retry by specifying these two
@@ -853,6 +855,7 @@ enum fault_flag {
 	FAULT_FLAG_INSTRUCTION =	1 << 8,
 	FAULT_FLAG_INTERRUPTIBLE =	1 << 9,
 	FAULT_FLAG_UNSHARE =		1 << 10,
+	FAULT_FLAG_ORIG_PTE_VALID =	1 << 11,
 };
 
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/mm/memory.c b/mm/memory.c
index f89034a60e9a..8e7f39c651c3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4301,6 +4301,15 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
 	set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
 }
 
+static bool vmf_pte_changed(struct vm_fault *vmf)
+{
+	if (vmf->flags & FAULT_FLAG_ORIG_PTE_VALID) {
+		return !pte_same(*vmf->pte, vmf->orig_pte);
+	}
+
+	return !pte_none(*vmf->pte);
+}
+
 /**
  * finish_fault - finish page fault once we have prepared the page to fault
  *
@@ -4359,7 +4368,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 				      vmf->address, &vmf->ptl);
 	ret = 0;
 	/* Re-check under ptl */
-	if (likely(pte_none(*vmf->pte)))
+	if (likely(!vmf_pte_changed(vmf)))
 		do_set_pte(vmf, page, vmf->address);
 	else
 		ret = VM_FAULT_NOPAGE;
@@ -4837,6 +4846,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		 * concurrent faults and from rmap lookups.
 		 */
 		vmf->pte = NULL;
+		vmf->flags &= ~FAULT_FLAG_ORIG_PTE_VALID;
 	} else {
 		/*
 		 * If a huge pmd materialized under us just retry later.  Use
@@ -4860,6 +4870,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		 */
 		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
 		vmf->orig_pte = *vmf->pte;
+		vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
 
 		/*
 		 * some architectures can have larger ptes than wordsize,
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 03/23] mm: Check against orig_pte for finish_fault()
  2022-04-13 16:43       ` Peter Xu
@ 2022-04-14  7:51         ` Marek Szyprowski
  2022-04-14 16:30           ` Peter Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Marek Szyprowski @ 2022-04-14  7:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Mike Kravetz, Nadav Amit, Matthew Wilcox,
	Mike Rapoport, David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple

Hi Peter,

On 13.04.2022 18:43, Peter Xu wrote:
> On Wed, Apr 13, 2022 at 04:03:28PM +0200, Marek Szyprowski wrote:
>> On 05.04.2022 03:48, Peter Xu wrote:
>>> We used to check against none pte in finish_fault(), with the assumption
>>> that the orig_pte is always none pte.
>>>
>>> This change prepares us to be able to call do_fault() on !none ptes.  For
>>> example, we should allow that to happen for pte marker so that we can restore
>>> information out of the pte markers.
>>>
>>> Let's change the "pte_none" check into detecting changes since we fetched
>>> orig_pte.  One trivial thing to take care of here is, when pmd==NULL for
>>> the pgtable we may not initialize orig_pte at all in handle_pte_fault().
>>>
>>> By default orig_pte will be all zeros however the problem is not all
>>> architectures are using all-zeros for a none pte.  pte_clear() will be the
>>> right thing to use here so that we'll always have a valid orig_pte value
>>> for the whole handle_pte_fault() call.
>>>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>> This patch landed in today's linux next-202204213 as commit fa6009949163
>> ("mm: check against orig_pte for finish_fault()"). Unfortunately it
>> causes serious system instability on some ARM 32bit machines. I've
>> observed it on all tested boards (various Samsung Exynos based,
>> Raspberry Pi 3b and 4b, even QEMU's virt 32bit machine) when kernel was
>> compiled from multi_v7_defconfig.
> Thanks for the report.
>
>> Here is a crash log from QEMU's ARM 32bit virt machine:
>>
>> 8<--- cut here ---
>> Unable to handle kernel paging request at virtual address e093263c
>> [e093263c] *pgd=42083811, *pte=00000000, *ppte=00000000
>> Internal error: Oops: 807 [#1] SMP ARM
>> Modules linked in:
>> CPU: 1 PID: 37 Comm: kworker/u4:0 Not tainted
>> 5.18.0-rc2-00176-gfa6009949163 #11684
>> Hardware name: Generic DT based system
>> PC is at cpu_ca15_set_pte_ext+0x4c/0x58
>> LR is at handle_mm_fault+0x46c/0xbb0
> I had a feeling that for some reason the pte_clear() isn't working right
> there when it's applying to a kernel stack variable for arm32.  I'm totally
> newbie to arm32, so what I'm reading is this:
>
> https://people.kernel.org/linusw/arm32-page-tables
>
> Especially:
>
> https://protect2.fireeye.com/v1/url?k=35bc90ac-6a27a9bd-35bd1be3-0cc47a31cdbc-b032cb1d178dc691&q=1&e=c82daefb-c86b-4ca1-8db1-cadbdc124ed2&u=https%3A%2F%2Fdflund.se%2F%7Etriad%2Fimages%2Fclassic-mmu-page-table.jpg
>
> It does match with what I read from arm32's proc-v7-2level.S of it, where
> from the comment above cpu_v7_set_pte_ext:
>
>    * - ptep  - pointer to level 2 translation table entry
>    *    (hardware version is stored at +2048 bytes)        <----------
>
> So it seems to me that arm32 needs to store some metadata at offset 0x800
> of any pte_t* pointer passed over to pte_clear(), then it must be a real
> pgtable or it'll write to random places in the kernel, am I correct?
>
> Does it mean that all pte_*() operations upon a kernel stack var will be
> wrong?  I thought it could happen easily in the rest of mm too but I didn't
> yet check much.  The fact shows that it's mostly possible the current code
> just work well with arm32 and no such violation occured yet.
>
> That does sound a bit tricky, IMHO.  But I don't have an immediate solution
> to make it less tricky.. though I have a thought of workaround, by simply
> not calling pte_clear() on the stack var.
>
> Would you try the attached patch to replace this problematic patch? So we
> need to revert commit fa6009949163 and apply the new one.  Please let me
> know whether it'll solve the problem, so far I only compile tested it, but
> I'll run some more test to make sure the uffd-wp scenarios will be working
> right with the new version.

I've reverted fa6009949163 and applied the attached patch on top of 
linux next-20220314. The ARM 32bit issues went away. :)

Feel free to add:

Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>

Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 03/23] mm: Check against orig_pte for finish_fault()
  2022-04-14  7:51         ` Marek Szyprowski
@ 2022-04-14 16:30           ` Peter Xu
  2022-04-14 20:57             ` Andrew Morton
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2022-04-14 16:30 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: linux-kernel, linux-mm, Mike Kravetz, Nadav Amit, Matthew Wilcox,
	Mike Rapoport, David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple

On Thu, Apr 14, 2022 at 09:51:01AM +0200, Marek Szyprowski wrote:
> Hi Peter,
> 
> On 13.04.2022 18:43, Peter Xu wrote:
> > On Wed, Apr 13, 2022 at 04:03:28PM +0200, Marek Szyprowski wrote:
> >> On 05.04.2022 03:48, Peter Xu wrote:
> >>> We used to check against none pte in finish_fault(), with the assumption
> >>> that the orig_pte is always none pte.
> >>>
> >>> This change prepares us to be able to call do_fault() on !none ptes.  For
> >>> example, we should allow that to happen for pte marker so that we can restore
> >>> information out of the pte markers.
> >>>
> >>> Let's change the "pte_none" check into detecting changes since we fetched
> >>> orig_pte.  One trivial thing to take care of here is, when pmd==NULL for
> >>> the pgtable we may not initialize orig_pte at all in handle_pte_fault().
> >>>
> >>> By default orig_pte will be all zeros however the problem is not all
> >>> architectures are using all-zeros for a none pte.  pte_clear() will be the
> >>> right thing to use here so that we'll always have a valid orig_pte value
> >>> for the whole handle_pte_fault() call.
> >>>
> >>> Signed-off-by: Peter Xu <peterx@redhat.com>
> >> This patch landed in today's linux next-202204213 as commit fa6009949163
> >> ("mm: check against orig_pte for finish_fault()"). Unfortunately it
> >> causes serious system instability on some ARM 32bit machines. I've
> >> observed it on all tested boards (various Samsung Exynos based,
> >> Raspberry Pi 3b and 4b, even QEMU's virt 32bit machine) when kernel was
> >> compiled from multi_v7_defconfig.
> > Thanks for the report.
> >
> >> Here is a crash log from QEMU's ARM 32bit virt machine:
> >>
> >> 8<--- cut here ---
> >> Unable to handle kernel paging request at virtual address e093263c
> >> [e093263c] *pgd=42083811, *pte=00000000, *ppte=00000000
> >> Internal error: Oops: 807 [#1] SMP ARM
> >> Modules linked in:
> >> CPU: 1 PID: 37 Comm: kworker/u4:0 Not tainted
> >> 5.18.0-rc2-00176-gfa6009949163 #11684
> >> Hardware name: Generic DT based system
> >> PC is at cpu_ca15_set_pte_ext+0x4c/0x58
> >> LR is at handle_mm_fault+0x46c/0xbb0
> > I had a feeling that for some reason the pte_clear() isn't working right
> > there when it's applying to a kernel stack variable for arm32.  I'm totally
> > newbie to arm32, so what I'm reading is this:
> >
> > https://people.kernel.org/linusw/arm32-page-tables
> >
> > Especially:
> >
> > https://protect2.fireeye.com/v1/url?k=35bc90ac-6a27a9bd-35bd1be3-0cc47a31cdbc-b032cb1d178dc691&q=1&e=c82daefb-c86b-4ca1-8db1-cadbdc124ed2&u=https%3A%2F%2Fdflund.se%2F%7Etriad%2Fimages%2Fclassic-mmu-page-table.jpg
> >
> > It does match with what I read from arm32's proc-v7-2level.S of it, where
> > from the comment above cpu_v7_set_pte_ext:
> >
> >    * - ptep  - pointer to level 2 translation table entry
> >    *    (hardware version is stored at +2048 bytes)        <----------
> >
> > So it seems to me that arm32 needs to store some metadata at offset 0x800
> > of any pte_t* pointer passed over to pte_clear(), then it must be a real
> > pgtable or it'll write to random places in the kernel, am I correct?
> >
> > Does it mean that all pte_*() operations upon a kernel stack var will be
> > wrong?  I thought it could happen easily in the rest of mm too but I didn't
> > yet check much.  The fact shows that it's mostly possible the current code
> > just work well with arm32 and no such violation occured yet.
> >
> > That does sound a bit tricky, IMHO.  But I don't have an immediate solution
> > to make it less tricky.. though I have a thought of workaround, by simply
> > not calling pte_clear() on the stack var.
> >
> > Would you try the attached patch to replace this problematic patch? So we
> > need to revert commit fa6009949163 and apply the new one.  Please let me
> > know whether it'll solve the problem, so far I only compile tested it, but
> > I'll run some more test to make sure the uffd-wp scenarios will be working
> > right with the new version.
> 
> I've reverted fa6009949163 and applied the attached patch on top of 
> linux next-20220314. The ARM 32bit issues went away. :)
> 
> Feel free to add:
> 
> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
> 
> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>

Thanks, Marek, for the fast feedback!

I've also verified it for the uffd-wp case so the whole series keeps
running as usual and nothing else shows up after the new patch replaced.

Andrew, any suggestion on how we proceed with the replacement patch?
E.g. do you want me to post it separately to the list?

Please let me know your preference, thanks.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 03/23] mm: Check against orig_pte for finish_fault()
  2022-04-14 16:30           ` Peter Xu
@ 2022-04-14 20:57             ` Andrew Morton
  2022-04-14 21:08               ` Peter Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Andrew Morton @ 2022-04-14 20:57 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marek Szyprowski, linux-kernel, linux-mm, Mike Kravetz,
	Nadav Amit, Matthew Wilcox, Mike Rapoport, David Hildenbrand,
	Hugh Dickins, Jerome Glisse, Kirill A . Shutemov,
	Andrea Arcangeli, Axel Rasmussen, Alistair Popple

On Thu, 14 Apr 2022 12:30:06 -0400 Peter Xu <peterx@redhat.com> wrote:

> > Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
> > 
> > Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
> 
> Thanks, Marek, for the fast feedback!

Certainly.

> I've also verified it for the uffd-wp case so the whole series keeps
> running as usual and nothing else shows up after the new patch replaced.
> 
> Andrew, any suggestion on how we proceed with the replacement patch?
> E.g. do you want me to post it separately to the list?

I turned it into an incremental diff and queued it against [03/23]:

--- a/include/linux/mm_types.h~mm-check-against-orig_pte-for-finish_fault-fix
+++ a/include/linux/mm_types.h
@@ -814,6 +814,8 @@ typedef struct {
  * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to unshare (and mark
  *                      exclusive) a possibly shared anonymous page that is
  *                      mapped R/O.
+ * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
+ *                        We should only access orig_pte if this flag set.
  *
  * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
  * whether we would allow page faults to retry by specifying these two
@@ -850,6 +852,7 @@ enum fault_flag {
 	FAULT_FLAG_INSTRUCTION =	1 << 8,
 	FAULT_FLAG_INTERRUPTIBLE =	1 << 9,
 	FAULT_FLAG_UNSHARE =		1 << 10,
+	FAULT_FLAG_ORIG_PTE_VALID =	1 << 11,
 };
 
 #endif /* _LINUX_MM_TYPES_H */
--- a/mm/memory.c~mm-check-against-orig_pte-for-finish_fault-fix
+++ a/mm/memory.c
@@ -4194,6 +4194,15 @@ void do_set_pte(struct vm_fault *vmf, st
 	set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
 }
 
+static bool vmf_pte_changed(struct vm_fault *vmf)
+{
+	if (vmf->flags & FAULT_FLAG_ORIG_PTE_VALID) {
+		return !pte_same(*vmf->pte, vmf->orig_pte);
+	}
+
+	return !pte_none(*vmf->pte);
+}
+
 /**
  * finish_fault - finish page fault once we have prepared the page to fault
  *
@@ -4252,7 +4261,7 @@ vm_fault_t finish_fault(struct vm_fault
 				      vmf->address, &vmf->ptl);
 	ret = 0;
 	/* Re-check under ptl */
-	if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
+	if (likely(!vmf_pte_changed(vmf)))
 		do_set_pte(vmf, page, vmf->address);
 	else
 		ret = VM_FAULT_NOPAGE;
@@ -4720,13 +4729,7 @@ static vm_fault_t handle_pte_fault(struc
 		 * concurrent faults and from rmap lookups.
 		 */
 		vmf->pte = NULL;
-		/*
-		 * Always initialize orig_pte.  This matches with below
-		 * code to have orig_pte to be the none pte if pte==NULL.
-		 * This makes the rest code to be always safe to reference
-		 * it, e.g. in finish_fault() we'll detect pte changes.
-		 */
-		pte_clear(vmf->vma->vm_mm, vmf->address, &vmf->orig_pte);
+		vmf->flags &= ~FAULT_FLAG_ORIG_PTE_VALID;
 	} else {
 		/*
 		 * If a huge pmd materialized under us just retry later.  Use
@@ -4750,6 +4753,7 @@ static vm_fault_t handle_pte_fault(struc
 		 */
 		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
 		vmf->orig_pte = *vmf->pte;
+		vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
 
 		/*
 		 * some architectures can have larger ptes than wordsize,
_



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 03/23] mm: Check against orig_pte for finish_fault()
  2022-04-14 20:57             ` Andrew Morton
@ 2022-04-14 21:08               ` Peter Xu
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-14 21:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Marek Szyprowski, linux-kernel, linux-mm, Mike Kravetz,
	Nadav Amit, Matthew Wilcox, Mike Rapoport, David Hildenbrand,
	Hugh Dickins, Jerome Glisse, Kirill A . Shutemov,
	Andrea Arcangeli, Axel Rasmussen, Alistair Popple

On Thu, Apr 14, 2022 at 01:57:40PM -0700, Andrew Morton wrote:
> On Thu, 14 Apr 2022 12:30:06 -0400 Peter Xu <peterx@redhat.com> wrote:
> 
> > > Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
> > > 
> > > Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
> > 
> > Thanks, Marek, for the fast feedback!
> 
> Certainly.
> 
> > I've also verified it for the uffd-wp case so the whole series keeps
> > running as usual and nothing else shows up after the new patch replaced.
> > 
> > Andrew, any suggestion on how we proceed with the replacement patch?
> > E.g. do you want me to post it separately to the list?
> 
> I turned it into an incremental diff and queued it against [03/23]:
> 
> --- a/include/linux/mm_types.h~mm-check-against-orig_pte-for-finish_fault-fix
> +++ a/include/linux/mm_types.h
> @@ -814,6 +814,8 @@ typedef struct {
>   * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to unshare (and mark
>   *                      exclusive) a possibly shared anonymous page that is
>   *                      mapped R/O.
> + * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
> + *                        We should only access orig_pte if this flag set.
>   *
>   * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
>   * whether we would allow page faults to retry by specifying these two
> @@ -850,6 +852,7 @@ enum fault_flag {
>  	FAULT_FLAG_INSTRUCTION =	1 << 8,
>  	FAULT_FLAG_INTERRUPTIBLE =	1 << 9,
>  	FAULT_FLAG_UNSHARE =		1 << 10,
> +	FAULT_FLAG_ORIG_PTE_VALID =	1 << 11,
>  };
>  
>  #endif /* _LINUX_MM_TYPES_H */
> --- a/mm/memory.c~mm-check-against-orig_pte-for-finish_fault-fix
> +++ a/mm/memory.c
> @@ -4194,6 +4194,15 @@ void do_set_pte(struct vm_fault *vmf, st
>  	set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
>  }
>  
> +static bool vmf_pte_changed(struct vm_fault *vmf)
> +{
> +	if (vmf->flags & FAULT_FLAG_ORIG_PTE_VALID) {
> +		return !pte_same(*vmf->pte, vmf->orig_pte);
> +	}
> +
> +	return !pte_none(*vmf->pte);
> +}
> +
>  /**
>   * finish_fault - finish page fault once we have prepared the page to fault
>   *
> @@ -4252,7 +4261,7 @@ vm_fault_t finish_fault(struct vm_fault
>  				      vmf->address, &vmf->ptl);
>  	ret = 0;
>  	/* Re-check under ptl */
> -	if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
> +	if (likely(!vmf_pte_changed(vmf)))
>  		do_set_pte(vmf, page, vmf->address);
>  	else
>  		ret = VM_FAULT_NOPAGE;
> @@ -4720,13 +4729,7 @@ static vm_fault_t handle_pte_fault(struc
>  		 * concurrent faults and from rmap lookups.
>  		 */
>  		vmf->pte = NULL;
> -		/*
> -		 * Always initialize orig_pte.  This matches with below
> -		 * code to have orig_pte to be the none pte if pte==NULL.
> -		 * This makes the rest code to be always safe to reference
> -		 * it, e.g. in finish_fault() we'll detect pte changes.
> -		 */
> -		pte_clear(vmf->vma->vm_mm, vmf->address, &vmf->orig_pte);
> +		vmf->flags &= ~FAULT_FLAG_ORIG_PTE_VALID;
>  	} else {
>  		/*
>  		 * If a huge pmd materialized under us just retry later.  Use
> @@ -4750,6 +4753,7 @@ static vm_fault_t handle_pte_fault(struc
>  		 */
>  		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>  		vmf->orig_pte = *vmf->pte;
> +		vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
>  
>  		/*
>  		 * some architectures can have larger ptes than wordsize,
> _
> 

I verified the diff, that matches with what I got.  Thanks Andrew.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 03/23] mm: Check against orig_pte for finish_fault()
  2022-04-05  1:48 ` [PATCH v8 03/23] mm: Check against orig_pte for finish_fault() Peter Xu
  2022-04-12  2:05   ` Alistair Popple
       [not found]   ` <CGME20220413140330eucas1p167da41e079712b829ef8237dc27b049c@eucas1p1.samsung.com>
@ 2022-04-15 14:21   ` Guenter Roeck
  2022-04-15 14:41     ` Peter Xu
  2 siblings, 1 reply; 63+ messages in thread
From: Guenter Roeck @ 2022-04-15 14:21 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Mike Kravetz, Nadav Amit, Matthew Wilcox,
	Mike Rapoport, David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple

Hi,

On Mon, Apr 04, 2022 at 09:48:36PM -0400, Peter Xu wrote:
> We used to check against none pte in finish_fault(), with the assumption
> that the orig_pte is always none pte.
> 
> This change prepares us to be able to call do_fault() on !none ptes.  For
> example, we should allow that to happen for pte marker so that we can restore
> information out of the pte markers.
> 
> Let's change the "pte_none" check into detecting changes since we fetched
> orig_pte.  One trivial thing to take care of here is, when pmd==NULL for
> the pgtable we may not initialize orig_pte at all in handle_pte_fault().
> 
> By default orig_pte will be all zeros however the problem is not all
> architectures are using all-zeros for a none pte.  pte_clear() will be the
> right thing to use here so that we'll always have a valid orig_pte value
> for the whole handle_pte_fault() call.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

This patch crashes pretty much all arm images in linux-next. Reverting it
fixes the problem. Sample crash log and bisect results attached.

Guenter

---
[   11.232343] 8<--- cut here ---
[   11.232564] Unable to handle kernel paging request at virtual address 88016664
[   11.232735] [88016664] *pgd=41cfd811, *pte=00000000, *ppte=00000000
[   11.233128] Internal error: Oops: 807 [#1] ARM
[   11.233385] CPU: 0 PID: 1 Comm: swapper Not tainted 5.18.0-rc2-next-20220414 #1
[   11.233564] Hardware name: Generic DT based system
[   11.233695] PC is at cpu_arm926_set_pte_ext+0x2c/0x40
[   11.233863] LR is at handle_mm_fault+0x4b0/0x11a8
[   11.233963] pc : [<8010e60c>]    lr : [<802944ec>]    psr: 00000113
[   11.234080] sp : 88015e20  ip : 88015e7c  fp : 00000492
[   11.234179] r10: 00000000  r9 : 00000000  r8 : 81167e50
[   11.234280] r7 : 00000000  r6 : 00000081  r5 : 7efffff1  r4 : 83034690
[   11.234402] r3 : 00000043  r2 : 00000000  r1 : 00000000  r0 : 88016664
[   11.234549] Flags: nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[   11.234691] Control: 00093177  Table: 40004000  DAC: 00000053
[   11.234816] Register r0 information: non-paged memory
[   11.235031] Register r1 information: NULL pointer
[   11.235127] Register r2 information: NULL pointer
[   11.235219] Register r3 information: non-paged memory
[   11.235316] Register r4 information: slab vm_area_struct start 83034688 data offset 8 pointer offset 0 allocated at vm_area_alloc+0x20/0x5c
[   11.235825]     kmem_cache_alloc+0x1fc/0x21c
[   11.235926]     vm_area_alloc+0x20/0x5c
[   11.236007]     alloc_bprm+0xd0/0x298
[   11.236082]     kernel_execve+0x34/0x194
[   11.236159]     kernel_init+0x6c/0x138
[   11.236235]     ret_from_fork+0x14/0x3c
[   11.236330] Register r5 information: non-paged memory
[   11.236432] Register r6 information: non-paged memory
[   11.236529] Register r7 information: NULL pointer
[   11.236620] Register r8 information: non-slab/vmalloc memory
[   11.236741] Register r9 information: NULL pointer
[   11.236833] Register r10 information: NULL pointer
[   11.236926] Register r11 information: non-paged memory
[   11.237023] Register r12 information: 2-page vmalloc region starting at 0x88014000 allocated at kernel_clone+0xa0/0x440
[   11.237253] Process swapper (pid: 1, stack limit = 0x88014000)
[   11.237388] Stack: (0x88015e20 to 0x88016000)
[   11.237518] 5e20: ffffffff fffffffe 81d29be0 00000000 a0000193 00000000 81d2a1e8 00007f7e
[   11.237670] 5e40: 816580a8 83034690 00000cc0 0007efff 7efff000 7efffff1 00000081 83199fb8
[   11.237814] 5e60: 83199fb8 00000000 00000000 00000000 00000000 00000000 00000000 0a363e34
[   11.237957] 5e80: 88015ea4 83034690 7efffff1 00002017 00000081 81f4dd00 00001fb8 00000000
[   11.238100] 5ea0: 00000492 8028d160 00000000 81d29be0 00000001 00002017 80deedcc 81d29be0
[   11.238241] 5ec0: 00000000 81f4dd00 7efffff1 88015f38 81f4dd60 00002017 00000000 8028d64c
[   11.238383] 5ee0: 88015f38 00000000 00000000 7efffff1 81f4dd00 00000000 00000001 00000000
[   11.238524] 5f00: 00000011 82d80800 00000001 7efffff1 81f4dd00 00000011 7efffff1 0000000b
[   11.238666] 5f20: 82d80800 802ca218 88015f38 00000000 00000000 000001d3 80e0b43c 0a363e34
[   11.238808] 5f40: 00000ffc 82d80800 81d73140 81d29be0 0000000b 802cb390 81d7315b 802ca0bc
[   11.238950] 5f60: 8110c940 0000000c 82d80800 81d73140 8110c8b0 8110c93c 00000000 00000000
[   11.239091] 5f80: 00000000 802cbf44 81107820 8110c8b0 00000000 00000000 00000000 80b05400
[   11.239234] 5fa0: 00000000 80b05394 00000000 801000f8 00000000 00000000 00000000 00000000
[   11.239376] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[   11.239518] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
[   11.239770] Code: e31300c0 03822e55 e3130003 13a02000 (e5802000)
[   11.240097] ---[ end trace 0000000000000000 ]---
[   11.240307] Kernel panic - not syncing: Fatal exception

--
# bad: [40354149f4d738dc3492d9998e45b3f02950369a] Add linux-next specific files for 20220414
# good: [ce522ba9ef7e2d9fb22a39eb3371c0c64e2a433e] Linux 5.18-rc2
git bisect start 'HEAD' 'v5.18-rc2'
# good: [0f52e407eccb0f7ed62fdd8907b0042f4195159e] Merge branch 'drm-next' of git://git.freedesktop.org/git/drm/drm.git
git bisect good 0f52e407eccb0f7ed62fdd8907b0042f4195159e
# good: [22b1b3a579c91a6afa945711eac72ab740b8f8e4] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
git bisect good 22b1b3a579c91a6afa945711eac72ab740b8f8e4
# good: [cbb5c08b3182cb498f67fa547392191a1d5622dd] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine.git
git bisect good cbb5c08b3182cb498f67fa547392191a1d5622dd
# good: [2acd94b759428825f0e8835fa24ad22c7b5c0e2c] Merge branch 'for-next/kspp' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git
git bisect good 2acd94b759428825f0e8835fa24ad22c7b5c0e2c
# bad: [d2d293faec99124d95590e88030ae3c8382fac7f] mm/shmem: persist uffd-wp bit across zapping for file-backed
git bisect bad d2d293faec99124d95590e88030ae3c8382fac7f
# good: [8cbcc910aec560e78e879cf82ed17e7e72d8a7d4] doc: update documentation for swap_activate and swap_rw
git bisect good 8cbcc910aec560e78e879cf82ed17e7e72d8a7d4
# good: [8c55a1ed1f9b95520b0307ba0ac6ff7f1aadfe9d] mm/page_alloc: simplify update of pgdat in wake_all_kswapds
git bisect good 8c55a1ed1f9b95520b0307ba0ac6ff7f1aadfe9d
# good: [3e68e467590511e2cf7f47194464a5512583f641] mm: hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*
git bisect good 3e68e467590511e2cf7f47194464a5512583f641
# good: [3fb21f4e38824f4d8a183ffcccc03b357ad836d4] mm: mmap: register suitable readonly file vmas for khugepaged
git bisect good 3fb21f4e38824f4d8a183ffcccc03b357ad836d4
# bad: [fa600994916318341cf53e18769be547aa5975d2] mm: check against orig_pte for finish_fault()
git bisect bad fa600994916318341cf53e18769be547aa5975d2
# good: [1112411b72b5e9774897538260028a677d616779] fixup! mm: Introduce PTE_MARKER swap entry
git bisect good 1112411b72b5e9774897538260028a677d616779
# good: [1ae034d98f81a6cf8896b37c3dee9e099daeb3e7] mm: teach core mm about pte markers
git bisect good 1ae034d98f81a6cf8896b37c3dee9e099daeb3e7
# first bad commit: [fa600994916318341cf53e18769be547aa5975d2] mm: check against orig_pte for finish_fault()


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 03/23] mm: Check against orig_pte for finish_fault()
  2022-04-15 14:21   ` Guenter Roeck
@ 2022-04-15 14:41     ` Peter Xu
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-15 14:41 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: linux-kernel, linux-mm, Mike Kravetz, Nadav Amit, Matthew Wilcox,
	Mike Rapoport, David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple

On Fri, Apr 15, 2022 at 07:21:12AM -0700, Guenter Roeck wrote:
> Hi,

Hi, Guenter,

> 
> On Mon, Apr 04, 2022 at 09:48:36PM -0400, Peter Xu wrote:
> > We used to check against none pte in finish_fault(), with the assumption
> > that the orig_pte is always none pte.
> > 
> > This change prepares us to be able to call do_fault() on !none ptes.  For
> > example, we should allow that to happen for pte marker so that we can restore
> > information out of the pte markers.
> > 
> > Let's change the "pte_none" check into detecting changes since we fetched
> > orig_pte.  One trivial thing to take care of here is, when pmd==NULL for
> > the pgtable we may not initialize orig_pte at all in handle_pte_fault().
> > 
> > By default orig_pte will be all zeros however the problem is not all
> > architectures are using all-zeros for a none pte.  pte_clear() will be the
> > right thing to use here so that we'll always have a valid orig_pte value
> > for the whole handle_pte_fault() call.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> This patch crashes pretty much all arm images in linux-next. Reverting it
> fixes the problem. Sample crash log and bisect results attached.

Sorry for the issue, and thanks for reporting and bisecting.

It's already reported by Marek and this problematic patch will be replaced
by this one (already updated in -mm, but may land -next later I think):

https://lore.kernel.org/all/Ylb9rXJyPm8%2Fao8f@xz-m1.local/

Thanks,

> 
> Guenter
> 
> ---
> [   11.232343] 8<--- cut here ---
> [   11.232564] Unable to handle kernel paging request at virtual address 88016664
> [   11.232735] [88016664] *pgd=41cfd811, *pte=00000000, *ppte=00000000
> [   11.233128] Internal error: Oops: 807 [#1] ARM
> [   11.233385] CPU: 0 PID: 1 Comm: swapper Not tainted 5.18.0-rc2-next-20220414 #1
> [   11.233564] Hardware name: Generic DT based system
> [   11.233695] PC is at cpu_arm926_set_pte_ext+0x2c/0x40
> [   11.233863] LR is at handle_mm_fault+0x4b0/0x11a8
> [   11.233963] pc : [<8010e60c>]    lr : [<802944ec>]    psr: 00000113
> [   11.234080] sp : 88015e20  ip : 88015e7c  fp : 00000492
> [   11.234179] r10: 00000000  r9 : 00000000  r8 : 81167e50
> [   11.234280] r7 : 00000000  r6 : 00000081  r5 : 7efffff1  r4 : 83034690
> [   11.234402] r3 : 00000043  r2 : 00000000  r1 : 00000000  r0 : 88016664
> [   11.234549] Flags: nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> [   11.234691] Control: 00093177  Table: 40004000  DAC: 00000053
> [   11.234816] Register r0 information: non-paged memory
> [   11.235031] Register r1 information: NULL pointer
> [   11.235127] Register r2 information: NULL pointer
> [   11.235219] Register r3 information: non-paged memory
> [   11.235316] Register r4 information: slab vm_area_struct start 83034688 data offset 8 pointer offset 0 allocated at vm_area_alloc+0x20/0x5c
> [   11.235825]     kmem_cache_alloc+0x1fc/0x21c
> [   11.235926]     vm_area_alloc+0x20/0x5c
> [   11.236007]     alloc_bprm+0xd0/0x298
> [   11.236082]     kernel_execve+0x34/0x194
> [   11.236159]     kernel_init+0x6c/0x138
> [   11.236235]     ret_from_fork+0x14/0x3c
> [   11.236330] Register r5 information: non-paged memory
> [   11.236432] Register r6 information: non-paged memory
> [   11.236529] Register r7 information: NULL pointer
> [   11.236620] Register r8 information: non-slab/vmalloc memory
> [   11.236741] Register r9 information: NULL pointer
> [   11.236833] Register r10 information: NULL pointer
> [   11.236926] Register r11 information: non-paged memory
> [   11.237023] Register r12 information: 2-page vmalloc region starting at 0x88014000 allocated at kernel_clone+0xa0/0x440
> [   11.237253] Process swapper (pid: 1, stack limit = 0x88014000)
> [   11.237388] Stack: (0x88015e20 to 0x88016000)
> [   11.237518] 5e20: ffffffff fffffffe 81d29be0 00000000 a0000193 00000000 81d2a1e8 00007f7e
> [   11.237670] 5e40: 816580a8 83034690 00000cc0 0007efff 7efff000 7efffff1 00000081 83199fb8
> [   11.237814] 5e60: 83199fb8 00000000 00000000 00000000 00000000 00000000 00000000 0a363e34
> [   11.237957] 5e80: 88015ea4 83034690 7efffff1 00002017 00000081 81f4dd00 00001fb8 00000000
> [   11.238100] 5ea0: 00000492 8028d160 00000000 81d29be0 00000001 00002017 80deedcc 81d29be0
> [   11.238241] 5ec0: 00000000 81f4dd00 7efffff1 88015f38 81f4dd60 00002017 00000000 8028d64c
> [   11.238383] 5ee0: 88015f38 00000000 00000000 7efffff1 81f4dd00 00000000 00000001 00000000
> [   11.238524] 5f00: 00000011 82d80800 00000001 7efffff1 81f4dd00 00000011 7efffff1 0000000b
> [   11.238666] 5f20: 82d80800 802ca218 88015f38 00000000 00000000 000001d3 80e0b43c 0a363e34
> [   11.238808] 5f40: 00000ffc 82d80800 81d73140 81d29be0 0000000b 802cb390 81d7315b 802ca0bc
> [   11.238950] 5f60: 8110c940 0000000c 82d80800 81d73140 8110c8b0 8110c93c 00000000 00000000
> [   11.239091] 5f80: 00000000 802cbf44 81107820 8110c8b0 00000000 00000000 00000000 80b05400
> [   11.239234] 5fa0: 00000000 80b05394 00000000 801000f8 00000000 00000000 00000000 00000000
> [   11.239376] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> [   11.239518] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
> [   11.239770] Code: e31300c0 03822e55 e3130003 13a02000 (e5802000)
> [   11.240097] ---[ end trace 0000000000000000 ]---
> [   11.240307] Kernel panic - not syncing: Fatal exception
> 
> --
> # bad: [40354149f4d738dc3492d9998e45b3f02950369a] Add linux-next specific files for 20220414
> # good: [ce522ba9ef7e2d9fb22a39eb3371c0c64e2a433e] Linux 5.18-rc2
> git bisect start 'HEAD' 'v5.18-rc2'
> # good: [0f52e407eccb0f7ed62fdd8907b0042f4195159e] Merge branch 'drm-next' of git://git.freedesktop.org/git/drm/drm.git
> git bisect good 0f52e407eccb0f7ed62fdd8907b0042f4195159e
> # good: [22b1b3a579c91a6afa945711eac72ab740b8f8e4] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
> git bisect good 22b1b3a579c91a6afa945711eac72ab740b8f8e4
> # good: [cbb5c08b3182cb498f67fa547392191a1d5622dd] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine.git
> git bisect good cbb5c08b3182cb498f67fa547392191a1d5622dd
> # good: [2acd94b759428825f0e8835fa24ad22c7b5c0e2c] Merge branch 'for-next/kspp' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git
> git bisect good 2acd94b759428825f0e8835fa24ad22c7b5c0e2c
> # bad: [d2d293faec99124d95590e88030ae3c8382fac7f] mm/shmem: persist uffd-wp bit across zapping for file-backed
> git bisect bad d2d293faec99124d95590e88030ae3c8382fac7f
> # good: [8cbcc910aec560e78e879cf82ed17e7e72d8a7d4] doc: update documentation for swap_activate and swap_rw
> git bisect good 8cbcc910aec560e78e879cf82ed17e7e72d8a7d4
> # good: [8c55a1ed1f9b95520b0307ba0ac6ff7f1aadfe9d] mm/page_alloc: simplify update of pgdat in wake_all_kswapds
> git bisect good 8c55a1ed1f9b95520b0307ba0ac6ff7f1aadfe9d
> # good: [3e68e467590511e2cf7f47194464a5512583f641] mm: hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*
> git bisect good 3e68e467590511e2cf7f47194464a5512583f641
> # good: [3fb21f4e38824f4d8a183ffcccc03b357ad836d4] mm: mmap: register suitable readonly file vmas for khugepaged
> git bisect good 3fb21f4e38824f4d8a183ffcccc03b357ad836d4
> # bad: [fa600994916318341cf53e18769be547aa5975d2] mm: check against orig_pte for finish_fault()
> git bisect bad fa600994916318341cf53e18769be547aa5975d2
> # good: [1112411b72b5e9774897538260028a677d616779] fixup! mm: Introduce PTE_MARKER swap entry
> git bisect good 1112411b72b5e9774897538260028a677d616779
> # good: [1ae034d98f81a6cf8896b37c3dee9e099daeb3e7] mm: teach core mm about pte markers
> git bisect good 1ae034d98f81a6cf8896b37c3dee9e099daeb3e7
> # first bad commit: [fa600994916318341cf53e18769be547aa5975d2] mm: check against orig_pte for finish_fault()
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 01/23] mm: Introduce PTE_MARKER swap entry
  2022-04-05  1:46 ` [PATCH v8 01/23] mm: Introduce PTE_MARKER swap entry Peter Xu
  2022-04-12  1:07   ` Alistair Popple
@ 2022-04-19  8:25   ` Alistair Popple
  2022-04-19 19:44     ` Peter Xu
  1 sibling, 1 reply; 63+ messages in thread
From: Alistair Popple @ 2022-04-19  8:25 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Mike Kravetz, Andrew Morton,
	David Hildenbrand, Matthew Wilcox, Nadav Amit, Axel Rasmussen,
	Andrea Arcangeli, Kirill A . Shutemov, Hugh Dickins,
	Jerome Glisse, Mike Rapoport

[-- Attachment #1: Type: text/plain, Size: 6511 bytes --]

Peter Xu <peterx@redhat.com> writes:

> This patch introduces a new swap entry type called PTE_MARKER.  It can be
> installed for any pte that maps a file-backed memory when the pte is
> temporarily zapped, so as to maintain per-pte information.

Hi Peter,

Is there something I have missed that means PTE markers can only be used with
file-backed memory? Obviously that's all you care about for this patch series,
but if we needed to mark some anonymous PTE for special processing is there
anything that would prevent us using a PTE marker? Specifically I was thinking
about it in relation to this series:
<https://lore.kernel.org/linux-mm/87pmldjxiq.fsf@nvdebian.thelocal/>

> The information that kept in the pte is called a "marker".  Here we define the
> marker as "unsigned long" just to match pgoff_t, however it will only work if
> it still fits in swp_offset(), which is e.g. currently 58 bits on x86_64.
>
> A new config CONFIG_PTE_MARKER is introduced too; it's by default off.  A bunch
> of helpers are defined altogether to service the rest of the pte marker code.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/asm-generic/hugetlb.h |  9 ++++
>  include/linux/swap.h          | 15 ++++++-
>  include/linux/swapops.h       | 78 +++++++++++++++++++++++++++++++++++
>  mm/Kconfig                    |  6 +++
>  4 files changed, 107 insertions(+), 1 deletion(-)
>
> diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
> index 8e1e6244a89d..f39cad20ffc6 100644
> --- a/include/asm-generic/hugetlb.h
> +++ b/include/asm-generic/hugetlb.h
> @@ -2,6 +2,9 @@
>  #ifndef _ASM_GENERIC_HUGETLB_H
>  #define _ASM_GENERIC_HUGETLB_H
>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
> +
>  static inline pte_t mk_huge_pte(struct page *page, pgprot_t pgprot)
>  {
>  	return mk_pte(page, pgprot);
> @@ -80,6 +83,12 @@ static inline int huge_pte_none(pte_t pte)
>  }
>  #endif
>
> +/* Please refer to comments above pte_none_mostly() for the usage */
> +static inline int huge_pte_none_mostly(pte_t pte)
> +{
> +	return huge_pte_none(pte) || is_pte_marker(pte);
> +}
> +
>  #ifndef __HAVE_ARCH_HUGE_PTE_WRPROTECT
>  static inline pte_t huge_pte_wrprotect(pte_t pte)
>  {
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 7daae5a4b3e1..5553189d0215 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -55,6 +55,19 @@ static inline int current_is_kswapd(void)
>   * actions on faults.
>   */
>
> +/*
> + * PTE markers are used to persist information onto PTEs that are mapped with
> + * file-backed memories.  As its name "PTE" hints, it should only be applied to
> + * the leaves of pgtables.
> + */
> +#ifdef CONFIG_PTE_MARKER
> +#define SWP_PTE_MARKER_NUM 1
> +#define SWP_PTE_MARKER     (MAX_SWAPFILES + SWP_HWPOISON_NUM + \
> +			    SWP_MIGRATION_NUM + SWP_DEVICE_NUM)
> +#else
> +#define SWP_PTE_MARKER_NUM 0
> +#endif
> +
>  /*
>   * Unaddressable device memory support. See include/linux/hmm.h and
>   * Documentation/vm/hmm.rst. Short description is we need struct pages for
> @@ -107,7 +120,7 @@ static inline int current_is_kswapd(void)
>
>  #define MAX_SWAPFILES \
>  	((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
> -	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
> +	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - SWP_PTE_MARKER_NUM)
>
>  /*
>   * Magic header for a swap area. The first part of the union is
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 32d517a28969..7a00627845f0 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -274,6 +274,84 @@ static inline int is_readable_migration_entry(swp_entry_t entry)
>
>  #endif
>
> +typedef unsigned long pte_marker;
> +
> +#define  PTE_MARKER_MASK     (0)
> +
> +#ifdef CONFIG_PTE_MARKER
> +
> +static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
> +{
> +	return swp_entry(SWP_PTE_MARKER, marker);
> +}
> +
> +static inline bool is_pte_marker_entry(swp_entry_t entry)
> +{
> +	return swp_type(entry) == SWP_PTE_MARKER;
> +}
> +
> +static inline pte_marker pte_marker_get(swp_entry_t entry)
> +{
> +	return swp_offset(entry) & PTE_MARKER_MASK;
> +}
> +
> +static inline bool is_pte_marker(pte_t pte)
> +{
> +	return is_swap_pte(pte) && is_pte_marker_entry(pte_to_swp_entry(pte));
> +}
> +
> +#else /* CONFIG_PTE_MARKER */
> +
> +static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
> +{
> +	/* This should never be called if !CONFIG_PTE_MARKER */
> +	WARN_ON_ONCE(1);
> +	return swp_entry(0, 0);
> +}
> +
> +static inline bool is_pte_marker_entry(swp_entry_t entry)
> +{
> +	return false;
> +}
> +
> +static inline pte_marker pte_marker_get(swp_entry_t entry)
> +{
> +	return 0;
> +}
> +
> +static inline bool is_pte_marker(pte_t pte)
> +{
> +	return false;
> +}
> +
> +#endif /* CONFIG_PTE_MARKER */
> +
> +static inline pte_t make_pte_marker(pte_marker marker)
> +{
> +	return swp_entry_to_pte(make_pte_marker_entry(marker));
> +}
> +
> +/*
> + * This is a special version to check pte_none() just to cover the case when
> + * the pte is a pte marker.  It existed because in many cases the pte marker
> + * should be seen as a none pte; it's just that we have stored some information
> + * onto the none pte so it becomes not-none any more.
> + *
> + * It should be used when the pte is file-backed, ram-based and backing
> + * userspace pages, like shmem.  It is not needed upon pgtables that do not
> + * support pte markers at all.  For example, it's not needed on anonymous
> + * memory, kernel-only memory (including when the system is during-boot),
> + * non-ram based generic file-system.  It's fine to be used even there, but the
> + * extra pte marker check will be pure overhead.
> + *
> + * For systems configured with !CONFIG_PTE_MARKER this will be automatically
> + * optimized to pte_none().
> + */
> +static inline int pte_none_mostly(pte_t pte)
> +{
> +	return pte_none(pte) || is_pte_marker(pte);
> +}
> +
>  static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
>  {
>  	struct page *p = pfn_to_page(swp_offset(entry));
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 034d87953600..a1688b9314b2 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -909,6 +909,12 @@ config ANON_VMA_NAME
>  	  area from being merged with adjacent virtual memory areas due to the
>  	  difference in their name.
>
> +config PTE_MARKER
> +	bool "Marker PTEs support"
> +
> +	help
> +	  Allows to create marker PTEs for file-backed memory.
> +
>  source "mm/damon/Kconfig"
>
>  endmenu

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 22/23] mm: Enable PTE markers by default
  2022-04-05  1:49 ` [PATCH v8 22/23] mm: Enable PTE markers by default Peter Xu
@ 2022-04-19 15:13   ` Johannes Weiner
  2022-04-19 19:59     ` Peter Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Johannes Weiner @ 2022-04-19 15:13 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Mike Kravetz, Nadav Amit, Matthew Wilcox,
	Mike Rapoport, David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple

Hi Peter,

On Mon, Apr 04, 2022 at 09:49:29PM -0400, Peter Xu wrote:
> Enable PTE markers by default.  On x86_64 it means it'll auto-enable
> PTE_MARKER_UFFD_WP as well.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/Kconfig | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 6e7c2d59fa96..3eca34c864c5 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -911,12 +911,14 @@ config ANON_VMA_NAME
>  
>  config PTE_MARKER
>  	bool "Marker PTEs support"
> +	default y
>  
>  	help
>  	  Allows to create marker PTEs for file-backed memory.

make oldconfig just prompted me on these:

---
Marker PTEs support (PTE_MARKER) [Y/n/?] (NEW) ?

CONFIG_PTE_MARKER:

Allows to create marker PTEs for file-backed memory.

Symbol: PTE_MARKER [=y]
Type  : bool
Defined at mm/Kconfig:1046
  Prompt: Marker PTEs support
  Location:
    Main menu
      -> Memory Management options
---

>  config PTE_MARKER_UFFD_WP
>  	bool "Marker PTEs support for userfaultfd write protection"
> +	default y
>  	depends on PTE_MARKER && HAVE_ARCH_USERFAULTFD_WP

It's not possible to answer them without looking at the code.

But after looking at the code, I'm still not sure why it asks
me. Isn't this infrastructure code?

Wouldn't it make more sense to remove the prompt string and have
userfaultfd simply select those?

If this is too experimental to enable per default, a more reasonable
question for the user would be a "userfaultfd file support" option or
something, and have *that* select the marker code.

Thanks!


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 01/23] mm: Introduce PTE_MARKER swap entry
  2022-04-19  8:25   ` Alistair Popple
@ 2022-04-19 19:44     ` Peter Xu
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-19 19:44 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-kernel, linux-mm, Mike Kravetz, Andrew Morton,
	David Hildenbrand, Matthew Wilcox, Nadav Amit, Axel Rasmussen,
	Andrea Arcangeli, Kirill A . Shutemov, Hugh Dickins,
	Jerome Glisse, Mike Rapoport

On Tue, Apr 19, 2022 at 06:25:31PM +1000, Alistair Popple wrote:
> Hi Peter,

Hi, Alistair,

> 
> Is there something I have missed that means PTE markers can only be used with
> file-backed memory? Obviously that's all you care about for this patch series,
> but if we needed to mark some anonymous PTE for special processing is there
> anything that would prevent us using a PTE marker? Specifically I was thinking
> about it in relation to this series:
> <https://lore.kernel.org/linux-mm/87pmldjxiq.fsf@nvdebian.thelocal/>

It's not necessarily to be restricted to file-backed.  All the file-backed
check here in this series was just for safety purpose and nothing else.

I think it's a very good example of that swap-read-error case that pte
marker does sound like a great fit, but let's see whether people would
still like to stick with hwpoison which makes some sense too.  Then let's
keep the discussion in that thread.

Thanks for the pointer!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 22/23] mm: Enable PTE markers by default
  2022-04-19 15:13   ` Johannes Weiner
@ 2022-04-19 19:59     ` Peter Xu
  2022-04-19 20:14       ` Johannes Weiner
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2022-04-19 19:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-kernel, linux-mm, Mike Kravetz, Nadav Amit, Matthew Wilcox,
	Mike Rapoport, David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple

[-- Attachment #1: Type: text/plain, Size: 2850 bytes --]

On Tue, Apr 19, 2022 at 11:13:48AM -0400, Johannes Weiner wrote:
> Hi Peter,

Hi, Johannes,

> 
> On Mon, Apr 04, 2022 at 09:49:29PM -0400, Peter Xu wrote:
> > Enable PTE markers by default.  On x86_64 it means it'll auto-enable
> > PTE_MARKER_UFFD_WP as well.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  mm/Kconfig | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 6e7c2d59fa96..3eca34c864c5 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -911,12 +911,14 @@ config ANON_VMA_NAME
> >  
> >  config PTE_MARKER
> >  	bool "Marker PTEs support"
> > +	default y
> >  
> >  	help
> >  	  Allows to create marker PTEs for file-backed memory.
> 
> make oldconfig just prompted me on these:
> 
> ---
> Marker PTEs support (PTE_MARKER) [Y/n/?] (NEW) ?
> 
> CONFIG_PTE_MARKER:
> 
> Allows to create marker PTEs for file-backed memory.
> 
> Symbol: PTE_MARKER [=y]
> Type  : bool
> Defined at mm/Kconfig:1046
>   Prompt: Marker PTEs support
>   Location:
>     Main menu
>       -> Memory Management options
> ---
> 
> >  config PTE_MARKER_UFFD_WP
> >  	bool "Marker PTEs support for userfaultfd write protection"
> > +	default y
> >  	depends on PTE_MARKER && HAVE_ARCH_USERFAULTFD_WP
> 
> It's not possible to answer them without looking at the code.
> 
> But after looking at the code, I'm still not sure why it asks
> me. Isn't this infrastructure code?
> 
> Wouldn't it make more sense to remove the prompt string and have
> userfaultfd simply select those?
> 
> If this is too experimental to enable per default, a more reasonable
> question for the user would be a "userfaultfd file support" option or
> something, and have *that* select the marker code.

Thanks for raising this question.

Actually it's right now enabled by default, so I kept the options just to
make sure we can always explicitly disable those options when we want.
That's majorly why I kept this patch standalone one so if we want we can
even drop it.

Said that, I fully agree with you that having two options seem to be an
overkill, especially the PTE_MARKER option will be too challenging to be
correctly understood by anyone not familiar with it.

So after a 2nd thought I'm trying to refine what I want to achieve with the
kbuild system on this new feature:

- On supported systems (x86_64), should be by default y with "make
  olddefconfig", but able to turn it off using "make oldconfig" etc., so
  the user will have a choice when they want.

- On not-supported systems (non-x86_64), should be always n without
  any prompt asking, and user won't even see this entry.

- PTE_MARKER option should always be hidden for all archs

I plan to post a patch that is attached (I also reworded the entry to not
mention about pte markers).  Does that look acceptable on your side?

Thanks,

-- 
Peter Xu

[-- Attachment #2: 0001-mm-uffd-Hide-PTE_MARKER-option.patch --]
[-- Type: text/plain, Size: 1501 bytes --]

From 5d25e9d685bf129a1730caa61c1545fb16c094bb Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Tue, 19 Apr 2022 15:31:12 -0400
Subject: [PATCH] mm/uffd: Hide PTE_MARKER option
Content-type: text/plain

The PTE_MARKER option should not need to be exposed to the kernel builder,
keep it internal and remove the prompt so it won't be seen.

Instead, make the PTE_MARKER_UFFD_WP option to explicitly choose PTE_MARKER
when necessary.

While PTE_MARKER_UFFD_WP will still prompt to user, change the wording so
that it'll not mention PTE_MARKER at all but renaming it to "Userfaultfd
write protection support for shmem/hugetlbfs".

Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/Kconfig | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 3eca34c864c5..d740e1ff3b2f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -910,16 +910,16 @@ config ANON_VMA_NAME
 	  difference in their name.
 
 config PTE_MARKER
-	bool "Marker PTEs support"
-	default y
+	bool
 
 	help
 	  Allows to create marker PTEs for file-backed memory.
 
 config PTE_MARKER_UFFD_WP
-	bool "Marker PTEs support for userfaultfd write protection"
+	bool "Userfaultfd write protection support for shmem/hugetlbfs"
 	default y
-	depends on PTE_MARKER && HAVE_ARCH_USERFAULTFD_WP
+	depends on HAVE_ARCH_USERFAULTFD_WP
+	select PTE_MARKER
 
 	help
 	  Allows to create marker PTEs for userfaultfd write protection
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 22/23] mm: Enable PTE markers by default
  2022-04-19 19:59     ` Peter Xu
@ 2022-04-19 20:14       ` Johannes Weiner
  2022-04-19 20:28         ` Peter Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Johannes Weiner @ 2022-04-19 20:14 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Mike Kravetz, Nadav Amit, Matthew Wilcox,
	Mike Rapoport, David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple

Hi Peter,

On Tue, Apr 19, 2022 at 03:59:21PM -0400, Peter Xu wrote:
> @@ -910,16 +910,16 @@ config ANON_VMA_NAME
>  	  difference in their name.
>  
>  config PTE_MARKER
> -	bool "Marker PTEs support"
> -	default y
> +	bool
>  
>  	help
>  	  Allows to create marker PTEs for file-backed memory.
>
>  config PTE_MARKER_UFFD_WP
> -	bool "Marker PTEs support for userfaultfd write protection"
> +	bool "Userfaultfd write protection support for shmem/hugetlbfs"
>  	default y
> -	depends on PTE_MARKER && HAVE_ARCH_USERFAULTFD_WP
> +	depends on HAVE_ARCH_USERFAULTFD_WP
> +	select PTE_MARKER

This is much easier to understand, thanks!

Btw, this doesn't do much without userfaultfd being enabled in
general, right? Would it make sense to have it next to 'config
USERFAULTFD' as a sub-option?


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 22/23] mm: Enable PTE markers by default
  2022-04-19 20:14       ` Johannes Weiner
@ 2022-04-19 20:28         ` Peter Xu
  2022-04-19 21:24           ` Johannes Weiner
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2022-04-19 20:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-kernel, linux-mm, Mike Kravetz, Nadav Amit, Matthew Wilcox,
	Mike Rapoport, David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple

On Tue, Apr 19, 2022 at 04:14:11PM -0400, Johannes Weiner wrote:
> Hi Peter,
> 
> On Tue, Apr 19, 2022 at 03:59:21PM -0400, Peter Xu wrote:
> > @@ -910,16 +910,16 @@ config ANON_VMA_NAME
> >  	  difference in their name.
> >  
> >  config PTE_MARKER
> > -	bool "Marker PTEs support"
> > -	default y
> > +	bool
> >  
> >  	help
> >  	  Allows to create marker PTEs for file-backed memory.
> >
> >  config PTE_MARKER_UFFD_WP
> > -	bool "Marker PTEs support for userfaultfd write protection"
> > +	bool "Userfaultfd write protection support for shmem/hugetlbfs"
> >  	default y
> > -	depends on PTE_MARKER && HAVE_ARCH_USERFAULTFD_WP
> > +	depends on HAVE_ARCH_USERFAULTFD_WP
> > +	select PTE_MARKER
> 
> This is much easier to understand, thanks!

Cool!  Sent as a formal patch just now.

> 
> Btw, this doesn't do much without userfaultfd being enabled in
> general, right?

So far yes, but I'm thinking there can be potential other users of
PTE_MARKERS from mm world.  The most close discussion is on the swap read
failures and this patch proposed by Miaohe:

https://lore.kernel.org/lkml/20220416030549.60559-1-linmiaohe@huawei.com/

So I hope we can still keep them around here under mm/ if possible, and
from the gut feeling it really should..

> Would it make sense to have it next to 'config USERFAULTFD' as a
> sub-option?

Yes another good question. :)

IIUC CONFIG_USERFAULTFD resides in init/Kconfig because it introduces a new
syscall.  Same to the rest of the bits for uffd since then, namely:

  - USERFAULTFD_WP
  - USERFAULTFD_MINOR

What I am thinking now is the other way round of your suggestion: whether
we should move most of them out, at least the _WP and _MINOR configs into
mm/?  Because IMHO they are really pure mm ideas and they're irrelevant to
syscalls and init.

Any thoughts?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 22/23] mm: Enable PTE markers by default
  2022-04-19 20:28         ` Peter Xu
@ 2022-04-19 21:24           ` Johannes Weiner
  2022-04-19 22:01             ` Peter Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Johannes Weiner @ 2022-04-19 21:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Mike Kravetz, Nadav Amit, Matthew Wilcox,
	Mike Rapoport, David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple

On Tue, Apr 19, 2022 at 04:28:16PM -0400, Peter Xu wrote:
> On Tue, Apr 19, 2022 at 04:14:11PM -0400, Johannes Weiner wrote:
> > On Tue, Apr 19, 2022 at 03:59:21PM -0400, Peter Xu wrote:
> > > @@ -910,16 +910,16 @@ config ANON_VMA_NAME
> > Btw, this doesn't do much without userfaultfd being enabled in
> > general, right?
> 
> So far yes, but I'm thinking there can be potential other users of
> PTE_MARKERS from mm world.  The most close discussion is on the swap read
> failures and this patch proposed by Miaohe:
> 
> https://lore.kernel.org/lkml/20220416030549.60559-1-linmiaohe@huawei.com/
>
> So I hope we can still keep them around here under mm/ if possible, and
> from the gut feeling it really should..

Agreed, mm/ seems a good fit for PTE_MARKER.

If it's invisible and gets selected as needed, it's less of a concern,
IMO. I'm somewhat worried about when and how the user-visible options
show up right now, though...

> > Would it make sense to have it next to 'config USERFAULTFD' as a
> > sub-option?
> 
> Yes another good question. :)
> 
> IIUC CONFIG_USERFAULTFD resides in init/Kconfig because it introduces a new
> syscall.  Same to the rest of the bits for uffd since then, namely:
> 
>   - USERFAULTFD_WP
>   - USERFAULTFD_MINOR
> 
> What I am thinking now is the other way round of your suggestion: whether
> we should move most of them out, at least the _WP and _MINOR configs into
> mm/?  Because IMHO they are really pure mm ideas and they're irrelevant to
> syscalls and init.

I'm thinking the MM submenu would probably be a better fit for all
user-visible userfaultfd options, including the syscall. Like you say,
it's an MM concept.

But if moving the syscall knob out from init isn't popular, IMO it
would be better to add the new WP option to init as well. This ensures
that when somebody selects userfaultfd, they also see the relevant
suboptions and don't have to chase them down across multiple submenus.

Conversely, they should also have the necessary depend clauses so that
suboptions aren't visible without the main feature. E.g. it asked me
for userfaultd options even though I have CONFIG_USERFAULTFD=n.

What do you think?


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 22/23] mm: Enable PTE markers by default
  2022-04-19 21:24           ` Johannes Weiner
@ 2022-04-19 22:01             ` Peter Xu
  2022-04-20 13:46               ` Johannes Weiner
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Xu @ 2022-04-19 22:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-kernel, linux-mm, Mike Kravetz, Nadav Amit, Matthew Wilcox,
	Mike Rapoport, David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple

On Tue, Apr 19, 2022 at 05:24:28PM -0400, Johannes Weiner wrote:
> On Tue, Apr 19, 2022 at 04:28:16PM -0400, Peter Xu wrote:
> > On Tue, Apr 19, 2022 at 04:14:11PM -0400, Johannes Weiner wrote:
> > > On Tue, Apr 19, 2022 at 03:59:21PM -0400, Peter Xu wrote:
> > > > @@ -910,16 +910,16 @@ config ANON_VMA_NAME
> > > Btw, this doesn't do much without userfaultfd being enabled in
> > > general, right?
> > 
> > So far yes, but I'm thinking there can be potential other users of
> > PTE_MARKERS from mm world.  The most close discussion is on the swap read
> > failures and this patch proposed by Miaohe:
> > 
> > https://lore.kernel.org/lkml/20220416030549.60559-1-linmiaohe@huawei.com/
> >
> > So I hope we can still keep them around here under mm/ if possible, and
> > from the gut feeling it really should..
> 
> Agreed, mm/ seems a good fit for PTE_MARKER.
> 
> If it's invisible and gets selected as needed, it's less of a concern,
> IMO. I'm somewhat worried about when and how the user-visible options
> show up right now, though...
> 
> > > Would it make sense to have it next to 'config USERFAULTFD' as a
> > > sub-option?
> > 
> > Yes another good question. :)
> > 
> > IIUC CONFIG_USERFAULTFD resides in init/Kconfig because it introduces a new
> > syscall.  Same to the rest of the bits for uffd since then, namely:
> > 
> >   - USERFAULTFD_WP
> >   - USERFAULTFD_MINOR
> > 
> > What I am thinking now is the other way round of your suggestion: whether
> > we should move most of them out, at least the _WP and _MINOR configs into
> > mm/?  Because IMHO they are really pure mm ideas and they're irrelevant to
> > syscalls and init.
> 
> I'm thinking the MM submenu would probably be a better fit for all
> user-visible userfaultfd options, including the syscall. Like you say,
> it's an MM concept.
> 
> But if moving the syscall knob out from init isn't popular, IMO it
> would be better to add the new WP option to init as well. This ensures
> that when somebody selects userfaultfd, they also see the relevant
> suboptions and don't have to chase them down across multiple submenus.
> 
> Conversely, they should also have the necessary depend clauses so that
> suboptions aren't visible without the main feature. E.g. it asked me
> for userfaultd options even though I have CONFIG_USERFAULTFD=n.

Hmm, this is a bit weird... since we do have that dependency chain for
PTE_MARKER_UFFD_WP -> HAVE_ARCH_USERFAULTFD_WP -> USERFAULTFD:

  in arch/x86/Kconfig:
  config X86
          ...
          select HAVE_ARCH_USERFAULTFD_WP         if X86_64 && USERFAULTFD

  in mm/Kconfig (with/without the "mm/uffd: Hide PTE_MARKER" patch applied):
  config PTE_MARKER_UFFD_WP
          ...
          depends on HAVE_ARCH_USERFAULTFD_WP

So logically if !USERFAULTFD we shouldn't see PTE_MARKER_UFFD_WP at all?

That's also what I got when I tried it out for either !USERFAULTFD on x86,
or any non-x86 platforms (because there we have !HAVE_ARCH_USERFAULTFD_WP
constantly irrelevant of USERFAULTFD).  Though I could have missed
something..

> 
> What do you think?

I don't have a strong preference here, I think it's okay if it's preferred
that we only put user-visible configs into mm/Kconfig.  It's just that I
see we have tons of user-invisible configs already in mm/Kconfig, to list
some:

        config ARCH_HAS_HUGEPD
        config MAPPING_DIRTY_HELPERS
        config KMAP_LOCAL
        config KMAP_LOCAL_NON_LINEAR_PTE_ARRAY

But I'm not sure whether it's a rule of thumb somewhere else.

At the meantime, I also looked at whether syscall configs are always and
only be put under init/, and funnily I got:

$ find . -name Kconfig | xargs grep --color -E "\".*syscall.*\""
./init/Kconfig: bool "Enable process_vm_readv/writev syscalls"
./init/Kconfig: bool "uselib syscall"
./init/Kconfig: bool "sgetmask/ssetmask syscalls support" if EXPERT
./init/Kconfig: bool "Sysfs syscall support" if EXPERT
./init/Kconfig: bool "open by fhandle syscalls" if EXPERT
./init/Kconfig: bool "Enable madvise/fadvise syscalls" if EXPERT
./arch/xtensa/Kconfig:  bool "Enable fast atomic syscalls"
./arch/xtensa/Kconfig:  bool "Enable spill registers syscall"
./arch/powerpc/Kconfig: bool "Support setting protections for 4k subpages (subpage_prot syscall)"
./arch/powerpc/Kconfig: bool "Enable filtering of RTAS syscalls"
./arch/Kconfig: bool "Support for randomizing kernel stack offset on syscall entry" if EXPERT
./arch/s390/Kconfig:    bool "Verify kernel signature during kexec_file_load() syscall"
./arch/sh/mm/Kconfig:   bool "Support vsyscall page"
./arch/x86/Kconfig:     bool "Enable vsyscall emulation" if EXPERT
./arch/x86/Kconfig:     bool "Verify kernel signature during kexec_file_load() syscall"
./arch/x86/Kconfig:     bool "Require a valid signature in kexec_file_load() syscall"
./arch/x86/Kconfig:     prompt "vsyscall table for legacy applications"
./arch/arm64/Kconfig:   bool "Verify kernel signature during kexec_file_load() syscall"
./arch/arm64/Kconfig:   bool "Enable the tagged user addresses syscall ABI"
./kernel/trace/Kconfig: bool "Trace syscalls"
./kernel/trace/Kconfig: bool "Run selftest on syscall events"

So let's put aside arch specific lines, ftrace does have FTRACE_SYSCALLS
that lies in the kernel/trace/ dir.. not sure whether we could move
USERFAULTFD and all the rest into mm/ as well?  Or perhaps that's just a
bad example? :)

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 22/23] mm: Enable PTE markers by default
  2022-04-19 22:01             ` Peter Xu
@ 2022-04-20 13:46               ` Johannes Weiner
  2022-04-20 14:25                 ` Peter Xu
  0 siblings, 1 reply; 63+ messages in thread
From: Johannes Weiner @ 2022-04-20 13:46 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Mike Kravetz, Nadav Amit, Matthew Wilcox,
	Mike Rapoport, David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple

On Tue, Apr 19, 2022 at 06:01:14PM -0400, Peter Xu wrote:
> On Tue, Apr 19, 2022 at 05:24:28PM -0400, Johannes Weiner wrote:
> > On Tue, Apr 19, 2022 at 04:28:16PM -0400, Peter Xu wrote:
> > > On Tue, Apr 19, 2022 at 04:14:11PM -0400, Johannes Weiner wrote:
> > > > On Tue, Apr 19, 2022 at 03:59:21PM -0400, Peter Xu wrote:
> > > > > @@ -910,16 +910,16 @@ config ANON_VMA_NAME
> > > > Btw, this doesn't do much without userfaultfd being enabled in
> > > > general, right?
> > > 
> > > So far yes, but I'm thinking there can be potential other users of
> > > PTE_MARKERS from mm world.  The most close discussion is on the swap read
> > > failures and this patch proposed by Miaohe:
> > > 
> > > https://lore.kernel.org/lkml/20220416030549.60559-1-linmiaohe@huawei.com/
> > >
> > > So I hope we can still keep them around here under mm/ if possible, and
> > > from the gut feeling it really should..
> > 
> > Agreed, mm/ seems a good fit for PTE_MARKER.
> > 
> > If it's invisible and gets selected as needed, it's less of a concern,
> > IMO. I'm somewhat worried about when and how the user-visible options
> > show up right now, though...
> > 
> > > > Would it make sense to have it next to 'config USERFAULTFD' as a
> > > > sub-option?
> > > 
> > > Yes another good question. :)
> > > 
> > > IIUC CONFIG_USERFAULTFD resides in init/Kconfig because it introduces a new
> > > syscall.  Same to the rest of the bits for uffd since then, namely:
> > > 
> > >   - USERFAULTFD_WP
> > >   - USERFAULTFD_MINOR
> > > 
> > > What I am thinking now is the other way round of your suggestion: whether
> > > we should move most of them out, at least the _WP and _MINOR configs into
> > > mm/?  Because IMHO they are really pure mm ideas and they're irrelevant to
> > > syscalls and init.
> > 
> > I'm thinking the MM submenu would probably be a better fit for all
> > user-visible userfaultfd options, including the syscall. Like you say,
> > it's an MM concept.
> > 
> > But if moving the syscall knob out from init isn't popular, IMO it
> > would be better to add the new WP option to init as well. This ensures
> > that when somebody selects userfaultfd, they also see the relevant
> > suboptions and don't have to chase them down across multiple submenus.
> > 
> > Conversely, they should also have the necessary depend clauses so that
> > suboptions aren't visible without the main feature. E.g. it asked me
> > for userfaultd options even though I have CONFIG_USERFAULTFD=n.
> 
> Hmm, this is a bit weird... since we do have that dependency chain for
> PTE_MARKER_UFFD_WP -> HAVE_ARCH_USERFAULTFD_WP -> USERFAULTFD:
> 
>   in arch/x86/Kconfig:
>   config X86
>           ...
>           select HAVE_ARCH_USERFAULTFD_WP         if X86_64 && USERFAULTFD
> 
>   in mm/Kconfig (with/without the "mm/uffd: Hide PTE_MARKER" patch applied):
>   config PTE_MARKER_UFFD_WP
>           ...
>           depends on HAVE_ARCH_USERFAULTFD_WP
> 
> So logically if !USERFAULTFD we shouldn't see PTE_MARKER_UFFD_WP at all?
> 
> That's also what I got when I tried it out for either !USERFAULTFD on x86,
> or any non-x86 platforms (because there we have !HAVE_ARCH_USERFAULTFD_WP
> constantly irrelevant of USERFAULTFD).  Though I could have missed
> something..

Sorry, it asked me about PTE_MARKERS, and that conclusion got stuck in
my head. Indeed, once that symbol is invisible we should be good.

> > What do you think?
> 
> I don't have a strong preference here, I think it's okay if it's preferred
> that we only put user-visible configs into mm/Kconfig.  It's just that I
> see we have tons of user-invisible configs already in mm/Kconfig, to list
> some:
> 
>         config ARCH_HAS_HUGEPD
>         config MAPPING_DIRTY_HELPERS
>         config KMAP_LOCAL
>         config KMAP_LOCAL_NON_LINEAR_PTE_ARRAY
> 
> But I'm not sure whether it's a rule of thumb somewhere else.

I wasn't objecting to invisible symbols in mm/.

My point was simply that for the user it might be easiest and most
intuitive if userfaultfd and its related suboptions are 1) grouped
together and 2) in the MM submenu.

> At the meantime, I also looked at whether syscall configs are always and
> only be put under init/, and funnily I got:
> 
> $ find . -name Kconfig | xargs grep --color -E "\".*syscall.*\""
> ./init/Kconfig: bool "Enable process_vm_readv/writev syscalls"
> ./init/Kconfig: bool "uselib syscall"
> ./init/Kconfig: bool "sgetmask/ssetmask syscalls support" if EXPERT
> ./init/Kconfig: bool "Sysfs syscall support" if EXPERT
> ./init/Kconfig: bool "open by fhandle syscalls" if EXPERT
> ./init/Kconfig: bool "Enable madvise/fadvise syscalls" if EXPERT
> ./arch/xtensa/Kconfig:  bool "Enable fast atomic syscalls"
> ./arch/xtensa/Kconfig:  bool "Enable spill registers syscall"
> ./arch/powerpc/Kconfig: bool "Support setting protections for 4k subpages (subpage_prot syscall)"
> ./arch/powerpc/Kconfig: bool "Enable filtering of RTAS syscalls"
> ./arch/Kconfig: bool "Support for randomizing kernel stack offset on syscall entry" if EXPERT
> ./arch/s390/Kconfig:    bool "Verify kernel signature during kexec_file_load() syscall"
> ./arch/sh/mm/Kconfig:   bool "Support vsyscall page"
> ./arch/x86/Kconfig:     bool "Enable vsyscall emulation" if EXPERT
> ./arch/x86/Kconfig:     bool "Verify kernel signature during kexec_file_load() syscall"
> ./arch/x86/Kconfig:     bool "Require a valid signature in kexec_file_load() syscall"
> ./arch/x86/Kconfig:     prompt "vsyscall table for legacy applications"
> ./arch/arm64/Kconfig:   bool "Verify kernel signature during kexec_file_load() syscall"
> ./arch/arm64/Kconfig:   bool "Enable the tagged user addresses syscall ABI"
> ./kernel/trace/Kconfig: bool "Trace syscalls"
> ./kernel/trace/Kconfig: bool "Run selftest on syscall events"
> 
> So let's put aside arch specific lines, ftrace does have FTRACE_SYSCALLS
> that lies in the kernel/trace/ dir.. not sure whether we could move
> USERFAULTFD and all the rest into mm/ as well?  Or perhaps that's just a
> bad example? :)

Yeah it looks like there is a healthy mix ;) To add to the list:

- mm/Kconfig has CONFIG_SWAP for the swapon/swapoff syscalls.
- fs/Kconfig has CONFIG_FILE_LOCKING, which adds the flock() syscall.
- Interestingly, fs/Kconfig has CONFIG_MEMFD_CREATE for memfd_create()
  which is implemented in mm/memfd.c.

It seems reasonable to me to move the userfaultfd stuff to mm as well,
especially when it's becoming more than just a single binary question
on whether you want a syscall or not, and has MM-specific suboptions.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 22/23] mm: Enable PTE markers by default
  2022-04-20 13:46               ` Johannes Weiner
@ 2022-04-20 14:25                 ` Peter Xu
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2022-04-20 14:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-kernel, linux-mm, Mike Kravetz, Nadav Amit, Matthew Wilcox,
	Mike Rapoport, David Hildenbrand, Hugh Dickins, Jerome Glisse,
	Kirill A . Shutemov, Andrea Arcangeli, Andrew Morton,
	Axel Rasmussen, Alistair Popple

On Wed, Apr 20, 2022 at 09:46:07AM -0400, Johannes Weiner wrote:
> My point was simply that for the user it might be easiest and most
> intuitive if userfaultfd and its related suboptions are 1) grouped
> together and 2) in the MM submenu.

Very reasonable request.

> Yeah it looks like there is a healthy mix ;) To add to the list:
> 
> - mm/Kconfig has CONFIG_SWAP for the swapon/swapoff syscalls.
> - fs/Kconfig has CONFIG_FILE_LOCKING, which adds the flock() syscall.
> - Interestingly, fs/Kconfig has CONFIG_MEMFD_CREATE for memfd_create()
>   which is implemented in mm/memfd.c.
> 
> It seems reasonable to me to move the userfaultfd stuff to mm as well,
> especially when it's becoming more than just a single binary question
> on whether you want a syscall or not, and has MM-specific suboptions.

Thanks for the extra information!

Obviously as you said it's growing a little bit.  I'll give it a shot later
today.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs
  2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
                   ` (23 preceding siblings ...)
  2022-04-05 22:16 ` [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Andrew Morton
@ 2022-05-10 19:05 ` Andrew Morton
  24 siblings, 0 replies; 63+ messages in thread
From: Andrew Morton @ 2022-05-10 19:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Mike Kravetz, David Hildenbrand,
	Matthew Wilcox, Alistair Popple, Nadav Amit, Axel Rasmussen,
	Andrea Arcangeli, Kirill A . Shutemov, Hugh Dickins,
	Jerome Glisse, Mike Rapoport, Gerald Schaefer

On Mon,  4 Apr 2022 21:46:23 -0400 Peter Xu <peterx@redhat.com> wrote:

> This is v8 of the series to add shmem+hugetlbfs support for userfaultfd
> write protection.

I think we've finished tossing this patchset around, so I plan to feed
it into mm-stable later this week.

A few -fixes were added:

https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-introduce-pte_marker-swap-entry-fix.patch
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-teach-core-mm-about-pte-markers-fix.patch
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-check-against-orig_pte-for-finish_fault-fix.patch
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-check-against-orig_pte-for-finish_fault-fix-checkpatch-fixes.patch
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-uffd-pte_marker_uffd_wp-fix.patch
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-hugetlb-only-drop-uffd-wp-special-pte-if-required-fix.patch
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-hugetlb-only-drop-uffd-wp-special-pte-if-required-fix-fix.patch
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-shmem-handle-uffd-wp-during-fork-fix.patch
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-enable-pte-markers-by-default-fix.patch



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 06/23] mm/shmem: Handle uffd-wp special pte in page fault handler
  2022-04-05  1:48 ` [PATCH v8 06/23] mm/shmem: Handle uffd-wp special pte in page fault handler Peter Xu
@ 2022-05-11 16:30   ` David Hildenbrand
  2022-05-12 16:34     ` Peter Xu
  0 siblings, 1 reply; 63+ messages in thread
From: David Hildenbrand @ 2022-05-11 16:30 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: Mike Kravetz, Nadav Amit, Matthew Wilcox, Mike Rapoport,
	Hugh Dickins, Jerome Glisse, Kirill A . Shutemov,
	Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Alistair Popple

> +/*
> + * This is actually a page-missing access, but with uffd-wp special pte
> + * installed.  It means this pte was wr-protected before being unmapped.
> + */
> +static vm_fault_t pte_marker_handle_uffd_wp(struct vm_fault *vmf)
> +{
> +	/*
> +	 * Just in case there're leftover special ptes even after the region
> +	 * got unregistered - we can simply clear them.  We can also do that
> +	 * proactively when e.g. when we do UFFDIO_UNREGISTER upon some uffd-wp
> +	 * ranges, but it should be more efficient to be done lazily here.
> +	 */
> +	if (unlikely(!userfaultfd_wp(vmf->vma) || vma_is_anonymous(vmf->vma)))
> +		return pte_marker_clear(vmf);

What would happen if we do a unregister followed by a register? IMHO we
should start with a clean uffd-wp slate then. Your comment makes ma
assume that we could receive stale WP events, which would be wrong?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 06/23] mm/shmem: Handle uffd-wp special pte in page fault handler
  2022-05-11 16:30   ` David Hildenbrand
@ 2022-05-12 16:34     ` Peter Xu
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Xu @ 2022-05-12 16:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Mike Kravetz, Nadav Amit, Matthew Wilcox,
	Mike Rapoport, Hugh Dickins, Jerome Glisse, Kirill A . Shutemov,
	Andrea Arcangeli, Andrew Morton, Axel Rasmussen, Alistair Popple

On Wed, May 11, 2022 at 06:30:59PM +0200, David Hildenbrand wrote:
> > +/*
> > + * This is actually a page-missing access, but with uffd-wp special pte
> > + * installed.  It means this pte was wr-protected before being unmapped.
> > + */
> > +static vm_fault_t pte_marker_handle_uffd_wp(struct vm_fault *vmf)
> > +{
> > +	/*
> > +	 * Just in case there're leftover special ptes even after the region
> > +	 * got unregistered - we can simply clear them.  We can also do that
> > +	 * proactively when e.g. when we do UFFDIO_UNREGISTER upon some uffd-wp
> > +	 * ranges, but it should be more efficient to be done lazily here.
> > +	 */
> > +	if (unlikely(!userfaultfd_wp(vmf->vma) || vma_is_anonymous(vmf->vma)))
> > +		return pte_marker_clear(vmf);
> 
> What would happen if we do a unregister followed by a register? IMHO we
> should start with a clean uffd-wp slate then. Your comment makes ma
> assume that we could receive stale WP events, which would be wrong?

I'd say it's not wrong, but it's true and actually expected.

Firstly, userfaultfd (by design) always allows false positives (getting
same message multiple times) but no tolerance on false negatives (missing
event, which is data corrupt).

The latter should be obvious.  For the former, the simplest example is when
two threads access the same missing page the same time, two same messages
will be generated.  Same applies to wr-protect faults.  And it'll be
non-trivial (or say, impossible.. IMHO) to avoid those.

In this specific case, it's about when to drop the uffd-wp bits when
unregister.  Two obvious options: (1) during unregister, or (2) lazy.

Here I chose the lazy way because unregister could be slowed down by this,
and that's when program quits.  In short with current approach we quit
fast.  We could have leftovers, but we'll take care of them when needed.

One important thing is leftover ptes should not be the major way uffd-wp
should be used by the normal register -> wr-protect -> unprotect ->
unregister sequence.  Normally the process won't unregister probably until
it quits, so the leftover does no harm to anyone.

Meanwhile, any user who wants to avoid the lazy way can simply do a
whole-round unprotect before unregister.  So we leave more choice for the
user and by default we make sure no syscall will be easily slowed down.

Hope that answers, thanks!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2022-05-12 16:34 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-05  1:46 [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
2022-04-05  1:46 ` [PATCH v8 01/23] mm: Introduce PTE_MARKER swap entry Peter Xu
2022-04-12  1:07   ` Alistair Popple
2022-04-12 19:45     ` Peter Xu
2022-04-13  0:30       ` Alistair Popple
2022-04-13 13:44         ` Peter Xu
2022-04-19  8:25   ` Alistair Popple
2022-04-19 19:44     ` Peter Xu
2022-04-05  1:48 ` [PATCH v8 02/23] mm: Teach core mm about pte markers Peter Xu
2022-04-12  1:22   ` Alistair Popple
2022-04-12 19:53     ` Peter Xu
2022-04-05  1:48 ` [PATCH v8 03/23] mm: Check against orig_pte for finish_fault() Peter Xu
2022-04-12  2:05   ` Alistair Popple
2022-04-12 19:54     ` Peter Xu
     [not found]   ` <CGME20220413140330eucas1p167da41e079712b829ef8237dc27b049c@eucas1p1.samsung.com>
2022-04-13 14:03     ` Marek Szyprowski
2022-04-13 16:43       ` Peter Xu
2022-04-14  7:51         ` Marek Szyprowski
2022-04-14 16:30           ` Peter Xu
2022-04-14 20:57             ` Andrew Morton
2022-04-14 21:08               ` Peter Xu
2022-04-15 14:21   ` Guenter Roeck
2022-04-15 14:41     ` Peter Xu
2022-04-05  1:48 ` [PATCH v8 04/23] mm/uffd: PTE_MARKER_UFFD_WP Peter Xu
2022-04-06  1:41   ` kernel test robot
2022-04-05  1:48 ` [PATCH v8 05/23] mm/shmem: Take care of UFFDIO_COPY_MODE_WP Peter Xu
2022-04-05  1:48 ` [PATCH v8 06/23] mm/shmem: Handle uffd-wp special pte in page fault handler Peter Xu
2022-05-11 16:30   ` David Hildenbrand
2022-05-12 16:34     ` Peter Xu
2022-04-05  1:48 ` [PATCH v8 07/23] mm/shmem: Persist uffd-wp bit across zapping for file-backed Peter Xu
2022-04-05  1:48 ` [PATCH v8 08/23] mm/shmem: Allow uffd wr-protect none pte for file-backed mem Peter Xu
2022-04-05  1:48 ` [PATCH v8 09/23] mm/shmem: Allows file-back mem to be uffd wr-protected on thps Peter Xu
2022-04-05  1:48 ` [PATCH v8 10/23] mm/shmem: Handle uffd-wp during fork() Peter Xu
2022-04-06  6:16   ` kernel test robot
2022-04-06 12:18     ` Peter Xu
2022-04-05  1:48 ` [PATCH v8 11/23] mm/hugetlb: Introduce huge pte version of uffd-wp helpers Peter Xu
2022-04-05  1:49 ` [PATCH v8 12/23] mm/hugetlb: Hook page faults for uffd write protection Peter Xu
2022-04-05  1:49 ` [PATCH v8 13/23] mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP Peter Xu
2022-04-05  1:49 ` [PATCH v8 14/23] mm/hugetlb: Handle UFFDIO_WRITEPROTECT Peter Xu
2022-04-05  1:49 ` [PATCH v8 15/23] mm/hugetlb: Handle pte markers in page faults Peter Xu
2022-04-06 13:37   ` kernel test robot
2022-04-06 15:02     ` Peter Xu
2022-04-05  1:49 ` [PATCH v8 16/23] mm/hugetlb: Allow uffd wr-protect none ptes Peter Xu
2022-04-05  1:49 ` [PATCH v8 17/23] mm/hugetlb: Only drop uffd-wp special pte if required Peter Xu
2022-04-05  1:49 ` [PATCH v8 18/23] mm/hugetlb: Handle uffd-wp during fork() Peter Xu
2022-04-05  1:49 ` [PATCH v8 19/23] mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered Peter Xu
2022-04-05  1:49 ` [PATCH v8 20/23] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs Peter Xu
2022-04-05  1:49 ` [PATCH v8 21/23] mm/uffd: Enable write protection for shmem & hugetlbfs Peter Xu
2022-04-05  1:49 ` [PATCH v8 22/23] mm: Enable PTE markers by default Peter Xu
2022-04-19 15:13   ` Johannes Weiner
2022-04-19 19:59     ` Peter Xu
2022-04-19 20:14       ` Johannes Weiner
2022-04-19 20:28         ` Peter Xu
2022-04-19 21:24           ` Johannes Weiner
2022-04-19 22:01             ` Peter Xu
2022-04-20 13:46               ` Johannes Weiner
2022-04-20 14:25                 ` Peter Xu
2022-04-05  1:49 ` [PATCH v8 23/23] selftests/uffd: Enable uffd-wp for shmem/hugetlbfs Peter Xu
2022-04-05 22:16 ` [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs Andrew Morton
2022-04-05 22:42   ` Peter Xu
2022-04-05 22:49     ` Andrew Morton
2022-04-05 23:02       ` Peter Xu
2022-04-05 23:08         ` Andrew Morton
2022-05-10 19:05 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).