Linux-api Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/9] userfaultfd: add minor fault handling for shmem
@ 2021-04-08 23:43 Axel Rasmussen
  2021-04-08 23:43 ` [PATCH 1/9] userfaultfd/hugetlbfs: avoid including userfaultfd_k.h in hugetlb.h Axel Rasmussen
                   ` (9 more replies)
  0 siblings, 10 replies; 21+ messages in thread
From: Axel Rasmussen @ 2021-04-08 23:43 UTC (permalink / raw)
  To: Alexander Viro, Andrea Arcangeli, Andrew Morton,
	Daniel Colascione, Hugh Dickins, Jerome Glisse, Joe Perches,
	Lokesh Gidra, Mike Kravetz, Mike Rapoport, Peter Xu, Shaohua Li,
	Shuah Khan, Stephen Rothwell, Wang Qing
  Cc: linux-api, linux-fsdevel, linux-kernel, linux-kselftest,
	linux-mm, Axel Rasmussen, Brian Geffon, Dr . David Alan Gilbert,
	Mina Almasry, Oliver Upton

Base
====

Since the original series [1] was merged into Andrew's tree, some issues were
noticed. Up to this point, we had been working on fixing what's in Andrew's
tree [2], but at this point we've changed direction enough that a lot of the
fix's delta is undoing what was done in the original series, thereby making it
hard to review.

As suggested by Hugh Dickins and Peter Xu, this series takes a step back. It can
be considered a v3 of the original series [1] - it combines those patches with
the fixes, reordered / broken up to allow for easier review.

The idea is that it will apply cleanly to akpm's tree, *replacing* the following
patches (i.e., drop these first, and then apply this series):

userfaultfd-support-minor-fault-handling-for-shmem.patch
userfaultfd-support-minor-fault-handling-for-shmem-fix.patch
userfaultfd-support-minor-fault-handling-for-shmem-fix-2.patch
userfaultfd-support-minor-fault-handling-for-shmem-fix-3.patch
userfaultfd-support-minor-fault-handling-for-shmem-fix-4.patch
userfaultfd-selftests-use-memfd_create-for-shmem-test-type.patch
userfaultfd-selftests-create-alias-mappings-in-the-shmem-test.patch
userfaultfd-selftests-reinitialize-test-context-in-each-test.patch
userfaultfd-selftests-exercise-minor-fault-handling-shmem-support.patch

Changelog
=========

Changes since the most recent fixup patch [2]:
- Squash the fixes ([2]) in with the original series ([1]). This makes reviewing
  easier, as we no longer have to sift through deltas undoing what we had done
  before. [Hugh, Peter]
- Modify shmem_mcopy_atomic_pte() to use the new mcopy_atomic_install_ptes()
  helper, reducing code duplication. [Hugh]
- Properly trigger handle_userfault() in the shmem_swapin_page() case. [Hugh]
- Use shmem_getpage() instead of find_lock_page() to lookup the existing page in
  for continue. This properly deals with swapped-out pages. [Hugh]
- Unconditionally pte_mkdirty() for anon memory (as before). [Peter]
- Don't include userfaultfd_k.h in either hugetlb.h or shmem_fs.h. [Hugh]
- Add comment for UFFD_FEATURE_MINOR_SHMEM (to match _HUGETLBFS). [Hugh]
- Fix some small cleanup issues (parens, reworded conditionals, reduced plumbing
  of some parameters, simplify labels/gotos, ...). [Hugh, Peter]

Overview
========

See the series which added minor faults for hugetlbfs [3] for a detailed
overview of minor fault handling in general. This series adds the same support
for shmem-backed areas.

This series is structured as follows:

- Commits 1 and 2 are cleanups.
- Commits 3 and 4 implement the new feature (minor fault handling for shmem).
- Commits 5, 6, 7, 8 update the userfaultfd selftest to exercise the feature.
- Commit 9 is one final cleanup, modifying an existing code path to re-use a new
  helper we've introduced. We rely on the selftest to show that this change
  doesn't break anything.

Use Case
========

In some cases it is useful to have VM memory backed by tmpfs instead of
hugetlbfs. So, this feature will be used to support the same VM live migration
use case described in my original series.

Additionally, Android folks (Lokesh Gidra <lokeshgidra@google.com>) hope to
optimize the Android Runtime garbage collector using this feature:

"The plan is to use userfaultfd for concurrently compacting the heap. With
this feature, the heap can be shared-mapped at another location where the
GC-thread(s) could continue the compaction operation without the need to
invoke userfault ioctl(UFFDIO_COPY) each time. OTOH, if and when Java threads
get faults on the heap, UFFDIO_CONTINUE can be used to resume execution.
Furthermore, this feature enables updating references in the 'non-moving'
portion of the heap efficiently. Without this feature, uneccessary page
copying (ioctl(UFFDIO_COPY)) would be required."

[1] https://lore.kernel.org/patchwork/cover/1388144/
[2] https://lore.kernel.org/patchwork/patch/1408161/
[3] https://lore.kernel.org/linux-fsdevel/20210301222728.176417-1-axelrasmussen@google.com/T/#t

Axel Rasmussen (9):
  userfaultfd/hugetlbfs: avoid including userfaultfd_k.h in hugetlb.h
  userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pte
  userfaultfd/shmem: support minor fault registration for shmem
  userfaultfd/shmem: support UFFDIO_CONTINUE for shmem
  userfaultfd/selftests: use memfd_create for shmem test type
  userfaultfd/selftests: create alias mappings in the shmem test
  userfaultfd/selftests: reinitialize test context in each test
  userfaultfd/selftests: exercise minor fault handling shmem support
  userfaultfd/shmem: modify shmem_mcopy_atomic_pte to use install_ptes

 fs/userfaultfd.c                         |   6 +-
 include/linux/hugetlb.h                  |   5 +-
 include/linux/shmem_fs.h                 |  15 +-
 include/linux/userfaultfd_k.h            |   5 +
 include/uapi/linux/userfaultfd.h         |   7 +-
 mm/hugetlb.c                             |   1 +
 mm/memory.c                              |   8 +-
 mm/shmem.c                               | 122 ++++------
 mm/userfaultfd.c                         | 183 ++++++++++-----
 tools/testing/selftests/vm/userfaultfd.c | 280 +++++++++++++++--------
 10 files changed, 387 insertions(+), 245 deletions(-)

--
2.31.1.295.g9ea45b61b8-goog


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/9] userfaultfd/hugetlbfs: avoid including userfaultfd_k.h in hugetlb.h
  2021-04-08 23:43 [PATCH 0/9] userfaultfd: add minor fault handling for shmem Axel Rasmussen
@ 2021-04-08 23:43 ` Axel Rasmussen
  2021-04-12 23:47   ` Peter Xu
  2021-04-08 23:43 ` [PATCH 2/9] userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pte Axel Rasmussen
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 21+ messages in thread
From: Axel Rasmussen @ 2021-04-08 23:43 UTC (permalink / raw)
  To: Alexander Viro, Andrea Arcangeli, Andrew Morton,
	Daniel Colascione, Hugh Dickins, Jerome Glisse, Joe Perches,
	Lokesh Gidra, Mike Kravetz, Mike Rapoport, Peter Xu, Shaohua Li,
	Shuah Khan, Stephen Rothwell, Wang Qing
  Cc: linux-api, linux-fsdevel, linux-kernel, linux-kselftest,
	linux-mm, Axel Rasmussen, Brian Geffon, Dr . David Alan Gilbert,
	Mina Almasry, Oliver Upton

Minimizing header file inclusion is desirable. In this case, we can do
so just by forward declaring the enumeration our signature relies upon.

Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 include/linux/hugetlb.h | 5 ++++-
 mm/hugetlb.c            | 1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 1d3246b31a41..dfb749eaf348 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -11,7 +11,6 @@
 #include <linux/kref.h>
 #include <linux/pgtable.h>
 #include <linux/gfp.h>
-#include <linux/userfaultfd_k.h>
 
 struct ctl_table;
 struct user_struct;
@@ -136,6 +135,8 @@ unsigned long hugetlb_total_pages(void);
 vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags);
 #ifdef CONFIG_USERFAULTFD
+enum mcopy_atomic_mode;
+
 int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
 				struct vm_area_struct *dst_vma,
 				unsigned long dst_addr,
@@ -315,6 +316,8 @@ static inline void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 }
 
 #ifdef CONFIG_USERFAULTFD
+enum mcopy_atomic_mode;
+
 static inline int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 						pte_t *dst_pte,
 						struct vm_area_struct *dst_vma,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9973dec4976c..3b93bbf8c80f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -40,6 +40,7 @@
 #include <linux/hugetlb_cgroup.h>
 #include <linux/node.h>
 #include <linux/page_owner.h>
+#include <linux/userfaultfd_k.h>
 #include "internal.h"
 
 int hugetlb_max_hstate __read_mostly;
-- 
2.31.1.295.g9ea45b61b8-goog


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/9] userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pte
  2021-04-08 23:43 [PATCH 0/9] userfaultfd: add minor fault handling for shmem Axel Rasmussen
  2021-04-08 23:43 ` [PATCH 1/9] userfaultfd/hugetlbfs: avoid including userfaultfd_k.h in hugetlb.h Axel Rasmussen
@ 2021-04-08 23:43 ` Axel Rasmussen
  2021-04-12 23:57   ` Peter Xu
  2021-04-08 23:43 ` [PATCH 3/9] userfaultfd/shmem: support minor fault registration for shmem Axel Rasmussen
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 21+ messages in thread
From: Axel Rasmussen @ 2021-04-08 23:43 UTC (permalink / raw)
  To: Alexander Viro, Andrea Arcangeli, Andrew Morton,
	Daniel Colascione, Hugh Dickins, Jerome Glisse, Joe Perches,
	Lokesh Gidra, Mike Kravetz, Mike Rapoport, Peter Xu, Shaohua Li,
	Shuah Khan, Stephen Rothwell, Wang Qing
  Cc: linux-api, linux-fsdevel, linux-kernel, linux-kselftest,
	linux-mm, Axel Rasmussen, Brian Geffon, Dr . David Alan Gilbert,
	Mina Almasry, Oliver Upton

Previously, we did a dance where we had one calling path in
userfaultfd.c (mfill_atomic_pte), but then we split it into two in
shmem_fs.h (shmem_{mcopy_atomic,mfill_zeropage}_pte), and then rejoined
into a single shared function in shmem.c (shmem_mfill_atomic_pte).

This is all a bit overly complex. Just call the single combined shmem
function directly, allowing us to clean up various branches,
boilerplate, etc.

While we're touching this function, two other small cleanup changes:
- offset is equivalent to pgoff, so we can get rid of offset entirely.
- Split two VM_BUG_ON cases into two statements. This means the line
  number reported when the BUG is hit specifies exactly which condition
  was true.

Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 include/linux/shmem_fs.h | 15 +++++-------
 mm/shmem.c               | 52 +++++++++++++---------------------------
 mm/userfaultfd.c         | 10 +++-----
 3 files changed, 25 insertions(+), 52 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index d82b6f396588..919e36671fe6 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -122,21 +122,18 @@ static inline bool shmem_file(struct file *file)
 extern bool shmem_charge(struct inode *inode, long pages);
 extern void shmem_uncharge(struct inode *inode, long pages);
 
+#ifdef CONFIG_USERFAULTFD
 #ifdef CONFIG_SHMEM
 extern int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 				  struct vm_area_struct *dst_vma,
 				  unsigned long dst_addr,
 				  unsigned long src_addr,
+				  bool zeropage,
 				  struct page **pagep);
-extern int shmem_mfill_zeropage_pte(struct mm_struct *dst_mm,
-				    pmd_t *dst_pmd,
-				    struct vm_area_struct *dst_vma,
-				    unsigned long dst_addr);
-#else
+#else /* !CONFIG_SHMEM */
 #define shmem_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma, dst_addr, \
-			       src_addr, pagep)        ({ BUG(); 0; })
-#define shmem_mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma, \
-				 dst_addr)      ({ BUG(); 0; })
-#endif
+			       src_addr, zeropage, pagep)       ({ BUG(); 0; })
+#endif /* CONFIG_SHMEM */
+#endif /* CONFIG_USERFAULTFD */
 
 #endif
diff --git a/mm/shmem.c b/mm/shmem.c
index b2db4ed0fbc7..c21f20cc4204 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2354,13 +2354,14 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 	return inode;
 }
 
-static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
-				  pmd_t *dst_pmd,
-				  struct vm_area_struct *dst_vma,
-				  unsigned long dst_addr,
-				  unsigned long src_addr,
-				  bool zeropage,
-				  struct page **pagep)
+#ifdef CONFIG_USERFAULTFD
+int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
+			   pmd_t *dst_pmd,
+			   struct vm_area_struct *dst_vma,
+			   unsigned long dst_addr,
+			   unsigned long src_addr,
+			   bool zeropage,
+			   struct page **pagep)
 {
 	struct inode *inode = file_inode(dst_vma->vm_file);
 	struct shmem_inode_info *info = SHMEM_I(inode);
@@ -2372,7 +2373,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 	struct page *page;
 	pte_t _dst_pte, *dst_pte;
 	int ret;
-	pgoff_t offset, max_off;
+	pgoff_t max_off;
 
 	ret = -ENOMEM;
 	if (!shmem_inode_acct_block(inode, 1))
@@ -2383,7 +2384,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 		if (!page)
 			goto out_unacct_blocks;
 
-		if (!zeropage) {	/* mcopy_atomic */
+		if (!zeropage) {	/* COPY */
 			page_kaddr = kmap_atomic(page);
 			ret = copy_from_user(page_kaddr,
 					     (const void __user *)src_addr,
@@ -2397,7 +2398,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 				/* don't free the page */
 				return -ENOENT;
 			}
-		} else {		/* mfill_zeropage_atomic */
+		} else {		/* ZEROPAGE */
 			clear_highpage(page);
 		}
 	} else {
@@ -2405,15 +2406,15 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 		*pagep = NULL;
 	}
 
-	VM_BUG_ON(PageLocked(page) || PageSwapBacked(page));
+	VM_BUG_ON(PageLocked(page));
+	VM_BUG_ON(PageSwapBacked(page));
 	__SetPageLocked(page);
 	__SetPageSwapBacked(page);
 	__SetPageUptodate(page);
 
 	ret = -EFAULT;
-	offset = linear_page_index(dst_vma, dst_addr);
 	max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
-	if (unlikely(offset >= max_off))
+	if (unlikely(pgoff >= max_off))
 		goto out_release;
 
 	ret = shmem_add_to_page_cache(page, mapping, pgoff, NULL,
@@ -2439,7 +2440,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 
 	ret = -EFAULT;
 	max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
-	if (unlikely(offset >= max_off))
+	if (unlikely(pgoff >= max_off))
 		goto out_release_unlock;
 
 	ret = -EEXIST;
@@ -2476,28 +2477,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 	shmem_inode_unacct_blocks(inode, 1);
 	goto out;
 }
-
-int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
-			   pmd_t *dst_pmd,
-			   struct vm_area_struct *dst_vma,
-			   unsigned long dst_addr,
-			   unsigned long src_addr,
-			   struct page **pagep)
-{
-	return shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma,
-				      dst_addr, src_addr, false, pagep);
-}
-
-int shmem_mfill_zeropage_pte(struct mm_struct *dst_mm,
-			     pmd_t *dst_pmd,
-			     struct vm_area_struct *dst_vma,
-			     unsigned long dst_addr)
-{
-	struct page *page = NULL;
-
-	return shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma,
-				      dst_addr, 0, true, &page);
-}
+#endif /* CONFIG_USERFAULTFD */
 
 #ifdef CONFIG_TMPFS
 static const struct inode_operations shmem_symlink_inode_operations;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index ce6cb4760d2c..60ae22207761 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -440,13 +440,9 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 						 dst_vma, dst_addr);
 	} else {
 		VM_WARN_ON_ONCE(wp_copy);
-		if (!zeropage)
-			err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd,
-						     dst_vma, dst_addr,
-						     src_addr, page);
-		else
-			err = shmem_mfill_zeropage_pte(dst_mm, dst_pmd,
-						       dst_vma, dst_addr);
+		err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
+					     dst_addr, src_addr, zeropage,
+					     page);
 	}
 
 	return err;
-- 
2.31.1.295.g9ea45b61b8-goog


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 3/9] userfaultfd/shmem: support minor fault registration for shmem
  2021-04-08 23:43 [PATCH 0/9] userfaultfd: add minor fault handling for shmem Axel Rasmussen
  2021-04-08 23:43 ` [PATCH 1/9] userfaultfd/hugetlbfs: avoid including userfaultfd_k.h in hugetlb.h Axel Rasmussen
  2021-04-08 23:43 ` [PATCH 2/9] userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pte Axel Rasmussen
@ 2021-04-08 23:43 ` Axel Rasmussen
  2021-04-09 16:50   ` Peter Xu
  2021-04-08 23:43 ` [PATCH 4/9] userfaultfd/shmem: support UFFDIO_CONTINUE " Axel Rasmussen
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 21+ messages in thread
From: Axel Rasmussen @ 2021-04-08 23:43 UTC (permalink / raw)
  To: Alexander Viro, Andrea Arcangeli, Andrew Morton,
	Daniel Colascione, Hugh Dickins, Jerome Glisse, Joe Perches,
	Lokesh Gidra, Mike Kravetz, Mike Rapoport, Peter Xu, Shaohua Li,
	Shuah Khan, Stephen Rothwell, Wang Qing
  Cc: linux-api, linux-fsdevel, linux-kernel, linux-kselftest,
	linux-mm, Axel Rasmussen, Brian Geffon, Dr . David Alan Gilbert,
	Mina Almasry, Oliver Upton

This patch allows shmem-backed VMAs to be registered for minor faults.
Minor faults are appropriately relayed to userspace in the fault path,
for VMAs with the relevant flag.

This commit doesn't hook up the UFFDIO_CONTINUE ioctl for shmem-backed
minor faults, though, so userspace doesn't yet have a way to resolve
such faults.

Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 fs/userfaultfd.c                 |  6 +++---
 include/uapi/linux/userfaultfd.h |  7 ++++++-
 mm/memory.c                      |  8 +++++---
 mm/shmem.c                       | 20 ++++++++++++++++----
 4 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 14f92285d04f..9f3b8684cf3c 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1267,8 +1267,7 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma,
 	}
 
 	if (vm_flags & VM_UFFD_MINOR) {
-		/* FIXME: Add minor fault interception for shmem. */
-		if (!is_vm_hugetlb_page(vma))
+		if (!(is_vm_hugetlb_page(vma) || vma_is_shmem(vma)))
 			return false;
 	}
 
@@ -1941,7 +1940,8 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
 	/* report all available features and ioctls to userland */
 	uffdio_api.features = UFFD_API_FEATURES;
 #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
-	uffdio_api.features &= ~UFFD_FEATURE_MINOR_HUGETLBFS;
+	uffdio_api.features &=
+		~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM);
 #endif
 	uffdio_api.ioctls = UFFD_API_IOCTLS;
 	ret = -EFAULT;
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index bafbeb1a2624..159a74e9564f 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -31,7 +31,8 @@
 			   UFFD_FEATURE_MISSING_SHMEM |		\
 			   UFFD_FEATURE_SIGBUS |		\
 			   UFFD_FEATURE_THREAD_ID |		\
-			   UFFD_FEATURE_MINOR_HUGETLBFS)
+			   UFFD_FEATURE_MINOR_HUGETLBFS |	\
+			   UFFD_FEATURE_MINOR_SHMEM)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -185,6 +186,9 @@ struct uffdio_api {
 	 * UFFD_FEATURE_MINOR_HUGETLBFS indicates that minor faults
 	 * can be intercepted (via REGISTER_MODE_MINOR) for
 	 * hugetlbfs-backed pages.
+	 *
+	 * UFFD_FEATURE_MINOR_SHMEM indicates the same support as
+	 * UFFD_FEATURE_MINOR_HUGETLBFS, but for shmem-backed pages instead.
 	 */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
 #define UFFD_FEATURE_EVENT_FORK			(1<<1)
@@ -196,6 +200,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_SIGBUS			(1<<7)
 #define UFFD_FEATURE_THREAD_ID			(1<<8)
 #define UFFD_FEATURE_MINOR_HUGETLBFS		(1<<9)
+#define UFFD_FEATURE_MINOR_SHMEM		(1<<10)
 	__u64 features;
 
 	__u64 ioctls;
diff --git a/mm/memory.c b/mm/memory.c
index c8e357627318..a1e5ff55027e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3929,9 +3929,11 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf)
 	 * something).
 	 */
 	if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
-		ret = do_fault_around(vmf);
-		if (ret)
-			return ret;
+		if (likely(!userfaultfd_minor(vmf->vma))) {
+			ret = do_fault_around(vmf);
+			if (ret)
+				return ret;
+		}
 	}
 
 	ret = __do_fault(vmf);
diff --git a/mm/shmem.c b/mm/shmem.c
index c21f20cc4204..99c54b165c16 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1785,7 +1785,7 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
  * vm. If we swap it in we mark it dirty since we also free the swap
  * entry since a page cannot live in both the swap and page cache.
  *
- * vmf and fault_type are only supplied by shmem_fault:
+ * vma, vmf, and fault_type are only supplied by shmem_fault:
  * otherwise they are NULL.
  */
 static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
@@ -1802,6 +1802,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 	pgoff_t hindex = index;
 	gfp_t huge_gfp;
 	int error;
+	bool swapped;
 	int once = 0;
 	int alloced = 0;
 
@@ -1820,16 +1821,27 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 
 	page = pagecache_get_page(mapping, index,
 					FGP_ENTRY | FGP_HEAD | FGP_LOCK, 0);
-	if (xa_is_value(page)) {
+	swapped = xa_is_value(page);
+	if (swapped) {
 		error = shmem_swapin_page(inode, index, &page,
 					  sgp, gfp, vma, fault_type);
 		if (error == -EEXIST)
 			goto repeat;
-
 		*pagep = page;
-		return error;
+		if (error)
+			return error;
 	}
 
+	if (page && vma && userfaultfd_minor(vma)) {
+		unlock_page(page);
+		put_page(page);
+		*fault_type = handle_userfault(vmf, VM_UFFD_MINOR);
+		return 0;
+	}
+
+	if (swapped)
+		return 0;
+
 	if (page)
 		hindex = page->index;
 	if (page && sgp == SGP_WRITE)
-- 
2.31.1.295.g9ea45b61b8-goog


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 4/9] userfaultfd/shmem: support UFFDIO_CONTINUE for shmem
  2021-04-08 23:43 [PATCH 0/9] userfaultfd: add minor fault handling for shmem Axel Rasmussen
                   ` (2 preceding siblings ...)
  2021-04-08 23:43 ` [PATCH 3/9] userfaultfd/shmem: support minor fault registration for shmem Axel Rasmussen
@ 2021-04-08 23:43 ` Axel Rasmussen
  2021-04-12 23:17   ` Peter Xu
  2021-04-08 23:43 ` [PATCH 5/9] userfaultfd/selftests: use memfd_create for shmem test type Axel Rasmussen
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 21+ messages in thread
From: Axel Rasmussen @ 2021-04-08 23:43 UTC (permalink / raw)
  To: Alexander Viro, Andrea Arcangeli, Andrew Morton,
	Daniel Colascione, Hugh Dickins, Jerome Glisse, Joe Perches,
	Lokesh Gidra, Mike Kravetz, Mike Rapoport, Peter Xu, Shaohua Li,
	Shuah Khan, Stephen Rothwell, Wang Qing
  Cc: linux-api, linux-fsdevel, linux-kernel, linux-kselftest,
	linux-mm, Axel Rasmussen, Brian Geffon, Dr . David Alan Gilbert,
	Mina Almasry, Oliver Upton

With this change, userspace can resolve a minor fault within a
shmem-backed area with a UFFDIO_CONTINUE ioctl. The semantics for this
match those for hugetlbfs - we look up the existing page in the page
cache, and install PTEs for it.

This commit introduces a new helper: mcopy_atomic_install_ptes.

Why handle UFFDIO_CONTINUE for shmem in mm/userfaultfd.c, instead of in
shmem.c? The existing userfault implementation only relies on shmem.c
for VM_SHARED VMAs. However, minor fault handling / CONTINUE work just
fine for !VM_SHARED VMAs as well. We'd prefer to handle CONTINUE for
shmem in one place, regardless of shared/private (to reduce code
duplication).

Why add a new mcopy_atomic_install_ptes helper? A problem we have with
continue is that shmem_mcopy_atomic_pte() and mcopy_atomic_pte() are
*close* to what we want, but not exactly. We do want to setup the PTEs
in a CONTINUE operation, but we don't want to e.g. allocate a new page,
charge it (e.g. to the shmem inode), manipulate various flags, etc. Also
we have the problem stated above: shmem_mcopy_atomic_pte() and
mcopy_atomic_pte() both handle one-half of the problem (shared /
private) continue cares about. So, introduce mcontinue_atomic_pte(), to
handle all of the shmem continue cases. Introduce the helper so it
doesn't duplicate code with mcopy_atomic_pte().

In a future commit, shmem_mcopy_atomic_pte() will also be modified to
use this new helper. However, since this is a bigger refactor, it seems
most clear to do it as a separate change.

Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 mm/userfaultfd.c | 176 +++++++++++++++++++++++++++++++++++------------
 1 file changed, 131 insertions(+), 45 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 60ae22207761..a539fe18b9a7 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -48,6 +48,87 @@ struct vm_area_struct *find_dst_vma(struct mm_struct *dst_mm,
 	return dst_vma;
 }
 
+/*
+ * Install PTEs, to map dst_addr (within dst_vma) to page.
+ *
+ * This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed),
+ * whether or not dst_vma is VM_SHARED. It also handles the more general
+ * MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file
+ * backed, or not).
+ *
+ * Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by
+ * shmem_mcopy_atomic_pte instead.
+ */
+static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd,
+				     struct vm_area_struct *dst_vma,
+				     unsigned long dst_addr, struct page *page,
+				     bool newly_allocated, bool wp_copy)
+{
+	int ret;
+	pte_t _dst_pte, *dst_pte;
+	int writable;
+	bool vm_shared = dst_vma->vm_flags & VM_SHARED;
+	spinlock_t *ptl;
+	struct inode *inode;
+	pgoff_t offset, max_off;
+
+	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
+	writable = dst_vma->vm_flags & VM_WRITE;
+	/* For private, non-anon we need CoW (don't write to page cache!) */
+	if (!vma_is_anonymous(dst_vma) && !vm_shared)
+		writable = 0;
+
+	if (writable || vma_is_anonymous(dst_vma))
+		_dst_pte = pte_mkdirty(_dst_pte);
+	if (writable) {
+		if (wp_copy)
+			_dst_pte = pte_mkuffd_wp(_dst_pte);
+		else
+			_dst_pte = pte_mkwrite(_dst_pte);
+	} else if (vm_shared) {
+		/*
+		 * Since we didn't pte_mkdirty(), mark the page dirty or it
+		 * could be freed from under us. We could do this
+		 * unconditionally, but doing it only if !writable is faster.
+		 */
+		set_page_dirty(page);
+	}
+
+	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
+
+	if (vma_is_shmem(dst_vma)) {
+		/* The shmem MAP_PRIVATE case requires checking the i_size */
+		inode = dst_vma->vm_file->f_inode;
+		offset = linear_page_index(dst_vma, dst_addr);
+		max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+		ret = -EFAULT;
+		if (unlikely(offset >= max_off))
+			goto out_unlock;
+	}
+
+	ret = -EEXIST;
+	if (!pte_none(*dst_pte))
+		goto out_unlock;
+
+	inc_mm_counter(dst_mm, mm_counter(page));
+	if (vma_is_shmem(dst_vma))
+		page_add_file_rmap(page, false);
+	else
+		page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
+
+	if (newly_allocated)
+		lru_cache_add_inactive_or_unevictable(page, dst_vma);
+
+	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache(dst_vma, dst_addr, dst_pte);
+	ret = 0;
+out_unlock:
+	pte_unmap_unlock(dst_pte, ptl);
+	return ret;
+}
+
 static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    pmd_t *dst_pmd,
 			    struct vm_area_struct *dst_vma,
@@ -56,13 +137,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    struct page **pagep,
 			    bool wp_copy)
 {
-	pte_t _dst_pte, *dst_pte;
-	spinlock_t *ptl;
 	void *page_kaddr;
 	int ret;
 	struct page *page;
-	pgoff_t offset, max_off;
-	struct inode *inode;
 
 	if (!*pagep) {
 		ret = -ENOMEM;
@@ -99,43 +176,12 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 	if (mem_cgroup_charge(page, dst_mm, GFP_KERNEL))
 		goto out_release;
 
-	_dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
-	if (dst_vma->vm_flags & VM_WRITE) {
-		if (wp_copy)
-			_dst_pte = pte_mkuffd_wp(_dst_pte);
-		else
-			_dst_pte = pte_mkwrite(_dst_pte);
-	}
-
-	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
-	if (dst_vma->vm_file) {
-		/* the shmem MAP_PRIVATE case requires checking the i_size */
-		inode = dst_vma->vm_file->f_inode;
-		offset = linear_page_index(dst_vma, dst_addr);
-		max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
-		ret = -EFAULT;
-		if (unlikely(offset >= max_off))
-			goto out_release_uncharge_unlock;
-	}
-	ret = -EEXIST;
-	if (!pte_none(*dst_pte))
-		goto out_release_uncharge_unlock;
-
-	inc_mm_counter(dst_mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
-	lru_cache_add_inactive_or_unevictable(page, dst_vma);
-
-	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
-
-	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(dst_vma, dst_addr, dst_pte);
-
-	pte_unmap_unlock(dst_pte, ptl);
-	ret = 0;
+	ret = mcopy_atomic_install_ptes(dst_mm, dst_pmd, dst_vma, dst_addr,
+					page, true, wp_copy);
+	if (ret)
+		goto out_release;
 out:
 	return ret;
-out_release_uncharge_unlock:
-	pte_unmap_unlock(dst_pte, ptl);
 out_release:
 	put_page(page);
 	goto out;
@@ -176,6 +222,41 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm,
 	return ret;
 }
 
+/* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */
+static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
+				pmd_t *dst_pmd,
+				struct vm_area_struct *dst_vma,
+				unsigned long dst_addr,
+				bool wp_copy)
+{
+	struct inode *inode = file_inode(dst_vma->vm_file);
+	pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
+	struct page *page;
+	int ret;
+
+	ret = shmem_getpage(inode, pgoff, &page, SGP_READ);
+	if (ret)
+		goto out;
+	if (!page) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	ret = mcopy_atomic_install_ptes(dst_mm, dst_pmd, dst_vma, dst_addr,
+					page, false, wp_copy);
+	if (ret)
+		goto out_release;
+
+	unlock_page(page);
+	ret = 0;
+out:
+	return ret;
+out_release:
+	unlock_page(page);
+	put_page(page);
+	goto out;
+}
+
 static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
 {
 	pgd_t *pgd;
@@ -415,11 +496,16 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 						unsigned long dst_addr,
 						unsigned long src_addr,
 						struct page **page,
-						bool zeropage,
+						enum mcopy_atomic_mode mode,
 						bool wp_copy)
 {
 	ssize_t err;
 
+	if (mode == MCOPY_ATOMIC_CONTINUE) {
+		return mcontinue_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
+					    wp_copy);
+	}
+
 	/*
 	 * The normal page fault path for a shmem will invoke the
 	 * fault, fill the hole in the file and COW it right away. The
@@ -431,7 +517,7 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 	 * and not in the radix tree.
 	 */
 	if (!(dst_vma->vm_flags & VM_SHARED)) {
-		if (!zeropage)
+		if (mode == MCOPY_ATOMIC_NORMAL)
 			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
 					       dst_addr, src_addr, page,
 					       wp_copy);
@@ -441,7 +527,8 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 	} else {
 		VM_WARN_ON_ONCE(wp_copy);
 		err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
-					     dst_addr, src_addr, zeropage,
+					     dst_addr, src_addr,
+					     mode != MCOPY_ATOMIC_NORMAL,
 					     page);
 	}
 
@@ -463,7 +550,6 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	long copied;
 	struct page *page;
 	bool wp_copy;
-	bool zeropage = (mcopy_mode == MCOPY_ATOMIC_ZEROPAGE);
 
 	/*
 	 * Sanitize the command parameters:
@@ -526,7 +612,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 
 	if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
 		goto out_unlock;
-	if (mcopy_mode == MCOPY_ATOMIC_CONTINUE)
+	if (!vma_is_shmem(dst_vma) && mcopy_mode == MCOPY_ATOMIC_CONTINUE)
 		goto out_unlock;
 
 	/*
@@ -574,7 +660,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 		BUG_ON(pmd_trans_huge(*dst_pmd));
 
 		err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
-				       src_addr, &page, zeropage, wp_copy);
+				       src_addr, &page, mcopy_mode, wp_copy);
 		cond_resched();
 
 		if (unlikely(err == -ENOENT)) {
-- 
2.31.1.295.g9ea45b61b8-goog


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 5/9] userfaultfd/selftests: use memfd_create for shmem test type
  2021-04-08 23:43 [PATCH 0/9] userfaultfd: add minor fault handling for shmem Axel Rasmussen
                   ` (3 preceding siblings ...)
  2021-04-08 23:43 ` [PATCH 4/9] userfaultfd/shmem: support UFFDIO_CONTINUE " Axel Rasmussen
@ 2021-04-08 23:43 ` Axel Rasmussen
  2021-04-08 23:43 ` [PATCH 6/9] userfaultfd/selftests: create alias mappings in the shmem test Axel Rasmussen
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Axel Rasmussen @ 2021-04-08 23:43 UTC (permalink / raw)
  To: Alexander Viro, Andrea Arcangeli, Andrew Morton,
	Daniel Colascione, Hugh Dickins, Jerome Glisse, Joe Perches,
	Lokesh Gidra, Mike Kravetz, Mike Rapoport, Peter Xu, Shaohua Li,
	Shuah Khan, Stephen Rothwell, Wang Qing
  Cc: linux-api, linux-fsdevel, linux-kernel, linux-kselftest,
	linux-mm, Axel Rasmussen, Brian Geffon, Dr . David Alan Gilbert,
	Mina Almasry, Oliver Upton

This is a preparatory commit. In the future, we want to be able to setup
alias mappings for area_src and area_dst in the shmem test, like we do
in the hugetlb_shared test. With a VMA obtained via
mmap(MAP_ANONYMOUS | MAP_SHARED), it isn't clear how to do this.

So, mmap() with an fd, so we can create alias mappings. Use memfd_create
instead of actually passing in a tmpfs path like hugetlb does, since
it's more convenient / simpler to run, and works just as well.

Future commits will:

1. Setup the alias mappings.
2. Extend our tests to actually take advantage of this, to test new
   userfaultfd behavior being introduced in this series.

Also, a small fix in the area we're changing: when the hugetlb setup
fails in main(), pass in the right argv[] so we actually print out the
hugetlb file path.

Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 tools/testing/selftests/vm/userfaultfd.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 310fc617c383..b0af88b258d7 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -85,6 +85,7 @@ static bool test_uffdio_wp = false;
 static bool test_uffdio_minor = false;
 
 static bool map_shared;
+static int shm_fd;
 static int huge_fd;
 static char *huge_fd_off0;
 static unsigned long long *count_verify;
@@ -278,10 +279,13 @@ static void shmem_release_pages(char *rel_area)
 
 static void shmem_allocate_area(void **alloc_area)
 {
+	unsigned long offset =
+		alloc_area == (void **)&area_src ? 0 : nr_pages * page_size;
+
 	*alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
-			   MAP_ANONYMOUS | MAP_SHARED, -1, 0);
+			   MAP_SHARED, shm_fd, offset);
 	if (*alloc_area == MAP_FAILED)
-		err("shared memory mmap failed");
+		err("mmap of memfd failed");
 }
 
 struct uffd_test_ops {
@@ -1446,6 +1450,16 @@ int main(int argc, char **argv)
 			err("Open of %s failed", argv[4]);
 		if (ftruncate(huge_fd, 0))
 			err("ftruncate %s to size 0 failed", argv[4]);
+	} else if (test_type == TEST_SHMEM) {
+		shm_fd = memfd_create(argv[0], 0);
+		if (shm_fd < 0)
+			err("memfd_create");
+		if (ftruncate(shm_fd, nr_pages * page_size * 2))
+			err("ftruncate");
+		if (fallocate(shm_fd,
+			      FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0,
+			      nr_pages * page_size * 2))
+			err("fallocate");
 	}
 	printf("nr_pages: %lu, nr_pages_per_cpu: %lu\n",
 	       nr_pages, nr_pages_per_cpu);
-- 
2.31.1.295.g9ea45b61b8-goog


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 6/9] userfaultfd/selftests: create alias mappings in the shmem test
  2021-04-08 23:43 [PATCH 0/9] userfaultfd: add minor fault handling for shmem Axel Rasmussen
                   ` (4 preceding siblings ...)
  2021-04-08 23:43 ` [PATCH 5/9] userfaultfd/selftests: use memfd_create for shmem test type Axel Rasmussen
@ 2021-04-08 23:43 ` Axel Rasmussen
  2021-04-08 23:43 ` [PATCH 7/9] userfaultfd/selftests: reinitialize test context in each test Axel Rasmussen
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Axel Rasmussen @ 2021-04-08 23:43 UTC (permalink / raw)
  To: Alexander Viro, Andrea Arcangeli, Andrew Morton,
	Daniel Colascione, Hugh Dickins, Jerome Glisse, Joe Perches,
	Lokesh Gidra, Mike Kravetz, Mike Rapoport, Peter Xu, Shaohua Li,
	Shuah Khan, Stephen Rothwell, Wang Qing
  Cc: linux-api, linux-fsdevel, linux-kernel, linux-kselftest,
	linux-mm, Axel Rasmussen, Brian Geffon, Dr . David Alan Gilbert,
	Mina Almasry, Oliver Upton

Previously, we just allocated two shm areas: area_src and area_dst. With
this commit, change this so we also allocate area_src_alias, and
area_dst_alias.

area_*_alias and area_* (respectively) point to the same underlying
physical pages, but are different VMAs. In a future commit in this
series, we'll leverage this setup to exercise minor fault handling
support for shmem, just like we do in the hugetlb_shared test.

Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 tools/testing/selftests/vm/userfaultfd.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index b0af88b258d7..4b49b2cf9819 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -279,13 +279,29 @@ static void shmem_release_pages(char *rel_area)
 
 static void shmem_allocate_area(void **alloc_area)
 {
-	unsigned long offset =
-		alloc_area == (void **)&area_src ? 0 : nr_pages * page_size;
+	void *area_alias = NULL;
+	bool is_src = alloc_area == (void **)&area_src;
+	unsigned long offset = is_src ? 0 : nr_pages * page_size;
 
 	*alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
 			   MAP_SHARED, shm_fd, offset);
 	if (*alloc_area == MAP_FAILED)
 		err("mmap of memfd failed");
+
+	area_alias = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
+			  MAP_SHARED, shm_fd, offset);
+	if (area_alias == MAP_FAILED)
+		err("mmap of memfd alias failed");
+
+	if (is_src)
+		area_src_alias = area_alias;
+	else
+		area_dst_alias = area_alias;
+}
+
+static void shmem_alias_mapping(__u64 *start, size_t len, unsigned long offset)
+{
+	*start = (unsigned long)area_dst_alias + offset;
 }
 
 struct uffd_test_ops {
@@ -315,7 +331,7 @@ static struct uffd_test_ops shmem_uffd_test_ops = {
 	.expected_ioctls = SHMEM_EXPECTED_IOCTLS,
 	.allocate_area	= shmem_allocate_area,
 	.release_pages	= shmem_release_pages,
-	.alias_mapping = noop_alias_mapping,
+	.alias_mapping = shmem_alias_mapping,
 };
 
 static struct uffd_test_ops hugetlb_uffd_test_ops = {
-- 
2.31.1.295.g9ea45b61b8-goog


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 7/9] userfaultfd/selftests: reinitialize test context in each test
  2021-04-08 23:43 [PATCH 0/9] userfaultfd: add minor fault handling for shmem Axel Rasmussen
                   ` (5 preceding siblings ...)
  2021-04-08 23:43 ` [PATCH 6/9] userfaultfd/selftests: create alias mappings in the shmem test Axel Rasmussen
@ 2021-04-08 23:43 ` Axel Rasmussen
  2021-04-08 23:43 ` [PATCH 8/9] userfaultfd/selftests: exercise minor fault handling shmem support Axel Rasmussen
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Axel Rasmussen @ 2021-04-08 23:43 UTC (permalink / raw)
  To: Alexander Viro, Andrea Arcangeli, Andrew Morton,
	Daniel Colascione, Hugh Dickins, Jerome Glisse, Joe Perches,
	Lokesh Gidra, Mike Kravetz, Mike Rapoport, Peter Xu, Shaohua Li,
	Shuah Khan, Stephen Rothwell, Wang Qing
  Cc: linux-api, linux-fsdevel, linux-kernel, linux-kselftest,
	linux-mm, Axel Rasmussen, Brian Geffon, Dr . David Alan Gilbert,
	Mina Almasry, Oliver Upton

Currently, the context (fds, mmap-ed areas, etc.) are global. Each test
mutates this state in some way, in some cases really "clobbering it"
(e.g., the events test mremap-ing area_dst over the top of area_src, or
the minor faults tests overwriting the count_verify values in the test
areas). We run the tests in a particular order, each test is careful to
make the right assumptions about its starting state, etc.

But, this is fragile. It's better for a test's success or failure to not
depend on what some other prior test case did to the global state.

To that end, clear and reinitialize the test context at the start of
each test case, so whatever prior test cases did doesn't affect future
tests.

This is particularly relevant to this series because the events test's
mremap of area_dst screws up assumptions the minor fault test was
relying on. This wasn't a problem for hugetlb, as we don't mremap in
that case.

Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 tools/testing/selftests/vm/userfaultfd.c | 221 +++++++++++++----------
 1 file changed, 129 insertions(+), 92 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 4b49b2cf9819..9b032cfdc262 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -89,7 +89,8 @@ static int shm_fd;
 static int huge_fd;
 static char *huge_fd_off0;
 static unsigned long long *count_verify;
-static int uffd, uffd_flags, finished, *pipefd;
+static int uffd = -1;
+static int uffd_flags, finished, *pipefd;
 static char *area_src, *area_src_alias, *area_dst, *area_dst_alias;
 static char *zeropage;
 pthread_attr_t attr;
@@ -343,6 +344,124 @@ static struct uffd_test_ops hugetlb_uffd_test_ops = {
 
 static struct uffd_test_ops *uffd_test_ops;
 
+static int userfaultfd_open(uint64_t *features)
+{
+	struct uffdio_api uffdio_api;
+
+	uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+	if (uffd < 0)
+		err("userfaultfd syscall not available in this kernel");
+	uffd_flags = fcntl(uffd, F_GETFD, NULL);
+
+	uffdio_api.api = UFFD_API;
+	uffdio_api.features = *features;
+	if (ioctl(uffd, UFFDIO_API, &uffdio_api))
+		err("UFFDIO_API failed.\nPlease make sure to "
+		    "run with either root or ptrace capability.");
+	if (uffdio_api.api != UFFD_API)
+		err("UFFDIO_API error: %" PRIu64, (uint64_t)uffdio_api.api);
+
+	*features = uffdio_api.features;
+	return 0;
+}
+
+static int uffd_test_ctx_init_ext(uint64_t *features)
+{
+	unsigned long nr, cpu;
+
+	uffd_test_ops->allocate_area((void **)&area_src);
+	if (!area_src)
+		return 1;
+	uffd_test_ops->allocate_area((void **)&area_dst);
+	if (!area_dst)
+		return 1;
+
+	if (uffd_test_ops->release_pages(area_src))
+		return 1;
+
+	if (uffd_test_ops->release_pages(area_dst))
+		return 1;
+
+	if (userfaultfd_open(features))
+		return 1;
+
+	count_verify = malloc(nr_pages * sizeof(unsigned long long));
+	if (!count_verify)
+		err("count_verify");
+
+	for (nr = 0; nr < nr_pages; nr++) {
+		*area_mutex(area_src, nr) =
+			(pthread_mutex_t)PTHREAD_MUTEX_INITIALIZER;
+		count_verify[nr] = *area_count(area_src, nr) = 1;
+		/*
+		 * In the transition between 255 to 256, powerpc will
+		 * read out of order in my_bcmp and see both bytes as
+		 * zero, so leave a placeholder below always non-zero
+		 * after the count, to avoid my_bcmp to trigger false
+		 * positives.
+		 */
+		*(area_count(area_src, nr) + 1) = 1;
+	}
+
+	pipefd = malloc(sizeof(int) * nr_cpus * 2);
+	if (!pipefd)
+		err("pipefd");
+	for (cpu = 0; cpu < nr_cpus; cpu++)
+		if (pipe2(&pipefd[cpu * 2], O_CLOEXEC | O_NONBLOCK))
+			err("pipe");
+
+	return 0;
+}
+
+static inline int uffd_test_ctx_init(uint64_t features)
+{
+	return uffd_test_ctx_init_ext(&features);
+}
+
+static inline int munmap_area(void **area)
+{
+	if (*area)
+		if (munmap(*area, nr_pages * page_size))
+			err("munmap");
+
+	*area = NULL;
+	return 0;
+}
+
+static int uffd_test_ctx_clear(void)
+{
+	int ret = 0;
+	size_t i;
+
+	if (pipefd) {
+		for (i = 0; i < nr_cpus * 2; ++i) {
+			if (close(pipefd[i]))
+				err("close pipefd");
+		}
+		free(pipefd);
+		pipefd = NULL;
+	}
+
+	if (count_verify) {
+		free(count_verify);
+		count_verify = NULL;
+	}
+
+	if (uffd != -1) {
+		if (close(uffd))
+			err("close uffd");
+		uffd = -1;
+	}
+
+	huge_fd_off0 = NULL;
+	ret |= munmap_area((void **)&area_src);
+	ret |= munmap_area((void **)&area_src_alias);
+	ret |= munmap_area((void **)&area_dst);
+	ret |= munmap_area((void **)&area_dst_alias);
+
+	return ret;
+}
+
 static int my_bcmp(char *str1, char *str2, size_t n)
 {
 	unsigned long i;
@@ -727,40 +846,6 @@ static int stress(struct uffd_stats *uffd_stats)
 	return 0;
 }
 
-static int userfaultfd_open_ext(uint64_t *features)
-{
-	struct uffdio_api uffdio_api;
-
-	uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
-	if (uffd < 0) {
-		fprintf(stderr,
-			"userfaultfd syscall not available in this kernel\n");
-		return 1;
-	}
-	uffd_flags = fcntl(uffd, F_GETFD, NULL);
-
-	uffdio_api.api = UFFD_API;
-	uffdio_api.features = *features;
-	if (ioctl(uffd, UFFDIO_API, &uffdio_api)) {
-		fprintf(stderr, "UFFDIO_API failed.\nPlease make sure to "
-			"run with either root or ptrace capability.\n");
-		return 1;
-	}
-	if (uffdio_api.api != UFFD_API) {
-		fprintf(stderr, "UFFDIO_API error: %" PRIu64 "\n",
-			(uint64_t)uffdio_api.api);
-		return 1;
-	}
-
-	*features = uffdio_api.features;
-	return 0;
-}
-
-static int userfaultfd_open(uint64_t features)
-{
-	return userfaultfd_open_ext(&features);
-}
-
 sigjmp_buf jbuf, *sigbuf;
 
 static void sighndl(int sig, siginfo_t *siginfo, void *ptr)
@@ -869,6 +954,8 @@ static int faulting_process(int signal_test)
 			  MREMAP_MAYMOVE | MREMAP_FIXED, area_src);
 	if (area_dst == MAP_FAILED)
 		err("mremap");
+	/* Reset area_src since we just clobbered it */
+	area_src = NULL;
 
 	for (; nr < nr_pages; nr++) {
 		count = *area_count(area_dst, nr);
@@ -962,11 +1049,9 @@ static int userfaultfd_zeropage_test(void)
 	printf("testing UFFDIO_ZEROPAGE: ");
 	fflush(stdout);
 
-	if (uffd_test_ops->release_pages(area_dst))
+	if (uffd_test_ctx_clear() || uffd_test_ctx_init(0))
 		return 1;
 
-	if (userfaultfd_open(0))
-		return 1;
 	uffdio_register.range.start = (unsigned long) area_dst;
 	uffdio_register.range.len = nr_pages * page_size;
 	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
@@ -983,7 +1068,6 @@ static int userfaultfd_zeropage_test(void)
 		if (my_bcmp(area_dst, zeropage, page_size))
 			err("zeropage is not zero");
 
-	close(uffd);
 	printf("done.\n");
 	return 0;
 }
@@ -1001,13 +1085,11 @@ static int userfaultfd_events_test(void)
 	printf("testing events (fork, remap, remove): ");
 	fflush(stdout);
 
-	if (uffd_test_ops->release_pages(area_dst))
-		return 1;
-
 	features = UFFD_FEATURE_EVENT_FORK | UFFD_FEATURE_EVENT_REMAP |
 		UFFD_FEATURE_EVENT_REMOVE;
-	if (userfaultfd_open(features))
+	if (uffd_test_ctx_clear() || uffd_test_ctx_init(features))
 		return 1;
+
 	fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK);
 
 	uffdio_register.range.start = (unsigned long) area_dst;
@@ -1040,8 +1122,6 @@ static int userfaultfd_events_test(void)
 	if (pthread_join(uffd_mon, NULL))
 		return 1;
 
-	close(uffd);
-
 	uffd_stats_report(&stats, 1);
 
 	return stats.missing_faults != nr_pages;
@@ -1061,12 +1141,10 @@ static int userfaultfd_sig_test(void)
 	printf("testing signal delivery: ");
 	fflush(stdout);
 
-	if (uffd_test_ops->release_pages(area_dst))
-		return 1;
-
 	features = UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_SIGBUS;
-	if (userfaultfd_open(features))
+	if (uffd_test_ctx_clear() || uffd_test_ctx_init(features))
 		return 1;
+
 	fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK);
 
 	uffdio_register.range.start = (unsigned long) area_dst;
@@ -1128,10 +1206,7 @@ static int userfaultfd_minor_test(void)
 	printf("testing minor faults: ");
 	fflush(stdout);
 
-	if (uffd_test_ops->release_pages(area_dst))
-		return 1;
-
-	if (userfaultfd_open_ext(&features))
+	if (uffd_test_ctx_clear() || uffd_test_ctx_init_ext(&features))
 		return 1;
 	/* If kernel reports the feature isn't supported, skip the test. */
 	if (!(features & UFFD_FEATURE_MINOR_HUGETLBFS)) {
@@ -1186,8 +1261,6 @@ static int userfaultfd_minor_test(void)
 	if (pthread_join(uffd_mon, NULL))
 		return 1;
 
-	close(uffd);
-
 	uffd_stats_report(&stats, 1);
 
 	return stats.missing_faults != 0 || stats.minor_faults != nr_pages;
@@ -1199,44 +1272,11 @@ static int userfaultfd_stress(void)
 	char *tmp_area;
 	unsigned long nr;
 	struct uffdio_register uffdio_register;
-	unsigned long cpu;
 	struct uffd_stats uffd_stats[nr_cpus];
 
-	uffd_test_ops->allocate_area((void **)&area_src);
-	if (!area_src)
-		return 1;
-	uffd_test_ops->allocate_area((void **)&area_dst);
-	if (!area_dst)
-		return 1;
-
-	if (userfaultfd_open(0))
+	if (uffd_test_ctx_init(0))
 		return 1;
 
-	count_verify = malloc(nr_pages * sizeof(unsigned long long));
-	if (!count_verify)
-		err("count_verify");
-
-	for (nr = 0; nr < nr_pages; nr++) {
-		*area_mutex(area_src, nr) = (pthread_mutex_t)
-			PTHREAD_MUTEX_INITIALIZER;
-		count_verify[nr] = *area_count(area_src, nr) = 1;
-		/*
-		 * In the transition between 255 to 256, powerpc will
-		 * read out of order in my_bcmp and see both bytes as
-		 * zero, so leave a placeholder below always non-zero
-		 * after the count, to avoid my_bcmp to trigger false
-		 * positives.
-		 */
-		*(area_count(area_src, nr) + 1) = 1;
-	}
-
-	pipefd = malloc(sizeof(int) * nr_cpus * 2);
-	if (!pipefd)
-		err("pipefd");
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		if (pipe2(&pipefd[cpu*2], O_CLOEXEC | O_NONBLOCK))
-			err("pipe");
-
 	if (posix_memalign(&area, page_size, page_size))
 		err("out of memory");
 	zeropage = area;
@@ -1356,9 +1396,6 @@ static int userfaultfd_stress(void)
 		uffd_stats_report(uffd_stats, nr_cpus);
 	}
 
-	if (close(uffd))
-		err("close uffd");
-
 	return userfaultfd_zeropage_test() || userfaultfd_sig_test()
 		|| userfaultfd_events_test() || userfaultfd_minor_test();
 }
-- 
2.31.1.295.g9ea45b61b8-goog


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 8/9] userfaultfd/selftests: exercise minor fault handling shmem support
  2021-04-08 23:43 [PATCH 0/9] userfaultfd: add minor fault handling for shmem Axel Rasmussen
                   ` (6 preceding siblings ...)
  2021-04-08 23:43 ` [PATCH 7/9] userfaultfd/selftests: reinitialize test context in each test Axel Rasmussen
@ 2021-04-08 23:43 ` Axel Rasmussen
  2021-04-08 23:43 ` [PATCH 9/9] userfaultfd/shmem: modify shmem_mcopy_atomic_pte to use install_ptes Axel Rasmussen
  2021-04-09  5:04 ` [PATCH 0/9] userfaultfd: add minor fault handling for shmem Andrew Morton
  9 siblings, 0 replies; 21+ messages in thread
From: Axel Rasmussen @ 2021-04-08 23:43 UTC (permalink / raw)
  To: Alexander Viro, Andrea Arcangeli, Andrew Morton,
	Daniel Colascione, Hugh Dickins, Jerome Glisse, Joe Perches,
	Lokesh Gidra, Mike Kravetz, Mike Rapoport, Peter Xu, Shaohua Li,
	Shuah Khan, Stephen Rothwell, Wang Qing
  Cc: linux-api, linux-fsdevel, linux-kernel, linux-kselftest,
	linux-mm, Axel Rasmussen, Brian Geffon, Dr . David Alan Gilbert,
	Mina Almasry, Oliver Upton

Enable test_uffdio_minor for test_type == TEST_SHMEM, and modify the
test slightly to pass in / check for the right feature flags.

Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 tools/testing/selftests/vm/userfaultfd.c | 29 ++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 9b032cfdc262..640d0a2d107d 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -488,6 +488,7 @@ static void wp_range(int ufd, __u64 start, __u64 len, bool wp)
 static void continue_range(int ufd, __u64 start, __u64 len)
 {
 	struct uffdio_continue req;
+	int ret;
 
 	req.range.start = start;
 	req.range.len = len;
@@ -496,6 +497,17 @@ static void continue_range(int ufd, __u64 start, __u64 len)
 	if (ioctl(ufd, UFFDIO_CONTINUE, &req))
 		err("UFFDIO_CONTINUE failed for address 0x%" PRIx64,
 		    (uint64_t)start);
+
+	/*
+	 * Error handling within the kernel for continue is subtly different
+	 * from copy or zeropage, so it may be a source of bugs. Trigger an
+	 * error (-EEXIST) on purpose, to verify doing so doesn't cause a BUG.
+	 */
+	req.mapped = 0;
+	ret = ioctl(ufd, UFFDIO_CONTINUE, &req);
+	if (ret >= 0 || req.mapped != -EEXIST)
+		err("failed to exercise UFFDIO_CONTINUE error handling, ret=%d, mapped=%" PRId64,
+		    ret, req.mapped);
 }
 
 static void *locking_thread(void *arg)
@@ -1198,7 +1210,7 @@ static int userfaultfd_minor_test(void)
 	void *expected_page;
 	char c;
 	struct uffd_stats stats = { 0 };
-	uint64_t features = UFFD_FEATURE_MINOR_HUGETLBFS;
+	uint64_t req_features, features_out;
 
 	if (!test_uffdio_minor)
 		return 0;
@@ -1206,10 +1218,18 @@ static int userfaultfd_minor_test(void)
 	printf("testing minor faults: ");
 	fflush(stdout);
 
-	if (uffd_test_ctx_clear() || uffd_test_ctx_init_ext(&features))
+	if (test_type == TEST_HUGETLB)
+		req_features = UFFD_FEATURE_MINOR_HUGETLBFS;
+	else if (test_type == TEST_SHMEM)
+		req_features = UFFD_FEATURE_MINOR_SHMEM;
+	else
+		return 1;
+
+	features_out = req_features;
+	if (uffd_test_ctx_clear() || uffd_test_ctx_init_ext(&features_out))
 		return 1;
-	/* If kernel reports the feature isn't supported, skip the test. */
-	if (!(features & UFFD_FEATURE_MINOR_HUGETLBFS)) {
+	/* If kernel reports required features aren't supported, skip test. */
+	if ((features_out & req_features) != req_features) {
 		printf("skipping test due to lack of feature support\n");
 		fflush(stdout);
 		return 0;
@@ -1444,6 +1464,7 @@ static void set_test_type(const char *type)
 		map_shared = true;
 		test_type = TEST_SHMEM;
 		uffd_test_ops = &shmem_uffd_test_ops;
+		test_uffdio_minor = true;
 	} else {
 		err("Unknown test type: %s", type);
 	}
-- 
2.31.1.295.g9ea45b61b8-goog


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 9/9] userfaultfd/shmem: modify shmem_mcopy_atomic_pte to use install_ptes
  2021-04-08 23:43 [PATCH 0/9] userfaultfd: add minor fault handling for shmem Axel Rasmussen
                   ` (7 preceding siblings ...)
  2021-04-08 23:43 ` [PATCH 8/9] userfaultfd/selftests: exercise minor fault handling shmem support Axel Rasmussen
@ 2021-04-08 23:43 ` Axel Rasmussen
  2021-04-09  5:04 ` [PATCH 0/9] userfaultfd: add minor fault handling for shmem Andrew Morton
  9 siblings, 0 replies; 21+ messages in thread
From: Axel Rasmussen @ 2021-04-08 23:43 UTC (permalink / raw)
  To: Alexander Viro, Andrea Arcangeli, Andrew Morton,
	Daniel Colascione, Hugh Dickins, Jerome Glisse, Joe Perches,
	Lokesh Gidra, Mike Kravetz, Mike Rapoport, Peter Xu, Shaohua Li,
	Shuah Khan, Stephen Rothwell, Wang Qing
  Cc: linux-api, linux-fsdevel, linux-kernel, linux-kselftest,
	linux-mm, Axel Rasmussen, Brian Geffon, Dr . David Alan Gilbert,
	Mina Almasry, Oliver Upton

In a previous commit, we added the mcopy_atomic_install_ptes() helper.
This helper does the job of setting up PTEs for an existing page, to map
it into a given VMA. It deals with both the anon and shmem cases, as
well as the shared and private cases.

In other words, shmem_mcopy_atomic_pte() duplicates a case it already
handles. So, expose it, and let shmem_mcopy_atomic_pte() use it
directly, to reduce code duplication.

This requires that we refactor shmem_mcopy_atomic-pte() a bit:

Instead of doing accounting (shmem_recalc_inode() et al) part-way
through the PTE setup, do it beforehand. This frees up
mcopy_atomic_install_ptes() from having to care about this accounting,
but it does mean we need to clean it up if we get a failure afterwards
(shmem_uncharge()).

We can *almost* use shmem_charge() to do this, reducing code
duplication. But, it does `inode->i_mapping->nrpages++`, which would
double-count since shmem_add_to_page_cache() also does this.

Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 include/linux/userfaultfd_k.h |  5 ++++
 mm/shmem.c                    | 52 +++++++----------------------------
 mm/userfaultfd.c              | 25 ++++++++---------
 3 files changed, 27 insertions(+), 55 deletions(-)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 794d1538b8ba..3e20bfa9ef80 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -53,6 +53,11 @@ enum mcopy_atomic_mode {
 	MCOPY_ATOMIC_CONTINUE,
 };
 
+extern int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd,
+				     struct vm_area_struct *dst_vma,
+				     unsigned long dst_addr, struct page *page,
+				     bool newly_allocated, bool wp_copy);
+
 extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
 			    unsigned long src_start, unsigned long len,
 			    bool *mmap_changing, __u64 mode);
diff --git a/mm/shmem.c b/mm/shmem.c
index 99c54b165c16..5d4b82e9bcb2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2380,10 +2380,8 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	struct address_space *mapping = inode->i_mapping;
 	gfp_t gfp = mapping_gfp_mask(mapping);
 	pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
-	spinlock_t *ptl;
 	void *page_kaddr;
 	struct page *page;
-	pte_t _dst_pte, *dst_pte;
 	int ret;
 	pgoff_t max_off;
 
@@ -2393,8 +2391,10 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
 
 	if (!*pagep) {
 		page = shmem_alloc_page(gfp, info, pgoff);
-		if (!page)
-			goto out_unacct_blocks;
+		if (!page) {
+			shmem_inode_unacct_blocks(inode, 1);
+			goto out;
+		}
 
 		if (!zeropage) {	/* COPY */
 			page_kaddr = kmap_atomic(page);
@@ -2434,59 +2434,27 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	if (ret)
 		goto out_release;
 
-	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
-	if (dst_vma->vm_flags & VM_WRITE)
-		_dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
-	else {
-		/*
-		 * We don't set the pte dirty if the vma has no
-		 * VM_WRITE permission, so mark the page dirty or it
-		 * could be freed from under us. We could do it
-		 * unconditionally before unlock_page(), but doing it
-		 * only if VM_WRITE is not set is faster.
-		 */
-		set_page_dirty(page);
-	}
-
-	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
-
-	ret = -EFAULT;
-	max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
-	if (unlikely(pgoff >= max_off))
-		goto out_release_unlock;
-
-	ret = -EEXIST;
-	if (!pte_none(*dst_pte))
-		goto out_release_unlock;
-
-	lru_cache_add(page);
-
 	spin_lock_irq(&info->lock);
 	info->alloced++;
 	inode->i_blocks += BLOCKS_PER_PAGE;
 	shmem_recalc_inode(inode);
 	spin_unlock_irq(&info->lock);
 
-	inc_mm_counter(dst_mm, mm_counter_file(page));
-	page_add_file_rmap(page, false);
-	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+	ret = mcopy_atomic_install_ptes(dst_mm, dst_pmd, dst_vma, dst_addr,
+					page, true, false);
+	if (ret)
+		goto out_release_uncharge;
 
-	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(dst_vma, dst_addr, dst_pte);
-	pte_unmap_unlock(dst_pte, ptl);
 	unlock_page(page);
 	ret = 0;
 out:
 	return ret;
-out_release_unlock:
-	pte_unmap_unlock(dst_pte, ptl);
-	ClearPageDirty(page);
+out_release_uncharge:
 	delete_from_page_cache(page);
+	shmem_uncharge(inode, 1);
 out_release:
 	unlock_page(page);
 	put_page(page);
-out_unacct_blocks:
-	shmem_inode_unacct_blocks(inode, 1);
 	goto out;
 }
 #endif /* CONFIG_USERFAULTFD */
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index a539fe18b9a7..8fc597782219 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -51,18 +51,13 @@ struct vm_area_struct *find_dst_vma(struct mm_struct *dst_mm,
 /*
  * Install PTEs, to map dst_addr (within dst_vma) to page.
  *
- * This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed),
- * whether or not dst_vma is VM_SHARED. It also handles the more general
- * MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file
- * backed, or not).
- *
- * Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by
- * shmem_mcopy_atomic_pte instead.
+ * This function handles both MCOPY_ATOMIC_NORMAL and _CONTINUE for both shmem
+ * and anon, and for both shared and private VMAs.
  */
-static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd,
-				     struct vm_area_struct *dst_vma,
-				     unsigned long dst_addr, struct page *page,
-				     bool newly_allocated, bool wp_copy)
+int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd,
+			      struct vm_area_struct *dst_vma,
+			      unsigned long dst_addr, struct page *page,
+			      bool newly_allocated, bool wp_copy)
 {
 	int ret;
 	pte_t _dst_pte, *dst_pte;
@@ -116,8 +111,12 @@ static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	else
 		page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
 
-	if (newly_allocated)
-		lru_cache_add_inactive_or_unevictable(page, dst_vma);
+	if (newly_allocated) {
+		if (vma_is_shmem(dst_vma) && vm_shared)
+			lru_cache_add(page);
+		else
+			lru_cache_add_inactive_or_unevictable(page, dst_vma);
+	}
 
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
 
-- 
2.31.1.295.g9ea45b61b8-goog


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9] userfaultfd: add minor fault handling for shmem
  2021-04-08 23:43 [PATCH 0/9] userfaultfd: add minor fault handling for shmem Axel Rasmussen
                   ` (8 preceding siblings ...)
  2021-04-08 23:43 ` [PATCH 9/9] userfaultfd/shmem: modify shmem_mcopy_atomic_pte to use install_ptes Axel Rasmussen
@ 2021-04-09  5:04 ` Andrew Morton
  2021-04-09 17:03   ` Axel Rasmussen
  9 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2021-04-09  5:04 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Alexander Viro, Andrea Arcangeli, Daniel Colascione,
	Hugh Dickins, Jerome Glisse, Joe Perches, Lokesh Gidra,
	Mike Kravetz, Mike Rapoport, Peter Xu, Shaohua Li, Shuah Khan,
	Stephen Rothwell, Wang Qing, linux-api, linux-fsdevel,
	linux-kernel, linux-kselftest, linux-mm, Brian Geffon,
	Dr . David Alan Gilbert, Mina Almasry, Oliver Upton

On Thu,  8 Apr 2021 16:43:18 -0700 Axel Rasmussen <axelrasmussen@google.com> wrote:

> The idea is that it will apply cleanly to akpm's tree, *replacing* the following
> patches (i.e., drop these first, and then apply this series):
> 
> userfaultfd-support-minor-fault-handling-for-shmem.patch
> userfaultfd-support-minor-fault-handling-for-shmem-fix.patch
> userfaultfd-support-minor-fault-handling-for-shmem-fix-2.patch
> userfaultfd-support-minor-fault-handling-for-shmem-fix-3.patch
> userfaultfd-support-minor-fault-handling-for-shmem-fix-4.patch
> userfaultfd-selftests-use-memfd_create-for-shmem-test-type.patch
> userfaultfd-selftests-create-alias-mappings-in-the-shmem-test.patch
> userfaultfd-selftests-reinitialize-test-context-in-each-test.patch
> userfaultfd-selftests-exercise-minor-fault-handling-shmem-support.patch

Well.  the problem is,

> +	if (area_alias == MAP_FAILED)
> +		err("mmap of memfd alias failed");

`err' doesn't exist until eleventy patches later, in Peter's
"userfaultfd/selftests: unify error handling".  I got tired of (and
lost confidence in) replacing "err(...)" with "fprintf(stderr, ...);
exit(1)" everywhere then fixing up the fallout when Peter's patch came
along.  Shudder.

Sorry, all this material pretty clearly isn't going to make 5.12
(potentially nine days hence), so I shall drop all the userfaultfd
patches.  Let's take a fresh run at all of this after -rc1.


I have tentatively retained the first series:

userfaultfd-add-minor-fault-registration-mode.patch
userfaultfd-add-minor-fault-registration-mode-fix.patch
userfaultfd-disable-huge-pmd-sharing-for-minor-registered-vmas.patch
userfaultfd-hugetlbfs-only-compile-uffd-helpers-if-config-enabled.patch
userfaultfd-add-uffdio_continue-ioctl.patch
userfaultfd-update-documentation-to-describe-minor-fault-handling.patch
userfaultfd-selftests-add-test-exercising-minor-fault-handling.patch

but I don't believe they have had much testing standalone, without the
other userfaultfd patches present.  So I don't think it's smart to
upstream these in this cycle.  Or I could drop them so you and Peter
can have a clean shot at redoing the whole thing.  Please let me know.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/9] userfaultfd/shmem: support minor fault registration for shmem
  2021-04-08 23:43 ` [PATCH 3/9] userfaultfd/shmem: support minor fault registration for shmem Axel Rasmussen
@ 2021-04-09 16:50   ` Peter Xu
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2021-04-09 16:50 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Alexander Viro, Andrea Arcangeli, Andrew Morton,
	Daniel Colascione, Hugh Dickins, Jerome Glisse, Joe Perches,
	Lokesh Gidra, Mike Kravetz, Mike Rapoport, Shaohua Li,
	Shuah Khan, Stephen Rothwell, Wang Qing, linux-api,
	linux-fsdevel, linux-kernel, linux-kselftest, linux-mm,
	Brian Geffon, Dr . David Alan Gilbert, Mina Almasry,
	Oliver Upton

On Thu, Apr 08, 2021 at 04:43:21PM -0700, Axel Rasmussen wrote:

[...]

> @@ -1820,16 +1821,27 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>  
>  	page = pagecache_get_page(mapping, index,
>  					FGP_ENTRY | FGP_HEAD | FGP_LOCK, 0);
> -	if (xa_is_value(page)) {
> +	swapped = xa_is_value(page);
> +	if (swapped) {
>  		error = shmem_swapin_page(inode, index, &page,
>  					  sgp, gfp, vma, fault_type);
>  		if (error == -EEXIST)
>  			goto repeat;
> -
>  		*pagep = page;
> -		return error;
> +		if (error)
> +			return error;
>  	}
>  
> +	if (page && vma && userfaultfd_minor(vma)) {
> +		unlock_page(page);
> +		put_page(page);
> +		*fault_type = handle_userfault(vmf, VM_UFFD_MINOR);
> +		return 0;
> +	}

If we need to consider swapping for UFFDIO_CONTINUE later (as Hugh pointed out
previously, which looks the right thing to do), it's indeed a bit awkward to
swapin here.  Maybe move this chunk to right after pagecache_get_page()
returns?  Then no need to touch the rest.

> +
> +	if (swapped)
> +		return 0;
> +
>  	if (page)
>  		hindex = page->index;
>  	if (page && sgp == SGP_WRITE)
> -- 
> 2.31.1.295.g9ea45b61b8-goog
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9] userfaultfd: add minor fault handling for shmem
  2021-04-09  5:04 ` [PATCH 0/9] userfaultfd: add minor fault handling for shmem Andrew Morton
@ 2021-04-09 17:03   ` Axel Rasmussen
  2021-04-09 21:18     ` Peter Xu
  0 siblings, 1 reply; 21+ messages in thread
From: Axel Rasmussen @ 2021-04-09 17:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andrea Arcangeli, Daniel Colascione,
	Hugh Dickins, Jerome Glisse, Joe Perches, Lokesh Gidra,
	Mike Kravetz, Mike Rapoport, Peter Xu, Shaohua Li, Shuah Khan,
	Stephen Rothwell, Wang Qing, linux-api, linux-fsdevel, LKML,
	linux-kselftest, Linux MM, Brian Geffon, Dr . David Alan Gilbert,
	Mina Almasry, Oliver Upton

On Thu, Apr 8, 2021 at 10:04 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu,  8 Apr 2021 16:43:18 -0700 Axel Rasmussen <axelrasmussen@google.com> wrote:
>
> > The idea is that it will apply cleanly to akpm's tree, *replacing* the following
> > patches (i.e., drop these first, and then apply this series):
> >
> > userfaultfd-support-minor-fault-handling-for-shmem.patch
> > userfaultfd-support-minor-fault-handling-for-shmem-fix.patch
> > userfaultfd-support-minor-fault-handling-for-shmem-fix-2.patch
> > userfaultfd-support-minor-fault-handling-for-shmem-fix-3.patch
> > userfaultfd-support-minor-fault-handling-for-shmem-fix-4.patch
> > userfaultfd-selftests-use-memfd_create-for-shmem-test-type.patch
> > userfaultfd-selftests-create-alias-mappings-in-the-shmem-test.patch
> > userfaultfd-selftests-reinitialize-test-context-in-each-test.patch
> > userfaultfd-selftests-exercise-minor-fault-handling-shmem-support.patch
>
> Well.  the problem is,
>
> > +     if (area_alias == MAP_FAILED)
> > +             err("mmap of memfd alias failed");
>
> `err' doesn't exist until eleventy patches later, in Peter's
> "userfaultfd/selftests: unify error handling".  I got tired of (and
> lost confidence in) replacing "err(...)" with "fprintf(stderr, ...);
> exit(1)" everywhere then fixing up the fallout when Peter's patch came
> along.  Shudder.

Oof - sorry about that!

>
> Sorry, all this material pretty clearly isn't going to make 5.12
> (potentially nine days hence), so I shall drop all the userfaultfd
> patches.  Let's take a fresh run at all of this after -rc1.

That's okay, my understanding was already that it certainly wouldn't
be in the 5.12 release, but that we might be ready in time for 5.13.

>
>
> I have tentatively retained the first series:
>
> userfaultfd-add-minor-fault-registration-mode.patch
> userfaultfd-add-minor-fault-registration-mode-fix.patch
> userfaultfd-disable-huge-pmd-sharing-for-minor-registered-vmas.patch
> userfaultfd-hugetlbfs-only-compile-uffd-helpers-if-config-enabled.patch
> userfaultfd-add-uffdio_continue-ioctl.patch
> userfaultfd-update-documentation-to-describe-minor-fault-handling.patch
> userfaultfd-selftests-add-test-exercising-minor-fault-handling.patch
>
> but I don't believe they have had much testing standalone, without the
> other userfaultfd patches present.  So I don't think it's smart to
> upstream these in this cycle.  Or I could drop them so you and Peter
> can have a clean shot at redoing the whole thing.  Please let me know.

From my perspective, both Peter's error handling and the hugetlbfs
minor faulting patches are ready to go. (Peter's most importantly; we
should establish that as a base, and put all the burden on resolving
conflicts with it on us instead of you :).)

My memory was that Peter's patch was applied before my shmem series,
but it seems I was mistaken. So, maybe the best thing to do is to have
Peter send a version of it based on your tree, without the shmem
series? And then I'll resolve any conflicts in my tree?

It's true that we haven't tested the hugetlbfs minor faults patch
extensively *with the shmem one also applied*, but it has had more
thorough review than the shmem one at this point (e.g. by Mike
Kravetz), and they're rather separate code paths (I'd be surprised if
one breaks the other).

>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9] userfaultfd: add minor fault handling for shmem
  2021-04-09 17:03   ` Axel Rasmussen
@ 2021-04-09 21:18     ` Peter Xu
  2021-04-09 22:16       ` Mike Kravetz
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2021-04-09 21:18 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Andrew Morton, Alexander Viro, Andrea Arcangeli,
	Daniel Colascione, Hugh Dickins, Jerome Glisse, Joe Perches,
	Lokesh Gidra, Mike Kravetz, Mike Rapoport, Shaohua Li,
	Shuah Khan, Stephen Rothwell, Wang Qing, linux-api,
	linux-fsdevel, LKML, linux-kselftest, Linux MM, Brian Geffon,
	Dr . David Alan Gilbert, Mina Almasry, Oliver Upton

On Fri, Apr 09, 2021 at 10:03:53AM -0700, Axel Rasmussen wrote:
> On Thu, Apr 8, 2021 at 10:04 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Thu,  8 Apr 2021 16:43:18 -0700 Axel Rasmussen <axelrasmussen@google.com> wrote:
> >
> > > The idea is that it will apply cleanly to akpm's tree, *replacing* the following
> > > patches (i.e., drop these first, and then apply this series):
> > >
> > > userfaultfd-support-minor-fault-handling-for-shmem.patch
> > > userfaultfd-support-minor-fault-handling-for-shmem-fix.patch
> > > userfaultfd-support-minor-fault-handling-for-shmem-fix-2.patch
> > > userfaultfd-support-minor-fault-handling-for-shmem-fix-3.patch
> > > userfaultfd-support-minor-fault-handling-for-shmem-fix-4.patch
> > > userfaultfd-selftests-use-memfd_create-for-shmem-test-type.patch
> > > userfaultfd-selftests-create-alias-mappings-in-the-shmem-test.patch
> > > userfaultfd-selftests-reinitialize-test-context-in-each-test.patch
> > > userfaultfd-selftests-exercise-minor-fault-handling-shmem-support.patch
> >
> > Well.  the problem is,
> >
> > > +     if (area_alias == MAP_FAILED)
> > > +             err("mmap of memfd alias failed");
> >
> > `err' doesn't exist until eleventy patches later, in Peter's
> > "userfaultfd/selftests: unify error handling".  I got tired of (and
> > lost confidence in) replacing "err(...)" with "fprintf(stderr, ...);
> > exit(1)" everywhere then fixing up the fallout when Peter's patch came
> > along.  Shudder.
> 
> Oof - sorry about that!
> 
> >
> > Sorry, all this material pretty clearly isn't going to make 5.12
> > (potentially nine days hence), so I shall drop all the userfaultfd
> > patches.  Let's take a fresh run at all of this after -rc1.
> 
> That's okay, my understanding was already that it certainly wouldn't
> be in the 5.12 release, but that we might be ready in time for 5.13.
> 
> >
> >
> > I have tentatively retained the first series:
> >
> > userfaultfd-add-minor-fault-registration-mode.patch
> > userfaultfd-add-minor-fault-registration-mode-fix.patch
> > userfaultfd-disable-huge-pmd-sharing-for-minor-registered-vmas.patch
> > userfaultfd-hugetlbfs-only-compile-uffd-helpers-if-config-enabled.patch
> > userfaultfd-add-uffdio_continue-ioctl.patch
> > userfaultfd-update-documentation-to-describe-minor-fault-handling.patch
> > userfaultfd-selftests-add-test-exercising-minor-fault-handling.patch
> >
> > but I don't believe they have had much testing standalone, without the
> > other userfaultfd patches present.  So I don't think it's smart to
> > upstream these in this cycle.  Or I could drop them so you and Peter
> > can have a clean shot at redoing the whole thing.  Please let me know.
> 
> From my perspective, both Peter's error handling and the hugetlbfs
> minor faulting patches are ready to go. (Peter's most importantly; we
> should establish that as a base, and put all the burden on resolving
> conflicts with it on us instead of you :).)
> 
> My memory was that Peter's patch was applied before my shmem series,
> but it seems I was mistaken. So, maybe the best thing to do is to have
> Peter send a version of it based on your tree, without the shmem
> series? And then I'll resolve any conflicts in my tree?
> 
> It's true that we haven't tested the hugetlbfs minor faults patch
> extensively *with the shmem one also applied*, but it has had more
> thorough review than the shmem one at this point (e.g. by Mike
> Kravetz), and they're rather separate code paths (I'd be surprised if
> one breaks the other).

Yes I think the hugetlb part should have got more review done.  IMHO it's a
matter of whether Mike would still like to do a more thorough review, or seems
okay to keep them.

I can repost the selftest series later if needed, as long as I figured which is
the suitable base commit.  Those selftest patches are definitely not urgent for
this release, so we can wait for the next release.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9] userfaultfd: add minor fault handling for shmem
  2021-04-09 21:18     ` Peter Xu
@ 2021-04-09 22:16       ` Mike Kravetz
  0 siblings, 0 replies; 21+ messages in thread
From: Mike Kravetz @ 2021-04-09 22:16 UTC (permalink / raw)
  To: Peter Xu, Axel Rasmussen
  Cc: Andrew Morton, Alexander Viro, Andrea Arcangeli,
	Daniel Colascione, Hugh Dickins, Jerome Glisse, Joe Perches,
	Lokesh Gidra, Mike Rapoport, Shaohua Li, Shuah Khan,
	Stephen Rothwell, Wang Qing, linux-api, linux-fsdevel, LKML,
	linux-kselftest, Linux MM, Brian Geffon, Dr . David Alan Gilbert,
	Mina Almasry, Oliver Upton

On 4/9/21 2:18 PM, Peter Xu wrote:
> On Fri, Apr 09, 2021 at 10:03:53AM -0700, Axel Rasmussen wrote:
>> On Thu, Apr 8, 2021 at 10:04 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>>>
>>> On Thu,  8 Apr 2021 16:43:18 -0700 Axel Rasmussen <axelrasmussen@google.com> wrote:
>>>
>>>> The idea is that it will apply cleanly to akpm's tree, *replacing* the following
>>>> patches (i.e., drop these first, and then apply this series):
>>>>
>>>> userfaultfd-support-minor-fault-handling-for-shmem.patch
>>>> userfaultfd-support-minor-fault-handling-for-shmem-fix.patch
>>>> userfaultfd-support-minor-fault-handling-for-shmem-fix-2.patch
>>>> userfaultfd-support-minor-fault-handling-for-shmem-fix-3.patch
>>>> userfaultfd-support-minor-fault-handling-for-shmem-fix-4.patch
>>>> userfaultfd-selftests-use-memfd_create-for-shmem-test-type.patch
>>>> userfaultfd-selftests-create-alias-mappings-in-the-shmem-test.patch
>>>> userfaultfd-selftests-reinitialize-test-context-in-each-test.patch
>>>> userfaultfd-selftests-exercise-minor-fault-handling-shmem-support.patch
>>>
>>> Well.  the problem is,
>>>
>>>> +     if (area_alias == MAP_FAILED)
>>>> +             err("mmap of memfd alias failed");
>>>
>>> `err' doesn't exist until eleventy patches later, in Peter's
>>> "userfaultfd/selftests: unify error handling".  I got tired of (and
>>> lost confidence in) replacing "err(...)" with "fprintf(stderr, ...);
>>> exit(1)" everywhere then fixing up the fallout when Peter's patch came
>>> along.  Shudder.
>>
>> Oof - sorry about that!
>>
>>>
>>> Sorry, all this material pretty clearly isn't going to make 5.12
>>> (potentially nine days hence), so I shall drop all the userfaultfd
>>> patches.  Let's take a fresh run at all of this after -rc1.
>>
>> That's okay, my understanding was already that it certainly wouldn't
>> be in the 5.12 release, but that we might be ready in time for 5.13.
>>
>>>
>>>
>>> I have tentatively retained the first series:
>>>
>>> userfaultfd-add-minor-fault-registration-mode.patch
>>> userfaultfd-add-minor-fault-registration-mode-fix.patch
>>> userfaultfd-disable-huge-pmd-sharing-for-minor-registered-vmas.patch
>>> userfaultfd-hugetlbfs-only-compile-uffd-helpers-if-config-enabled.patch
>>> userfaultfd-add-uffdio_continue-ioctl.patch
>>> userfaultfd-update-documentation-to-describe-minor-fault-handling.patch
>>> userfaultfd-selftests-add-test-exercising-minor-fault-handling.patch
>>>
>>> but I don't believe they have had much testing standalone, without the
>>> other userfaultfd patches present.  So I don't think it's smart to
>>> upstream these in this cycle.  Or I could drop them so you and Peter
>>> can have a clean shot at redoing the whole thing.  Please let me know.
>>
>> From my perspective, both Peter's error handling and the hugetlbfs
>> minor faulting patches are ready to go. (Peter's most importantly; we
>> should establish that as a base, and put all the burden on resolving
>> conflicts with it on us instead of you :).)
>>
>> My memory was that Peter's patch was applied before my shmem series,
>> but it seems I was mistaken. So, maybe the best thing to do is to have
>> Peter send a version of it based on your tree, without the shmem
>> series? And then I'll resolve any conflicts in my tree?
>>
>> It's true that we haven't tested the hugetlbfs minor faults patch
>> extensively *with the shmem one also applied*, but it has had more
>> thorough review than the shmem one at this point (e.g. by Mike
>> Kravetz), and they're rather separate code paths (I'd be surprised if
>> one breaks the other).
> 
> Yes I think the hugetlb part should have got more review done.  IMHO it's a
> matter of whether Mike would still like to do a more thorough review, or seems
> okay to keep them.

I looked pretty closely at the hugetlb specific parts of the minor fault
handling series.  I only took a high level look at the code modifying and
dealing with the userfaultfd API.  The hugetlb specific parts looked fine
to me.  I can take a closer look at the userfaultfd API modifications,
but it would take more time for me to come up to speed on the APIs.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/9] userfaultfd/shmem: support UFFDIO_CONTINUE for shmem
  2021-04-08 23:43 ` [PATCH 4/9] userfaultfd/shmem: support UFFDIO_CONTINUE " Axel Rasmussen
@ 2021-04-12 23:17   ` Peter Xu
  2021-04-13  4:40     ` Axel Rasmussen
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2021-04-12 23:17 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Alexander Viro, Andrea Arcangeli, Andrew Morton,
	Daniel Colascione, Hugh Dickins, Jerome Glisse, Joe Perches,
	Lokesh Gidra, Mike Kravetz, Mike Rapoport, Shaohua Li,
	Shuah Khan, Stephen Rothwell, Wang Qing, linux-api,
	linux-fsdevel, linux-kernel, linux-kselftest, linux-mm,
	Brian Geffon, Dr . David Alan Gilbert, Mina Almasry,
	Oliver Upton

On Thu, Apr 08, 2021 at 04:43:22PM -0700, Axel Rasmussen wrote:
> +/*
> + * Install PTEs, to map dst_addr (within dst_vma) to page.
> + *
> + * This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed),
> + * whether or not dst_vma is VM_SHARED. It also handles the more general
> + * MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file
> + * backed, or not).
> + *
> + * Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by
> + * shmem_mcopy_atomic_pte instead.
> + */
> +static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> +				     struct vm_area_struct *dst_vma,
> +				     unsigned long dst_addr, struct page *page,
> +				     bool newly_allocated, bool wp_copy)
> +{
> +	int ret;
> +	pte_t _dst_pte, *dst_pte;
> +	int writable;
> +	bool vm_shared = dst_vma->vm_flags & VM_SHARED;
> +	spinlock_t *ptl;
> +	struct inode *inode;
> +	pgoff_t offset, max_off;
> +
> +	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> +	writable = dst_vma->vm_flags & VM_WRITE;
> +	/* For private, non-anon we need CoW (don't write to page cache!) */
> +	if (!vma_is_anonymous(dst_vma) && !vm_shared)
> +		writable = 0;
> +
> +	if (writable || vma_is_anonymous(dst_vma))
> +		_dst_pte = pte_mkdirty(_dst_pte);
> +	if (writable) {
> +		if (wp_copy)
> +			_dst_pte = pte_mkuffd_wp(_dst_pte);
> +		else
> +			_dst_pte = pte_mkwrite(_dst_pte);
> +	} else if (vm_shared) {
> +		/*
> +		 * Since we didn't pte_mkdirty(), mark the page dirty or it
> +		 * could be freed from under us. We could do this
> +		 * unconditionally, but doing it only if !writable is faster.
> +		 */
> +		set_page_dirty(page);
> +	}
> +
> +	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
> +
> +	if (vma_is_shmem(dst_vma)) {
> +		/* The shmem MAP_PRIVATE case requires checking the i_size */

When you start to use this function in the last patch it'll be needed too even
if MAP_SHARED?

How about directly state the reason of doing this ("serialize against truncate
with the PT lock") instead of commenting about "who will need it"?

> +		inode = dst_vma->vm_file->f_inode;
> +		offset = linear_page_index(dst_vma, dst_addr);
> +		max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> +		ret = -EFAULT;
> +		if (unlikely(offset >= max_off))
> +			goto out_unlock;
> +	}

[...]

> +/* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */
> +static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
> +				pmd_t *dst_pmd,
> +				struct vm_area_struct *dst_vma,
> +				unsigned long dst_addr,
> +				bool wp_copy)
> +{
> +	struct inode *inode = file_inode(dst_vma->vm_file);
> +	pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
> +	struct page *page;
> +	int ret;
> +
> +	ret = shmem_getpage(inode, pgoff, &page, SGP_READ);

SGP_READ looks right, as we don't want page allocation.  However I noticed
there's very slight difference when the page was just fallocated:

	/* fallocated page? */
	if (page && !PageUptodate(page)) {
		if (sgp != SGP_READ)
			goto clear;
		unlock_page(page);
		put_page(page);
		page = NULL;
		hindex = index;
	}

I think it won't happen for your case since the page should be uptodate already
(the other thread should check and modify the page before CONTINUE), but still
raise this up, since if the page was allocated it smells better to still
install the fallocated page (do we need to clear the page and SetUptodate)?

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/9] userfaultfd/hugetlbfs: avoid including userfaultfd_k.h in hugetlb.h
  2021-04-08 23:43 ` [PATCH 1/9] userfaultfd/hugetlbfs: avoid including userfaultfd_k.h in hugetlb.h Axel Rasmussen
@ 2021-04-12 23:47   ` Peter Xu
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2021-04-12 23:47 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Alexander Viro, Andrea Arcangeli, Andrew Morton,
	Daniel Colascione, Hugh Dickins, Jerome Glisse, Joe Perches,
	Lokesh Gidra, Mike Kravetz, Mike Rapoport, Shaohua Li,
	Shuah Khan, Stephen Rothwell, Wang Qing, linux-api,
	linux-fsdevel, linux-kernel, linux-kselftest, linux-mm,
	Brian Geffon, Dr . David Alan Gilbert, Mina Almasry,
	Oliver Upton

On Thu, Apr 08, 2021 at 04:43:19PM -0700, Axel Rasmussen wrote:
> Minimizing header file inclusion is desirable. In this case, we can do
> so just by forward declaring the enumeration our signature relies upon.
> 
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> ---
>  include/linux/hugetlb.h | 5 ++++-
>  mm/hugetlb.c            | 1 +
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 1d3246b31a41..dfb749eaf348 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -11,7 +11,6 @@
>  #include <linux/kref.h>
>  #include <linux/pgtable.h>
>  #include <linux/gfp.h>
> -#include <linux/userfaultfd_k.h>
>  
>  struct ctl_table;
>  struct user_struct;
> @@ -136,6 +135,8 @@ unsigned long hugetlb_total_pages(void);
>  vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  			unsigned long address, unsigned int flags);
>  #ifdef CONFIG_USERFAULTFD
> +enum mcopy_atomic_mode;

(I'm not 100% sure, but.. maybe this can be moved even out of ifdef?  Then you
 can define it once at the top rather than twice?)

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/9] userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pte
  2021-04-08 23:43 ` [PATCH 2/9] userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pte Axel Rasmussen
@ 2021-04-12 23:57   ` Peter Xu
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2021-04-12 23:57 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Alexander Viro, Andrea Arcangeli, Andrew Morton,
	Daniel Colascione, Hugh Dickins, Jerome Glisse, Joe Perches,
	Lokesh Gidra, Mike Kravetz, Mike Rapoport, Shaohua Li,
	Shuah Khan, Stephen Rothwell, Wang Qing, linux-api,
	linux-fsdevel, linux-kernel, linux-kselftest, linux-mm,
	Brian Geffon, Dr . David Alan Gilbert, Mina Almasry,
	Oliver Upton

On Thu, Apr 08, 2021 at 04:43:20PM -0700, Axel Rasmussen wrote:
> Previously, we did a dance where we had one calling path in
> userfaultfd.c (mfill_atomic_pte), but then we split it into two in
> shmem_fs.h (shmem_{mcopy_atomic,mfill_zeropage}_pte), and then rejoined
> into a single shared function in shmem.c (shmem_mfill_atomic_pte).
> 
> This is all a bit overly complex. Just call the single combined shmem
> function directly, allowing us to clean up various branches,
> boilerplate, etc.
> 
> While we're touching this function, two other small cleanup changes:
> - offset is equivalent to pgoff, so we can get rid of offset entirely.
> - Split two VM_BUG_ON cases into two statements. This means the line
>   number reported when the BUG is hit specifies exactly which condition
>   was true.

(For my own preference, I'll avoid touching the latter one)

> 
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> ---
>  include/linux/shmem_fs.h | 15 +++++-------
>  mm/shmem.c               | 52 +++++++++++++---------------------------
>  mm/userfaultfd.c         | 10 +++-----
>  3 files changed, 25 insertions(+), 52 deletions(-)
> 
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index d82b6f396588..919e36671fe6 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -122,21 +122,18 @@ static inline bool shmem_file(struct file *file)
>  extern bool shmem_charge(struct inode *inode, long pages);
>  extern void shmem_uncharge(struct inode *inode, long pages);
>  
> +#ifdef CONFIG_USERFAULTFD
>  #ifdef CONFIG_SHMEM
>  extern int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>  				  struct vm_area_struct *dst_vma,
>  				  unsigned long dst_addr,
>  				  unsigned long src_addr,

Not a problem of your patch, but it's just that we passed in odd src_addr
values into mfill_atomic_pte() for zeropage case because we loop on src_addr in
__mcopy_atomic()...  Then it'll further passed into shmem_mcopy_atomic_pte()
now after this patch (as shmem_mfill_zeropage_pte() probably only did one thing
good which is to clear src_addr).  Not a big deal, though.

All the rest looks sane to me.

Reviewed-by: Peter Xu <peterx@redhat.com>

I'll wait to look at the selftests since in all cases they should be prone to
rebase (either based on the v2 cleanup I posted, or you'd need to post without
err() - then I can rebase again), so I figured maybe I just read the new
version.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/9] userfaultfd/shmem: support UFFDIO_CONTINUE for shmem
  2021-04-12 23:17   ` Peter Xu
@ 2021-04-13  4:40     ` Axel Rasmussen
  2021-04-13 18:12       ` Peter Xu
  0 siblings, 1 reply; 21+ messages in thread
From: Axel Rasmussen @ 2021-04-13  4:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alexander Viro, Andrea Arcangeli, Andrew Morton,
	Daniel Colascione, Hugh Dickins, Jerome Glisse, Joe Perches,
	Lokesh Gidra, Mike Kravetz, Mike Rapoport, Shaohua Li,
	Shuah Khan, Stephen Rothwell, Wang Qing, linux-api,
	linux-fsdevel, LKML, linux-kselftest, Linux MM, Brian Geffon,
	Dr . David Alan Gilbert, Mina Almasry, Oliver Upton

On Mon, Apr 12, 2021 at 4:17 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Apr 08, 2021 at 04:43:22PM -0700, Axel Rasmussen wrote:
> > +/*
> > + * Install PTEs, to map dst_addr (within dst_vma) to page.
> > + *
> > + * This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed),
> > + * whether or not dst_vma is VM_SHARED. It also handles the more general
> > + * MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file
> > + * backed, or not).
> > + *
> > + * Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by
> > + * shmem_mcopy_atomic_pte instead.
> > + */
> > +static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> > +                                  struct vm_area_struct *dst_vma,
> > +                                  unsigned long dst_addr, struct page *page,
> > +                                  bool newly_allocated, bool wp_copy)
> > +{
> > +     int ret;
> > +     pte_t _dst_pte, *dst_pte;
> > +     int writable;
> > +     bool vm_shared = dst_vma->vm_flags & VM_SHARED;
> > +     spinlock_t *ptl;
> > +     struct inode *inode;
> > +     pgoff_t offset, max_off;
> > +
> > +     _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> > +     writable = dst_vma->vm_flags & VM_WRITE;
> > +     /* For private, non-anon we need CoW (don't write to page cache!) */
> > +     if (!vma_is_anonymous(dst_vma) && !vm_shared)
> > +             writable = 0;
> > +
> > +     if (writable || vma_is_anonymous(dst_vma))
> > +             _dst_pte = pte_mkdirty(_dst_pte);
> > +     if (writable) {
> > +             if (wp_copy)
> > +                     _dst_pte = pte_mkuffd_wp(_dst_pte);
> > +             else
> > +                     _dst_pte = pte_mkwrite(_dst_pte);
> > +     } else if (vm_shared) {
> > +             /*
> > +              * Since we didn't pte_mkdirty(), mark the page dirty or it
> > +              * could be freed from under us. We could do this
> > +              * unconditionally, but doing it only if !writable is faster.
> > +              */
> > +             set_page_dirty(page);
> > +     }
> > +
> > +     dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
> > +
> > +     if (vma_is_shmem(dst_vma)) {
> > +             /* The shmem MAP_PRIVATE case requires checking the i_size */
>
> When you start to use this function in the last patch it'll be needed too even
> if MAP_SHARED?
>
> How about directly state the reason of doing this ("serialize against truncate
> with the PT lock") instead of commenting about "who will need it"?
>
> > +             inode = dst_vma->vm_file->f_inode;
> > +             offset = linear_page_index(dst_vma, dst_addr);
> > +             max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> > +             ret = -EFAULT;
> > +             if (unlikely(offset >= max_off))
> > +                     goto out_unlock;
> > +     }
>
> [...]
>
> > +/* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */
> > +static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
> > +                             pmd_t *dst_pmd,
> > +                             struct vm_area_struct *dst_vma,
> > +                             unsigned long dst_addr,
> > +                             bool wp_copy)
> > +{
> > +     struct inode *inode = file_inode(dst_vma->vm_file);
> > +     pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
> > +     struct page *page;
> > +     int ret;
> > +
> > +     ret = shmem_getpage(inode, pgoff, &page, SGP_READ);
>
> SGP_READ looks right, as we don't want page allocation.  However I noticed
> there's very slight difference when the page was just fallocated:
>
>         /* fallocated page? */
>         if (page && !PageUptodate(page)) {
>                 if (sgp != SGP_READ)
>                         goto clear;
>                 unlock_page(page);
>                 put_page(page);
>                 page = NULL;
>                 hindex = index;
>         }
>
> I think it won't happen for your case since the page should be uptodate already
> (the other thread should check and modify the page before CONTINUE), but still
> raise this up, since if the page was allocated it smells better to still
> install the fallocated page (do we need to clear the page and SetUptodate)?

Sorry for the somewhat rambling thought process:

My first thought is, I don't really know what PageUptodate means for
shmem pages. If I understand correctly, normally we say PageUptodate()
if the in memory data is more recent or equivalent to the on-disk
data. But, shmem pages are entirely in memory - they are file backed
in name only, in some sense.

fallocate() does all sorts of things so the comment to me seems a bit
ambiguous, but it seems the implication is that we're worried
specifically about the case where the shmem page was recently
allocated with fallocate(mode=0)? In that case, do we use
!PageUptodate() to denote that the page has been allocated, but its
contents are undefined?

I suppose that would make sense, as the action "goto clear;" generally
memset()-s the page to zero it, and then calls SetPageUptodate().

Okay so let's say the following sequence of events happens:

1. Userspace calls fallocate(mode=0) to allocate some shmem pages.
2. Another thread, via a UFFD-registered mapping, manages to trigger a
minor fault on one such page, while we still have !PageUptodate().
(I'm not 100% sure this can happen, but let's say it can.)
3. UFFD handler thread gets the minor fault event, and for whatever
(buggy?) reason does nothing - it doesn't modify the page, it just
calls CONTINUE.

I think if we get to this point, zeroing the page, returning it, and
setting up the PTEs seems somewhat reasonable to me. I suppose
alternatively we could notice that this happened and return an error
to the caller? I'm hesitant to mess with the behavior of
shmem_getpage_gfp() to make such a thing happen though. I do think if
we're going to set up the PTEs instead of returning an error, we
definitely do need to clear and SetPageUptodate() the page first.

In conclusion, I think this behavior is correct.

>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/9] userfaultfd/shmem: support UFFDIO_CONTINUE for shmem
  2021-04-13  4:40     ` Axel Rasmussen
@ 2021-04-13 18:12       ` Peter Xu
  2021-04-15 18:18         ` Axel Rasmussen
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2021-04-13 18:12 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Alexander Viro, Andrea Arcangeli, Andrew Morton,
	Daniel Colascione, Hugh Dickins, Jerome Glisse, Joe Perches,
	Lokesh Gidra, Mike Kravetz, Mike Rapoport, Shaohua Li,
	Shuah Khan, Stephen Rothwell, Wang Qing, linux-api,
	linux-fsdevel, LKML, linux-kselftest, Linux MM, Brian Geffon,
	Dr . David Alan Gilbert, Mina Almasry, Oliver Upton

On Mon, Apr 12, 2021 at 09:40:22PM -0700, Axel Rasmussen wrote:
> On Mon, Apr 12, 2021 at 4:17 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Apr 08, 2021 at 04:43:22PM -0700, Axel Rasmussen wrote:
> > > +/*
> > > + * Install PTEs, to map dst_addr (within dst_vma) to page.
> > > + *
> > > + * This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed),
> > > + * whether or not dst_vma is VM_SHARED. It also handles the more general
> > > + * MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file
> > > + * backed, or not).
> > > + *
> > > + * Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by
> > > + * shmem_mcopy_atomic_pte instead.
> > > + */
> > > +static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> > > +                                  struct vm_area_struct *dst_vma,
> > > +                                  unsigned long dst_addr, struct page *page,
> > > +                                  bool newly_allocated, bool wp_copy)
> > > +{
> > > +     int ret;
> > > +     pte_t _dst_pte, *dst_pte;
> > > +     int writable;
> > > +     bool vm_shared = dst_vma->vm_flags & VM_SHARED;
> > > +     spinlock_t *ptl;
> > > +     struct inode *inode;
> > > +     pgoff_t offset, max_off;
> > > +
> > > +     _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> > > +     writable = dst_vma->vm_flags & VM_WRITE;
> > > +     /* For private, non-anon we need CoW (don't write to page cache!) */
> > > +     if (!vma_is_anonymous(dst_vma) && !vm_shared)
> > > +             writable = 0;
> > > +
> > > +     if (writable || vma_is_anonymous(dst_vma))
> > > +             _dst_pte = pte_mkdirty(_dst_pte);
> > > +     if (writable) {
> > > +             if (wp_copy)
> > > +                     _dst_pte = pte_mkuffd_wp(_dst_pte);
> > > +             else
> > > +                     _dst_pte = pte_mkwrite(_dst_pte);
> > > +     } else if (vm_shared) {
> > > +             /*
> > > +              * Since we didn't pte_mkdirty(), mark the page dirty or it
> > > +              * could be freed from under us. We could do this
> > > +              * unconditionally, but doing it only if !writable is faster.
> > > +              */
> > > +             set_page_dirty(page);
> > > +     }
> > > +
> > > +     dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
> > > +
> > > +     if (vma_is_shmem(dst_vma)) {
> > > +             /* The shmem MAP_PRIVATE case requires checking the i_size */
> >
> > When you start to use this function in the last patch it'll be needed too even
> > if MAP_SHARED?
> >
> > How about directly state the reason of doing this ("serialize against truncate
> > with the PT lock") instead of commenting about "who will need it"?
> >
> > > +             inode = dst_vma->vm_file->f_inode;
> > > +             offset = linear_page_index(dst_vma, dst_addr);
> > > +             max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> > > +             ret = -EFAULT;
> > > +             if (unlikely(offset >= max_off))
> > > +                     goto out_unlock;
> > > +     }
> >
> > [...]
> >
> > > +/* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */
> > > +static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
> > > +                             pmd_t *dst_pmd,
> > > +                             struct vm_area_struct *dst_vma,
> > > +                             unsigned long dst_addr,
> > > +                             bool wp_copy)
> > > +{
> > > +     struct inode *inode = file_inode(dst_vma->vm_file);
> > > +     pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
> > > +     struct page *page;
> > > +     int ret;
> > > +
> > > +     ret = shmem_getpage(inode, pgoff, &page, SGP_READ);
> >
> > SGP_READ looks right, as we don't want page allocation.  However I noticed
> > there's very slight difference when the page was just fallocated:
> >
> >         /* fallocated page? */
> >         if (page && !PageUptodate(page)) {
> >                 if (sgp != SGP_READ)
> >                         goto clear;

[1]

> >                 unlock_page(page);
> >                 put_page(page);
> >                 page = NULL;
> >                 hindex = index;
> >         }
> >
> > I think it won't happen for your case since the page should be uptodate already
> > (the other thread should check and modify the page before CONTINUE), but still
> > raise this up, since if the page was allocated it smells better to still
> > install the fallocated page (do we need to clear the page and SetUptodate)?
> 
> Sorry for the somewhat rambling thought process:
> 
> My first thought is, I don't really know what PageUptodate means for
> shmem pages. If I understand correctly, normally we say PageUptodate()
> if the in memory data is more recent or equivalent to the on-disk
> data. But, shmem pages are entirely in memory - they are file backed
> in name only, in some sense.
> 
> fallocate() does all sorts of things so the comment to me seems a bit
> ambiguous, but it seems the implication is that we're worried
> specifically about the case where the shmem page was recently
> allocated with fallocate(mode=0)? In that case, do we use
> !PageUptodate() to denote that the page has been allocated, but its
> contents are undefined?
> 
> I suppose that would make sense, as the action "goto clear;" generally
> memset()-s the page to zero it, and then calls SetPageUptodate().
> 
> Okay so let's say the following sequence of events happens:
> 
> 1. Userspace calls fallocate(mode=0) to allocate some shmem pages.
> 2. Another thread, via a UFFD-registered mapping, manages to trigger a
> minor fault on one such page, while we still have !PageUptodate().
> (I'm not 100% sure this can happen, but let's say it can.)
> 3. UFFD handler thread gets the minor fault event, and for whatever
> (buggy?) reason does nothing - it doesn't modify the page, it just
> calls CONTINUE.

[2]

> 
> I think if we get to this point, zeroing the page, returning it, and
> setting up the PTEs seems somewhat reasonable to me. I suppose
> alternatively we could notice that this happened and return an error
> to the caller? I'm hesitant to mess with the behavior of
> shmem_getpage_gfp() to make such a thing happen though. I do think if
> we're going to set up the PTEs instead of returning an error, we
> definitely do need to clear and SetPageUptodate() the page first.
> 
> In conclusion, I think this behavior is correct.

I agree with you (mostly :), but except one thing: you passed in SGP_READ, so
IMHO it won't do what you explained (see [1] above: "goto clear" is with "sgp
!= SGP_READ" only); instead of doing what you said, I think it'll reset page
pointer to NULL..  Then quickly in the latter block:

	if (page || sgp == SGP_READ)
		goto out;

So I think at last shmem_getpage_gfp(SGP_READ) will return NULL.

I do think I've got some confusion here regarding SGP_READ, since from the
comment in shmem_fs.h it says:

	SGP_READ,	/* don't exceed i_size, don't allocate page */

It's natural to think it as "return the fallocated page" in this case.  However
it seems not the case?  My gut feeling is the comment for SGP_READ needs a
touch up, so as to state that for newly fallocated (and not used) pages it'll
return NULL even if cache hit.

So I think you're right, for all cases this may be a trivial case.  However
I've got a lesson somewhere else that we should never overlook zero pages,
which is also related to this case - although fallocated page is still
!Uptodate so clear page happens even latter, however from userspace pov, the
user could assume it's a zero page even if the page is not accessed at all
(since any access will cause clear page).  Then the user program could avoid
modifying this page if it knows this page keeps to be zero page somehow (e.g.,
a zero page bitmap?). Then your example above [2] seems indeed a valid one
worth thinking, at least not fully paranoid.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/9] userfaultfd/shmem: support UFFDIO_CONTINUE for shmem
  2021-04-13 18:12       ` Peter Xu
@ 2021-04-15 18:18         ` Axel Rasmussen
  0 siblings, 0 replies; 21+ messages in thread
From: Axel Rasmussen @ 2021-04-15 18:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alexander Viro, Andrea Arcangeli, Andrew Morton,
	Daniel Colascione, Hugh Dickins, Jerome Glisse, Joe Perches,
	Lokesh Gidra, Mike Kravetz, Mike Rapoport, Shaohua Li,
	Shuah Khan, Stephen Rothwell, Wang Qing, linux-api,
	linux-fsdevel, LKML, linux-kselftest, Linux MM, Brian Geffon,
	Dr . David Alan Gilbert, Mina Almasry, Oliver Upton

On Tue, Apr 13, 2021 at 11:12 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Apr 12, 2021 at 09:40:22PM -0700, Axel Rasmussen wrote:
> > On Mon, Apr 12, 2021 at 4:17 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Thu, Apr 08, 2021 at 04:43:22PM -0700, Axel Rasmussen wrote:
> > > > +/*
> > > > + * Install PTEs, to map dst_addr (within dst_vma) to page.
> > > > + *
> > > > + * This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed),
> > > > + * whether or not dst_vma is VM_SHARED. It also handles the more general
> > > > + * MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file
> > > > + * backed, or not).
> > > > + *
> > > > + * Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by
> > > > + * shmem_mcopy_atomic_pte instead.
> > > > + */
> > > > +static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> > > > +                                  struct vm_area_struct *dst_vma,
> > > > +                                  unsigned long dst_addr, struct page *page,
> > > > +                                  bool newly_allocated, bool wp_copy)
> > > > +{
> > > > +     int ret;
> > > > +     pte_t _dst_pte, *dst_pte;
> > > > +     int writable;
> > > > +     bool vm_shared = dst_vma->vm_flags & VM_SHARED;
> > > > +     spinlock_t *ptl;
> > > > +     struct inode *inode;
> > > > +     pgoff_t offset, max_off;
> > > > +
> > > > +     _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> > > > +     writable = dst_vma->vm_flags & VM_WRITE;
> > > > +     /* For private, non-anon we need CoW (don't write to page cache!) */
> > > > +     if (!vma_is_anonymous(dst_vma) && !vm_shared)
> > > > +             writable = 0;
> > > > +
> > > > +     if (writable || vma_is_anonymous(dst_vma))
> > > > +             _dst_pte = pte_mkdirty(_dst_pte);
> > > > +     if (writable) {
> > > > +             if (wp_copy)
> > > > +                     _dst_pte = pte_mkuffd_wp(_dst_pte);
> > > > +             else
> > > > +                     _dst_pte = pte_mkwrite(_dst_pte);
> > > > +     } else if (vm_shared) {
> > > > +             /*
> > > > +              * Since we didn't pte_mkdirty(), mark the page dirty or it
> > > > +              * could be freed from under us. We could do this
> > > > +              * unconditionally, but doing it only if !writable is faster.
> > > > +              */
> > > > +             set_page_dirty(page);
> > > > +     }
> > > > +
> > > > +     dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
> > > > +
> > > > +     if (vma_is_shmem(dst_vma)) {
> > > > +             /* The shmem MAP_PRIVATE case requires checking the i_size */
> > >
> > > When you start to use this function in the last patch it'll be needed too even
> > > if MAP_SHARED?
> > >
> > > How about directly state the reason of doing this ("serialize against truncate
> > > with the PT lock") instead of commenting about "who will need it"?
> > >
> > > > +             inode = dst_vma->vm_file->f_inode;
> > > > +             offset = linear_page_index(dst_vma, dst_addr);
> > > > +             max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> > > > +             ret = -EFAULT;
> > > > +             if (unlikely(offset >= max_off))
> > > > +                     goto out_unlock;
> > > > +     }
> > >
> > > [...]
> > >
> > > > +/* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */
> > > > +static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
> > > > +                             pmd_t *dst_pmd,
> > > > +                             struct vm_area_struct *dst_vma,
> > > > +                             unsigned long dst_addr,
> > > > +                             bool wp_copy)
> > > > +{
> > > > +     struct inode *inode = file_inode(dst_vma->vm_file);
> > > > +     pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
> > > > +     struct page *page;
> > > > +     int ret;
> > > > +
> > > > +     ret = shmem_getpage(inode, pgoff, &page, SGP_READ);
> > >
> > > SGP_READ looks right, as we don't want page allocation.  However I noticed
> > > there's very slight difference when the page was just fallocated:
> > >
> > >         /* fallocated page? */
> > >         if (page && !PageUptodate(page)) {
> > >                 if (sgp != SGP_READ)
> > >                         goto clear;
>
> [1]
>
> > >                 unlock_page(page);
> > >                 put_page(page);
> > >                 page = NULL;
> > >                 hindex = index;
> > >         }
> > >
> > > I think it won't happen for your case since the page should be uptodate already
> > > (the other thread should check and modify the page before CONTINUE), but still
> > > raise this up, since if the page was allocated it smells better to still
> > > install the fallocated page (do we need to clear the page and SetUptodate)?
> >
> > Sorry for the somewhat rambling thought process:
> >
> > My first thought is, I don't really know what PageUptodate means for
> > shmem pages. If I understand correctly, normally we say PageUptodate()
> > if the in memory data is more recent or equivalent to the on-disk
> > data. But, shmem pages are entirely in memory - they are file backed
> > in name only, in some sense.
> >
> > fallocate() does all sorts of things so the comment to me seems a bit
> > ambiguous, but it seems the implication is that we're worried
> > specifically about the case where the shmem page was recently
> > allocated with fallocate(mode=0)? In that case, do we use
> > !PageUptodate() to denote that the page has been allocated, but its
> > contents are undefined?
> >
> > I suppose that would make sense, as the action "goto clear;" generally
> > memset()-s the page to zero it, and then calls SetPageUptodate().
> >
> > Okay so let's say the following sequence of events happens:
> >
> > 1. Userspace calls fallocate(mode=0) to allocate some shmem pages.
> > 2. Another thread, via a UFFD-registered mapping, manages to trigger a
> > minor fault on one such page, while we still have !PageUptodate().
> > (I'm not 100% sure this can happen, but let's say it can.)
> > 3. UFFD handler thread gets the minor fault event, and for whatever
> > (buggy?) reason does nothing - it doesn't modify the page, it just
> > calls CONTINUE.
>
> [2]
>
> >
> > I think if we get to this point, zeroing the page, returning it, and
> > setting up the PTEs seems somewhat reasonable to me. I suppose
> > alternatively we could notice that this happened and return an error
> > to the caller? I'm hesitant to mess with the behavior of
> > shmem_getpage_gfp() to make such a thing happen though. I do think if
> > we're going to set up the PTEs instead of returning an error, we
> > definitely do need to clear and SetPageUptodate() the page first.
> >
> > In conclusion, I think this behavior is correct.
>
> I agree with you (mostly :), but except one thing: you passed in SGP_READ, so
> IMHO it won't do what you explained (see [1] above: "goto clear" is with "sgp
> != SGP_READ" only); instead of doing what you said, I think it'll reset page
> pointer to NULL..  Then quickly in the latter block:
>
>         if (page || sgp == SGP_READ)
>                 goto out;
>
> So I think at last shmem_getpage_gfp(SGP_READ) will return NULL.

Ah, indeed, I mistakenly read that as "sgp != SGP_READ".

>
> I do think I've got some confusion here regarding SGP_READ, since from the
> comment in shmem_fs.h it says:
>
>         SGP_READ,       /* don't exceed i_size, don't allocate page */
>
> It's natural to think it as "return the fallocated page" in this case.  However
> it seems not the case?  My gut feeling is the comment for SGP_READ needs a
> touch up, so as to state that for newly fallocated (and not used) pages it'll
> return NULL even if cache hit.
>
> So I think you're right, for all cases this may be a trivial case.  However
> I've got a lesson somewhere else that we should never overlook zero pages,
> which is also related to this case - although fallocated page is still
> !Uptodate so clear page happens even latter, however from userspace pov, the
> user could assume it's a zero page even if the page is not accessed at all
> (since any access will cause clear page).  Then the user program could avoid
> modifying this page if it knows this page keeps to be zero page somehow (e.g.,
> a zero page bitmap?). Then your example above [2] seems indeed a valid one
> worth thinking, at least not fully paranoid.

I'm kind of left thinking along the same lines though.

Modifying shmem_getpage_gfp to behave differently in this case seems
likely to be a can of worms.

With the current behavior, at least it doesn't seem to be too
problematic - we're not going to hand userspace an unzeroed page or
anything. At worst, it's a bit of a rough edge userspace might hit.
The UFFD handler thread will just get an error when it tries to
CONTINUE, and could recover by e.g. writing to the page or something
and trying again.

I'm inclined to leave it as is, and maybe look into a way to file down
the rough edge in a future patch, as well as some of the other
cleanups we've discussed elsewhere. Objections?

>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, back to index

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-08 23:43 [PATCH 0/9] userfaultfd: add minor fault handling for shmem Axel Rasmussen
2021-04-08 23:43 ` [PATCH 1/9] userfaultfd/hugetlbfs: avoid including userfaultfd_k.h in hugetlb.h Axel Rasmussen
2021-04-12 23:47   ` Peter Xu
2021-04-08 23:43 ` [PATCH 2/9] userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pte Axel Rasmussen
2021-04-12 23:57   ` Peter Xu
2021-04-08 23:43 ` [PATCH 3/9] userfaultfd/shmem: support minor fault registration for shmem Axel Rasmussen
2021-04-09 16:50   ` Peter Xu
2021-04-08 23:43 ` [PATCH 4/9] userfaultfd/shmem: support UFFDIO_CONTINUE " Axel Rasmussen
2021-04-12 23:17   ` Peter Xu
2021-04-13  4:40     ` Axel Rasmussen
2021-04-13 18:12       ` Peter Xu
2021-04-15 18:18         ` Axel Rasmussen
2021-04-08 23:43 ` [PATCH 5/9] userfaultfd/selftests: use memfd_create for shmem test type Axel Rasmussen
2021-04-08 23:43 ` [PATCH 6/9] userfaultfd/selftests: create alias mappings in the shmem test Axel Rasmussen
2021-04-08 23:43 ` [PATCH 7/9] userfaultfd/selftests: reinitialize test context in each test Axel Rasmussen
2021-04-08 23:43 ` [PATCH 8/9] userfaultfd/selftests: exercise minor fault handling shmem support Axel Rasmussen
2021-04-08 23:43 ` [PATCH 9/9] userfaultfd/shmem: modify shmem_mcopy_atomic_pte to use install_ptes Axel Rasmussen
2021-04-09  5:04 ` [PATCH 0/9] userfaultfd: add minor fault handling for shmem Andrew Morton
2021-04-09 17:03   ` Axel Rasmussen
2021-04-09 21:18     ` Peter Xu
2021-04-09 22:16       ` Mike Kravetz

Linux-api Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-api/0 linux-api/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-api linux-api/ https://lore.kernel.org/linux-api \
		linux-api@vger.kernel.org
	public-inbox-index linux-api

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-api


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git