All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/5] userfaultfd: support access/write hints
@ 2022-06-19 23:34 Nadav Amit
  2022-06-19 23:34 ` [RFC PATCH v2 1/5] userfaultfd: introduce uffd_flags Nadav Amit
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Nadav Amit @ 2022-06-19 23:34 UTC (permalink / raw)
  To: linux-mm
  Cc: Nadav Amit, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, Peter Xu, David Hildenbrand, Mike Rapoport

From: Nadav Amit <namit@vmware.com>

Setting the access-bit and dirty-bit introduces a tradeoff. When the bit
is set access/write is faster, but memory reclamation might be slower.
Currently, in the common userfaultfd cases the access-bit is not set on
and the dirty-bit is set. This is a questionable behavior.

Allow userspace to control this behavior through hints access- and
write-likely hints. These hints are used to control access- and
dirty-bits. For zero-pages that with write-likely hint, allocate a clear
page instead of mapping the zero-page.

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>

Nadav Amit (5):
  userfaultfd: introduce uffd_flags
  userfaultfd: introduce access-likely mode for copy/wp operations
  userfaultfd: introduce write-likely mode for copy/wp operations
  userfaultfd: zero access/write hints
  selftest/userfaultfd: test read/write hints

 fs/userfaultfd.c                         |  57 ++++++++---
 include/linux/hugetlb.h                  |   4 +-
 include/linux/shmem_fs.h                 |   8 +-
 include/linux/userfaultfd_k.h            |  25 +++--
 include/uapi/linux/userfaultfd.h         |  33 ++++++-
 mm/hugetlb.c                             |   6 +-
 mm/shmem.c                               |   9 +-
 mm/userfaultfd.c                         | 118 +++++++++++++++++------
 tools/testing/selftests/vm/userfaultfd.c |  26 +++++
 9 files changed, 226 insertions(+), 60 deletions(-)

-- 
2.25.1



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH v2 1/5] userfaultfd: introduce uffd_flags
  2022-06-19 23:34 [RFC PATCH v2 0/5] userfaultfd: support access/write hints Nadav Amit
@ 2022-06-19 23:34 ` Nadav Amit
  2022-06-21  8:41   ` David Hildenbrand
  2022-06-21 15:29   ` Peter Xu
  2022-06-19 23:34 ` [RFC PATCH v2 2/5] userfaultfd: introduce access-likely mode for copy/wp operations Nadav Amit
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 24+ messages in thread
From: Nadav Amit @ 2022-06-19 23:34 UTC (permalink / raw)
  To: linux-mm
  Cc: Nadav Amit, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, Peter Xu, David Hildenbrand, Mike Rapoport

From: Nadav Amit <namit@vmware.com>

As the next patches are going to introduce more information that needs
to be propagated regarding handled user requests, introduce uffd_flags
that would be used to propagate this information.

Remove the unused UFFD_FLAGS_SET to avoid confusion in the constant
names.

Introducing uffd flags also allows to avoid mm/userfaultfd from being
using uapi (e.g., UFFDIO_COPY_MODE_WP).

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 fs/userfaultfd.c              | 20 +++++++++----
 include/linux/hugetlb.h       |  4 +--
 include/linux/shmem_fs.h      |  8 ++++--
 include/linux/userfaultfd_k.h | 23 +++++++++------
 mm/hugetlb.c                  |  3 +-
 mm/shmem.c                    |  6 ++--
 mm/userfaultfd.c              | 53 ++++++++++++++++++-----------------
 7 files changed, 68 insertions(+), 49 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index d398f6bf6d74..5daafa54eb3f 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1700,6 +1700,8 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 	struct uffdio_copy uffdio_copy;
 	struct uffdio_copy __user *user_uffdio_copy;
 	struct userfaultfd_wake_range range;
+	bool mode_wp;
+	uffd_flags_t uffd_flags;
 
 	user_uffdio_copy = (struct uffdio_copy __user *) arg;
 
@@ -1726,10 +1728,15 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 		goto out;
 	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
 		goto out;
+
+	mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP;
+
+	uffd_flags = mode_wp ? UFFD_FLAGS_WP : 0;
+
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
 				   uffdio_copy.len, &ctx->mmap_changing,
-				   uffdio_copy.mode);
+				   uffd_flags);
 		mmput(ctx->mm);
 	} else {
 		return -ESRCH;
@@ -1781,7 +1788,7 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mfill_zeropage(ctx->mm, uffdio_zeropage.range.start,
 				     uffdio_zeropage.range.len,
-				     &ctx->mmap_changing);
+				     &ctx->mmap_changing, 0);
 		mmput(ctx->mm);
 	} else {
 		return -ESRCH;
@@ -1810,6 +1817,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 	struct uffdio_writeprotect __user *user_uffdio_wp;
 	struct userfaultfd_wake_range range;
 	bool mode_wp, mode_dontwake;
+	uffd_flags_t uffd_flags;
 
 	if (atomic_read(&ctx->mmap_changing))
 		return -EAGAIN;
@@ -1835,10 +1843,12 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 	if (mode_wp && mode_dontwake)
 		return -EINVAL;
 
+	uffd_flags = (mode_wp ? UFFD_FLAGS_WP : 0);
+
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
-					  uffdio_wp.range.len, mode_wp,
-					  &ctx->mmap_changing);
+					  uffdio_wp.range.len,
+					  &ctx->mmap_changing, uffd_flags);
 		mmput(ctx->mm);
 	} else {
 		return -ESRCH;
@@ -1891,7 +1901,7 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mcopy_continue(ctx->mm, uffdio_continue.range.start,
 				     uffdio_continue.range.len,
-				     &ctx->mmap_changing);
+				     &ctx->mmap_changing, 0);
 		mmput(ctx->mm);
 	} else {
 		return -ESRCH;
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 642a39016f9a..a4f326bc2de6 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -166,7 +166,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
 				unsigned long src_addr,
 				enum mcopy_atomic_mode mode,
 				struct page **pagep,
-				bool wp_copy);
+				uffd_flags_t uffd_flags);
 #endif /* CONFIG_USERFAULTFD */
 bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						struct vm_area_struct *vma,
@@ -366,7 +366,7 @@ static inline int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 						unsigned long src_addr,
 						enum mcopy_atomic_mode mode,
 						struct page **pagep,
-						bool wp_copy)
+						uffd_flags_t uffd_flags)
 {
 	BUG();
 	return 0;
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index a68f982f22d1..f93a3c114002 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -9,6 +9,7 @@
 #include <linux/percpu_counter.h>
 #include <linux/xattr.h>
 #include <linux/fs_parser.h>
+#include <linux/userfaultfd_k.h>
 
 /* inode in-kernel data */
 
@@ -145,11 +146,12 @@ extern int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 				  struct vm_area_struct *dst_vma,
 				  unsigned long dst_addr,
 				  unsigned long src_addr,
-				  bool zeropage, bool wp_copy,
-				  struct page **pagep);
+				  bool zeropage,
+				  struct page **pagep,
+				  uffd_flags_t uffd_flags);
 #else /* !CONFIG_SHMEM */
 #define shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, \
-			       src_addr, zeropage, wp_copy, pagep) ({ BUG(); 0; })
+			       src_addr, zeropage, pagep, uffd_flags) ({ BUG(); 0; })
 #endif /* CONFIG_SHMEM */
 #endif /* CONFIG_USERFAULTFD */
 
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index eee374c29c85..6331148023c1 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -34,7 +34,6 @@
 #define UFFD_NONBLOCK O_NONBLOCK
 
 #define UFFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
-#define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
 
 extern int sysctl_unprivileged_userfaultfd;
 
@@ -56,23 +55,29 @@ enum mcopy_atomic_mode {
 	MCOPY_ATOMIC_CONTINUE,
 };
 
+typedef unsigned int __bitwise uffd_flags_t;
+
+#define UFFD_FLAGS_WP			((__force uffd_flags_t)BIT(0))
+
 extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 				    struct vm_area_struct *dst_vma,
 				    unsigned long dst_addr, struct page *page,
-				    bool newly_allocated, bool wp_copy);
+				    bool newly_allocated,
+				    uffd_flags_t uffd_flags);
 
 extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
 			    unsigned long src_start, unsigned long len,
-			    atomic_t *mmap_changing, __u64 mode);
-extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
-			      unsigned long dst_start,
-			      unsigned long len,
-			      atomic_t *mmap_changing);
+			    atomic_t *mmap_changing, uffd_flags_t uffd_flags);
+extern ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long dst_start,
+			      unsigned long len, atomic_t *mmap_changing,
+			      uffd_flags_t uffd_flags);
 extern ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long dst_start,
-			      unsigned long len, atomic_t *mmap_changing);
+			      unsigned long len, atomic_t *mmap_changing,
+			      uffd_flags_t uffd_flags);
 extern int mwriteprotect_range(struct mm_struct *dst_mm,
 			       unsigned long start, unsigned long len,
-			       bool enable_wp, atomic_t *mmap_changing);
+			       atomic_t *mmap_changing,
+			       uffd_flags_t uffd_flags);
 
 /* mm helpers */
 static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2bc9d1170e4f..2beff8a4bf7c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5875,9 +5875,10 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    unsigned long src_addr,
 			    enum mcopy_atomic_mode mode,
 			    struct page **pagep,
-			    bool wp_copy)
+			    uffd_flags_t uffd_flags)
 {
 	bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
+	bool wp_copy = uffd_flags & UFFD_FLAGS_WP;
 	struct hstate *h = hstate_vma(dst_vma);
 	struct address_space *mapping = dst_vma->vm_file->f_mapping;
 	pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
diff --git a/mm/shmem.c b/mm/shmem.c
index 12ac67dc831f..89c775275bae 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2343,8 +2343,8 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 			   struct vm_area_struct *dst_vma,
 			   unsigned long dst_addr,
 			   unsigned long src_addr,
-			   bool zeropage, bool wp_copy,
-			   struct page **pagep)
+			   bool zeropage, struct page **pagep,
+			   uffd_flags_t uffd_flags)
 {
 	struct inode *inode = file_inode(dst_vma->vm_file);
 	struct shmem_inode_info *info = SHMEM_I(inode);
@@ -2418,7 +2418,7 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 		goto out_release;
 
 	ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
-				       page, true, wp_copy);
+				       page, true, uffd_flags);
 	if (ret)
 		goto out_delete_from_cache;
 
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 07d3befc80e4..734de6aa0b8e 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -58,7 +58,7 @@ struct vm_area_struct *find_dst_vma(struct mm_struct *dst_mm,
 int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 			     struct vm_area_struct *dst_vma,
 			     unsigned long dst_addr, struct page *page,
-			     bool newly_allocated, bool wp_copy)
+			     bool newly_allocated, uffd_flags_t uffd_flags)
 {
 	int ret;
 	pte_t _dst_pte, *dst_pte;
@@ -78,7 +78,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	 * Always mark a PTE as write-protected when needed, regardless of
 	 * VM_WRITE, which the user might change.
 	 */
-	if (wp_copy) {
+	if (uffd_flags & UFFD_FLAGS_WP) {
 		_dst_pte = pte_mkuffd_wp(_dst_pte);
 		writable = false;
 	}
@@ -145,7 +145,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    unsigned long dst_addr,
 			    unsigned long src_addr,
 			    struct page **pagep,
-			    bool wp_copy)
+			    uffd_flags_t uffd_flags)
 {
 	void *page_kaddr;
 	int ret;
@@ -189,7 +189,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 		goto out_release;
 
 	ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
-				       page, true, wp_copy);
+				       page, true, uffd_flags);
 	if (ret)
 		goto out_release;
 out:
@@ -239,7 +239,7 @@ static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
 				pmd_t *dst_pmd,
 				struct vm_area_struct *dst_vma,
 				unsigned long dst_addr,
-				bool wp_copy)
+				uffd_flags_t uffd_flags)
 {
 	struct inode *inode = file_inode(dst_vma->vm_file);
 	pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
@@ -263,7 +263,7 @@ static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
 	}
 
 	ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
-				       page, false, wp_copy);
+				       page, false, uffd_flags);
 	if (ret)
 		goto out_release;
 
@@ -309,7 +309,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 					      unsigned long src_start,
 					      unsigned long len,
 					      enum mcopy_atomic_mode mode,
-					      bool wp_copy)
+					      uffd_flags_t uffd_flags)
 {
 	int vm_shared = dst_vma->vm_flags & VM_SHARED;
 	ssize_t err;
@@ -406,7 +406,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 
 		err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
 					       dst_addr, src_addr, mode, &page,
-					       wp_copy);
+					       uffd_flags);
 
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		i_mmap_unlock_read(mapping);
@@ -462,7 +462,7 @@ extern ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 				      unsigned long src_start,
 				      unsigned long len,
 				      enum mcopy_atomic_mode mode,
-				      bool wp_copy);
+				      uffd_flags_t uffd_flags);
 #endif /* CONFIG_HUGETLB_PAGE */
 
 static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
@@ -472,13 +472,13 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 						unsigned long src_addr,
 						struct page **page,
 						enum mcopy_atomic_mode mode,
-						bool wp_copy)
+						uffd_flags_t uffd_flags)
 {
 	ssize_t err;
 
 	if (mode == MCOPY_ATOMIC_CONTINUE) {
 		return mcontinue_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
-					    wp_copy);
+					    uffd_flags);
 	}
 
 	/*
@@ -495,7 +495,7 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 		if (mode == MCOPY_ATOMIC_NORMAL)
 			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
 					       dst_addr, src_addr, page,
-					       wp_copy);
+					       uffd_flags);
 		else
 			err = mfill_zeropage_pte(dst_mm, dst_pmd,
 						 dst_vma, dst_addr);
@@ -503,7 +503,7 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 		err = shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma,
 					     dst_addr, src_addr,
 					     mode != MCOPY_ATOMIC_NORMAL,
-					     wp_copy, page);
+					     page, uffd_flags);
 	}
 
 	return err;
@@ -515,7 +515,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 					      unsigned long len,
 					      enum mcopy_atomic_mode mcopy_mode,
 					      atomic_t *mmap_changing,
-					      __u64 mode)
+					      uffd_flags_t uffd_flags)
 {
 	struct vm_area_struct *dst_vma;
 	ssize_t err;
@@ -523,7 +523,6 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	unsigned long src_addr, dst_addr;
 	long copied;
 	struct page *page;
-	bool wp_copy;
 
 	/*
 	 * Sanitize the command parameters:
@@ -570,11 +569,10 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 		goto out_unlock;
 
 	/*
-	 * validate 'mode' now that we know the dst_vma: don't allow
+	 * validate 'flags' now that we know the dst_vma: don't allow
 	 * a wrprotect copy if the userfaultfd didn't register as WP.
 	 */
-	wp_copy = mode & UFFDIO_COPY_MODE_WP;
-	if (wp_copy && !(dst_vma->vm_flags & VM_UFFD_WP))
+	if ((uffd_flags & UFFD_FLAGS_WP) && !(dst_vma->vm_flags & VM_UFFD_WP))
 		goto out_unlock;
 
 	/*
@@ -583,7 +581,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	if (is_vm_hugetlb_page(dst_vma))
 		return  __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
 					       src_start, len, mcopy_mode,
-					       wp_copy);
+					       uffd_flags);
 
 	if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
 		goto out_unlock;
@@ -635,7 +633,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 		BUG_ON(pmd_trans_huge(*dst_pmd));
 
 		err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
-				       src_addr, &page, mcopy_mode, wp_copy);
+				       src_addr, &page, mcopy_mode, uffd_flags);
 		cond_resched();
 
 		if (unlikely(err == -ENOENT)) {
@@ -683,30 +681,33 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 
 ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
 		     unsigned long src_start, unsigned long len,
-		     atomic_t *mmap_changing, __u64 mode)
+		     atomic_t *mmap_changing, uffd_flags_t uffd_flags)
 {
 	return __mcopy_atomic(dst_mm, dst_start, src_start, len,
-			      MCOPY_ATOMIC_NORMAL, mmap_changing, mode);
+			      MCOPY_ATOMIC_NORMAL, mmap_changing, uffd_flags);
 }
 
 ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
-		       unsigned long len, atomic_t *mmap_changing)
+		       unsigned long len, atomic_t *mmap_changing,
+		       uffd_flags_t uffd_flags)
 {
 	return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_ZEROPAGE,
 			      mmap_changing, 0);
 }
 
 ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long start,
-		       unsigned long len, atomic_t *mmap_changing)
+		       unsigned long len, atomic_t *mmap_changing,
+		       uffd_flags_t uffd_flags)
 {
 	return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_CONTINUE,
 			      mmap_changing, 0);
 }
 
 int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
-			unsigned long len, bool enable_wp,
-			atomic_t *mmap_changing)
+			unsigned long len,
+			atomic_t *mmap_changing, uffd_flags_t uffd_flags)
 {
+	bool enable_wp = uffd_flags & UFFD_FLAGS_WP;
 	struct vm_area_struct *dst_vma;
 	unsigned long page_mask;
 	struct mmu_gather tlb;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH v2 2/5] userfaultfd: introduce access-likely mode for copy/wp operations
  2022-06-19 23:34 [RFC PATCH v2 0/5] userfaultfd: support access/write hints Nadav Amit
  2022-06-19 23:34 ` [RFC PATCH v2 1/5] userfaultfd: introduce uffd_flags Nadav Amit
@ 2022-06-19 23:34 ` Nadav Amit
  2022-06-20 10:33   ` kernel test robot
  2022-06-21  8:48   ` David Hildenbrand
  2022-06-19 23:34 ` [RFC PATCH v2 3/5] userfaultfd: introduce write-likely " Nadav Amit
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 24+ messages in thread
From: Nadav Amit @ 2022-06-19 23:34 UTC (permalink / raw)
  To: linux-mm
  Cc: Nadav Amit, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, Peter Xu, David Hildenbrand, Mike Rapoport

From: Nadav Amit <namit@vmware.com>

Using a PTE on x86 with cleared access-bit (aka young-bit)
takes ~600 cycles more than when the access bit is set. At the same
time, setting the access-bit for memory that is not used (e.g.,
prefetched) can introduce greater overheads, as the prefetched memory is
reclaimed later than it should be.

Userfaultfd currently does not set the access-bit (excluding the
huge-pages case). Arguably, it is best to let the user control whether
the access bit should be set or not. The expected use is to request
userfaultfd to set the access-bit when the copy/wp operation is done to
resolve a page-fault, and not to set the access-bit when the memory is
prefetched.

Introduce UFFDIO_COPY_MODE_ACCESS_LIKELY and
UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY to enable userspace to request
the young bit to be set. Set for UFFDIO_CONTINUE and UFFDIO_ZEROPAGE the
bit unconditionally since the former is only used to resolve page-faults
and the latter would not benefit from not setting the access-bit.

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 fs/userfaultfd.c                 | 23 ++++++++++++++++-------
 include/linux/userfaultfd_k.h    |  1 +
 include/uapi/linux/userfaultfd.h | 20 +++++++++++++++++++-
 mm/userfaultfd.c                 | 18 ++++++++++++++++--
 4 files changed, 52 insertions(+), 10 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 5daafa54eb3f..35a8c4347c54 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1700,7 +1700,7 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 	struct uffdio_copy uffdio_copy;
 	struct uffdio_copy __user *user_uffdio_copy;
 	struct userfaultfd_wake_range range;
-	bool mode_wp;
+	bool mode_wp, mode_access_likely;
 	uffd_flags_t uffd_flags;
 
 	user_uffdio_copy = (struct uffdio_copy __user *) arg;
@@ -1726,12 +1726,15 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 	ret = -EINVAL;
 	if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
 		goto out;
-	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
+	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP|
+				 UFFDIO_COPY_MODE_ACCESS_LIKELY))
 		goto out;
 
 	mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP;
+	mode_access_likely = uffdio_copy.mode & UFFDIO_COPY_MODE_ACCESS_LIKELY;
 
-	uffd_flags = mode_wp ? UFFD_FLAGS_WP : 0;
+	uffd_flags = (mode_wp ? UFFD_FLAGS_WP : 0) |
+		     (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0);
 
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
@@ -1816,7 +1819,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 	struct uffdio_writeprotect uffdio_wp;
 	struct uffdio_writeprotect __user *user_uffdio_wp;
 	struct userfaultfd_wake_range range;
-	bool mode_wp, mode_dontwake;
+	bool mode_wp, mode_dontwake, mode_access_likely;
 	uffd_flags_t uffd_flags;
 
 	if (atomic_read(&ctx->mmap_changing))
@@ -1834,16 +1837,19 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 		return ret;
 
 	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
-			       UFFDIO_WRITEPROTECT_MODE_WP))
+			       UFFDIO_WRITEPROTECT_MODE_WP |
+			       UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY))
 		return -EINVAL;
 
 	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
 	mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
+	mode_access_likely = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY;
 
 	if (mode_wp && mode_dontwake)
 		return -EINVAL;
 
-	uffd_flags = (mode_wp ? UFFD_FLAGS_WP : 0);
+	uffd_flags = (mode_wp ? UFFD_FLAGS_WP : 0) |
+		     (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0);
 
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
@@ -1871,6 +1877,7 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
 	struct uffdio_continue uffdio_continue;
 	struct uffdio_continue __user *user_uffdio_continue;
 	struct userfaultfd_wake_range range;
+	uffd_flags_t uffd_flags;
 
 	user_uffdio_continue = (struct uffdio_continue __user *)arg;
 
@@ -1898,10 +1905,12 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
 	if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
 		goto out;
 
+	uffd_flags = UFFD_FLAGS_ACCESS_LIKELY;
+
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mcopy_continue(ctx->mm, uffdio_continue.range.start,
 				     uffdio_continue.range.len,
-				     &ctx->mmap_changing, 0);
+				     &ctx->mmap_changing, uffd_flags);
 		mmput(ctx->mm);
 	} else {
 		return -ESRCH;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 6331148023c1..e6ac165ec044 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -58,6 +58,7 @@ enum mcopy_atomic_mode {
 typedef unsigned int __bitwise uffd_flags_t;
 
 #define UFFD_FLAGS_WP			((__force uffd_flags_t)BIT(0))
+#define UFFD_FLAGS_ACCESS_LIKELY	((__force uffd_flags_t)BIT(1))
 
 extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 				    struct vm_area_struct *dst_vma,
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 005e5e306266..d9c8ce9ba777 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -38,7 +38,8 @@
 			   UFFD_FEATURE_MINOR_HUGETLBFS |	\
 			   UFFD_FEATURE_MINOR_SHMEM |		\
 			   UFFD_FEATURE_EXACT_ADDRESS |		\
-			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
+			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM |	\
+			   UFFD_FEATURE_ACCESS_HINTS)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -203,6 +204,10 @@ struct uffdio_api {
 	 *
 	 * UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd
 	 * write-protection mode is supported on both shmem and hugetlbfs.
+	 *
+	 * UFFD_FEATURE_ACCESS_HINTS indicates that the copy supports
+	 * UFFDIO_COPY_MODE_ACCESS_LIKELY supports
+	 * UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY.
 	 */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
 #define UFFD_FEATURE_EVENT_FORK			(1<<1)
@@ -217,6 +222,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_MINOR_SHMEM		(1<<10)
 #define UFFD_FEATURE_EXACT_ADDRESS		(1<<11)
 #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM		(1<<12)
+#define UFFD_FEATURE_ACCESS_HINTS		(1<<13)
 	__u64 features;
 
 	__u64 ioctls;
@@ -260,6 +266,13 @@ struct uffdio_copy {
 	 * copy_from_user will not read the last 8 bytes.
 	 */
 	__s64 copy;
+	/*
+	 * UFFDIO_COPY_MODE_ACCESS_LIKELY will set the mapped page as young.
+	 * This can reduce the time that the first access to the page takes.
+	 * Yet, if set opportunistically to memory that is not used, it might
+	 * extend the time before the unused memory pages are reclaimed.
+	 */
+#define UFFDIO_COPY_MODE_ACCESS_LIKELY		((__u64)1<<3)
 };
 
 struct uffdio_zeropage {
@@ -284,6 +297,10 @@ struct uffdio_writeprotect {
  * UFFDIO_WRITEPROTECT_MODE_DONTWAKE: set the flag to avoid waking up
  * any wait thread after the operation succeeds.
  *
+ * UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY: set the flag to mark the modified
+ * memory as young, which can reduce the time that the first access
+ * to the page takes.
+ *
  * NOTE: Write protecting a region (WP=1) is unrelated to page faults,
  * therefore DONTWAKE flag is meaningless with WP=1.  Removing write
  * protection (WP=0) in response to a page fault wakes the faulting
@@ -291,6 +308,7 @@ struct uffdio_writeprotect {
  */
 #define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
 #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
+#define UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY	((__u64)1<<2)
 	__u64 mode;
 };
 
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 734de6aa0b8e..140c8d3e946e 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -92,6 +92,9 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 		 */
 		_dst_pte = pte_wrprotect(_dst_pte);
 
+	if (uffd_flags & UFFD_FLAGS_ACCESS_LIKELY)
+		_dst_pte = pte_mkyoung(_dst_pte);
+
 	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
 
 	if (vma_is_shmem(dst_vma)) {
@@ -202,7 +205,8 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 static int mfill_zeropage_pte(struct mm_struct *dst_mm,
 			      pmd_t *dst_pmd,
 			      struct vm_area_struct *dst_vma,
-			      unsigned long dst_addr)
+			      unsigned long dst_addr,
+			      uffd_flags_t uffd_flags)
 {
 	pte_t _dst_pte, *dst_pte;
 	spinlock_t *ptl;
@@ -225,6 +229,10 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm,
 	ret = -EEXIST;
 	if (!pte_none(*dst_pte))
 		goto out_unlock;
+
+	if (uffd_flags & UFFD_FLAGS_ACCESS_LIKELY)
+		_dst_pte = pte_mkyoung(_dst_pte);
+
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(dst_vma, dst_addr, dst_pte);
@@ -498,7 +506,7 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 					       uffd_flags);
 		else
 			err = mfill_zeropage_pte(dst_mm, dst_pmd,
-						 dst_vma, dst_addr);
+						 dst_vma, dst_addr, uffd_flags);
 	} else {
 		err = shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma,
 					     dst_addr, src_addr,
@@ -691,6 +699,9 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
 		       unsigned long len, atomic_t *mmap_changing,
 		       uffd_flags_t uffd_flags)
 {
+	/* There is no cost for setting the access bit of a zeropage */
+	uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+
 	return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_ZEROPAGE,
 			      mmap_changing, 0);
 }
@@ -699,6 +710,9 @@ ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long start,
 		       unsigned long len, atomic_t *mmap_changing,
 		       uffd_flags_t uffd_flags)
 {
+	/* The page is likely to be accessed */
+	uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+
 	return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_CONTINUE,
 			      mmap_changing, 0);
 }
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH v2 3/5] userfaultfd: introduce write-likely mode for copy/wp operations
  2022-06-19 23:34 [RFC PATCH v2 0/5] userfaultfd: support access/write hints Nadav Amit
  2022-06-19 23:34 ` [RFC PATCH v2 1/5] userfaultfd: introduce uffd_flags Nadav Amit
  2022-06-19 23:34 ` [RFC PATCH v2 2/5] userfaultfd: introduce access-likely mode for copy/wp operations Nadav Amit
@ 2022-06-19 23:34 ` Nadav Amit
  2022-06-21 16:38   ` Peter Xu
  2022-06-19 23:34 ` [RFC PATCH v2 4/5] userfaultfd: zero access/write hints Nadav Amit
  2022-06-19 23:34 ` [RFC PATCH v2 5/5] selftest/userfaultfd: test read/write hints Nadav Amit
  4 siblings, 1 reply; 24+ messages in thread
From: Nadav Amit @ 2022-06-19 23:34 UTC (permalink / raw)
  To: linux-mm
  Cc: Nadav Amit, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, Peter Xu, David Hildenbrand, Mike Rapoport

From: Nadav Amit <namit@vmware.com>

Commit 9ae0f87d009ca ("mm/shmem: unconditionally set pte dirty in
mfill_atomic_install_pte") has set PTEs as dirty as its title indicates.
However, setting read-only PTEs as dirty can have several undesired
implications.

First, setting read-only PTEs as dirty, can cause these PTEs to become
writable during mprotect() syscall. See in change_pte_range():

	/* Avoid taking write faults for known dirty pages */
	if (dirty_accountable && pte_dirty(ptent) &&
			(pte_soft_dirty(ptent) ||
			 !(vma->vm_flags & VM_SOFTDIRTY))) {
		ptent = pte_mkwrite(ptent);
	}

Second, unmapping read-only dirty PTEs often prevents TLB flush batching.
See try_to_unmap_one():

	/*
	 * Page is dirty. Flush the TLB if a writable entry
	 * potentially exists to avoid CPU writes after IO
	 * starts and then write it out here.
	 */
	try_to_unmap_flush_dirty();

Similarly batching TLB flushed might be prevented in zap_pte_range():

	if (!PageAnon(page)) {
		if (pte_dirty(ptent)) {
			force_flush = 1;
			set_page_dirty(page);
		}
	...

In general, setting a PTE as dirty seems for read-only entries might be
dangerous. It should be reminded the dirty-COW vulnerability mitigation
also relies on the dirty bit being set only after COW (although it does
not appear to apply to userfaultfd).

To summarize, setting the dirty bit for read-only PTEs is dangerous. But
even if we only consider writable pages, always setting the dirty bit or
always leaving it clear, does not seem as the best policy. Leaving the
bit clear introduces overhead on the first write-access to set the bit.
Setting the bit for pages the are eventually not written to can require
more TLB flushes.

Let the userfaultfd users control whether PTEs are marked as dirty or
clean. Introduce UFFDIO_COPY_MODE_WRITE and
UFFDIO_COPY_MODE_WRITE_LIKELY and UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY
to enable userspace to indicate whether pages are likely to be written
and set the dirty-bit if they are likely to be written.

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 fs/userfaultfd.c                 | 22 ++++++++++++++--------
 include/linux/userfaultfd_k.h    |  1 +
 include/uapi/linux/userfaultfd.h | 27 +++++++++++++++++++--------
 mm/hugetlb.c                     |  3 +++
 mm/shmem.c                       |  3 +++
 mm/userfaultfd.c                 | 11 +++++++++--
 6 files changed, 49 insertions(+), 18 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 35a8c4347c54..a56983b594d5 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1700,7 +1700,7 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 	struct uffdio_copy uffdio_copy;
 	struct uffdio_copy __user *user_uffdio_copy;
 	struct userfaultfd_wake_range range;
-	bool mode_wp, mode_access_likely;
+	bool mode_wp, mode_access_likely, mode_write_likely;
 	uffd_flags_t uffd_flags;
 
 	user_uffdio_copy = (struct uffdio_copy __user *) arg;
@@ -1727,14 +1727,17 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 	if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
 		goto out;
 	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP|
-				 UFFDIO_COPY_MODE_ACCESS_LIKELY))
+				 UFFDIO_COPY_MODE_ACCESS_LIKELY|
+				 UFFDIO_COPY_MODE_WRITE_LIKELY))
 		goto out;
 
 	mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP;
 	mode_access_likely = uffdio_copy.mode & UFFDIO_COPY_MODE_ACCESS_LIKELY;
+	mode_write_likely = uffdio_copy.mode & UFFDIO_COPY_MODE_WRITE_LIKELY;
 
 	uffd_flags = (mode_wp ? UFFD_FLAGS_WP : 0) |
-		     (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0);
+		     (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0) |
+		     (mode_write_likely ? UFFD_FLAGS_WRITE_LIKELY : 0);
 
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
@@ -1819,7 +1822,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 	struct uffdio_writeprotect uffdio_wp;
 	struct uffdio_writeprotect __user *user_uffdio_wp;
 	struct userfaultfd_wake_range range;
-	bool mode_wp, mode_dontwake, mode_access_likely;
+	bool mode_wp, mode_dontwake, mode_access_likely, mode_write_likely;
 	uffd_flags_t uffd_flags;
 
 	if (atomic_read(&ctx->mmap_changing))
@@ -1838,18 +1841,21 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 
 	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
 			       UFFDIO_WRITEPROTECT_MODE_WP |
-			       UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY))
+			       UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY |
+			       UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY))
 		return -EINVAL;
 
 	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
 	mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
 	mode_access_likely = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY;
+	mode_write_likely = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY;
 
 	if (mode_wp && mode_dontwake)
 		return -EINVAL;
 
 	uffd_flags = (mode_wp ? UFFD_FLAGS_WP : 0) |
-		     (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0);
+		     (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0) |
+		     (mode_write_likely ? UFFD_FLAGS_WRITE_LIKELY : 0);
 
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
@@ -1902,10 +1908,10 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
 	    uffdio_continue.range.start) {
 		goto out;
 	}
-	if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
+	if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE))
 		goto out;
 
-	uffd_flags = UFFD_FLAGS_ACCESS_LIKELY;
+	uffd_flags = UFFD_FLAGS_ACCESS_LIKELY | UFFD_FLAGS_WRITE_LIKELY;
 
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mcopy_continue(ctx->mm, uffdio_continue.range.start,
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index e6ac165ec044..261a3fa750d0 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -59,6 +59,7 @@ typedef unsigned int __bitwise uffd_flags_t;
 
 #define UFFD_FLAGS_WP			((__force uffd_flags_t)BIT(0))
 #define UFFD_FLAGS_ACCESS_LIKELY	((__force uffd_flags_t)BIT(1))
+#define UFFD_FLAGS_WRITE_LIKELY		((__force uffd_flags_t)BIT(2))
 
 extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 				    struct vm_area_struct *dst_vma,
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index d9c8ce9ba777..6ad93a13282e 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -267,12 +267,20 @@ struct uffdio_copy {
 	 */
 	__s64 copy;
 	/*
-	 * UFFDIO_COPY_MODE_ACCESS_LIKELY will set the mapped page as young.
-	 * This can reduce the time that the first access to the page takes.
-	 * Yet, if set opportunistically to memory that is not used, it might
-	 * extend the time before the unused memory pages are reclaimed.
+	 * UFFDIO_COPY_MODE_ACCESS_LIKELY indicates that the memory is likely to
+	 * be accessed in the near future, in contrast to memory that is
+	 * opportunistically copied and might not be accessed. The kernel will
+	 * act accordingly, for instance by setting the access-bit in the PTE to
+	 * reduce the access time to the page.
+	 *
+	 * UFFDIO_COPY_MODE_WRITE_LIKELY indicates that the memory is likely to
+	 * be written to. The kernel will act accordingly, for instance by
+	 * setting the dirty-bit in the PTE to reduce the write time to the
+	 * page. This flag will be silently ignored if UFFDIO_COPY_MODE_WP is
+	 * set.
 	 */
-#define UFFDIO_COPY_MODE_ACCESS_LIKELY		((__u64)1<<3)
+#define UFFDIO_COPY_MODE_ACCESS_LIKELY		((__u64)1<<2)
+#define UFFDIO_COPY_MODE_WRITE_LIKELY		((__u64)1<<3)
 };
 
 struct uffdio_zeropage {
@@ -297,9 +305,11 @@ struct uffdio_writeprotect {
  * UFFDIO_WRITEPROTECT_MODE_DONTWAKE: set the flag to avoid waking up
  * any wait thread after the operation succeeds.
  *
- * UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY: set the flag to mark the modified
- * memory as young, which can reduce the time that the first access
- * to the page takes.
+ * UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY: set the flag to indicate the memory
+ * is likely to be accessed in the near future.
+ *
+ * UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY: set the flag to indicate that the
+ * memory is likely to be written to in the near future.
  *
  * NOTE: Write protecting a region (WP=1) is unrelated to page faults,
  * therefore DONTWAKE flag is meaningless with WP=1.  Removing write
@@ -309,6 +319,7 @@ struct uffdio_writeprotect {
 #define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
 #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
 #define UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY	((__u64)1<<2)
+#define UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY	((__u64)1<<3)
 	__u64 mode;
 };
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2beff8a4bf7c..46814fc7762f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5962,6 +5962,9 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		*pagep = NULL;
 	}
 
+	/* The PTE is not marked as dirty unconditionally */
+	SetPageDirty(page);
+
 	/*
 	 * The memory barrier inside __SetPageUptodate makes sure that
 	 * preceding stores to the page contents become visible before
diff --git a/mm/shmem.c b/mm/shmem.c
index 89c775275bae..7488cd186c32 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2404,6 +2404,9 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 	VM_BUG_ON(PageSwapBacked(page));
 	__SetPageLocked(page);
 	__SetPageSwapBacked(page);
+
+	/* The PTE is not marked as dirty unconditionally */
+	SetPageDirty(page);
 	__SetPageUptodate(page);
 
 	ret = -EFAULT;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 140c8d3e946e..3172158d8faa 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -70,7 +70,6 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	pgoff_t offset, max_off;
 
 	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
-	_dst_pte = pte_mkdirty(_dst_pte);
 	if (page_in_cache && !vm_shared)
 		writable = false;
 
@@ -85,13 +84,18 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 
 	if (writable)
 		_dst_pte = pte_mkwrite(_dst_pte);
-	else
+	else {
 		/*
 		 * We need this to make sure write bit removed; as mk_pte()
 		 * could return a pte with write bit set.
 		 */
 		_dst_pte = pte_wrprotect(_dst_pte);
 
+		/* Marking RO entries as dirty can mess with other code */
+		if (uffd_flags & UFFD_FLAGS_WRITE_LIKELY)
+			_dst_pte = pte_mkdirty(_dst_pte);
+	}
+
 	if (uffd_flags & UFFD_FLAGS_ACCESS_LIKELY)
 		_dst_pte = pte_mkyoung(_dst_pte);
 
@@ -180,6 +184,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 		*pagep = NULL;
 	}
 
+	/* The PTE is not marked as dirty unconditionally */
+	SetPageDirty(page);
+
 	/*
 	 * The memory barrier inside __SetPageUptodate makes sure that
 	 * preceding stores to the page contents become visible before
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH v2 4/5] userfaultfd: zero access/write hints
  2022-06-19 23:34 [RFC PATCH v2 0/5] userfaultfd: support access/write hints Nadav Amit
                   ` (2 preceding siblings ...)
  2022-06-19 23:34 ` [RFC PATCH v2 3/5] userfaultfd: introduce write-likely " Nadav Amit
@ 2022-06-19 23:34 ` Nadav Amit
  2022-06-20 18:06   ` kernel test robot
  2022-06-21 17:04   ` Peter Xu
  2022-06-19 23:34 ` [RFC PATCH v2 5/5] selftest/userfaultfd: test read/write hints Nadav Amit
  4 siblings, 2 replies; 24+ messages in thread
From: Nadav Amit @ 2022-06-19 23:34 UTC (permalink / raw)
  To: linux-mm
  Cc: Nadav Amit, David Hildenbrand, Mike Kravetz, Hugh Dickins,
	Andrew Morton, Axel Rasmussen, Peter Xu, Mike Rapoport

From: Nadav Amit <namit@vmware.com>

When userfaultfd provides a zeropage in response to ioctl, it provides a
readonly alias to the zero page. If the page is later written (which is
the likely scenario), page-fault occurs and the page-fault allocator
allocates a page and rewires the page-tables.

This is an expensive flow for cases in which a page is likely be written
to. Users can use the copy ioctl to initialize zero page (by copying
zeros), but this is also wasteful.

Allow userfaultfd users to efficiently map initialized zero-pages that
are writable. Introduce UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY, which, when
provided would map a clear page instead of an alias to the zero page.

For consistency, introduce also UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY.

Suggested-by: David Hildenbrand <david@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 fs/userfaultfd.c                 | 14 +++++++++++--
 include/uapi/linux/userfaultfd.h |  2 ++
 mm/userfaultfd.c                 | 36 ++++++++++++++++++++++++++++++++
 3 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index a56983b594d5..ff073de78ea8 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1770,6 +1770,8 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
 	struct uffdio_zeropage uffdio_zeropage;
 	struct uffdio_zeropage __user *user_uffdio_zeropage;
 	struct userfaultfd_wake_range range;
+	bool mode_dontwake, mode_access_likely, mode_write_likely;
+	uffd_flags_t uffd_flags;
 
 	user_uffdio_zeropage = (struct uffdio_zeropage __user *) arg;
 
@@ -1788,8 +1790,16 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
 	if (ret)
 		goto out;
 	ret = -EINVAL;
-	if (uffdio_zeropage.mode & ~UFFDIO_ZEROPAGE_MODE_DONTWAKE)
-		goto out;
+
+	mode_dontwake = uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_DONTWAKE;
+	mode_access_likely = uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY;
+	mode_write_likely = uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY;
+
+	if (mode_dontwake)
+		return -EINVAL;
+
+	uffd_flags = (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0) |
+		     (mode_write_likely ? UFFD_FLAGS_WRITE_LIKELY : 0);
 
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mfill_zeropage(ctx->mm, uffdio_zeropage.range.start,
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 6ad93a13282e..b586b7c1e265 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -286,6 +286,8 @@ struct uffdio_copy {
 struct uffdio_zeropage {
 	struct uffdio_range range;
 #define UFFDIO_ZEROPAGE_MODE_DONTWAKE		((__u64)1<<0)
+#define UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY	((__u64)1<<2)
+#define UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY	((__u64)1<<3)
 	__u64 mode;
 
 	/*
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 3172158d8faa..5dfbb1e80369 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -249,6 +249,38 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm,
 	return ret;
 }
 
+static int mfill_clearpage_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
+			       struct vm_area_struct *dst_vma,
+			       unsigned long dst_addr,
+			       uffd_flags_t uffd_flags)
+{
+	struct page *page;
+	int ret;
+
+	ret = -ENOMEM;
+	page = alloc_zeroed_user_highpage_movable(dst_vma, dst_addr);
+	if (!page)
+		goto out;
+
+	/* The PTE is not marked as dirty unconditionally */
+	SetPageDirty(page);
+	__SetPageUptodate(page);
+
+	ret = -ENOMEM;
+	if (mem_cgroup_charge(page_folio(page), dst_vma->vm_mm, GFP_KERNEL))
+		goto out_release;
+
+	ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
+				       page, true, uffd_flags);
+	if (ret)
+		goto out_release;
+out:
+	return ret;
+out_release:
+	put_page(page);
+	goto out;
+}
+
 /* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */
 static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
 				pmd_t *dst_pmd,
@@ -511,6 +543,10 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
 					       dst_addr, src_addr, page,
 					       uffd_flags);
+		else if (!(uffd_flags & UFFD_FLAGS_WP) &&
+			 (uffd_flags & UFFD_FLAGS_WRITE_LIKELY))
+			err = mfill_clearpage_pte(dst_mm, dst_pmd, dst_vma,
+						  dst_addr, uffd_flags);
 		else
 			err = mfill_zeropage_pte(dst_mm, dst_pmd,
 						 dst_vma, dst_addr, uffd_flags);
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH v2 5/5] selftest/userfaultfd: test read/write hints
  2022-06-19 23:34 [RFC PATCH v2 0/5] userfaultfd: support access/write hints Nadav Amit
                   ` (3 preceding siblings ...)
  2022-06-19 23:34 ` [RFC PATCH v2 4/5] userfaultfd: zero access/write hints Nadav Amit
@ 2022-06-19 23:34 ` Nadav Amit
  4 siblings, 0 replies; 24+ messages in thread
From: Nadav Amit @ 2022-06-19 23:34 UTC (permalink / raw)
  To: linux-mm
  Cc: Nadav Amit, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, Peter Xu, David Hildenbrand, Mike Rapoport

From: Nadav Amit <namit@vmware.com>

Test UFFDIO_*_MODE_ACCESS_LIKELY and UFFDIO_*_MODE_WRITE_LIKELY.
Introduce a modifier to trigger the use of the hints.

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 tools/testing/selftests/vm/userfaultfd.c | 26 ++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 28b881523d15..01680a7d1cdd 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -88,6 +88,8 @@ static volatile bool test_uffdio_zeropage_eexist = true;
 static bool test_uffdio_wp = true;
 /* Whether to test uffd minor faults */
 static bool test_uffdio_minor = false;
+static bool test_access_likely;
+static bool test_write_likely;
 
 static bool map_shared;
 static int shm_fd;
@@ -550,6 +552,12 @@ static void wp_range(int ufd, __u64 start, __u64 len, bool wp)
 	/* Undo write-protect, do wakeup after that */
 	prms.mode = wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0;
 
+	if (test_access_likely)
+		prms.mode |= UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY;
+
+	if (test_write_likely)
+		prms.mode |= UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY;
+
 	if (ioctl(ufd, UFFDIO_WRITEPROTECT, &prms))
 		err("clear WP failed: address=0x%"PRIx64, (uint64_t)start);
 }
@@ -653,6 +661,13 @@ static int __copy_page(int ufd, unsigned long offset, bool retry)
 		uffdio_copy.mode = UFFDIO_COPY_MODE_WP;
 	else
 		uffdio_copy.mode = 0;
+
+	if (test_access_likely)
+		uffdio_copy.mode |= UFFDIO_COPY_MODE_ACCESS_LIKELY;
+
+	if (test_write_likely)
+		uffdio_copy.mode |= UFFDIO_COPY_MODE_WRITE_LIKELY;
+
 	uffdio_copy.copy = 0;
 	if (ioctl(ufd, UFFDIO_COPY, &uffdio_copy)) {
 		/* real retval in ufdio_copy.copy */
@@ -1080,6 +1095,13 @@ static int __uffdio_zeropage(int ufd, unsigned long offset, bool retry)
 	uffdio_zeropage.range.start = (unsigned long) area_dst + offset;
 	uffdio_zeropage.range.len = page_size;
 	uffdio_zeropage.mode = 0;
+
+	if (test_access_likely)
+		uffdio_zeropage.mode |= UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY;
+
+	if (test_write_likely)
+		uffdio_zeropage.mode |= UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY;
+
 	ret = ioctl(ufd, UFFDIO_ZEROPAGE, &uffdio_zeropage);
 	res = uffdio_zeropage.zeropage;
 	if (ret) {
@@ -1648,6 +1670,10 @@ static void parse_test_type_arg(const char *raw_type)
 			set_test_type(token);
 		else if (!strcmp(token, "dev"))
 			test_dev_userfaultfd = true;
+		else if (!strcmp(token, "access_likely"))
+			test_access_likely = true;
+		else if (!strcmp(token, "write_likely"))
+			test_write_likely = true;
 		else
 			err("unrecognized test mod '%s'", token);
 	}
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 2/5] userfaultfd: introduce access-likely mode for copy/wp operations
  2022-06-19 23:34 ` [RFC PATCH v2 2/5] userfaultfd: introduce access-likely mode for copy/wp operations Nadav Amit
@ 2022-06-20 10:33   ` kernel test robot
  2022-06-21  8:48   ` David Hildenbrand
  1 sibling, 0 replies; 24+ messages in thread
From: kernel test robot @ 2022-06-20 10:33 UTC (permalink / raw)
  To: Nadav Amit; +Cc: llvm, kbuild-all

Hi Nadav,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master v5.19-rc2 next-20220617]
[cannot apply to shuah-kselftest/next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Nadav-Amit/userfaultfd-support-access-write-hints/20220620-150942
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
config: hexagon-buildonly-randconfig-r001-20220620 (https://download.01.org/0day-ci/archive/20220620/202206201856.LMF3y588-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project af6d2a0b6825e71965f3e2701a63c239fa0ad70f)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/ad1b8120bf00d273f9c786fe63c5c1fbcd7ff43c
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Nadav-Amit/userfaultfd-support-access-write-hints/20220620-150942
        git checkout ad1b8120bf00d273f9c786fe63c5c1fbcd7ff43c
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> mm/userfaultfd.c:700:23: warning: parameter 'uffd_flags' set but not used [-Wunused-but-set-parameter]
                          uffd_flags_t uffd_flags)
                                       ^
   mm/userfaultfd.c:711:23: warning: parameter 'uffd_flags' set but not used [-Wunused-but-set-parameter]
                          uffd_flags_t uffd_flags)
                                       ^
   2 warnings generated.


vim +/uffd_flags +700 mm/userfaultfd.c

c1a4de99fada21 Andrea Arcangeli 2015-09-04  697  
c1a4de99fada21 Andrea Arcangeli 2015-09-04  698  ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
0b36999093ff22 Nadav Amit       2022-06-19  699  		       unsigned long len, atomic_t *mmap_changing,
0b36999093ff22 Nadav Amit       2022-06-19 @700  		       uffd_flags_t uffd_flags)
c1a4de99fada21 Andrea Arcangeli 2015-09-04  701  {
ad1b8120bf00d2 Nadav Amit       2022-06-19  702  	/* There is no cost for setting the access bit of a zeropage */
ad1b8120bf00d2 Nadav Amit       2022-06-19  703  	uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
ad1b8120bf00d2 Nadav Amit       2022-06-19  704  
f619147104c8ea Axel Rasmussen   2021-05-04  705  	return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_ZEROPAGE,
f619147104c8ea Axel Rasmussen   2021-05-04  706  			      mmap_changing, 0);
f619147104c8ea Axel Rasmussen   2021-05-04  707  }
f619147104c8ea Axel Rasmussen   2021-05-04  708  

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 4/5] userfaultfd: zero access/write hints
  2022-06-19 23:34 ` [RFC PATCH v2 4/5] userfaultfd: zero access/write hints Nadav Amit
@ 2022-06-20 18:06   ` kernel test robot
  2022-06-21 17:04   ` Peter Xu
  1 sibling, 0 replies; 24+ messages in thread
From: kernel test robot @ 2022-06-20 18:06 UTC (permalink / raw)
  To: Nadav Amit; +Cc: llvm, kbuild-all

Hi Nadav,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master v5.19-rc2 next-20220617]
[cannot apply to shuah-kselftest/next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Nadav-Amit/userfaultfd-support-access-write-hints/20220620-150942
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
config: i386-randconfig-a011-20220620 (https://download.01.org/0day-ci/archive/20220621/202206210125.0A6porLp-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project af6d2a0b6825e71965f3e2701a63c239fa0ad70f)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/e5d70abf56dd545f8ea3abce34e8af894745c2bc
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Nadav-Amit/userfaultfd-support-access-write-hints/20220620-150942
        git checkout e5d70abf56dd545f8ea3abce34e8af894745c2bc
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> fs/userfaultfd.c:1774:15: warning: variable 'uffd_flags' set but not used [-Wunused-but-set-variable]
           uffd_flags_t uffd_flags;
                        ^
   1 warning generated.


vim +/uffd_flags +1774 fs/userfaultfd.c

  1765	
  1766	static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
  1767					unsigned long arg)
  1768	{
  1769		__s64 ret;
  1770		struct uffdio_zeropage uffdio_zeropage;
  1771		struct uffdio_zeropage __user *user_uffdio_zeropage;
  1772		struct userfaultfd_wake_range range;
  1773		bool mode_dontwake, mode_access_likely, mode_write_likely;
> 1774		uffd_flags_t uffd_flags;
  1775	
  1776		user_uffdio_zeropage = (struct uffdio_zeropage __user *) arg;
  1777	
  1778		ret = -EAGAIN;
  1779		if (atomic_read(&ctx->mmap_changing))
  1780			goto out;
  1781	
  1782		ret = -EFAULT;
  1783		if (copy_from_user(&uffdio_zeropage, user_uffdio_zeropage,
  1784				   /* don't copy "zeropage" last field */
  1785				   sizeof(uffdio_zeropage)-sizeof(__s64)))
  1786			goto out;
  1787	
  1788		ret = validate_range(ctx->mm, uffdio_zeropage.range.start,
  1789				     uffdio_zeropage.range.len);
  1790		if (ret)
  1791			goto out;
  1792		ret = -EINVAL;
  1793	
  1794		mode_dontwake = uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_DONTWAKE;
  1795		mode_access_likely = uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY;
  1796		mode_write_likely = uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY;
  1797	
  1798		if (mode_dontwake)
  1799			return -EINVAL;
  1800	
  1801		uffd_flags = (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0) |
  1802			     (mode_write_likely ? UFFD_FLAGS_WRITE_LIKELY : 0);
  1803	
  1804		if (mmget_not_zero(ctx->mm)) {
  1805			ret = mfill_zeropage(ctx->mm, uffdio_zeropage.range.start,
  1806					     uffdio_zeropage.range.len,
  1807					     &ctx->mmap_changing, 0);
  1808			mmput(ctx->mm);
  1809		} else {
  1810			return -ESRCH;
  1811		}
  1812		if (unlikely(put_user(ret, &user_uffdio_zeropage->zeropage)))
  1813			return -EFAULT;
  1814		if (ret < 0)
  1815			goto out;
  1816		/* len == 0 would wake all */
  1817		BUG_ON(!ret);
  1818		range.len = ret;
  1819		if (!(uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_DONTWAKE)) {
  1820			range.start = uffdio_zeropage.range.start;
  1821			wake_userfault(ctx, &range);
  1822		}
  1823		ret = range.len == uffdio_zeropage.range.len ? 0 : -EAGAIN;
  1824	out:
  1825		return ret;
  1826	}
  1827	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 1/5] userfaultfd: introduce uffd_flags
  2022-06-19 23:34 ` [RFC PATCH v2 1/5] userfaultfd: introduce uffd_flags Nadav Amit
@ 2022-06-21  8:41   ` David Hildenbrand
  2022-06-21 15:31     ` Peter Xu
  2022-06-21 15:29   ` Peter Xu
  1 sibling, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2022-06-21  8:41 UTC (permalink / raw)
  To: Nadav Amit, linux-mm
  Cc: Nadav Amit, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, Peter Xu, Mike Rapoport

On 20.06.22 01:34, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> As the next patches are going to introduce more information that needs
> to be propagated regarding handled user requests, introduce uffd_flags
> that would be used to propagate this information.
> 
> Remove the unused UFFD_FLAGS_SET to avoid confusion in the constant
> names.
> 
> Introducing uffd flags also allows to avoid mm/userfaultfd from being
> using uapi (e.g., UFFDIO_COPY_MODE_WP).
> 
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  fs/userfaultfd.c              | 20 +++++++++----
>  include/linux/hugetlb.h       |  4 +--
>  include/linux/shmem_fs.h      |  8 ++++--
>  include/linux/userfaultfd_k.h | 23 +++++++++------
>  mm/hugetlb.c                  |  3 +-
>  mm/shmem.c                    |  6 ++--
>  mm/userfaultfd.c              | 53 ++++++++++++++++++-----------------
>  7 files changed, 68 insertions(+), 49 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index d398f6bf6d74..5daafa54eb3f 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1700,6 +1700,8 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
>  	struct uffdio_copy uffdio_copy;
>  	struct uffdio_copy __user *user_uffdio_copy;
>  	struct userfaultfd_wake_range range;
> +	bool mode_wp;
> +	uffd_flags_t uffd_flags;
>  
>  	user_uffdio_copy = (struct uffdio_copy __user *) arg;
>  
> @@ -1726,10 +1728,15 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
>  		goto out;
>  	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
>  		goto out;
> +
> +	mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP;
> +
> +	uffd_flags = mode_wp ? UFFD_FLAGS_WP : 0;

why not simply

uffd_flags = 0;
if (uffdio_copy.mode & UFFDIO_COPY_MODE_WP)
	uffd_flags |= UFFD_FLAGS_WP;

?

[...]

> index eee374c29c85..6331148023c1 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -34,7 +34,6 @@
>  #define UFFD_NONBLOCK O_NONBLOCK
>  
>  #define UFFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
> -#define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
>  
>  extern int sysctl_unprivileged_userfaultfd;
>  
> @@ -56,23 +55,29 @@ enum mcopy_atomic_mode {
>  	MCOPY_ATOMIC_CONTINUE,
>  };
>  
> +typedef unsigned int __bitwise uffd_flags_t;
> +

Instead of using 0 when no flags are defined,  add

#define UFFD_FLAGS_NONE		((__force uffd_flags_t)0)


which makes it easier to understand at callsites what's happening


Apart from these two things

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 2/5] userfaultfd: introduce access-likely mode for copy/wp operations
  2022-06-19 23:34 ` [RFC PATCH v2 2/5] userfaultfd: introduce access-likely mode for copy/wp operations Nadav Amit
  2022-06-20 10:33   ` kernel test robot
@ 2022-06-21  8:48   ` David Hildenbrand
  2022-06-21 15:42     ` Peter Xu
  2022-06-21 17:27     ` Nadav Amit
  1 sibling, 2 replies; 24+ messages in thread
From: David Hildenbrand @ 2022-06-21  8:48 UTC (permalink / raw)
  To: Nadav Amit, linux-mm
  Cc: Nadav Amit, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, Peter Xu, Mike Rapoport

On 20.06.22 01:34, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> Using a PTE on x86 with cleared access-bit (aka young-bit)
> takes ~600 cycles more than when the access bit is set. At the same
> time, setting the access-bit for memory that is not used (e.g.,
> prefetched) can introduce greater overheads, as the prefetched memory is
> reclaimed later than it should be.
> 
> Userfaultfd currently does not set the access-bit (excluding the
> huge-pages case). Arguably, it is best to let the user control whether
> the access bit should be set or not. The expected use is to request
> userfaultfd to set the access-bit when the copy/wp operation is done to
> resolve a page-fault, and not to set the access-bit when the memory is
> prefetched.
> 
> Introduce UFFDIO_COPY_MODE_ACCESS_LIKELY and
> UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY to enable userspace to request
> the young bit to be set. Set for UFFDIO_CONTINUE and UFFDIO_ZEROPAGE the
> bit unconditionally since the former is only used to resolve page-faults
> and the latter would not benefit from not setting the access-bit.
> 
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  fs/userfaultfd.c                 | 23 ++++++++++++++++-------
>  include/linux/userfaultfd_k.h    |  1 +
>  include/uapi/linux/userfaultfd.h | 20 +++++++++++++++++++-
>  mm/userfaultfd.c                 | 18 ++++++++++++++++--
>  4 files changed, 52 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 5daafa54eb3f..35a8c4347c54 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1700,7 +1700,7 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
>  	struct uffdio_copy uffdio_copy;
>  	struct uffdio_copy __user *user_uffdio_copy;
>  	struct userfaultfd_wake_range range;
> -	bool mode_wp;
> +	bool mode_wp, mode_access_likely;
>  	uffd_flags_t uffd_flags;
>  
>  	user_uffdio_copy = (struct uffdio_copy __user *) arg;
> @@ -1726,12 +1726,15 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
>  	ret = -EINVAL;
>  	if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
>  		goto out;
> -	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
> +	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP|
> +				 UFFDIO_COPY_MODE_ACCESS_LIKELY))
>  		goto out;
>  
>  	mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP;
> +	mode_access_likely = uffdio_copy.mode & UFFDIO_COPY_MODE_ACCESS_LIKELY;

I *relly* prefer just

if (uffdio_copy.mode & UFFDIO_COPY_MODE_ACCESS_LIKELY)
	uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY
[...]

>  
> -	uffd_flags = (mode_wp ? UFFD_FLAGS_WP : 0);
> +	uffd_flags = (mode_wp ? UFFD_FLAGS_WP : 0) |
> +		     (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0);
>  

Dito.

>  	if (mmget_not_zero(ctx->mm)) {
>  		ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
> @@ -1871,6 +1877,7 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
>  	struct uffdio_continue uffdio_continue;
>  	struct uffdio_continue __user *user_uffdio_continue;
>  	struct userfaultfd_wake_range range;
> +	uffd_flags_t uffd_flags;
>  
>  	user_uffdio_continue = (struct uffdio_continue __user *)arg;
>  
> @@ -1898,10 +1905,12 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
>  	if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
>  		goto out;
>  
> +	uffd_flags = UFFD_FLAGS_ACCESS_LIKELY;

Can we add a comment why that makes sense? I think I know why -- someone
is stuck waiting for that continue to happen :)

> +
>  	if (mmget_not_zero(ctx->mm)) {
>  		ret = mcopy_continue(ctx->mm, uffdio_continue.range.start,
>  				     uffdio_continue.range.len,
> -				     &ctx->mmap_changing, 0);
> +				     &ctx->mmap_changing, uffd_flags);
>  		mmput(ctx->mm);
>  	} else {
>  		return -ESRCH;
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 6331148023c1..e6ac165ec044 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -58,6 +58,7 @@ enum mcopy_atomic_mode {
>  typedef unsigned int __bitwise uffd_flags_t;
>  
>  #define UFFD_FLAGS_WP			((__force uffd_flags_t)BIT(0))
> +#define UFFD_FLAGS_ACCESS_LIKELY	((__force uffd_flags_t)BIT(1))
>  
>  extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>  				    struct vm_area_struct *dst_vma,
> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> index 005e5e306266..d9c8ce9ba777 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -38,7 +38,8 @@
>  			   UFFD_FEATURE_MINOR_HUGETLBFS |	\
>  			   UFFD_FEATURE_MINOR_SHMEM |		\
>  			   UFFD_FEATURE_EXACT_ADDRESS |		\
> -			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
> +			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM |	\
> +			   UFFD_FEATURE_ACCESS_HINTS)
>  #define UFFD_API_IOCTLS				\
>  	((__u64)1 << _UFFDIO_REGISTER |		\
>  	 (__u64)1 << _UFFDIO_UNREGISTER |	\
> @@ -203,6 +204,10 @@ struct uffdio_api {
>  	 *
>  	 * UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd
>  	 * write-protection mode is supported on both shmem and hugetlbfs.
> +	 *
> +	 * UFFD_FEATURE_ACCESS_HINTS indicates that the copy supports
> +	 * UFFDIO_COPY_MODE_ACCESS_LIKELY supports
> +	 * UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY.

I think that sentence doesn't make sense.

>  	 */
>  #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
>  #define UFFD_FEATURE_EVENT_FORK			(1<<1)
> @@ -217,6 +222,7 @@ struct uffdio_api {
>  #define UFFD_FEATURE_MINOR_SHMEM		(1<<10)
>  #define UFFD_FEATURE_EXACT_ADDRESS		(1<<11)
>  #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM		(1<<12)
> +#define UFFD_FEATURE_ACCESS_HINTS		(1<<13)
>  	__u64 features;
>  
>  	__u64 ioctls;
> @@ -260,6 +266,13 @@ struct uffdio_copy {
>  	 * copy_from_user will not read the last 8 bytes.
>  	 */
>  	__s64 copy;
> +	/*
> +	 * UFFDIO_COPY_MODE_ACCESS_LIKELY will set the mapped page as young.

Setting the page young is an implementation detail. Can you phrase it
more generically what the effect of that hint might be?

> +	 * This can reduce the time that the first access to the page takes.
> +	 * Yet, if set opportunistically to memory that is not used, it might
> +	 * extend the time before the unused memory pages are reclaimed.
> +	 */
> +#define UFFDIO_COPY_MODE_ACCESS_LIKELY		((__u64)1<<3)
>  };
>  
>  struct uffdio_zeropage {
> @@ -284,6 +297,10 @@ struct uffdio_writeprotect {
>   * UFFDIO_WRITEPROTECT_MODE_DONTWAKE: set the flag to avoid waking up
>   * any wait thread after the operation succeeds.
>   *
> + * UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY: set the flag to mark the modified
> + * memory as young, which can reduce the time that the first access
> + * to the page takes.

Dito.

> + *
>   * NOTE: Write protecting a region (WP=1) is unrelated to page faults,
>   * therefore DONTWAKE flag is meaningless with WP=1.  Removing write
>   * protection (WP=0) in response to a page fault wakes the faulting
> @@ -291,6 +308,7 @@ struct uffdio_writeprotect {
>   */
>  #define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
>  #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
> +#define UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY	((__u64)1<<2)
>  	__u64 mode;
>  };


[...]

> @@ -691,6 +699,9 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
>  		       unsigned long len, atomic_t *mmap_changing,
>  		       uffd_flags_t uffd_flags)
>  {
> +	/* There is no cost for setting the access bit of a zeropage */
> +	uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
> +
>  	return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_ZEROPAGE,
>  			      mmap_changing, 0);
>  }
> @@ -699,6 +710,9 @@ ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long start,
>  		       unsigned long len, atomic_t *mmap_changing,
>  		       uffd_flags_t uffd_flags)
>  {
> +	/* The page is likely to be accessed */
> +	uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;

Shoouldn't that be set by the caller already?

> +
>  	return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_CONTINUE,
>  			      mmap_changing, 0);
>  }


In general, LGTM.


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 1/5] userfaultfd: introduce uffd_flags
  2022-06-19 23:34 ` [RFC PATCH v2 1/5] userfaultfd: introduce uffd_flags Nadav Amit
  2022-06-21  8:41   ` David Hildenbrand
@ 2022-06-21 15:29   ` Peter Xu
  2022-06-21 17:41     ` Nadav Amit
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Xu @ 2022-06-21 15:29 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, Nadav Amit, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport

On Sun, Jun 19, 2022 at 04:34:45PM -0700, Nadav Amit wrote:
> @@ -683,30 +681,33 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
>  
>  ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
>  		     unsigned long src_start, unsigned long len,
> -		     atomic_t *mmap_changing, __u64 mode)
> +		     atomic_t *mmap_changing, uffd_flags_t uffd_flags)
>  {
>  	return __mcopy_atomic(dst_mm, dst_start, src_start, len,
> -			      MCOPY_ATOMIC_NORMAL, mmap_changing, mode);
> +			      MCOPY_ATOMIC_NORMAL, mmap_changing, uffd_flags);
>  }
>  
>  ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
> -		       unsigned long len, atomic_t *mmap_changing)
> +		       unsigned long len, atomic_t *mmap_changing,
> +		       uffd_flags_t uffd_flags)

Should this be fed into the last parameter of __mcopy_atomic() below?

>  {
>  	return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_ZEROPAGE,
>  			      mmap_changing, 0);
>  }
>  
>  ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long start,
> -		       unsigned long len, atomic_t *mmap_changing)
> +		       unsigned long len, atomic_t *mmap_changing,
> +		       uffd_flags_t uffd_flags)

Same question here..

>  {
>  	return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_CONTINUE,
>  			      mmap_changing, 0);
>  }

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 1/5] userfaultfd: introduce uffd_flags
  2022-06-21  8:41   ` David Hildenbrand
@ 2022-06-21 15:31     ` Peter Xu
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Xu @ 2022-06-21 15:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nadav Amit, linux-mm, Nadav Amit, Mike Kravetz, Hugh Dickins,
	Andrew Morton, Axel Rasmussen, Mike Rapoport

On Tue, Jun 21, 2022 at 10:41:19AM +0200, David Hildenbrand wrote:
> > @@ -1726,10 +1728,15 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
> >  		goto out;
> >  	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
> >  		goto out;
> > +
> > +	mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP;
> > +
> > +	uffd_flags = mode_wp ? UFFD_FLAGS_WP : 0;
> 
> why not simply
> 
> uffd_flags = 0;
> if (uffdio_copy.mode & UFFDIO_COPY_MODE_WP)
> 	uffd_flags |= UFFD_FLAGS_WP;
> 
> ?

Seconded, or even one-line it?

        uffd_flags = (uffdio_copy.mode & UFFDIO_COPY_MODE_WP) ? UFFD_FLAGS_WP : 0;

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 2/5] userfaultfd: introduce access-likely mode for copy/wp operations
  2022-06-21  8:48   ` David Hildenbrand
@ 2022-06-21 15:42     ` Peter Xu
  2022-06-21 17:27     ` Nadav Amit
  1 sibling, 0 replies; 24+ messages in thread
From: Peter Xu @ 2022-06-21 15:42 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nadav Amit, linux-mm, Nadav Amit, Mike Kravetz, Hugh Dickins,
	Andrew Morton, Axel Rasmussen, Mike Rapoport

On Tue, Jun 21, 2022 at 10:48:51AM +0200, David Hildenbrand wrote:
> > @@ -1871,6 +1877,7 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
> >  	struct uffdio_continue uffdio_continue;
> >  	struct uffdio_continue __user *user_uffdio_continue;
> >  	struct userfaultfd_wake_range range;
> > +	uffd_flags_t uffd_flags;
> >  
> >  	user_uffdio_continue = (struct uffdio_continue __user *)arg;
> >  
> > @@ -1898,10 +1905,12 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
> >  	if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
> >  		goto out;
> >  
> > +	uffd_flags = UFFD_FLAGS_ACCESS_LIKELY;
> 
> Can we add a comment why that makes sense? I think I know why -- someone
> is stuck waiting for that continue to happen :)

I think we shouldn't apply it by default for CONTINUE at all, at least not
sololy for CONTINUE.  CONTINUE can be used similarly as COPY at least in VM
migration use case, afaict, so we can proactively install pgtables even if
the page was not faulted.

[...]

> > diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> > index 005e5e306266..d9c8ce9ba777 100644
> > --- a/include/uapi/linux/userfaultfd.h
> > +++ b/include/uapi/linux/userfaultfd.h
> > @@ -38,7 +38,8 @@
> >  			   UFFD_FEATURE_MINOR_HUGETLBFS |	\
> >  			   UFFD_FEATURE_MINOR_SHMEM |		\
> >  			   UFFD_FEATURE_EXACT_ADDRESS |		\
> > -			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
> > +			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM |	\
> > +			   UFFD_FEATURE_ACCESS_HINTS)

Is the access_hint feature gonna cover the next dirty bit patch?  If so I'd
suggest we add the feature declaration in a separate patch after all bits
ready.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 3/5] userfaultfd: introduce write-likely mode for copy/wp operations
  2022-06-19 23:34 ` [RFC PATCH v2 3/5] userfaultfd: introduce write-likely " Nadav Amit
@ 2022-06-21 16:38   ` Peter Xu
  2022-06-21 17:14     ` Nadav Amit
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2022-06-21 16:38 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, Nadav Amit, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport

Hi, Nadav,

On Sun, Jun 19, 2022 at 04:34:47PM -0700, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> Commit 9ae0f87d009ca ("mm/shmem: unconditionally set pte dirty in
> mfill_atomic_install_pte") has set PTEs as dirty as its title indicates.
> However, setting read-only PTEs as dirty can have several undesired
> implications.
> 
> First, setting read-only PTEs as dirty, can cause these PTEs to become
> writable during mprotect() syscall. See in change_pte_range():
> 
> 	/* Avoid taking write faults for known dirty pages */
> 	if (dirty_accountable && pte_dirty(ptent) &&
> 			(pte_soft_dirty(ptent) ||
> 			 !(vma->vm_flags & VM_SOFTDIRTY))) {
> 		ptent = pte_mkwrite(ptent);
> 	}

IMHO this is not really the direct reason to add the feature to "allow user
to specify whether dirty bit be set for UFFDIO_COPY/... ioctl", because
IIUC what's missing is the pte_uffd_wp() check that I should have added in
the shmem+hugetlb uffd-wp series in change_pte_range() but I missed..

But since this is fixed by David's patch to optimize mprotect() altogether
which checks pte_uffd_wp() (and afaict that's only needed after the
shmem+hugetlb patchset, not before), so I think we're safe now with all the
branches.

So IMHO we don't need to mention this as it's kind of misleading.  It'll be
welcomed if you want to recover the pte_dirty behavior in
mfill_atomic_install_pte() but probably this is not the right patch for it?

> 
> Second, unmapping read-only dirty PTEs often prevents TLB flush batching.
> See try_to_unmap_one():
> 
> 	/*
> 	 * Page is dirty. Flush the TLB if a writable entry
> 	 * potentially exists to avoid CPU writes after IO
> 	 * starts and then write it out here.
> 	 */
> 	try_to_unmap_flush_dirty();
> 
> Similarly batching TLB flushed might be prevented in zap_pte_range():
> 
> 	if (!PageAnon(page)) {
> 		if (pte_dirty(ptent)) {
> 			force_flush = 1;
> 			set_page_dirty(page);
> 		}
> 	...

I still keep the pure question here (which I asked in the private reply to
you) on why we'd like only pte_dirty() but not pte_write() && pte_dirty()
here.  I'll rewrite what I have in the private email to you here again that
I think should be the write thing to do here..

        if (!PageAnon(page)) {
                if (pte_dirty(ptent)) {
                        set_page_dirty(page);
                        if (pte_write(ptent))
                                force_flush = 1;
                }
        }

I also mentioned the other example of doing mprotect(PROT_READ) upon a
dirty pte can also create a pte with dirty=1 and write=0 which should be
the same condition we have here with uffd, afaict. So it's at least not a
solo problem for uffd, and I still think if we treat this tlb flush issue
as a perf bug we should consider fixing up the tlb flush code instead
because otherwise the "mprotect(PROT_READ) upon dirty pte" will also be
able to hit it.

Meanwhile, I'll have the same comment that this is not helping any reviewer
to understand why we need to grant user ability to conditionally set dirty
bit from uABI, so I think it's better to drop this paragraph too for this
patch alone.

> 
> In general, setting a PTE as dirty seems for read-only entries might be
> dangerous. It should be reminded the dirty-COW vulnerability mitigation
> also relies on the dirty bit being set only after COW (although it does
> not appear to apply to userfaultfd).
> 
> To summarize, setting the dirty bit for read-only PTEs is dangerous. But
> even if we only consider writable pages, always setting the dirty bit or
> always leaving it clear, does not seem as the best policy. Leaving the
> bit clear introduces overhead on the first write-access to set the bit.
> Setting the bit for pages the are eventually not written to can require
> more TLB flushes.

And IMHO only this paragraph is the real and proper reasoning for this
patch..

> 
> Let the userfaultfd users control whether PTEs are marked as dirty or
> clean. Introduce UFFDIO_COPY_MODE_WRITE and
> UFFDIO_COPY_MODE_WRITE_LIKELY and UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY
> to enable userspace to indicate whether pages are likely to be written
> and set the dirty-bit if they are likely to be written.
> 
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  fs/userfaultfd.c                 | 22 ++++++++++++++--------
>  include/linux/userfaultfd_k.h    |  1 +
>  include/uapi/linux/userfaultfd.h | 27 +++++++++++++++++++--------
>  mm/hugetlb.c                     |  3 +++
>  mm/shmem.c                       |  3 +++
>  mm/userfaultfd.c                 | 11 +++++++++--
>  6 files changed, 49 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 35a8c4347c54..a56983b594d5 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1700,7 +1700,7 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
>  	struct uffdio_copy uffdio_copy;
>  	struct uffdio_copy __user *user_uffdio_copy;
>  	struct userfaultfd_wake_range range;
> -	bool mode_wp, mode_access_likely;
> +	bool mode_wp, mode_access_likely, mode_write_likely;
>  	uffd_flags_t uffd_flags;
>  
>  	user_uffdio_copy = (struct uffdio_copy __user *) arg;
> @@ -1727,14 +1727,17 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
>  	if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
>  		goto out;
>  	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP|
> -				 UFFDIO_COPY_MODE_ACCESS_LIKELY))
> +				 UFFDIO_COPY_MODE_ACCESS_LIKELY|
> +				 UFFDIO_COPY_MODE_WRITE_LIKELY))
>  		goto out;
>  
>  	mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP;
>  	mode_access_likely = uffdio_copy.mode & UFFDIO_COPY_MODE_ACCESS_LIKELY;
> +	mode_write_likely = uffdio_copy.mode & UFFDIO_COPY_MODE_WRITE_LIKELY;
>  
>  	uffd_flags = (mode_wp ? UFFD_FLAGS_WP : 0) |
> -		     (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0);
> +		     (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0) |
> +		     (mode_write_likely ? UFFD_FLAGS_WRITE_LIKELY : 0);
>  
>  	if (mmget_not_zero(ctx->mm)) {
>  		ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
> @@ -1819,7 +1822,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
>  	struct uffdio_writeprotect uffdio_wp;
>  	struct uffdio_writeprotect __user *user_uffdio_wp;
>  	struct userfaultfd_wake_range range;
> -	bool mode_wp, mode_dontwake, mode_access_likely;
> +	bool mode_wp, mode_dontwake, mode_access_likely, mode_write_likely;
>  	uffd_flags_t uffd_flags;
>  
>  	if (atomic_read(&ctx->mmap_changing))
> @@ -1838,18 +1841,21 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
>  
>  	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
>  			       UFFDIO_WRITEPROTECT_MODE_WP |
> -			       UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY))
> +			       UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY |
> +			       UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY))
>  		return -EINVAL;
>  
>  	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
>  	mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
>  	mode_access_likely = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY;
> +	mode_write_likely = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY;
>  
>  	if (mode_wp && mode_dontwake)
>  		return -EINVAL;
>  
>  	uffd_flags = (mode_wp ? UFFD_FLAGS_WP : 0) |
> -		     (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0);
> +		     (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0) |
> +		     (mode_write_likely ? UFFD_FLAGS_WRITE_LIKELY : 0);
>  
>  	if (mmget_not_zero(ctx->mm)) {
>  		ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
> @@ -1902,10 +1908,10 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
>  	    uffdio_continue.range.start) {
>  		goto out;
>  	}
> -	if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
> +	if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE))
>  		goto out;
>  
> -	uffd_flags = UFFD_FLAGS_ACCESS_LIKELY;
> +	uffd_flags = UFFD_FLAGS_ACCESS_LIKELY | UFFD_FLAGS_WRITE_LIKELY;

Setting dirty by default for CONTINUE may make some sense (unlike young),
at least to keep the old behavior of the code where the pte was
unconditionally set.

But there's another thought a real CONTINUE user modifies the page cache
elsewhere so logically the PageDirty should be set elsewhere already
anyways, in most cases when the userapp is updating the page cache with
another mapping before installing pgtables for this mapping.  Then this
dirty is not required, it seems.

If we're going to export this to uABI, I'm wondering maybe it'll be nicer
to not apply either young or dirty bit for CONTINUE, because fundamentally
losing dirty bit doesn't sound risky, meanwhile the user app should have
the best knowledge of what to do (whether page was requested or it's just
during a pre-faulting in the background).

Axel may have some better thoughts.

>  
>  	if (mmget_not_zero(ctx->mm)) {
>  		ret = mcopy_continue(ctx->mm, uffdio_continue.range.start,
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index e6ac165ec044..261a3fa750d0 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -59,6 +59,7 @@ typedef unsigned int __bitwise uffd_flags_t;
>  
>  #define UFFD_FLAGS_WP			((__force uffd_flags_t)BIT(0))
>  #define UFFD_FLAGS_ACCESS_LIKELY	((__force uffd_flags_t)BIT(1))
> +#define UFFD_FLAGS_WRITE_LIKELY		((__force uffd_flags_t)BIT(2))
>  
>  extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>  				    struct vm_area_struct *dst_vma,
> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> index d9c8ce9ba777..6ad93a13282e 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -267,12 +267,20 @@ struct uffdio_copy {
>  	 */
>  	__s64 copy;
>  	/*
> -	 * UFFDIO_COPY_MODE_ACCESS_LIKELY will set the mapped page as young.
> -	 * This can reduce the time that the first access to the page takes.
> -	 * Yet, if set opportunistically to memory that is not used, it might
> -	 * extend the time before the unused memory pages are reclaimed.
> +	 * UFFDIO_COPY_MODE_ACCESS_LIKELY indicates that the memory is likely to
> +	 * be accessed in the near future, in contrast to memory that is
> +	 * opportunistically copied and might not be accessed. The kernel will
> +	 * act accordingly, for instance by setting the access-bit in the PTE to
> +	 * reduce the access time to the page.
> +	 *
> +	 * UFFDIO_COPY_MODE_WRITE_LIKELY indicates that the memory is likely to
> +	 * be written to. The kernel will act accordingly, for instance by
> +	 * setting the dirty-bit in the PTE to reduce the write time to the
> +	 * page. This flag will be silently ignored if UFFDIO_COPY_MODE_WP is
> +	 * set.
>  	 */
> -#define UFFDIO_COPY_MODE_ACCESS_LIKELY		((__u64)1<<3)
> +#define UFFDIO_COPY_MODE_ACCESS_LIKELY		((__u64)1<<2)
> +#define UFFDIO_COPY_MODE_WRITE_LIKELY		((__u64)1<<3)
>  };
>  
>  struct uffdio_zeropage {
> @@ -297,9 +305,11 @@ struct uffdio_writeprotect {
>   * UFFDIO_WRITEPROTECT_MODE_DONTWAKE: set the flag to avoid waking up
>   * any wait thread after the operation succeeds.
>   *
> - * UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY: set the flag to mark the modified
> - * memory as young, which can reduce the time that the first access
> - * to the page takes.
> + * UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY: set the flag to indicate the memory
> + * is likely to be accessed in the near future.
> + *
> + * UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY: set the flag to indicate that the
> + * memory is likely to be written to in the near future.
>   *
>   * NOTE: Write protecting a region (WP=1) is unrelated to page faults,
>   * therefore DONTWAKE flag is meaningless with WP=1.  Removing write
> @@ -309,6 +319,7 @@ struct uffdio_writeprotect {
>  #define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
>  #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
>  #define UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY	((__u64)1<<2)
> +#define UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY	((__u64)1<<3)
>  	__u64 mode;
>  };
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 2beff8a4bf7c..46814fc7762f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5962,6 +5962,9 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  		*pagep = NULL;
>  	}
>  
> +	/* The PTE is not marked as dirty unconditionally */
> +	SetPageDirty(page);
> +
>  	/*
>  	 * The memory barrier inside __SetPageUptodate makes sure that
>  	 * preceding stores to the page contents become visible before
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 89c775275bae..7488cd186c32 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2404,6 +2404,9 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
>  	VM_BUG_ON(PageSwapBacked(page));
>  	__SetPageLocked(page);
>  	__SetPageSwapBacked(page);
> +
> +	/* The PTE is not marked as dirty unconditionally */
> +	SetPageDirty(page);
>  	__SetPageUptodate(page);
>  
>  	ret = -EFAULT;
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 140c8d3e946e..3172158d8faa 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -70,7 +70,6 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>  	pgoff_t offset, max_off;
>  
>  	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> -	_dst_pte = pte_mkdirty(_dst_pte);
>  	if (page_in_cache && !vm_shared)
>  		writable = false;
>  
> @@ -85,13 +84,18 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>  
>  	if (writable)
>  		_dst_pte = pte_mkwrite(_dst_pte);
> -	else
> +	else {
>  		/*
>  		 * We need this to make sure write bit removed; as mk_pte()
>  		 * could return a pte with write bit set.
>  		 */
>  		_dst_pte = pte_wrprotect(_dst_pte);
>  
> +		/* Marking RO entries as dirty can mess with other code */
> +		if (uffd_flags & UFFD_FLAGS_WRITE_LIKELY)
> +			_dst_pte = pte_mkdirty(_dst_pte);

Hmm.. what about "writable=true" ones?

> +	}
> +
>  	if (uffd_flags & UFFD_FLAGS_ACCESS_LIKELY)
>  		_dst_pte = pte_mkyoung(_dst_pte);
>  
> @@ -180,6 +184,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
>  		*pagep = NULL;
>  	}
>  
> +	/* The PTE is not marked as dirty unconditionally */
> +	SetPageDirty(page);
> +
>  	/*
>  	 * The memory barrier inside __SetPageUptodate makes sure that
>  	 * preceding stores to the page contents become visible before
> -- 
> 2.25.1
> 
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 4/5] userfaultfd: zero access/write hints
  2022-06-19 23:34 ` [RFC PATCH v2 4/5] userfaultfd: zero access/write hints Nadav Amit
  2022-06-20 18:06   ` kernel test robot
@ 2022-06-21 17:04   ` Peter Xu
  2022-06-21 17:17     ` Nadav Amit
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Xu @ 2022-06-21 17:04 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, Nadav Amit, David Hildenbrand, Mike Kravetz,
	Hugh Dickins, Andrew Morton, Axel Rasmussen, Mike Rapoport

On Sun, Jun 19, 2022 at 04:34:48PM -0700, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> When userfaultfd provides a zeropage in response to ioctl, it provides a
> readonly alias to the zero page. If the page is later written (which is
> the likely scenario), page-fault occurs and the page-fault allocator
> allocates a page and rewires the page-tables.
> 
> This is an expensive flow for cases in which a page is likely be written
> to. Users can use the copy ioctl to initialize zero page (by copying
> zeros), but this is also wasteful.
> 
> Allow userfaultfd users to efficiently map initialized zero-pages that
> are writable. Introduce UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY, which, when
> provided would map a clear page instead of an alias to the zero page.
> 
> For consistency, introduce also UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  fs/userfaultfd.c                 | 14 +++++++++++--
>  include/uapi/linux/userfaultfd.h |  2 ++
>  mm/userfaultfd.c                 | 36 ++++++++++++++++++++++++++++++++
>  3 files changed, 50 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index a56983b594d5..ff073de78ea8 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1770,6 +1770,8 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
>  	struct uffdio_zeropage uffdio_zeropage;
>  	struct uffdio_zeropage __user *user_uffdio_zeropage;
>  	struct userfaultfd_wake_range range;
> +	bool mode_dontwake, mode_access_likely, mode_write_likely;
> +	uffd_flags_t uffd_flags;
>  
>  	user_uffdio_zeropage = (struct uffdio_zeropage __user *) arg;
>  
> @@ -1788,8 +1790,16 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
>  	if (ret)
>  		goto out;
>  	ret = -EINVAL;
> -	if (uffdio_zeropage.mode & ~UFFDIO_ZEROPAGE_MODE_DONTWAKE)
> -		goto out;
> +
> +	mode_dontwake = uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_DONTWAKE;
> +	mode_access_likely = uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY;
> +	mode_write_likely = uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY;
> +
> +	if (mode_dontwake)
> +		return -EINVAL;

Hmm.. Why?

Note that the above uffdio_zeropage.mode check was for invalid mode flags
only, and I think that should be kept, but still I don't see why we want to
fail UFFDIO_ZEROPAGE_MODE_DONTWAKE users.

> +
> +	uffd_flags = (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0) |
> +		     (mode_write_likely ? UFFD_FLAGS_WRITE_LIKELY : 0);
>  
>  	if (mmget_not_zero(ctx->mm)) {
>  		ret = mfill_zeropage(ctx->mm, uffdio_zeropage.range.start,
> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> index 6ad93a13282e..b586b7c1e265 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -286,6 +286,8 @@ struct uffdio_copy {
>  struct uffdio_zeropage {
>  	struct uffdio_range range;
>  #define UFFDIO_ZEROPAGE_MODE_DONTWAKE		((__u64)1<<0)
> +#define UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY	((__u64)1<<2)
> +#define UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY	((__u64)1<<3)
>  	__u64 mode;
>  
>  	/*
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 3172158d8faa..5dfbb1e80369 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -249,6 +249,38 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm,
>  	return ret;
>  }
>  
> +static int mfill_clearpage_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> +			       struct vm_area_struct *dst_vma,
> +			       unsigned long dst_addr,
> +			       uffd_flags_t uffd_flags)
> +{
> +	struct page *page;
> +	int ret;
> +
> +	ret = -ENOMEM;
> +	page = alloc_zeroed_user_highpage_movable(dst_vma, dst_addr);
> +	if (!page)
> +		goto out;
> +
> +	/* The PTE is not marked as dirty unconditionally */
> +	SetPageDirty(page);
> +	__SetPageUptodate(page);
> +
> +	ret = -ENOMEM;

Nit: can drop this since ret will always be -ENOMEM here..

Thanks,

> +	if (mem_cgroup_charge(page_folio(page), dst_vma->vm_mm, GFP_KERNEL))
> +		goto out_release;
> +
> +	ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
> +				       page, true, uffd_flags);
> +	if (ret)
> +		goto out_release;
> +out:
> +	return ret;
> +out_release:
> +	put_page(page);
> +	goto out;
> +}
> +
>  /* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */
>  static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
>  				pmd_t *dst_pmd,
> @@ -511,6 +543,10 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
>  			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
>  					       dst_addr, src_addr, page,
>  					       uffd_flags);
> +		else if (!(uffd_flags & UFFD_FLAGS_WP) &&
> +			 (uffd_flags & UFFD_FLAGS_WRITE_LIKELY))
> +			err = mfill_clearpage_pte(dst_mm, dst_pmd, dst_vma,
> +						  dst_addr, uffd_flags);
>  		else
>  			err = mfill_zeropage_pte(dst_mm, dst_pmd,
>  						 dst_vma, dst_addr, uffd_flags);
> -- 
> 2.25.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 3/5] userfaultfd: introduce write-likely mode for copy/wp operations
  2022-06-21 16:38   ` Peter Xu
@ 2022-06-21 17:14     ` Nadav Amit
  2022-06-21 18:10       ` Peter Xu
  0 siblings, 1 reply; 24+ messages in thread
From: Nadav Amit @ 2022-06-21 17:14 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport

On Jun 21, 2022, at 9:38 AM, Peter Xu <peterx@redhat.com> wrote:

> Hi, Nadav,
> 
> On Sun, Jun 19, 2022 at 04:34:47PM -0700, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> Commit 9ae0f87d009ca ("mm/shmem: unconditionally set pte dirty in
>> mfill_atomic_install_pte") has set PTEs as dirty as its title indicates.
>> However, setting read-only PTEs as dirty can have several undesired
>> implications.
>> 
>> First, setting read-only PTEs as dirty, can cause these PTEs to become
>> writable during mprotect() syscall. See in change_pte_range():
>> 
>> 	/* Avoid taking write faults for known dirty pages */
>> 	if (dirty_accountable && pte_dirty(ptent) &&
>> 			(pte_soft_dirty(ptent) ||
>> 			 !(vma->vm_flags & VM_SOFTDIRTY))) {
>> 		ptent = pte_mkwrite(ptent);
>> 	}
> 
> IMHO this is not really the direct reason to add the feature to "allow user
> to specify whether dirty bit be set for UFFDIO_COPY/... ioctl", because
> IIUC what's missing is the pte_uffd_wp() check that I should have added in
> the shmem+hugetlb uffd-wp series in change_pte_range() but I missed..
> 
> But since this is fixed by David's patch to optimize mprotect() altogether
> which checks pte_uffd_wp() (and afaict that's only needed after the
> shmem+hugetlb patchset, not before), so I think we're safe now with all the
> branches.
> 
> So IMHO we don't need to mention this as it's kind of misleading. It'll be
> welcomed if you want to recover the pte_dirty behavior in
> mfill_atomic_install_pte() but probably this is not the right patch for it?

Sorry, I will remove it from the commit log.

I am not sure whether there should be an additional patch to revert (or
partially revert) 9ae0f87d009ca and to be backported. What do you say?

>> Second, unmapping read-only dirty PTEs often prevents TLB flush batching.
>> See try_to_unmap_one():
>> 
>> 	/*
>> 	 * Page is dirty. Flush the TLB if a writable entry
>> 	 * potentially exists to avoid CPU writes after IO
>> 	 * starts and then write it out here.
>> 	 */
>> 	try_to_unmap_flush_dirty();
>> 
>> Similarly batching TLB flushed might be prevented in zap_pte_range():
>> 
>> 	if (!PageAnon(page)) {
>> 		if (pte_dirty(ptent)) {
>> 			force_flush = 1;
>> 			set_page_dirty(page);
>> 		}
>> 	...
> 
> I still keep the pure question here (which I asked in the private reply to
> you) on why we'd like only pte_dirty() but not pte_write() && pte_dirty()
> here. I'll rewrite what I have in the private email to you here again that
> I think should be the write thing to do here..
> 
> if (!PageAnon(page)) {
> if (pte_dirty(ptent)) {
> set_page_dirty(page);
> if (pte_write(ptent))
> force_flush = 1;
> }
> }

The “Second” part regards a perf issue, not a correctness issue. As I told
you offline, I think if makes sense to look both on pte_dirty() and
pte_write(), but I am afraid of other architectures than x86, which I know.
After our discussion, I looked at other architectures, and it does look
safe. So I will add an additional patch for that (with your suggested-by).

> I also mentioned the other example of doing mprotect(PROT_READ) upon a
> dirty pte can also create a pte with dirty=1 and write=0 which should be
> the same condition we have here with uffd, afaict. So it's at least not a
> solo problem for uffd, and I still think if we treat this tlb flush issue
> as a perf bug we should consider fixing up the tlb flush code instead
> because otherwise the "mprotect(PROT_READ) upon dirty pte" will also be
> able to hit it.
> 
> Meanwhile, I'll have the same comment that this is not helping any reviewer
> to understand why we need to grant user ability to conditionally set dirty
> bit from uABI, so I think it's better to drop this paragraph too for this
> patch alone.

Ok. Point taken.

> 
>> In general, setting a PTE as dirty seems for read-only entries might be
>> dangerous. It should be reminded the dirty-COW vulnerability mitigation
>> also relies on the dirty bit being set only after COW (although it does
>> not appear to apply to userfaultfd).
>> 
>> To summarize, setting the dirty bit for read-only PTEs is dangerous. But
>> even if we only consider writable pages, always setting the dirty bit or
>> always leaving it clear, does not seem as the best policy. Leaving the
>> bit clear introduces overhead on the first write-access to set the bit.
>> Setting the bit for pages the are eventually not written to can require
>> more TLB flushes.
> 
> And IMHO only this paragraph is the real and proper reasoning for this
> patch..

Will do.

> 
>> @@ -1902,10 +1908,10 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
>> 	 uffdio_continue.range.start) {
>> 		goto out;
>> 	}
>> -	if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
>> +	if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE))
>> 		goto out;
>> 
>> -	uffd_flags = UFFD_FLAGS_ACCESS_LIKELY;
>> +	uffd_flags = UFFD_FLAGS_ACCESS_LIKELY | UFFD_FLAGS_WRITE_LIKELY;
> 
> Setting dirty by default for CONTINUE may make some sense (unlike young),
> at least to keep the old behavior of the code where the pte was
> unconditionally set.
> 
> But there's another thought a real CONTINUE user modifies the page cache
> elsewhere so logically the PageDirty should be set elsewhere already
> anyways, in most cases when the userapp is updating the page cache with
> another mapping before installing pgtables for this mapping. Then this
> dirty is not required, it seems.
> 
> If we're going to export this to uABI, I'm wondering maybe it'll be nicer
> to not apply either young or dirty bit for CONTINUE, because fundamentally
> losing dirty bit doesn't sound risky, meanwhile the user app should have
> the best knowledge of what to do (whether page was requested or it's just
> during a pre-faulting in the background).
> 
> Axel may have some better thoughts.

I think that perhaps I did not communicate well enough the reason it makes
sense to set the dirty-bit.

Setting the access-bit and dirty-bit takes quite some time. So regardless of
whether you set the PageDirty, if you didn’t see access-bit and dirty-bit
and later you access the page - you are going to pay ~550 cycles for
nothing.

For reference: https://marc.info/?l=linux-kernel&m=146582237922378&w=2

If you want me to allow userspace to provide a hint, let me know.

>> @@ -85,13 +84,18 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>> 
>> 	if (writable)
>> 		_dst_pte = pte_mkwrite(_dst_pte);
>> -	else
>> +	else {
>> 		/*
>> 		 * We need this to make sure write bit removed; as mk_pte()
>> 		 * could return a pte with write bit set.
>> 		 */
>> 		_dst_pte = pte_wrprotect(_dst_pte);
>> 
>> +		/* Marking RO entries as dirty can mess with other code */
>> +		if (uffd_flags & UFFD_FLAGS_WRITE_LIKELY)
>> +			_dst_pte = pte_mkdirty(_dst_pte);
> 
> Hmm.. what about "writable=true" ones?

Oh boy, I reverted the condition… :(

Thanks for noticing. I’ll fix it.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 4/5] userfaultfd: zero access/write hints
  2022-06-21 17:04   ` Peter Xu
@ 2022-06-21 17:17     ` Nadav Amit
  2022-06-21 17:56       ` Peter Xu
  0 siblings, 1 reply; 24+ messages in thread
From: Nadav Amit @ 2022-06-21 17:17 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux MM, David Hildenbrand, Mike Kravetz, Hugh Dickins,
	Andrew Morton, Axel Rasmussen, Mike Rapoport

On Jun 21, 2022, at 10:04 AM, Peter Xu <peterx@redhat.com> wrote:

> ⚠ External Email
> 
> On Sun, Jun 19, 2022 at 04:34:48PM -0700, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> When userfaultfd provides a zeropage in response to ioctl, it provides a
>> readonly alias to the zero page. If the page is later written (which is
>> the likely scenario), page-fault occurs and the page-fault allocator
>> allocates a page and rewires the page-tables.
>> 
>> This is an expensive flow for cases in which a page is likely be written
>> to. Users can use the copy ioctl to initialize zero page (by copying
>> zeros), but this is also wasteful.
>> 
>> Allow userfaultfd users to efficiently map initialized zero-pages that
>> are writable. Introduce UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY, which, when
>> provided would map a clear page instead of an alias to the zero page.
>> 
>> For consistency, introduce also UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY.
>> 
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Axel Rasmussen <axelrasmussen@google.com>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: Mike Rapoport <rppt@linux.ibm.com>
>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> ---
>> fs/userfaultfd.c                 | 14 +++++++++++--
>> include/uapi/linux/userfaultfd.h |  2 ++
>> mm/userfaultfd.c                 | 36 ++++++++++++++++++++++++++++++++
>> 3 files changed, 50 insertions(+), 2 deletions(-)
>> 
>> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
>> index a56983b594d5..ff073de78ea8 100644
>> --- a/fs/userfaultfd.c
>> +++ b/fs/userfaultfd.c
>> @@ -1770,6 +1770,8 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
>>      struct uffdio_zeropage uffdio_zeropage;
>>      struct uffdio_zeropage __user *user_uffdio_zeropage;
>>      struct userfaultfd_wake_range range;
>> +     bool mode_dontwake, mode_access_likely, mode_write_likely;
>> +     uffd_flags_t uffd_flags;
>> 
>>      user_uffdio_zeropage = (struct uffdio_zeropage __user *) arg;
>> 
>> @@ -1788,8 +1790,16 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
>>      if (ret)
>>              goto out;
>>      ret = -EINVAL;
>> -     if (uffdio_zeropage.mode & ~UFFDIO_ZEROPAGE_MODE_DONTWAKE)
>> -             goto out;
>> +
>> +     mode_dontwake = uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_DONTWAKE;
>> +     mode_access_likely = uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY;
>> +     mode_write_likely = uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY;
>> +
>> +     if (mode_dontwake)
>> +             return -EINVAL;
> 
> Hmm.. Why?
> 
> Note that the above uffdio_zeropage.mode check was for invalid mode flags
> only, and I think that should be kept, but still I don't see why we want to
> fail UFFDIO_ZEROPAGE_MODE_DONTWAKE users.
> 
>> +
>> +     uffd_flags = (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0) |
>> +                  (mode_write_likely ? UFFD_FLAGS_WRITE_LIKELY : 0);
>> 
>>      if (mmget_not_zero(ctx->mm)) {
>>              ret = mfill_zeropage(ctx->mm, uffdio_zeropage.range.start,
>> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
>> index 6ad93a13282e..b586b7c1e265 100644
>> --- a/include/uapi/linux/userfaultfd.h
>> +++ b/include/uapi/linux/userfaultfd.h
>> @@ -286,6 +286,8 @@ struct uffdio_copy {
>> struct uffdio_zeropage {
>>      struct uffdio_range range;
>> #define UFFDIO_ZEROPAGE_MODE_DONTWAKE                ((__u64)1<<0)
>> +#define UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY   ((__u64)1<<2)
>> +#define UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY    ((__u64)1<<3)
>>      __u64 mode;
>> 
>>      /*
>> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
>> index 3172158d8faa..5dfbb1e80369 100644
>> --- a/mm/userfaultfd.c
>> +++ b/mm/userfaultfd.c
>> @@ -249,6 +249,38 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm,
>>      return ret;
>> }
>> 
>> +static int mfill_clearpage_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>> +                            struct vm_area_struct *dst_vma,
>> +                            unsigned long dst_addr,
>> +                            uffd_flags_t uffd_flags)
>> +{
>> +     struct page *page;
>> +     int ret;
>> +
>> +     ret = -ENOMEM;
>> +     page = alloc_zeroed_user_highpage_movable(dst_vma, dst_addr);
>> +     if (!page)
>> +             goto out;
>> +
>> +     /* The PTE is not marked as dirty unconditionally */
>> +     SetPageDirty(page);
>> +     __SetPageUptodate(page);
>> +
>> +     ret = -ENOMEM;
> 
> Nit: can drop this since ret will always be -ENOMEM here..

I noticed. Just thought it is clearer this way, and more robust against
future changes.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 2/5] userfaultfd: introduce access-likely mode for copy/wp operations
  2022-06-21  8:48   ` David Hildenbrand
  2022-06-21 15:42     ` Peter Xu
@ 2022-06-21 17:27     ` Nadav Amit
  1 sibling, 0 replies; 24+ messages in thread
From: Nadav Amit @ 2022-06-21 17:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, Peter Xu, Mike Rapoport

On Jun 21, 2022, at 1:48 AM, David Hildenbrand <david@redhat.com> wrote:

> ⚠ External Email
> 
> On 20.06.22 01:34, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> Using a PTE on x86 with cleared access-bit (aka young-bit)
>> takes ~600 cycles more than when the access bit is set. At the same
>> time, setting the access-bit for memory that is not used (e.g.,
>> prefetched) can introduce greater overheads, as the prefetched memory is
>> reclaimed later than it should be.
>> 
>> Userfaultfd currently does not set the access-bit (excluding the
>> huge-pages case). Arguably, it is best to let the user control whether
>> the access bit should be set or not. The expected use is to request
>> userfaultfd to set the access-bit when the copy/wp operation is done to
>> resolve a page-fault, and not to set the access-bit when the memory is
>> prefetched.
>> 
>> Introduce UFFDIO_COPY_MODE_ACCESS_LIKELY and
>> UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY to enable userspace to request
>> the young bit to be set. Set for UFFDIO_CONTINUE and UFFDIO_ZEROPAGE the
>> bit unconditionally since the former is only used to resolve page-faults
>> and the latter would not benefit from not setting the access-bit.
>> 
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Axel Rasmussen <axelrasmussen@google.com>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Mike Rapoport <rppt@linux.ibm.com>
>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> ---
>> fs/userfaultfd.c                 | 23 ++++++++++++++++-------
>> include/linux/userfaultfd_k.h    |  1 +
>> include/uapi/linux/userfaultfd.h | 20 +++++++++++++++++++-
>> mm/userfaultfd.c                 | 18 ++++++++++++++++--
>> 4 files changed, 52 insertions(+), 10 deletions(-)
>> 
>> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
>> index 5daafa54eb3f..35a8c4347c54 100644
>> --- a/fs/userfaultfd.c
>> +++ b/fs/userfaultfd.c
>> @@ -1700,7 +1700,7 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
>>      struct uffdio_copy uffdio_copy;
>>      struct uffdio_copy __user *user_uffdio_copy;
>>      struct userfaultfd_wake_range range;
>> -     bool mode_wp;
>> +     bool mode_wp, mode_access_likely;
>>      uffd_flags_t uffd_flags;
>> 
>>      user_uffdio_copy = (struct uffdio_copy __user *) arg;
>> @@ -1726,12 +1726,15 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
>>      ret = -EINVAL;
>>      if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
>>              goto out;
>> -     if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
>> +     if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP|
>> +                              UFFDIO_COPY_MODE_ACCESS_LIKELY))
>>              goto out;
>> 
>>      mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP;
>> +     mode_access_likely = uffdio_copy.mode & UFFDIO_COPY_MODE_ACCESS_LIKELY;
> 
> I *relly* prefer just
> 
> if (uffdio_copy.mode & UFFDIO_COPY_MODE_ACCESS_LIKELY)
>        uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY
> [...]
> 
>> -     uffd_flags = (mode_wp ? UFFD_FLAGS_WP : 0);
>> +     uffd_flags = (mode_wp ? UFFD_FLAGS_WP : 0) |
>> +                  (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0);
> 
> Dito.
> 
>>      if (mmget_not_zero(ctx->mm)) {
>>              ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
>> @@ -1871,6 +1877,7 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
>>      struct uffdio_continue uffdio_continue;
>>      struct uffdio_continue __user *user_uffdio_continue;
>>      struct userfaultfd_wake_range range;
>> +     uffd_flags_t uffd_flags;
>> 
>>      user_uffdio_continue = (struct uffdio_continue __user *)arg;
>> 
>> @@ -1898,10 +1905,12 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
>>      if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
>>              goto out;
>> 
>> +     uffd_flags = UFFD_FLAGS_ACCESS_LIKELY;
> 
> Can we add a comment why that makes sense? I think I know why -- someone
> is stuck waiting for that continue to happen :)
> 
>> +
>>      if (mmget_not_zero(ctx->mm)) {
>>              ret = mcopy_continue(ctx->mm, uffdio_continue.range.start,
>>                                   uffdio_continue.range.len,
>> -                                  &ctx->mmap_changing, 0);
>> +                                  &ctx->mmap_changing, uffd_flags);
>>              mmput(ctx->mm);
>>      } else {
>>              return -ESRCH;
>> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
>> index 6331148023c1..e6ac165ec044 100644
>> --- a/include/linux/userfaultfd_k.h
>> +++ b/include/linux/userfaultfd_k.h
>> @@ -58,6 +58,7 @@ enum mcopy_atomic_mode {
>> typedef unsigned int __bitwise uffd_flags_t;
>> 
>> #define UFFD_FLAGS_WP                        ((__force uffd_flags_t)BIT(0))
>> +#define UFFD_FLAGS_ACCESS_LIKELY     ((__force uffd_flags_t)BIT(1))
>> 
>> extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>>                                  struct vm_area_struct *dst_vma,
>> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
>> index 005e5e306266..d9c8ce9ba777 100644
>> --- a/include/uapi/linux/userfaultfd.h
>> +++ b/include/uapi/linux/userfaultfd.h
>> @@ -38,7 +38,8 @@
>>                         UFFD_FEATURE_MINOR_HUGETLBFS |       \
>>                         UFFD_FEATURE_MINOR_SHMEM |           \
>>                         UFFD_FEATURE_EXACT_ADDRESS |         \
>> -                        UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
>> +                        UFFD_FEATURE_WP_HUGETLBFS_SHMEM |    \
>> +                        UFFD_FEATURE_ACCESS_HINTS)
>> #define UFFD_API_IOCTLS                              \
>>      ((__u64)1 << _UFFDIO_REGISTER |         \
>>       (__u64)1 << _UFFDIO_UNREGISTER |       \
>> @@ -203,6 +204,10 @@ struct uffdio_api {
>>       *
>>       * UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd
>>       * write-protection mode is supported on both shmem and hugetlbfs.
>> +      *
>> +      * UFFD_FEATURE_ACCESS_HINTS indicates that the copy supports
>> +      * UFFDIO_COPY_MODE_ACCESS_LIKELY supports
>> +      * UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY.
> 
> I think that sentence doesn't make sense.

English… :)

How about

         * UFFD_FEATURE_ACCESS_HINTS indicates that the ioctl operations
         * supports the UFFDIO_*_MODE_[ACCESS|WRITE]_LIKELY and
         * UFFDIO_*_MODE_[ACCESS|WRITE]_LIKELY hints.

But that would mean that for consistency, I would need to provide
zero/continue hints (which might be disregarded by the kernel)?

>>       */
>> #define UFFD_FEATURE_PAGEFAULT_FLAG_WP               (1<<0)
>> #define UFFD_FEATURE_EVENT_FORK                      (1<<1)
>> @@ -217,6 +222,7 @@ struct uffdio_api {
>> #define UFFD_FEATURE_MINOR_SHMEM             (1<<10)
>> #define UFFD_FEATURE_EXACT_ADDRESS           (1<<11)
>> #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM              (1<<12)
>> +#define UFFD_FEATURE_ACCESS_HINTS            (1<<13)
>>      __u64 features;
>> 
>>      __u64 ioctls;
>> @@ -260,6 +266,13 @@ struct uffdio_copy {
>>       * copy_from_user will not read the last 8 bytes.
>>       */
>>      __s64 copy;
>> +     /*
>> +      * UFFDIO_COPY_MODE_ACCESS_LIKELY will set the mapped page as young.
> 
> Setting the page young is an implementation detail. Can you phrase it
> more generically what the effect of that hint might be?

Err. I forgot to fix it before sending. How about:

         * UFFDIO_COPY_MODE_ACCESS_LIKELY provides a hint to the kernel
         * that the page is likely to be access in the near future. Providing
         * the hint properly can improve performance.


?
> 
>> @@ -691,6 +699,9 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
>>                     unsigned long len, atomic_t *mmap_changing,
>>                     uffd_flags_t uffd_flags)
>> {
>> +     /* There is no cost for setting the access bit of a zeropage */
>> +     uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
>> +
>>      return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_ZEROPAGE,
>>                            mmap_changing, 0);
>> }
>> @@ -699,6 +710,9 @@ ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long start,
>>                     unsigned long len, atomic_t *mmap_changing,
>>                     uffd_flags_t uffd_flags)
>> {
>> +     /* The page is likely to be accessed */
>> +     uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
> 
> Shoouldn't that be set by the caller already?

I thought that it belongs conceptually to mm/userfaultfd and not
fs/userfaultfd.

I will wait for Axel input as to how to handle the CONTINUE case and fix it
accordingly.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 1/5] userfaultfd: introduce uffd_flags
  2022-06-21 15:29   ` Peter Xu
@ 2022-06-21 17:41     ` Nadav Amit
  0 siblings, 0 replies; 24+ messages in thread
From: Nadav Amit @ 2022-06-21 17:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport

On Jun 21, 2022, at 8:29 AM, Peter Xu <peterx@redhat.com> wrote:

> ⚠ External Email
> 
> On Sun, Jun 19, 2022 at 04:34:45PM -0700, Nadav Amit wrote:
>> @@ -683,30 +681,33 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
>> 
>> ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
>>                   unsigned long src_start, unsigned long len,
>> -                  atomic_t *mmap_changing, __u64 mode)
>> +                  atomic_t *mmap_changing, uffd_flags_t uffd_flags)
>> {
>>      return __mcopy_atomic(dst_mm, dst_start, src_start, len,
>> -                           MCOPY_ATOMIC_NORMAL, mmap_changing, mode);
>> +                           MCOPY_ATOMIC_NORMAL, mmap_changing, uffd_flags);
>> }
>> 
>> ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
>> -                    unsigned long len, atomic_t *mmap_changing)
>> +                    unsigned long len, atomic_t *mmap_changing,
>> +                    uffd_flags_t uffd_flags)
> 
> Should this be fed into the last parameter of __mcopy_atomic() below?
> 
>> {
>>      return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_ZEROPAGE,
>>                            mmap_changing, 0);
>> }
>> 
>> ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long start,
>> -                    unsigned long len, atomic_t *mmap_changing)
>> +                    unsigned long len, atomic_t *mmap_changing,
>> +                    uffd_flags_t uffd_flags)
> 
> Same question here..

Yes it should. I think I will add flags for CONTINUE/ZERO ACCESS/WRITE
just for consistency, even if we eventually decide to disregard them.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 4/5] userfaultfd: zero access/write hints
  2022-06-21 17:17     ` Nadav Amit
@ 2022-06-21 17:56       ` Peter Xu
  2022-06-21 17:58         ` Nadav Amit
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2022-06-21 17:56 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux MM, David Hildenbrand, Mike Kravetz, Hugh Dickins,
	Andrew Morton, Axel Rasmussen, Mike Rapoport

On Tue, Jun 21, 2022 at 05:17:05PM +0000, Nadav Amit wrote:
> On Jun 21, 2022, at 10:04 AM, Peter Xu <peterx@redhat.com> wrote:
> 
> > ⚠ External Email
> > 
> > On Sun, Jun 19, 2022 at 04:34:48PM -0700, Nadav Amit wrote:
> >> From: Nadav Amit <namit@vmware.com>
> >> 
> >> When userfaultfd provides a zeropage in response to ioctl, it provides a
> >> readonly alias to the zero page. If the page is later written (which is
> >> the likely scenario), page-fault occurs and the page-fault allocator
> >> allocates a page and rewires the page-tables.
> >> 
> >> This is an expensive flow for cases in which a page is likely be written
> >> to. Users can use the copy ioctl to initialize zero page (by copying
> >> zeros), but this is also wasteful.
> >> 
> >> Allow userfaultfd users to efficiently map initialized zero-pages that
> >> are writable. Introduce UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY, which, when
> >> provided would map a clear page instead of an alias to the zero page.
> >> 
> >> For consistency, introduce also UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY.
> >> 
> >> Suggested-by: David Hildenbrand <david@redhat.com>
> >> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> >> Cc: Hugh Dickins <hughd@google.com>
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> Cc: Axel Rasmussen <axelrasmussen@google.com>
> >> Cc: Peter Xu <peterx@redhat.com>
> >> Cc: Mike Rapoport <rppt@linux.ibm.com>
> >> Signed-off-by: Nadav Amit <namit@vmware.com>
> >> ---
> >> fs/userfaultfd.c                 | 14 +++++++++++--
> >> include/uapi/linux/userfaultfd.h |  2 ++
> >> mm/userfaultfd.c                 | 36 ++++++++++++++++++++++++++++++++
> >> 3 files changed, 50 insertions(+), 2 deletions(-)
> >> 
> >> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> >> index a56983b594d5..ff073de78ea8 100644
> >> --- a/fs/userfaultfd.c
> >> +++ b/fs/userfaultfd.c
> >> @@ -1770,6 +1770,8 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
> >>      struct uffdio_zeropage uffdio_zeropage;
> >>      struct uffdio_zeropage __user *user_uffdio_zeropage;
> >>      struct userfaultfd_wake_range range;
> >> +     bool mode_dontwake, mode_access_likely, mode_write_likely;
> >> +     uffd_flags_t uffd_flags;
> >> 
> >>      user_uffdio_zeropage = (struct uffdio_zeropage __user *) arg;
> >> 
> >> @@ -1788,8 +1790,16 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
> >>      if (ret)
> >>              goto out;
> >>      ret = -EINVAL;
> >> -     if (uffdio_zeropage.mode & ~UFFDIO_ZEROPAGE_MODE_DONTWAKE)
> >> -             goto out;
> >> +
> >> +     mode_dontwake = uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_DONTWAKE;
> >> +     mode_access_likely = uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY;
> >> +     mode_write_likely = uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY;
> >> +
> >> +     if (mode_dontwake)
> >> +             return -EINVAL;
> > 
> > Hmm.. Why?
> > 
> > Note that the above uffdio_zeropage.mode check was for invalid mode flags
> > only, and I think that should be kept, but still I don't see why we want to
> > fail UFFDIO_ZEROPAGE_MODE_DONTWAKE users.

[1]

> > 
> >> +
> >> +     uffd_flags = (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0) |
> >> +                  (mode_write_likely ? UFFD_FLAGS_WRITE_LIKELY : 0);
> >> 
> >>      if (mmget_not_zero(ctx->mm)) {
> >>              ret = mfill_zeropage(ctx->mm, uffdio_zeropage.range.start,
> >> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> >> index 6ad93a13282e..b586b7c1e265 100644
> >> --- a/include/uapi/linux/userfaultfd.h
> >> +++ b/include/uapi/linux/userfaultfd.h
> >> @@ -286,6 +286,8 @@ struct uffdio_copy {
> >> struct uffdio_zeropage {
> >>      struct uffdio_range range;
> >> #define UFFDIO_ZEROPAGE_MODE_DONTWAKE                ((__u64)1<<0)
> >> +#define UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY   ((__u64)1<<2)
> >> +#define UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY    ((__u64)1<<3)
> >>      __u64 mode;
> >> 
> >>      /*
> >> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> >> index 3172158d8faa..5dfbb1e80369 100644
> >> --- a/mm/userfaultfd.c
> >> +++ b/mm/userfaultfd.c
> >> @@ -249,6 +249,38 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm,
> >>      return ret;
> >> }
> >> 
> >> +static int mfill_clearpage_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> >> +                            struct vm_area_struct *dst_vma,
> >> +                            unsigned long dst_addr,
> >> +                            uffd_flags_t uffd_flags)
> >> +{
> >> +     struct page *page;
> >> +     int ret;
> >> +
> >> +     ret = -ENOMEM;
> >> +     page = alloc_zeroed_user_highpage_movable(dst_vma, dst_addr);
> >> +     if (!page)
> >> +             goto out;
> >> +
> >> +     /* The PTE is not marked as dirty unconditionally */
> >> +     SetPageDirty(page);
> >> +     __SetPageUptodate(page);
> >> +
> >> +     ret = -ENOMEM;
> > 
> > Nit: can drop this since ret will always be -ENOMEM here..
> 
> I noticed. Just thought it is clearer this way, and more robust against
> future changes.

I'd rather leave that for future, but if you really prefer that no problem
on my side too.

Please just still check [1] above and that's the major real comment, just
to make sure it's not overlooked..

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 4/5] userfaultfd: zero access/write hints
  2022-06-21 17:56       ` Peter Xu
@ 2022-06-21 17:58         ` Nadav Amit
  0 siblings, 0 replies; 24+ messages in thread
From: Nadav Amit @ 2022-06-21 17:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux MM, David Hildenbrand, Mike Kravetz, Hugh Dickins,
	Andrew Morton, Axel Rasmussen, Mike Rapoport

On Jun 21, 2022, at 10:56 AM, Peter Xu <peterx@redhat.com> wrote:

> ⚠ External Email
>>> Nit: can drop this since ret will always be -ENOMEM here..
>> 
>> I noticed. Just thought it is clearer this way, and more robust against
>> future changes.
> 
> I'd rather leave that for future, but if you really prefer that no problem
> on my side too.
> 
> Please just still check [1] above and that's the major real comment, just
> to make sure it's not overlooked..

Yes. I noticed it just after sending my email, and didn’t want to spam.
Fixed.

Thanks for noticing it!!

Regards,
Nadav

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 3/5] userfaultfd: introduce write-likely mode for copy/wp operations
  2022-06-21 17:14     ` Nadav Amit
@ 2022-06-21 18:10       ` Peter Xu
  2022-06-21 18:30         ` Nadav Amit
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2022-06-21 18:10 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport

On Tue, Jun 21, 2022 at 10:14:00AM -0700, Nadav Amit wrote:
> On Jun 21, 2022, at 9:38 AM, Peter Xu <peterx@redhat.com> wrote:
> 
> > Hi, Nadav,
> > 
> > On Sun, Jun 19, 2022 at 04:34:47PM -0700, Nadav Amit wrote:
> >> From: Nadav Amit <namit@vmware.com>
> >> 
> >> Commit 9ae0f87d009ca ("mm/shmem: unconditionally set pte dirty in
> >> mfill_atomic_install_pte") has set PTEs as dirty as its title indicates.
> >> However, setting read-only PTEs as dirty can have several undesired
> >> implications.
> >> 
> >> First, setting read-only PTEs as dirty, can cause these PTEs to become
> >> writable during mprotect() syscall. See in change_pte_range():
> >> 
> >> 	/* Avoid taking write faults for known dirty pages */
> >> 	if (dirty_accountable && pte_dirty(ptent) &&
> >> 			(pte_soft_dirty(ptent) ||
> >> 			 !(vma->vm_flags & VM_SOFTDIRTY))) {
> >> 		ptent = pte_mkwrite(ptent);
> >> 	}
> > 
> > IMHO this is not really the direct reason to add the feature to "allow user
> > to specify whether dirty bit be set for UFFDIO_COPY/... ioctl", because
> > IIUC what's missing is the pte_uffd_wp() check that I should have added in
> > the shmem+hugetlb uffd-wp series in change_pte_range() but I missed..
> > 
> > But since this is fixed by David's patch to optimize mprotect() altogether
> > which checks pte_uffd_wp() (and afaict that's only needed after the
> > shmem+hugetlb patchset, not before), so I think we're safe now with all the
> > branches.
> > 
> > So IMHO we don't need to mention this as it's kind of misleading. It'll be
> > welcomed if you want to recover the pte_dirty behavior in
> > mfill_atomic_install_pte() but probably this is not the right patch for it?
> 
> Sorry, I will remove it from the commit log.
> 
> I am not sure whether there should be an additional patch to revert (or
> partially revert) 9ae0f87d009ca and to be backported. What do you say?

I think we discussed it offlist, I'd not bother but I don't have a strong
opinion.  Please feel free to go with your preference.

Just to mention that it might not be a clean revert since at that time we
don't have continue and uffd-wp on files, so we may need to add the
complete check back if we're going to make it.  Please also do proper swap
tests for both anon and shmem with things like memcg, and when you posted
the patches I can do some tests too.

> 
> >> Second, unmapping read-only dirty PTEs often prevents TLB flush batching.
> >> See try_to_unmap_one():
> >> 
> >> 	/*
> >> 	 * Page is dirty. Flush the TLB if a writable entry
> >> 	 * potentially exists to avoid CPU writes after IO
> >> 	 * starts and then write it out here.
> >> 	 */
> >> 	try_to_unmap_flush_dirty();
> >> 
> >> Similarly batching TLB flushed might be prevented in zap_pte_range():
> >> 
> >> 	if (!PageAnon(page)) {
> >> 		if (pte_dirty(ptent)) {
> >> 			force_flush = 1;
> >> 			set_page_dirty(page);
> >> 		}
> >> 	...
> > 
> > I still keep the pure question here (which I asked in the private reply to
> > you) on why we'd like only pte_dirty() but not pte_write() && pte_dirty()
> > here. I'll rewrite what I have in the private email to you here again that
> > I think should be the write thing to do here..
> > 
> > if (!PageAnon(page)) {
> > if (pte_dirty(ptent)) {
> > set_page_dirty(page);
> > if (pte_write(ptent))
> > force_flush = 1;
> > }
> > }
> 
> The “Second” part regards a perf issue, not a correctness issue. As I told
> you offline, I think if makes sense to look both on pte_dirty() and
> pte_write(), but I am afraid of other architectures than x86, which I know.

I don't really see why this logic is arch-specific.  I meant, as long as
pte_write() returns a value that means "this page is writable", I still
think it's making sense.

> After our discussion, I looked at other architectures, and it does look
> safe. So I will add an additional patch for that (with your suggested-by).

Only if you agree with what I thought, thanks.  And that could be a
separate patch for sure out of this one even if to come.

> 
> > I also mentioned the other example of doing mprotect(PROT_READ) upon a
> > dirty pte can also create a pte with dirty=1 and write=0 which should be
> > the same condition we have here with uffd, afaict. So it's at least not a
> > solo problem for uffd, and I still think if we treat this tlb flush issue
> > as a perf bug we should consider fixing up the tlb flush code instead
> > because otherwise the "mprotect(PROT_READ) upon dirty pte" will also be
> > able to hit it.
> > 
> > Meanwhile, I'll have the same comment that this is not helping any reviewer
> > to understand why we need to grant user ability to conditionally set dirty
> > bit from uABI, so I think it's better to drop this paragraph too for this
> > patch alone.
> 
> Ok. Point taken.
> 
> > 
> >> In general, setting a PTE as dirty seems for read-only entries might be
> >> dangerous. It should be reminded the dirty-COW vulnerability mitigation
> >> also relies on the dirty bit being set only after COW (although it does
> >> not appear to apply to userfaultfd).
> >> 
> >> To summarize, setting the dirty bit for read-only PTEs is dangerous. But
> >> even if we only consider writable pages, always setting the dirty bit or
> >> always leaving it clear, does not seem as the best policy. Leaving the
> >> bit clear introduces overhead on the first write-access to set the bit.
> >> Setting the bit for pages the are eventually not written to can require
> >> more TLB flushes.
> > 
> > And IMHO only this paragraph is the real and proper reasoning for this
> > patch..
> 
> Will do.
> 
> > 
> >> @@ -1902,10 +1908,10 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
> >> 	 uffdio_continue.range.start) {
> >> 		goto out;
> >> 	}
> >> -	if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
> >> +	if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE))
> >> 		goto out;
> >> 
> >> -	uffd_flags = UFFD_FLAGS_ACCESS_LIKELY;
> >> +	uffd_flags = UFFD_FLAGS_ACCESS_LIKELY | UFFD_FLAGS_WRITE_LIKELY;
> > 
> > Setting dirty by default for CONTINUE may make some sense (unlike young),
> > at least to keep the old behavior of the code where the pte was
> > unconditionally set.
> > 
> > But there's another thought a real CONTINUE user modifies the page cache
> > elsewhere so logically the PageDirty should be set elsewhere already
> > anyways, in most cases when the userapp is updating the page cache with
> > another mapping before installing pgtables for this mapping. Then this
> > dirty is not required, it seems.
> > 
> > If we're going to export this to uABI, I'm wondering maybe it'll be nicer
> > to not apply either young or dirty bit for CONTINUE, because fundamentally
> > losing dirty bit doesn't sound risky, meanwhile the user app should have
> > the best knowledge of what to do (whether page was requested or it's just
> > during a pre-faulting in the background).
> > 
> > Axel may have some better thoughts.
> 
> I think that perhaps I did not communicate well enough the reason it makes
> sense to set the dirty-bit.
> 
> Setting the access-bit and dirty-bit takes quite some time. So regardless of
> whether you set the PageDirty, if you didn’t see access-bit and dirty-bit
> and later you access the page - you are going to pay ~550 cycles for
> nothing.
> 
> For reference: https://marc.info/?l=linux-kernel&m=146582237922378&w=2
> 
> If you want me to allow userspace to provide a hint, let me know.

You did communicate well, so probably it's me that didn't. :)

Yes I think it'll be nicer to allow userspace provide the same hint for
UFFDIO_CONTINUE, the reason as I explained in the other thread on the young
bit (or say ACCESS_LIKELY), that UFFDIO_CONTINUE can (at least in my
current qemu impl [1]) proactively be used to install pgtables even if
they're code pages and they may not be accessed in the near future.  So
IMHO we should treat UFFDIO_CONTINUE the same as UFFDIO_COPY, IMHO, from
this point of view.

There's just still some differences on young/dirty here so I'm not sure
whether the idea applies to both, but I think at least it applies to young
bit (ACCESS_LIKELY).

[1] https://github.com/xzpeter/qemu/commit/b7758ad55af42b5364796362e96f4a06b51d9582

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 3/5] userfaultfd: introduce write-likely mode for copy/wp operations
  2022-06-21 18:10       ` Peter Xu
@ 2022-06-21 18:30         ` Nadav Amit
  2022-06-21 18:43           ` Peter Xu
  0 siblings, 1 reply; 24+ messages in thread
From: Nadav Amit @ 2022-06-21 18:30 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport

On Jun 21, 2022, at 11:10 AM, Peter Xu <peterx@redhat.com> wrote:

> On Tue, Jun 21, 2022 at 10:14:00AM -0700, Nadav Amit wrote:
>> On Jun 21, 2022, at 9:38 AM, Peter Xu <peterx@redhat.com> wrote:
>> 
>>> Hi, Nadav,
>>> 
>>> On Sun, Jun 19, 2022 at 04:34:47PM -0700, Nadav Amit wrote:
>>>> From: Nadav Amit <namit@vmware.com>
>>>> 
>>>> Commit 9ae0f87d009ca ("mm/shmem: unconditionally set pte dirty in
>>>> mfill_atomic_install_pte") has set PTEs as dirty as its title indicates.
>>>> However, setting read-only PTEs as dirty can have several undesired
>>>> implications.
>>>> 
>>>> First, setting read-only PTEs as dirty, can cause these PTEs to become
>>>> writable during mprotect() syscall. See in change_pte_range():
>>>> 
>>>> 	/* Avoid taking write faults for known dirty pages */
>>>> 	if (dirty_accountable && pte_dirty(ptent) &&
>>>> 			(pte_soft_dirty(ptent) ||
>>>> 			 !(vma->vm_flags & VM_SOFTDIRTY))) {
>>>> 		ptent = pte_mkwrite(ptent);
>>>> 	}
>>> 
>>> IMHO this is not really the direct reason to add the feature to "allow user
>>> to specify whether dirty bit be set for UFFDIO_COPY/... ioctl", because
>>> IIUC what's missing is the pte_uffd_wp() check that I should have added in
>>> the shmem+hugetlb uffd-wp series in change_pte_range() but I missed..
>>> 
>>> But since this is fixed by David's patch to optimize mprotect() altogether
>>> which checks pte_uffd_wp() (and afaict that's only needed after the
>>> shmem+hugetlb patchset, not before), so I think we're safe now with all the
>>> branches.
>>> 
>>> So IMHO we don't need to mention this as it's kind of misleading. It'll be
>>> welcomed if you want to recover the pte_dirty behavior in
>>> mfill_atomic_install_pte() but probably this is not the right patch for it?
>> 
>> Sorry, I will remove it from the commit log.
>> 
>> I am not sure whether there should be an additional patch to revert (or
>> partially revert) 9ae0f87d009ca and to be backported. What do you say?
> 
> I think we discussed it offlist, I'd not bother but I don't have a strong
> opinion. Please feel free to go with your preference.
> 
> Just to mention that it might not be a clean revert since at that time we
> don't have continue and uffd-wp on files, so we may need to add the
> complete check back if we're going to make it. Please also do proper swap
> tests for both anon and shmem with things like memcg, and when you posted
> the patches I can do some tests too.

That’s what I was afraid of. It moves it down in my priority list...

>>>> Second, unmapping read-only dirty PTEs often prevents TLB flush batching.
>>>> See try_to_unmap_one():
>>>> 
>>>> 	/*
>>>> 	 * Page is dirty. Flush the TLB if a writable entry
>>>> 	 * potentially exists to avoid CPU writes after IO
>>>> 	 * starts and then write it out here.
>>>> 	 */
>>>> 	try_to_unmap_flush_dirty();
>>>> 
>>>> Similarly batching TLB flushed might be prevented in zap_pte_range():
>>>> 
>>>> 	if (!PageAnon(page)) {
>>>> 		if (pte_dirty(ptent)) {
>>>> 			force_flush = 1;
>>>> 			set_page_dirty(page);
>>>> 		}
>>>> 	...
>>> 
>>> I still keep the pure question here (which I asked in the private reply to
>>> you) on why we'd like only pte_dirty() but not pte_write() && pte_dirty()
>>> here. I'll rewrite what I have in the private email to you here again that
>>> I think should be the write thing to do here..
>>> 
>>> if (!PageAnon(page)) {
>>> if (pte_dirty(ptent)) {
>>> set_page_dirty(page);
>>> if (pte_write(ptent))
>>> force_flush = 1;
>>> }
>>> }
>> 
>> The “Second” part regards a perf issue, not a correctness issue. As I told
>> you offline, I think if makes sense to look both on pte_dirty() and
>> pte_write(), but I am afraid of other architectures than x86, which I know.
> 
> I don't really see why this logic is arch-specific. I meant, as long as
> pte_write() returns a value that means "this page is writable", I still
> think it's making sense.
> 
>> After our discussion, I looked at other architectures, and it does look
>> safe. So I will add an additional patch for that (with your suggested-by).
> 
> Only if you agree with what I thought, thanks. And that could be a
> separate patch for sure out of this one even if to come.

I do agree. I did not quite understand your second sentence. I guess you
meant not to combine it into this patchset (or at least that’s what I
thought and I attribute it to you).

>> 
>> 
>> I think that perhaps I did not communicate well enough the reason it makes
>> sense to set the dirty-bit.
>> 
>> Setting the access-bit and dirty-bit takes quite some time. So regardless of
>> whether you set the PageDirty, if you didn’t see access-bit and dirty-bit
>> and later you access the page - you are going to pay ~550 cycles for
>> nothing.
>> 
>> For reference: https://marc.info/?l=linux-kernel&m=146582237922378&w=2
>> 
>> If you want me to allow userspace to provide a hint, let me know.
> 
> You did communicate well, so probably it's me that didn't. :)
> 
> Yes I think it'll be nicer to allow userspace provide the same hint for
> UFFDIO_CONTINUE, the reason as I explained in the other thread on the young
> bit (or say ACCESS_LIKELY), that UFFDIO_CONTINUE can (at least in my
> current qemu impl [1]) proactively be used to install pgtables even if
> they're code pages and they may not be accessed in the near future. So
> IMHO we should treat UFFDIO_CONTINUE the same as UFFDIO_COPY, IMHO, from
> this point of view.
> 
> There's just still some differences on young/dirty here so I'm not sure
> whether the idea applies to both, but I think at least it applies to young
> bit (ACCESS_LIKELY).
> 
> [1] https://github.com/xzpeter/qemu/commit/b7758ad55af42b5364796362e96f4a06b51d9582

We have a saying in Hebrew: “When you explain slowly, I understand quickly”.
Thanks for explaining. I will add hints for UFFDIO_CONTINUE as well.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 3/5] userfaultfd: introduce write-likely mode for copy/wp operations
  2022-06-21 18:30         ` Nadav Amit
@ 2022-06-21 18:43           ` Peter Xu
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Xu @ 2022-06-21 18:43 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport

On Tue, Jun 21, 2022 at 11:30:52AM -0700, Nadav Amit wrote:
> > Only if you agree with what I thought, thanks. And that could be a
> > separate patch for sure out of this one even if to come.
> 
> I do agree. I did not quite understand your second sentence. I guess you
> meant not to combine it into this patchset (or at least that’s what I
> thought and I attribute it to you).

Feel free to not attribute anything to me, even no need on a suggested-by.
But if you plan to post for sure I'd be glad if you copy me, I may be very
willing to either provide a r-b or read about what's missed there on the
thought if that isn't work like that.

So the only purpose of the 2nd sentense is to make sure it shouldn't block
your current series (you explained why you're hurry and catching time), and
that's all I wanted to express.

> 
> >> 
> >> 
> >> I think that perhaps I did not communicate well enough the reason it makes
> >> sense to set the dirty-bit.
> >> 
> >> Setting the access-bit and dirty-bit takes quite some time. So regardless of
> >> whether you set the PageDirty, if you didn’t see access-bit and dirty-bit
> >> and later you access the page - you are going to pay ~550 cycles for
> >> nothing.
> >> 
> >> For reference: https://marc.info/?l=linux-kernel&m=146582237922378&w=2
> >> 
> >> If you want me to allow userspace to provide a hint, let me know.
> > 
> > You did communicate well, so probably it's me that didn't. :)
> > 
> > Yes I think it'll be nicer to allow userspace provide the same hint for
> > UFFDIO_CONTINUE, the reason as I explained in the other thread on the young
> > bit (or say ACCESS_LIKELY), that UFFDIO_CONTINUE can (at least in my
> > current qemu impl [1]) proactively be used to install pgtables even if
> > they're code pages and they may not be accessed in the near future. So
> > IMHO we should treat UFFDIO_CONTINUE the same as UFFDIO_COPY, IMHO, from
> > this point of view.
> > 
> > There's just still some differences on young/dirty here so I'm not sure
> > whether the idea applies to both, but I think at least it applies to young
> > bit (ACCESS_LIKELY).
> > 
> > [1] https://github.com/xzpeter/qemu/commit/b7758ad55af42b5364796362e96f4a06b51d9582
> 
> We have a saying in Hebrew: “When you explain slowly, I understand quickly”.
> Thanks for explaining. I will add hints for UFFDIO_CONTINUE as well.

I'm always happy to learn these interesting idioms, and I'm glad we seem to
be on the same page now.  Thanks!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-06-21 18:43 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-19 23:34 [RFC PATCH v2 0/5] userfaultfd: support access/write hints Nadav Amit
2022-06-19 23:34 ` [RFC PATCH v2 1/5] userfaultfd: introduce uffd_flags Nadav Amit
2022-06-21  8:41   ` David Hildenbrand
2022-06-21 15:31     ` Peter Xu
2022-06-21 15:29   ` Peter Xu
2022-06-21 17:41     ` Nadav Amit
2022-06-19 23:34 ` [RFC PATCH v2 2/5] userfaultfd: introduce access-likely mode for copy/wp operations Nadav Amit
2022-06-20 10:33   ` kernel test robot
2022-06-21  8:48   ` David Hildenbrand
2022-06-21 15:42     ` Peter Xu
2022-06-21 17:27     ` Nadav Amit
2022-06-19 23:34 ` [RFC PATCH v2 3/5] userfaultfd: introduce write-likely " Nadav Amit
2022-06-21 16:38   ` Peter Xu
2022-06-21 17:14     ` Nadav Amit
2022-06-21 18:10       ` Peter Xu
2022-06-21 18:30         ` Nadav Amit
2022-06-21 18:43           ` Peter Xu
2022-06-19 23:34 ` [RFC PATCH v2 4/5] userfaultfd: zero access/write hints Nadav Amit
2022-06-20 18:06   ` kernel test robot
2022-06-21 17:04   ` Peter Xu
2022-06-21 17:17     ` Nadav Amit
2022-06-21 17:56       ` Peter Xu
2022-06-21 17:58         ` Nadav Amit
2022-06-19 23:34 ` [RFC PATCH v2 5/5] selftest/userfaultfd: test read/write hints Nadav Amit

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.