All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/5] userfaultfd: support access/write hints
@ 2022-06-22 18:50 Nadav Amit
  2022-06-22 18:50 ` [PATCH v1 1/5] userfaultfd: introduce uffd_flags Nadav Amit
                   ` (4 more replies)
  0 siblings, 5 replies; 34+ messages in thread
From: Nadav Amit @ 2022-06-22 18:50 UTC (permalink / raw)
  To: linux-mm
  Cc: Nadav Amit, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, Peter Xu, David Hildenbrand, Mike Rapoport

From: Nadav Amit <namit@vmware.com>

Setting the access-bit and dirty-bit introduces a tradeoff. When the bit
is set access/write is faster, but memory reclamation might be slower.
Currently, in the common userfaultfd cases the access-bit is not set on
and the dirty-bit is set. This is a questionable behavior.

Allow userspace to control this behavior through hints access- and
write-likely hints. These hints are used to control access- and
dirty-bits. For zero-pages that with write-likely hint, allocate a clear
page instead of mapping the zero-page.

RFCv2 -> v1:
* Adding hints to zeropage and continue
* Fixing other issues pointed by David H. & Peter Xu
* Adding tests to ./run_vmtests.sh

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>

Nadav Amit (5):
  userfaultfd: introduce uffd_flags
  userfaultfd: introduce access-likely mode for common operations
  userfaultfd: introduce write-likely mode for uffd operations
  userfaultfd: zero access/write hints
  selftest/userfaultfd: test read/write hints

 fs/userfaultfd.c                          |  56 +++++++++--
 include/linux/hugetlb.h                   |   4 +-
 include/linux/shmem_fs.h                  |   8 +-
 include/linux/userfaultfd_k.h             |  26 +++--
 include/uapi/linux/userfaultfd.h          |  31 +++++-
 mm/hugetlb.c                              |   6 +-
 mm/shmem.c                                |   9 +-
 mm/userfaultfd.c                          | 117 ++++++++++++++++------
 tools/testing/selftests/vm/run_vmtests.sh |  23 ++---
 tools/testing/selftests/vm/userfaultfd.c  |  32 ++++++
 10 files changed, 239 insertions(+), 73 deletions(-)

-- 
2.25.1



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v1 1/5] userfaultfd: introduce uffd_flags
  2022-06-22 18:50 [PATCH v1 0/5] userfaultfd: support access/write hints Nadav Amit
@ 2022-06-22 18:50 ` Nadav Amit
  2022-06-23 21:57   ` Peter Xu
  2022-06-22 18:50 ` [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations Nadav Amit
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2022-06-22 18:50 UTC (permalink / raw)
  To: linux-mm
  Cc: Nadav Amit, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, Peter Xu, Mike Rapoport, David Hildenbrand

From: Nadav Amit <namit@vmware.com>

As the next patches are going to introduce more information that needs
to be propagated regarding handled user requests, introduce uffd_flags
that would be used to propagate this information.

Remove the unused UFFD_FLAGS_SET to avoid confusion in the constant
names.

Introducing uffd flags also allows to avoid mm/userfaultfd from being
using uapi (e.g., UFFDIO_COPY_MODE_WP).

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 fs/userfaultfd.c              | 21 ++++++++++----
 include/linux/hugetlb.h       |  4 +--
 include/linux/shmem_fs.h      |  8 ++++--
 include/linux/userfaultfd_k.h | 24 ++++++++++------
 mm/hugetlb.c                  |  3 +-
 mm/shmem.c                    |  6 ++--
 mm/userfaultfd.c              | 53 ++++++++++++++++++-----------------
 7 files changed, 70 insertions(+), 49 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index d398f6bf6d74..a44e46f8249f 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1700,6 +1700,8 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 	struct uffdio_copy uffdio_copy;
 	struct uffdio_copy __user *user_uffdio_copy;
 	struct userfaultfd_wake_range range;
+	bool mode_wp;
+	uffd_flags_t uffd_flags;
 
 	user_uffdio_copy = (struct uffdio_copy __user *) arg;
 
@@ -1726,10 +1728,15 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 		goto out;
 	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
 		goto out;
+
+	mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP;
+
+	uffd_flags = mode_wp ? UFFD_FLAGS_WP : UFFD_FLAGS_NONE;
+
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
 				   uffdio_copy.len, &ctx->mmap_changing,
-				   uffdio_copy.mode);
+				   uffd_flags);
 		mmput(ctx->mm);
 	} else {
 		return -ESRCH;
@@ -1757,6 +1764,7 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
 	struct uffdio_zeropage uffdio_zeropage;
 	struct uffdio_zeropage __user *user_uffdio_zeropage;
 	struct userfaultfd_wake_range range;
+	uffd_flags_t uffd_flags = UFFD_FLAGS_NONE;
 
 	user_uffdio_zeropage = (struct uffdio_zeropage __user *) arg;
 
@@ -1781,7 +1789,7 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mfill_zeropage(ctx->mm, uffdio_zeropage.range.start,
 				     uffdio_zeropage.range.len,
-				     &ctx->mmap_changing);
+				     &ctx->mmap_changing, uffd_flags);
 		mmput(ctx->mm);
 	} else {
 		return -ESRCH;
@@ -1810,6 +1818,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 	struct uffdio_writeprotect __user *user_uffdio_wp;
 	struct userfaultfd_wake_range range;
 	bool mode_wp, mode_dontwake;
+	uffd_flags_t uffd_flags;
 
 	if (atomic_read(&ctx->mmap_changing))
 		return -EAGAIN;
@@ -1835,10 +1844,12 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 	if (mode_wp && mode_dontwake)
 		return -EINVAL;
 
+	uffd_flags = mode_wp ? UFFD_FLAGS_WP : UFFD_FLAGS_NONE;
+
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
-					  uffdio_wp.range.len, mode_wp,
-					  &ctx->mmap_changing);
+					  uffdio_wp.range.len,
+					  &ctx->mmap_changing, uffd_flags);
 		mmput(ctx->mm);
 	} else {
 		return -ESRCH;
@@ -1891,7 +1902,7 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mcopy_continue(ctx->mm, uffdio_continue.range.start,
 				     uffdio_continue.range.len,
-				     &ctx->mmap_changing);
+				     &ctx->mmap_changing, 0);
 		mmput(ctx->mm);
 	} else {
 		return -ESRCH;
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 642a39016f9a..a4f326bc2de6 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -166,7 +166,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
 				unsigned long src_addr,
 				enum mcopy_atomic_mode mode,
 				struct page **pagep,
-				bool wp_copy);
+				uffd_flags_t uffd_flags);
 #endif /* CONFIG_USERFAULTFD */
 bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						struct vm_area_struct *vma,
@@ -366,7 +366,7 @@ static inline int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 						unsigned long src_addr,
 						enum mcopy_atomic_mode mode,
 						struct page **pagep,
-						bool wp_copy)
+						uffd_flags_t uffd_flags)
 {
 	BUG();
 	return 0;
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index a68f982f22d1..f93a3c114002 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -9,6 +9,7 @@
 #include <linux/percpu_counter.h>
 #include <linux/xattr.h>
 #include <linux/fs_parser.h>
+#include <linux/userfaultfd_k.h>
 
 /* inode in-kernel data */
 
@@ -145,11 +146,12 @@ extern int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 				  struct vm_area_struct *dst_vma,
 				  unsigned long dst_addr,
 				  unsigned long src_addr,
-				  bool zeropage, bool wp_copy,
-				  struct page **pagep);
+				  bool zeropage,
+				  struct page **pagep,
+				  uffd_flags_t uffd_flags);
 #else /* !CONFIG_SHMEM */
 #define shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, \
-			       src_addr, zeropage, wp_copy, pagep) ({ BUG(); 0; })
+			       src_addr, zeropage, pagep, uffd_flags) ({ BUG(); 0; })
 #endif /* CONFIG_SHMEM */
 #endif /* CONFIG_USERFAULTFD */
 
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index eee374c29c85..d5b3dff48a87 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -34,7 +34,6 @@
 #define UFFD_NONBLOCK O_NONBLOCK
 
 #define UFFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
-#define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
 
 extern int sysctl_unprivileged_userfaultfd;
 
@@ -56,23 +55,30 @@ enum mcopy_atomic_mode {
 	MCOPY_ATOMIC_CONTINUE,
 };
 
+typedef unsigned int __bitwise uffd_flags_t;
+
+#define UFFD_FLAGS_NONE			((__force uffd_flags_t)0)
+#define UFFD_FLAGS_WP			((__force uffd_flags_t)BIT(0))
+
 extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 				    struct vm_area_struct *dst_vma,
 				    unsigned long dst_addr, struct page *page,
-				    bool newly_allocated, bool wp_copy);
+				    bool newly_allocated,
+				    uffd_flags_t uffd_flags);
 
 extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
 			    unsigned long src_start, unsigned long len,
-			    atomic_t *mmap_changing, __u64 mode);
-extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
-			      unsigned long dst_start,
-			      unsigned long len,
-			      atomic_t *mmap_changing);
+			    atomic_t *mmap_changing, uffd_flags_t uffd_flags);
+extern ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long dst_start,
+			      unsigned long len, atomic_t *mmap_changing,
+			      uffd_flags_t uffd_flags);
 extern ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long dst_start,
-			      unsigned long len, atomic_t *mmap_changing);
+			      unsigned long len, atomic_t *mmap_changing,
+			      uffd_flags_t uffd_flags);
 extern int mwriteprotect_range(struct mm_struct *dst_mm,
 			       unsigned long start, unsigned long len,
-			       bool enable_wp, atomic_t *mmap_changing);
+			       atomic_t *mmap_changing,
+			       uffd_flags_t uffd_flags);
 
 /* mm helpers */
 static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2bc9d1170e4f..2beff8a4bf7c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5875,9 +5875,10 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    unsigned long src_addr,
 			    enum mcopy_atomic_mode mode,
 			    struct page **pagep,
-			    bool wp_copy)
+			    uffd_flags_t uffd_flags)
 {
 	bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
+	bool wp_copy = uffd_flags & UFFD_FLAGS_WP;
 	struct hstate *h = hstate_vma(dst_vma);
 	struct address_space *mapping = dst_vma->vm_file->f_mapping;
 	pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
diff --git a/mm/shmem.c b/mm/shmem.c
index 12ac67dc831f..89c775275bae 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2343,8 +2343,8 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 			   struct vm_area_struct *dst_vma,
 			   unsigned long dst_addr,
 			   unsigned long src_addr,
-			   bool zeropage, bool wp_copy,
-			   struct page **pagep)
+			   bool zeropage, struct page **pagep,
+			   uffd_flags_t uffd_flags)
 {
 	struct inode *inode = file_inode(dst_vma->vm_file);
 	struct shmem_inode_info *info = SHMEM_I(inode);
@@ -2418,7 +2418,7 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 		goto out_release;
 
 	ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
-				       page, true, wp_copy);
+				       page, true, uffd_flags);
 	if (ret)
 		goto out_delete_from_cache;
 
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 07d3befc80e4..734de6aa0b8e 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -58,7 +58,7 @@ struct vm_area_struct *find_dst_vma(struct mm_struct *dst_mm,
 int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 			     struct vm_area_struct *dst_vma,
 			     unsigned long dst_addr, struct page *page,
-			     bool newly_allocated, bool wp_copy)
+			     bool newly_allocated, uffd_flags_t uffd_flags)
 {
 	int ret;
 	pte_t _dst_pte, *dst_pte;
@@ -78,7 +78,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	 * Always mark a PTE as write-protected when needed, regardless of
 	 * VM_WRITE, which the user might change.
 	 */
-	if (wp_copy) {
+	if (uffd_flags & UFFD_FLAGS_WP) {
 		_dst_pte = pte_mkuffd_wp(_dst_pte);
 		writable = false;
 	}
@@ -145,7 +145,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    unsigned long dst_addr,
 			    unsigned long src_addr,
 			    struct page **pagep,
-			    bool wp_copy)
+			    uffd_flags_t uffd_flags)
 {
 	void *page_kaddr;
 	int ret;
@@ -189,7 +189,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 		goto out_release;
 
 	ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
-				       page, true, wp_copy);
+				       page, true, uffd_flags);
 	if (ret)
 		goto out_release;
 out:
@@ -239,7 +239,7 @@ static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
 				pmd_t *dst_pmd,
 				struct vm_area_struct *dst_vma,
 				unsigned long dst_addr,
-				bool wp_copy)
+				uffd_flags_t uffd_flags)
 {
 	struct inode *inode = file_inode(dst_vma->vm_file);
 	pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
@@ -263,7 +263,7 @@ static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
 	}
 
 	ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
-				       page, false, wp_copy);
+				       page, false, uffd_flags);
 	if (ret)
 		goto out_release;
 
@@ -309,7 +309,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 					      unsigned long src_start,
 					      unsigned long len,
 					      enum mcopy_atomic_mode mode,
-					      bool wp_copy)
+					      uffd_flags_t uffd_flags)
 {
 	int vm_shared = dst_vma->vm_flags & VM_SHARED;
 	ssize_t err;
@@ -406,7 +406,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 
 		err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
 					       dst_addr, src_addr, mode, &page,
-					       wp_copy);
+					       uffd_flags);
 
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		i_mmap_unlock_read(mapping);
@@ -462,7 +462,7 @@ extern ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 				      unsigned long src_start,
 				      unsigned long len,
 				      enum mcopy_atomic_mode mode,
-				      bool wp_copy);
+				      uffd_flags_t uffd_flags);
 #endif /* CONFIG_HUGETLB_PAGE */
 
 static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
@@ -472,13 +472,13 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 						unsigned long src_addr,
 						struct page **page,
 						enum mcopy_atomic_mode mode,
-						bool wp_copy)
+						uffd_flags_t uffd_flags)
 {
 	ssize_t err;
 
 	if (mode == MCOPY_ATOMIC_CONTINUE) {
 		return mcontinue_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
-					    wp_copy);
+					    uffd_flags);
 	}
 
 	/*
@@ -495,7 +495,7 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 		if (mode == MCOPY_ATOMIC_NORMAL)
 			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
 					       dst_addr, src_addr, page,
-					       wp_copy);
+					       uffd_flags);
 		else
 			err = mfill_zeropage_pte(dst_mm, dst_pmd,
 						 dst_vma, dst_addr);
@@ -503,7 +503,7 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 		err = shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma,
 					     dst_addr, src_addr,
 					     mode != MCOPY_ATOMIC_NORMAL,
-					     wp_copy, page);
+					     page, uffd_flags);
 	}
 
 	return err;
@@ -515,7 +515,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 					      unsigned long len,
 					      enum mcopy_atomic_mode mcopy_mode,
 					      atomic_t *mmap_changing,
-					      __u64 mode)
+					      uffd_flags_t uffd_flags)
 {
 	struct vm_area_struct *dst_vma;
 	ssize_t err;
@@ -523,7 +523,6 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	unsigned long src_addr, dst_addr;
 	long copied;
 	struct page *page;
-	bool wp_copy;
 
 	/*
 	 * Sanitize the command parameters:
@@ -570,11 +569,10 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 		goto out_unlock;
 
 	/*
-	 * validate 'mode' now that we know the dst_vma: don't allow
+	 * validate 'flags' now that we know the dst_vma: don't allow
 	 * a wrprotect copy if the userfaultfd didn't register as WP.
 	 */
-	wp_copy = mode & UFFDIO_COPY_MODE_WP;
-	if (wp_copy && !(dst_vma->vm_flags & VM_UFFD_WP))
+	if ((uffd_flags & UFFD_FLAGS_WP) && !(dst_vma->vm_flags & VM_UFFD_WP))
 		goto out_unlock;
 
 	/*
@@ -583,7 +581,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	if (is_vm_hugetlb_page(dst_vma))
 		return  __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
 					       src_start, len, mcopy_mode,
-					       wp_copy);
+					       uffd_flags);
 
 	if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
 		goto out_unlock;
@@ -635,7 +633,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 		BUG_ON(pmd_trans_huge(*dst_pmd));
 
 		err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
-				       src_addr, &page, mcopy_mode, wp_copy);
+				       src_addr, &page, mcopy_mode, uffd_flags);
 		cond_resched();
 
 		if (unlikely(err == -ENOENT)) {
@@ -683,30 +681,33 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 
 ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
 		     unsigned long src_start, unsigned long len,
-		     atomic_t *mmap_changing, __u64 mode)
+		     atomic_t *mmap_changing, uffd_flags_t uffd_flags)
 {
 	return __mcopy_atomic(dst_mm, dst_start, src_start, len,
-			      MCOPY_ATOMIC_NORMAL, mmap_changing, mode);
+			      MCOPY_ATOMIC_NORMAL, mmap_changing, uffd_flags);
 }
 
 ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
-		       unsigned long len, atomic_t *mmap_changing)
+		       unsigned long len, atomic_t *mmap_changing,
+		       uffd_flags_t uffd_flags)
 {
 	return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_ZEROPAGE,
 			      mmap_changing, 0);
 }
 
 ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long start,
-		       unsigned long len, atomic_t *mmap_changing)
+		       unsigned long len, atomic_t *mmap_changing,
+		       uffd_flags_t uffd_flags)
 {
 	return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_CONTINUE,
 			      mmap_changing, 0);
 }
 
 int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
-			unsigned long len, bool enable_wp,
-			atomic_t *mmap_changing)
+			unsigned long len,
+			atomic_t *mmap_changing, uffd_flags_t uffd_flags)
 {
+	bool enable_wp = uffd_flags & UFFD_FLAGS_WP;
 	struct vm_area_struct *dst_vma;
 	unsigned long page_mask;
 	struct mmu_gather tlb;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-06-22 18:50 [PATCH v1 0/5] userfaultfd: support access/write hints Nadav Amit
  2022-06-22 18:50 ` [PATCH v1 1/5] userfaultfd: introduce uffd_flags Nadav Amit
@ 2022-06-22 18:50 ` Nadav Amit
  2022-06-23 23:24   ` Peter Xu
  2022-07-12  6:19   ` Nadav Amit
  2022-06-22 18:50 ` [PATCH v1 3/5] userfaultfd: introduce write-likely mode for uffd operations Nadav Amit
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 34+ messages in thread
From: Nadav Amit @ 2022-06-22 18:50 UTC (permalink / raw)
  To: linux-mm
  Cc: Nadav Amit, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, Peter Xu, David Hildenbrand, Mike Rapoport

From: Nadav Amit <namit@vmware.com>

Using a PTE on x86 with cleared access-bit (aka young-bit)
takes ~600 cycles more than when the access bit is set. At the same
time, setting the access-bit for memory that is not used (e.g.,
prefetched) can introduce greater overheads, as the prefetched memory is
reclaimed later than it should be.

Userfaultfd currently does not set the access-bit (excluding the
huge-pages case). Arguably, it is best to let the user control whether
the access bit should be set or not. The expected use is to request
userfaultfd to set the access-bit when the copy/wp operation is done to
resolve a page-fault, and not to set the access-bit when the memory is
prefetched.

Introduce UFFDIO_[op]_ACCESS_LIKELY to enable userspace to request the
young bit to be set.

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 fs/userfaultfd.c                 | 25 ++++++++++++++++++++-----
 include/linux/userfaultfd_k.h    |  1 +
 include/uapi/linux/userfaultfd.h | 20 +++++++++++++++++++-
 mm/userfaultfd.c                 | 16 ++++++++++++----
 4 files changed, 52 insertions(+), 10 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index a44e46f8249f..abf176bd0349 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1726,12 +1726,15 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 	ret = -EINVAL;
 	if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
 		goto out;
-	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
+	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP|
+				 UFFDIO_COPY_MODE_ACCESS_LIKELY))
 		goto out;
 
 	mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP;
 
 	uffd_flags = mode_wp ? UFFD_FLAGS_WP : UFFD_FLAGS_NONE;
+	if (uffdio_copy.mode & UFFDIO_COPY_MODE_ACCESS_LIKELY)
+		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
 
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
@@ -1783,9 +1786,13 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
 	if (ret)
 		goto out;
 	ret = -EINVAL;
-	if (uffdio_zeropage.mode & ~UFFDIO_ZEROPAGE_MODE_DONTWAKE)
+	if (uffdio_zeropage.mode & ~(UFFDIO_ZEROPAGE_MODE_DONTWAKE|
+				     UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY))
 		goto out;
 
+	if (uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY)
+		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mfill_zeropage(ctx->mm, uffdio_zeropage.range.start,
 				     uffdio_zeropage.range.len,
@@ -1835,7 +1842,8 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 		return ret;
 
 	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
-			       UFFDIO_WRITEPROTECT_MODE_WP))
+			       UFFDIO_WRITEPROTECT_MODE_WP |
+			       UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY))
 		return -EINVAL;
 
 	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
@@ -1845,6 +1853,8 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 		return -EINVAL;
 
 	uffd_flags = mode_wp ? UFFD_FLAGS_WP : UFFD_FLAGS_NONE;
+	if (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY)
+		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
 
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
@@ -1872,6 +1882,7 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
 	struct uffdio_continue uffdio_continue;
 	struct uffdio_continue __user *user_uffdio_continue;
 	struct userfaultfd_wake_range range;
+	uffd_flags_t uffd_flags = UFFD_FLAGS_NONE;
 
 	user_uffdio_continue = (struct uffdio_continue __user *)arg;
 
@@ -1896,13 +1907,17 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
 	    uffdio_continue.range.start) {
 		goto out;
 	}
-	if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
+	if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE|
+				     UFFDIO_CONTINUE_MODE_ACCESS_LIKELY))
 		goto out;
 
+	if (uffdio_continue.mode & UFFDIO_CONTINUE_MODE_ACCESS_LIKELY)
+		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mcopy_continue(ctx->mm, uffdio_continue.range.start,
 				     uffdio_continue.range.len,
-				     &ctx->mmap_changing, 0);
+				     &ctx->mmap_changing, uffd_flags);
 		mmput(ctx->mm);
 	} else {
 		return -ESRCH;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index d5b3dff48a87..af268b2c2b27 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -59,6 +59,7 @@ typedef unsigned int __bitwise uffd_flags_t;
 
 #define UFFD_FLAGS_NONE			((__force uffd_flags_t)0)
 #define UFFD_FLAGS_WP			((__force uffd_flags_t)BIT(0))
+#define UFFD_FLAGS_ACCESS_LIKELY	((__force uffd_flags_t)BIT(1))
 
 extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 				    struct vm_area_struct *dst_vma,
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 005e5e306266..ff7150c878bb 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -38,7 +38,8 @@
 			   UFFD_FEATURE_MINOR_HUGETLBFS |	\
 			   UFFD_FEATURE_MINOR_SHMEM |		\
 			   UFFD_FEATURE_EXACT_ADDRESS |		\
-			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
+			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM |	\
+			   UFFD_FEATURE_ACCESS_HINTS)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -203,6 +204,9 @@ struct uffdio_api {
 	 *
 	 * UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd
 	 * write-protection mode is supported on both shmem and hugetlbfs.
+	 *
+	 * UFFD_FEATURE_ACCESS_HINTS indicates that the ioctl operations
+	 * support the UFFDIO_*_MODE_ACCESS_LIKELY hints.
 	 */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
 #define UFFD_FEATURE_EVENT_FORK			(1<<1)
@@ -217,6 +221,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_MINOR_SHMEM		(1<<10)
 #define UFFD_FEATURE_EXACT_ADDRESS		(1<<11)
 #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM		(1<<12)
+#define UFFD_FEATURE_ACCESS_HINTS		(1<<13)
 	__u64 features;
 
 	__u64 ioctls;
@@ -251,8 +256,14 @@ struct uffdio_copy {
 	 * the fly.  UFFDIO_COPY_MODE_WP is available only if the
 	 * write protected ioctl is implemented for the range
 	 * according to the uffdio_register.ioctls.
+	 *
+	 * UFFDIO_COPY_MODE_ACCESS_LIKELY provides a hint to the kernel that the
+	 * page is likely to be access in the near future. Providing the hint
+	 * properly can improve performance.
+	 *
 	 */
 #define UFFDIO_COPY_MODE_WP			((__u64)1<<1)
+#define UFFDIO_COPY_MODE_ACCESS_LIKELY		((__u64)1<<2)
 	__u64 mode;
 
 	/*
@@ -265,6 +276,7 @@ struct uffdio_copy {
 struct uffdio_zeropage {
 	struct uffdio_range range;
 #define UFFDIO_ZEROPAGE_MODE_DONTWAKE		((__u64)1<<0)
+#define UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY	((__u64)1<<1)
 	__u64 mode;
 
 	/*
@@ -284,6 +296,10 @@ struct uffdio_writeprotect {
  * UFFDIO_WRITEPROTECT_MODE_DONTWAKE: set the flag to avoid waking up
  * any wait thread after the operation succeeds.
  *
+ * UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY provides a hint to the kernel
+ * that the page is likely to be access in the near future. Providing
+ * the hint properly can improve performance.
+ *
  * NOTE: Write protecting a region (WP=1) is unrelated to page faults,
  * therefore DONTWAKE flag is meaningless with WP=1.  Removing write
  * protection (WP=0) in response to a page fault wakes the faulting
@@ -291,12 +307,14 @@ struct uffdio_writeprotect {
  */
 #define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
 #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
+#define UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY	((__u64)1<<2)
 	__u64 mode;
 };
 
 struct uffdio_continue {
 	struct uffdio_range range;
 #define UFFDIO_CONTINUE_MODE_DONTWAKE		((__u64)1<<0)
+#define UFFDIO_CONTINUE_MODE_ACCESS_LIKELY	((__u64)1<<1)
 	__u64 mode;
 
 	/*
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 734de6aa0b8e..5051b9028722 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -92,6 +92,9 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 		 */
 		_dst_pte = pte_wrprotect(_dst_pte);
 
+	if (uffd_flags & UFFD_FLAGS_ACCESS_LIKELY)
+		_dst_pte = pte_mkyoung(_dst_pte);
+
 	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
 
 	if (vma_is_shmem(dst_vma)) {
@@ -202,7 +205,8 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 static int mfill_zeropage_pte(struct mm_struct *dst_mm,
 			      pmd_t *dst_pmd,
 			      struct vm_area_struct *dst_vma,
-			      unsigned long dst_addr)
+			      unsigned long dst_addr,
+			      uffd_flags_t uffd_flags)
 {
 	pte_t _dst_pte, *dst_pte;
 	spinlock_t *ptl;
@@ -225,6 +229,10 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm,
 	ret = -EEXIST;
 	if (!pte_none(*dst_pte))
 		goto out_unlock;
+
+	if (uffd_flags & UFFD_FLAGS_ACCESS_LIKELY)
+		_dst_pte = pte_mkyoung(_dst_pte);
+
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(dst_vma, dst_addr, dst_pte);
@@ -498,7 +506,7 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 					       uffd_flags);
 		else
 			err = mfill_zeropage_pte(dst_mm, dst_pmd,
-						 dst_vma, dst_addr);
+						 dst_vma, dst_addr, uffd_flags);
 	} else {
 		err = shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma,
 					     dst_addr, src_addr,
@@ -692,7 +700,7 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
 		       uffd_flags_t uffd_flags)
 {
 	return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_ZEROPAGE,
-			      mmap_changing, 0);
+			      mmap_changing, uffd_flags);
 }
 
 ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long start,
@@ -700,7 +708,7 @@ ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long start,
 		       uffd_flags_t uffd_flags)
 {
 	return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_CONTINUE,
-			      mmap_changing, 0);
+			      mmap_changing, uffd_flags);
 }
 
 int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 3/5] userfaultfd: introduce write-likely mode for uffd operations
  2022-06-22 18:50 [PATCH v1 0/5] userfaultfd: support access/write hints Nadav Amit
  2022-06-22 18:50 ` [PATCH v1 1/5] userfaultfd: introduce uffd_flags Nadav Amit
  2022-06-22 18:50 ` [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations Nadav Amit
@ 2022-06-22 18:50 ` Nadav Amit
  2022-06-22 18:50 ` [PATCH v1 4/5] userfaultfd: zero access/write hints Nadav Amit
  2022-06-22 18:50 ` [PATCH v1 5/5] selftest/userfaultfd: test read/write hints Nadav Amit
  4 siblings, 0 replies; 34+ messages in thread
From: Nadav Amit @ 2022-06-22 18:50 UTC (permalink / raw)
  To: linux-mm
  Cc: Nadav Amit, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, Peter Xu, David Hildenbrand, Mike Rapoport

From: Nadav Amit <namit@vmware.com>

Either always setting the dirty bit or always leaving it clear does not
seem as the best policy. Leaving the bit clear introduces overhead on
the first write-access, which is required to set the bit.  Setting the
bit for pages the are eventually not written can require more TLB
flushes.

Let the userfaultfd users control whether PTEs are marked as dirty or
clean. Introduce UFFDIO_[op]_MODE_WRITE to enable userspace to indicate
whether pages are likely to be written and set the dirty-bit if they are
likely to be written.

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 fs/userfaultfd.c                 | 20 ++++++++++++++++----
 include/linux/userfaultfd_k.h    |  1 +
 include/uapi/linux/userfaultfd.h | 13 ++++++++++++-
 mm/hugetlb.c                     |  3 +++
 mm/shmem.c                       |  3 +++
 mm/userfaultfd.c                 | 13 ++++++++++---
 6 files changed, 45 insertions(+), 8 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index abf176bd0349..13d73e37e230 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1727,7 +1727,8 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 	if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
 		goto out;
 	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP|
-				 UFFDIO_COPY_MODE_ACCESS_LIKELY))
+				 UFFDIO_COPY_MODE_ACCESS_LIKELY|
+				 UFFDIO_COPY_MODE_WRITE_LIKELY))
 		goto out;
 
 	mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP;
@@ -1735,6 +1736,8 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 	uffd_flags = mode_wp ? UFFD_FLAGS_WP : UFFD_FLAGS_NONE;
 	if (uffdio_copy.mode & UFFDIO_COPY_MODE_ACCESS_LIKELY)
 		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+	if (uffdio_copy.mode & UFFDIO_COPY_MODE_WRITE_LIKELY)
+		uffd_flags |= UFFD_FLAGS_WRITE_LIKELY;
 
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
@@ -1787,11 +1790,14 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
 		goto out;
 	ret = -EINVAL;
 	if (uffdio_zeropage.mode & ~(UFFDIO_ZEROPAGE_MODE_DONTWAKE|
-				     UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY))
+				     UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY|
+				     UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY))
 		goto out;
 
 	if (uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY)
 		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+	if (uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY)
+		uffd_flags |= UFFD_FLAGS_WRITE_LIKELY;
 
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mfill_zeropage(ctx->mm, uffdio_zeropage.range.start,
@@ -1843,7 +1849,8 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 
 	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
 			       UFFDIO_WRITEPROTECT_MODE_WP |
-			       UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY))
+			       UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY |
+			       UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY))
 		return -EINVAL;
 
 	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
@@ -1855,6 +1862,8 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 	uffd_flags = mode_wp ? UFFD_FLAGS_WP : UFFD_FLAGS_NONE;
 	if (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY)
 		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+	if (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY)
+		uffd_flags |= UFFD_FLAGS_WRITE_LIKELY;
 
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
@@ -1908,11 +1917,14 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
 		goto out;
 	}
 	if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE|
-				     UFFDIO_CONTINUE_MODE_ACCESS_LIKELY))
+				     UFFDIO_CONTINUE_MODE_ACCESS_LIKELY|
+				     UFFDIO_CONTINUE_MODE_WRITE_LIKELY))
 		goto out;
 
 	if (uffdio_continue.mode & UFFDIO_CONTINUE_MODE_ACCESS_LIKELY)
 		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+	if (uffdio_continue.mode & UFFDIO_CONTINUE_MODE_WRITE_LIKELY)
+		uffd_flags |= UFFD_FLAGS_WRITE_LIKELY;
 
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mcopy_continue(ctx->mm, uffdio_continue.range.start,
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index af268b2c2b27..59c43ea502e7 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -60,6 +60,7 @@ typedef unsigned int __bitwise uffd_flags_t;
 #define UFFD_FLAGS_NONE			((__force uffd_flags_t)0)
 #define UFFD_FLAGS_WP			((__force uffd_flags_t)BIT(0))
 #define UFFD_FLAGS_ACCESS_LIKELY	((__force uffd_flags_t)BIT(1))
+#define UFFD_FLAGS_WRITE_LIKELY		((__force uffd_flags_t)BIT(2))
 
 extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 				    struct vm_area_struct *dst_vma,
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index ff7150c878bb..7b6ab0b43475 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -206,7 +206,7 @@ struct uffdio_api {
 	 * write-protection mode is supported on both shmem and hugetlbfs.
 	 *
 	 * UFFD_FEATURE_ACCESS_HINTS indicates that the ioctl operations
-	 * support the UFFDIO_*_MODE_ACCESS_LIKELY hints.
+	 * support the UFFDIO_*_MODE_[ACCESS|WRITE]_LIKELY hints.
 	 */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
 #define UFFD_FEATURE_EVENT_FORK			(1<<1)
@@ -261,9 +261,13 @@ struct uffdio_copy {
 	 * page is likely to be access in the near future. Providing the hint
 	 * properly can improve performance.
 	 *
+	 * UFFDIO_COPY_MODE_WRITE_LIKELY provides a hint to the kernel that the
+	 * page is likely to be written in the near future. Providing the hint
+	 * properly can improve performance.
 	 */
 #define UFFDIO_COPY_MODE_WP			((__u64)1<<1)
 #define UFFDIO_COPY_MODE_ACCESS_LIKELY		((__u64)1<<2)
+#define UFFDIO_COPY_MODE_WRITE_LIKELY		((__u64)1<<3)
 	__u64 mode;
 
 	/*
@@ -277,6 +281,7 @@ struct uffdio_zeropage {
 	struct uffdio_range range;
 #define UFFDIO_ZEROPAGE_MODE_DONTWAKE		((__u64)1<<0)
 #define UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY	((__u64)1<<1)
+#define UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY	((__u64)1<<2)
 	__u64 mode;
 
 	/*
@@ -300,6 +305,10 @@ struct uffdio_writeprotect {
  * that the page is likely to be access in the near future. Providing
  * the hint properly can improve performance.
  *
+ * UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY: provides a hint to the kernel
+ * that the page is likely to be written in the near future. Providing
+ * the hint properly can improve performance.
+ *
  * NOTE: Write protecting a region (WP=1) is unrelated to page faults,
  * therefore DONTWAKE flag is meaningless with WP=1.  Removing write
  * protection (WP=0) in response to a page fault wakes the faulting
@@ -308,6 +317,7 @@ struct uffdio_writeprotect {
 #define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
 #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
 #define UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY	((__u64)1<<2)
+#define UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY	((__u64)1<<3)
 	__u64 mode;
 };
 
@@ -315,6 +325,7 @@ struct uffdio_continue {
 	struct uffdio_range range;
 #define UFFDIO_CONTINUE_MODE_DONTWAKE		((__u64)1<<0)
 #define UFFDIO_CONTINUE_MODE_ACCESS_LIKELY	((__u64)1<<1)
+#define UFFDIO_CONTINUE_MODE_WRITE_LIKELY	((__u64)1<<2)
 	__u64 mode;
 
 	/*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2beff8a4bf7c..46814fc7762f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5962,6 +5962,9 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		*pagep = NULL;
 	}
 
+	/* The PTE is not marked as dirty unconditionally */
+	SetPageDirty(page);
+
 	/*
 	 * The memory barrier inside __SetPageUptodate makes sure that
 	 * preceding stores to the page contents become visible before
diff --git a/mm/shmem.c b/mm/shmem.c
index 89c775275bae..7488cd186c32 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2404,6 +2404,9 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 	VM_BUG_ON(PageSwapBacked(page));
 	__SetPageLocked(page);
 	__SetPageSwapBacked(page);
+
+	/* The PTE is not marked as dirty unconditionally */
+	SetPageDirty(page);
 	__SetPageUptodate(page);
 
 	ret = -EFAULT;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 5051b9028722..6e767f1e7007 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -70,7 +70,6 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	pgoff_t offset, max_off;
 
 	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
-	_dst_pte = pte_mkdirty(_dst_pte);
 	if (page_in_cache && !vm_shared)
 		writable = false;
 
@@ -83,14 +82,19 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 		writable = false;
 	}
 
-	if (writable)
+	if (writable) {
 		_dst_pte = pte_mkwrite(_dst_pte);
-	else
+
+		/* Marking RO entries as dirty can mess with other code */
+		if (uffd_flags & UFFD_FLAGS_WRITE_LIKELY)
+			_dst_pte = pte_mkdirty(_dst_pte);
+	} else {
 		/*
 		 * We need this to make sure write bit removed; as mk_pte()
 		 * could return a pte with write bit set.
 		 */
 		_dst_pte = pte_wrprotect(_dst_pte);
+	}
 
 	if (uffd_flags & UFFD_FLAGS_ACCESS_LIKELY)
 		_dst_pte = pte_mkyoung(_dst_pte);
@@ -180,6 +184,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 		*pagep = NULL;
 	}
 
+	/* The PTE is not marked as dirty unconditionally */
+	SetPageDirty(page);
+
 	/*
 	 * The memory barrier inside __SetPageUptodate makes sure that
 	 * preceding stores to the page contents become visible before
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 4/5] userfaultfd: zero access/write hints
  2022-06-22 18:50 [PATCH v1 0/5] userfaultfd: support access/write hints Nadav Amit
                   ` (2 preceding siblings ...)
  2022-06-22 18:50 ` [PATCH v1 3/5] userfaultfd: introduce write-likely mode for uffd operations Nadav Amit
@ 2022-06-22 18:50 ` Nadav Amit
  2022-06-23 23:34   ` Peter Xu
  2022-06-22 18:50 ` [PATCH v1 5/5] selftest/userfaultfd: test read/write hints Nadav Amit
  4 siblings, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2022-06-22 18:50 UTC (permalink / raw)
  To: linux-mm
  Cc: Nadav Amit, David Hildenbrand, Mike Kravetz, Hugh Dickins,
	Andrew Morton, Axel Rasmussen, Peter Xu, Mike Rapoport

From: Nadav Amit <namit@vmware.com>

When userfaultfd provides a zeropage in response to ioctl, it provides a
readonly alias to the zero page. If the page is later written (which is
the likely scenario), page-fault occurs and the page-fault allocator
allocates a page and rewires the page-tables.

This is an expensive flow for cases in which a page is likely be written
to. Users can use the copy ioctl to initialize zero page (by copying
zeros), but this is also wasteful.

Allow userfaultfd users to efficiently map initialized zero-pages that
are writable. IF UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY is provided would map
a clear page instead of an alias to the zero page.

For consistency, introduce also UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY.

Suggested-by: David Hildenbrand <david@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 mm/userfaultfd.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 6e767f1e7007..48286746b0af 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -249,6 +249,37 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm,
 	return ret;
 }
 
+static int mfill_clearpage_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
+			       struct vm_area_struct *dst_vma,
+			       unsigned long dst_addr,
+			       uffd_flags_t uffd_flags)
+{
+	struct page *page;
+	int ret;
+
+	ret = -ENOMEM;
+	page = alloc_zeroed_user_highpage_movable(dst_vma, dst_addr);
+	if (!page)
+		goto out;
+
+	/* The PTE is not marked as dirty unconditionally */
+	SetPageDirty(page);
+	__SetPageUptodate(page);
+
+	if (mem_cgroup_charge(page_folio(page), dst_vma->vm_mm, GFP_KERNEL))
+		goto out_release;
+
+	ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
+				       page, true, uffd_flags);
+	if (ret)
+		goto out_release;
+out:
+	return ret;
+out_release:
+	put_page(page);
+	goto out;
+}
+
 /* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */
 static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
 				pmd_t *dst_pmd,
@@ -511,6 +542,10 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
 					       dst_addr, src_addr, page,
 					       uffd_flags);
+		else if (!(uffd_flags & UFFD_FLAGS_WP) &&
+			 (uffd_flags & UFFD_FLAGS_WRITE_LIKELY))
+			err = mfill_clearpage_pte(dst_mm, dst_pmd, dst_vma,
+						  dst_addr, uffd_flags);
 		else
 			err = mfill_zeropage_pte(dst_mm, dst_pmd,
 						 dst_vma, dst_addr, uffd_flags);
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 5/5] selftest/userfaultfd: test read/write hints
  2022-06-22 18:50 [PATCH v1 0/5] userfaultfd: support access/write hints Nadav Amit
                   ` (3 preceding siblings ...)
  2022-06-22 18:50 ` [PATCH v1 4/5] userfaultfd: zero access/write hints Nadav Amit
@ 2022-06-22 18:50 ` Nadav Amit
  4 siblings, 0 replies; 34+ messages in thread
From: Nadav Amit @ 2022-06-22 18:50 UTC (permalink / raw)
  To: linux-mm
  Cc: Nadav Amit, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, Peter Xu, David Hildenbrand, Mike Rapoport

From: Nadav Amit <namit@vmware.com>

Test UFFDIO_*_MODE_ACCESS_LIKELY and UFFDIO_*_MODE_WRITE_LIKELY.
Introduce a modifier to trigger the use of the hints.

Add the test to run_vmtests.sh and add an array to run different
userfaultfd configurations.

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 tools/testing/selftests/vm/run_vmtests.sh | 23 ++++++++--------
 tools/testing/selftests/vm/userfaultfd.c  | 32 +++++++++++++++++++++++
 2 files changed, 43 insertions(+), 12 deletions(-)

diff --git a/tools/testing/selftests/vm/run_vmtests.sh b/tools/testing/selftests/vm/run_vmtests.sh
index 930c54eb5b4b..b90e9cf9716d 100755
--- a/tools/testing/selftests/vm/run_vmtests.sh
+++ b/tools/testing/selftests/vm/run_vmtests.sh
@@ -120,18 +120,17 @@ run_test ./gup_test -a
 # Dump pages 0, 19, and 4096, using pin_user_pages:
 run_test ./gup_test -ct -F 0x1 0 19 0x1000
 
-run_test ./userfaultfd anon 20 16
-run_test ./userfaultfd anon:dev 20 16
-# Hugetlb tests require source and destination huge pages. Pass in half the
-# size ($half_ufd_size_MB), which is used for *each*.
-run_test ./userfaultfd hugetlb "$half_ufd_size_MB" 32
-run_test ./userfaultfd hugetlb:dev "$half_ufd_size_MB" 32
-run_test ./userfaultfd hugetlb_shared "$half_ufd_size_MB" 32 "$mnt"/uffd-test
-rm -f "$mnt"/uffd-test
-run_test ./userfaultfd hugetlb_shared:dev "$half_ufd_size_MB" 32 "$mnt"/uffd-test
-rm -f "$mnt"/uffd-test
-run_test ./userfaultfd shmem 20 16
-run_test ./userfaultfd shmem:dev 20 16
+uffd_mods=("" ":dev" ":access_likely" ":access_likely:write_likely" ":write_likely")
+
+for mod in "${uffd_mods[@]}"; do
+	run_test ./userfaultfd anon${mod} 20 16
+	# Hugetlb tests require source and destination huge pages. Pass in half the
+	# size ($half_ufd_size_MB), which is used for *each*.
+	run_test ./userfaultfd hugetlb${mod} "$half_ufd_size_MB" 32
+	run_test ./userfaultfd hugetlb_shared${mod} "$half_ufd_size_MB" 32 "$mnt"/uffd-test
+	rm -f "$mnt"/uffd-test
+	run_test ./userfaultfd shmem${mod} 20 16
+done
 
 #cleanup
 umount "$mnt"
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 28b881523d15..763458ce1d52 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -88,6 +88,8 @@ static volatile bool test_uffdio_zeropage_eexist = true;
 static bool test_uffdio_wp = true;
 /* Whether to test uffd minor faults */
 static bool test_uffdio_minor = false;
+static bool test_access_likely;
+static bool test_write_likely;
 
 static bool map_shared;
 static int shm_fd;
@@ -550,6 +552,12 @@ static void wp_range(int ufd, __u64 start, __u64 len, bool wp)
 	/* Undo write-protect, do wakeup after that */
 	prms.mode = wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0;
 
+	if (test_access_likely)
+		prms.mode |= UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY;
+
+	if (test_write_likely)
+		prms.mode |= UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY;
+
 	if (ioctl(ufd, UFFDIO_WRITEPROTECT, &prms))
 		err("clear WP failed: address=0x%"PRIx64, (uint64_t)start);
 }
@@ -563,6 +571,12 @@ static void continue_range(int ufd, __u64 start, __u64 len)
 	req.range.len = len;
 	req.mode = 0;
 
+	if (test_access_likely)
+		req.mode |= UFFDIO_CONTINUE_MODE_ACCESS_LIKELY;
+
+	if (test_write_likely)
+		req.mode |= UFFDIO_CONTINUE_MODE_WRITE_LIKELY;
+
 	if (ioctl(ufd, UFFDIO_CONTINUE, &req))
 		err("UFFDIO_CONTINUE failed for address 0x%" PRIx64,
 		    (uint64_t)start);
@@ -653,6 +667,13 @@ static int __copy_page(int ufd, unsigned long offset, bool retry)
 		uffdio_copy.mode = UFFDIO_COPY_MODE_WP;
 	else
 		uffdio_copy.mode = 0;
+
+	if (test_access_likely)
+		uffdio_copy.mode |= UFFDIO_COPY_MODE_ACCESS_LIKELY;
+
+	if (test_write_likely)
+		uffdio_copy.mode |= UFFDIO_COPY_MODE_WRITE_LIKELY;
+
 	uffdio_copy.copy = 0;
 	if (ioctl(ufd, UFFDIO_COPY, &uffdio_copy)) {
 		/* real retval in ufdio_copy.copy */
@@ -1080,6 +1101,13 @@ static int __uffdio_zeropage(int ufd, unsigned long offset, bool retry)
 	uffdio_zeropage.range.start = (unsigned long) area_dst + offset;
 	uffdio_zeropage.range.len = page_size;
 	uffdio_zeropage.mode = 0;
+
+	if (test_access_likely)
+		uffdio_zeropage.mode |= UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY;
+
+	if (test_write_likely)
+		uffdio_zeropage.mode |= UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY;
+
 	ret = ioctl(ufd, UFFDIO_ZEROPAGE, &uffdio_zeropage);
 	res = uffdio_zeropage.zeropage;
 	if (ret) {
@@ -1648,6 +1676,10 @@ static void parse_test_type_arg(const char *raw_type)
 			set_test_type(token);
 		else if (!strcmp(token, "dev"))
 			test_dev_userfaultfd = true;
+		else if (!strcmp(token, "access_likely"))
+			test_access_likely = true;
+		else if (!strcmp(token, "write_likely"))
+			test_write_likely = true;
 		else
 			err("unrecognized test mod '%s'", token);
 	}
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 1/5] userfaultfd: introduce uffd_flags
  2022-06-22 18:50 ` [PATCH v1 1/5] userfaultfd: introduce uffd_flags Nadav Amit
@ 2022-06-23 21:57   ` Peter Xu
  2022-06-23 22:04     ` Nadav Amit
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2022-06-23 21:57 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, Nadav Amit, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, Mike Rapoport, David Hildenbrand

On Wed, Jun 22, 2022 at 11:50:34AM -0700, Nadav Amit wrote:
> @@ -1891,7 +1902,7 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
>  	if (mmget_not_zero(ctx->mm)) {
>  		ret = mcopy_continue(ctx->mm, uffdio_continue.range.start,
>  				     uffdio_continue.range.len,
> -				     &ctx->mmap_changing);
> +				     &ctx->mmap_changing, 0);

Shall we consistently use either 0 or UFFD_FLAGS_NONE?  I'd go for 0
directly since that's clearer on having "nothing" as flag.

>  		mmput(ctx->mm);
>  	} else {
>  		return -ESRCH;

[...]

>  ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
> -		       unsigned long len, atomic_t *mmap_changing)
> +		       unsigned long len, atomic_t *mmap_changing,
> +		       uffd_flags_t uffd_flags)
>  {
>  	return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_ZEROPAGE,
>  			      mmap_changing, 0);

I think you agreed on passing the uffd_flags into __mcopy_atomic() (and
also below). Is it forgotten or plan changed?

Thanks,

>  }
>  
>  ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long start,
> -		       unsigned long len, atomic_t *mmap_changing)
> +		       unsigned long len, atomic_t *mmap_changing,
> +		       uffd_flags_t uffd_flags)
>  {
>  	return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_CONTINUE,
>  			      mmap_changing, 0);
>  }

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 1/5] userfaultfd: introduce uffd_flags
  2022-06-23 21:57   ` Peter Xu
@ 2022-06-23 22:04     ` Nadav Amit
  0 siblings, 0 replies; 34+ messages in thread
From: Nadav Amit @ 2022-06-23 22:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, Mike Rapoport, David Hildenbrand

On Jun 23, 2022, at 2:57 PM, Peter Xu <peterx@redhat.com> wrote:

> On Wed, Jun 22, 2022 at 11:50:34AM -0700, Nadav Amit wrote:
>> @@ -1891,7 +1902,7 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
>> 	if (mmget_not_zero(ctx->mm)) {
>> 		ret = mcopy_continue(ctx->mm, uffdio_continue.range.start,
>> 				     uffdio_continue.range.len,
>> -				     &ctx->mmap_changing);
>> +				     &ctx->mmap_changing, 0);
> 
> Shall we consistently use either 0 or UFFD_FLAGS_NONE?  I'd go for 0
> directly since that's clearer on having "nothing" as flag.
> 
>> mmput(ctx->mm);
>> 	} else {
>> 		return -ESRCH;
> 
> [...]
> 
>> ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
>> -		       unsigned long len, atomic_t *mmap_changing)
>> +		       unsigned long len, atomic_t *mmap_changing,
>> +		       uffd_flags_t uffd_flags)
>> {
>> 	return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_ZEROPAGE,
>> 			      mmap_changing, 0);
> 
> I think you agreed on passing the uffd_flags into __mcopy_atomic() (and
> also below). Is it forgotten or plan changed?

I addressed these issues in the following patches - I messed up slightly the
patch order.

I will address these two into patch 1.

Regards,
Nadav

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-06-22 18:50 ` [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations Nadav Amit
@ 2022-06-23 23:24   ` Peter Xu
  2022-06-23 23:35     ` Nadav Amit
  2022-07-12  6:19   ` Nadav Amit
  1 sibling, 1 reply; 34+ messages in thread
From: Peter Xu @ 2022-06-23 23:24 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, Nadav Amit, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport

On Wed, Jun 22, 2022 at 11:50:35AM -0700, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> Using a PTE on x86 with cleared access-bit (aka young-bit)
> takes ~600 cycles more than when the access bit is set. At the same
> time, setting the access-bit for memory that is not used (e.g.,
> prefetched) can introduce greater overheads, as the prefetched memory is
> reclaimed later than it should be.
> 
> Userfaultfd currently does not set the access-bit (excluding the
> huge-pages case). Arguably, it is best to let the user control whether
> the access bit should be set or not. The expected use is to request
> userfaultfd to set the access-bit when the copy/wp operation is done to
> resolve a page-fault, and not to set the access-bit when the memory is
> prefetched.
> 
> Introduce UFFDIO_[op]_ACCESS_LIKELY to enable userspace to request the
> young bit to be set.
> 
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Signed-off-by: Nadav Amit <namit@vmware.com>

Hmm.. is the hugetlb code overlooked (for both of the hints), or maybe I
missed it?  Do we need to cover them too?  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 4/5] userfaultfd: zero access/write hints
  2022-06-22 18:50 ` [PATCH v1 4/5] userfaultfd: zero access/write hints Nadav Amit
@ 2022-06-23 23:34   ` Peter Xu
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Xu @ 2022-06-23 23:34 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, Nadav Amit, David Hildenbrand, Mike Kravetz,
	Hugh Dickins, Andrew Morton, Axel Rasmussen, Mike Rapoport

On Wed, Jun 22, 2022 at 11:50:37AM -0700, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> When userfaultfd provides a zeropage in response to ioctl, it provides a
> readonly alias to the zero page. If the page is later written (which is
> the likely scenario), page-fault occurs and the page-fault allocator
> allocates a page and rewires the page-tables.
> 
> This is an expensive flow for cases in which a page is likely be written
> to. Users can use the copy ioctl to initialize zero page (by copying
> zeros), but this is also wasteful.
> 
> Allow userfaultfd users to efficiently map initialized zero-pages that
> are writable. IF UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY is provided would map
> a clear page instead of an alias to the zero page.
> 
> For consistency, introduce also UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Signed-off-by: Nadav Amit <namit@vmware.com>

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-06-23 23:24   ` Peter Xu
@ 2022-06-23 23:35     ` Nadav Amit
  2022-06-23 23:49       ` Peter Xu
  0 siblings, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2022-06-23 23:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport

On Jun 23, 2022, at 4:24 PM, Peter Xu <peterx@redhat.com> wrote:

> ⚠ External Email
> 
> On Wed, Jun 22, 2022 at 11:50:35AM -0700, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> Using a PTE on x86 with cleared access-bit (aka young-bit)
>> takes ~600 cycles more than when the access bit is set. At the same
>> time, setting the access-bit for memory that is not used (e.g.,
>> prefetched) can introduce greater overheads, as the prefetched memory is
>> reclaimed later than it should be.
>> 
>> Userfaultfd currently does not set the access-bit (excluding the
>> huge-pages case). Arguably, it is best to let the user control whether
>> the access bit should be set or not. The expected use is to request
>> userfaultfd to set the access-bit when the copy/wp operation is done to
>> resolve a page-fault, and not to set the access-bit when the memory is
>> prefetched.
>> 
>> Introduce UFFDIO_[op]_ACCESS_LIKELY to enable userspace to request the
>> young bit to be set.
>> 
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Axel Rasmussen <axelrasmussen@google.com>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Mike Rapoport <rppt@linux.ibm.com>
>> Signed-off-by: Nadav Amit <namit@vmware.com>
> 
> Hmm.. is the hugetlb code overlooked (for both of the hints), or maybe I
> missed it? Do we need to cover them too? Thanks,

I do not know what the value of not setting the PTE’s access/dirty when it
comes to performance. The pages won’t be swapped out, just as you wrote in
your comment in hugetlb_mcopy_atomic_pte():

        /*
         * Always mark UFFDIO_COPY page dirty; note that this may not be
         * extremely important for hugetlbfs for now since swapping is not
         * supported, but we should still be clear in that this page cannot be
         * thrown away at will, even if write bit not set.
         */
        _dst_pte = huge_pte_mkdirty(_dst_pte);
        _dst_pte = pte_mkyoung(_dst_pte);

If you want for consistency/robustness not to set dirty on read-only
entries, that’s something that I can do.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-06-23 23:35     ` Nadav Amit
@ 2022-06-23 23:49       ` Peter Xu
  2022-06-24  0:03         ` Nadav Amit
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2022-06-23 23:49 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport

On Thu, Jun 23, 2022 at 11:35:00PM +0000, Nadav Amit wrote:
> On Jun 23, 2022, at 4:24 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> > ⚠ External Email
> > 
> > On Wed, Jun 22, 2022 at 11:50:35AM -0700, Nadav Amit wrote:
> >> From: Nadav Amit <namit@vmware.com>
> >> 
> >> Using a PTE on x86 with cleared access-bit (aka young-bit)
> >> takes ~600 cycles more than when the access bit is set. At the same
> >> time, setting the access-bit for memory that is not used (e.g.,
> >> prefetched) can introduce greater overheads, as the prefetched memory is
> >> reclaimed later than it should be.
> >> 
> >> Userfaultfd currently does not set the access-bit (excluding the
> >> huge-pages case). Arguably, it is best to let the user control whether
> >> the access bit should be set or not. The expected use is to request
> >> userfaultfd to set the access-bit when the copy/wp operation is done to
> >> resolve a page-fault, and not to set the access-bit when the memory is
> >> prefetched.
> >> 
> >> Introduce UFFDIO_[op]_ACCESS_LIKELY to enable userspace to request the
> >> young bit to be set.
> >> 
> >> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> >> Cc: Hugh Dickins <hughd@google.com>
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> Cc: Axel Rasmussen <axelrasmussen@google.com>
> >> Cc: Peter Xu <peterx@redhat.com>
> >> Cc: David Hildenbrand <david@redhat.com>
> >> Cc: Mike Rapoport <rppt@linux.ibm.com>
> >> Signed-off-by: Nadav Amit <namit@vmware.com>
> > 
> > Hmm.. is the hugetlb code overlooked (for both of the hints), or maybe I
> > missed it? Do we need to cover them too? Thanks,
> 
> I do not know what the value of not setting the PTE’s access/dirty when it
> comes to performance. The pages won’t be swapped out, just as you wrote in
> your comment in hugetlb_mcopy_atomic_pte():
> 
>         /*
>          * Always mark UFFDIO_COPY page dirty; note that this may not be
>          * extremely important for hugetlbfs for now since swapping is not
>          * supported, but we should still be clear in that this page cannot be
>          * thrown away at will, even if write bit not set.
>          */
>         _dst_pte = huge_pte_mkdirty(_dst_pte);
>         _dst_pte = pte_mkyoung(_dst_pte);
> 
> If you want for consistency/robustness not to set dirty on read-only
> entries, that’s something that I can do.

We have more than one options here I guess.

We could have the hints not available for hugetlbfs, but then we'll both
need to add special document for hugetlbfs (when you write the man-page,
let's say) and also it'll be probably better to fail the ioctls with hints
when applying to hugetlb vmas.

Leaving them silent to hugetlbfs is another option, it just sounds weird to
me without any kind of documentation so I like this one least.

Or we could make them work for hugetlbfs too.  Not saying that swap will
work some day (but it still could?!) it just seems all consistent as you
said.

So I'd prefer the last one, but only if you agree.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-06-23 23:49       ` Peter Xu
@ 2022-06-24  0:03         ` Nadav Amit
  2022-06-24  2:05           ` Peter Xu
  0 siblings, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2022-06-24  0:03 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport

On Jun 23, 2022, at 4:49 PM, Peter Xu <peterx@redhat.com> wrote:

> On Thu, Jun 23, 2022 at 11:35:00PM +0000, Nadav Amit wrote:
>> On Jun 23, 2022, at 4:24 PM, Peter Xu <peterx@redhat.com> wrote:
>> 
>>> ⚠ External Email
>>> 
>>> On Wed, Jun 22, 2022 at 11:50:35AM -0700, Nadav Amit wrote:
>>>> From: Nadav Amit <namit@vmware.com>
>>>> 
>>>> Using a PTE on x86 with cleared access-bit (aka young-bit)
>>>> takes ~600 cycles more than when the access bit is set. At the same
>>>> time, setting the access-bit for memory that is not used (e.g.,
>>>> prefetched) can introduce greater overheads, as the prefetched memory is
>>>> reclaimed later than it should be.
>>>> 
>>>> Userfaultfd currently does not set the access-bit (excluding the
>>>> huge-pages case). Arguably, it is best to let the user control whether
>>>> the access bit should be set or not. The expected use is to request
>>>> userfaultfd to set the access-bit when the copy/wp operation is done to
>>>> resolve a page-fault, and not to set the access-bit when the memory is
>>>> prefetched.
>>>> 
>>>> Introduce UFFDIO_[op]_ACCESS_LIKELY to enable userspace to request the
>>>> young bit to be set.
>>>> 
>>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>>>> Cc: Hugh Dickins <hughd@google.com>
>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>> Cc: Axel Rasmussen <axelrasmussen@google.com>
>>>> Cc: Peter Xu <peterx@redhat.com>
>>>> Cc: David Hildenbrand <david@redhat.com>
>>>> Cc: Mike Rapoport <rppt@linux.ibm.com>
>>>> Signed-off-by: Nadav Amit <namit@vmware.com>
>>> 
>>> Hmm.. is the hugetlb code overlooked (for both of the hints), or maybe I
>>> missed it? Do we need to cover them too? Thanks,
>> 
>> I do not know what the value of not setting the PTE’s access/dirty when it
>> comes to performance. The pages won’t be swapped out, just as you wrote in
>> your comment in hugetlb_mcopy_atomic_pte():
>> 
>>        /*
>>         * Always mark UFFDIO_COPY page dirty; note that this may not be
>>         * extremely important for hugetlbfs for now since swapping is not
>>         * supported, but we should still be clear in that this page cannot be
>>         * thrown away at will, even if write bit not set.
>>         */
>>        _dst_pte = huge_pte_mkdirty(_dst_pte);
>>        _dst_pte = pte_mkyoung(_dst_pte);
>> 
>> If you want for consistency/robustness not to set dirty on read-only
>> entries, that’s something that I can do.
> 
> We have more than one options here I guess.
> 
> We could have the hints not available for hugetlbfs, but then we'll both
> need to add special document for hugetlbfs (when you write the man-page,
> let's say) and also it'll be probably better to fail the ioctls with hints
> when applying to hugetlb vmas.
> 
> Leaving them silent to hugetlbfs is another option, it just sounds weird to
> me without any kind of documentation so I like this one least.
> 
> Or we could make them work for hugetlbfs too.  Not saying that swap will
> work some day (but it still could?!) it just seems all consistent as you
> said.
> 
> So I'd prefer the last one, but only if you agree.

My take is that hints are hints. Following David’s (or was it yours?)
feedback, I fixed the description to indicate that this is merely a hint and
removed all references to dirty/access bits. The kernel therefore can ignore
the hint when it wants to or use it in any other way. I fully agree that
this gives the kernel the ability to change the behavior as needed.

Note that for write-protected 4KB zero-page (where we share the zero-page)
we always set the access-bit, regardless of the hint, because it makes
sense: the zero-page is not swappable and therefore the access-bit is set.

I think that the lesser user-facing documentation there is on how the
feature is *exactly* used by the kernel - is better from an API point of
view.

So I see no reason to fail or be forced not to set a page as young, just
because a hint was *not* provided. This would even be a regression in the
behavior. The hint is actually always respected right now, it is just that
even if you do not provide the hint, the access/dirty is set.

The only consistency I think worth thinking about is with the dirty-bit, and
I can add it if you want. Note that the access-bit (in x86) might be set
speculatively in contrast to the dirty-bit is only set atomically with a
real access. That’s the reason I think it may make sense not to set the
dirty without a hint.

Is that acceptable? Access-bit always set, dirty-bit according to hint?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-06-24  0:03         ` Nadav Amit
@ 2022-06-24  2:05           ` Peter Xu
  2022-06-24  2:42             ` Nadav Amit
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2022-06-24  2:05 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport

On Fri, Jun 24, 2022 at 12:03:38AM +0000, Nadav Amit wrote:
> My take is that hints are hints. Following David’s (or was it yours?)
> feedback, I fixed the description to indicate that this is merely a hint and
> removed all references to dirty/access bits. The kernel therefore can ignore
> the hint when it wants to or use it in any other way. I fully agree that
> this gives the kernel the ability to change the behavior as needed.
> 
> Note that for write-protected 4KB zero-page (where we share the zero-page)
> we always set the access-bit, regardless of the hint, because it makes
> sense: the zero-page is not swappable and therefore the access-bit is set.

The zero-page example makes sense, and yeah that makes the hugetlb behavior
making more sense too.

> 
> I think that the lesser user-facing documentation there is on how the
> feature is *exactly* used by the kernel - is better from an API point of
> view.
> 
> So I see no reason to fail or be forced not to set a page as young, just
> because a hint was *not* provided. This would even be a regression in the
> behavior. The hint is actually always respected right now, it is just that
> even if you do not provide the hint, the access/dirty is set.
> 
> The only consistency I think worth thinking about is with the dirty-bit, and
> I can add it if you want. Note that the access-bit (in x86) might be set
> speculatively in contrast to the dirty-bit is only set atomically with a
> real access. That’s the reason I think it may make sense not to set the
> dirty without a hint.

Sorry to ask if this is (another) naive question: any link/help to explain
the speculative behavior on access bit?  Is it part of speculative
execution (which, iiuc, would it be reverted if the speculation failed)?

> 
> Is that acceptable? Access-bit always set, dirty-bit according to hint?

I'm still trying to digest what you said above, sorry.

Aren't both access and dirty bits need an atomic op to be set anyway?  Then
from perf pov should we simply keep setting them both too like what you did
with this version? because it seems that'll always avoid an extra pgtable
update access?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-06-24  2:05           ` Peter Xu
@ 2022-06-24  2:42             ` Nadav Amit
  2022-06-24 21:58               ` Peter Xu
  0 siblings, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2022-06-24  2:42 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport



> On Jun 23, 2022, at 7:05 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> On Fri, Jun 24, 2022 at 12:03:38AM +0000, Nadav Amit wrote:
>> My take is that hints are hints. Following David’s (or was it yours?)
>> feedback, I fixed the description to indicate that this is merely a hint and
>> removed all references to dirty/access bits. The kernel therefore can ignore
>> the hint when it wants to or use it in any other way. I fully agree that
>> this gives the kernel the ability to change the behavior as needed.
>> 
>> Note that for write-protected 4KB zero-page (where we share the zero-page)
>> we always set the access-bit, regardless of the hint, because it makes
>> sense: the zero-page is not swappable and therefore the access-bit is set.
> 
> The zero-page example makes sense, and yeah that makes the hugetlb behavior
> making more sense too.
> 
>> 
>> I think that the lesser user-facing documentation there is on how the
>> feature is *exactly* used by the kernel - is better from an API point of
>> view.
>> 
>> So I see no reason to fail or be forced not to set a page as young, just
>> because a hint was *not* provided. This would even be a regression in the
>> behavior. The hint is actually always respected right now, it is just that
>> even if you do not provide the hint, the access/dirty is set.
>> 
>> The only consistency I think worth thinking about is with the dirty-bit, and
>> I can add it if you want. Note that the access-bit (in x86) might be set
>> speculatively in contrast to the dirty-bit is only set atomically with a
>> real access. That’s the reason I think it may make sense not to set the
>> dirty without a hint.
> 
> Sorry to ask if this is (another) naive question: any link/help to explain
> the speculative behavior on access bit? Is it part of speculative
> execution (which, iiuc, would it be reverted if the speculation failed)?

Oh man, it is hard to find a reference. I made this claim it based on my
recollection (and logic).

The access-bit on Intel is set when the PTE is loaded into the TLB, so if you
allow speculative loading of the TLB, that’s what you get.

Googling shows Yu Zhao saying: "IIRC, there are also false positives, i.e.,
the accessed bit is set on entries used by speculative execution only.” [1]

Intel SDM says: "Whenever the processor uses a paging-structure entry as part
of linear-address translation, it sets the accessed flag in that entry...
Whenever there is a write to a linear address, the processor sets the dirty
flag (if it is not already set) in the paging- structure entry..."

You can argue that this indicates that the access-bit is updated
speculatively (translations can be speculative) and dirty-bit is on actual
write. But it is somewhat of a creative reading.

Googling further did not help much, but I found a relevant discussion on
RISC-V, in which they actually consider a similar behavior. [2]

If you want (and care), we can cc Dave Hansen to get a clear answer.

[1] https://lore.kernel.org/lkml/YE7Rk%2FYA1Uj7yFn2@google.com/
[2] https://lists.riscv.org/g/tech-virt-mem/topic/accessed_bit/77699883?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,1,80,77699883

> 
>> 
>> Is that acceptable? Access-bit always set, dirty-bit according to hint?
> 
> I'm still trying to digest what you said above, sorry.
> 
> Aren't both access and dirty bits need an atomic op to be set anyway? Then
> from perf pov should we simply keep setting them both too like what you did
> with this version? because it seems that'll always avoid an extra pgtable
> update access?

I guess by atomic-op you mean atomic-update by the hardware AD-assist.

I agree that if a page is written, the bits would need to be updated and
these would introduce an overhead. However, if the page cannot be written,
well, the dirty bit would never be set.

hugetlb_mcopy_atomic_pte() currently does the following:

        _dst_pte = huge_pte_mkdirty(_dst_pte);
        _dst_pte = pte_mkyoung(_dst_pte);

        if (wp_copy)
                _dst_pte = huge_pte_mkuffd_wp(_dst_pte);

Since you asked to update hugetlb_mcopy_atomic_pte(), I can offer three
options:

1. Do not set dirty if (wp_copy).
2. Do not set dirty if (wp_copy || !write_hint) 
3. Keep it as is.

I am fine with whatever would make you happy. :)


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-06-24  2:42             ` Nadav Amit
@ 2022-06-24 21:58               ` Peter Xu
  2022-06-24 22:17                 ` Peter Xu
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2022-06-24 21:58 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport

[Sorry for replying late]

On Fri, Jun 24, 2022 at 02:42:21AM +0000, Nadav Amit wrote:
> 
> 
> > On Jun 23, 2022, at 7:05 PM, Peter Xu <peterx@redhat.com> wrote:
> > 
> > On Fri, Jun 24, 2022 at 12:03:38AM +0000, Nadav Amit wrote:
> >> My take is that hints are hints. Following David’s (or was it yours?)
> >> feedback, I fixed the description to indicate that this is merely a hint and
> >> removed all references to dirty/access bits. The kernel therefore can ignore
> >> the hint when it wants to or use it in any other way. I fully agree that
> >> this gives the kernel the ability to change the behavior as needed.
> >> 
> >> Note that for write-protected 4KB zero-page (where we share the zero-page)
> >> we always set the access-bit, regardless of the hint, because it makes
> >> sense: the zero-page is not swappable and therefore the access-bit is set.
> > 
> > The zero-page example makes sense, and yeah that makes the hugetlb behavior
> > making more sense too.
> > 
> >> 
> >> I think that the lesser user-facing documentation there is on how the
> >> feature is *exactly* used by the kernel - is better from an API point of
> >> view.
> >> 
> >> So I see no reason to fail or be forced not to set a page as young, just
> >> because a hint was *not* provided. This would even be a regression in the
> >> behavior. The hint is actually always respected right now, it is just that
> >> even if you do not provide the hint, the access/dirty is set.
> >> 
> >> The only consistency I think worth thinking about is with the dirty-bit, and
> >> I can add it if you want. Note that the access-bit (in x86) might be set
> >> speculatively in contrast to the dirty-bit is only set atomically with a
> >> real access. That’s the reason I think it may make sense not to set the
> >> dirty without a hint.
> > 
> > Sorry to ask if this is (another) naive question: any link/help to explain
> > the speculative behavior on access bit? Is it part of speculative
> > execution (which, iiuc, would it be reverted if the speculation failed)?
> 
> Oh man, it is hard to find a reference. I made this claim it based on my
> recollection (and logic).
> 
> The access-bit on Intel is set when the PTE is loaded into the TLB, so if you
> allow speculative loading of the TLB, that’s what you get.
> 
> Googling shows Yu Zhao saying: "IIRC, there are also false positives, i.e.,
> the accessed bit is set on entries used by speculative execution only.” [1]
> 
> Intel SDM says: "Whenever the processor uses a paging-structure entry as part
> of linear-address translation, it sets the accessed flag in that entry...
> Whenever there is a write to a linear address, the processor sets the dirty
> flag (if it is not already set) in the paging- structure entry..."
> 
> You can argue that this indicates that the access-bit is updated
> speculatively (translations can be speculative) and dirty-bit is on actual
> write. But it is somewhat of a creative reading.
> 
> Googling further did not help much, but I found a relevant discussion on
> RISC-V, in which they actually consider a similar behavior. [2]
> 
> If you want (and care), we can cc Dave Hansen to get a clear answer.
> 
> [1] https://lore.kernel.org/lkml/YE7Rk%2FYA1Uj7yFn2@google.com/
> [2] https://lists.riscv.org/g/tech-virt-mem/topic/accessed_bit/77699883?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,1,80,77699883

I thought even writes can be speculatively executed too?  Though I think
when the speculation was proved wrong the write needs to be reverted along
with making sure D bit cleared if it was cleared before the speculative
operation.

So I think I get you if you meant the access bit may not be reverted even
if we hit a speculative failure (though without solid proofs, afaict..).
IOW we could have false positive access bits set even if not accessed, but
not to D bits which should be accurate.

> 
> > 
> >> 
> >> Is that acceptable? Access-bit always set, dirty-bit according to hint?
> > 
> > I'm still trying to digest what you said above, sorry.
> > 
> > Aren't both access and dirty bits need an atomic op to be set anyway? Then
> > from perf pov should we simply keep setting them both too like what you did
> > with this version? because it seems that'll always avoid an extra pgtable
> > update access?
> 
> I guess by atomic-op you mean atomic-update by the hardware AD-assist.

Yes.

Btw, since I looked at the SDM as you quoted I think that may not strictly
be like an atomic op from processor pov, I guess, since there's a NOTE:

  The accesses used by the processor to set these flags may or may not be
  exposed to the processor’s self-modifying code detection logic. If the
  processor is executing code from the same memory area that is being used
  for the paging structures, the setting of these flags may or may not
  result in an immediate change to the executing code stream.

So I read it as: even if it'll be an atomic, the op can be postponed.

> 
> I agree that if a page is written, the bits would need to be updated and
> these would introduce an overhead. However, if the page cannot be written,
> well, the dirty bit would never be set.

Ok I see what you mean now.  But honestly, I don't think it's anything
related to the speculative access bit behavior described above.. or is it?

> 
> hugetlb_mcopy_atomic_pte() currently does the following:
> 
>         _dst_pte = huge_pte_mkdirty(_dst_pte);
>         _dst_pte = pte_mkyoung(_dst_pte);
> 
>         if (wp_copy)
>                 _dst_pte = huge_pte_mkuffd_wp(_dst_pte);
> 
> Since you asked to update hugetlb_mcopy_atomic_pte(), I can offer three
> options:
> 
> 1. Do not set dirty if (wp_copy).
> 2. Do not set dirty if (wp_copy || !write_hint) 
> 3. Keep it as is.

AFAICT you already go somewhere at least not (3) with non-hugetlb pages in
current series.. because dirty bit is not always set already for them, so
I'd say we'd make them match?  Hugetlbfs shouldn't be special in this
aspect, IMHO.

Said that, I think it doesn't really necessary need to be that complex,
since make_huge_pte() already sets dirty bit when "writable=1", so IIUC
what you need to do is simply make sure dirty bit set when write_hint=1.

Does it sounds correct to you?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-06-24 21:58               ` Peter Xu
@ 2022-06-24 22:17                 ` Peter Xu
  2022-06-25  7:49                   ` Nadav Amit
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2022-06-24 22:17 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport

On Fri, Jun 24, 2022 at 05:58:17PM -0400, Peter Xu wrote:
> [Sorry for replying late]
> 
> On Fri, Jun 24, 2022 at 02:42:21AM +0000, Nadav Amit wrote:
> > 
> > 
> > > On Jun 23, 2022, at 7:05 PM, Peter Xu <peterx@redhat.com> wrote:
> > > 
> > > On Fri, Jun 24, 2022 at 12:03:38AM +0000, Nadav Amit wrote:
> > >> My take is that hints are hints. Following David’s (or was it yours?)
> > >> feedback, I fixed the description to indicate that this is merely a hint and
> > >> removed all references to dirty/access bits. The kernel therefore can ignore
> > >> the hint when it wants to or use it in any other way. I fully agree that
> > >> this gives the kernel the ability to change the behavior as needed.
> > >> 
> > >> Note that for write-protected 4KB zero-page (where we share the zero-page)
> > >> we always set the access-bit, regardless of the hint, because it makes
> > >> sense: the zero-page is not swappable and therefore the access-bit is set.
> > > 
> > > The zero-page example makes sense, and yeah that makes the hugetlb behavior
> > > making more sense too.
> > > 
> > >> 
> > >> I think that the lesser user-facing documentation there is on how the
> > >> feature is *exactly* used by the kernel - is better from an API point of
> > >> view.
> > >> 
> > >> So I see no reason to fail or be forced not to set a page as young, just
> > >> because a hint was *not* provided. This would even be a regression in the
> > >> behavior. The hint is actually always respected right now, it is just that
> > >> even if you do not provide the hint, the access/dirty is set.
> > >> 
> > >> The only consistency I think worth thinking about is with the dirty-bit, and
> > >> I can add it if you want. Note that the access-bit (in x86) might be set
> > >> speculatively in contrast to the dirty-bit is only set atomically with a
> > >> real access. That’s the reason I think it may make sense not to set the
> > >> dirty without a hint.
> > > 
> > > Sorry to ask if this is (another) naive question: any link/help to explain
> > > the speculative behavior on access bit? Is it part of speculative
> > > execution (which, iiuc, would it be reverted if the speculation failed)?
> > 
> > Oh man, it is hard to find a reference. I made this claim it based on my
> > recollection (and logic).
> > 
> > The access-bit on Intel is set when the PTE is loaded into the TLB, so if you
> > allow speculative loading of the TLB, that’s what you get.
> > 
> > Googling shows Yu Zhao saying: "IIRC, there are also false positives, i.e.,
> > the accessed bit is set on entries used by speculative execution only.” [1]
> > 
> > Intel SDM says: "Whenever the processor uses a paging-structure entry as part
> > of linear-address translation, it sets the accessed flag in that entry...
> > Whenever there is a write to a linear address, the processor sets the dirty
> > flag (if it is not already set) in the paging- structure entry..."
> > 
> > You can argue that this indicates that the access-bit is updated
> > speculatively (translations can be speculative) and dirty-bit is on actual
> > write. But it is somewhat of a creative reading.
> > 
> > Googling further did not help much, but I found a relevant discussion on
> > RISC-V, in which they actually consider a similar behavior. [2]
> > 
> > If you want (and care), we can cc Dave Hansen to get a clear answer.
> > 
> > [1] https://lore.kernel.org/lkml/YE7Rk%2FYA1Uj7yFn2@google.com/
> > [2] https://lists.riscv.org/g/tech-virt-mem/topic/accessed_bit/77699883?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,1,80,77699883
> 
> I thought even writes can be speculatively executed too?  Though I think
> when the speculation was proved wrong the write needs to be reverted along
> with making sure D bit cleared if it was cleared before the speculative
> operation.
> 
> So I think I get you if you meant the access bit may not be reverted even
> if we hit a speculative failure (though without solid proofs, afaict..).
> IOW we could have false positive access bits set even if not accessed, but
> not to D bits which should be accurate.
> 
> > 
> > > 
> > >> 
> > >> Is that acceptable? Access-bit always set, dirty-bit according to hint?
> > > 
> > > I'm still trying to digest what you said above, sorry.
> > > 
> > > Aren't both access and dirty bits need an atomic op to be set anyway? Then
> > > from perf pov should we simply keep setting them both too like what you did
> > > with this version? because it seems that'll always avoid an extra pgtable
> > > update access?
> > 
> > I guess by atomic-op you mean atomic-update by the hardware AD-assist.
> 
> Yes.
> 
> Btw, since I looked at the SDM as you quoted I think that may not strictly
> be like an atomic op from processor pov, I guess, since there's a NOTE:
> 
>   The accesses used by the processor to set these flags may or may not be
>   exposed to the processor’s self-modifying code detection logic. If the
>   processor is executing code from the same memory area that is being used
>   for the paging structures, the setting of these flags may or may not
>   result in an immediate change to the executing code stream.
> 
> So I read it as: even if it'll be an atomic, the op can be postponed.
> 
> > 
> > I agree that if a page is written, the bits would need to be updated and
> > these would introduce an overhead. However, if the page cannot be written,
> > well, the dirty bit would never be set.
> 
> Ok I see what you mean now.  But honestly, I don't think it's anything
> related to the speculative access bit behavior described above.. or is it?
> 
> > 
> > hugetlb_mcopy_atomic_pte() currently does the following:
> > 
> >         _dst_pte = huge_pte_mkdirty(_dst_pte);
> >         _dst_pte = pte_mkyoung(_dst_pte);
> > 
> >         if (wp_copy)
> >                 _dst_pte = huge_pte_mkuffd_wp(_dst_pte);
> > 
> > Since you asked to update hugetlb_mcopy_atomic_pte(), I can offer three
> > options:
> > 
> > 1. Do not set dirty if (wp_copy).
> > 2. Do not set dirty if (wp_copy || !write_hint) 
> > 3. Keep it as is.
> 
> AFAICT you already go somewhere at least not (3) with non-hugetlb pages in
> current series.. because dirty bit is not always set already for them, so
> I'd say we'd make them match?  Hugetlbfs shouldn't be special in this
> aspect, IMHO.
> 
> Said that, I think it doesn't really necessary need to be that complex,
> since make_huge_pte() already sets dirty bit when "writable=1", so IIUC
> what you need to do is simply make sure dirty bit set when write_hint=1.
> 
> Does it sounds correct to you?

Hmm, hold on...  I failed to figure out how that write-likely hint could
help us for either huge or non-huge pages, since:

  (1) Old code always set dirty, so no perf degrade anyway with/without the
      hint

  (2) If we want to rework dirty bit (which I'm totally fine with..), then
      we don't apply it when we shouldn't, and afaict we should set D bit
      whenever we should...  if the user assumes this page is likely to be
      written but made it read-only, say, with UFFDIO_COPY(wp_mode=1),
      setting D bit will not help, instead, the user should simply use an
      UFFDIO_COPY(wp_mode=0) then the dirty will be set with write=1..

It'll be helpful but only helpful for UFFDIO_ZEROCOPY because it avoids one
COW.  But that seems to be it.

In short: I'm wondering whether we only really need the ACCESS_LIKELY hint
as you proposed earlier.  We may want UFFDIO_ZEROPAGE_MODE_ALLOCATE
separately, but keep that only for zeropage op (and it shouldn't really be
called WRITE_LIKELY)?  Or did I miss something?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-06-24 22:17                 ` Peter Xu
@ 2022-06-25  7:49                   ` Nadav Amit
  2022-06-27 13:12                     ` Peter Xu
  0 siblings, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2022-06-25  7:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport



> On Jun 24, 2022, at 3:17 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> On Fri, Jun 24, 2022 at 05:58:17PM -0400, Peter Xu wrote:
>> [Sorry for replying late]
>> 
>> Said that, I think it doesn't really necessary need to be that complex,
>> since make_huge_pte() already sets dirty bit when "writable=1", so IIUC
>> what you need to do is simply make sure dirty bit set when write_hint=1.
>> 
>> Does it sounds correct to you?
> 
> Hmm, hold on...  I failed to figure out how that write-likely hint could
> help us for either huge or non-huge pages, since:
> 
>  (1) Old code always set dirty, so no perf degrade anyway with/without the
>      hint
> 
>  (2) If we want to rework dirty bit (which I'm totally fine with..), then
>      we don't apply it when we shouldn't, and afaict we should set D bit
>      whenever we should...  if the user assumes this page is likely to be
>      written but made it read-only, say, with UFFDIO_COPY(wp_mode=1),
>      setting D bit will not help, instead, the user should simply use an
>      UFFDIO_COPY(wp_mode=0) then the dirty will be set with write=1..
> 
> It'll be helpful but only helpful for UFFDIO_ZEROCOPY because it avoids one
> COW.  But that seems to be it.
> 
> In short: I'm wondering whether we only really need the ACCESS_LIKELY hint
> as you proposed earlier.  We may want UFFDIO_ZEROPAGE_MODE_ALLOCATE
> separately, but keep that only for zeropage op (and it shouldn't really be
> called WRITE_LIKELY)?  Or did I miss something?

Let’s see if I get you correctly. I am not sure whether we had this
discussion before.

We are talking about a scenario in which WP=0. You argue that if the page
is already set as dirty, what is the benefit of not setting the dirty-bit,
right?

So first, IIUC, there are cases in which the page would not be set as
dirty, e.g., UFFDIO_CONTINUE. [ I am admittedly not too familiar with this
use-case, so I say it based on the comments. ]

Second, even if the page is dirty (e.g., following UFFDIO_COPY), but it
is not written by the user after UFFDI_COPY, marking the PTE as dirty
when it is mapped would induce overhead, as we discussed before, since
if/when the PTE is unmapped, TLB flush batching might not be possible.

So I don’t think there is a problem in having WRITE_LIKELY hint. Moreover,
I would reiterate my position (which you guys convinced me in!) that
having hints that indicate what the user does (WRITE_LIKELY) is a better
API than something that indicates directly what the kernel should do
(e.g., UFFDIO_ZEROPAGE_MODE_ALLOCATE).


But this discussion made me think that there are two somewhat related
matters that we may want to address:

1. mwriteprotect_range() should use MM_CP_TRY_CHANGE_WRITABLE when !wp
to proactively make entries writable and save .

2. The WRITE_LIKELY hint should be propagated from mwriteprotect_range()
to change_pte_range() through cp_flags, and the entry should be set
dirty accordingly.

With that, and with leaving hugetlbfs as it is (I meant before - leaving
it as it is in the patchset that I sent), would it be ok with you?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-06-25  7:49                   ` Nadav Amit
@ 2022-06-27 13:12                     ` Peter Xu
  2022-06-27 13:27                       ` David Hildenbrand
  2022-06-27 23:37                       ` Nadav Amit
  0 siblings, 2 replies; 34+ messages in thread
From: Peter Xu @ 2022-06-27 13:12 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport

On Sat, Jun 25, 2022 at 07:49:54AM +0000, Nadav Amit wrote:
> 
> 
> > On Jun 24, 2022, at 3:17 PM, Peter Xu <peterx@redhat.com> wrote:
> > 
> > On Fri, Jun 24, 2022 at 05:58:17PM -0400, Peter Xu wrote:
> >> [Sorry for replying late]
> >> 
> >> Said that, I think it doesn't really necessary need to be that complex,
> >> since make_huge_pte() already sets dirty bit when "writable=1", so IIUC
> >> what you need to do is simply make sure dirty bit set when write_hint=1.
> >> 
> >> Does it sounds correct to you?
> > 
> > Hmm, hold on...  I failed to figure out how that write-likely hint could
> > help us for either huge or non-huge pages, since:
> > 
> >  (1) Old code always set dirty, so no perf degrade anyway with/without the
> >      hint
> > 
> >  (2) If we want to rework dirty bit (which I'm totally fine with..), then
> >      we don't apply it when we shouldn't, and afaict we should set D bit
> >      whenever we should...  if the user assumes this page is likely to be
> >      written but made it read-only, say, with UFFDIO_COPY(wp_mode=1),
> >      setting D bit will not help, instead, the user should simply use an
> >      UFFDIO_COPY(wp_mode=0) then the dirty will be set with write=1..
> > 
> > It'll be helpful but only helpful for UFFDIO_ZEROCOPY because it avoids one
> > COW.  But that seems to be it.
> > 
> > In short: I'm wondering whether we only really need the ACCESS_LIKELY hint
> > as you proposed earlier.  We may want UFFDIO_ZEROPAGE_MODE_ALLOCATE
> > separately, but keep that only for zeropage op (and it shouldn't really be
> > called WRITE_LIKELY)?  Or did I miss something?
> 
> Let’s see if I get you correctly. I am not sure whether we had this
> discussion before.
> 
> We are talking about a scenario in which WP=0. You argue that if the page
> is already set as dirty, what is the benefit of not setting the dirty-bit,
> right?
> 
> So first, IIUC, there are cases in which the page would not be set as
> dirty, e.g., UFFDIO_CONTINUE. [ I am admittedly not too familiar with this
> use-case, so I say it based on the comments. ]
> 
> Second, even if the page is dirty (e.g., following UFFDIO_COPY), but it
> is not written by the user after UFFDI_COPY, marking the PTE as dirty
> when it is mapped would induce overhead, as we discussed before, since
> if/when the PTE is unmapped, TLB flush batching might not be possible.

I'd hope we don't make an interface design just to service that purpose of
when write=0 and dirty=1 use case that is internal to the kernel so far,
and I still think it's the tlb flush code to change.. or do we have other
use case for this WRITE_LIKELY hint?

For UFFDIO_CONTINUE, if we want to make things clear on dirty bit, then
IMHO for UFFDIO_CONTINUE the right place for the dirty process is where the
user writes to the page in the other mapping, where PageDirty() will start
to be true already even if the pte that to be CONTINUEd will have dirty=0
in the pte entry.  From that pov I still don't see why we need to grant the
user on the dirty bit control, no matter with a hint only, or explicit.

> 
> So I don’t think there is a problem in having WRITE_LIKELY hint. Moreover,
> I would reiterate my position (which you guys convinced me in!)

David convinced you I think :)

> that having hints that indicate what the user does (WRITE_LIKELY) is a
> better API than something that indicates directly what the kernel should
> do (e.g., UFFDIO_ZEROPAGE_MODE_ALLOCATE).

The hint idea sounds good to me, it's just that we actually have two steps
here:

  (1) We think providing user the control of dirty bit makes sense, then,
  (2) We think the flag should be a hint not explicit "set dirty bit"

I agree with (2) in this case if (1) is applicable.  And now I think I'm
questioning myself on (1).

Fundamentally, access bit has more meaningful context (0 means cold, 1
means hot), for dirty it's really more a perf thing to me (when clear,
it'll take extra cycles to set it when memory write happens to it; being
clear _may_ help only for the tlb flush example you mentioned but I'm not
fully convinced that's correct).

Maybe with the to be proposed RFC patch for tlb flush we can know whether
that should be something we can rely on.  It'll add more dependency on this
work which I'm sorry to say.  It's just that IMHO we should think carefully
for the write-hint because this is a solid new uABI we're talking about.

The other option is we can introduce the access hint first and think more
on the dirty one (we can always add it when proper).  What do you think?
Also, David please chim in anytime if I missed the whole point when you
proposed the idea.

> 
> But this discussion made me think that there are two somewhat related
> matters that we may want to address:
> 
> 1. mwriteprotect_range() should use MM_CP_TRY_CHANGE_WRITABLE when !wp
> to proactively make entries writable and save .

I'm not sure I'm right here, but I think David's patch should have covered
that case?  The new helper only checks pte_uffd_wp() based on my memory,
and when resolving page faults uffd-wp bit should have been gone, so it
should be treated the same as normal ptes.

> 
> 2. The WRITE_LIKELY hint should be propagated from mwriteprotect_range()
> to change_pte_range() through cp_flags, and the entry should be set
> dirty accordingly.

Sounds correct.  Though again I hope we can think more thoroughly on
whether we need the write-hint first.  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-06-27 13:12                     ` Peter Xu
@ 2022-06-27 13:27                       ` David Hildenbrand
  2022-06-27 14:59                         ` Peter Xu
  2022-06-27 23:37                       ` Nadav Amit
  1 sibling, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2022-06-27 13:27 UTC (permalink / raw)
  To: Peter Xu, Nadav Amit
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, Mike Rapoport

> Fundamentally, access bit has more meaningful context (0 means cold, 1
> means hot), for dirty it's really more a perf thing to me (when clear,
> it'll take extra cycles to set it when memory write happens to it; being
> clear _may_ help only for the tlb flush example you mentioned but I'm not
> fully convinced that's correct).
> 
> Maybe with the to be proposed RFC patch for tlb flush we can know whether
> that should be something we can rely on.  It'll add more dependency on this
> work which I'm sorry to say.  It's just that IMHO we should think carefully
> for the write-hint because this is a solid new uABI we're talking about.
> 
> The other option is we can introduce the access hint first and think more
> on the dirty one (we can always add it when proper).  What do you think?
> Also, David please chim in anytime if I missed the whole point when you
> proposed the idea.

Well, if we have an ABI that places pages without further information
*why* we're doing that makes us having to guess what to do or what not
to do, and I think the zeropage placement is a prime example for that.
Personally, I think communicating the intention in forms of hints is
something that doesn't leak implementation details into an ABI.

So "no planned access" vs. "read_likely" vs. "write_likely" conceptually
makes sense to me.

As I raised previously, *if* we want to let the user affect the dirty
bit setting (1) is then a pure implementation detail. Or whatever else
we might want to do.

But I also want to raise awareness that architectures that don't have a
hw-set dirty bit have to use page faults to mimic dirty tracking. IIRC,
s390x is a prime example for that: pte_mkclean() sets the WP bit and
marks the page dirty from the write fault. So it's even more expensive
than on other architectures.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-06-27 13:27                       ` David Hildenbrand
@ 2022-06-27 14:59                         ` Peter Xu
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Xu @ 2022-06-27 14:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nadav Amit, Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, Mike Rapoport

On Mon, Jun 27, 2022 at 03:27:49PM +0200, David Hildenbrand wrote:
> > Fundamentally, access bit has more meaningful context (0 means cold, 1
> > means hot), for dirty it's really more a perf thing to me (when clear,
> > it'll take extra cycles to set it when memory write happens to it; being
> > clear _may_ help only for the tlb flush example you mentioned but I'm not
> > fully convinced that's correct).
> > 
> > Maybe with the to be proposed RFC patch for tlb flush we can know whether
> > that should be something we can rely on.  It'll add more dependency on this
> > work which I'm sorry to say.  It's just that IMHO we should think carefully
> > for the write-hint because this is a solid new uABI we're talking about.
> > 
> > The other option is we can introduce the access hint first and think more
> > on the dirty one (we can always add it when proper).  What do you think?
> > Also, David please chim in anytime if I missed the whole point when you
> > proposed the idea.
> 
> Well, if we have an ABI that places pages without further information
> *why* we're doing that makes us having to guess what to do or what not
> to do, and I think the zeropage placement is a prime example for that.
> Personally, I think communicating the intention in forms of hints is
> something that doesn't leak implementation details into an ABI.
> 
> So "no planned access" vs. "read_likely" vs. "write_likely" conceptually
> makes sense to me.
> 
> As I raised previously, *if* we want to let the user affect the dirty
> bit setting (1) is then a pure implementation detail. Or whatever else
> we might want to do.
> 
> But I also want to raise awareness that architectures that don't have a
> hw-set dirty bit have to use page faults to mimic dirty tracking. IIRC,
> s390x is a prime example for that: pte_mkclean() sets the WP bit and
> marks the page dirty from the write fault. So it's even more expensive
> than on other architectures.

The last input seems to be supporting that we'd better even have redundant
dirty bit in ptes rather than accidentally not having it, even when both
are safe.

So to me WRITE_LIKELY was still mostly around dirty bit besides the
ZEROPAGE case.  I don't have a strong opinion on how we should name that
flag, if we want to insist on WRITE_LIKELY but only on ZEROPAGE I think
it's fine, it's just that if I'm the user app I prefer making sure the page
is allocated after UFFDIO_ZEROPAGE returned, rather than only providing a
hint and then the kernels says "we'll do something but nothing is
guaranteed".

I also fully agree we don't want to expose impl details but my question was
more on whether we want that hint at all as a generic one, and in what case
that hint helps outside ZEROPAGE.  For "it can be accessed" hint, I have an
answer and it seems to apply to most of the uffd ioctls; but not so generic
for a "it can be written" hint.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-06-27 13:12                     ` Peter Xu
  2022-06-27 13:27                       ` David Hildenbrand
@ 2022-06-27 23:37                       ` Nadav Amit
  2022-06-28 10:55                         ` David Hildenbrand
  2022-06-28 19:15                         ` Peter Xu
  1 sibling, 2 replies; 34+ messages in thread
From: Nadav Amit @ 2022-06-27 23:37 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport, Dave Hansen

[ +Dave Hansen to say how wrong I am ] 

> On Jun 27, 2022, at 6:12 AM, Peter Xu <peterx@redhat.com> wrote:
> 
> ⚠ External Email
> 
> On Sat, Jun 25, 2022 at 07:49:54AM +0000, Nadav Amit wrote:
>> 
>> 
>>> On Jun 24, 2022, at 3:17 PM, Peter Xu <peterx@redhat.com> wrote:
>>> 
>>> On Fri, Jun 24, 2022 at 05:58:17PM -0400, Peter Xu wrote:
>>>> [Sorry for replying late]
>>>> 
>>>> Said that, I think it doesn't really necessary need to be that complex,
>>>> since make_huge_pte() already sets dirty bit when "writable=1", so IIUC
>>>> what you need to do is simply make sure dirty bit set when write_hint=1.
>>>> 
>>>> Does it sounds correct to you?
>>> 
>>> Hmm, hold on...  I failed to figure out how that write-likely hint could
>>> help us for either huge or non-huge pages, since:
>>> 
>>> (1) Old code always set dirty, so no perf degrade anyway with/without the
>>>     hint
>>> 
>>> (2) If we want to rework dirty bit (which I'm totally fine with..), then
>>>     we don't apply it when we shouldn't, and afaict we should set D bit
>>>     whenever we should...  if the user assumes this page is likely to be
>>>     written but made it read-only, say, with UFFDIO_COPY(wp_mode=1),
>>>     setting D bit will not help, instead, the user should simply use an
>>>     UFFDIO_COPY(wp_mode=0) then the dirty will be set with write=1..
>>> 
>>> It'll be helpful but only helpful for UFFDIO_ZEROCOPY because it avoids one
>>> COW.  But that seems to be it.
>>> 
>>> In short: I'm wondering whether we only really need the ACCESS_LIKELY hint
>>> as you proposed earlier.  We may want UFFDIO_ZEROPAGE_MODE_ALLOCATE
>>> separately, but keep that only for zeropage op (and it shouldn't really be
>>> called WRITE_LIKELY)?  Or did I miss something?
>> 
>> Let’s see if I get you correctly. I am not sure whether we had this
>> discussion before.
>> 
>> We are talking about a scenario in which WP=0. You argue that if the page
>> is already set as dirty, what is the benefit of not setting the dirty-bit,
>> right?
>> 
>> So first, IIUC, there are cases in which the page would not be set as
>> dirty, e.g., UFFDIO_CONTINUE. [ I am admittedly not too familiar with this
>> use-case, so I say it based on the comments. ]
>> 
>> Second, even if the page is dirty (e.g., following UFFDIO_COPY), but it
>> is not written by the user after UFFDI_COPY, marking the PTE as dirty
>> when it is mapped would induce overhead, as we discussed before, since
>> if/when the PTE is unmapped, TLB flush batching might not be possible.
> 
> I'd hope we don't make an interface design just to service that purpose of
> when write=0 and dirty=1 use case that is internal to the kernel so far,
> and I still think it's the tlb flush code to change.. or do we have other
> use case for this WRITE_LIKELY hint?
> 
> For UFFDIO_CONTINUE, if we want to make things clear on dirty bit, then
> IMHO for UFFDIO_CONTINUE the right place for the dirty process is where the
> user writes to the page in the other mapping, where PageDirty() will start
> to be true already even if the pte that to be CONTINUEd will have dirty=0
> in the pte entry.  From that pov I still don't see why we need to grant the
> user on the dirty bit control, no matter with a hint only, or explicit.
> 
>> 
>> So I don’t think there is a problem in having WRITE_LIKELY hint. Moreover,
>> I would reiterate my position (which you guys convinced me in!)
> 
> David convinced you I think :)
> 
>> that having hints that indicate what the user does (WRITE_LIKELY) is a
>> better API than something that indicates directly what the kernel should
>> do (e.g., UFFDIO_ZEROPAGE_MODE_ALLOCATE).
> 
> The hint idea sounds good to me, it's just that we actually have two steps
> here:
> 
>  (1) We think providing user the control of dirty bit makes sense, then,
>  (2) We think the flag should be a hint not explicit "set dirty bit"
> 
> I agree with (2) in this case if (1) is applicable.  And now I think I'm
> questioning myself on (1).
> 
> Fundamentally, access bit has more meaningful context (0 means cold, 1
> means hot), for dirty it's really more a perf thing to me (when clear,
> it'll take extra cycles to set it when memory write happens to it; being
> clear _may_ help only for the tlb flush example you mentioned but I'm not
> fully convinced that's correct).

I am not sure we understand each other. I think the benefit of not setting
a dirty-bit when a page is not actually written is fundamental, and has
inherit performance benefit.

When I did x86’s pte_flags_need_flush(), I was defensive, but there is a
basic optimization that is possible to avoid a TLB flush on non-dirty
writable PTEs.

In x86, consider a situation in which you use ptep_modify_prot_start()
to remove a PTE and load its old value using xchg. (A similar case happens
on reclaim). Assume you want to write-protect the entry.

If the PTE is non-dirty then you should be able to avoid a flush, even if
the PTE is writable. In x86, a write and the change of the dirty-bit are
performed both atomically. Therefore, if the dirty-bit on the old PTE was
clear, you can avoid a TLB flush.

Besides the benefit of avoiding a TLB flush, there is also the benefit
of having more precise dirty tracking. You assume UFFDIO_CONTINUE will be
preceded by memory write to the shared memory, but that does not have to
be the case. Similarly, if in the future userfaultfd would also support
memory-backed private mappings, that does not have to be the case either.

Putting all of the above aside, there is a bug in my code, but this
bug also points why dirty should not be set unconditionally. If someone
uses SOFT_DIRTY with userfaultfd, then marking the PTE as dirty (and
soft-dirty) might be misleading, causing unnecessary userspace writeback
of memory.

So I do need to fix my code so it would not write-unprotect memory if
soft-dirty is enabled and UFFD_FLAGS_WRITE_LIKELY is not provided. But
I think it emphasizes the benefit of having UFFD_FLAGS_WRITE_LIKELY.

> 
> Maybe with the to be proposed RFC patch for tlb flush we can know whether
> that should be something we can rely on.  It'll add more dependency on this
> work which I'm sorry to say.  It's just that IMHO we should think carefully
> for the write-hint because this is a solid new uABI we're talking about.
> 
> The other option is we can introduce the access hint first and think more
> on the dirty one (we can always add it when proper).  What do you think?
> Also, David please chim in anytime if I missed the whole point when you
> proposed the idea.
> 
>> 
>> But this discussion made me think that there are two somewhat related
>> matters that we may want to address:
>> 
>> 1. mwriteprotect_range() should use MM_CP_TRY_CHANGE_WRITABLE when !wp
>> to proactively make entries writable and save .
> 
> I'm not sure I'm right here, but I think David's patch should have covered
> that case?  The new helper only checks pte_uffd_wp() based on my memory,
> and when resolving page faults uffd-wp bit should have been gone, so it
> should be treated the same as normal ptes.

Let’s see we get to the same page:

mwriteprotect_range() does:

        change_protection(&tlb, dst_vma, start, start + len, newprot,
                          enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE)

As you see no use of MM_CP_TRY_CHANGE_WRITABLE.

And then change_pte_range() does:

                        if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
                            !pte_write(ptent) &&
                            can_change_pte_writable(vma, addr, ptent))
                                ptent = pte_mkwrite(ptent);

If we do not provide MM_CP_TRY_CHANGE_WRITABLE, the PTE would not be
writable.

Now for the record, we may want an additional optimization, so if a
fault handler is woken because a PTE become writable, we do not retry
the page fault (since the PTE is already writable). It is a small change,
but let’s see first we agree on the patches so far…


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-06-27 23:37                       ` Nadav Amit
@ 2022-06-28 10:55                         ` David Hildenbrand
  2022-06-28 19:15                         ` Peter Xu
  1 sibling, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2022-06-28 10:55 UTC (permalink / raw)
  To: Nadav Amit, Peter Xu
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, Mike Rapoport, Dave Hansen

On 28.06.22 01:37, Nadav Amit wrote:
> [ +Dave Hansen to say how wrong I am ] 
> 
>> On Jun 27, 2022, at 6:12 AM, Peter Xu <peterx@redhat.com> wrote:
>>
>> ⚠ External Email
>>
>> On Sat, Jun 25, 2022 at 07:49:54AM +0000, Nadav Amit wrote:
>>>
>>>
>>>> On Jun 24, 2022, at 3:17 PM, Peter Xu <peterx@redhat.com> wrote:
>>>>
>>>> On Fri, Jun 24, 2022 at 05:58:17PM -0400, Peter Xu wrote:
>>>>> [Sorry for replying late]
>>>>>
>>>>> Said that, I think it doesn't really necessary need to be that complex,
>>>>> since make_huge_pte() already sets dirty bit when "writable=1", so IIUC
>>>>> what you need to do is simply make sure dirty bit set when write_hint=1.
>>>>>
>>>>> Does it sounds correct to you?
>>>>
>>>> Hmm, hold on...  I failed to figure out how that write-likely hint could
>>>> help us for either huge or non-huge pages, since:
>>>>
>>>> (1) Old code always set dirty, so no perf degrade anyway with/without the
>>>>     hint
>>>>
>>>> (2) If we want to rework dirty bit (which I'm totally fine with..), then
>>>>     we don't apply it when we shouldn't, and afaict we should set D bit
>>>>     whenever we should...  if the user assumes this page is likely to be
>>>>     written but made it read-only, say, with UFFDIO_COPY(wp_mode=1),
>>>>     setting D bit will not help, instead, the user should simply use an
>>>>     UFFDIO_COPY(wp_mode=0) then the dirty will be set with write=1..
>>>>
>>>> It'll be helpful but only helpful for UFFDIO_ZEROCOPY because it avoids one
>>>> COW.  But that seems to be it.
>>>>
>>>> In short: I'm wondering whether we only really need the ACCESS_LIKELY hint
>>>> as you proposed earlier.  We may want UFFDIO_ZEROPAGE_MODE_ALLOCATE
>>>> separately, but keep that only for zeropage op (and it shouldn't really be
>>>> called WRITE_LIKELY)?  Or did I miss something?
>>>
>>> Let’s see if I get you correctly. I am not sure whether we had this
>>> discussion before.
>>>
>>> We are talking about a scenario in which WP=0. You argue that if the page
>>> is already set as dirty, what is the benefit of not setting the dirty-bit,
>>> right?
>>>
>>> So first, IIUC, there are cases in which the page would not be set as
>>> dirty, e.g., UFFDIO_CONTINUE. [ I am admittedly not too familiar with this
>>> use-case, so I say it based on the comments. ]
>>>
>>> Second, even if the page is dirty (e.g., following UFFDIO_COPY), but it
>>> is not written by the user after UFFDI_COPY, marking the PTE as dirty
>>> when it is mapped would induce overhead, as we discussed before, since
>>> if/when the PTE is unmapped, TLB flush batching might not be possible.
>>
>> I'd hope we don't make an interface design just to service that purpose of
>> when write=0 and dirty=1 use case that is internal to the kernel so far,
>> and I still think it's the tlb flush code to change.. or do we have other
>> use case for this WRITE_LIKELY hint?
>>
>> For UFFDIO_CONTINUE, if we want to make things clear on dirty bit, then
>> IMHO for UFFDIO_CONTINUE the right place for the dirty process is where the
>> user writes to the page in the other mapping, where PageDirty() will start
>> to be true already even if the pte that to be CONTINUEd will have dirty=0
>> in the pte entry.  From that pov I still don't see why we need to grant the
>> user on the dirty bit control, no matter with a hint only, or explicit.
>>
>>>
>>> So I don’t think there is a problem in having WRITE_LIKELY hint. Moreover,
>>> I would reiterate my position (which you guys convinced me in!)
>>
>> David convinced you I think :)
>>
>>> that having hints that indicate what the user does (WRITE_LIKELY) is a
>>> better API than something that indicates directly what the kernel should
>>> do (e.g., UFFDIO_ZEROPAGE_MODE_ALLOCATE).
>>
>> The hint idea sounds good to me, it's just that we actually have two steps
>> here:
>>
>>  (1) We think providing user the control of dirty bit makes sense, then,
>>  (2) We think the flag should be a hint not explicit "set dirty bit"
>>
>> I agree with (2) in this case if (1) is applicable.  And now I think I'm
>> questioning myself on (1).
>>
>> Fundamentally, access bit has more meaningful context (0 means cold, 1
>> means hot), for dirty it's really more a perf thing to me (when clear,
>> it'll take extra cycles to set it when memory write happens to it; being
>> clear _may_ help only for the tlb flush example you mentioned but I'm not
>> fully convinced that's correct).
> 
> I am not sure we understand each other. I think the benefit of not setting
> a dirty-bit when a page is not actually written is fundamental, and has
> inherit performance benefit.
> 
> When I did x86’s pte_flags_need_flush(), I was defensive, but there is a
> basic optimization that is possible to avoid a TLB flush on non-dirty
> writable PTEs.
> 
> In x86, consider a situation in which you use ptep_modify_prot_start()
> to remove a PTE and load its old value using xchg. (A similar case happens
> on reclaim). Assume you want to write-protect the entry.
> 
> If the PTE is non-dirty then you should be able to avoid a flush, even if
> the PTE is writable. In x86, a write and the change of the dirty-bit are
> performed both atomically. Therefore, if the dirty-bit on the old PTE was
> clear, you can avoid a TLB flush.
> 
> Besides the benefit of avoiding a TLB flush, there is also the benefit
> of having more precise dirty tracking. You assume UFFDIO_CONTINUE will be
> preceded by memory write to the shared memory, but that does not have to
> be the case. Similarly, if in the future userfaultfd would also support
> memory-backed private mappings, that does not have to be the case either.
> 
> Putting all of the above aside, there is a bug in my code, but this
> bug also points why dirty should not be set unconditionally. If someone
> uses SOFT_DIRTY with userfaultfd, then marking the PTE as dirty (and
> soft-dirty) might be misleading, causing unnecessary userspace writeback
> of memory.
> 
> So I do need to fix my code so it would not write-unprotect memory if
> soft-dirty is enabled and UFFD_FLAGS_WRITE_LIKELY is not provided. But
> I think it emphasizes the benefit of having UFFD_FLAGS_WRITE_LIKELY.
> 
>>
>> Maybe with the to be proposed RFC patch for tlb flush we can know whether
>> that should be something we can rely on.  It'll add more dependency on this
>> work which I'm sorry to say.  It's just that IMHO we should think carefully
>> for the write-hint because this is a solid new uABI we're talking about.
>>
>> The other option is we can introduce the access hint first and think more
>> on the dirty one (we can always add it when proper).  What do you think?
>> Also, David please chim in anytime if I missed the whole point when you
>> proposed the idea.
>>
>>>
>>> But this discussion made me think that there are two somewhat related
>>> matters that we may want to address:
>>>
>>> 1. mwriteprotect_range() should use MM_CP_TRY_CHANGE_WRITABLE when !wp
>>> to proactively make entries writable and save .
>>
>> I'm not sure I'm right here, but I think David's patch should have covered
>> that case?  The new helper only checks pte_uffd_wp() based on my memory,
>> and when resolving page faults uffd-wp bit should have been gone, so it
>> should be treated the same as normal ptes.
> 
> Let’s see we get to the same page:
> 
> mwriteprotect_range() does:
> 
>         change_protection(&tlb, dst_vma, start, start + len, newprot,
>                           enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE)
> 
> As you see no use of MM_CP_TRY_CHANGE_WRITABLE.
> 
> And then change_pte_range() does:
> 
>                         if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>                             !pte_write(ptent) &&
>                             can_change_pte_writable(vma, addr, ptent))
>                                 ptent = pte_mkwrite(ptent);

Right, I think in a previous version of my patch (before you guys
convinced me to introduce MM_CP_TRY_CHANGE_WRITABLE :P ) it would have
done it automatically (for private mappings). We might have to add it to
some callers now manually to not only consider mprotect.


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-06-27 23:37                       ` Nadav Amit
  2022-06-28 10:55                         ` David Hildenbrand
@ 2022-06-28 19:15                         ` Peter Xu
  2022-06-28 20:30                           ` Nadav Amit
  1 sibling, 1 reply; 34+ messages in thread
From: Peter Xu @ 2022-06-28 19:15 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport, Dave Hansen

On Mon, Jun 27, 2022 at 11:37:32PM +0000, Nadav Amit wrote:
> [ +Dave Hansen to say how wrong I am ] 
> 
> > On Jun 27, 2022, at 6:12 AM, Peter Xu <peterx@redhat.com> wrote:
> > 
> > ⚠ External Email
> > 
> > On Sat, Jun 25, 2022 at 07:49:54AM +0000, Nadav Amit wrote:
> >> 
> >> 
> >>> On Jun 24, 2022, at 3:17 PM, Peter Xu <peterx@redhat.com> wrote:
> >>> 
> >>> On Fri, Jun 24, 2022 at 05:58:17PM -0400, Peter Xu wrote:
> >>>> [Sorry for replying late]
> >>>> 
> >>>> Said that, I think it doesn't really necessary need to be that complex,
> >>>> since make_huge_pte() already sets dirty bit when "writable=1", so IIUC
> >>>> what you need to do is simply make sure dirty bit set when write_hint=1.
> >>>> 
> >>>> Does it sounds correct to you?
> >>> 
> >>> Hmm, hold on...  I failed to figure out how that write-likely hint could
> >>> help us for either huge or non-huge pages, since:
> >>> 
> >>> (1) Old code always set dirty, so no perf degrade anyway with/without the
> >>>     hint
> >>> 
> >>> (2) If we want to rework dirty bit (which I'm totally fine with..), then
> >>>     we don't apply it when we shouldn't, and afaict we should set D bit
> >>>     whenever we should...  if the user assumes this page is likely to be
> >>>     written but made it read-only, say, with UFFDIO_COPY(wp_mode=1),
> >>>     setting D bit will not help, instead, the user should simply use an
> >>>     UFFDIO_COPY(wp_mode=0) then the dirty will be set with write=1..
> >>> 
> >>> It'll be helpful but only helpful for UFFDIO_ZEROCOPY because it avoids one
> >>> COW.  But that seems to be it.
> >>> 
> >>> In short: I'm wondering whether we only really need the ACCESS_LIKELY hint
> >>> as you proposed earlier.  We may want UFFDIO_ZEROPAGE_MODE_ALLOCATE
> >>> separately, but keep that only for zeropage op (and it shouldn't really be
> >>> called WRITE_LIKELY)?  Or did I miss something?
> >> 
> >> Let’s see if I get you correctly. I am not sure whether we had this
> >> discussion before.
> >> 
> >> We are talking about a scenario in which WP=0. You argue that if the page
> >> is already set as dirty, what is the benefit of not setting the dirty-bit,
> >> right?
> >> 
> >> So first, IIUC, there are cases in which the page would not be set as
> >> dirty, e.g., UFFDIO_CONTINUE. [ I am admittedly not too familiar with this
> >> use-case, so I say it based on the comments. ]
> >> 
> >> Second, even if the page is dirty (e.g., following UFFDIO_COPY), but it
> >> is not written by the user after UFFDI_COPY, marking the PTE as dirty
> >> when it is mapped would induce overhead, as we discussed before, since
> >> if/when the PTE is unmapped, TLB flush batching might not be possible.
> > 
> > I'd hope we don't make an interface design just to service that purpose of
> > when write=0 and dirty=1 use case that is internal to the kernel so far,
> > and I still think it's the tlb flush code to change.. or do we have other
> > use case for this WRITE_LIKELY hint?
> > 
> > For UFFDIO_CONTINUE, if we want to make things clear on dirty bit, then
> > IMHO for UFFDIO_CONTINUE the right place for the dirty process is where the
> > user writes to the page in the other mapping, where PageDirty() will start
> > to be true already even if the pte that to be CONTINUEd will have dirty=0
> > in the pte entry.  From that pov I still don't see why we need to grant the
> > user on the dirty bit control, no matter with a hint only, or explicit.
> > 
> >> 
> >> So I don’t think there is a problem in having WRITE_LIKELY hint. Moreover,
> >> I would reiterate my position (which you guys convinced me in!)
> > 
> > David convinced you I think :)
> > 
> >> that having hints that indicate what the user does (WRITE_LIKELY) is a
> >> better API than something that indicates directly what the kernel should
> >> do (e.g., UFFDIO_ZEROPAGE_MODE_ALLOCATE).
> > 
> > The hint idea sounds good to me, it's just that we actually have two steps
> > here:
> > 
> >  (1) We think providing user the control of dirty bit makes sense, then,
> >  (2) We think the flag should be a hint not explicit "set dirty bit"
> > 
> > I agree with (2) in this case if (1) is applicable.  And now I think I'm
> > questioning myself on (1).
> > 
> > Fundamentally, access bit has more meaningful context (0 means cold, 1
> > means hot), for dirty it's really more a perf thing to me (when clear,
> > it'll take extra cycles to set it when memory write happens to it; being
> > clear _may_ help only for the tlb flush example you mentioned but I'm not
> > fully convinced that's correct).
> 
> I am not sure we understand each other. I think the benefit of not setting
> a dirty-bit when a page is not actually written is fundamental, and has
> inherit performance benefit.
> 
> When I did x86’s pte_flags_need_flush(), I was defensive, but there is a
> basic optimization that is possible to avoid a TLB flush on non-dirty
> writable PTEs.
> 
> In x86, consider a situation in which you use ptep_modify_prot_start()
> to remove a PTE and load its old value using xchg. (A similar case happens
> on reclaim). Assume you want to write-protect the entry.
> 
> If the PTE is non-dirty then you should be able to avoid a flush, even if
> the PTE is writable. In x86, a write and the change of the dirty-bit are
> performed both atomically. Therefore, if the dirty-bit on the old PTE was
> clear, you can avoid a TLB flush.
> 
> Besides the benefit of avoiding a TLB flush, there is also the benefit
> of having more precise dirty tracking. You assume UFFDIO_CONTINUE will be
> preceded by memory write to the shared memory, but that does not have to
> be the case. Similarly, if in the future userfaultfd would also support
> memory-backed private mappings, that does not have to be the case either.

Could I ask what's the not supported private mapping you're talking about
(and I think you meant "file-backed")?  I thought all kinds of private
mappings are supported on all modes already?

Thanks for explaining, so yes either false positive or false negative on
pte dirty bit can bring a side effect on things, and the user knows the
best.  Makes sense.  I think what I overlooked is the downside of having
both write==dirty==1 when it doesn't need to.

> 
> Putting all of the above aside, there is a bug in my code, but this
> bug also points why dirty should not be set unconditionally. If someone
> uses SOFT_DIRTY with userfaultfd, then marking the PTE as dirty (and
> soft-dirty) might be misleading, causing unnecessary userspace writeback
> of memory.

Well I never thought anyone would be using soft-dirty with uffd because
logically uffd-wp was somehow trying to replace it with a better interface,
hopefully.

If to talk about it, IMHO the most important thing is really not the dirty
bit but the write bit: even if you leave both the dirty bits cleared (so
the user specified !WRITE_HINT then we grant write bit only), it means when
the write happens for sure dirty bit will be set on demand but not
soft-dirty since we won't generate a page fault at all.  AFAICT it'll be a
false negative.

The old code is safe in that respect on that we always set dirty so we can
only have false positive (which is acceptable in this case) but not false
negative (which is not).

> 
> So I do need to fix my code so it would not write-unprotect memory if
> soft-dirty is enabled and UFFD_FLAGS_WRITE_LIKELY is not provided. But
> I think it emphasizes the benefit of having UFFD_FLAGS_WRITE_LIKELY.

Sorry to say that, but to me it seems a proof that dirty bit is even more
tricky than access bit so we need to have better thoughts, even if the
WRITE_LIKELY hint sounds straightforward.  I now can buy in all the tlb
flush points you explained, but still IMHO that's too fine granule (some
random ptes in a batched tlb range flush may not even matter!) and I'm just
naively wondering whether we need more thoughts for an uABI to be proposed
like that.  I won't ask for some use cases and especially numbers to prove
assuming I'm asking too much, but really that'll be very persuasive if we
can get.  I'd not worry for ACCESS_LIKELY on it because that's much more
straightforward to me.

> 
> > 
> > Maybe with the to be proposed RFC patch for tlb flush we can know whether
> > that should be something we can rely on.  It'll add more dependency on this
> > work which I'm sorry to say.  It's just that IMHO we should think carefully
> > for the write-hint because this is a solid new uABI we're talking about.
> > 
> > The other option is we can introduce the access hint first and think more
> > on the dirty one (we can always add it when proper).  What do you think?
> > Also, David please chim in anytime if I missed the whole point when you
> > proposed the idea.
> > 
> >> 
> >> But this discussion made me think that there are two somewhat related
> >> matters that we may want to address:
> >> 
> >> 1. mwriteprotect_range() should use MM_CP_TRY_CHANGE_WRITABLE when !wp
> >> to proactively make entries writable and save .
> > 
> > I'm not sure I'm right here, but I think David's patch should have covered
> > that case?  The new helper only checks pte_uffd_wp() based on my memory,
> > and when resolving page faults uffd-wp bit should have been gone, so it
> > should be treated the same as normal ptes.
> 
> Let’s see we get to the same page:
> 
> mwriteprotect_range() does:
> 
>         change_protection(&tlb, dst_vma, start, start + len, newprot,
>                           enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE)
> 
> As you see no use of MM_CP_TRY_CHANGE_WRITABLE.
> 
> And then change_pte_range() does:
> 
>                         if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>                             !pte_write(ptent) &&
>                             can_change_pte_writable(vma, addr, ptent))
>                                 ptent = pte_mkwrite(ptent);
> 
> If we do not provide MM_CP_TRY_CHANGE_WRITABLE, the PTE would not be
> writable.
> 
> Now for the record, we may want an additional optimization, so if a
> fault handler is woken because a PTE become writable, we do not retry
> the page fault (since the PTE is already writable). It is a small change,
> but let’s see first we agree on the patches so far…

Ah I see what I missed, thanks.

Another option is we call can_change_pte_writable() somehow in the
unprotect branch.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-06-28 19:15                         ` Peter Xu
@ 2022-06-28 20:30                           ` Nadav Amit
  2022-06-28 20:56                             ` Peter Xu
  0 siblings, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2022-06-28 20:30 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport, Dave Hansen



> On Jun 28, 2022, at 12:15 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> ⚠ External Email
> 
> On Mon, Jun 27, 2022 at 11:37:32PM +0000, Nadav Amit wrote:
>> 
>> Besides the benefit of avoiding a TLB flush, there is also the benefit
>> of having more precise dirty tracking. You assume UFFDIO_CONTINUE will be
>> preceded by memory write to the shared memory, but that does not have to
>> be the case. Similarly, if in the future userfaultfd would also support
>> memory-backed private mappings, that does not have to be the case either.
> 
> Could I ask what's the not supported private mapping you're talking about
> (and I think you meant "file-backed")? I thought all kinds of private
> mappings are supported on all modes already?

Yes, I meant file-backed. See vma_can_userfault() for the supported
types of memory:

  static inline bool vma_can_userfault(struct vm_area_struct *vma,
                                     unsigned long vm_flags)
  {       
        if (vm_flags & VM_UFFD_MINOR)
                return is_vm_hugetlb_page(vma) || vma_is_shmem(vma);
                
#ifndef CONFIG_PTE_MARKER_UFFD_WP
        /*
         * If user requested uffd-wp but not enabled pte markers for
         * uffd-wp, then shmem & hugetlbfs are not supported but only
         * anonymous.
         */             
        if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
                return false;
#endif
        return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
            vma_is_shmem(vma);
  }


> 
> Thanks for explaining, so yes either false positive or false negative on
> pte dirty bit can bring a side effect on things, and the user knows the
> best. Makes sense. I think what I overlooked is the downside of having
> both write==dirty==1 when it doesn't need to.
> 
>> 
>> Putting all of the above aside, there is a bug in my code, but this
>> bug also points why dirty should not be set unconditionally. If someone
>> uses SOFT_DIRTY with userfaultfd, then marking the PTE as dirty (and
>> soft-dirty) might be misleading, causing unnecessary userspace writeback
>> of memory.
> 
> Well I never thought anyone would be using soft-dirty with uffd because
> logically uffd-wp was somehow trying to replace it with a better interface,
> hopefully.

I have heard about some who does (not me). So I do not make it up,
unfortunately.

> 
> If to talk about it, IMHO the most important thing is really not the dirty
> bit but the write bit: even if you leave both the dirty bits cleared (so
> the user specified !WRITE_HINT then we grant write bit only), it means when
> the write happens for sure dirty bit will be set on demand but not
> soft-dirty since we won't generate a page fault at all. AFAICT it'll be a
> false negative.
> 
> The old code is safe in that respect on that we always set dirty so we can
> only have false positive (which is acceptable in this case) but not false
> negative (which is not).

I agree that the PTE should be left RO in this case. I was just pointing
that the user might not be aware of soft-dirty being used, and then it
makes sense to treat the (lack of) write-hint appropriately.

In the current code, when PTEs are unconditionally marked as dirty, even
when they are write-protected, soft-dirty would produce false-positives.
Nothing would break, but it is not good for performance either.

> 
>> 
>> So I do need to fix my code so it would not write-unprotect memory if
>> soft-dirty is enabled and UFFD_FLAGS_WRITE_LIKELY is not provided. But
>> I think it emphasizes the benefit of having UFFD_FLAGS_WRITE_LIKELY.
> 
> Sorry to say that, but to me it seems a proof that dirty bit is even more
> tricky than access bit so we need to have better thoughts, even if the
> WRITE_LIKELY hint sounds straightforward. I now can buy in all the tlb
> flush points you explained, but still IMHO that's too fine granule (some
> random ptes in a batched tlb range flush may not even matter!) and I'm just
> naively wondering whether we need more thoughts for an uABI to be proposed
> like that. I won't ask for some use cases and especially numbers to prove
> assuming I'm asking too much, but really that'll be very persuasive if we
> can get. I'd not worry for ACCESS_LIKELY on it because that's much more
> straightforward to me.

Let me get back to you with numbers.

> 
>> 
>>> 
>>> Maybe with the to be proposed RFC patch for tlb flush we can know whether
>>> that should be something we can rely on. It'll add more dependency on this
>>> work which I'm sorry to say. It's just that IMHO we should think carefully
>>> for the write-hint because this is a solid new uABI we're talking about.
>>> 
>>> The other option is we can introduce the access hint first and think more
>>> on the dirty one (we can always add it when proper). What do you think?
>>> Also, David please chim in anytime if I missed the whole point when you
>>> proposed the idea.
>>> 
>>>> 
>>>> But this discussion made me think that there are two somewhat related
>>>> matters that we may want to address:
>>>> 
>>>> 1. mwriteprotect_range() should use MM_CP_TRY_CHANGE_WRITABLE when !wp
>>>> to proactively make entries writable and save .
>>> 
>>> I'm not sure I'm right here, but I think David's patch should have covered
>>> that case? The new helper only checks pte_uffd_wp() based on my memory,
>>> and when resolving page faults uffd-wp bit should have been gone, so it
>>> should be treated the same as normal ptes.
>> 
>> Let’s see we get to the same page:
>> 
>> mwriteprotect_range() does:
>> 
>> change_protection(&tlb, dst_vma, start, start + len, newprot,
>> enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE)
>> 
>> As you see no use of MM_CP_TRY_CHANGE_WRITABLE.
>> 
>> And then change_pte_range() does:
>> 
>> if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>> !pte_write(ptent) &&
>> can_change_pte_writable(vma, addr, ptent))
>> ptent = pte_mkwrite(ptent);
>> 
>> If we do not provide MM_CP_TRY_CHANGE_WRITABLE, the PTE would not be
>> writable.
>> 
>> Now for the record, we may want an additional optimization, so if a
>> fault handler is woken because a PTE become writable, we do not retry
>> the page fault (since the PTE is already writable). It is a small change,
>> but let’s see first we agree on the patches so far…
> 
> Ah I see what I missed, thanks.
> 
> Another option is we call can_change_pte_writable() somehow in the
> unprotect branch.

I don’t think that I got this one.

tl;dr: I’ll run some numbers (access when PTE-dirty/clear), address the
aforementioned issues and send v2. Based on the results we can decide
whether it makes sense.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-06-28 20:30                           ` Nadav Amit
@ 2022-06-28 20:56                             ` Peter Xu
  2022-06-28 21:03                               ` Nadav Amit
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2022-06-28 20:56 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport, Dave Hansen

On Tue, Jun 28, 2022 at 08:30:46PM +0000, Nadav Amit wrote:
> 
> 
> > On Jun 28, 2022, at 12:15 PM, Peter Xu <peterx@redhat.com> wrote:
> > 
> > ⚠ External Email
> > 
> > On Mon, Jun 27, 2022 at 11:37:32PM +0000, Nadav Amit wrote:
> >> 
> >> Besides the benefit of avoiding a TLB flush, there is also the benefit
> >> of having more precise dirty tracking. You assume UFFDIO_CONTINUE will be
> >> preceded by memory write to the shared memory, but that does not have to
> >> be the case. Similarly, if in the future userfaultfd would also support
> >> memory-backed private mappings, that does not have to be the case either.
> > 
> > Could I ask what's the not supported private mapping you're talking about
> > (and I think you meant "file-backed")? I thought all kinds of private
> > mappings are supported on all modes already?
> 
> Yes, I meant file-backed. See vma_can_userfault() for the supported
> types of memory:
> 
>   static inline bool vma_can_userfault(struct vm_area_struct *vma,
>                                      unsigned long vm_flags)
>   {       
>         if (vm_flags & VM_UFFD_MINOR)
>                 return is_vm_hugetlb_page(vma) || vma_is_shmem(vma);
>                 
> #ifndef CONFIG_PTE_MARKER_UFFD_WP
>         /*
>          * If user requested uffd-wp but not enabled pte markers for
>          * uffd-wp, then shmem & hugetlbfs are not supported but only
>          * anonymous.
>          */             
>         if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
>                 return false;
> #endif
>         return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
>             vma_is_shmem(vma);
>   }

I'm confused.  Let me ask in another way: do you mean !VM_SHARED when you
say "private"?

Both shmem & hugetlb support private mappings for all three modes, afaict.

> > Well I never thought anyone would be using soft-dirty with uffd because
> > logically uffd-wp was somehow trying to replace it with a better interface,
> > hopefully.
> 
> I have heard about some who does (not me). So I do not make it up,
> unfortunately.

If in any form you can get the reason of using soft-dirty in that use case
please kindly share more.  I think it could be that uffd-wp is not working
as good as soft-dirty in some way but we can think about it when there's a
valid use case, so ultimately it's more possible for a full replacement.

Sync messages can be a problem and they're indeed slow, soft-dirty is by
nature async. But still I'd like to check in case you know.

> 
> > 
> > If to talk about it, IMHO the most important thing is really not the dirty
> > bit but the write bit: even if you leave both the dirty bits cleared (so
> > the user specified !WRITE_HINT then we grant write bit only), it means when
> > the write happens for sure dirty bit will be set on demand but not
> > soft-dirty since we won't generate a page fault at all. AFAICT it'll be a
> > false negative.
> > 
> > The old code is safe in that respect on that we always set dirty so we can
> > only have false positive (which is acceptable in this case) but not false
> > negative (which is not).
> 
> I agree that the PTE should be left RO in this case. I was just pointing
> that the user might not be aware of soft-dirty being used, and then it
> makes sense to treat the (lack of) write-hint appropriately.

Yes please.  Thanks for raising this topic btw, I never thought about that
side effect.

[...]

> > Another option is we call can_change_pte_writable() somehow in the
> > unprotect branch.
> 
> I don’t think that I got this one.

No worry, feel free to ignore it.  It's probably not ideal anyway because
we need an extra pte_mkwrite() there too when can_change_pte_writable()
returns true.

> tl;dr: I’ll run some numbers (access when PTE-dirty/clear), address the
> aforementioned issues and send v2. Based on the results we can decide
> whether it makes sense.

That'll be very appreciated.  Thanks a lot!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-06-28 20:56                             ` Peter Xu
@ 2022-06-28 21:03                               ` Nadav Amit
  2022-06-28 21:12                                 ` Peter Xu
  0 siblings, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2022-06-28 21:03 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport, Dave Hansen

On Jun 28, 2022, at 1:56 PM, Peter Xu <peterx@redhat.com> wrote:

> On Tue, Jun 28, 2022 at 08:30:46PM +0000, Nadav Amit wrote:
>>> On Jun 28, 2022, at 12:15 PM, Peter Xu <peterx@redhat.com> wrote:
>>> 
>>> ⚠ External Email
>>> 
>>> On Mon, Jun 27, 2022 at 11:37:32PM +0000, Nadav Amit wrote:
>>>> Besides the benefit of avoiding a TLB flush, there is also the benefit
>>>> of having more precise dirty tracking. You assume UFFDIO_CONTINUE will be
>>>> preceded by memory write to the shared memory, but that does not have to
>>>> be the case. Similarly, if in the future userfaultfd would also support
>>>> memory-backed private mappings, that does not have to be the case either.
>>> 
>>> Could I ask what's the not supported private mapping you're talking about
>>> (and I think you meant "file-backed")? I thought all kinds of private
>>> mappings are supported on all modes already?
>> 
>> Yes, I meant file-backed. See vma_can_userfault() for the supported
>> types of memory:
>> 
>> static inline bool vma_can_userfault(struct vm_area_struct *vma,
>> unsigned long vm_flags)
>> { 
>> if (vm_flags & VM_UFFD_MINOR)
>> return is_vm_hugetlb_page(vma) || vma_is_shmem(vma);
>> 
>> #ifndef CONFIG_PTE_MARKER_UFFD_WP
>> /*
>> * If user requested uffd-wp but not enabled pte markers for
>> * uffd-wp, then shmem & hugetlbfs are not supported but only
>> * anonymous.
>> */ 
>> if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
>> return false;
>> #endif
>> return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
>> vma_is_shmem(vma);
>> }
> 
> I'm confused. Let me ask in another way: do you mean !VM_SHARED when you
> say "private"?
> 
> Both shmem & hugetlb support private mappings for all three modes, afaict.

Sorry, let me more clear. I meant private (!VM_SHARED) file-backed memory.

> 
>>> Well I never thought anyone would be using soft-dirty with uffd because
>>> logically uffd-wp was somehow trying to replace it with a better interface,
>>> hopefully.
>> 
>> I have heard about some who does (not me). So I do not make it up,
>> unfortunately.
> 
> If in any form you can get the reason of using soft-dirty in that use case
> please kindly share more. I think it could be that uffd-wp is not working
> as good as soft-dirty in some way but we can think about it when there's a
> valid use case, so ultimately it's more possible for a full replacement.
> 
> Sync messages can be a problem and they're indeed slow, soft-dirty is by
> nature async. But still I'd like to check in case you know.

It’s not my thing (not VMware’s either), so I do not feel free of sharing,
but anyhow I know very little. I think that project started before uffd-wp
was implemented and that’s the reason they use this strange combination. It
should not be our main motivation - just to say that users are “creative”.

Again, I will get back to you with the rest.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-06-28 21:03                               ` Nadav Amit
@ 2022-06-28 21:12                                 ` Peter Xu
  2022-06-28 21:15                                   ` Nadav Amit
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2022-06-28 21:12 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport, Dave Hansen

On Tue, Jun 28, 2022 at 09:03:02PM +0000, Nadav Amit wrote:
> > Both shmem & hugetlb support private mappings for all three modes, afaict.
> 
> Sorry, let me more clear. I meant private (!VM_SHARED) file-backed memory.

I normally use "file-backed" to stand for "shmem + hugetlb" in any uffd
context.

If you meant anything outside shmem/hugetlb for uffd (e.g. on xfs), then we
never support them anyway, do we?

Hmm.. What am I ultimately missing?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-06-28 21:12                                 ` Peter Xu
@ 2022-06-28 21:15                                   ` Nadav Amit
  0 siblings, 0 replies; 34+ messages in thread
From: Nadav Amit @ 2022-06-28 21:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport, Dave Hansen

On Jun 28, 2022, at 2:12 PM, Peter Xu <peterx@redhat.com> wrote:

> ⚠ External Email
> 
> On Tue, Jun 28, 2022 at 09:03:02PM +0000, Nadav Amit wrote:
>>> Both shmem & hugetlb support private mappings for all three modes, afaict.
>> 
>> Sorry, let me more clear. I meant private (!VM_SHARED) file-backed memory.
> 
> I normally use "file-backed" to stand for "shmem + hugetlb" in any uffd
> context.
> 
> If you meant anything outside shmem/hugetlb for uffd (e.g. on xfs), then we
> never support them anyway, do we?
> 
> Hmm.. What am I ultimately missing?

You are not missing anything. We just have communication problems.

I meant - we do not support uffd+xfs today, but maybe one day we will. I do
not know/think of any implications WRITE_HINT should have in such case, but
it would be useful to have the option to decide how to handle something like
that in the future (and not bind ourselves to certain behavior now).


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-06-22 18:50 ` [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations Nadav Amit
  2022-06-23 23:24   ` Peter Xu
@ 2022-07-12  6:19   ` Nadav Amit
  2022-07-12 14:56     ` Peter Xu
  1 sibling, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2022-07-12  6:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport, Nadav Amit

On Jun 22, 2022, at 11:50 AM, Nadav Amit <nadav.amit@gmail.com> wrote:

> From: Nadav Amit <namit@vmware.com>
> 
> Using a PTE on x86 with cleared access-bit (aka young-bit)
> takes ~600 cycles more than when the access bit is set. At the same
> time, setting the access-bit for memory that is not used (e.g.,
> prefetched) can introduce greater overheads, as the prefetched memory is
> reclaimed later than it should be.
> 
> Userfaultfd currently does not set the access-bit (excluding the
> huge-pages case). Arguably, it is best to let the user control whether
> the access bit should be set or not. The expected use is to request
> userfaultfd to set the access-bit when the copy/wp operation is done to
> resolve a page-fault, and not to set the access-bit when the memory is
> prefetched.
> 
> Introduce UFFDIO_[op]_ACCESS_LIKELY to enable userspace to request the
> young bit to be set.

I reply to my own email, but this mostly addresses the concerns that Peter
has raised.

So I ran the test below on my Haswell (x86), which showed two things:

1. Accessing an address using a clean PTE or old PTE takes ~500 cycles
more than with dirty+young (depending on the access, of course: dirty
does not matter for read, dirty+young both matter for write).

2. I made a mistake in my implementation. PTEs are - at least on x86 -
created as young with mk_pte(). So the logic should be similar to
do_set_pte():

        if (prefault && arch_wants_old_prefaulted_pte())
                entry = pte_mkold(entry);
        else
                entry = pte_sw_mkyoung(entry);

Based on these results, I will send another version for both young and
dirty. Let me know if these results are not convincing.

I will add, as we discussed (well, I think I raised these things, so
hopefully you agree):

1. On x86, avoid flush if changing WP->RO and PTE is clean.

2. When write-unprotecting entry, if PTE is exclusive, set it as writable.
[ I considered not setting it as writable if write-hint is not provided, but
with the change in (1), it does not provide any real value. ]

---

#define _GNU_SOURCE
#include <sys/types.h>
#include <stdio.h>
#include <linux/userfaultfd.h>
#include <pthread.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <signal.h>
#include <poll.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/ioctl.h>
#include <stdint.h>
#include <stdbool.h>

#define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
		       } while (0)

static inline uint64_t rdtscp(void)
{
	uint64_t rax, rdx;
	uint32_t aux;
	asm volatile ("rdtscp" : "=a" (rax), "=d" (rdx), "=c" (aux):: "memory");
}

int main(int argc, char *argv[])
{
	long uffd;          /* userfaultfd file descriptor */
	char *addr;         /* Start of region handled by userfaultfd */
	unsigned long len;  /* Length of region handled by userfaultfd */
	pthread_t thr;      /* ID of thread that handles page faults */
	bool young, dirty, write;
	struct uffdio_api uffdio_api;
	struct uffdio_register uffdio_register;
	int l;
	static char *page = NULL;
	struct uffdio_copy uffdio_copy;
	ssize_t nread;
	int page_size;

	if (argc != 5) {
		fprintf(stderr, "Usage: %s [num-pages] [write] [young] [dirty]\n", argv[0]);
		exit(EXIT_FAILURE);
	}

	page_size = sysconf(_SC_PAGE_SIZE);
	len = strtoul(argv[1], NULL, 0) * page_size;
	write = !!strtoul(argv[2], NULL, 0);
	young = !!strtoul(argv[3], NULL, 0);
	dirty = !!strtoul(argv[4], NULL, 0);

	page = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
		   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

	if (page == MAP_FAILED)
		errExit("mmap");

	uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
	if (uffd == -1)
		errExit("userfaultfd");

	uffdio_api.api = UFFD_API;
	uffdio_api.features = (1<<11); //UFFD_FEATURE_EXACT_ADDRESS;
	if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1)
		errExit("ioctl-UFFDIO_API");

	addr = mmap(NULL, len, PROT_READ | PROT_WRITE,
		MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
	if (addr == MAP_FAILED)
		errExit("mmap");

	uffdio_register.range.start = (unsigned long) addr;
	uffdio_register.range.len = len;
	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1)
		errExit("ioctl-UFFDIO_REGISTER");

	uffdio_copy.src = (unsigned long) page;
	uffdio_copy.mode = 0;
	if (young)
		uffdio_copy.mode |= (1ul << 2);
	if (dirty)
		uffdio_copy.mode |= (1ul << 3);

	uffdio_copy.len = page_size;
	uffdio_copy.copy = 0;

	for (l = 0; l < len; l += page_size) {
		uffdio_copy.dst = (unsigned long)(&addr[l]);
		if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy) == -1)
			errExit("ioctl-UFFDIO_COPY");
	}

	for (l = 0; l < len; l += page_size) {
		char c;
		uint64_t start;

		start = rdtscp();
		if (write)
			addr[l] = 5;
		else
			c = *(volatile char *)(&addr[l]);
		printf("%ld\n", rdtscp() - start);
	}

	exit(EXIT_SUCCESS);
}

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-07-12  6:19   ` Nadav Amit
@ 2022-07-12 14:56     ` Peter Xu
  2022-07-13  1:09       ` Nadav Amit
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2022-07-12 14:56 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport, Nadav Amit

Hi, Nadav,

On Tue, Jul 12, 2022 at 06:19:08AM +0000, Nadav Amit wrote:
> On Jun 22, 2022, at 11:50 AM, Nadav Amit <nadav.amit@gmail.com> wrote:
> 
> > From: Nadav Amit <namit@vmware.com>
> > 
> > Using a PTE on x86 with cleared access-bit (aka young-bit)
> > takes ~600 cycles more than when the access bit is set. At the same
> > time, setting the access-bit for memory that is not used (e.g.,
> > prefetched) can introduce greater overheads, as the prefetched memory is
> > reclaimed later than it should be.
> > 
> > Userfaultfd currently does not set the access-bit (excluding the
> > huge-pages case). Arguably, it is best to let the user control whether
> > the access bit should be set or not. The expected use is to request
> > userfaultfd to set the access-bit when the copy/wp operation is done to
> > resolve a page-fault, and not to set the access-bit when the memory is
> > prefetched.
> > 
> > Introduce UFFDIO_[op]_ACCESS_LIKELY to enable userspace to request the
> > young bit to be set.
> 
> I reply to my own email, but this mostly addresses the concerns that Peter
> has raised.
> 
> So I ran the test below on my Haswell (x86), which showed two things:
> 
> 1. Accessing an address using a clean PTE or old PTE takes ~500 cycles
> more than with dirty+young (depending on the access, of course: dirty
> does not matter for read, dirty+young both matter for write).
> 
> 2. I made a mistake in my implementation. PTEs are - at least on x86 -
> created as young with mk_pte(). So the logic should be similar to
> do_set_pte():
> 
>         if (prefault && arch_wants_old_prefaulted_pte())
>                 entry = pte_mkold(entry);
>         else
>                 entry = pte_sw_mkyoung(entry);
> 
> Based on these results, I will send another version for both young and
> dirty. Let me know if these results are not convincing.

Thanks for trying to verify this idea, but I'm not fully sure this is what
my concern was on WRITE_LIKELY.

AFAICT the test below was trying to measure the overhead of hardware
setting either access or dirty or both bits when they're not set for
read/write.

What I wanted as a justification is whether WRITE_LIKELY would be helpful
in any real world scenario at all.  AFAIK the only way to prove it so far
is to measure any tlb flush difference (probably only on x86, since that
tlb code is only compiled on x86) that may trigger with W=0,D=1 but may not
trigger with W=0,D=0 (where W stands for "write bit", and D stands for
"dirty bit").

It's not about the slowness when D is cleared.

The core thing is (sorry to rephrase, but just hope we're on the same page)
we'll set D bit always for all uffd pages so far.  Even if we want to
change that behavior so we skip setting D bit for RO pages (we'll need to
convert the dirty bit into PageDirty though), we'll still always set D bit
for writable pages.  So we always set D bit as long as possible and we'll
never suffer from hardware overhead on setting D bit for uffd pages.

The other worry of having WRITE_HINT is, after we have it we probably need
to _not_ apply dirty bit when WRITE_HINT is not set (which is actually a
very light ABI change since we used to always set it), then I'll start to
worry the hardware setting D bit overhead you just measured because we'll
have that overhead when user didn't specify WRITE_HINT with the old code.

So again, I'm totally fine if you want to start with ACCESS_HINT only, but
I still don't see why we should need WRITE_HINT too..

Thanks,

> 
> I will add, as we discussed (well, I think I raised these things, so
> hopefully you agree):
> 
> 1. On x86, avoid flush if changing WP->RO and PTE is clean.
> 
> 2. When write-unprotecting entry, if PTE is exclusive, set it as writable.
> [ I considered not setting it as writable if write-hint is not provided, but
> with the change in (1), it does not provide any real value. ]
> 
> ---
> 
> #define _GNU_SOURCE
> #include <sys/types.h>
> #include <stdio.h>
> #include <linux/userfaultfd.h>
> #include <pthread.h>
> #include <errno.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <fcntl.h>
> #include <signal.h>
> #include <poll.h>
> #include <string.h>
> #include <sys/mman.h>
> #include <sys/syscall.h>
> #include <sys/ioctl.h>
> #include <stdint.h>
> #include <stdbool.h>
> 
> #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
> 		       } while (0)
> 
> static inline uint64_t rdtscp(void)
> {
> 	uint64_t rax, rdx;
> 	uint32_t aux;
> 	asm volatile ("rdtscp" : "=a" (rax), "=d" (rdx), "=c" (aux):: "memory");
> }
> 
> int main(int argc, char *argv[])
> {
> 	long uffd;          /* userfaultfd file descriptor */
> 	char *addr;         /* Start of region handled by userfaultfd */
> 	unsigned long len;  /* Length of region handled by userfaultfd */
> 	pthread_t thr;      /* ID of thread that handles page faults */
> 	bool young, dirty, write;
> 	struct uffdio_api uffdio_api;
> 	struct uffdio_register uffdio_register;
> 	int l;
> 	static char *page = NULL;
> 	struct uffdio_copy uffdio_copy;
> 	ssize_t nread;
> 	int page_size;
> 
> 	if (argc != 5) {
> 		fprintf(stderr, "Usage: %s [num-pages] [write] [young] [dirty]\n", argv[0]);
> 		exit(EXIT_FAILURE);
> 	}
> 
> 	page_size = sysconf(_SC_PAGE_SIZE);
> 	len = strtoul(argv[1], NULL, 0) * page_size;
> 	write = !!strtoul(argv[2], NULL, 0);
> 	young = !!strtoul(argv[3], NULL, 0);
> 	dirty = !!strtoul(argv[4], NULL, 0);
> 
> 	page = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
> 		   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 
> 	if (page == MAP_FAILED)
> 		errExit("mmap");
> 
> 	uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
> 	if (uffd == -1)
> 		errExit("userfaultfd");
> 
> 	uffdio_api.api = UFFD_API;
> 	uffdio_api.features = (1<<11); //UFFD_FEATURE_EXACT_ADDRESS;
> 	if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1)
> 		errExit("ioctl-UFFDIO_API");
> 
> 	addr = mmap(NULL, len, PROT_READ | PROT_WRITE,
> 		MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 	if (addr == MAP_FAILED)
> 		errExit("mmap");
> 
> 	uffdio_register.range.start = (unsigned long) addr;
> 	uffdio_register.range.len = len;
> 	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
> 	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1)
> 		errExit("ioctl-UFFDIO_REGISTER");
> 
> 	uffdio_copy.src = (unsigned long) page;
> 	uffdio_copy.mode = 0;
> 	if (young)
> 		uffdio_copy.mode |= (1ul << 2);
> 	if (dirty)
> 		uffdio_copy.mode |= (1ul << 3);
> 
> 	uffdio_copy.len = page_size;
> 	uffdio_copy.copy = 0;
> 
> 	for (l = 0; l < len; l += page_size) {
> 		uffdio_copy.dst = (unsigned long)(&addr[l]);
> 		if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy) == -1)
> 			errExit("ioctl-UFFDIO_COPY");
> 	}
> 
> 	for (l = 0; l < len; l += page_size) {
> 		char c;
> 		uint64_t start;
> 
> 		start = rdtscp();
> 		if (write)
> 			addr[l] = 5;
> 		else
> 			c = *(volatile char *)(&addr[l]);
> 		printf("%ld\n", rdtscp() - start);
> 	}
> 
> 	exit(EXIT_SUCCESS);
> }
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-07-12 14:56     ` Peter Xu
@ 2022-07-13  1:09       ` Nadav Amit
  2022-07-13 16:02         ` Peter Xu
  0 siblings, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2022-07-13  1:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport

On Jul 12, 2022, at 7:56 AM, Peter Xu <peterx@redhat.com> wrote:

> Hi, Nadav,
> 
> On Tue, Jul 12, 2022 at 06:19:08AM +0000, Nadav Amit wrote:
>> On Jun 22, 2022, at 11:50 AM, Nadav Amit <nadav.amit@gmail.com> wrote:
>> 
>>> From: Nadav Amit <namit@vmware.com>
>>> 
>>> Using a PTE on x86 with cleared access-bit (aka young-bit)
>>> takes ~600 cycles more than when the access bit is set. At the same
>>> time, setting the access-bit for memory that is not used (e.g.,
>>> prefetched) can introduce greater overheads, as the prefetched memory is
>>> reclaimed later than it should be.
>>> 
>>> Userfaultfd currently does not set the access-bit (excluding the
>>> huge-pages case). Arguably, it is best to let the user control whether
>>> the access bit should be set or not. The expected use is to request
>>> userfaultfd to set the access-bit when the copy/wp operation is done to
>>> resolve a page-fault, and not to set the access-bit when the memory is
>>> prefetched.
>>> 
>>> Introduce UFFDIO_[op]_ACCESS_LIKELY to enable userspace to request the
>>> young bit to be set.
>> 
>> I reply to my own email, but this mostly addresses the concerns that Peter
>> has raised.
>> 
>> So I ran the test below on my Haswell (x86), which showed two things:
>> 
>> 1. Accessing an address using a clean PTE or old PTE takes ~500 cycles
>> more than with dirty+young (depending on the access, of course: dirty
>> does not matter for read, dirty+young both matter for write).
>> 
>> 2. I made a mistake in my implementation. PTEs are - at least on x86 -
>> created as young with mk_pte(). So the logic should be similar to
>> do_set_pte():
>> 
>> if (prefault && arch_wants_old_prefaulted_pte())
>> entry = pte_mkold(entry);
>> else
>> entry = pte_sw_mkyoung(entry);
>> 
>> Based on these results, I will send another version for both young and
>> dirty. Let me know if these results are not convincing.
> 
> Thanks for trying to verify this idea, but I'm not fully sure this is what
> my concern was on WRITE_LIKELY.
> 
> AFAICT the test below was trying to measure the overhead of hardware
> setting either access or dirty or both bits when they're not set for
> read/write.

Indeed.

> 
> What I wanted as a justification is whether WRITE_LIKELY would be helpful
> in any real world scenario at all. AFAIK the only way to prove it so far
> is to measure any tlb flush difference (probably only on x86, since that
> tlb code is only compiled on x86) that may trigger with W=0,D=1 but may not
> trigger with W=0,D=0 (where W stands for "write bit", and D stands for
> "dirty bit").
> 
> It's not about the slowness when D is cleared.
> 
> The core thing is (sorry to rephrase, but just hope we're on the same page)
> we'll set D bit always for all uffd pages so far. Even if we want to
> change that behavior so we skip setting D bit for RO pages (we'll need to
> convert the dirty bit into PageDirty though), we'll still always set D bit
> for writable pages. So we always set D bit as long as possible and we'll
> never suffer from hardware overhead on setting D bit for uffd pages.

Thanks as usual for your clarifications. As you see, I also do my best to be
on the same page with, even if from time to time I fail. I had some recent
communication challenges on lkml, so I hope that you understand that
everything I say is said with full respect, and if I use double-quotes while
arguing with you, it is in good spirit, and I really appreciate your
feedback.

Ok. So there is a lot to digest in what you just said, and I politely
disagree with some of the assertions that you made. You focus on discussing
the issue of whether we set the dirty bit for RO pages, which in my opinion
is so intuitively wrong. But, I think that discussing this issue really
digress us from the benefits of not setting the D-bit when it is unnecessary
for RW pages, which is the main question.

But before we get to it, I want to argue with some of the “facts” that you
present:

1. "D bit always for all uffd pages” - This is true for almost all UFFD
operations, but not true to write-unprotect. The moment we use David’s
MM_CP_TRY_CHANGE_WRITABLE in UFFD, the PTE would be writable but not dirty.
[Yes, the patches that I sent do not deal with that: as I noted before, I
want to send it as part of v2.] Arguably, one of the places that setting the
D-bit matters the most if when you change PTE from RO->RW. And anyhow, as
you can see the API is inconsistent.

2. "we'll still always set D bit for writable pages” - To continue my
previous bullet - why? Why would we? Besides uffd-wrunprotect path, why
would we always set it for MCOPY_CONTINUE? (more to follow on this one)

3. "measure any tlb flush difference … that may trigger with W=0,D=1 but may
not trigger with W=0,D=0” - This is really a boring case, which even if was
underoptimized could have been resolved by its own. This is certainly not
the motivation for the write hints.


So please allow me to go back to the reasons for why you want write-hints,
and RO entries would be mostly left out and implicitly included in the
other arguments - which are mainly about RW entries:

1. You can avoid TLB flushes when write-protecting clean writable PTEs (on
x86). Such PTEs can be created when userspace monitor prefaults or
speculatively write-unprotects memory that might be needed. When a monitor
removes the mapping, using MADV_DONTNEED, it would not need to flush
anything.

2. Hopefully you agree that write-hints are needed if during
uffd-write-unprotect we actually write-unprotect the PTE (not just clearing
the logical flag). So you do need write-hint for zero-page (to get a
clear-page) and for write-unprotect, so why not to be consistent and provide
it for all operations?

3. For UFFDIO_CONTINUE. Admittedly, I am not very familiar with
UFFDIO_CONTINUE, but from the discussion (and skimming the code) I
understood that you can be used for prefetching. In such case, why would you
assume that the page is dirty?

4. It allows you to treat softdirty properly. If softdirty is on,
presumably, if you did not get a WRITE_HINT, you would keep it
writeprotected.

5. Consistency with UFFDIO_ZERO that needs a write-hint to clear a page.

Now I will “kill myself” over support of write-hints. Not everything that
I mentioned here is in v1 that I sent.

But, if you decide you do not want write-hints, I would still need to leave
the UFFDIO_ZERO write-hints (that clear the page) and to find some solution
for UFFDIO_WRITEPROTECT (that should set the dirty-bit for better performance
of WP page-fault handling). 

If you decide you do want it, I would run some tests to check that indeed
the access-time of UFFDIO_WRITEPROTECT reflects the fact no TLB flush was
needed.

> The other worry of having WRITE_HINT is, after we have it we probably need
> to _not_ apply dirty bit when WRITE_HINT is not set (which is actually a
> very light ABI change since we used to always set it), then I'll start to
> worry the hardware setting D bit overhead you just measured because we'll
> have that overhead when user didn't specify WRITE_HINT with the old code.

Good point, which I have already got to. So, because I was mistaken on the
ACCESS_HINT understanding (young-bit is set by default), this would mean
that the hints should only be regarded when the “hints” feature is enabled
for backward compatibility. I got this code ready for v2.

> 
> So again, I'm totally fine if you want to start with ACCESS_HINT only, but
> I still don't see why we should need WRITE_HINT too..

I really hate how the “features” evolve that I think it is better to decide
now.

I think that in the lack of agreement, the best thing to do is to put all of
the write-hints now in the API, and only regard them for UFFDIO_ZERO (which
would clear the page) and UFFDIO_WRITE(un)PROTECT (that would set the old
bit).

Again, thanks for the feedback, and hopefully (at least) I understood you
this time.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-07-13  1:09       ` Nadav Amit
@ 2022-07-13 16:02         ` Peter Xu
  2022-07-13 16:49           ` Nadav Amit
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2022-07-13 16:02 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport

On Tue, Jul 12, 2022 at 06:09:35PM -0700, Nadav Amit wrote:
> On Jul 12, 2022, at 7:56 AM, Peter Xu <peterx@redhat.com> wrote:
> 
> > Hi, Nadav,
> > 
> > On Tue, Jul 12, 2022 at 06:19:08AM +0000, Nadav Amit wrote:
> >> On Jun 22, 2022, at 11:50 AM, Nadav Amit <nadav.amit@gmail.com> wrote:
> >> 
> >>> From: Nadav Amit <namit@vmware.com>
> >>> 
> >>> Using a PTE on x86 with cleared access-bit (aka young-bit)
> >>> takes ~600 cycles more than when the access bit is set. At the same
> >>> time, setting the access-bit for memory that is not used (e.g.,
> >>> prefetched) can introduce greater overheads, as the prefetched memory is
> >>> reclaimed later than it should be.
> >>> 
> >>> Userfaultfd currently does not set the access-bit (excluding the
> >>> huge-pages case). Arguably, it is best to let the user control whether
> >>> the access bit should be set or not. The expected use is to request
> >>> userfaultfd to set the access-bit when the copy/wp operation is done to
> >>> resolve a page-fault, and not to set the access-bit when the memory is
> >>> prefetched.
> >>> 
> >>> Introduce UFFDIO_[op]_ACCESS_LIKELY to enable userspace to request the
> >>> young bit to be set.
> >> 
> >> I reply to my own email, but this mostly addresses the concerns that Peter
> >> has raised.
> >> 
> >> So I ran the test below on my Haswell (x86), which showed two things:
> >> 
> >> 1. Accessing an address using a clean PTE or old PTE takes ~500 cycles
> >> more than with dirty+young (depending on the access, of course: dirty
> >> does not matter for read, dirty+young both matter for write).
> >> 
> >> 2. I made a mistake in my implementation. PTEs are - at least on x86 -
> >> created as young with mk_pte(). So the logic should be similar to
> >> do_set_pte():
> >> 
> >> if (prefault && arch_wants_old_prefaulted_pte())
> >> entry = pte_mkold(entry);
> >> else
> >> entry = pte_sw_mkyoung(entry);
> >> 
> >> Based on these results, I will send another version for both young and
> >> dirty. Let me know if these results are not convincing.
> > 
> > Thanks for trying to verify this idea, but I'm not fully sure this is what
> > my concern was on WRITE_LIKELY.
> > 
> > AFAICT the test below was trying to measure the overhead of hardware
> > setting either access or dirty or both bits when they're not set for
> > read/write.
> 
> Indeed.
> 
> > 
> > What I wanted as a justification is whether WRITE_LIKELY would be helpful
> > in any real world scenario at all. AFAIK the only way to prove it so far
> > is to measure any tlb flush difference (probably only on x86, since that
> > tlb code is only compiled on x86) that may trigger with W=0,D=1 but may not
> > trigger with W=0,D=0 (where W stands for "write bit", and D stands for
> > "dirty bit").
> > 
> > It's not about the slowness when D is cleared.
> > 
> > The core thing is (sorry to rephrase, but just hope we're on the same page)
> > we'll set D bit always for all uffd pages so far. Even if we want to
> > change that behavior so we skip setting D bit for RO pages (we'll need to
> > convert the dirty bit into PageDirty though), we'll still always set D bit
> > for writable pages. So we always set D bit as long as possible and we'll
> > never suffer from hardware overhead on setting D bit for uffd pages.
> 
> Thanks as usual for your clarifications. As you see, I also do my best to be
> on the same page with, even if from time to time I fail. I had some recent
> communication challenges on lkml, so I hope that you understand that
> everything I say is said with full respect, and if I use double-quotes while
> arguing with you, it is in good spirit, and I really appreciate your
> feedback.
> 
> Ok. So there is a lot to digest in what you just said, and I politely
> disagree with some of the assertions that you made. You focus on discussing
> the issue of whether we set the dirty bit for RO pages, which in my opinion
> is so intuitively wrong. But, I think that discussing this issue really
> digress us from the benefits of not setting the D-bit when it is unnecessary
> for RW pages, which is the main question.
> 
> But before we get to it, I want to argue with some of the “facts” that you
> present:
> 
> 1. "D bit always for all uffd pages” - This is true for almost all UFFD
> operations, but not true to write-unprotect. The moment we use David’s
> MM_CP_TRY_CHANGE_WRITABLE in UFFD, the PTE would be writable but not dirty.
> [Yes, the patches that I sent do not deal with that: as I noted before, I
> want to send it as part of v2.] Arguably, one of the places that setting the
> D-bit matters the most if when you change PTE from RO->RW. And anyhow, as
> you can see the API is inconsistent.

Yes.  This point is WRITE_HINT irrelevant, afaict, but I fully agree we
should fix it.  It'll be great if you'll cover that upon David's patch.

[1]

> 
> 2. "we'll still always set D bit for writable pages” - To continue my
> previous bullet - why? Why would we? Besides uffd-wrunprotect path, why
> would we always set it for MCOPY_CONTINUE? (more to follow on this one)

I wanted to mean that we used to always set both D bits when W=1.  IIUC
that applies to not only uffd but any general pgtable accesses.  E.g. in
do_anonymous_page() we'll set both D if W=1.  Same to anywhere I can find
across the kernel.

> 
> 3. "measure any tlb flush difference … that may trigger with W=0,D=1 but may
> not trigger with W=0,D=0” - This is really a boring case, which even if was
> underoptimized could have been resolved by its own. This is certainly not
> the motivation for the write hints.

Hmm?

I don't see what benefit we can get from the new WRITE_HINT besides
avoiding tlb flushes, or do we?  See right below..

> 
> 
> So please allow me to go back to the reasons for why you want write-hints,
> and RO entries would be mostly left out and implicitly included in the
> other arguments - which are mainly about RW entries:
> 
> 1. You can avoid TLB flushes when write-protecting clean writable PTEs (on
> x86). Such PTEs can be created when userspace monitor prefaults or
> speculatively write-unprotects memory that might be needed. When a monitor
> removes the mapping, using MADV_DONTNEED, it would not need to flush
> anything.

Yep, but afaict this is the "avoid tlb flush benefit" we talked above..

> 
> 2. Hopefully you agree that write-hints are needed if during
> uffd-write-unprotect

I think I agree on the unprotect above [1] on fixing D bit set if
unprotected.  If with that fix then I think whether WRITE_HINT would help
or not here for unprotect is the same as whether WRITE_HINT would help in
other cases like UFFDIO_COPY.  So these are two problems to me:

  (1) Whether we should fix change_pte_range() to set D properly when
      unprotect (W=1), again I fully agree.

  (2) Whether WRITE_HINT helps for unprotect is to be discussed here, and
      we're trying to reach a consensus..

> we actually write-unprotect the PTE (not just clearing
> the logical flag). So you do need write-hint for zero-page (to get a
> clear-page) and for write-unprotect, so why not to be consistent and provide
> it for all operations?

I always agree on the zeropage case.  But IMHO that's the only special case
here.

> 
> 3. For UFFDIO_CONTINUE. Admittedly, I am not very familiar with
> UFFDIO_CONTINUE, but from the discussion (and skimming the code) I
> understood that you can be used for prefetching. In such case, why would you
> assume that the page is dirty?

Again, I think it's the same as when we faulting in a normal anonymous
page: when we grant write bit we always assume it's dirty.  Yes it may not
really be written, but don't you think that's a more common question to ask
rather than uffd only?  IOW, why we always set D bit in do_anonymous_page()
even if there's possibility that the page may not be written at all?

Actually I'd say we'll have problem if we don't set dirty, because then we
won't be able to track soft-dirty anymore just like below - we need an
explicit page fault to trap soft-dirty.  With W=1, how do we do that?

> 
> 4. It allows you to treat softdirty properly. If softdirty is on,
> presumably, if you did not get a WRITE_HINT, you would keep it
> writeprotected.

Sorry I don't get why WRITE_HINT helps soft-dirty.  As I mentioned before,
I think it will start to complicate soft-dirty rather than making it
better.  Basically when the hints are enabled in feature bits and when the
user has !WRITE_HINT, we'll need to wr-protect the page for soft-dirty if
it's enabled.  We don't need that extra complexity if we don't apply
WRITE_HINT outside ZEROPAGE.  That's really even more than "hardware
setting D bit overhead" because it'll be a full path page fault just to
trap soft-dirty.

I am not sure we're really on the same page here, but I think I'll just
wait and read your code if it's coming and comment there if I see anything
wrong.  Maybe that'll be more straightforward.

> 
> 5. Consistency with UFFDIO_ZERO that needs a write-hint to clear a page.

This one is actually point 2 above.  Taken.

> 
> Now I will “kill myself” over support of write-hints. Not everything that
> I mentioned here is in v1 that I sent.
> 
> But, if you decide you do not want write-hints, I would still need to leave
> the UFFDIO_ZERO write-hints (that clear the page) and to find some solution
> for UFFDIO_WRITEPROTECT (that should set the dirty-bit for better performance
> of WP page-fault handling). 
> 
> If you decide you do want it, I would run some tests to check that indeed
> the access-time of UFFDIO_WRITEPROTECT reflects the fact no TLB flush was
> needed.
> 
> > The other worry of having WRITE_HINT is, after we have it we probably need
> > to _not_ apply dirty bit when WRITE_HINT is not set (which is actually a
> > very light ABI change since we used to always set it), then I'll start to
> > worry the hardware setting D bit overhead you just measured because we'll
> > have that overhead when user didn't specify WRITE_HINT with the old code.
> 
> Good point, which I have already got to. So, because I was mistaken on the
> ACCESS_HINT understanding (young-bit is set by default), this would mean
> that the hints should only be regarded when the “hints” feature is enabled
> for backward compatibility. I got this code ready for v2.

Sounds good.

> 
> > 
> > So again, I'm totally fine if you want to start with ACCESS_HINT only, but
> > I still don't see why we should need WRITE_HINT too..
> 
> I really hate how the “features” evolve that I think it is better to decide
> now.
> 
> I think that in the lack of agreement, the best thing to do is to put all of
> the write-hints now in the API, and only regard them for UFFDIO_ZERO (which
> would clear the page) and UFFDIO_WRITE(un)PROTECT (that would set the old
> bit).

Yes let's fix unprotect on D bit first, because that's inconsistent.
Please put it as the 1st patch if you plan to post them together, I think
we should have it even earlier if not together with the uffd-hint series.

Then, there's one thing we can do as you said - we can introduce WRITE_HINT
but only apply it to ZEROPAGE only (note: I still don't see why unprotect
is special here, it just has one perf bug to fix).  It can be bound to
ACCESS_HINT with the same feature bit then we think about whether we want
to extend it to other uffd ioctls besides ZEROPAGE.  That's probably a good
split on the differences.

> 
> Again, thanks for the feedback, and hopefully (at least) I understood you
> this time.

What you suggested at the end sounds good to me, thanks for being
persistent.  The only thing uncertain seems to be whether we need
WRITE_HINT for UFFDIO_WRITEPROTECT, all the rest sounds a good plan.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-07-13 16:02         ` Peter Xu
@ 2022-07-13 16:49           ` Nadav Amit
  0 siblings, 0 replies; 34+ messages in thread
From: Nadav Amit @ 2022-07-13 16:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport

On Jul 13, 2022, at 9:02 AM, Peter Xu <peterx@redhat.com> wrote:

> On Tue, Jul 12, 2022 at 06:09:35PM -0700, Nadav Amit wrote:
>> On Jul 12, 2022, at 7:56 AM, Peter Xu <peterx@redhat.com> wrote:
> 
> 
> What you suggested at the end sounds good to me, thanks for being
> persistent. The only thing uncertain seems to be whether we need
> WRITE_HINT for UFFDIO_WRITEPROTECT, all the rest sounds a good plan.

Peter, thanks for the detailed feedback.

I understand and agree with most of what you wrote. I think we still have
the mprotect()/UFFDIO_WRITE(un)PROTECT issue, which I either do not
understand your position or disagree with it. :)

Anyhow, instead of too many words, I’ll send v2 based on your feedback,
which would help to refine the behavior and clear any misunderstanding
we might have.



^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2022-07-13 16:49 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-22 18:50 [PATCH v1 0/5] userfaultfd: support access/write hints Nadav Amit
2022-06-22 18:50 ` [PATCH v1 1/5] userfaultfd: introduce uffd_flags Nadav Amit
2022-06-23 21:57   ` Peter Xu
2022-06-23 22:04     ` Nadav Amit
2022-06-22 18:50 ` [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations Nadav Amit
2022-06-23 23:24   ` Peter Xu
2022-06-23 23:35     ` Nadav Amit
2022-06-23 23:49       ` Peter Xu
2022-06-24  0:03         ` Nadav Amit
2022-06-24  2:05           ` Peter Xu
2022-06-24  2:42             ` Nadav Amit
2022-06-24 21:58               ` Peter Xu
2022-06-24 22:17                 ` Peter Xu
2022-06-25  7:49                   ` Nadav Amit
2022-06-27 13:12                     ` Peter Xu
2022-06-27 13:27                       ` David Hildenbrand
2022-06-27 14:59                         ` Peter Xu
2022-06-27 23:37                       ` Nadav Amit
2022-06-28 10:55                         ` David Hildenbrand
2022-06-28 19:15                         ` Peter Xu
2022-06-28 20:30                           ` Nadav Amit
2022-06-28 20:56                             ` Peter Xu
2022-06-28 21:03                               ` Nadav Amit
2022-06-28 21:12                                 ` Peter Xu
2022-06-28 21:15                                   ` Nadav Amit
2022-07-12  6:19   ` Nadav Amit
2022-07-12 14:56     ` Peter Xu
2022-07-13  1:09       ` Nadav Amit
2022-07-13 16:02         ` Peter Xu
2022-07-13 16:49           ` Nadav Amit
2022-06-22 18:50 ` [PATCH v1 3/5] userfaultfd: introduce write-likely mode for uffd operations Nadav Amit
2022-06-22 18:50 ` [PATCH v1 4/5] userfaultfd: zero access/write hints Nadav Amit
2022-06-23 23:34   ` Peter Xu
2022-06-22 18:50 ` [PATCH v1 5/5] selftest/userfaultfd: test read/write hints Nadav Amit

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.