All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] userfaultfd: support access/write hints
@ 2022-07-18 11:47 Nadav Amit
  2022-07-18 11:47 ` [PATCH v2 2/5] userfaultfd: introduce access-likely mode for common operations Nadav Amit
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Nadav Amit @ 2022-07-18 11:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Nadav Amit, Mike Kravetz, Hugh Dickins,
	Axel Rasmussen, Peter Xu, David Hildenbrand, Mike Rapoport

From: Nadav Amit <namit@vmware.com>

This patch-set introduces access/write hints for userfaultfd. Unlike the
previous versions, the use of these hints in this version is 
limited. Yet, in order to keep introducing new features again and again,
hints are introduced for all of uffd related ioctls.

The access-hint is currently used to set the young bit, similarly to
do_set_pte(). This has no effect on x86, but may on arm64.

When a write-hint is provided on zeropage ioctl, a clear page is
allocated instead of mapping the zero-page.

Future patches would use the write-hint to decide whether to map the
writable pages on write-(un)protect ioctl.

Setting the access-bit and dirty-bit introduces a tradeoff. When the bit
is set access/write is faster, but memory reclamation might be slower.
Currently, in the common userfaultfd cases the access-bit is not set on
and the dirty-bit is set. This is a questionable behavior.

Allow userspace to control this behavior through hints access- and
write-likely hints. These hints are used to control access- and
dirty-bits. For zero-pages that with write-likely hint, allocate a clear
page instead of mapping the zero-page.

v1 -> v2:
* Leave dirty-bit as it was before [Peter Xu]

RFCv2 -> v1:
* Adding hints to zeropage and continue
* Fixing other issues pointed by David H. & Peter Xu
* Adding tests to ./run_vmtests.sh

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>

Nadav Amit (5):
  userfaultfd: introduce uffd_flags
  userfaultfd: introduce access-likely mode for common operations
  userfaultfd: introduce write-likely mode for uffd operations
  userfaultfd: zero access/write hints
  selftest/userfaultfd: test read/write hints

 fs/userfaultfd.c                          |  77 ++++++++++++++--
 include/linux/hugetlb.h                   |   4 +-
 include/linux/shmem_fs.h                  |   8 +-
 include/linux/userfaultfd_k.h             |  26 ++++--
 include/uapi/linux/userfaultfd.h          |  31 ++++++-
 mm/hugetlb.c                              |   3 +-
 mm/internal.h                             |  13 +++
 mm/memory.c                               |  12 ---
 mm/shmem.c                                |   6 +-
 mm/userfaultfd.c                          | 103 +++++++++++++++-------
 tools/testing/selftests/vm/run_vmtests.sh |  16 ++--
 tools/testing/selftests/vm/userfaultfd.c  |  54 +++++++++++-
 12 files changed, 275 insertions(+), 78 deletions(-)

-- 
2.25.1



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-07-18 11:47 [PATCH v2 0/5] userfaultfd: support access/write hints Nadav Amit
@ 2022-07-18 11:47 ` Nadav Amit
  2022-07-18 20:05   ` Peter Xu
  2022-07-23  9:16   ` Mike Rapoport
  2022-07-18 11:47 ` [PATCH v2 3/5] userfaultfd: introduce write-likely mode for uffd operations Nadav Amit
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 20+ messages in thread
From: Nadav Amit @ 2022-07-18 11:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Nadav Amit, Mike Kravetz, Hugh Dickins,
	Axel Rasmussen, Peter Xu, David Hildenbrand, Mike Rapoport

From: Nadav Amit <namit@vmware.com>

Introduce access-hints in userfaultfd. The expectation is that userspace
would set access-hints when a page-fault occurred on a page and would
not provide the access-hint on prefaulted memory. The exact behavior of
the kernel in regard to the hints would not be part of userfaultfd api.

At this time the use of the access-hint is only in setting access-bit
similarly to the way it is done in do_set_pte(). In x86, currently PTEs
are always marked as young, including prefetched ones. But on arm64,
PTEs would be marked as old (when access bit is supported).

If access hints are not enabled, the kernel would behave as if the
access-hint was provided for backward compatibility.

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 fs/userfaultfd.c                 | 39 ++++++++++++++++++++++++++++----
 include/linux/userfaultfd_k.h    |  1 +
 include/uapi/linux/userfaultfd.h | 20 +++++++++++++++-
 mm/internal.h                    | 13 +++++++++++
 mm/memory.c                      | 12 ----------
 mm/userfaultfd.c                 | 11 +++++++--
 6 files changed, 77 insertions(+), 19 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 2ae24327beec..8d8792b27c53 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1708,13 +1708,21 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 	ret = -EINVAL;
 	if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
 		goto out;
-	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
+	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP|
+				 UFFDIO_COPY_MODE_ACCESS_LIKELY))
 		goto out;
 
 	mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP;
 
 	uffd_flags = mode_wp ? UFFD_FLAGS_WP : UFFD_FLAGS_NONE;
 
+	if (ctx->features & UFFD_FEATURE_ACCESS_HINTS) {
+		if (uffdio_copy.mode & UFFDIO_COPY_MODE_ACCESS_LIKELY)
+			uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+	} else {
+		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+	}
+
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
 				   uffdio_copy.len, &ctx->mmap_changing,
@@ -1765,9 +1773,17 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
 	if (ret)
 		goto out;
 	ret = -EINVAL;
-	if (uffdio_zeropage.mode & ~UFFDIO_ZEROPAGE_MODE_DONTWAKE)
+	if (uffdio_zeropage.mode & ~(UFFDIO_ZEROPAGE_MODE_DONTWAKE|
+				     UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY))
 		goto out;
 
+	if (ctx->features & UFFD_FEATURE_ACCESS_HINTS) {
+		if (uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY)
+			uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+	} else {
+		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+	}
+
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mfill_zeropage(ctx->mm, uffdio_zeropage.range.start,
 				     uffdio_zeropage.range.len,
@@ -1817,7 +1833,8 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 		return ret;
 
 	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
-			       UFFDIO_WRITEPROTECT_MODE_WP))
+			       UFFDIO_WRITEPROTECT_MODE_WP |
+			       UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY))
 		return -EINVAL;
 
 	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
@@ -1827,6 +1844,12 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 		return -EINVAL;
 
 	uffd_flags = mode_wp ? UFFD_FLAGS_WP : UFFD_FLAGS_NONE;
+	if (ctx->features & UFFD_FEATURE_ACCESS_HINTS) {
+		if (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY)
+			uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+	} else {
+		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+	}
 
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
@@ -1879,9 +1902,17 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
 	    uffdio_continue.range.start) {
 		goto out;
 	}
-	if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
+	if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE|
+				     UFFDIO_CONTINUE_MODE_ACCESS_LIKELY))
 		goto out;
 
+	if (ctx->features & UFFD_FEATURE_ACCESS_HINTS) {
+		if (uffdio_continue.mode & UFFDIO_CONTINUE_MODE_ACCESS_LIKELY)
+			uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+	} else {
+		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+	}
+
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mcopy_continue(ctx->mm, uffdio_continue.range.start,
 				     uffdio_continue.range.len,
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index a63b61823984..b326798b5677 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -59,6 +59,7 @@ typedef unsigned int __bitwise uffd_flags_t;
 
 #define UFFD_FLAGS_NONE			((__force uffd_flags_t)0)
 #define UFFD_FLAGS_WP			((__force uffd_flags_t)BIT(0))
+#define UFFD_FLAGS_ACCESS_LIKELY	((__force uffd_flags_t)BIT(1))
 
 extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 				    struct vm_area_struct *dst_vma,
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 7d32b1e797fb..02e0c1f56939 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -34,7 +34,8 @@
 			   UFFD_FEATURE_MINOR_HUGETLBFS |	\
 			   UFFD_FEATURE_MINOR_SHMEM |		\
 			   UFFD_FEATURE_EXACT_ADDRESS |		\
-			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
+			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM |	\
+			   UFFD_FEATURE_ACCESS_HINTS)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -199,6 +200,9 @@ struct uffdio_api {
 	 *
 	 * UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd
 	 * write-protection mode is supported on both shmem and hugetlbfs.
+	 *
+	 * UFFD_FEATURE_ACCESS_HINTS indicates that the ioctl operations
+	 * support the UFFDIO_*_MODE_ACCESS_LIKELY hints.
 	 */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
 #define UFFD_FEATURE_EVENT_FORK			(1<<1)
@@ -213,6 +217,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_MINOR_SHMEM		(1<<10)
 #define UFFD_FEATURE_EXACT_ADDRESS		(1<<11)
 #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM		(1<<12)
+#define UFFD_FEATURE_ACCESS_HINTS		(1<<13)
 	__u64 features;
 
 	__u64 ioctls;
@@ -247,8 +252,14 @@ struct uffdio_copy {
 	 * the fly.  UFFDIO_COPY_MODE_WP is available only if the
 	 * write protected ioctl is implemented for the range
 	 * according to the uffdio_register.ioctls.
+	 *
+	 * UFFDIO_COPY_MODE_ACCESS_LIKELY provides a hint to the kernel that the
+	 * page is likely to be access in the near future. Providing the hint
+	 * properly can improve performance.
+	 *
 	 */
 #define UFFDIO_COPY_MODE_WP			((__u64)1<<1)
+#define UFFDIO_COPY_MODE_ACCESS_LIKELY		((__u64)1<<2)
 	__u64 mode;
 
 	/*
@@ -261,6 +272,7 @@ struct uffdio_copy {
 struct uffdio_zeropage {
 	struct uffdio_range range;
 #define UFFDIO_ZEROPAGE_MODE_DONTWAKE		((__u64)1<<0)
+#define UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY	((__u64)1<<1)
 	__u64 mode;
 
 	/*
@@ -280,6 +292,10 @@ struct uffdio_writeprotect {
  * UFFDIO_WRITEPROTECT_MODE_DONTWAKE: set the flag to avoid waking up
  * any wait thread after the operation succeeds.
  *
+ * UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY provides a hint to the kernel
+ * that the page is likely to be access in the near future. Providing
+ * the hint properly can improve performance.
+ *
  * NOTE: Write protecting a region (WP=1) is unrelated to page faults,
  * therefore DONTWAKE flag is meaningless with WP=1.  Removing write
  * protection (WP=0) in response to a page fault wakes the faulting
@@ -287,12 +303,14 @@ struct uffdio_writeprotect {
  */
 #define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
 #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
+#define UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY	((__u64)1<<2)
 	__u64 mode;
 };
 
 struct uffdio_continue {
 	struct uffdio_range range;
 #define UFFDIO_CONTINUE_MODE_DONTWAKE		((__u64)1<<0)
+#define UFFDIO_CONTINUE_MODE_ACCESS_LIKELY	((__u64)1<<1)
 	__u64 mode;
 
 	/*
diff --git a/mm/internal.h b/mm/internal.h
index c0f8fbe0445b..d035b77b4f2f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -12,6 +12,7 @@
 #include <linux/pagemap.h>
 #include <linux/rmap.h>
 #include <linux/tracepoint-defs.h>
+#include <linux/pgtable.h>
 
 struct folio_batch;
 
@@ -861,4 +862,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags);
 
 DECLARE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
 
+#ifndef arch_wants_old_prefaulted_pte
+static inline bool arch_wants_old_prefaulted_pte(void)
+{
+	/*
+	 * Transitioning a PTE from 'old' to 'young' can be expensive on
+	 * some architectures, even if it's performed in hardware. By
+	 * default, "false" means prefaulted entries will be 'young'.
+	 */
+	return false;
+}
+#endif
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memory.c b/mm/memory.c
index 580c62febe42..31ec3f0071a2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -137,18 +137,6 @@ static inline bool arch_faults_on_old_pte(void)
 }
 #endif
 
-#ifndef arch_wants_old_prefaulted_pte
-static inline bool arch_wants_old_prefaulted_pte(void)
-{
-	/*
-	 * Transitioning a PTE from 'old' to 'young' can be expensive on
-	 * some architectures, even if it's performed in hardware. By
-	 * default, "false" means prefaulted entries will be 'young'.
-	 */
-	return false;
-}
-#endif
-
 static int __init disable_randmaps(char *s)
 {
 	randomize_va_space = 0;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 421784d26651..c15679f3eb6a 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -65,6 +65,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	bool writable = dst_vma->vm_flags & VM_WRITE;
 	bool vm_shared = dst_vma->vm_flags & VM_SHARED;
 	bool page_in_cache = page->mapping;
+	bool prefault = !(uffd_flags & UFFD_FLAGS_ACCESS_LIKELY);
 	spinlock_t *ptl;
 	struct inode *inode;
 	pgoff_t offset, max_off;
@@ -92,6 +93,11 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 		 */
 		_dst_pte = pte_wrprotect(_dst_pte);
 
+	if (prefault && arch_wants_old_prefaulted_pte())
+		_dst_pte = pte_mkold(_dst_pte);
+	else
+		_dst_pte = pte_sw_mkyoung(_dst_pte);
+
 	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
 
 	if (vma_is_shmem(dst_vma)) {
@@ -202,7 +208,8 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 static int mfill_zeropage_pte(struct mm_struct *dst_mm,
 			      pmd_t *dst_pmd,
 			      struct vm_area_struct *dst_vma,
-			      unsigned long dst_addr)
+			      unsigned long dst_addr,
+			      uffd_flags_t uffd_flags)
 {
 	pte_t _dst_pte, *dst_pte;
 	spinlock_t *ptl;
@@ -495,7 +502,7 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 					       uffd_flags);
 		else
 			err = mfill_zeropage_pte(dst_mm, dst_pmd,
-						 dst_vma, dst_addr);
+						 dst_vma, dst_addr, uffd_flags);
 	} else {
 		err = shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma,
 					     dst_addr, src_addr,
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 3/5] userfaultfd: introduce write-likely mode for uffd operations
  2022-07-18 11:47 [PATCH v2 0/5] userfaultfd: support access/write hints Nadav Amit
  2022-07-18 11:47 ` [PATCH v2 2/5] userfaultfd: introduce access-likely mode for common operations Nadav Amit
@ 2022-07-18 11:47 ` Nadav Amit
  2022-07-18 20:12   ` Peter Xu
  2022-07-18 11:47 ` [PATCH v2 4/5] userfaultfd: zero access/write hints Nadav Amit
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Nadav Amit @ 2022-07-18 11:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Nadav Amit, Mike Kravetz, Hugh Dickins,
	Axel Rasmussen, Peter Xu, David Hildenbrand, Mike Rapoport

From: Nadav Amit <namit@vmware.com>

Introduce write-likely hints for uffd. These hints would be used in a
future patch to decide whether to attempt to map pages in the page-table
or whether to only mark them logically as writable. This allows
userspace to determine whether a page would be accessed faster or
whether removal of the page would be possible, potentially, without
writeback and TLB flush.

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 fs/userfaultfd.c                 | 32 ++++++++++++++++++++++++--------
 include/linux/userfaultfd_k.h    |  1 +
 include/uapi/linux/userfaultfd.h | 13 ++++++++++++-
 3 files changed, 37 insertions(+), 9 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 8d8792b27c53..3027d228550a 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1709,7 +1709,8 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 	if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
 		goto out;
 	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP|
-				 UFFDIO_COPY_MODE_ACCESS_LIKELY))
+				 UFFDIO_COPY_MODE_ACCESS_LIKELY|
+				 UFFDIO_COPY_MODE_WRITE_LIKELY))
 		goto out;
 
 	mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP;
@@ -1719,8 +1720,11 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 	if (ctx->features & UFFD_FEATURE_ACCESS_HINTS) {
 		if (uffdio_copy.mode & UFFDIO_COPY_MODE_ACCESS_LIKELY)
 			uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+		if (uffdio_copy.mode & UFFDIO_COPY_MODE_WRITE_LIKELY)
+			uffd_flags |= UFFD_FLAGS_WRITE_LIKELY;
 	} else {
-		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY |
+			      UFFD_FLAGS_WRITE_LIKELY;
 	}
 
 	if (mmget_not_zero(ctx->mm)) {
@@ -1774,14 +1778,18 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
 		goto out;
 	ret = -EINVAL;
 	if (uffdio_zeropage.mode & ~(UFFDIO_ZEROPAGE_MODE_DONTWAKE|
-				     UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY))
+				     UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY|
+				     UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY))
 		goto out;
 
 	if (ctx->features & UFFD_FEATURE_ACCESS_HINTS) {
 		if (uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY)
 			uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+		if (uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY)
+			uffd_flags |= UFFD_FLAGS_WRITE_LIKELY;
 	} else {
-		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY |
+			      UFFD_FLAGS_WRITE_LIKELY;
 	}
 
 	if (mmget_not_zero(ctx->mm)) {
@@ -1834,7 +1842,8 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 
 	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
 			       UFFDIO_WRITEPROTECT_MODE_WP |
-			       UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY))
+			       UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY |
+			       UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY))
 		return -EINVAL;
 
 	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
@@ -1847,8 +1856,11 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 	if (ctx->features & UFFD_FEATURE_ACCESS_HINTS) {
 		if (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY)
 			uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+		if (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY)
+			uffd_flags |= UFFD_FLAGS_WRITE_LIKELY;
 	} else {
-		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY |
+			      UFFD_FLAGS_WRITE_LIKELY;
 	}
 
 	if (mmget_not_zero(ctx->mm)) {
@@ -1903,14 +1915,18 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
 		goto out;
 	}
 	if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE|
-				     UFFDIO_CONTINUE_MODE_ACCESS_LIKELY))
+				     UFFDIO_CONTINUE_MODE_ACCESS_LIKELY|
+				     UFFDIO_CONTINUE_MODE_WRITE_LIKELY))
 		goto out;
 
 	if (ctx->features & UFFD_FEATURE_ACCESS_HINTS) {
 		if (uffdio_continue.mode & UFFDIO_CONTINUE_MODE_ACCESS_LIKELY)
 			uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+		if (uffdio_continue.mode & UFFDIO_CONTINUE_MODE_WRITE_LIKELY)
+			uffd_flags |= UFFD_FLAGS_WRITE_LIKELY;
 	} else {
-		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY |
+			      UFFD_FLAGS_WRITE_LIKELY;
 	}
 
 	if (mmget_not_zero(ctx->mm)) {
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index b326798b5677..4968c86938b2 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -60,6 +60,7 @@ typedef unsigned int __bitwise uffd_flags_t;
 #define UFFD_FLAGS_NONE			((__force uffd_flags_t)0)
 #define UFFD_FLAGS_WP			((__force uffd_flags_t)BIT(0))
 #define UFFD_FLAGS_ACCESS_LIKELY	((__force uffd_flags_t)BIT(1))
+#define UFFD_FLAGS_WRITE_LIKELY		((__force uffd_flags_t)BIT(2))
 
 extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 				    struct vm_area_struct *dst_vma,
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 02e0c1f56939..f52cbe4c9c44 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -202,7 +202,7 @@ struct uffdio_api {
 	 * write-protection mode is supported on both shmem and hugetlbfs.
 	 *
 	 * UFFD_FEATURE_ACCESS_HINTS indicates that the ioctl operations
-	 * support the UFFDIO_*_MODE_ACCESS_LIKELY hints.
+	 * support the UFFDIO_*_MODE_[ACCESS|WRITE]_LIKELY hints.
 	 */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
 #define UFFD_FEATURE_EVENT_FORK			(1<<1)
@@ -257,9 +257,13 @@ struct uffdio_copy {
 	 * page is likely to be access in the near future. Providing the hint
 	 * properly can improve performance.
 	 *
+	 * UFFDIO_COPY_MODE_WRITE_LIKELY provides a hint to the kernel that the
+	 * page is likely to be written in the near future. Providing the hint
+	 * properly can improve performance.
 	 */
 #define UFFDIO_COPY_MODE_WP			((__u64)1<<1)
 #define UFFDIO_COPY_MODE_ACCESS_LIKELY		((__u64)1<<2)
+#define UFFDIO_COPY_MODE_WRITE_LIKELY		((__u64)1<<3)
 	__u64 mode;
 
 	/*
@@ -273,6 +277,7 @@ struct uffdio_zeropage {
 	struct uffdio_range range;
 #define UFFDIO_ZEROPAGE_MODE_DONTWAKE		((__u64)1<<0)
 #define UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY	((__u64)1<<1)
+#define UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY	((__u64)1<<2)
 	__u64 mode;
 
 	/*
@@ -296,6 +301,10 @@ struct uffdio_writeprotect {
  * that the page is likely to be access in the near future. Providing
  * the hint properly can improve performance.
  *
+ * UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY: provides a hint to the kernel
+ * that the page is likely to be written in the near future. Providing
+ * the hint properly can improve performance.
+ *
  * NOTE: Write protecting a region (WP=1) is unrelated to page faults,
  * therefore DONTWAKE flag is meaningless with WP=1.  Removing write
  * protection (WP=0) in response to a page fault wakes the faulting
@@ -304,6 +313,7 @@ struct uffdio_writeprotect {
 #define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
 #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
 #define UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY	((__u64)1<<2)
+#define UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY	((__u64)1<<3)
 	__u64 mode;
 };
 
@@ -311,6 +321,7 @@ struct uffdio_continue {
 	struct uffdio_range range;
 #define UFFDIO_CONTINUE_MODE_DONTWAKE		((__u64)1<<0)
 #define UFFDIO_CONTINUE_MODE_ACCESS_LIKELY	((__u64)1<<1)
+#define UFFDIO_CONTINUE_MODE_WRITE_LIKELY	((__u64)1<<2)
 	__u64 mode;
 
 	/*
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 4/5] userfaultfd: zero access/write hints
  2022-07-18 11:47 [PATCH v2 0/5] userfaultfd: support access/write hints Nadav Amit
  2022-07-18 11:47 ` [PATCH v2 2/5] userfaultfd: introduce access-likely mode for common operations Nadav Amit
  2022-07-18 11:47 ` [PATCH v2 3/5] userfaultfd: introduce write-likely mode for uffd operations Nadav Amit
@ 2022-07-18 11:47 ` Nadav Amit
  2022-07-22  7:47   ` David Hildenbrand
  2022-07-18 11:47 ` [PATCH v2 5/5] selftest/userfaultfd: test read/write hints Nadav Amit
       [not found] ` <20220718114748.2623-2-namit@vmware.com>
  4 siblings, 1 reply; 20+ messages in thread
From: Nadav Amit @ 2022-07-18 11:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Nadav Amit, David Hildenbrand, Mike Kravetz,
	Hugh Dickins, Axel Rasmussen, Mike Rapoport, Peter Xu

From: Nadav Amit <namit@vmware.com>

When userfaultfd provides a zeropage in response to ioctl, it provides a
readonly alias to the zero page. If the page is later written (which is
the likely scenario), page-fault occurs and the page-fault allocator
allocates a page and rewires the page-tables.

This is an expensive flow for cases in which a page is likely be written
to. Users can use the copy ioctl to initialize zero page (by copying
zeros), but this is also wasteful.

Allow userfaultfd users to efficiently map initialized zero-pages that
are writable. IF UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY is provided would map
a clear page instead of an alias to the zero page.

Suggested-by: David Hildenbrand <david@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 mm/userfaultfd.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index c15679f3eb6a..954c6980b29f 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -241,6 +241,37 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm,
 	return ret;
 }
 
+static int mfill_clearpage_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
+			       struct vm_area_struct *dst_vma,
+			       unsigned long dst_addr,
+			       uffd_flags_t uffd_flags)
+{
+	struct page *page;
+	int ret;
+
+	ret = -ENOMEM;
+	page = alloc_zeroed_user_highpage_movable(dst_vma, dst_addr);
+	if (!page)
+		goto out;
+
+	/* The PTE is not marked as dirty unconditionally */
+	SetPageDirty(page);
+	__SetPageUptodate(page);
+
+	if (mem_cgroup_charge(page_folio(page), dst_vma->vm_mm, GFP_KERNEL))
+		goto out_release;
+
+	ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
+				       page, true, uffd_flags);
+	if (ret)
+		goto out_release;
+out:
+	return ret;
+out_release:
+	put_page(page);
+	goto out;
+}
+
 /* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */
 static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
 				pmd_t *dst_pmd,
@@ -500,6 +531,10 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
 					       dst_addr, src_addr, page,
 					       uffd_flags);
+		else if (!(uffd_flags & UFFD_FLAGS_WP) &&
+			 (uffd_flags & UFFD_FLAGS_WRITE_LIKELY))
+			err = mfill_clearpage_pte(dst_mm, dst_pmd, dst_vma,
+						  dst_addr, uffd_flags);
 		else
 			err = mfill_zeropage_pte(dst_mm, dst_pmd,
 						 dst_vma, dst_addr, uffd_flags);
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 5/5] selftest/userfaultfd: test read/write hints
  2022-07-18 11:47 [PATCH v2 0/5] userfaultfd: support access/write hints Nadav Amit
                   ` (2 preceding siblings ...)
  2022-07-18 11:47 ` [PATCH v2 4/5] userfaultfd: zero access/write hints Nadav Amit
@ 2022-07-18 11:47 ` Nadav Amit
       [not found] ` <20220718114748.2623-2-namit@vmware.com>
  4 siblings, 0 replies; 20+ messages in thread
From: Nadav Amit @ 2022-07-18 11:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Nadav Amit, Mike Kravetz, Hugh Dickins,
	Axel Rasmussen, Peter Xu, David Hildenbrand, Mike Rapoport

From: Nadav Amit <namit@vmware.com>

Test UFFDIO_*_MODE_ACCESS_LIKELY and UFFDIO_*_MODE_WRITE_LIKELY.
Introduce a modifier to trigger the use of the hints.

Add the test to run_vmtests.sh and add an array to run different
userfaultfd configurations.

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 tools/testing/selftests/vm/run_vmtests.sh | 16 ++++---
 tools/testing/selftests/vm/userfaultfd.c  | 54 +++++++++++++++++++++--
 2 files changed, 62 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/vm/run_vmtests.sh b/tools/testing/selftests/vm/run_vmtests.sh
index 27c01c35c7a9..296862547ff9 100755
--- a/tools/testing/selftests/vm/run_vmtests.sh
+++ b/tools/testing/selftests/vm/run_vmtests.sh
@@ -120,11 +120,17 @@ run_test ./gup_test -a
 # Dump pages 0, 19, and 4096, using pin_user_pages:
 run_test ./gup_test -ct -F 0x1 0 19 0x1000
 
-run_test ./userfaultfd anon 20 16
-# Test requires source and destination huge pages.  Size of source
-# (half_ufd_size_MB) is passed as argument to test.
-run_test ./userfaultfd hugetlb "$half_ufd_size_MB" 32
-run_test ./userfaultfd shmem 20 16
+uffd_mods=("" ":access_likely" ":access_likely:write_likely" ":write_likely")
+
+for mod in "${uffd_mods[@]}"; do
+	run_test ./userfaultfd anon${mod} 20 16
+	# Hugetlb tests require source and destination huge pages. Pass in half the
+	# size ($half_ufd_size_MB), which is used for *each*.
+	run_test ./userfaultfd hugetlb${mod} "$half_ufd_size_MB" 32
+	run_test ./userfaultfd hugetlb_shared${mod} "$half_ufd_size_MB" 32 "$mnt"/uffd-test
+	rm -f "$mnt"/uffd-test
+	run_test ./userfaultfd shmem${mod} 20 16
+done
 
 #cleanup
 umount "$mnt"
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 7c3f1b0ab468..d54f65246bd8 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -85,6 +85,8 @@ static volatile bool test_uffdio_zeropage_eexist = true;
 static bool test_uffdio_wp = true;
 /* Whether to test uffd minor faults */
 static bool test_uffdio_minor = false;
+static bool test_access_likely;
+static bool test_write_likely;
 
 static bool map_shared;
 static int shm_fd;
@@ -518,6 +520,12 @@ static void wp_range(int ufd, __u64 start, __u64 len, bool wp)
 	/* Undo write-protect, do wakeup after that */
 	prms.mode = wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0;
 
+	if (test_access_likely)
+		prms.mode |= UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY;
+
+	if (test_write_likely)
+		prms.mode |= UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY;
+
 	if (ioctl(ufd, UFFDIO_WRITEPROTECT, &prms))
 		err("clear WP failed: address=0x%"PRIx64, (uint64_t)start);
 }
@@ -531,6 +539,12 @@ static void continue_range(int ufd, __u64 start, __u64 len)
 	req.range.len = len;
 	req.mode = 0;
 
+	if (test_access_likely)
+		req.mode |= UFFDIO_CONTINUE_MODE_ACCESS_LIKELY;
+
+	if (test_write_likely)
+		req.mode |= UFFDIO_CONTINUE_MODE_WRITE_LIKELY;
+
 	if (ioctl(ufd, UFFDIO_CONTINUE, &req))
 		err("UFFDIO_CONTINUE failed for address 0x%" PRIx64,
 		    (uint64_t)start);
@@ -621,6 +635,13 @@ static int __copy_page(int ufd, unsigned long offset, bool retry)
 		uffdio_copy.mode = UFFDIO_COPY_MODE_WP;
 	else
 		uffdio_copy.mode = 0;
+
+	if (test_access_likely)
+		uffdio_copy.mode |= UFFDIO_COPY_MODE_ACCESS_LIKELY;
+
+	if (test_write_likely)
+		uffdio_copy.mode |= UFFDIO_COPY_MODE_WRITE_LIKELY;
+
 	uffdio_copy.copy = 0;
 	if (ioctl(ufd, UFFDIO_COPY, &uffdio_copy)) {
 		/* real retval in ufdio_copy.copy */
@@ -1048,6 +1069,13 @@ static int __uffdio_zeropage(int ufd, unsigned long offset, bool retry)
 	uffdio_zeropage.range.start = (unsigned long) area_dst + offset;
 	uffdio_zeropage.range.len = page_size;
 	uffdio_zeropage.mode = 0;
+
+	if (test_access_likely)
+		uffdio_zeropage.mode |= UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY;
+
+	if (test_write_likely)
+		uffdio_zeropage.mode |= UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY;
+
 	ret = ioctl(ufd, UFFDIO_ZEROPAGE, &uffdio_zeropage);
 	res = uffdio_zeropage.zeropage;
 	if (ret) {
@@ -1584,8 +1612,6 @@ unsigned long default_huge_page_size(void)
 
 static void set_test_type(const char *type)
 {
-	uint64_t features = UFFD_API_FEATURES;
-
 	if (!strcmp(type, "anon")) {
 		test_type = TEST_ANON;
 		uffd_test_ops = &anon_uffd_test_ops;
@@ -1606,6 +1632,28 @@ static void set_test_type(const char *type)
 	} else {
 		err("Unknown test type: %s", type);
 	}
+}
+
+static void parse_test_type_arg(const char *raw_type)
+{
+	char *buf = strdup(raw_type);
+	uint64_t features = UFFD_API_FEATURES;
+
+	while (buf) {
+		const char *token = strsep(&buf, ":");
+
+		if (!test_type)
+			set_test_type(token);
+		else if (!strcmp(token, "access_likely"))
+			test_access_likely = true;
+		else if (!strcmp(token, "write_likely"))
+			test_write_likely = true;
+		else
+			err("unrecognized test mod '%s'", token);
+	}
+
+	if (!test_type)
+		err("failed to parse test type argument: '%s'", raw_type);
 
 	if (test_type == TEST_HUGETLB)
 		page_size = default_huge_page_size();
@@ -1653,7 +1701,7 @@ int main(int argc, char **argv)
 		err("failed to arm SIGALRM");
 	alarm(ALARM_INTERVAL_SECS);
 
-	set_test_type(argv[1]);
+	parse_test_type_arg(argv[1]);
 
 	nr_cpus = sysconf(_SC_NPROCESSORS_ONLN);
 	nr_pages_per_cpu = atol(argv[2]) * 1024*1024 / page_size /
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-07-18 11:47 ` [PATCH v2 2/5] userfaultfd: introduce access-likely mode for common operations Nadav Amit
@ 2022-07-18 20:05   ` Peter Xu
  2022-07-18 20:59     ` Nadav Amit
  2022-07-23  9:16   ` Mike Rapoport
  1 sibling, 1 reply; 20+ messages in thread
From: Peter Xu @ 2022-07-18 20:05 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, Andrew Morton, Nadav Amit, Mike Kravetz, Hugh Dickins,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport

On Mon, Jul 18, 2022 at 04:47:45AM -0700, Nadav Amit wrote:
> @@ -261,6 +272,7 @@ struct uffdio_copy {
>  struct uffdio_zeropage {
>  	struct uffdio_range range;
>  #define UFFDIO_ZEROPAGE_MODE_DONTWAKE		((__u64)1<<0)
> +#define UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY	((__u64)1<<1)

Would access hint help zeropage use case?  I remembered you used to comment
around and said it won't help since we won't reclaim zero page anyway.

It won't help either even if this flag is only used for the follow up
WRITE_HINT (since then there'll be a CoW) because when WRITE_HINT attached
it doesn't make sense to not have ACCESS_HINT, then it seems the WRITE_HINT
itself would be enough for ZEROPAGE to me.

[...]

> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 421784d26651..c15679f3eb6a 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -65,6 +65,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>  	bool writable = dst_vma->vm_flags & VM_WRITE;
>  	bool vm_shared = dst_vma->vm_flags & VM_SHARED;
>  	bool page_in_cache = page->mapping;
> +	bool prefault = !(uffd_flags & UFFD_FLAGS_ACCESS_LIKELY);

I think it's okay to name it "prefault" as a temp var, but ideally IMHO we
shouldn't assume what the user app is doing - it is only installing some
uffd pgtables with !ACCESS_LIKELY and it does not necessarily need to be a
prefault process..

>  	spinlock_t *ptl;
>  	struct inode *inode;
>  	pgoff_t offset, max_off;
> @@ -92,6 +93,11 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>  		 */
>  		_dst_pte = pte_wrprotect(_dst_pte);
>  
> +	if (prefault && arch_wants_old_prefaulted_pte())
> +		_dst_pte = pte_mkold(_dst_pte);
> +	else
> +		_dst_pte = pte_sw_mkyoung(_dst_pte);

Could you explain why we couldn't unconditionally mkold here even for x86?

It'll be a pity if this feature bit will only be useful on arm64 but not
covering x86 (which is so far still the majority I think).

IMHO it's slightly different here comparing to kernel prefaults - the uesr
app may not be aware of kernel prefaults, but here !ACCESS_HINT it's
user-aware, and it's what user app explicitly provided.  IMO it's a
stronger proof of a cold page already.

The other thing I got confused here is arch_wants_old_prefaulted_pte()
returns true if arm64 supports hardware AF.  However for all the rest archs
(including x86_64 which, afaict, support AF too in most models) it'll
constantly return false.  Do you know what's the rational behind?

> +
>  	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
>  
>  	if (vma_is_shmem(dst_vma)) {
> @@ -202,7 +208,8 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
>  static int mfill_zeropage_pte(struct mm_struct *dst_mm,
>  			      pmd_t *dst_pmd,
>  			      struct vm_area_struct *dst_vma,
> -			      unsigned long dst_addr)
> +			      unsigned long dst_addr,
> +			      uffd_flags_t uffd_flags)
>  {
>  	pte_t _dst_pte, *dst_pte;
>  	spinlock_t *ptl;
> @@ -495,7 +502,7 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
>  					       uffd_flags);
>  		else
>  			err = mfill_zeropage_pte(dst_mm, dst_pmd,
> -						 dst_vma, dst_addr);
> +						 dst_vma, dst_addr, uffd_flags);
>  	} else {
>  		err = shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma,
>  					     dst_addr, src_addr,
> -- 
> 2.25.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/5] userfaultfd: introduce uffd_flags
       [not found] ` <20220718114748.2623-2-namit@vmware.com>
@ 2022-07-18 20:05   ` Peter Xu
  2022-07-22  7:54   ` David Hildenbrand
  2022-07-23  9:12   ` Mike Rapoport
  2 siblings, 0 replies; 20+ messages in thread
From: Peter Xu @ 2022-07-18 20:05 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, Andrew Morton, Nadav Amit, Mike Kravetz, Hugh Dickins,
	Axel Rasmussen, Mike Rapoport, David Hildenbrand

On Mon, Jul 18, 2022 at 04:47:44AM -0700, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> As the next patches are going to introduce more information that needs
> to be propagated regarding handled user requests, introduce uffd_flags
> that would be used to propagate this information.
> 
> Remove the unused UFFD_FLAGS_SET to avoid confusion in the constant
> names.
> 
> Introducing uffd flags also allows to avoid mm/userfaultfd from being
> using uapi (e.g., UFFDIO_COPY_MODE_WP).
> 
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Nadav Amit <namit@vmware.com>

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/5] userfaultfd: introduce write-likely mode for uffd operations
  2022-07-18 11:47 ` [PATCH v2 3/5] userfaultfd: introduce write-likely mode for uffd operations Nadav Amit
@ 2022-07-18 20:12   ` Peter Xu
  2022-07-18 20:25     ` Nadav Amit
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Xu @ 2022-07-18 20:12 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, Andrew Morton, Nadav Amit, Mike Kravetz, Hugh Dickins,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport

On Mon, Jul 18, 2022 at 04:47:46AM -0700, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> Introduce write-likely hints for uffd. These hints would be used in a
> future patch to decide whether to attempt to map pages in the page-table
> or whether to only mark them logically as writable. This allows
> userspace to determine whether a page would be accessed faster or
> whether removal of the page would be possible, potentially, without
> writeback and TLB flush.
> 
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  fs/userfaultfd.c                 | 32 ++++++++++++++++++++++++--------
>  include/linux/userfaultfd_k.h    |  1 +
>  include/uapi/linux/userfaultfd.h | 13 ++++++++++++-
>  3 files changed, 37 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 8d8792b27c53..3027d228550a 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1709,7 +1709,8 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
>  	if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
>  		goto out;
>  	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP|
> -				 UFFDIO_COPY_MODE_ACCESS_LIKELY))
> +				 UFFDIO_COPY_MODE_ACCESS_LIKELY|
> +				 UFFDIO_COPY_MODE_WRITE_LIKELY))
>  		goto out;
>  
>  	mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP;
> @@ -1719,8 +1720,11 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
>  	if (ctx->features & UFFD_FEATURE_ACCESS_HINTS) {
>  		if (uffdio_copy.mode & UFFDIO_COPY_MODE_ACCESS_LIKELY)
>  			uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
> +		if (uffdio_copy.mode & UFFDIO_COPY_MODE_WRITE_LIKELY)
> +			uffd_flags |= UFFD_FLAGS_WRITE_LIKELY;
>  	} else {
> -		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
> +		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY |
> +			      UFFD_FLAGS_WRITE_LIKELY;
>  	}
>  
>  	if (mmget_not_zero(ctx->mm)) {
> @@ -1774,14 +1778,18 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
>  		goto out;
>  	ret = -EINVAL;
>  	if (uffdio_zeropage.mode & ~(UFFDIO_ZEROPAGE_MODE_DONTWAKE|
> -				     UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY))
> +				     UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY|
> +				     UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY))
>  		goto out;
>  
>  	if (ctx->features & UFFD_FEATURE_ACCESS_HINTS) {
>  		if (uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY)
>  			uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
> +		if (uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY)
> +			uffd_flags |= UFFD_FLAGS_WRITE_LIKELY;
>  	} else {
> -		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
> +		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY |
> +			      UFFD_FLAGS_WRITE_LIKELY;
>  	}
>  
>  	if (mmget_not_zero(ctx->mm)) {
> @@ -1834,7 +1842,8 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
>  
>  	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
>  			       UFFDIO_WRITEPROTECT_MODE_WP |
> -			       UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY))
> +			       UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY |
> +			       UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY))
>  		return -EINVAL;
>  
>  	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
> @@ -1847,8 +1856,11 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
>  	if (ctx->features & UFFD_FEATURE_ACCESS_HINTS) {
>  		if (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY)
>  			uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
> +		if (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY)
> +			uffd_flags |= UFFD_FLAGS_WRITE_LIKELY;
>  	} else {
> -		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
> +		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY |
> +			      UFFD_FLAGS_WRITE_LIKELY;
>  	}
>  
>  	if (mmget_not_zero(ctx->mm)) {
> @@ -1903,14 +1915,18 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
>  		goto out;
>  	}
>  	if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE|
> -				     UFFDIO_CONTINUE_MODE_ACCESS_LIKELY))
> +				     UFFDIO_CONTINUE_MODE_ACCESS_LIKELY|
> +				     UFFDIO_CONTINUE_MODE_WRITE_LIKELY))
>  		goto out;
>  
>  	if (ctx->features & UFFD_FEATURE_ACCESS_HINTS) {
>  		if (uffdio_continue.mode & UFFDIO_CONTINUE_MODE_ACCESS_LIKELY)
>  			uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
> +		if (uffdio_continue.mode & UFFDIO_CONTINUE_MODE_WRITE_LIKELY)
> +			uffd_flags |= UFFD_FLAGS_WRITE_LIKELY;
>  	} else {
> -		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
> +		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY |
> +			      UFFD_FLAGS_WRITE_LIKELY;
>  	}
>  
>  	if (mmget_not_zero(ctx->mm)) {
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index b326798b5677..4968c86938b2 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -60,6 +60,7 @@ typedef unsigned int __bitwise uffd_flags_t;
>  #define UFFD_FLAGS_NONE			((__force uffd_flags_t)0)
>  #define UFFD_FLAGS_WP			((__force uffd_flags_t)BIT(0))
>  #define UFFD_FLAGS_ACCESS_LIKELY	((__force uffd_flags_t)BIT(1))
> +#define UFFD_FLAGS_WRITE_LIKELY		((__force uffd_flags_t)BIT(2))
>  
>  extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>  				    struct vm_area_struct *dst_vma,
> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> index 02e0c1f56939..f52cbe4c9c44 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -202,7 +202,7 @@ struct uffdio_api {
>  	 * write-protection mode is supported on both shmem and hugetlbfs.
>  	 *
>  	 * UFFD_FEATURE_ACCESS_HINTS indicates that the ioctl operations
> -	 * support the UFFDIO_*_MODE_ACCESS_LIKELY hints.
> +	 * support the UFFDIO_*_MODE_[ACCESS|WRITE]_LIKELY hints.
>  	 */
>  #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
>  #define UFFD_FEATURE_EVENT_FORK			(1<<1)
> @@ -257,9 +257,13 @@ struct uffdio_copy {
>  	 * page is likely to be access in the near future. Providing the hint
>  	 * properly can improve performance.
>  	 *
> +	 * UFFDIO_COPY_MODE_WRITE_LIKELY provides a hint to the kernel that the
> +	 * page is likely to be written in the near future. Providing the hint
> +	 * properly can improve performance.
>  	 */
>  #define UFFDIO_COPY_MODE_WP			((__u64)1<<1)
>  #define UFFDIO_COPY_MODE_ACCESS_LIKELY		((__u64)1<<2)
> +#define UFFDIO_COPY_MODE_WRITE_LIKELY		((__u64)1<<3)
>  	__u64 mode;
>  
>  	/*
> @@ -273,6 +277,7 @@ struct uffdio_zeropage {
>  	struct uffdio_range range;
>  #define UFFDIO_ZEROPAGE_MODE_DONTWAKE		((__u64)1<<0)
>  #define UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY	((__u64)1<<1)
> +#define UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY	((__u64)1<<2)
>  	__u64 mode;
>  
>  	/*
> @@ -296,6 +301,10 @@ struct uffdio_writeprotect {
>   * that the page is likely to be access in the near future. Providing
>   * the hint properly can improve performance.
>   *
> + * UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY: provides a hint to the kernel
> + * that the page is likely to be written in the near future. Providing
> + * the hint properly can improve performance.
> + *
>   * NOTE: Write protecting a region (WP=1) is unrelated to page faults,
>   * therefore DONTWAKE flag is meaningless with WP=1.  Removing write
>   * protection (WP=0) in response to a page fault wakes the faulting
> @@ -304,6 +313,7 @@ struct uffdio_writeprotect {
>  #define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
>  #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
>  #define UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY	((__u64)1<<2)
> +#define UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY	((__u64)1<<3)
>  	__u64 mode;
>  };
>  
> @@ -311,6 +321,7 @@ struct uffdio_continue {
>  	struct uffdio_range range;
>  #define UFFDIO_CONTINUE_MODE_DONTWAKE		((__u64)1<<0)
>  #define UFFDIO_CONTINUE_MODE_ACCESS_LIKELY	((__u64)1<<1)
> +#define UFFDIO_CONTINUE_MODE_WRITE_LIKELY	((__u64)1<<2)
>  	__u64 mode;

I thought you would have some reasoning on having the flag for unprotect
(since our last discussion you mentioned it) but it seems not there..

Then, could we only keep the zeropage write hint but drop the rest?
They're never used in this whole series besides the zeropage one, meanwhile
I think we're still not reaching consensus on whether they'll be helpful?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/5] userfaultfd: introduce write-likely mode for uffd operations
  2022-07-18 20:12   ` Peter Xu
@ 2022-07-18 20:25     ` Nadav Amit
  2022-07-18 21:27       ` Peter Xu
  0 siblings, 1 reply; 20+ messages in thread
From: Nadav Amit @ 2022-07-18 20:25 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux MM, Andrew Morton, Mike Kravetz, Hugh Dickins,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport

On Jul 18, 2022, at 1:12 PM, Peter Xu <peterx@redhat.com> wrote:

> ⚠ External Email
> 
> On Mon, Jul 18, 2022 at 04:47:46AM -0700, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> Introduce write-likely hints for uffd. These hints would be used in a
>> future patch to decide whether to attempt to map pages in the page-table
>> or whether to only mark them logically as writable. This allows
>> userspace to determine whether a page would be accessed faster or
>> whether removal of the page would be possible, potentially, without
>> writeback and TLB flush.
>> 
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Axel Rasmussen <axelrasmussen@google.com>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Mike Rapoport <rppt@linux.ibm.com>
>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> ---
>> fs/userfaultfd.c | 32 ++++++++++++++++++++++++--------
>> include/linux/userfaultfd_k.h | 1 +
>> include/uapi/linux/userfaultfd.h | 13 ++++++++++++-
>> 3 files changed, 37 insertions(+), 9 deletions(-)
>> 
>> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
>> index 8d8792b27c53..3027d228550a 100644
>> --- a/fs/userfaultfd.c
>> +++ b/fs/userfaultfd.c
>> @@ -1709,7 +1709,8 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
>> if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
>> goto out;
>> if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP|
>> - UFFDIO_COPY_MODE_ACCESS_LIKELY))
>> + UFFDIO_COPY_MODE_ACCESS_LIKELY|
>> + UFFDIO_COPY_MODE_WRITE_LIKELY))
>> goto out;
>> 
>> mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP;
>> @@ -1719,8 +1720,11 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
>> if (ctx->features & UFFD_FEATURE_ACCESS_HINTS) {
>> if (uffdio_copy.mode & UFFDIO_COPY_MODE_ACCESS_LIKELY)
>> uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
>> + if (uffdio_copy.mode & UFFDIO_COPY_MODE_WRITE_LIKELY)
>> + uffd_flags |= UFFD_FLAGS_WRITE_LIKELY;
>> } else {
>> - uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
>> + uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY |
>> + UFFD_FLAGS_WRITE_LIKELY;
>> }
>> 
>> if (mmget_not_zero(ctx->mm)) {
>> @@ -1774,14 +1778,18 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
>> goto out;
>> ret = -EINVAL;
>> if (uffdio_zeropage.mode & ~(UFFDIO_ZEROPAGE_MODE_DONTWAKE|
>> - UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY))
>> + UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY|
>> + UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY))
>> goto out;
>> 
>> if (ctx->features & UFFD_FEATURE_ACCESS_HINTS) {
>> if (uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY)
>> uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
>> + if (uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY)
>> + uffd_flags |= UFFD_FLAGS_WRITE_LIKELY;
>> } else {
>> - uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
>> + uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY |
>> + UFFD_FLAGS_WRITE_LIKELY;
>> }
>> 
>> if (mmget_not_zero(ctx->mm)) {
>> @@ -1834,7 +1842,8 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
>> 
>> if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
>> UFFDIO_WRITEPROTECT_MODE_WP |
>> - UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY))
>> + UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY |
>> + UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY))
>> return -EINVAL;
>> 
>> mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
>> @@ -1847,8 +1856,11 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
>> if (ctx->features & UFFD_FEATURE_ACCESS_HINTS) {
>> if (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY)
>> uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
>> + if (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY)
>> + uffd_flags |= UFFD_FLAGS_WRITE_LIKELY;
>> } else {
>> - uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
>> + uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY |
>> + UFFD_FLAGS_WRITE_LIKELY;
>> }
>> 
>> if (mmget_not_zero(ctx->mm)) {
>> @@ -1903,14 +1915,18 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
>> goto out;
>> }
>> if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE|
>> - UFFDIO_CONTINUE_MODE_ACCESS_LIKELY))
>> + UFFDIO_CONTINUE_MODE_ACCESS_LIKELY|
>> + UFFDIO_CONTINUE_MODE_WRITE_LIKELY))
>> goto out;
>> 
>> if (ctx->features & UFFD_FEATURE_ACCESS_HINTS) {
>> if (uffdio_continue.mode & UFFDIO_CONTINUE_MODE_ACCESS_LIKELY)
>> uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
>> + if (uffdio_continue.mode & UFFDIO_CONTINUE_MODE_WRITE_LIKELY)
>> + uffd_flags |= UFFD_FLAGS_WRITE_LIKELY;
>> } else {
>> - uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
>> + uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY |
>> + UFFD_FLAGS_WRITE_LIKELY;
>> }
>> 
>> if (mmget_not_zero(ctx->mm)) {
>> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
>> index b326798b5677..4968c86938b2 100644
>> --- a/include/linux/userfaultfd_k.h
>> +++ b/include/linux/userfaultfd_k.h
>> @@ -60,6 +60,7 @@ typedef unsigned int __bitwise uffd_flags_t;
>> #define UFFD_FLAGS_NONE ((__force uffd_flags_t)0)
>> #define UFFD_FLAGS_WP ((__force uffd_flags_t)BIT(0))
>> #define UFFD_FLAGS_ACCESS_LIKELY ((__force uffd_flags_t)BIT(1))
>> +#define UFFD_FLAGS_WRITE_LIKELY ((__force uffd_flags_t)BIT(2))
>> 
>> extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>> struct vm_area_struct *dst_vma,
>> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
>> index 02e0c1f56939..f52cbe4c9c44 100644
>> --- a/include/uapi/linux/userfaultfd.h
>> +++ b/include/uapi/linux/userfaultfd.h
>> @@ -202,7 +202,7 @@ struct uffdio_api {
>> * write-protection mode is supported on both shmem and hugetlbfs.
>> *
>> * UFFD_FEATURE_ACCESS_HINTS indicates that the ioctl operations
>> - * support the UFFDIO_*_MODE_ACCESS_LIKELY hints.
>> + * support the UFFDIO_*_MODE_[ACCESS|WRITE]_LIKELY hints.
>> */
>> #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
>> #define UFFD_FEATURE_EVENT_FORK (1<<1)
>> @@ -257,9 +257,13 @@ struct uffdio_copy {
>> * page is likely to be access in the near future. Providing the hint
>> * properly can improve performance.
>> *
>> + * UFFDIO_COPY_MODE_WRITE_LIKELY provides a hint to the kernel that the
>> + * page is likely to be written in the near future. Providing the hint
>> + * properly can improve performance.
>> */
>> #define UFFDIO_COPY_MODE_WP ((__u64)1<<1)
>> #define UFFDIO_COPY_MODE_ACCESS_LIKELY ((__u64)1<<2)
>> +#define UFFDIO_COPY_MODE_WRITE_LIKELY ((__u64)1<<3)
>> __u64 mode;
>> 
>> /*
>> @@ -273,6 +277,7 @@ struct uffdio_zeropage {
>> struct uffdio_range range;
>> #define UFFDIO_ZEROPAGE_MODE_DONTWAKE ((__u64)1<<0)
>> #define UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY ((__u64)1<<1)
>> +#define UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY ((__u64)1<<2)
>> __u64 mode;
>> 
>> /*
>> @@ -296,6 +301,10 @@ struct uffdio_writeprotect {
>> * that the page is likely to be access in the near future. Providing
>> * the hint properly can improve performance.
>> *
>> + * UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY: provides a hint to the kernel
>> + * that the page is likely to be written in the near future. Providing
>> + * the hint properly can improve performance.
>> + *
>> * NOTE: Write protecting a region (WP=1) is unrelated to page faults,
>> * therefore DONTWAKE flag is meaningless with WP=1. Removing write
>> * protection (WP=0) in response to a page fault wakes the faulting
>> @@ -304,6 +313,7 @@ struct uffdio_writeprotect {
>> #define UFFDIO_WRITEPROTECT_MODE_WP ((__u64)1<<0)
>> #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE ((__u64)1<<1)
>> #define UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY ((__u64)1<<2)
>> +#define UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY ((__u64)1<<3)
>> __u64 mode;
>> };
>> 
>> @@ -311,6 +321,7 @@ struct uffdio_continue {
>> struct uffdio_range range;
>> #define UFFDIO_CONTINUE_MODE_DONTWAKE ((__u64)1<<0)
>> #define UFFDIO_CONTINUE_MODE_ACCESS_LIKELY ((__u64)1<<1)
>> +#define UFFDIO_CONTINUE_MODE_WRITE_LIKELY ((__u64)1<<2)
>> __u64 mode;
> 
> I thought you would have some reasoning on having the flag for unprotect
> (since our last discussion you mentioned it) but it seems not there..
> 
> Then, could we only keep the zeropage write hint but drop the rest?
> They're never used in this whole series besides the zeropage one, meanwhile
> I think we're still not reaching consensus on whether they'll be helpful?

I think that I didn’t communicate clearly enough two things. First, the
access flags are used here.

Now, you are correct that although the unprotect flag is defined here, it is
not used in this patch-set. There is a reason for that.

It turns out that using David’s work to map a writable page can cause
undesired behaviors - the clean PTE, which we discussed, and additional TLB
shootdowns. Since it required a lot of changes to get rid of these
additional shootdowns, I put the unprotect changes in a different patch-set.

https://lore.kernel.org/all/20220718120212.3180-1-namit@vmware.com/

Let me know if that answers your question.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-07-18 20:05   ` Peter Xu
@ 2022-07-18 20:59     ` Nadav Amit
  2022-07-18 21:21       ` Peter Xu
  0 siblings, 1 reply; 20+ messages in thread
From: Nadav Amit @ 2022-07-18 20:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux MM, Andrew Morton, Mike Kravetz, Hugh Dickins,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport

On Jul 18, 2022, at 1:05 PM, Peter Xu <peterx@redhat.com> wrote:

> ⚠ External Email
> 
> On Mon, Jul 18, 2022 at 04:47:45AM -0700, Nadav Amit wrote:
>> @@ -261,6 +272,7 @@ struct uffdio_copy {
>> struct uffdio_zeropage {
>>      struct uffdio_range range;
>> #define UFFDIO_ZEROPAGE_MODE_DONTWAKE                ((__u64)1<<0)
>> +#define UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY   ((__u64)1<<1)
> 
> Would access hint help zeropage use case?  I remembered you used to comment
> around and said it won't help since we won't reclaim zero page anyway.

I agree that there is no meaning for access bit on zero page. I just think
that it is best to have the flags for consistency. If you ask me, I would
prefer to have all the flags in a fixed place (highest bits?). Anyhow, if we
expose the hints as a feature, I do not think we would later want to say
“here is another feature that enables another hint that we thought is not
needed before”. Userfaultfd’s feature bits are already nuts, IMHO.

> It won't help either even if this flag is only used for the follow up
> WRITE_HINT (since then there'll be a CoW) because when WRITE_HINT attached
> it doesn't make sense to not have ACCESS_HINT, then it seems the WRITE_HINT
> itself would be enough for ZEROPAGE to me.

Agreed. Again, I think it is worthy for consistency.

> [...]
> 
>> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
>> index 421784d26651..c15679f3eb6a 100644
>> --- a/mm/userfaultfd.c
>> +++ b/mm/userfaultfd.c
>> @@ -65,6 +65,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>>      bool writable = dst_vma->vm_flags & VM_WRITE;
>>      bool vm_shared = dst_vma->vm_flags & VM_SHARED;
>>      bool page_in_cache = page->mapping;
>> +     bool prefault = !(uffd_flags & UFFD_FLAGS_ACCESS_LIKELY);
> 
> I think it's okay to name it "prefault" as a temp var, but ideally IMHO we
> shouldn't assume what the user app is doing - it is only installing some
> uffd pgtables with !ACCESS_LIKELY and it does not necessarily need to be a
> prefault process..
> 
>>      spinlock_t *ptl;
>>      struct inode *inode;
>>      pgoff_t offset, max_off;
>> @@ -92,6 +93,11 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>>               */
>>              _dst_pte = pte_wrprotect(_dst_pte);
>> 
>> +     if (prefault && arch_wants_old_prefaulted_pte())
>> +             _dst_pte = pte_mkold(_dst_pte);
>> +     else
>> +             _dst_pte = pte_sw_mkyoung(_dst_pte);
> 
> Could you explain why we couldn't unconditionally mkold here even for x86?

To answer this question and the previous one, please note that the logic is
“borrowed” from do_set_pte(). If you want me to refactor and extract a
function, please let me know.

Here is the deal: for x86, we don’t do pte_mkold() because setting the
access bit is expensive (>500 cycles). For arm64 that have access-bit we
don’t since (according to arm64 code or commit log), the cost of setting the
access bit on arm is low.

> It'll be a pity if this feature bit will only be useful on arm64 but not
> covering x86 (which is so far still the majority I think).
> 
> IMHO it's slightly different here comparing to kernel prefaults - the uesr
> app may not be aware of kernel prefaults, but here !ACCESS_HINT it's
> user-aware, and it's what user app explicitly provided.  IMO it's a
> stronger proof of a cold page already.

I’m ok with that if that is your choice. I actually prefer to give userspace
more control, but I tried to be consistent with other parts of the kernel.
Having said that, it’s really hard for me to see why young bit would be clear,
but dirty bit would be set...

> The other thing I got confused here is arch_wants_old_prefaulted_pte()
> returns true if arm64 supports hardware AF.  However for all the rest archs
> (including x86_64 which, afaict, support AF too in most models) it'll
> constantly return false.  Do you know what's the rational behind?

All x86 (32/64) since 386 support access-bit in the page-tables (IIRC, 286
had access bit in the segments).

I thought we discussed it before: if you access an old PTE on x86, you pay
>500 cycles; this actually affected UnixBench when people tried to change
this behavior [1]. In contrast, on arm64, which I have never profiled, you
probably saw the comment saying: "Experimentally, it's cheap to set the
access flag in hardware and we benefit from prefaulting mappings as 'old’ to
start with.”.

I do not know what happens on other architectures.

( sorry if I have some repetitions in this email )

[1] https://marc.info/?l=linux-kernel&m=146582237922378&w=2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-07-18 20:59     ` Nadav Amit
@ 2022-07-18 21:21       ` Peter Xu
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Xu @ 2022-07-18 21:21 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux MM, Andrew Morton, Mike Kravetz, Hugh Dickins,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport

On Mon, Jul 18, 2022 at 08:59:37PM +0000, Nadav Amit wrote:
> On Jul 18, 2022, at 1:05 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> > ⚠ External Email
> > 
> > On Mon, Jul 18, 2022 at 04:47:45AM -0700, Nadav Amit wrote:
> >> @@ -261,6 +272,7 @@ struct uffdio_copy {
> >> struct uffdio_zeropage {
> >>      struct uffdio_range range;
> >> #define UFFDIO_ZEROPAGE_MODE_DONTWAKE                ((__u64)1<<0)
> >> +#define UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY   ((__u64)1<<1)
> > 
> > Would access hint help zeropage use case?  I remembered you used to comment
> > around and said it won't help since we won't reclaim zero page anyway.
> 
> I agree that there is no meaning for access bit on zero page. I just think
> that it is best to have the flags for consistency. If you ask me, I would
> prefer to have all the flags in a fixed place (highest bits?). Anyhow, if we
> expose the hints as a feature, I do not think we would later want to say
> “here is another feature that enables another hint that we thought is not
> needed before”. Userfaultfd’s feature bits are already nuts, IMHO.
> 
> > It won't help either even if this flag is only used for the follow up
> > WRITE_HINT (since then there'll be a CoW) because when WRITE_HINT attached
> > it doesn't make sense to not have ACCESS_HINT, then it seems the WRITE_HINT
> > itself would be enough for ZEROPAGE to me.
> 
> Agreed. Again, I think it is worthy for consistency.

I'd be fine if it's kernel internal flags only.  But this is solid kernel
ABI.  Are you.. sure?

We're literally trying to introduce some flags just for "consistency" even
if we know nobody will be using it.  It really dosn't sound very right on
designing good interfaces..

> 
> > [...]
> > 
> >> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> >> index 421784d26651..c15679f3eb6a 100644
> >> --- a/mm/userfaultfd.c
> >> +++ b/mm/userfaultfd.c
> >> @@ -65,6 +65,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> >>      bool writable = dst_vma->vm_flags & VM_WRITE;
> >>      bool vm_shared = dst_vma->vm_flags & VM_SHARED;
> >>      bool page_in_cache = page->mapping;
> >> +     bool prefault = !(uffd_flags & UFFD_FLAGS_ACCESS_LIKELY);
> > 
> > I think it's okay to name it "prefault" as a temp var, but ideally IMHO we
> > shouldn't assume what the user app is doing - it is only installing some
> > uffd pgtables with !ACCESS_LIKELY and it does not necessarily need to be a
> > prefault process..
> > 
> >>      spinlock_t *ptl;
> >>      struct inode *inode;
> >>      pgoff_t offset, max_off;
> >> @@ -92,6 +93,11 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> >>               */
> >>              _dst_pte = pte_wrprotect(_dst_pte);
> >> 
> >> +     if (prefault && arch_wants_old_prefaulted_pte())
> >> +             _dst_pte = pte_mkold(_dst_pte);
> >> +     else
> >> +             _dst_pte = pte_sw_mkyoung(_dst_pte);
> > 
> > Could you explain why we couldn't unconditionally mkold here even for x86?
> 
> To answer this question and the previous one, please note that the logic is
> “borrowed” from do_set_pte(). If you want me to refactor and extract a
> function, please let me know.
> 
> Here is the deal: for x86, we don’t do pte_mkold() because setting the
> access bit is expensive (>500 cycles). For arm64 that have access-bit we
> don’t since (according to arm64 code or commit log), the cost of setting the
> access bit on arm is low.
> 
> > It'll be a pity if this feature bit will only be useful on arm64 but not
> > covering x86 (which is so far still the majority I think).
> > 
> > IMHO it's slightly different here comparing to kernel prefaults - the uesr
> > app may not be aware of kernel prefaults, but here !ACCESS_HINT it's
> > user-aware, and it's what user app explicitly provided.  IMO it's a
> > stronger proof of a cold page already.
> 
> I’m ok with that if that is your choice. I actually prefer to give userspace
> more control, but I tried to be consistent with other parts of the kernel.

Ah good to know, then if there's a vote I'll go for your proposal.

I'd suggest we make it a strong semantics.  We used to have similar
discussions around the MADV_COLLAPSE on whether it should be restricted to
khugepaged limitations.  I think it's similar here.

> Having said that, it’s really hard for me to see why young bit would be clear,
> but dirty bit would be set...

Assume one page has both young/dirty set, the reclaim code decides to age
this page, then.. young=0 && dirty=1?

> 
> > The other thing I got confused here is arch_wants_old_prefaulted_pte()
> > returns true if arm64 supports hardware AF.  However for all the rest archs
> > (including x86_64 which, afaict, support AF too in most models) it'll
> > constantly return false.  Do you know what's the rational behind?
> 
> All x86 (32/64) since 386 support access-bit in the page-tables (IIRC, 286
> had access bit in the segments).
> 
> I thought we discussed it before: if you access an old PTE on x86, you pay
> >500 cycles; this actually affected UnixBench when people tried to change
> this behavior [1]. In contrast, on arm64, which I have never profiled, you
> probably saw the comment saying: "Experimentally, it's cheap to set the
> access flag in hardware and we benefit from prefaulting mappings as 'old’ to
> start with.”.

Thanks.  I'm really curious how fast would aarch64 be on setting
hardware-assist young bit and why now.

> 
> I do not know what happens on other architectures.
> 
> ( sorry if I have some repetitions in this email )
> 
> [1] https://marc.info/?l=linux-kernel&m=146582237922378&w=2
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/5] userfaultfd: introduce write-likely mode for uffd operations
  2022-07-18 20:25     ` Nadav Amit
@ 2022-07-18 21:27       ` Peter Xu
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Xu @ 2022-07-18 21:27 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux MM, Andrew Morton, Mike Kravetz, Hugh Dickins,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport

On Mon, Jul 18, 2022 at 08:25:46PM +0000, Nadav Amit wrote:
> >> @@ -311,6 +321,7 @@ struct uffdio_continue {
> >> struct uffdio_range range;
> >> #define UFFDIO_CONTINUE_MODE_DONTWAKE ((__u64)1<<0)
> >> #define UFFDIO_CONTINUE_MODE_ACCESS_LIKELY ((__u64)1<<1)
> >> +#define UFFDIO_CONTINUE_MODE_WRITE_LIKELY ((__u64)1<<2)
> >> __u64 mode;
> > 
> > I thought you would have some reasoning on having the flag for unprotect
> > (since our last discussion you mentioned it) but it seems not there..
> > 
> > Then, could we only keep the zeropage write hint but drop the rest?
> > They're never used in this whole series besides the zeropage one, meanwhile
> > I think we're still not reaching consensus on whether they'll be helpful?
> 
> I think that I didn’t communicate clearly enough two things. First, the
> access flags are used here.
> 
> Now, you are correct that although the unprotect flag is defined here, it is
> not used in this patch-set. There is a reason for that.
> 
> It turns out that using David’s work to map a writable page can cause
> undesired behaviors - the clean PTE, which we discussed, and additional TLB
> shootdowns. Since it required a lot of changes to get rid of these
> additional shootdowns, I put the unprotect changes in a different patch-set.
> 
> https://lore.kernel.org/all/20220718120212.3180-1-namit@vmware.com/
> 
> Let me know if that answers your question.

Okay, I'll read it tomorrow, thanks.  Though note that IMHO we should have
the fix without depending on WRITE_HINT at all.  I hope that's what'll
happen in the other patchset, or I can also comment there.

Btw, if there's direct dependency on flags I'd rather squash the two
patchsets.  The thing is by sololy reading this patch the reader will have
no idea why you wanted to have WRITE_HINT outside ZEROPAGE, at least to me.
We could have introduced WRITE_HINT for ZEROPAGE in this patch (then IMO
you can squash that part with patch 4) then leave the rest for the other
patchset.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 4/5] userfaultfd: zero access/write hints
  2022-07-18 11:47 ` [PATCH v2 4/5] userfaultfd: zero access/write hints Nadav Amit
@ 2022-07-22  7:47   ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-07-22  7:47 UTC (permalink / raw)
  To: Nadav Amit, linux-mm
  Cc: Andrew Morton, Nadav Amit, Mike Kravetz, Hugh Dickins,
	Axel Rasmussen, Mike Rapoport, Peter Xu

On 18.07.22 13:47, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> When userfaultfd provides a zeropage in response to ioctl, it provides a
> readonly alias to the zero page. If the page is later written (which is
> the likely scenario), page-fault occurs and the page-fault allocator
> allocates a page and rewires the page-tables.
> 
> This is an expensive flow for cases in which a page is likely be written
> to. Users can use the copy ioctl to initialize zero page (by copying
> zeros), but this is also wasteful.
> 
> Allow userfaultfd users to efficiently map initialized zero-pages that
> are writable. IF UFFDIO_ZEROPAGE_MODE_WRITE_LIKELY is provided would map
> a clear page instead of an alias to the zero page.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Acked-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  mm/userfaultfd.c | 35 +++++++++++++++++++++++++++++++++++
>  1 file changed, 35 insertions(+)
> 
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index c15679f3eb6a..954c6980b29f 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -241,6 +241,37 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm,
>  	return ret;
>  }
>  
> +static int mfill_clearpage_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> +			       struct vm_area_struct *dst_vma,
> +			       unsigned long dst_addr,
> +			       uffd_flags_t uffd_flags)
> +{
> +	struct page *page;
> +	int ret;
> +
> +	ret = -ENOMEM;
> +	page = alloc_zeroed_user_highpage_movable(dst_vma, dst_addr);
> +	if (!page)
> +		goto out;
> +
> +	/* The PTE is not marked as dirty unconditionally */
> +	SetPageDirty(page);
> +	__SetPageUptodate(page);
> +
> +	if (mem_cgroup_charge(page_folio(page), dst_vma->vm_mm, GFP_KERNEL))
> +		goto out_release;
> +
> +	ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
> +				       page, true, uffd_flags);
> +	if (ret)
> +		goto out_release;
> +out:
> +	return ret;
> +out_release:
> +	put_page(page);
> +	goto out;
> +}
> +
>  /* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */
>  static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
>  				pmd_t *dst_pmd,
> @@ -500,6 +531,10 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
>  			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
>  					       dst_addr, src_addr, page,
>  					       uffd_flags);
> +		else if (!(uffd_flags & UFFD_FLAGS_WP) &&
> +			 (uffd_flags & UFFD_FLAGS_WRITE_LIKELY))
> +			err = mfill_clearpage_pte(dst_mm, dst_pmd, dst_vma,
> +						  dst_addr, uffd_flags);
>  		else
>  			err = mfill_zeropage_pte(dst_mm, dst_pmd,
>  						 dst_vma, dst_addr, uffd_flags);

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/5] userfaultfd: introduce uffd_flags
       [not found] ` <20220718114748.2623-2-namit@vmware.com>
  2022-07-18 20:05   ` [PATCH v2 1/5] userfaultfd: introduce uffd_flags Peter Xu
@ 2022-07-22  7:54   ` David Hildenbrand
  2022-07-22 18:47     ` Nadav Amit
  2022-07-23  9:12   ` Mike Rapoport
  2 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2022-07-22  7:54 UTC (permalink / raw)
  To: Nadav Amit, linux-mm
  Cc: Andrew Morton, Nadav Amit, Mike Kravetz, Hugh Dickins,
	Axel Rasmussen, Peter Xu, Mike Rapoport

On 18.07.22 13:47, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> As the next patches are going to introduce more information that needs
> to be propagated regarding handled user requests, introduce uffd_flags
> that would be used to propagate this information.
> 
> Remove the unused UFFD_FLAGS_SET to avoid confusion in the constant
> names.
> 
> Introducing uffd flags also allows to avoid mm/userfaultfd from being
> using uapi (e.g., UFFDIO_COPY_MODE_WP).
> 
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Nadav Amit <namit@vmware.com>

[...]

>  
>  int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
> -			unsigned long len, bool enable_wp,
> -			atomic_t *mmap_changing)
> +			unsigned long len,
> +			atomic_t *mmap_changing, uffd_flags_t uffd_flags)
>  {
> +	bool enable_wp = uffd_flags & UFFD_FLAGS_WP;

Could be that this will trigger a sparse warnings, but I haven't fully
understood yet when/how sparse will start to complain. If so, this would
have to be

bool enable_wp = !!(uffd_flags & UFFD_FLAGS_WP);

I stumbled into something like that in
https://lore.kernel.org/lkml/202202252038.ij1YGn0d-lkp@intel.com/T/


Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/5] userfaultfd: introduce uffd_flags
  2022-07-22  7:54   ` David Hildenbrand
@ 2022-07-22 18:47     ` Nadav Amit
  0 siblings, 0 replies; 20+ messages in thread
From: Nadav Amit @ 2022-07-22 18:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linux MM, Andrew Morton, Mike Kravetz, Hugh Dickins,
	Axel Rasmussen, Peter Xu, Mike Rapoport

On Jul 22, 2022, at 12:54 AM, David Hildenbrand <david@redhat.com> wrote:

> ⚠ External Email
> 
> On 18.07.22 13:47, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> As the next patches are going to introduce more information that needs
>> to be propagated regarding handled user requests, introduce uffd_flags
>> that would be used to propagate this information.
>> 
>> Remove the unused UFFD_FLAGS_SET to avoid confusion in the constant
>> names.
>> 
>> Introducing uffd flags also allows to avoid mm/userfaultfd from being
>> using uapi (e.g., UFFDIO_COPY_MODE_WP).
>> 
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Axel Rasmussen <axelrasmussen@google.com>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: Mike Rapoport <rppt@linux.ibm.com>
>> Acked-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: Nadav Amit <namit@vmware.com>
> 
> [...]
> 
>> int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
>> -                     unsigned long len, bool enable_wp,
>> -                     atomic_t *mmap_changing)
>> +                     unsigned long len,
>> +                     atomic_t *mmap_changing, uffd_flags_t uffd_flags)
>> {
>> +     bool enable_wp = uffd_flags & UFFD_FLAGS_WP;
> 
> Could be that this will trigger a sparse warnings, but I haven't fully
> understood yet when/how sparse will start to complain. If so, this would
> have to be
> 
> bool enable_wp = !!(uffd_flags & UFFD_FLAGS_WP);
> 
> I stumbled into something like that in
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F202202252038.ij1YGn0d-lkp%40intel.com%2FT%2F&amp;data=05%7C01%7Cnamit%40vmware.com%7Cc237032d11f04972fdb708da6bb77ada%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637940733049609220%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=qW0xe6sS7PjP3papl890GoPTbZ97iE%2Ffztt1rA9t6%2F0%3D&amp;reserved=0

Oh, damn. Thanks for pointing it out. Sparse gives me segmentation faults
for some reason, but I guess it should be addressed - just in case.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/5] userfaultfd: introduce uffd_flags
       [not found] ` <20220718114748.2623-2-namit@vmware.com>
  2022-07-18 20:05   ` [PATCH v2 1/5] userfaultfd: introduce uffd_flags Peter Xu
  2022-07-22  7:54   ` David Hildenbrand
@ 2022-07-23  9:12   ` Mike Rapoport
  2022-07-25 17:23     ` Nadav Amit
  2 siblings, 1 reply; 20+ messages in thread
From: Mike Rapoport @ 2022-07-23  9:12 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, Andrew Morton, Nadav Amit, Mike Kravetz, Hugh Dickins,
	Axel Rasmussen, Peter Xu, David Hildenbrand

On Mon, Jul 18, 2022 at 04:47:44AM -0700, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> As the next patches are going to introduce more information that needs
> to be propagated regarding handled user requests, introduce uffd_flags
> that would be used to propagate this information.
> 
> Remove the unused UFFD_FLAGS_SET to avoid confusion in the constant
> names.
> 
> Introducing uffd flags also allows to avoid mm/userfaultfd from being
> using uapi (e.g., UFFDIO_COPY_MODE_WP).
> 
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  fs/userfaultfd.c              | 22 +++++++++++---
>  include/linux/hugetlb.h       |  4 +--
>  include/linux/shmem_fs.h      |  8 +++--
>  include/linux/userfaultfd_k.h | 24 +++++++++------
>  mm/hugetlb.c                  |  3 +-
>  mm/shmem.c                    |  6 ++--
>  mm/userfaultfd.c              | 57 ++++++++++++++++++-----------------
>  7 files changed, 73 insertions(+), 51 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index e943370107d0..2ae24327beec 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1682,6 +1682,8 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
>  	struct uffdio_copy uffdio_copy;
>  	struct uffdio_copy __user *user_uffdio_copy;
>  	struct userfaultfd_wake_range range;
> +	bool mode_wp;
> +	uffd_flags_t uffd_flags;
>  
>  	user_uffdio_copy = (struct uffdio_copy __user *) arg;
>  
> @@ -1708,10 +1710,15 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
>  		goto out;
>  	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
>  		goto out;
> +
> +	mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP;

This seems to be the only place where mode_wp is used in this function. 
I'd just drop it, and set uffd_flags directly from uffdio_copy.mode. E.g.
something like

	uffd_flags_t uffd_flags = UFFD_FLAGS_NONE;

	...

	if (uffdio_copy.mode & UFFDIO_COPY_MODE_WP)
		uffd_flags = UFFD_FLAGS_WP;

Otherwise

Acked-by: Mike Rapoport <rppt@linux.ibm.com>

> +
> +	uffd_flags = mode_wp ? UFFD_FLAGS_WP : UFFD_FLAGS_NONE;
> +
>  	if (mmget_not_zero(ctx->mm)) {
>  		ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
>  				   uffdio_copy.len, &ctx->mmap_changing,
> -				   uffdio_copy.mode);
> +				   uffd_flags);
>  		mmput(ctx->mm);
>  	} else {
>  		return -ESRCH;

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-07-18 11:47 ` [PATCH v2 2/5] userfaultfd: introduce access-likely mode for common operations Nadav Amit
  2022-07-18 20:05   ` Peter Xu
@ 2022-07-23  9:16   ` Mike Rapoport
  2022-07-25 17:18     ` Nadav Amit
  1 sibling, 1 reply; 20+ messages in thread
From: Mike Rapoport @ 2022-07-23  9:16 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, Andrew Morton, Nadav Amit, Mike Kravetz, Hugh Dickins,
	Axel Rasmussen, Peter Xu, David Hildenbrand

On Mon, Jul 18, 2022 at 04:47:45AM -0700, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> Introduce access-hints in userfaultfd. The expectation is that userspace
> would set access-hints when a page-fault occurred on a page and would
> not provide the access-hint on prefaulted memory. The exact behavior of
> the kernel in regard to the hints would not be part of userfaultfd api.
> 
> At this time the use of the access-hint is only in setting access-bit
> similarly to the way it is done in do_set_pte(). In x86, currently PTEs
> are always marked as young, including prefetched ones. But on arm64,
> PTEs would be marked as old (when access bit is supported).
> 
> If access hints are not enabled, the kernel would behave as if the
> access-hint was provided for backward compatibility.
> 
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  fs/userfaultfd.c                 | 39 ++++++++++++++++++++++++++++----
>  include/linux/userfaultfd_k.h    |  1 +
>  include/uapi/linux/userfaultfd.h | 20 +++++++++++++++-
>  mm/internal.h                    | 13 +++++++++++
>  mm/memory.c                      | 12 ----------
>  mm/userfaultfd.c                 | 11 +++++++--
>  6 files changed, 77 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 2ae24327beec..8d8792b27c53 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1708,13 +1708,21 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
>  	ret = -EINVAL;
>  	if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
>  		goto out;
> -	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
> +	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP|
> +				 UFFDIO_COPY_MODE_ACCESS_LIKELY))
>  		goto out;
>  
>  	mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP;
>  
>  	uffd_flags = mode_wp ? UFFD_FLAGS_WP : UFFD_FLAGS_NONE;
>  
> +	if (ctx->features & UFFD_FEATURE_ACCESS_HINTS) {
> +		if (uffdio_copy.mode & UFFDIO_COPY_MODE_ACCESS_LIKELY)
> +			uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
> +	} else {
> +		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
> +	}
> +

This is quite a construct and it gets more complex in the following
patches. How about making it to a static inline function?

>  	if (mmget_not_zero(ctx->mm)) {
>  		ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
>  				   uffdio_copy.len, &ctx->mmap_changing,
> @@ -1765,9 +1773,17 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
>  	if (ret)
>  		goto out;
>  	ret = -EINVAL;
> -	if (uffdio_zeropage.mode & ~UFFDIO_ZEROPAGE_MODE_DONTWAKE)
> +	if (uffdio_zeropage.mode & ~(UFFDIO_ZEROPAGE_MODE_DONTWAKE|
> +				     UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY))
>  		goto out;
>  
> +	if (ctx->features & UFFD_FEATURE_ACCESS_HINTS) {
> +		if (uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY)
> +			uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
> +	} else {
> +		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
> +	}
> +
>  	if (mmget_not_zero(ctx->mm)) {
>  		ret = mfill_zeropage(ctx->mm, uffdio_zeropage.range.start,
>  				     uffdio_zeropage.range.len,

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-07-23  9:16   ` Mike Rapoport
@ 2022-07-25 17:18     ` Nadav Amit
  2022-07-26 16:02       ` Mike Rapoport
  0 siblings, 1 reply; 20+ messages in thread
From: Nadav Amit @ 2022-07-25 17:18 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Linux MM, Andrew Morton, Mike Kravetz, Hugh Dickins,
	Axel Rasmussen, Peter Xu, David Hildenbrand



> On Jul 23, 2022, at 2:16 AM, Mike Rapoport <rppt@linux.ibm.com> wrote:
> 
> On Mon, Jul 18, 2022 at 04:47:45AM -0700, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> Introduce access-hints in userfaultfd. The expectation is that userspace
>> would set access-hints when a page-fault occurred on a page and would
>> not provide the access-hint on prefaulted memory. The exact behavior of
>> the kernel in regard to the hints would not be part of userfaultfd api.
>> 
>> At this time the use of the access-hint is only in setting access-bit
>> similarly to the way it is done in do_set_pte(). In x86, currently PTEs
>> are always marked as young, including prefetched ones. But on arm64,
>> PTEs would be marked as old (when access bit is supported).
>> 
>> If access hints are not enabled, the kernel would behave as if the
>> access-hint was provided for backward compatibility.
>> 
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Axel Rasmussen <axelrasmussen@google.com>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Mike Rapoport <rppt@linux.ibm.com>
>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> ---
>> fs/userfaultfd.c                 | 39 ++++++++++++++++++++++++++++----
>> include/linux/userfaultfd_k.h    |  1 +
>> include/uapi/linux/userfaultfd.h | 20 +++++++++++++++-
>> mm/internal.h                    | 13 +++++++++++
>> mm/memory.c                      | 12 ----------
>> mm/userfaultfd.c                 | 11 +++++++--
>> 6 files changed, 77 insertions(+), 19 deletions(-)
>> 
>> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
>> index 2ae24327beec..8d8792b27c53 100644
>> --- a/fs/userfaultfd.c
>> +++ b/fs/userfaultfd.c
>> @@ -1708,13 +1708,21 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
>> 	ret = -EINVAL;
>> 	if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
>> 		goto out;
>> -	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
>> +	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP|
>> +				 UFFDIO_COPY_MODE_ACCESS_LIKELY))
>> 		goto out;
>> 
>> 	mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP;
>> 
>> 	uffd_flags = mode_wp ? UFFD_FLAGS_WP : UFFD_FLAGS_NONE;
>> 
>> +	if (ctx->features & UFFD_FEATURE_ACCESS_HINTS) {
>> +		if (uffdio_copy.mode & UFFDIO_COPY_MODE_ACCESS_LIKELY)
>> +			uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
>> +	} else {
>> +		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
>> +	}
>> +
> 
> This is quite a construct and it gets more complex in the following
> patches. How about making it to a static inline function?

Possible. There is another option though. I think it would have been
much cleaner if some flags were in common offsets in the different
“mode” fields. It might be too late for some fields (WP), but I can
put these the ACCESS/WRITE fields in the the high bits in fixed
place for all modes, which would allow to at least reuse the logic.

Is that ok?



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/5] userfaultfd: introduce uffd_flags
  2022-07-23  9:12   ` Mike Rapoport
@ 2022-07-25 17:23     ` Nadav Amit
  0 siblings, 0 replies; 20+ messages in thread
From: Nadav Amit @ 2022-07-25 17:23 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-mm, Andrew Morton, Mike Kravetz, Hugh Dickins,
	Axel Rasmussen, Peter Xu, David Hildenbrand



> On Jul 23, 2022, at 2:12 AM, Mike Rapoport <rppt@linux.ibm.com> wrote:
> 
> On Mon, Jul 18, 2022 at 04:47:44AM -0700, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> As the next patches are going to introduce more information that needs
>> to be propagated regarding handled user requests, introduce uffd_flags
>> that would be used to propagate this information.
>> 
>> Remove the unused UFFD_FLAGS_SET to avoid confusion in the constant
>> names.
>> 
>> Introducing uffd flags also allows to avoid mm/userfaultfd from being
>> using uapi (e.g., UFFDIO_COPY_MODE_WP).
>> 
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Axel Rasmussen <axelrasmussen@google.com>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: Mike Rapoport <rppt@linux.ibm.com>
>> Acked-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> ---
>> fs/userfaultfd.c              | 22 +++++++++++---
>> include/linux/hugetlb.h       |  4 +--
>> include/linux/shmem_fs.h      |  8 +++--
>> include/linux/userfaultfd_k.h | 24 +++++++++------
>> mm/hugetlb.c                  |  3 +-
>> mm/shmem.c                    |  6 ++--
>> mm/userfaultfd.c              | 57 ++++++++++++++++++-----------------
>> 7 files changed, 73 insertions(+), 51 deletions(-)
>> 
>> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
>> index e943370107d0..2ae24327beec 100644
>> --- a/fs/userfaultfd.c
>> +++ b/fs/userfaultfd.c
>> @@ -1682,6 +1682,8 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
>> 	struct uffdio_copy uffdio_copy;
>> 	struct uffdio_copy __user *user_uffdio_copy;
>> 	struct userfaultfd_wake_range range;
>> +	bool mode_wp;
>> +	uffd_flags_t uffd_flags;
>> 
>> 	user_uffdio_copy = (struct uffdio_copy __user *) arg;
>> 
>> @@ -1708,10 +1710,15 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
>> 		goto out;
>> 	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
>> 		goto out;
>> +
>> +	mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP;
> 
> This seems to be the only place where mode_wp is used in this function. 
> I'd just drop it, and set uffd_flags directly from uffdio_copy.mode. E.g.
> something like
> 
> 	uffd_flags_t uffd_flags = UFFD_FLAGS_NONE;
> 
> 	...
> 
> 	if (uffdio_copy.mode & UFFDIO_COPY_MODE_WP)
> 		uffd_flags = UFFD_FLAGS_WP;

Good point; taken.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/5] userfaultfd: introduce access-likely mode for common operations
  2022-07-25 17:18     ` Nadav Amit
@ 2022-07-26 16:02       ` Mike Rapoport
  0 siblings, 0 replies; 20+ messages in thread
From: Mike Rapoport @ 2022-07-26 16:02 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux MM, Andrew Morton, Mike Kravetz, Hugh Dickins,
	Axel Rasmussen, Peter Xu, David Hildenbrand

On Mon, Jul 25, 2022 at 10:18:38AM -0700, Nadav Amit wrote:
> 
> > On Jul 23, 2022, at 2:16 AM, Mike Rapoport <rppt@linux.ibm.com> wrote:
> > 
> > On Mon, Jul 18, 2022 at 04:47:45AM -0700, Nadav Amit wrote:
> >> From: Nadav Amit <namit@vmware.com>
> >> 
> >> Introduce access-hints in userfaultfd. The expectation is that userspace
> >> would set access-hints when a page-fault occurred on a page and would
> >> not provide the access-hint on prefaulted memory. The exact behavior of
> >> the kernel in regard to the hints would not be part of userfaultfd api.
> >> 
> >> At this time the use of the access-hint is only in setting access-bit
> >> similarly to the way it is done in do_set_pte(). In x86, currently PTEs
> >> are always marked as young, including prefetched ones. But on arm64,
> >> PTEs would be marked as old (when access bit is supported).
> >> 
> >> If access hints are not enabled, the kernel would behave as if the
> >> access-hint was provided for backward compatibility.
> >> 
> >> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> >> Cc: Hugh Dickins <hughd@google.com>
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> Cc: Axel Rasmussen <axelrasmussen@google.com>
> >> Cc: Peter Xu <peterx@redhat.com>
> >> Cc: David Hildenbrand <david@redhat.com>
> >> Cc: Mike Rapoport <rppt@linux.ibm.com>
> >> Signed-off-by: Nadav Amit <namit@vmware.com>
> >> ---
> >> fs/userfaultfd.c                 | 39 ++++++++++++++++++++++++++++----
> >> include/linux/userfaultfd_k.h    |  1 +
> >> include/uapi/linux/userfaultfd.h | 20 +++++++++++++++-
> >> mm/internal.h                    | 13 +++++++++++
> >> mm/memory.c                      | 12 ----------
> >> mm/userfaultfd.c                 | 11 +++++++--
> >> 6 files changed, 77 insertions(+), 19 deletions(-)
> >> 
> >> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> >> index 2ae24327beec..8d8792b27c53 100644
> >> --- a/fs/userfaultfd.c
> >> +++ b/fs/userfaultfd.c
> >> @@ -1708,13 +1708,21 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
> >> 	ret = -EINVAL;
> >> 	if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
> >> 		goto out;
> >> -	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
> >> +	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP|
> >> +				 UFFDIO_COPY_MODE_ACCESS_LIKELY))
> >> 		goto out;
> >> 
> >> 	mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP;
> >> 
> >> 	uffd_flags = mode_wp ? UFFD_FLAGS_WP : UFFD_FLAGS_NONE;
> >> 
> >> +	if (ctx->features & UFFD_FEATURE_ACCESS_HINTS) {
> >> +		if (uffdio_copy.mode & UFFDIO_COPY_MODE_ACCESS_LIKELY)
> >> +			uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
> >> +	} else {
> >> +		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
> >> +	}
> >> +
> > 
> > This is quite a construct and it gets more complex in the following
> > patches. How about making it to a static inline function?
> 
> Possible. There is another option though. I think it would have been
> much cleaner if some flags were in common offsets in the different
> “mode” fields. It might be too late for some fields (WP), but I can
> put these the ACCESS/WRITE fields in the the high bits in fixed
> place for all modes, which would allow to at least reuse the logic.

So unless I'm missing something it'll be

	if (ctx->features & UFFD_FEATURE_ACCESS_HINTS)
		uffd_flags |= (uffdio_copy.mode & UFFDIO_COPY_MODE_ACCESS_MASK);
	else
		uffd_flags |= UFFD_FLAGS_ACCESS_MASK;

I still think it's worth wrapping it in static inline with a comments about
common offsets for 'if' clause and backward compatibility for 'else'
clause.

> Is that ok?
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-07-26 16:03 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-18 11:47 [PATCH v2 0/5] userfaultfd: support access/write hints Nadav Amit
2022-07-18 11:47 ` [PATCH v2 2/5] userfaultfd: introduce access-likely mode for common operations Nadav Amit
2022-07-18 20:05   ` Peter Xu
2022-07-18 20:59     ` Nadav Amit
2022-07-18 21:21       ` Peter Xu
2022-07-23  9:16   ` Mike Rapoport
2022-07-25 17:18     ` Nadav Amit
2022-07-26 16:02       ` Mike Rapoport
2022-07-18 11:47 ` [PATCH v2 3/5] userfaultfd: introduce write-likely mode for uffd operations Nadav Amit
2022-07-18 20:12   ` Peter Xu
2022-07-18 20:25     ` Nadav Amit
2022-07-18 21:27       ` Peter Xu
2022-07-18 11:47 ` [PATCH v2 4/5] userfaultfd: zero access/write hints Nadav Amit
2022-07-22  7:47   ` David Hildenbrand
2022-07-18 11:47 ` [PATCH v2 5/5] selftest/userfaultfd: test read/write hints Nadav Amit
     [not found] ` <20220718114748.2623-2-namit@vmware.com>
2022-07-18 20:05   ` [PATCH v2 1/5] userfaultfd: introduce uffd_flags Peter Xu
2022-07-22  7:54   ` David Hildenbrand
2022-07-22 18:47     ` Nadav Amit
2022-07-23  9:12   ` Mike Rapoport
2022-07-25 17:23     ` Nadav Amit

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.