All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nadav Amit <nadav.amit@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Nadav Amit <namit@vmware.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Hugh Dickins <hughd@google.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Peter Xu <peterx@redhat.com>,
	David Hildenbrand <david@redhat.com>,
	Mike Rapoport <rppt@linux.ibm.com>
Subject: [PATCH v2 2/5] userfaultfd: introduce access-likely mode for common operations
Date: Mon, 18 Jul 2022 04:47:45 -0700	[thread overview]
Message-ID: <20220718114748.2623-3-namit@vmware.com> (raw)
In-Reply-To: <20220718114748.2623-1-namit@vmware.com>

From: Nadav Amit <namit@vmware.com>

Introduce access-hints in userfaultfd. The expectation is that userspace
would set access-hints when a page-fault occurred on a page and would
not provide the access-hint on prefaulted memory. The exact behavior of
the kernel in regard to the hints would not be part of userfaultfd api.

At this time the use of the access-hint is only in setting access-bit
similarly to the way it is done in do_set_pte(). In x86, currently PTEs
are always marked as young, including prefetched ones. But on arm64,
PTEs would be marked as old (when access bit is supported).

If access hints are not enabled, the kernel would behave as if the
access-hint was provided for backward compatibility.

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 fs/userfaultfd.c                 | 39 ++++++++++++++++++++++++++++----
 include/linux/userfaultfd_k.h    |  1 +
 include/uapi/linux/userfaultfd.h | 20 +++++++++++++++-
 mm/internal.h                    | 13 +++++++++++
 mm/memory.c                      | 12 ----------
 mm/userfaultfd.c                 | 11 +++++++--
 6 files changed, 77 insertions(+), 19 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 2ae24327beec..8d8792b27c53 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1708,13 +1708,21 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 	ret = -EINVAL;
 	if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
 		goto out;
-	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
+	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP|
+				 UFFDIO_COPY_MODE_ACCESS_LIKELY))
 		goto out;
 
 	mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP;
 
 	uffd_flags = mode_wp ? UFFD_FLAGS_WP : UFFD_FLAGS_NONE;
 
+	if (ctx->features & UFFD_FEATURE_ACCESS_HINTS) {
+		if (uffdio_copy.mode & UFFDIO_COPY_MODE_ACCESS_LIKELY)
+			uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+	} else {
+		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+	}
+
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
 				   uffdio_copy.len, &ctx->mmap_changing,
@@ -1765,9 +1773,17 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
 	if (ret)
 		goto out;
 	ret = -EINVAL;
-	if (uffdio_zeropage.mode & ~UFFDIO_ZEROPAGE_MODE_DONTWAKE)
+	if (uffdio_zeropage.mode & ~(UFFDIO_ZEROPAGE_MODE_DONTWAKE|
+				     UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY))
 		goto out;
 
+	if (ctx->features & UFFD_FEATURE_ACCESS_HINTS) {
+		if (uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY)
+			uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+	} else {
+		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+	}
+
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mfill_zeropage(ctx->mm, uffdio_zeropage.range.start,
 				     uffdio_zeropage.range.len,
@@ -1817,7 +1833,8 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 		return ret;
 
 	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
-			       UFFDIO_WRITEPROTECT_MODE_WP))
+			       UFFDIO_WRITEPROTECT_MODE_WP |
+			       UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY))
 		return -EINVAL;
 
 	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
@@ -1827,6 +1844,12 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 		return -EINVAL;
 
 	uffd_flags = mode_wp ? UFFD_FLAGS_WP : UFFD_FLAGS_NONE;
+	if (ctx->features & UFFD_FEATURE_ACCESS_HINTS) {
+		if (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY)
+			uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+	} else {
+		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+	}
 
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
@@ -1879,9 +1902,17 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
 	    uffdio_continue.range.start) {
 		goto out;
 	}
-	if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
+	if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE|
+				     UFFDIO_CONTINUE_MODE_ACCESS_LIKELY))
 		goto out;
 
+	if (ctx->features & UFFD_FEATURE_ACCESS_HINTS) {
+		if (uffdio_continue.mode & UFFDIO_CONTINUE_MODE_ACCESS_LIKELY)
+			uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+	} else {
+		uffd_flags |= UFFD_FLAGS_ACCESS_LIKELY;
+	}
+
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mcopy_continue(ctx->mm, uffdio_continue.range.start,
 				     uffdio_continue.range.len,
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index a63b61823984..b326798b5677 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -59,6 +59,7 @@ typedef unsigned int __bitwise uffd_flags_t;
 
 #define UFFD_FLAGS_NONE			((__force uffd_flags_t)0)
 #define UFFD_FLAGS_WP			((__force uffd_flags_t)BIT(0))
+#define UFFD_FLAGS_ACCESS_LIKELY	((__force uffd_flags_t)BIT(1))
 
 extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 				    struct vm_area_struct *dst_vma,
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 7d32b1e797fb..02e0c1f56939 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -34,7 +34,8 @@
 			   UFFD_FEATURE_MINOR_HUGETLBFS |	\
 			   UFFD_FEATURE_MINOR_SHMEM |		\
 			   UFFD_FEATURE_EXACT_ADDRESS |		\
-			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
+			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM |	\
+			   UFFD_FEATURE_ACCESS_HINTS)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -199,6 +200,9 @@ struct uffdio_api {
 	 *
 	 * UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd
 	 * write-protection mode is supported on both shmem and hugetlbfs.
+	 *
+	 * UFFD_FEATURE_ACCESS_HINTS indicates that the ioctl operations
+	 * support the UFFDIO_*_MODE_ACCESS_LIKELY hints.
 	 */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
 #define UFFD_FEATURE_EVENT_FORK			(1<<1)
@@ -213,6 +217,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_MINOR_SHMEM		(1<<10)
 #define UFFD_FEATURE_EXACT_ADDRESS		(1<<11)
 #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM		(1<<12)
+#define UFFD_FEATURE_ACCESS_HINTS		(1<<13)
 	__u64 features;
 
 	__u64 ioctls;
@@ -247,8 +252,14 @@ struct uffdio_copy {
 	 * the fly.  UFFDIO_COPY_MODE_WP is available only if the
 	 * write protected ioctl is implemented for the range
 	 * according to the uffdio_register.ioctls.
+	 *
+	 * UFFDIO_COPY_MODE_ACCESS_LIKELY provides a hint to the kernel that the
+	 * page is likely to be access in the near future. Providing the hint
+	 * properly can improve performance.
+	 *
 	 */
 #define UFFDIO_COPY_MODE_WP			((__u64)1<<1)
+#define UFFDIO_COPY_MODE_ACCESS_LIKELY		((__u64)1<<2)
 	__u64 mode;
 
 	/*
@@ -261,6 +272,7 @@ struct uffdio_copy {
 struct uffdio_zeropage {
 	struct uffdio_range range;
 #define UFFDIO_ZEROPAGE_MODE_DONTWAKE		((__u64)1<<0)
+#define UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY	((__u64)1<<1)
 	__u64 mode;
 
 	/*
@@ -280,6 +292,10 @@ struct uffdio_writeprotect {
  * UFFDIO_WRITEPROTECT_MODE_DONTWAKE: set the flag to avoid waking up
  * any wait thread after the operation succeeds.
  *
+ * UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY provides a hint to the kernel
+ * that the page is likely to be access in the near future. Providing
+ * the hint properly can improve performance.
+ *
  * NOTE: Write protecting a region (WP=1) is unrelated to page faults,
  * therefore DONTWAKE flag is meaningless with WP=1.  Removing write
  * protection (WP=0) in response to a page fault wakes the faulting
@@ -287,12 +303,14 @@ struct uffdio_writeprotect {
  */
 #define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
 #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
+#define UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY	((__u64)1<<2)
 	__u64 mode;
 };
 
 struct uffdio_continue {
 	struct uffdio_range range;
 #define UFFDIO_CONTINUE_MODE_DONTWAKE		((__u64)1<<0)
+#define UFFDIO_CONTINUE_MODE_ACCESS_LIKELY	((__u64)1<<1)
 	__u64 mode;
 
 	/*
diff --git a/mm/internal.h b/mm/internal.h
index c0f8fbe0445b..d035b77b4f2f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -12,6 +12,7 @@
 #include <linux/pagemap.h>
 #include <linux/rmap.h>
 #include <linux/tracepoint-defs.h>
+#include <linux/pgtable.h>
 
 struct folio_batch;
 
@@ -861,4 +862,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags);
 
 DECLARE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
 
+#ifndef arch_wants_old_prefaulted_pte
+static inline bool arch_wants_old_prefaulted_pte(void)
+{
+	/*
+	 * Transitioning a PTE from 'old' to 'young' can be expensive on
+	 * some architectures, even if it's performed in hardware. By
+	 * default, "false" means prefaulted entries will be 'young'.
+	 */
+	return false;
+}
+#endif
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memory.c b/mm/memory.c
index 580c62febe42..31ec3f0071a2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -137,18 +137,6 @@ static inline bool arch_faults_on_old_pte(void)
 }
 #endif
 
-#ifndef arch_wants_old_prefaulted_pte
-static inline bool arch_wants_old_prefaulted_pte(void)
-{
-	/*
-	 * Transitioning a PTE from 'old' to 'young' can be expensive on
-	 * some architectures, even if it's performed in hardware. By
-	 * default, "false" means prefaulted entries will be 'young'.
-	 */
-	return false;
-}
-#endif
-
 static int __init disable_randmaps(char *s)
 {
 	randomize_va_space = 0;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 421784d26651..c15679f3eb6a 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -65,6 +65,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	bool writable = dst_vma->vm_flags & VM_WRITE;
 	bool vm_shared = dst_vma->vm_flags & VM_SHARED;
 	bool page_in_cache = page->mapping;
+	bool prefault = !(uffd_flags & UFFD_FLAGS_ACCESS_LIKELY);
 	spinlock_t *ptl;
 	struct inode *inode;
 	pgoff_t offset, max_off;
@@ -92,6 +93,11 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 		 */
 		_dst_pte = pte_wrprotect(_dst_pte);
 
+	if (prefault && arch_wants_old_prefaulted_pte())
+		_dst_pte = pte_mkold(_dst_pte);
+	else
+		_dst_pte = pte_sw_mkyoung(_dst_pte);
+
 	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
 
 	if (vma_is_shmem(dst_vma)) {
@@ -202,7 +208,8 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 static int mfill_zeropage_pte(struct mm_struct *dst_mm,
 			      pmd_t *dst_pmd,
 			      struct vm_area_struct *dst_vma,
-			      unsigned long dst_addr)
+			      unsigned long dst_addr,
+			      uffd_flags_t uffd_flags)
 {
 	pte_t _dst_pte, *dst_pte;
 	spinlock_t *ptl;
@@ -495,7 +502,7 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 					       uffd_flags);
 		else
 			err = mfill_zeropage_pte(dst_mm, dst_pmd,
-						 dst_vma, dst_addr);
+						 dst_vma, dst_addr, uffd_flags);
 	} else {
 		err = shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma,
 					     dst_addr, src_addr,
-- 
2.25.1



  reply	other threads:[~2022-07-18 19:22 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-18 11:47 [PATCH v2 0/5] userfaultfd: support access/write hints Nadav Amit
2022-07-18 11:47 ` Nadav Amit [this message]
2022-07-18 20:05   ` [PATCH v2 2/5] userfaultfd: introduce access-likely mode for common operations Peter Xu
2022-07-18 20:59     ` Nadav Amit
2022-07-18 21:21       ` Peter Xu
2022-07-23  9:16   ` Mike Rapoport
2022-07-25 17:18     ` Nadav Amit
2022-07-26 16:02       ` Mike Rapoport
2022-07-18 11:47 ` [PATCH v2 3/5] userfaultfd: introduce write-likely mode for uffd operations Nadav Amit
2022-07-18 20:12   ` Peter Xu
2022-07-18 20:25     ` Nadav Amit
2022-07-18 21:27       ` Peter Xu
2022-07-18 11:47 ` [PATCH v2 4/5] userfaultfd: zero access/write hints Nadav Amit
2022-07-22  7:47   ` David Hildenbrand
2022-07-18 11:47 ` [PATCH v2 5/5] selftest/userfaultfd: test read/write hints Nadav Amit
     [not found] ` <20220718114748.2623-2-namit@vmware.com>
2022-07-18 20:05   ` [PATCH v2 1/5] userfaultfd: introduce uffd_flags Peter Xu
2022-07-22  7:54   ` David Hildenbrand
2022-07-22 18:47     ` Nadav Amit
2022-07-23  9:12   ` Mike Rapoport
2022-07-25 17:23     ` Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220718114748.2623-3-namit@vmware.com \
    --to=nadav.amit@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=namit@vmware.com \
    --cc=peterx@redhat.com \
    --cc=rppt@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.