linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC] userfaultfd: introduce UFFDIO_COPY_MODE_YOUNG
@ 2022-06-13 20:40 Nadav Amit
  2022-06-14 15:22 ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Nadav Amit @ 2022-06-13 20:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, Nadav Amit, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, David Hildenbrand, Mike Rapoport

From: Nadav Amit <namit@vmware.com>

As we know, using a PTE on x86 with cleared access-bit (aka young-bit)
takes ~600 cycles more than when the access-bit is set. At the same
time, setting the access-bit for memory that is not used (e.g.,
prefetched) can introduce greater overheads, as the prefetched memory is
reclaimed later than it should be.

Userfaultfd currently does not set the access-bit (excluding the
huge-pages case). Arguably, it is best to let the uffd monitor control
whether the access-bit should be set or not. The expected use is for the
monitor to request userfaultfd to set the access-bit when the copy
operation is done to resolve a page-fault, and not to set the young-bit
when the memory is prefetched.

Introduce UFFDIO_COPY_MODE_YOUNG to enable userspace to request the
young bit to be set. For UFFDIO_CONTINUE and UFFDIO_ZEROPAGE set the bit
unconditionally since the former is only used to resolve page-faults and
the latter would not benefit from not setting the access-bit.

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Nadav Amit <namit@vmware.com>

---

There are 2 possible enhancements:

1. Use the flag to decide on whether to mark the PTE as dirty (for
writable PTEs). I guess that setting the dirty-bit is as expensive as
setting the access-bit, and setting it introduces similar tradeoffs,
as mentioned above.

2. Introduce a similar mode for write-protect and use this information
for setting both the young and dirty bits. Makes one wonder whether
mprotect() should also set the bit in certain cases...
---
 fs/userfaultfd.c                 |  3 +-
 include/linux/hugetlb.h          |  6 ++--
 include/linux/shmem_fs.h         |  6 ++--
 include/linux/userfaultfd_k.h    |  3 +-
 include/uapi/linux/userfaultfd.h | 14 ++++++++-
 mm/hugetlb.c                     |  7 +++--
 mm/shmem.c                       |  6 ++--
 mm/userfaultfd.c                 | 50 ++++++++++++++++++++++----------
 8 files changed, 68 insertions(+), 27 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index aa0c47cb0d16..888a1514b9af 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1721,7 +1721,8 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 	ret = -EINVAL;
 	if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
 		goto out;
-	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
+	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP|
+				 UFFDIO_COPY_MODE_YOUNG))
 		goto out;
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ac2a1d758a80..da8276d78464 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -160,7 +160,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
 				unsigned long dst_addr,
 				unsigned long src_addr,
 				enum mcopy_atomic_mode mode,
-				struct page **pagep);
+				struct page **pagep,
+				bool young);
 #endif /* CONFIG_USERFAULTFD */
 bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						struct vm_area_struct *vma,
@@ -356,7 +357,8 @@ static inline int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 						unsigned long dst_addr,
 						unsigned long src_addr,
 						enum mcopy_atomic_mode mode,
-						struct page **pagep)
+						struct page **pagep,
+						bool young)
 {
 	BUG();
 	return 0;
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index ab51d3cd39bd..99565bcbe4ba 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -146,10 +146,12 @@ extern int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 				  unsigned long dst_addr,
 				  unsigned long src_addr,
 				  bool zeropage,
-				  struct page **pagep);
+				  struct page **pagep,
+				  bool young);
 #else /* !CONFIG_SHMEM */
 #define shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, \
-			       src_addr, zeropage, pagep)       ({ BUG(); 0; })
+			       src_addr, zeropage, pagep, young) \
+								({ BUG(); 0; })
 #endif /* CONFIG_SHMEM */
 #endif /* CONFIG_USERFAULTFD */
 
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 33cea484d1ad..1ad4dc9668b4 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -56,7 +56,8 @@ enum mcopy_atomic_mode {
 extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 				    struct vm_area_struct *dst_vma,
 				    unsigned long dst_addr, struct page *page,
-				    bool newly_allocated, bool wp_copy);
+				    bool newly_allocated, bool wp_copy,
+				    bool young);
 
 extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
 			    unsigned long src_start, unsigned long len,
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index ef739054cb1c..4d86829a732a 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -33,7 +33,8 @@
 			   UFFD_FEATURE_THREAD_ID |		\
 			   UFFD_FEATURE_MINOR_HUGETLBFS |	\
 			   UFFD_FEATURE_MINOR_SHMEM |		\
-			   UFFD_FEATURE_EXACT_ADDRESS)
+			   UFFD_FEATURE_EXACT_ADDRESS |		\
+			   UFFD_FEATURE_YOUNG)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -194,6 +195,9 @@ struct uffdio_api {
 	 * UFFD_FEATURE_EXACT_ADDRESS indicates that the exact address of page
 	 * faults would be provided and the offset within the page would not be
 	 * masked.
+	 *
+	 * UFFD_FEATURE_YOUNG indicates that the copy supports
+	 * UFFDIO_COPY_MODE_YOUNG.
 	 */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
 #define UFFD_FEATURE_EVENT_FORK			(1<<1)
@@ -207,6 +211,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_MINOR_HUGETLBFS		(1<<9)
 #define UFFD_FEATURE_MINOR_SHMEM		(1<<10)
 #define UFFD_FEATURE_EXACT_ADDRESS		(1<<11)
+#define UFFD_FEATURE_YOUNG			(1<<12)
 	__u64 features;
 
 	__u64 ioctls;
@@ -250,6 +255,13 @@ struct uffdio_copy {
 	 * copy_from_user will not read the last 8 bytes.
 	 */
 	__s64 copy;
+	/*
+	 * UFFDIO_COPY_MODE_YOUNG will set the mapped page as young. This can
+	 * reduce the time that the first access to the page takes. Yet, if set
+	 * opportunistically to memory that is not used, it might lengthen the
+	 * time before the unused memory pages are reclaimed.
+	 */
+#define UFFDIO_COPY_MODE_YOUNG			((__u64)1<<2)
 };
 
 struct uffdio_zeropage {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3fc721789743..405651b117e6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5746,7 +5746,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    unsigned long dst_addr,
 			    unsigned long src_addr,
 			    enum mcopy_atomic_mode mode,
-			    struct page **pagep)
+			    struct page **pagep,
+			    bool young)
 {
 	bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
 	struct hstate *h = hstate_vma(dst_vma);
@@ -5895,7 +5896,9 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	_dst_pte = make_huge_pte(dst_vma, page, writable);
 	if (writable)
 		_dst_pte = huge_pte_mkdirty(_dst_pte);
-	_dst_pte = pte_mkyoung(_dst_pte);
+
+	if (young)
+		_dst_pte = pte_mkyoung(_dst_pte);
 
 	set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 4b2fea33158e..36ccc9609cea 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2319,7 +2319,8 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 			   unsigned long dst_addr,
 			   unsigned long src_addr,
 			   bool zeropage,
-			   struct page **pagep)
+			   struct page **pagep,
+			   bool young)
 {
 	struct inode *inode = file_inode(dst_vma->vm_file);
 	struct shmem_inode_info *info = SHMEM_I(inode);
@@ -2391,7 +2392,7 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 		goto out_release;
 
 	ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
-				       page, true, false);
+				       page, true, false, young);
 	if (ret)
 		goto out_delete_from_cache;
 
@@ -2412,6 +2413,7 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 	shmem_inode_unacct_blocks(inode, 1);
 	return ret;
 }
+
 #endif /* CONFIG_USERFAULTFD */
 
 #ifdef CONFIG_TMPFS
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e9bb6db002aa..9181fe4442c7 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -57,7 +57,8 @@ struct vm_area_struct *find_dst_vma(struct mm_struct *dst_mm,
 int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 			     struct vm_area_struct *dst_vma,
 			     unsigned long dst_addr, struct page *page,
-			     bool newly_allocated, bool wp_copy)
+			     bool newly_allocated, bool wp_copy,
+			     bool young)
 {
 	int ret;
 	pte_t _dst_pte, *dst_pte;
@@ -82,6 +83,9 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	else if (writable)
 		_dst_pte = pte_mkwrite(_dst_pte);
 
+	if (young)
+		pte_mkyoung(_dst_pte);
+
 	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
 
 	if (vma_is_shmem(dst_vma)) {
@@ -130,7 +134,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    unsigned long dst_addr,
 			    unsigned long src_addr,
 			    struct page **pagep,
-			    bool wp_copy)
+			    bool wp_copy, bool young)
 {
 	void *page_kaddr;
 	int ret;
@@ -174,7 +178,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 		goto out_release;
 
 	ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
-				       page, true, wp_copy);
+				       page, true, wp_copy, young);
 	if (ret)
 		goto out_release;
 out:
@@ -187,7 +191,8 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 static int mfill_zeropage_pte(struct mm_struct *dst_mm,
 			      pmd_t *dst_pmd,
 			      struct vm_area_struct *dst_vma,
-			      unsigned long dst_addr)
+			      unsigned long dst_addr,
+			      bool young)
 {
 	pte_t _dst_pte, *dst_pte;
 	spinlock_t *ptl;
@@ -210,6 +215,10 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm,
 	ret = -EEXIST;
 	if (!pte_none(*dst_pte))
 		goto out_unlock;
+
+	if (young)
+		pte_mkyoung(_dst_pte);
+
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(dst_vma, dst_addr, dst_pte);
@@ -245,7 +254,7 @@ static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
 	}
 
 	ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
-				       page, false, wp_copy);
+				       page, false, wp_copy, false);
 	if (ret)
 		goto out_release;
 
@@ -290,7 +299,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 					      unsigned long dst_start,
 					      unsigned long src_start,
 					      unsigned long len,
-					      enum mcopy_atomic_mode mode)
+					      enum mcopy_atomic_mode mode,
+					      bool young)
 {
 	int vm_shared = dst_vma->vm_flags & VM_SHARED;
 	ssize_t err;
@@ -386,7 +396,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		}
 
 		err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
-					       dst_addr, src_addr, mode, &page);
+					       dst_addr, src_addr, mode, &page,
+					       young);
 
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		i_mmap_unlock_read(mapping);
@@ -441,7 +452,8 @@ extern ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 				      unsigned long dst_start,
 				      unsigned long src_start,
 				      unsigned long len,
-				      enum mcopy_atomic_mode mode);
+				      enum mcopy_atomic_mode mode,
+				      bool young);
 #endif /* CONFIG_HUGETLB_PAGE */
 
 static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
@@ -451,7 +463,8 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 						unsigned long src_addr,
 						struct page **page,
 						enum mcopy_atomic_mode mode,
-						bool wp_copy)
+						bool wp_copy,
+						bool young)
 {
 	ssize_t err;
 
@@ -474,16 +487,16 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 		if (mode == MCOPY_ATOMIC_NORMAL)
 			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
 					       dst_addr, src_addr, page,
-					       wp_copy);
+					       wp_copy, young);
 		else
 			err = mfill_zeropage_pte(dst_mm, dst_pmd,
-						 dst_vma, dst_addr);
+						 dst_vma, dst_addr, young);
 	} else {
 		VM_WARN_ON_ONCE(wp_copy);
 		err = shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma,
 					     dst_addr, src_addr,
 					     mode != MCOPY_ATOMIC_NORMAL,
-					     page);
+					     page, young);
 	}
 
 	return err;
@@ -504,6 +517,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	long copied;
 	struct page *page;
 	bool wp_copy;
+	bool young;
 
 	/*
 	 * Sanitize the command parameters:
@@ -557,12 +571,15 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	if (wp_copy && !(dst_vma->vm_flags & VM_UFFD_WP))
 		goto out_unlock;
 
+	young = mode & UFFDIO_COPY_MODE_YOUNG;
+
 	/*
 	 * If this is a HUGETLB vma, pass off to appropriate routine
 	 */
 	if (is_vm_hugetlb_page(dst_vma))
 		return  __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
-						src_start, len, mcopy_mode);
+						src_start, len, mcopy_mode,
+						young);
 
 	if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
 		goto out_unlock;
@@ -614,7 +631,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 		BUG_ON(pmd_trans_huge(*dst_pmd));
 
 		err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
-				       src_addr, &page, mcopy_mode, wp_copy);
+				       src_addr, &page, mcopy_mode, wp_copy,
+				       young);
 		cond_resched();
 
 		if (unlikely(err == -ENOENT)) {
@@ -672,14 +690,14 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
 		       unsigned long len, atomic_t *mmap_changing)
 {
 	return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_ZEROPAGE,
-			      mmap_changing, 0);
+			      mmap_changing, UFFDIO_COPY_MODE_YOUNG);
 }
 
 ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long start,
 		       unsigned long len, atomic_t *mmap_changing)
 {
 	return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_CONTINUE,
-			      mmap_changing, 0);
+			      mmap_changing, UFFDIO_COPY_MODE_YOUNG);
 }
 
 int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC] userfaultfd: introduce UFFDIO_COPY_MODE_YOUNG
  2022-06-13 20:40 [PATCH RFC] userfaultfd: introduce UFFDIO_COPY_MODE_YOUNG Nadav Amit
@ 2022-06-14 15:22 ` David Hildenbrand
  2022-06-14 16:18   ` Nadav Amit
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2022-06-14 15:22 UTC (permalink / raw)
  To: Nadav Amit, Peter Xu
  Cc: linux-mm, Nadav Amit, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, Mike Rapoport

On 13.06.22 22:40, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> As we know, using a PTE on x86 with cleared access-bit (aka young-bit)
> takes ~600 cycles more than when the access-bit is set. At the same
> time, setting the access-bit for memory that is not used (e.g.,
> prefetched) can introduce greater overheads, as the prefetched memory is
> reclaimed later than it should be.
> 
> Userfaultfd currently does not set the access-bit (excluding the
> huge-pages case). Arguably, it is best to let the uffd monitor control
> whether the access-bit should be set or not. The expected use is for the
> monitor to request userfaultfd to set the access-bit when the copy
> operation is done to resolve a page-fault, and not to set the young-bit
> when the memory is prefetched.

Thinking out loud about existing users: postcopy live migration in QEMU
has two usage for placement of pages

a) Resolving a fault. E.g., a VCPU might be waiting for resolution to
   make progress.
b) Background migration to converge without faults on all relevant
   pages.

I guess in a) we'd want UFFDIO_COPY_MODE_YOUNG in b) we don't want it.


I wonder, however, instead of calling this "young", which implies what
the OS should or shouldn't do, to define this as a hint that the placed
page is very likely to be accessed next.

I'm bad at naming, UFFDIO_COPY_MODE_ACCESS_LIKELY would express what I
have in mind.

> 
> Introduce UFFDIO_COPY_MODE_YOUNG to enable userspace to request the
> young bit to be set. For UFFDIO_CONTINUE and UFFDIO_ZEROPAGE set the bit
> unconditionally since the former is only used to resolve page-faults and
> the latter would not benefit from not setting the access-bit.
> 
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> 
> ---
> 
> There are 2 possible enhancements:
> 
> 1. Use the flag to decide on whether to mark the PTE as dirty (for
> writable PTEs). I guess that setting the dirty-bit is as expensive as
> setting the access-bit, and setting it introduces similar tradeoffs,
> as mentioned above.
> 
> 2. Introduce a similar mode for write-protect and use this information
> for setting both the young and dirty bits. Makes one wonder whether
> mprotect() should also set the bit in certain cases...

I wonder if UFFDIO_COPY_MODE_READ_ACCESS_LIKELY vs.
UFFDIO_COPY_WRITE_ACCESS_LIKELY could evenmake sense. I feel like it could.

For example, QEMU knows if a page fault it's resolving was due to a read
or a write fault and could use that information accordingly. Of course,
we don't completely know if we currently have a read fault, if we could
get a write fault immediately after.

Especially in the context of UFFDIO_ZEROPAGE,
UFFDIO_ZEROPAGE_WRITE_ACCESS_LIKELY could ... not place the zeropage but
instead populate an actual page and mark it accessed+dirty. I even have
a use case for that ;)


The kernel could decide how to treat these hints -- for example, if it
doesn't want user space to mess with access/dirty bits, it could just
mostly ignore the hints.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC] userfaultfd: introduce UFFDIO_COPY_MODE_YOUNG
  2022-06-14 15:22 ` David Hildenbrand
@ 2022-06-14 16:18   ` Nadav Amit
  2022-06-14 17:14     ` David Hildenbrand
  2022-06-14 18:56     ` Mike Rapoport
  0 siblings, 2 replies; 18+ messages in thread
From: Nadav Amit @ 2022-06-14 16:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, Mike Rapoport

On Jun 14, 2022, at 8:22 AM, David Hildenbrand <david@redhat.com> wrote:

> On 13.06.22 22:40, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> As we know, using a PTE on x86 with cleared access-bit (aka young-bit)
>> takes ~600 cycles more than when the access-bit is set. At the same
>> time, setting the access-bit for memory that is not used (e.g.,
>> prefetched) can introduce greater overheads, as the prefetched memory is
>> reclaimed later than it should be.
>> 
>> Userfaultfd currently does not set the access-bit (excluding the
>> huge-pages case). Arguably, it is best to let the uffd monitor control
>> whether the access-bit should be set or not. The expected use is for the
>> monitor to request userfaultfd to set the access-bit when the copy
>> operation is done to resolve a page-fault, and not to set the young-bit
>> when the memory is prefetched.
> 
> Thinking out loud about existing users: postcopy live migration in QEMU
> has two usage for placement of pages
> 
> a) Resolving a fault. E.g., a VCPU might be waiting for resolution to
> make progress.
> b) Background migration to converge without faults on all relevant
> pages.
> 
> I guess in a) we'd want UFFDIO_COPY_MODE_YOUNG in b) we don't want it.
> 
> 
> I wonder, however, instead of calling this "young", which implies what
> the OS should or shouldn't do, to define this as a hint that the placed
> page is very likely to be accessed next.
> 
> I'm bad at naming, UFFDIO_COPY_MODE_ACCESS_LIKELY would express what I
> have in mind.

How about UFFDIO_COPY_MODE_WILLNEED_READ ?

> 
>> Introduce UFFDIO_COPY_MODE_YOUNG to enable userspace to request the
>> young bit to be set. For UFFDIO_CONTINUE and UFFDIO_ZEROPAGE set the bit
>> unconditionally since the former is only used to resolve page-faults and
>> the latter would not benefit from not setting the access-bit.
>> 
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Axel Rasmussen <axelrasmussen@google.com>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Mike Rapoport <rppt@linux.ibm.com>
>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> 
>> ---
>> 
>> There are 2 possible enhancements:
>> 
>> 1. Use the flag to decide on whether to mark the PTE as dirty (for
>> writable PTEs). I guess that setting the dirty-bit is as expensive as
>> setting the access-bit, and setting it introduces similar tradeoffs,
>> as mentioned above.
>> 
>> 2. Introduce a similar mode for write-protect and use this information
>> for setting both the young and dirty bits. Makes one wonder whether
>> mprotect() should also set the bit in certain cases...
> 
> I wonder if UFFDIO_COPY_MODE_READ_ACCESS_LIKELY vs.
> UFFDIO_COPY_WRITE_ACCESS_LIKELY could evenmake sense. I feel like it could.
> 
> For example, QEMU knows if a page fault it's resolving was due to a read
> or a write fault and could use that information accordingly. Of course,
> we don't completely know if we currently have a read fault, if we could
> get a write fault immediately after.
> 
> Especially in the context of UFFDIO_ZEROPAGE,
> UFFDIO_ZEROPAGE_WRITE_ACCESS_LIKELY could ... not place the zeropage but
> instead populate an actual page and mark it accessed+dirty. I even have
> a use case for that ;)
> 
> 
> The kernel could decide how to treat these hints -- for example, if it
> doesn't want user space to mess with access/dirty bits, it could just
> mostly ignore the hints.

I can do that. I think users can do the zero page-copy themselves today, but
whatever you prefer.

But, I cannot take it anymore: the list of arguments for uffd stuff is
crazy. I would like to collect all the possible arguments that are used for
uffd operation into some “struct uffd_op”.

Any objection?




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC] userfaultfd: introduce UFFDIO_COPY_MODE_YOUNG
  2022-06-14 16:18   ` Nadav Amit
@ 2022-06-14 17:14     ` David Hildenbrand
  2022-06-14 18:56     ` Mike Rapoport
  1 sibling, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2022-06-14 17:14 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Xu, Linux MM, Mike Kravetz, Hugh Dickins, Andrew Morton,
	Axel Rasmussen, Mike Rapoport

[...]

>> I'm bad at naming, UFFDIO_COPY_MODE_ACCESS_LIKELY would express what I
>> have in mind.
> 
> How about UFFDIO_COPY_MODE_WILLNEED_READ ?

Would work for me.

> 
>>
>>> Introduce UFFDIO_COPY_MODE_YOUNG to enable userspace to request the
>>> young bit to be set. For UFFDIO_CONTINUE and UFFDIO_ZEROPAGE set the bit
>>> unconditionally since the former is only used to resolve page-faults and
>>> the latter would not benefit from not setting the access-bit.
>>>
>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>>> Cc: Hugh Dickins <hughd@google.com>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Axel Rasmussen <axelrasmussen@google.com>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Mike Rapoport <rppt@linux.ibm.com>
>>> Signed-off-by: Nadav Amit <namit@vmware.com>
>>>
>>> ---
>>>
>>> There are 2 possible enhancements:
>>>
>>> 1. Use the flag to decide on whether to mark the PTE as dirty (for
>>> writable PTEs). I guess that setting the dirty-bit is as expensive as
>>> setting the access-bit, and setting it introduces similar tradeoffs,
>>> as mentioned above.
>>>
>>> 2. Introduce a similar mode for write-protect and use this information
>>> for setting both the young and dirty bits. Makes one wonder whether
>>> mprotect() should also set the bit in certain cases...
>>
>> I wonder if UFFDIO_COPY_MODE_READ_ACCESS_LIKELY vs.
>> UFFDIO_COPY_WRITE_ACCESS_LIKELY could evenmake sense. I feel like it could.
>>
>> For example, QEMU knows if a page fault it's resolving was due to a read
>> or a write fault and could use that information accordingly. Of course,
>> we don't completely know if we currently have a read fault, if we could
>> get a write fault immediately after.
>>
>> Especially in the context of UFFDIO_ZEROPAGE,
>> UFFDIO_ZEROPAGE_WRITE_ACCESS_LIKELY could ... not place the zeropage but
>> instead populate an actual page and mark it accessed+dirty. I even have
>> a use case for that ;)
>>
>>
>> The kernel could decide how to treat these hints -- for example, if it
>> doesn't want user space to mess with access/dirty bits, it could just
>> mostly ignore the hints.
> 
> I can do that. I think users can do the zero page-copy themselves today, but
> whatever you prefer.

Just so we're on the same page and I'm not missing some smart way: it
would have to provide a zeroed buffer in user space, and trigger the
copy via UFFDIO_COPY.

Instead, the kernel can simply clear the user page directly when
allocating from the buddy, instead of eventually zeroing it and then
copying from a zeroed user buffer / zeropage.


> 
> But, I cannot take it anymore: the list of arguments for uffd stuff is
> crazy. I would like to collect all the possible arguments that are used for
> uffd operation into some “struct uffd_op”.
> 
> Any objection?

Not from my side, as long as it doesn't break uapi.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC] userfaultfd: introduce UFFDIO_COPY_MODE_YOUNG
  2022-06-14 16:18   ` Nadav Amit
  2022-06-14 17:14     ` David Hildenbrand
@ 2022-06-14 18:56     ` Mike Rapoport
  2022-06-14 19:25       ` Nadav Amit
  2022-06-14 20:40       ` John Hubbard
  1 sibling, 2 replies; 18+ messages in thread
From: Mike Rapoport @ 2022-06-14 18:56 UTC (permalink / raw)
  To: Nadav Amit
  Cc: David Hildenbrand, Peter Xu, Linux MM, Mike Kravetz,
	Hugh Dickins, Andrew Morton, Axel Rasmussen

On Tue, Jun 14, 2022 at 09:18:43AM -0700, Nadav Amit wrote:
> On Jun 14, 2022, at 8:22 AM, David Hildenbrand <david@redhat.com> wrote:
> 
> > On 13.06.22 22:40, Nadav Amit wrote:
> >> From: Nadav Amit <namit@vmware.com>
> >> 
> >> As we know, using a PTE on x86 with cleared access-bit (aka young-bit)
> >> takes ~600 cycles more than when the access-bit is set. At the same
> >> time, setting the access-bit for memory that is not used (e.g.,
> >> prefetched) can introduce greater overheads, as the prefetched memory is
> >> reclaimed later than it should be.
> >> 
> >> Userfaultfd currently does not set the access-bit (excluding the
> >> huge-pages case). Arguably, it is best to let the uffd monitor control
> >> whether the access-bit should be set or not. The expected use is for the
> >> monitor to request userfaultfd to set the access-bit when the copy
> >> operation is done to resolve a page-fault, and not to set the young-bit
> >> when the memory is prefetched.
> > 
> > Thinking out loud about existing users: postcopy live migration in QEMU
> > has two usage for placement of pages
> > 
> > a) Resolving a fault. E.g., a VCPU might be waiting for resolution to
> > make progress.
> > b) Background migration to converge without faults on all relevant
> > pages.
> > 
> > I guess in a) we'd want UFFDIO_COPY_MODE_YOUNG in b) we don't want it.
> > 
> > 
> > I wonder, however, instead of calling this "young", which implies what
> > the OS should or shouldn't do, to define this as a hint that the placed
> > page is very likely to be accessed next.
> > 
> > I'm bad at naming, UFFDIO_COPY_MODE_ACCESS_LIKELY would express what I
> > have in mind.
> 
> How about UFFDIO_COPY_MODE_WILLNEED_READ ?
> 
> > 
> >> Introduce UFFDIO_COPY_MODE_YOUNG to enable userspace to request the
> >> young bit to be set. For UFFDIO_CONTINUE and UFFDIO_ZEROPAGE set the bit
> >> unconditionally since the former is only used to resolve page-faults and
> >> the latter would not benefit from not setting the access-bit.
> >> 
> >> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> >> Cc: Hugh Dickins <hughd@google.com>
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> Cc: Axel Rasmussen <axelrasmussen@google.com>
> >> Cc: Peter Xu <peterx@redhat.com>
> >> Cc: David Hildenbrand <david@redhat.com>
> >> Cc: Mike Rapoport <rppt@linux.ibm.com>
> >> Signed-off-by: Nadav Amit <namit@vmware.com>
> >> 
> >> ---
> >> 
> >> There are 2 possible enhancements:
> >> 
> >> 1. Use the flag to decide on whether to mark the PTE as dirty (for
> >> writable PTEs). I guess that setting the dirty-bit is as expensive as
> >> setting the access-bit, and setting it introduces similar tradeoffs,
> >> as mentioned above.
> >> 
> >> 2. Introduce a similar mode for write-protect and use this information
> >> for setting both the young and dirty bits. Makes one wonder whether
> >> mprotect() should also set the bit in certain cases...
> > 
> > I wonder if UFFDIO_COPY_MODE_READ_ACCESS_LIKELY vs.
> > UFFDIO_COPY_WRITE_ACCESS_LIKELY could evenmake sense. I feel like it could.
> > 
> > For example, QEMU knows if a page fault it's resolving was due to a read
> > or a write fault and could use that information accordingly. Of course,
> > we don't completely know if we currently have a read fault, if we could
> > get a write fault immediately after.
> > 
> > Especially in the context of UFFDIO_ZEROPAGE,
> > UFFDIO_ZEROPAGE_WRITE_ACCESS_LIKELY could ... not place the zeropage but
> > instead populate an actual page and mark it accessed+dirty. I even have
> > a use case for that ;)
> > 
> > 
> > The kernel could decide how to treat these hints -- for example, if it
> > doesn't want user space to mess with access/dirty bits, it could just
> > mostly ignore the hints.
> 
> I can do that. I think users can do the zero page-copy themselves today, but
> whatever you prefer.
> 
> But, I cannot take it anymore: the list of arguments for uffd stuff is
> crazy. I would like to collect all the possible arguments that are used for
> uffd operation into some “struct uffd_op”.

Squashing boolean parameters into int flags will also reduce the insane
amount of parameters. No strong feelings though.
 
> Any objection?
> 
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC] userfaultfd: introduce UFFDIO_COPY_MODE_YOUNG
  2022-06-14 18:56     ` Mike Rapoport
@ 2022-06-14 19:25       ` Nadav Amit
  2022-06-14 20:40       ` John Hubbard
  1 sibling, 0 replies; 18+ messages in thread
From: Nadav Amit @ 2022-06-14 19:25 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: David Hildenbrand, Peter Xu, Linux MM, Mike Kravetz,
	Hugh Dickins, Andrew Morton, Axel Rasmussen

On Jun 14, 2022, at 11:56 AM, Mike Rapoport <rppt@linux.ibm.com> wrote:

> On Tue, Jun 14, 2022 at 09:18:43AM -0700, Nadav Amit wrote:
>> On Jun 14, 2022, at 8:22 AM, David Hildenbrand <david@redhat.com> wrote:
>> 
>>> On 13.06.22 22:40, Nadav Amit wrote:
>>>> From: Nadav Amit <namit@vmware.com>
>>>> 
>>>> As we know, using a PTE on x86 with cleared access-bit (aka young-bit)
>>>> takes ~600 cycles more than when the access-bit is set. At the same
>>>> time, setting the access-bit for memory that is not used (e.g.,
>>>> prefetched) can introduce greater overheads, as the prefetched memory is
>>>> reclaimed later than it should be.
>>>> 
>>>> Userfaultfd currently does not set the access-bit (excluding the
>>>> huge-pages case). Arguably, it is best to let the uffd monitor control
>>>> whether the access-bit should be set or not. The expected use is for the
>>>> monitor to request userfaultfd to set the access-bit when the copy
>>>> operation is done to resolve a page-fault, and not to set the young-bit
>>>> when the memory is prefetched.
>>> 
>>> Thinking out loud about existing users: postcopy live migration in QEMU
>>> has two usage for placement of pages
>>> 
>>> a) Resolving a fault. E.g., a VCPU might be waiting for resolution to
>>> make progress.
>>> b) Background migration to converge without faults on all relevant
>>> pages.
>>> 
>>> I guess in a) we'd want UFFDIO_COPY_MODE_YOUNG in b) we don't want it.
>>> 
>>> 
>>> I wonder, however, instead of calling this "young", which implies what
>>> the OS should or shouldn't do, to define this as a hint that the placed
>>> page is very likely to be accessed next.
>>> 
>>> I'm bad at naming, UFFDIO_COPY_MODE_ACCESS_LIKELY would express what I
>>> have in mind.
>> 
>> How about UFFDIO_COPY_MODE_WILLNEED_READ ?
>> 
>>>> Introduce UFFDIO_COPY_MODE_YOUNG to enable userspace to request the
>>>> young bit to be set. For UFFDIO_CONTINUE and UFFDIO_ZEROPAGE set the bit
>>>> unconditionally since the former is only used to resolve page-faults and
>>>> the latter would not benefit from not setting the access-bit.
>>>> 
>>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>>>> Cc: Hugh Dickins <hughd@google.com>
>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>> Cc: Axel Rasmussen <axelrasmussen@google.com>
>>>> Cc: Peter Xu <peterx@redhat.com>
>>>> Cc: David Hildenbrand <david@redhat.com>
>>>> Cc: Mike Rapoport <rppt@linux.ibm.com>
>>>> Signed-off-by: Nadav Amit <namit@vmware.com>
>>>> 
>>>> ---
>>>> 
>>>> There are 2 possible enhancements:
>>>> 
>>>> 1. Use the flag to decide on whether to mark the PTE as dirty (for
>>>> writable PTEs). I guess that setting the dirty-bit is as expensive as
>>>> setting the access-bit, and setting it introduces similar tradeoffs,
>>>> as mentioned above.
>>>> 
>>>> 2. Introduce a similar mode for write-protect and use this information
>>>> for setting both the young and dirty bits. Makes one wonder whether
>>>> mprotect() should also set the bit in certain cases...
>>> 
>>> I wonder if UFFDIO_COPY_MODE_READ_ACCESS_LIKELY vs.
>>> UFFDIO_COPY_WRITE_ACCESS_LIKELY could evenmake sense. I feel like it could.
>>> 
>>> For example, QEMU knows if a page fault it's resolving was due to a read
>>> or a write fault and could use that information accordingly. Of course,
>>> we don't completely know if we currently have a read fault, if we could
>>> get a write fault immediately after.
>>> 
>>> Especially in the context of UFFDIO_ZEROPAGE,
>>> UFFDIO_ZEROPAGE_WRITE_ACCESS_LIKELY could ... not place the zeropage but
>>> instead populate an actual page and mark it accessed+dirty. I even have
>>> a use case for that ;)
>>> 
>>> 
>>> The kernel could decide how to treat these hints -- for example, if it
>>> doesn't want user space to mess with access/dirty bits, it could just
>>> mostly ignore the hints.
>> 
>> I can do that. I think users can do the zero page-copy themselves today, but
>> whatever you prefer.
>> 
>> But, I cannot take it anymore: the list of arguments for uffd stuff is
>> crazy. I would like to collect all the possible arguments that are used for
>> uffd operation into some “struct uffd_op”.
> 
> Squashing boolean parameters into int flags will also reduce the insane
> amount of parameters. No strong feelings though.
> 
>> Any objection?

Thanks. I also noticed a couple of embarrassing bugs that I made. Will send
v1 with fixes.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC] userfaultfd: introduce UFFDIO_COPY_MODE_YOUNG
  2022-06-14 18:56     ` Mike Rapoport
  2022-06-14 19:25       ` Nadav Amit
@ 2022-06-14 20:40       ` John Hubbard
  2022-06-14 20:56         ` Nadav Amit
  1 sibling, 1 reply; 18+ messages in thread
From: John Hubbard @ 2022-06-14 20:40 UTC (permalink / raw)
  To: Mike Rapoport, Nadav Amit
  Cc: David Hildenbrand, Peter Xu, Linux MM, Mike Kravetz,
	Hugh Dickins, Andrew Morton, Axel Rasmussen

On 6/14/22 11:56, Mike Rapoport wrote:
>> But, I cannot take it anymore: the list of arguments for uffd stuff is
>> crazy. I would like to collect all the possible arguments that are used for
>> uffd operation into some “struct uffd_op”.
> 
> Squashing boolean parameters into int flags will also reduce the insane
> amount of parameters. No strong feelings though.
>   

Just a quick drive-by comment about boolean arguments: they ruin the
readability of the call sites. In practice, sometimes a single boolean
argument can be OK-ish (still poor to read at the call site, but easier
to code initially), but once you get past one boolean argument in the
function, readability is hopeless:

     foo(ptr, true, false, a == b);

So if you have a choice, I implore you to prefer flags and/or enums. :)


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC] userfaultfd: introduce UFFDIO_COPY_MODE_YOUNG
  2022-06-14 20:40       ` John Hubbard
@ 2022-06-14 20:56         ` Nadav Amit
  2022-06-14 21:40           ` John Hubbard
  2022-06-15  7:26           ` Mike Rapoport
  0 siblings, 2 replies; 18+ messages in thread
From: Nadav Amit @ 2022-06-14 20:56 UTC (permalink / raw)
  To: John Hubbard
  Cc: Mike Rapoport, David Hildenbrand, Peter Xu, Linux MM,
	Mike Kravetz, Hugh Dickins, Andrew Morton, Axel Rasmussen

On Jun 14, 2022, at 1:40 PM, John Hubbard <jhubbard@nvidia.com> wrote:

> On 6/14/22 11:56, Mike Rapoport wrote:
>>> But, I cannot take it anymore: the list of arguments for uffd stuff is
>>> crazy. I would like to collect all the possible arguments that are used for
>>> uffd operation into some “struct uffd_op”.
>> Squashing boolean parameters into int flags will also reduce the insane
>> amount of parameters. No strong feelings though.
>>  
> 
> Just a quick drive-by comment about boolean arguments: they ruin the
> readability of the call sites. In practice, sometimes a single boolean
> argument can be OK-ish (still poor to read at the call site, but easier
> to code initially), but once you get past one boolean argument in the
> function, readability is hopeless:
> 
>    foo(ptr, true, false, a == b);
> 
> So if you have a choice, I implore you to prefer flags and/or enums. :)

Thanks for the feedback - I am aware it is very confusing to have booleans
and especially multiple ones in a func call.

Just not sure how it maps to what I proposed. I thought of passing as an
argument reference (pointer) to something similar to the following struct,
which I think is very self-descriptive:

struct uffd_op {
	/* various fields */
	struct vm_area_struct *dst_vma;
	unsigned long len;
	atomic_t *mmap_changing;

	...
	
	/* ... and some flags */
	int wp: 1;
	int zero: 1;
	int read_likely: 1;

	...
};

I think that fits what you were asking for. The only thing I am not sure of,
is whether to include in uffd_op fields that are internal to mm/userfaultfd
such as “page” and “newly_allocated”. I guess not.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC] userfaultfd: introduce UFFDIO_COPY_MODE_YOUNG
  2022-06-14 20:56         ` Nadav Amit
@ 2022-06-14 21:40           ` John Hubbard
  2022-06-14 21:52             ` Nadav Amit
  2022-06-15  7:26           ` Mike Rapoport
  1 sibling, 1 reply; 18+ messages in thread
From: John Hubbard @ 2022-06-14 21:40 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Mike Rapoport, David Hildenbrand, Peter Xu, Linux MM,
	Mike Kravetz, Hugh Dickins, Andrew Morton, Axel Rasmussen

On 6/14/22 13:56, Nadav Amit wrote:
...
>> So if you have a choice, I implore you to prefer flags and/or enums. :)
> 
> Thanks for the feedback - I am aware it is very confusing to have booleans
> and especially multiple ones in a func call.
> 
> Just not sure how it maps to what I proposed. I thought of passing as an
> argument reference (pointer) to something similar to the following struct,
> which I think is very self-descriptive:
> 
> struct uffd_op {
> 	/* various fields */
> 	struct vm_area_struct *dst_vma;
> 	unsigned long len;
> 	atomic_t *mmap_changing;
> 
> 	...
> 	
> 	/* ... and some flags */
> 	int wp: 1;
> 	int zero: 1;
> 	int read_likely: 1;

I am more accustomed to seeing:
	unsigned int flags;

...and then some #defines or enums nearby that are used for .flags.
The bitfields are not used as much, Linus wrote some words about why,
(which I'm not hopeful I can find). Basically they are not a very
robust C feature, and the kernel has good support for dealing with
flags within a word.

> 
> 	...
> };
> 
> I think that fits what you were asking for. The only thing I am not sure of,
> is whether to include in uffd_op fields that are internal to mm/userfaultfd
> such as “page” and “newly_allocated”. I guess not.
> 

Actually, I think passing around a struct might be overkill, when you can
simply collapse the various boolean args into a single flags arg. It looked
like a lot of the new args were bools...

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC] userfaultfd: introduce UFFDIO_COPY_MODE_YOUNG
  2022-06-14 21:40           ` John Hubbard
@ 2022-06-14 21:52             ` Nadav Amit
  2022-06-14 21:59               ` John Hubbard
  0 siblings, 1 reply; 18+ messages in thread
From: Nadav Amit @ 2022-06-14 21:52 UTC (permalink / raw)
  To: John Hubbard
  Cc: Mike Rapoport, David Hildenbrand, Peter Xu, Linux MM,
	Mike Kravetz, Hugh Dickins, Andrew Morton, Axel Rasmussen

On Jun 14, 2022, at 2:40 PM, John Hubbard <jhubbard@nvidia.com> wrote:

> On 6/14/22 13:56, Nadav Amit wrote:
> ...
>>> So if you have a choice, I implore you to prefer flags and/or enums. :)
>> Thanks for the feedback - I am aware it is very confusing to have booleans
>> and especially multiple ones in a func call.
>> Just not sure how it maps to what I proposed. I thought of passing as an
>> argument reference (pointer) to something similar to the following struct,
>> which I think is very self-descriptive:
>> struct uffd_op {
>> 	/* various fields */
>> 	struct vm_area_struct *dst_vma;
>> 	unsigned long len;
>> 	atomic_t *mmap_changing;
>> 	...
>> 	
>> 	/* ... and some flags */
>> 	int wp: 1;
>> 	int zero: 1;
>> 	int read_likely: 1;
> 
> I am more accustomed to seeing:
> 	unsigned int flags;
> 
> ...and then some #defines or enums nearby that are used for .flags.
> The bitfields are not used as much, Linus wrote some words about why,
> (which I'm not hopeful I can find). Basically they are not a very
> robust C feature, and the kernel has good support for dealing with
> flags within a word.
> 

Yes, Linus wrote some harsh words about it to me specifically. But my
understanding was the he opposes to the use of bitfields if they are only
used for flags. In this case it is not only flags, so there are not
problems, for instance in initialization of parameter. There are many
instances for such use (e.g., tcp_options_received)..

>> ...
>> };
>> I think that fits what you were asking for. The only thing I am not sure of,
>> is whether to include in uffd_op fields that are internal to mm/userfaultfd
>> such as “page” and “newly_allocated”. I guess not.
> 
> Actually, I think passing around a struct might be overkill, when you can
> simply collapse the various boolean args into a single flags arg. It looked
> like a lot of the new args were bools...

Ok. Whatever you prefer. I thought that having something similar to “struct
vm_fault” makes sense, especially since it would allow to avoid propagating
ugly arguments like mmap_changing. But I do not care enough to argue.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC] userfaultfd: introduce UFFDIO_COPY_MODE_YOUNG
  2022-06-14 21:52             ` Nadav Amit
@ 2022-06-14 21:59               ` John Hubbard
  0 siblings, 0 replies; 18+ messages in thread
From: John Hubbard @ 2022-06-14 21:59 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Mike Rapoport, David Hildenbrand, Peter Xu, Linux MM,
	Mike Kravetz, Hugh Dickins, Andrew Morton, Axel Rasmussen

On 6/14/22 14:52, Nadav Amit wrote:
>> Actually, I think passing around a struct might be overkill, when you can
>> simply collapse the various boolean args into a single flags arg. It looked
>> like a lot of the new args were bools...
> 
> Ok. Whatever you prefer. I thought that having something similar to “struct
> vm_fault” makes sense, especially since it would allow to avoid propagating
> ugly arguments like mmap_changing. But I do not care enough to argue.

Actually, I think you could make a case for passing around a struct.
The argument list is getting longer. I don't actually have an opinion
on that point, other than to point out the obvious: if you usually need
most of the struct fields (args) in each function call, then a struct
might be an improvement. But if you only need a few, then a struct passes
up the opportunity to have some function calls that are much simpler
and shorter.

It's the flags point that I really think you want to stick with, either
a flags arg, or a .flags field.

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC] userfaultfd: introduce UFFDIO_COPY_MODE_YOUNG
  2022-06-14 20:56         ` Nadav Amit
  2022-06-14 21:40           ` John Hubbard
@ 2022-06-15  7:26           ` Mike Rapoport
  2022-06-15 15:43             ` Peter Xu
  1 sibling, 1 reply; 18+ messages in thread
From: Mike Rapoport @ 2022-06-15  7:26 UTC (permalink / raw)
  To: Nadav Amit
  Cc: John Hubbard, David Hildenbrand, Peter Xu, Linux MM,
	Mike Kravetz, Hugh Dickins, Andrew Morton, Axel Rasmussen

On Tue, Jun 14, 2022 at 01:56:56PM -0700, Nadav Amit wrote:
> On Jun 14, 2022, at 1:40 PM, John Hubbard <jhubbard@nvidia.com> wrote:
> 
> > On 6/14/22 11:56, Mike Rapoport wrote:
> >>> But, I cannot take it anymore: the list of arguments for uffd stuff is
> >>> crazy. I would like to collect all the possible arguments that are used for
> >>> uffd operation into some “struct uffd_op”.
> >> Squashing boolean parameters into int flags will also reduce the insane
> >> amount of parameters. No strong feelings though.
> >>  
> > 
> > Just a quick drive-by comment about boolean arguments: they ruin the
> > readability of the call sites. In practice, sometimes a single boolean
> > argument can be OK-ish (still poor to read at the call site, but easier
> > to code initially), but once you get past one boolean argument in the
> > function, readability is hopeless:
> > 
> >    foo(ptr, true, false, a == b);
> > 
> > So if you have a choice, I implore you to prefer flags and/or enums. :)
> 
> Thanks for the feedback - I am aware it is very confusing to have booleans
> and especially multiple ones in a func call.
> 
> Just not sure how it maps to what I proposed. I thought of passing as an
> argument reference (pointer) to something similar to the following struct,
> which I think is very self-descriptive:
> 
> struct uffd_op {
> 	/* various fields */
> 	struct vm_area_struct *dst_vma;
> 	unsigned long len;
> 	atomic_t *mmap_changing;
> 
> 	...
> 	
> 	/* ... and some flags */
> 	int wp: 1;
> 	int zero: 1;
> 	int read_likely: 1;
> 
> 	...
> };
> 
> I think that fits what you were asking for. The only thing I am not sure of,
> is whether to include in uffd_op fields that are internal to mm/userfaultfd
> such as “page” and “newly_allocated”. I guess not.

mfill_atomic_install_pte() is called by shmem_mfill_atomic_pte() so it's
not entirely internal to mm/userfaultfd.c.

Another thing is that with all the parameters packed into a struct, the
call sites could become really hairy, so maybe the best way would be to
pack some of the parameters and leave the others.

But you'll never know until you try :)

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC] userfaultfd: introduce UFFDIO_COPY_MODE_YOUNG
  2022-06-15  7:26           ` Mike Rapoport
@ 2022-06-15 15:43             ` Peter Xu
  2022-06-15 16:58               ` Nadav Amit
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2022-06-15 15:43 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Nadav Amit, John Hubbard, David Hildenbrand, Linux MM,
	Mike Kravetz, Hugh Dickins, Andrew Morton, Axel Rasmussen

On Wed, Jun 15, 2022 at 10:26:21AM +0300, Mike Rapoport wrote:
> On Tue, Jun 14, 2022 at 01:56:56PM -0700, Nadav Amit wrote:
> > On Jun 14, 2022, at 1:40 PM, John Hubbard <jhubbard@nvidia.com> wrote:
> > 
> > > On 6/14/22 11:56, Mike Rapoport wrote:
> > >>> But, I cannot take it anymore: the list of arguments for uffd stuff is
> > >>> crazy. I would like to collect all the possible arguments that are used for
> > >>> uffd operation into some “struct uffd_op”.
> > >> Squashing boolean parameters into int flags will also reduce the insane
> > >> amount of parameters. No strong feelings though.
> > >>  
> > > 
> > > Just a quick drive-by comment about boolean arguments: they ruin the
> > > readability of the call sites. In practice, sometimes a single boolean
> > > argument can be OK-ish (still poor to read at the call site, but easier
> > > to code initially), but once you get past one boolean argument in the
> > > function, readability is hopeless:
> > > 
> > >    foo(ptr, true, false, a == b);
> > > 
> > > So if you have a choice, I implore you to prefer flags and/or enums. :)
> > 
> > Thanks for the feedback - I am aware it is very confusing to have booleans
> > and especially multiple ones in a func call.
> > 
> > Just not sure how it maps to what I proposed. I thought of passing as an
> > argument reference (pointer) to something similar to the following struct,
> > which I think is very self-descriptive:
> > 
> > struct uffd_op {
> > 	/* various fields */
> > 	struct vm_area_struct *dst_vma;
> > 	unsigned long len;
> > 	atomic_t *mmap_changing;
> > 
> > 	...
> > 	
> > 	/* ... and some flags */
> > 	int wp: 1;
> > 	int zero: 1;
> > 	int read_likely: 1;
> > 
> > 	...
> > };
> > 
> > I think that fits what you were asking for. The only thing I am not sure of,
> > is whether to include in uffd_op fields that are internal to mm/userfaultfd
> > such as “page” and “newly_allocated”. I guess not.
> 
> mfill_atomic_install_pte() is called by shmem_mfill_atomic_pte() so it's
> not entirely internal to mm/userfaultfd.c.
> 
> Another thing is that with all the parameters packed into a struct, the
> call sites could become really hairy, so maybe the best way would be to
> pack some of the parameters and leave the others.
> 
> But you'll never know until you try :)

Yeh.  Axel packed some booleans in f619147104c8e into mcopy_atomic_mode.
The other option (besides uffd_ops) could be making mcopy_atomic_mode a
bitmask and keep the rest, the mode itself only took 2 bits.

uffd_ops sounds good too if the final outcome looks clean, since we do pass
quite a few things over and over deep into the stack.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC] userfaultfd: introduce UFFDIO_COPY_MODE_YOUNG
  2022-06-15 15:43             ` Peter Xu
@ 2022-06-15 16:58               ` Nadav Amit
  2022-06-15 18:39                 ` Peter Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Nadav Amit @ 2022-06-15 16:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Rapoport, John Hubbard, David Hildenbrand, Linux MM,
	Mike Kravetz, Hugh Dickins, Andrew Morton, Axel Rasmussen

On Jun 15, 2022, at 8:43 AM, Peter Xu <peterx@redhat.com> wrote:

> On Wed, Jun 15, 2022 at 10:26:21AM +0300, Mike Rapoport wrote:
>> On Tue, Jun 14, 2022 at 01:56:56PM -0700, Nadav Amit wrote:
>>> On Jun 14, 2022, at 1:40 PM, John Hubbard <jhubbard@nvidia.com> wrote:
>>> 
>>>> On 6/14/22 11:56, Mike Rapoport wrote:
>>>>>> But, I cannot take it anymore: the list of arguments for uffd stuff is
>>>>>> crazy. I would like to collect all the possible arguments that are used for
>>>>>> uffd operation into some “struct uffd_op”.
>>>>> Squashing boolean parameters into int flags will also reduce the insane
>>>>> amount of parameters. No strong feelings though.
>>>> 
>>>> Just a quick drive-by comment about boolean arguments: they ruin the
>>>> readability of the call sites. In practice, sometimes a single boolean
>>>> argument can be OK-ish (still poor to read at the call site, but easier
>>>> to code initially), but once you get past one boolean argument in the
>>>> function, readability is hopeless:
>>>> 
>>>>   foo(ptr, true, false, a == b);
>>>> 
>>>> So if you have a choice, I implore you to prefer flags and/or enums. :)
>>> 
>>> Thanks for the feedback - I am aware it is very confusing to have booleans
>>> and especially multiple ones in a func call.
>>> 
>>> Just not sure how it maps to what I proposed. I thought of passing as an
>>> argument reference (pointer) to something similar to the following struct,
>>> which I think is very self-descriptive:
>>> 
>>> struct uffd_op {
>>> 	/* various fields */
>>> 	struct vm_area_struct *dst_vma;
>>> 	unsigned long len;
>>> 	atomic_t *mmap_changing;
>>> 
>>> 	...
>>> 	
>>> 	/* ... and some flags */
>>> 	int wp: 1;
>>> 	int zero: 1;
>>> 	int read_likely: 1;
>>> 
>>> 	...
>>> };
>>> 
>>> I think that fits what you were asking for. The only thing I am not sure of,
>>> is whether to include in uffd_op fields that are internal to mm/userfaultfd
>>> such as “page” and “newly_allocated”. I guess not.
>> 
>> mfill_atomic_install_pte() is called by shmem_mfill_atomic_pte() so it's
>> not entirely internal to mm/userfaultfd.c.
>> 
>> Another thing is that with all the parameters packed into a struct, the
>> call sites could become really hairy, so maybe the best way would be to
>> pack some of the parameters and leave the others.
>> 
>> But you'll never know until you try :)
> 
> Yeh.  Axel packed some booleans in f619147104c8e into mcopy_atomic_mode.
> The other option (besides uffd_ops) could be making mcopy_atomic_mode a
> bitmask and keep the rest, the mode itself only took 2 bits.
> 
> uffd_ops sounds good too if the final outcome looks clean, since we do pass
> quite a few things over and over deep into the stack.

Thanks.

I see 3 options:

1. Pack only fs/mm flags: WP, read-likely, write-likely.
2. (1) + as part of the flags internally include Axel’s copy_atomic_mode.
3. The uffd_op approach: include all relevant fields.

For the time being I’m going with (1) since I do not have too much time to
finish all of that and upstream the rest of my work (Broadcom is knocking).

(3) also has the downside of stack-protector that would be added due to
stack-protector strong, which is not-that-bad, but I hate it.

Three more points for consideration in future cleanups:

1. This __always_inline thingy is crazy IMHO. The size of the compilation
unit is almost double because of it, and I saw no explanation for its use in
the commit log (unless I missed it). The overheads in userfaultfd are mostly
due to memory copying, scheduling, IPIs.

2. I think it makes more sense to strive not to have more than 6 arguments
for each function (as supported in registers on x86). For that it is possible
to get rid of dst_mm when it can be retrieved from dst_vma. Anyhow we access
dst_vma->vm_flags which share a cache-line with dst_vma->vm_mm.

3. These BUG_ON()s all around are also ... excessive. I guess they were
introduced before the age in which Linus got angry on each BUG_ON(). Is
there any good reason not to change them into VM_BUG_ON()?



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC] userfaultfd: introduce UFFDIO_COPY_MODE_YOUNG
  2022-06-15 16:58               ` Nadav Amit
@ 2022-06-15 18:39                 ` Peter Xu
  2022-06-15 19:42                   ` Nadav Amit
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2022-06-15 18:39 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Mike Rapoport, John Hubbard, David Hildenbrand, Linux MM,
	Mike Kravetz, Hugh Dickins, Andrew Morton, Axel Rasmussen

On Wed, Jun 15, 2022 at 09:58:27AM -0700, Nadav Amit wrote:
> On Jun 15, 2022, at 8:43 AM, Peter Xu <peterx@redhat.com> wrote:
> 
> > On Wed, Jun 15, 2022 at 10:26:21AM +0300, Mike Rapoport wrote:
> >> On Tue, Jun 14, 2022 at 01:56:56PM -0700, Nadav Amit wrote:
> >>> On Jun 14, 2022, at 1:40 PM, John Hubbard <jhubbard@nvidia.com> wrote:
> >>> 
> >>>> On 6/14/22 11:56, Mike Rapoport wrote:
> >>>>>> But, I cannot take it anymore: the list of arguments for uffd stuff is
> >>>>>> crazy. I would like to collect all the possible arguments that are used for
> >>>>>> uffd operation into some “struct uffd_op”.
> >>>>> Squashing boolean parameters into int flags will also reduce the insane
> >>>>> amount of parameters. No strong feelings though.
> >>>> 
> >>>> Just a quick drive-by comment about boolean arguments: they ruin the
> >>>> readability of the call sites. In practice, sometimes a single boolean
> >>>> argument can be OK-ish (still poor to read at the call site, but easier
> >>>> to code initially), but once you get past one boolean argument in the
> >>>> function, readability is hopeless:
> >>>> 
> >>>>   foo(ptr, true, false, a == b);
> >>>> 
> >>>> So if you have a choice, I implore you to prefer flags and/or enums. :)
> >>> 
> >>> Thanks for the feedback - I am aware it is very confusing to have booleans
> >>> and especially multiple ones in a func call.
> >>> 
> >>> Just not sure how it maps to what I proposed. I thought of passing as an
> >>> argument reference (pointer) to something similar to the following struct,
> >>> which I think is very self-descriptive:
> >>> 
> >>> struct uffd_op {
> >>> 	/* various fields */
> >>> 	struct vm_area_struct *dst_vma;
> >>> 	unsigned long len;
> >>> 	atomic_t *mmap_changing;
> >>> 
> >>> 	...
> >>> 	
> >>> 	/* ... and some flags */
> >>> 	int wp: 1;
> >>> 	int zero: 1;
> >>> 	int read_likely: 1;
> >>> 
> >>> 	...
> >>> };
> >>> 
> >>> I think that fits what you were asking for. The only thing I am not sure of,
> >>> is whether to include in uffd_op fields that are internal to mm/userfaultfd
> >>> such as “page” and “newly_allocated”. I guess not.
> >> 
> >> mfill_atomic_install_pte() is called by shmem_mfill_atomic_pte() so it's
> >> not entirely internal to mm/userfaultfd.c.
> >> 
> >> Another thing is that with all the parameters packed into a struct, the
> >> call sites could become really hairy, so maybe the best way would be to
> >> pack some of the parameters and leave the others.
> >> 
> >> But you'll never know until you try :)
> > 
> > Yeh.  Axel packed some booleans in f619147104c8e into mcopy_atomic_mode.
> > The other option (besides uffd_ops) could be making mcopy_atomic_mode a
> > bitmask and keep the rest, the mode itself only took 2 bits.
> > 
> > uffd_ops sounds good too if the final outcome looks clean, since we do pass
> > quite a few things over and over deep into the stack.
> 
> Thanks.
> 
> I see 3 options:
> 
> 1. Pack only fs/mm flags: WP, read-likely, write-likely.
> 2. (1) + as part of the flags internally include Axel’s copy_atomic_mode.
> 3. The uffd_op approach: include all relevant fields.
> 
> For the time being I’m going with (1) since I do not have too much time to
> finish all of that and upstream the rest of my work (Broadcom is knocking).
> 
> (3) also has the downside of stack-protector that would be added due to
> stack-protector strong, which is not-that-bad, but I hate it.

Any short (but further) explanations?

> 
> 
> Three more points for consideration in future cleanups:
> 
> 1. This __always_inline thingy is crazy IMHO. The size of the compilation
> unit is almost double because of it, and I saw no explanation for its use in
> the commit log (unless I missed it). The overheads in userfaultfd are mostly
> due to memory copying, scheduling, IPIs.

Do you meant __mcopy_atomic()?  I agree it's suspicious.  That's a huge
function with three callers.  Even if to mark it inline I think it's more
proper to mark it at the three callers, or I'd think we can leave that for
the compilers.

> 
> 2. I think it makes more sense to strive not to have more than 6 arguments
> for each function (as supported in registers on x86). For that it is possible
> to get rid of dst_mm when it can be retrieved from dst_vma. Anyhow we access
> dst_vma->vm_flags which share a cache-line with dst_vma->vm_mm.
> 
> 3. These BUG_ON()s all around are also ... excessive. I guess they were
> introduced before the age in which Linus got angry on each BUG_ON(). Is
> there any good reason not to change them into VM_BUG_ON()?

I'm not sure how that was handled, I'd personally think it's fine to keep
them if they're there for ages.  They're gone already if they can be
triggered in any bad ways anyway..  But for sure we'd keep an eye on new
ones when reviewing.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC] userfaultfd: introduce UFFDIO_COPY_MODE_YOUNG
  2022-06-15 18:39                 ` Peter Xu
@ 2022-06-15 19:42                   ` Nadav Amit
  2022-06-15 20:56                     ` Peter Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Nadav Amit @ 2022-06-15 19:42 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Rapoport, John Hubbard, David Hildenbrand, Linux MM,
	Mike Kravetz, Hugh Dickins, Andrew Morton, Axel Rasmussen

On Jun 15, 2022, at 11:39 AM, Peter Xu <peterx@redhat.com> wrote:

> On Wed, Jun 15, 2022 at 09:58:27AM -0700, Nadav Amit wrote:
>> On Jun 15, 2022, at 8:43 AM, Peter Xu <peterx@redhat.com> wrote:
>> 
>>> On Wed, Jun 15, 2022 at 10:26:21AM +0300, Mike Rapoport wrote:
>>>> On Tue, Jun 14, 2022 at 01:56:56PM -0700, Nadav Amit wrote:
>>>>> On Jun 14, 2022, at 1:40 PM, John Hubbard <jhubbard@nvidia.com> wrote:
>>>>> 
>>>>>> On 6/14/22 11:56, Mike Rapoport wrote:
>>>>>>>> But, I cannot take it anymore: the list of arguments for uffd stuff is
>>>>>>>> crazy. I would like to collect all the possible arguments that are used for
>>>>>>>> uffd operation into some “struct uffd_op”.
>>>>>>> Squashing boolean parameters into int flags will also reduce the insane
>>>>>>> amount of parameters. No strong feelings though.
>>>>>> 
>>>>>> Just a quick drive-by comment about boolean arguments: they ruin the
>>>>>> readability of the call sites. In practice, sometimes a single boolean
>>>>>> argument can be OK-ish (still poor to read at the call site, but easier
>>>>>> to code initially), but once you get past one boolean argument in the
>>>>>> function, readability is hopeless:
>>>>>> 
>>>>>>  foo(ptr, true, false, a == b);
>>>>>> 
>>>>>> So if you have a choice, I implore you to prefer flags and/or enums. :)
>>>>> 
>>>>> Thanks for the feedback - I am aware it is very confusing to have booleans
>>>>> and especially multiple ones in a func call.
>>>>> 
>>>>> Just not sure how it maps to what I proposed. I thought of passing as an
>>>>> argument reference (pointer) to something similar to the following struct,
>>>>> which I think is very self-descriptive:
>>>>> 
>>>>> struct uffd_op {
>>>>> 	/* various fields */
>>>>> 	struct vm_area_struct *dst_vma;
>>>>> 	unsigned long len;
>>>>> 	atomic_t *mmap_changing;
>>>>> 
>>>>> 	...
>>>>> 	
>>>>> 	/* ... and some flags */
>>>>> 	int wp: 1;
>>>>> 	int zero: 1;
>>>>> 	int read_likely: 1;
>>>>> 
>>>>> 	...
>>>>> };
>>>>> 
>>>>> I think that fits what you were asking for. The only thing I am not sure of,
>>>>> is whether to include in uffd_op fields that are internal to mm/userfaultfd
>>>>> such as “page” and “newly_allocated”. I guess not.
>>>> 
>>>> mfill_atomic_install_pte() is called by shmem_mfill_atomic_pte() so it's
>>>> not entirely internal to mm/userfaultfd.c.
>>>> 
>>>> Another thing is that with all the parameters packed into a struct, the
>>>> call sites could become really hairy, so maybe the best way would be to
>>>> pack some of the parameters and leave the others.
>>>> 
>>>> But you'll never know until you try :)
>>> 
>>> Yeh.  Axel packed some booleans in f619147104c8e into mcopy_atomic_mode.
>>> The other option (besides uffd_ops) could be making mcopy_atomic_mode a
>>> bitmask and keep the rest, the mode itself only took 2 bits.
>>> 
>>> uffd_ops sounds good too if the final outcome looks clean, since we do pass
>>> quite a few things over and over deep into the stack.
>> 
>> Thanks.
>> 
>> I see 3 options:
>> 
>> 1. Pack only fs/mm flags: WP, read-likely, write-likely.
>> 2. (1) + as part of the flags internally include Axel’s copy_atomic_mode.
>> 3. The uffd_op approach: include all relevant fields.
>> 
>> For the time being I’m going with (1) since I do not have too much time to
>> finish all of that and upstream the rest of my work (Broadcom is knocking).
>> 
>> (3) also has the downside of stack-protector that would be added due to
>> stack-protector strong, which is not-that-bad, but I hate it.
> 
> Any short (but further) explanations?

Sure, but it might not be short.

So one time Mathew Wilcox tried to convert some function arguments into a
struct that would hold the arguments and transfer it as a single argument,
and he found out that the binary size actually increased. Since I like to
waste my time, I analyzed why.

IIRC, there were two main reasons.

The first one is that the kernel by default uses the “strong”
stack-protector. It means that not all functions would have a stack
protector, and actually only quite few would. One main reason that you have
a stack protector is that you provide dereference a local variable. So if
you have a local struct that hold the arguments - you get a stack protector,
and it does introduce slight overhead (~10ns IIRC). There may be some pragma
to prevent the stack protector, but clearly I will be shot if I used it.

The second reason is that the compiler either reloads data from the struct
you use to hold the arguments or might spill it to the stack if you try to
cache it.

Consider the following two scenarios:

A. You access an argument multiple times:

	local1 = args->arg1;
	another_fn();	// Or some store to the heap
	local2 = args->arg1;

	// You use local1 and local2

In this case the compiler would reload args->arg1 from memory, even if there
is a register that holds the value. The compiler is concerned that another_fn()
might have overwritten args->arg1 or - in the case of a store - that the value
was overwritten. The reload might prevent further optimizations of the compiler.

B. You cache the argument locally (aka, you decided to be “smart”):

	arg1 = args->arg1;
	local1 = arg1;
	another_fn();
	local2 = arg1;

You may think that this prevents the reload. But this might even be worse.
The compiler might run out of registers spill arg1 to the stack and then
access it from the stack. Or it might need to spill something else, or
shuffle registers around.

So what can you do? You can mark another_fn() as pure if it is so, which
does help in very certain cases. There are various limitations though. IIRC,
gcc (or is it clang?) ignores it for inline functions. So if you have an
inline function which does some write that you don’t care about you cannot
force the compiler to ignore it.

Note that at least gcc (IIRC) regards inline assembly as something that might
write to arbitrary memory address. So having BUG_ON() would require a reload
of the argument from the struct.

You may say: “what about the restrict keyword?” Well, this is nice in
theory, but it does not always help and the implementation of gcc and clang
are not even the same in this regard. IIRC, in one of them (I forgot which),
the restrict keyword will prevent the reload if instead another_fn() there
was a store, but once you call a different function, that compiler says “all
bets are off, I ignore the restrict” and would reload/spill arg1 (depending
on A or B).

>> Three more points for consideration in future cleanups:
>> 
>> 1. This __always_inline thingy is crazy IMHO. The size of the compilation
>> unit is almost double because of it, and I saw no explanation for its use in
>> the commit log (unless I missed it). The overheads in userfaultfd are mostly
>> due to memory copying, scheduling, IPIs.
> 
> Do you meant __mcopy_atomic()?  I agree it's suspicious.  That's a huge
> function with three callers.  Even if to mark it inline I think it's more
> proper to mark it at the three callers, or I'd think we can leave that for
> the compilers.

Among others. But yes, __mcopy_atomic() is the craziest one.

> 
>> 2. I think it makes more sense to strive not to have more than 6 arguments
>> for each function (as supported in registers on x86). For that it is possible
>> to get rid of dst_mm when it can be retrieved from dst_vma. Anyhow we access
>> dst_vma->vm_flags which share a cache-line with dst_vma->vm_mm.
>> 
>> 3. These BUG_ON()s all around are also ... excessive. I guess they were
>> introduced before the age in which Linus got angry on each BUG_ON(). Is
>> there any good reason not to change them into VM_BUG_ON()?
> 
> I'm not sure how that was handled, I'd personally think it's fine to keep
> them if they're there for ages.  They're gone already if they can be
> triggered in any bad ways anyway..  But for sure we'd keep an eye on new
> ones when reviewing.

Whatever. They might just cause slightly less optimized code to be
generated.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC] userfaultfd: introduce UFFDIO_COPY_MODE_YOUNG
  2022-06-15 19:42                   ` Nadav Amit
@ 2022-06-15 20:56                     ` Peter Xu
  2022-06-16  5:24                       ` Nadav Amit
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2022-06-15 20:56 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Mike Rapoport, John Hubbard, David Hildenbrand, Linux MM,
	Mike Kravetz, Hugh Dickins, Andrew Morton, Axel Rasmussen

On Wed, Jun 15, 2022 at 12:42:13PM -0700, Nadav Amit wrote:
> >> (3) also has the downside of stack-protector that would be added due to
> >> stack-protector strong, which is not-that-bad, but I hate it.
> > 
> > Any short (but further) explanations?
> 
> Sure, but it might not be short.
> 
> So one time Mathew Wilcox tried to convert some function arguments into a
> struct that would hold the arguments and transfer it as a single argument,
> and he found out that the binary size actually increased. Since I like to
> waste my time, I analyzed why.
> 
> IIRC, there were two main reasons.
> 
> The first one is that the kernel by default uses the “strong”
> stack-protector. It means that not all functions would have a stack
> protector, and actually only quite few would. One main reason that you have
> a stack protector is that you provide dereference a local variable. So if
> you have a local struct that hold the arguments - you get a stack protector,
> and it does introduce slight overhead (~10ns IIRC). There may be some pragma
> to prevent the stack protector, but clearly I will be shot if I used it.
> 
> The second reason is that the compiler either reloads data from the struct
> you use to hold the arguments or might spill it to the stack if you try to
> cache it.
> 
> Consider the following two scenarios:
> 
> A. You access an argument multiple times:
> 
> 	local1 = args->arg1;
> 	another_fn();	// Or some store to the heap
> 	local2 = args->arg1;
> 
> 	// You use local1 and local2
> 
> In this case the compiler would reload args->arg1 from memory, even if there
> is a register that holds the value. The compiler is concerned that another_fn()
> might have overwritten args->arg1 or - in the case of a store - that the value
> was overwritten. The reload might prevent further optimizations of the compiler.
> 
> B. You cache the argument locally (aka, you decided to be “smart”):
> 
> 	arg1 = args->arg1;
> 	local1 = arg1;
> 	another_fn();
> 	local2 = arg1;
> 
> You may think that this prevents the reload. But this might even be worse.
> The compiler might run out of registers spill arg1 to the stack and then
> access it from the stack. Or it might need to spill something else, or
> shuffle registers around.
> 
> So what can you do? You can mark another_fn() as pure if it is so, which
> does help in very certain cases. There are various limitations though. IIRC,
> gcc (or is it clang?) ignores it for inline functions. So if you have an
> inline function which does some write that you don’t care about you cannot
> force the compiler to ignore it.
> 
> Note that at least gcc (IIRC) regards inline assembly as something that might
> write to arbitrary memory address. So having BUG_ON() would require a reload
> of the argument from the struct.

Ah, I never knew that side of BUG_ON()..

> 
> You may say: “what about the restrict keyword?” Well, this is nice in
> theory, but it does not always help and the implementation of gcc and clang
> are not even the same in this regard. IIRC, in one of them (I forgot which),
> the restrict keyword will prevent the reload if instead another_fn() there
> was a store, but once you call a different function, that compiler says “all
> bets are off, I ignore the restrict” and would reload/spill arg1 (depending
> on A or B).

Thanks, I understand now.

I did a check-up and my most-used distro (fedora) doesn't really have the
strong stack protector enabled.. centos/rhel has.

It seems to me besides the two points you mentioned above to increase the
obj being generated (and also the performance impact), afaiu it also
shrinks the number of vars to push/pop, so even if the lump-sum would be
similar to before there's still a valid reason to have it since it provides
more safety on stack checks.

But I really am not an expert on this so I'd not be able to make any
appropriate conclusion or provide any useful input..  It's just that it'll
not be a question to answer to uffd code but a more general question, as
using a struct pointer to pass over things are really common, afaict, in
not only mm but the Linux repository as a whole.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC] userfaultfd: introduce UFFDIO_COPY_MODE_YOUNG
  2022-06-15 20:56                     ` Peter Xu
@ 2022-06-16  5:24                       ` Nadav Amit
  0 siblings, 0 replies; 18+ messages in thread
From: Nadav Amit @ 2022-06-16  5:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Rapoport, John Hubbard, David Hildenbrand, Linux MM,
	Mike Kravetz, Hugh Dickins, Andrew Morton, Axel Rasmussen

On Jun 15, 2022, at 1:56 PM, Peter Xu <peterx@redhat.com> wrote:

> On Wed, Jun 15, 2022 at 12:42:13PM -0700, Nadav Amit wrote:
>>>> (3) also has the downside of stack-protector that would be added due to
>>>> stack-protector strong, which is not-that-bad, but I hate it.
>>> 
>>> Any short (but further) explanations?
>> 
>> Sure, but it might not be short.
>> 
>> So one time Mathew Wilcox tried to convert some function arguments into a
>> struct that would hold the arguments and transfer it as a single argument,
>> and he found out that the binary size actually increased. Since I like to
>> waste my time, I analyzed why.
>> 
>> IIRC, there were two main reasons.
>> 
>> The first one is that the kernel by default uses the “strong”
>> stack-protector. It means that not all functions would have a stack
>> protector, and actually only quite few would. One main reason that you have
>> a stack protector is that you provide dereference a local variable. So if
>> you have a local struct that hold the arguments - you get a stack protector,
>> and it does introduce slight overhead (~10ns IIRC). There may be some pragma
>> to prevent the stack protector, but clearly I will be shot if I used it.
>> 
>> The second reason is that the compiler either reloads data from the struct
>> you use to hold the arguments or might spill it to the stack if you try to
>> cache it.
>> 
>> Consider the following two scenarios:
>> 
>> A. You access an argument multiple times:
>> 
>> 	local1 = args->arg1;
>> 	another_fn();	// Or some store to the heap
>> 	local2 = args->arg1;
>> 
>> 	// You use local1 and local2
>> 
>> In this case the compiler would reload args->arg1 from memory, even if there
>> is a register that holds the value. The compiler is concerned that another_fn()
>> might have overwritten args->arg1 or - in the case of a store - that the value
>> was overwritten. The reload might prevent further optimizations of the compiler.
>> 
>> B. You cache the argument locally (aka, you decided to be “smart”):
>> 
>> 	arg1 = args->arg1;
>> 	local1 = arg1;
>> 	another_fn();
>> 	local2 = arg1;
>> 
>> You may think that this prevents the reload. But this might even be worse.
>> The compiler might run out of registers spill arg1 to the stack and then
>> access it from the stack. Or it might need to spill something else, or
>> shuffle registers around.
>> 
>> So what can you do? You can mark another_fn() as pure if it is so, which
>> does help in very certain cases. There are various limitations though. IIRC,
>> gcc (or is it clang?) ignores it for inline functions. So if you have an
>> inline function which does some write that you don’t care about you cannot
>> force the compiler to ignore it.
>> 
>> Note that at least gcc (IIRC) regards inline assembly as something that might
>> write to arbitrary memory address. So having BUG_ON() would require a reload
>> of the argument from the struct.
> 
> Ah, I never knew that side of BUG_ON()..

I was wrong about this part. Actually BUG_ON() does not have this effect.

Sorry for misleading you.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-06-16  5:24 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-13 20:40 [PATCH RFC] userfaultfd: introduce UFFDIO_COPY_MODE_YOUNG Nadav Amit
2022-06-14 15:22 ` David Hildenbrand
2022-06-14 16:18   ` Nadav Amit
2022-06-14 17:14     ` David Hildenbrand
2022-06-14 18:56     ` Mike Rapoport
2022-06-14 19:25       ` Nadav Amit
2022-06-14 20:40       ` John Hubbard
2022-06-14 20:56         ` Nadav Amit
2022-06-14 21:40           ` John Hubbard
2022-06-14 21:52             ` Nadav Amit
2022-06-14 21:59               ` John Hubbard
2022-06-15  7:26           ` Mike Rapoport
2022-06-15 15:43             ` Peter Xu
2022-06-15 16:58               ` Nadav Amit
2022-06-15 18:39                 ` Peter Xu
2022-06-15 19:42                   ` Nadav Amit
2022-06-15 20:56                     ` Peter Xu
2022-06-16  5:24                       ` Nadav Amit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).