All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] mm/uffd: Fix missing markers on hugetlb
@ 2023-01-04 22:52 Peter Xu
  2023-01-04 22:52 ` [PATCH 1/3] mm/hugetlb: Pre-allocate pgtable pages for uffd wr-protects Peter Xu
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Peter Xu @ 2023-01-04 22:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Mike Kravetz, Muchun Song, peterx, Nadav Amit, Andrea Arcangeli,
	David Hildenbrand, James Houghton, Axel Rasmussen, Andrew Morton

When James was developing the vma split fix for hugetlb pmd sharing, he
found that hugetlb uffd-wp is broken with the test case he developed [1]:

https://lore.kernel.org/r/CADrL8HWSym93=yNpTUdWebOEzUOTR2ffbfUk04XdK6O+PNJNoA@mail.gmail.com

Missing hugetlb pgtable pages caused uffd-wp to lose message when vma split
happens to be across a shared huge pmd range in the test.

The issue is pgtable pre-allocation on hugetlb path was overlooked.  That
was fixed in patch 1.

Meanwhile there's another issue on proper reporting of pgtable allocation
failures during UFFDIO_WRITEPROTECT.  When pgtable allocation failed during
the ioctl(UFFDIO_WRITEPROTECT), we will silent the error so the user cannot
detect it (even if extremely rare).  This issue can happen not only on
hugetlb but also shmem.  Anon is not affected because anon doesn't require
pgtable allocation during wr-protection.  Patch 2 prepares for such a
change, then patch 3 allows the error to be reported to the users.

This set only marks patch 1 to copy stable, because it's a real bug to be
fixed for all kernels 5.19+.

Patch 2-3 will be an enhancement to process pgtable allocation errors, it
should hardly be hit even during heavy workloads in the past of my tests,
but it should make the interface clearer.  Not copying stable for patch 2-3
due to that.  I'll prepare a man page update after patch 2-3 lands.

Tested with:

  - James's reproducer above [1] so it'll start to pass with the vma split
    fix:
    https://lore.kernel.org/r/20230101230042.244286-1-jthoughton@google.com
  - Faked memory pressures to make sure -ENOMEM returned with either shmem
    and hugetlbfs
  - Some uffd general routines

Peter Xu (3):
  mm/hugetlb: Pre-allocate pgtable pages for uffd wr-protects
  mm/mprotect: Use long for page accountings and retval
  mm/uffd: Detect pgtable allocation failures

 include/linux/hugetlb.h       |  4 +-
 include/linux/mm.h            |  2 +-
 include/linux/userfaultfd_k.h |  2 +-
 mm/hugetlb.c                  | 21 +++++++--
 mm/mempolicy.c                |  4 +-
 mm/mprotect.c                 | 89 ++++++++++++++++++++++-------------
 mm/userfaultfd.c              | 16 +++++--
 7 files changed, 88 insertions(+), 50 deletions(-)

-- 
2.37.3


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/3] mm/hugetlb: Pre-allocate pgtable pages for uffd wr-protects
  2023-01-04 22:52 [PATCH 0/3] mm/uffd: Fix missing markers on hugetlb Peter Xu
@ 2023-01-04 22:52 ` Peter Xu
  2023-01-05  1:50   ` James Houghton
                     ` (2 more replies)
  2023-01-04 22:52 ` [PATCH 2/3] mm/mprotect: Use long for page accountings and retval Peter Xu
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 21+ messages in thread
From: Peter Xu @ 2023-01-04 22:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Mike Kravetz, Muchun Song, peterx, Nadav Amit, Andrea Arcangeli,
	David Hildenbrand, James Houghton, Axel Rasmussen, Andrew Morton,
	linux-stable

Userfaultfd-wp uses pte markers to mark wr-protected pages for both shmem
and hugetlb.  Shmem has pre-allocation ready for markers, but hugetlb path
was overlooked.

Doing so by calling huge_pte_alloc() if the initial pgtable walk fails to
find the huge ptep.  It's possible that huge_pte_alloc() can fail with high
memory pressure, in that case stop the loop immediately and fail silently.
This is not the most ideal solution but it matches with what we do with
shmem meanwhile it avoids the splat in dmesg.

Cc: linux-stable <stable@vger.kernel.org> # 5.19+
Fixes: 60dfaad65aa9 ("mm/hugetlb: allow uffd wr-protect none ptes")
Reported-by: James Houghton <jthoughton@google.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bf7a1f628357..017d9159cddf 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6649,8 +6649,17 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 		spinlock_t *ptl;
 		ptep = hugetlb_walk(vma, address, psize);
 		if (!ptep) {
-			address |= last_addr_mask;
-			continue;
+			if (!uffd_wp) {
+				address |= last_addr_mask;
+				continue;
+			}
+			/*
+			 * Userfaultfd wr-protect requires pgtable
+			 * pre-allocations to install pte markers.
+			 */
+			ptep = huge_pte_alloc(mm, vma, address, psize);
+			if (!ptep)
+				break;
 		}
 		ptl = huge_pte_lock(h, mm, ptep);
 		if (huge_pmd_unshare(mm, vma, address, ptep)) {
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/3] mm/mprotect: Use long for page accountings and retval
  2023-01-04 22:52 [PATCH 0/3] mm/uffd: Fix missing markers on hugetlb Peter Xu
  2023-01-04 22:52 ` [PATCH 1/3] mm/hugetlb: Pre-allocate pgtable pages for uffd wr-protects Peter Xu
@ 2023-01-04 22:52 ` Peter Xu
  2023-01-05  1:51   ` James Houghton
                     ` (2 more replies)
  2023-01-04 22:52 ` [PATCH 3/3] mm/uffd: Detect pgtable allocation failures Peter Xu
  2023-01-05  8:16 ` [PATCH 0/3] mm/uffd: Fix missing markers on hugetlb David Hildenbrand
  3 siblings, 3 replies; 21+ messages in thread
From: Peter Xu @ 2023-01-04 22:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Mike Kravetz, Muchun Song, peterx, Nadav Amit, Andrea Arcangeli,
	David Hildenbrand, James Houghton, Axel Rasmussen, Andrew Morton

Switch to use type "long" for page accountings and retval across the whole
procedure of change_protection().

The change should have shrinked the possible maximum page number to be half
comparing to previous (ULONG_MAX / 2), but it shouldn't overflow on any
system either because the maximum possible pages touched by change
protection should be ULONG_MAX / PAGE_SIZE.

Two reasons to switch from "unsigned long" to "long":

  1. It suites better on count_vm_numa_events(), whose 2nd parameter takes
     a long type.

  2. It paves way for returning negative (error) values in the future.

Currently the only caller that consumes this retval is change_prot_numa(),
where the unsigned long was converted to an int.  Since at it, touching up
the numa code to also take a long, so it'll avoid any possible overflow too
during the int-size convertion.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/hugetlb.h |  4 ++--
 include/linux/mm.h      |  2 +-
 mm/hugetlb.c            |  4 ++--
 mm/mempolicy.c          |  2 +-
 mm/mprotect.c           | 26 +++++++++++++-------------
 5 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b6b10101bea7..e3aa336df900 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -248,7 +248,7 @@ void hugetlb_vma_lock_release(struct kref *kref);
 
 int pmd_huge(pmd_t pmd);
 int pud_huge(pud_t pud);
-unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
+long hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end, pgprot_t newprot,
 		unsigned long cp_flags);
 
@@ -437,7 +437,7 @@ static inline void move_hugetlb_state(struct folio *old_folio,
 {
 }
 
-static inline unsigned long hugetlb_change_protection(
+static inline long hugetlb_change_protection(
 			struct vm_area_struct *vma, unsigned long address,
 			unsigned long end, pgprot_t newprot,
 			unsigned long cp_flags)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c37f9330f14e..86fe17e6ded7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2132,7 +2132,7 @@ static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma
 }
 bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 			     pte_t pte);
-extern unsigned long change_protection(struct mmu_gather *tlb,
+extern long change_protection(struct mmu_gather *tlb,
 			      struct vm_area_struct *vma, unsigned long start,
 			      unsigned long end, unsigned long cp_flags);
 extern int mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 017d9159cddf..84bc665c7c86 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6613,7 +6613,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	return i ? i : err;
 }
 
-unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
+long hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end,
 		pgprot_t newprot, unsigned long cp_flags)
 {
@@ -6622,7 +6622,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	pte_t *ptep;
 	pte_t pte;
 	struct hstate *h = hstate_vma(vma);
-	unsigned long pages = 0, psize = huge_page_size(h);
+	long pages = 0, psize = huge_page_size(h);
 	bool shared_pmd = false;
 	struct mmu_notifier_range range;
 	unsigned long last_addr_mask;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d3558248a0f0..a86b8f15e2f0 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -631,7 +631,7 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 			unsigned long addr, unsigned long end)
 {
 	struct mmu_gather tlb;
-	int nr_updated;
+	long nr_updated;
 
 	tlb_gather_mmu(&tlb, vma->vm_mm);
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 71358e45a742..0af22ab59ea8 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -80,13 +80,13 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 	return pte_dirty(pte);
 }
 
-static unsigned long change_pte_range(struct mmu_gather *tlb,
+static long change_pte_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
-	unsigned long pages = 0;
+	long pages = 0;
 	int target_node = NUMA_NO_NODE;
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
@@ -353,13 +353,13 @@ uffd_wp_protect_file(struct vm_area_struct *vma, unsigned long cp_flags)
 		}							\
 	} while (0)
 
-static inline unsigned long change_pmd_range(struct mmu_gather *tlb,
+static inline long change_pmd_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pud_t *pud, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
 	pmd_t *pmd;
 	unsigned long next;
-	unsigned long pages = 0;
+	long pages = 0;
 	unsigned long nr_huge_updates = 0;
 	struct mmu_notifier_range range;
 
@@ -367,7 +367,7 @@ static inline unsigned long change_pmd_range(struct mmu_gather *tlb,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		unsigned long this_pages;
+		long this_pages;
 
 		next = pmd_addr_end(addr, end);
 
@@ -437,13 +437,13 @@ static inline unsigned long change_pmd_range(struct mmu_gather *tlb,
 	return pages;
 }
 
-static inline unsigned long change_pud_range(struct mmu_gather *tlb,
+static inline long change_pud_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, p4d_t *p4d, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
 	pud_t *pud;
 	unsigned long next;
-	unsigned long pages = 0;
+	long pages = 0;
 
 	pud = pud_offset(p4d, addr);
 	do {
@@ -458,13 +458,13 @@ static inline unsigned long change_pud_range(struct mmu_gather *tlb,
 	return pages;
 }
 
-static inline unsigned long change_p4d_range(struct mmu_gather *tlb,
+static inline long change_p4d_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pgd_t *pgd, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
 	p4d_t *p4d;
 	unsigned long next;
-	unsigned long pages = 0;
+	long pages = 0;
 
 	p4d = p4d_offset(pgd, addr);
 	do {
@@ -479,14 +479,14 @@ static inline unsigned long change_p4d_range(struct mmu_gather *tlb,
 	return pages;
 }
 
-static unsigned long change_protection_range(struct mmu_gather *tlb,
+static long change_protection_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long pages = 0;
+	long pages = 0;
 
 	BUG_ON(addr >= end);
 	pgd = pgd_offset(mm, addr);
@@ -505,12 +505,12 @@ static unsigned long change_protection_range(struct mmu_gather *tlb,
 	return pages;
 }
 
-unsigned long change_protection(struct mmu_gather *tlb,
+long change_protection(struct mmu_gather *tlb,
 		       struct vm_area_struct *vma, unsigned long start,
 		       unsigned long end, unsigned long cp_flags)
 {
 	pgprot_t newprot = vma->vm_page_prot;
-	unsigned long pages;
+	long pages;
 
 	BUG_ON((cp_flags & MM_CP_UFFD_WP_ALL) == MM_CP_UFFD_WP_ALL);
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/3] mm/uffd: Detect pgtable allocation failures
  2023-01-04 22:52 [PATCH 0/3] mm/uffd: Fix missing markers on hugetlb Peter Xu
  2023-01-04 22:52 ` [PATCH 1/3] mm/hugetlb: Pre-allocate pgtable pages for uffd wr-protects Peter Xu
  2023-01-04 22:52 ` [PATCH 2/3] mm/mprotect: Use long for page accountings and retval Peter Xu
@ 2023-01-04 22:52 ` Peter Xu
  2023-01-05  1:52   ` James Houghton
                     ` (2 more replies)
  2023-01-05  8:16 ` [PATCH 0/3] mm/uffd: Fix missing markers on hugetlb David Hildenbrand
  3 siblings, 3 replies; 21+ messages in thread
From: Peter Xu @ 2023-01-04 22:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Mike Kravetz, Muchun Song, peterx, Nadav Amit, Andrea Arcangeli,
	David Hildenbrand, James Houghton, Axel Rasmussen, Andrew Morton

Before this patch, when there's any pgtable allocation issues happened
during change_protection(), the error will be ignored from the syscall.
For shmem, there will be an error dumped into the host dmesg.  Two issues
with that:

  (1) Doing a trace dump when allocation fails is not anything close to
      grace..

  (2) The user should be notified with any kind of such error, so the user
      can trap it and decide what to do next, either by retrying, or stop
      the process properly, or anything else.

For userfault users, this will change the API of UFFDIO_WRITEPROTECT when
pgtable allocation failure happened.  It should not normally break anyone,
though.  If it breaks, then in good ways.

One man-page update will be on the way to introduce the new -ENOMEM for
UFFDIO_WRITEPROTECT.  Not marking stable so we keep the old behavior on the
5.19-till-now kernels.

Reported-by: James Houghton <jthoughton@google.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/userfaultfd_k.h |  2 +-
 mm/hugetlb.c                  |  6 ++-
 mm/mempolicy.c                |  2 +-
 mm/mprotect.c                 | 69 +++++++++++++++++++++++------------
 mm/userfaultfd.c              | 16 +++++---
 5 files changed, 62 insertions(+), 33 deletions(-)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 9df0b9a762cc..3767f18114ef 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -73,7 +73,7 @@ extern ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long dst_start,
 extern int mwriteprotect_range(struct mm_struct *dst_mm,
 			       unsigned long start, unsigned long len,
 			       bool enable_wp, atomic_t *mmap_changing);
-extern void uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *vma,
+extern long uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *vma,
 			  unsigned long start, unsigned long len, bool enable_wp);
 
 /* mm helpers */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 84bc665c7c86..d82d97e03eae 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6658,8 +6658,10 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
 			 * pre-allocations to install pte markers.
 			 */
 			ptep = huge_pte_alloc(mm, vma, address, psize);
-			if (!ptep)
+			if (!ptep) {
+				pages = -ENOMEM;
 				break;
+			}
 		}
 		ptl = huge_pte_lock(h, mm, ptep);
 		if (huge_pmd_unshare(mm, vma, address, ptep)) {
@@ -6749,7 +6751,7 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
 	hugetlb_vma_unlock_write(vma);
 	mmu_notifier_invalidate_range_end(&range);
 
-	return pages << h->order;
+	return pages > 0 ? (pages << h->order) : pages;
 }
 
 /* Return true if reservation was successful, false otherwise.  */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index a86b8f15e2f0..85a34f1f3ab8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -636,7 +636,7 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 	tlb_gather_mmu(&tlb, vma->vm_mm);
 
 	nr_updated = change_protection(&tlb, vma, addr, end, MM_CP_PROT_NUMA);
-	if (nr_updated)
+	if (nr_updated > 0)
 		count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
 
 	tlb_finish_mmu(&tlb);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 0af22ab59ea8..ade0d5f85a36 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -330,28 +330,34 @@ uffd_wp_protect_file(struct vm_area_struct *vma, unsigned long cp_flags)
 /*
  * If wr-protecting the range for file-backed, populate pgtable for the case
  * when pgtable is empty but page cache exists.  When {pte|pmd|...}_alloc()
- * failed it means no memory, we don't have a better option but stop.
+ * failed we treat it the same way as pgtable allocation failures during
+ * page faults by kicking OOM and returning error.
  */
 #define  change_pmd_prepare(vma, pmd, cp_flags)				\
-	do {								\
+	({								\
+		long err = 0;						\
 		if (unlikely(uffd_wp_protect_file(vma, cp_flags))) {	\
-			if (WARN_ON_ONCE(pte_alloc(vma->vm_mm, pmd)))	\
-				break;					\
+			if (pte_alloc(vma->vm_mm, pmd))			\
+				err = -ENOMEM;				\
 		}							\
-	} while (0)
+		err;							\
+	})
+
 /*
  * This is the general pud/p4d/pgd version of change_pmd_prepare(). We need to
  * have separate change_pmd_prepare() because pte_alloc() returns 0 on success,
  * while {pmd|pud|p4d}_alloc() returns the valid pointer on success.
  */
 #define  change_prepare(vma, high, low, addr, cp_flags)			\
-	do {								\
-		if (unlikely(uffd_wp_protect_file(vma, cp_flags))) {	\
-			low##_t *p = low##_alloc(vma->vm_mm, high, addr); \
-			if (WARN_ON_ONCE(p == NULL))			\
-				break;					\
-		}							\
-	} while (0)
+	  ({								\
+		  long err = 0;						\
+		  if (unlikely(uffd_wp_protect_file(vma, cp_flags))) {	\
+			  low##_t *p = low##_alloc(vma->vm_mm, high, addr); \
+			  if (p == NULL)				\
+				  err = -ENOMEM;			\
+		  }							\
+		  err;							\
+	  })
 
 static inline long change_pmd_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pud_t *pud, unsigned long addr,
@@ -367,11 +373,15 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		long this_pages;
+		long ret;
 
 		next = pmd_addr_end(addr, end);
 
-		change_pmd_prepare(vma, pmd, cp_flags);
+		ret = change_pmd_prepare(vma, pmd, cp_flags);
+		if (ret) {
+			pages = ret;
+			break;
+		}
 		/*
 		 * Automatic NUMA balancing walks the tables with mmap_lock
 		 * held for read. It's possible a parallel update to occur
@@ -401,7 +411,11 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 				 * cleared; make sure pmd populated if
 				 * necessary, then fall-through to pte level.
 				 */
-				change_pmd_prepare(vma, pmd, cp_flags);
+				ret = change_pmd_prepare(vma, pmd, cp_flags);
+				if (ret) {
+					pages = ret;
+					break;
+				}
 			} else {
 				/*
 				 * change_huge_pmd() does not defer TLB flushes,
@@ -422,9 +436,8 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 			}
 			/* fall through, the trans huge pmd just split */
 		}
-		this_pages = change_pte_range(tlb, vma, pmd, addr, next,
-					      newprot, cp_flags);
-		pages += this_pages;
+		pages += change_pte_range(tlb, vma, pmd, addr, next,
+					  newprot, cp_flags);
 next:
 		cond_resched();
 	} while (pmd++, addr = next, addr != end);
@@ -443,12 +456,14 @@ static inline long change_pud_range(struct mmu_gather *tlb,
 {
 	pud_t *pud;
 	unsigned long next;
-	long pages = 0;
+	long pages = 0, ret;
 
 	pud = pud_offset(p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
-		change_prepare(vma, pud, pmd, addr, cp_flags);
+		ret = change_prepare(vma, pud, pmd, addr, cp_flags);
+		if (ret)
+			return ret;
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		pages += change_pmd_range(tlb, vma, pud, addr, next, newprot,
@@ -464,12 +479,14 @@ static inline long change_p4d_range(struct mmu_gather *tlb,
 {
 	p4d_t *p4d;
 	unsigned long next;
-	long pages = 0;
+	long pages = 0, ret;
 
 	p4d = p4d_offset(pgd, addr);
 	do {
 		next = p4d_addr_end(addr, end);
-		change_prepare(vma, p4d, pud, addr, cp_flags);
+		ret = change_prepare(vma, p4d, pud, addr, cp_flags);
+		if (ret)
+			return ret;
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
 		pages += change_pud_range(tlb, vma, p4d, addr, next, newprot,
@@ -486,14 +503,18 @@ static long change_protection_range(struct mmu_gather *tlb,
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
 	unsigned long next;
-	long pages = 0;
+	long pages = 0, ret;
 
 	BUG_ON(addr >= end);
 	pgd = pgd_offset(mm, addr);
 	tlb_start_vma(tlb, vma);
 	do {
 		next = pgd_addr_end(addr, end);
-		change_prepare(vma, pgd, p4d, addr, cp_flags);
+		ret = change_prepare(vma, pgd, p4d, addr, cp_flags);
+		if (ret) {
+			pages = ret;
+			break;
+		}
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		pages += change_p4d_range(tlb, vma, pgd, addr, next, newprot,
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 65ad172add27..53c3d916ff66 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -710,11 +710,12 @@ ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long start,
 			      mmap_changing, 0);
 }
 
-void uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma,
+long uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma,
 		   unsigned long start, unsigned long len, bool enable_wp)
 {
 	unsigned int mm_cp_flags;
 	struct mmu_gather tlb;
+	long ret;
 
 	if (enable_wp)
 		mm_cp_flags = MM_CP_UFFD_WP;
@@ -730,8 +731,10 @@ void uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma,
 	if (!enable_wp && vma_wants_manual_pte_write_upgrade(dst_vma))
 		mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE;
 	tlb_gather_mmu(&tlb, dst_mm);
-	change_protection(&tlb, dst_vma, start, start + len, mm_cp_flags);
+	ret = change_protection(&tlb, dst_vma, start, start + len, mm_cp_flags);
 	tlb_finish_mmu(&tlb);
+
+	return ret;
 }
 
 int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
@@ -740,7 +743,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 {
 	struct vm_area_struct *dst_vma;
 	unsigned long page_mask;
-	int err;
+	long err;
 
 	/*
 	 * Sanitize the command parameters:
@@ -779,9 +782,12 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 			goto out_unlock;
 	}
 
-	uffd_wp_range(dst_mm, dst_vma, start, len, enable_wp);
+	err = uffd_wp_range(dst_mm, dst_vma, start, len, enable_wp);
+
+	/* Return 0 on success, <0 on failures */
+	if (err > 0)
+		err = 0;
 
-	err = 0;
 out_unlock:
 	mmap_read_unlock(dst_mm);
 	return err;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] mm/hugetlb: Pre-allocate pgtable pages for uffd wr-protects
  2023-01-04 22:52 ` [PATCH 1/3] mm/hugetlb: Pre-allocate pgtable pages for uffd wr-protects Peter Xu
@ 2023-01-05  1:50   ` James Houghton
  2023-01-05  8:39   ` David Hildenbrand
  2023-01-05 18:37   ` Mike Kravetz
  2 siblings, 0 replies; 21+ messages in thread
From: James Houghton @ 2023-01-05  1:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Mike Kravetz, Muchun Song, Nadav Amit,
	Andrea Arcangeli, David Hildenbrand, Axel Rasmussen,
	Andrew Morton, linux-stable

On Wed, Jan 4, 2023 at 10:52 PM Peter Xu <peterx@redhat.com> wrote:
>
> Userfaultfd-wp uses pte markers to mark wr-protected pages for both shmem
> and hugetlb.  Shmem has pre-allocation ready for markers, but hugetlb path
> was overlooked.
>
> Doing so by calling huge_pte_alloc() if the initial pgtable walk fails to
> find the huge ptep.  It's possible that huge_pte_alloc() can fail with high
> memory pressure, in that case stop the loop immediately and fail silently.
> This is not the most ideal solution but it matches with what we do with
> shmem meanwhile it avoids the splat in dmesg.
>
> Cc: linux-stable <stable@vger.kernel.org> # 5.19+
> Fixes: 60dfaad65aa9 ("mm/hugetlb: allow uffd wr-protect none ptes")
> Reported-by: James Houghton <jthoughton@google.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/hugetlb.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)

Acked-by: James Houghton <jthoughton@google.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] mm/mprotect: Use long for page accountings and retval
  2023-01-04 22:52 ` [PATCH 2/3] mm/mprotect: Use long for page accountings and retval Peter Xu
@ 2023-01-05  1:51   ` James Houghton
  2023-01-05  8:44   ` David Hildenbrand
  2023-01-05 18:48   ` Mike Kravetz
  2 siblings, 0 replies; 21+ messages in thread
From: James Houghton @ 2023-01-05  1:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Mike Kravetz, Muchun Song, Nadav Amit,
	Andrea Arcangeli, David Hildenbrand, Axel Rasmussen,
	Andrew Morton

On Wed, Jan 4, 2023 at 10:52 PM Peter Xu <peterx@redhat.com> wrote:
>
> Switch to use type "long" for page accountings and retval across the whole
> procedure of change_protection().
>
> The change should have shrinked the possible maximum page number to be half
> comparing to previous (ULONG_MAX / 2), but it shouldn't overflow on any
> system either because the maximum possible pages touched by change
> protection should be ULONG_MAX / PAGE_SIZE.
>
> Two reasons to switch from "unsigned long" to "long":
>
>   1. It suites better on count_vm_numa_events(), whose 2nd parameter takes
>      a long type.
>
>   2. It paves way for returning negative (error) values in the future.
>
> Currently the only caller that consumes this retval is change_prot_numa(),
> where the unsigned long was converted to an int.  Since at it, touching up
> the numa code to also take a long, so it'll avoid any possible overflow too
> during the int-size convertion.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/hugetlb.h |  4 ++--
>  include/linux/mm.h      |  2 +-
>  mm/hugetlb.c            |  4 ++--
>  mm/mempolicy.c          |  2 +-
>  mm/mprotect.c           | 26 +++++++++++++-------------
>  5 files changed, 19 insertions(+), 19 deletions(-)

Acked-by: James Houghton <jthoughton@google.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] mm/uffd: Detect pgtable allocation failures
  2023-01-04 22:52 ` [PATCH 3/3] mm/uffd: Detect pgtable allocation failures Peter Xu
@ 2023-01-05  1:52   ` James Houghton
  2023-01-05  3:10   ` Nadav Amit
  2023-01-05  8:47   ` David Hildenbrand
  2 siblings, 0 replies; 21+ messages in thread
From: James Houghton @ 2023-01-05  1:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Mike Kravetz, Muchun Song, Nadav Amit,
	Andrea Arcangeli, David Hildenbrand, Axel Rasmussen,
	Andrew Morton

On Wed, Jan 4, 2023 at 10:52 PM Peter Xu <peterx@redhat.com> wrote:
>
> Before this patch, when there's any pgtable allocation issues happened
> during change_protection(), the error will be ignored from the syscall.
> For shmem, there will be an error dumped into the host dmesg.  Two issues
> with that:
>
>   (1) Doing a trace dump when allocation fails is not anything close to
>       grace..
>
>   (2) The user should be notified with any kind of such error, so the user
>       can trap it and decide what to do next, either by retrying, or stop
>       the process properly, or anything else.
>
> For userfault users, this will change the API of UFFDIO_WRITEPROTECT when
> pgtable allocation failure happened.  It should not normally break anyone,
> though.  If it breaks, then in good ways.
>
> One man-page update will be on the way to introduce the new -ENOMEM for
> UFFDIO_WRITEPROTECT.  Not marking stable so we keep the old behavior on the
> 5.19-till-now kernels.
>
> Reported-by: James Houghton <jthoughton@google.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Acked-by: James Houghton <jthoughton@google.com>

Thanks Peter! :)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] mm/uffd: Detect pgtable allocation failures
  2023-01-04 22:52 ` [PATCH 3/3] mm/uffd: Detect pgtable allocation failures Peter Xu
  2023-01-05  1:52   ` James Houghton
@ 2023-01-05  3:10   ` Nadav Amit
  2023-01-05  8:59     ` David Hildenbrand
  2023-01-05  8:47   ` David Hildenbrand
  2 siblings, 1 reply; 21+ messages in thread
From: Nadav Amit @ 2023-01-05  3:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux-MM, kernel list, Mike Kravetz, Muchun Song,
	Andrea Arcangeli, David Hildenbrand, James Houghton,
	Axel Rasmussen, Andrew Morton


> On Jan 4, 2023, at 2:52 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> Before this patch, when there's any pgtable allocation issues happened
> during change_protection(), the error will be ignored from the syscall.
> For shmem, there will be an error dumped into the host dmesg.  Two issues
> with that:
> 
>  (1) Doing a trace dump when allocation fails is not anything close to
>      grace..
> 
>  (2) The user should be notified with any kind of such error, so the user
>      can trap it and decide what to do next, either by retrying, or stop
>      the process properly, or anything else.
> 
> For userfault users, this will change the API of UFFDIO_WRITEPROTECT when
> pgtable allocation failure happened.  It should not normally break anyone,
> though.  If it breaks, then in good ways.
> 
> One man-page update will be on the way to introduce the new -ENOMEM for
> UFFDIO_WRITEPROTECT.  Not marking stable so we keep the old behavior on the
> 5.19-till-now kernels.

I understand that the current assumption is that change_protection() should
fully succeed or fail, and I guess this is the current behavior.

However, to be more “future-proof” perhaps this needs to be revisited.

For instance, UFFDIO_WRITEPROTECT can benefit from the ability to (based on
userspace request) prevent write-protection of pages that are pinned. This is
necessary to allow userspace uffd monitor to avoid write-protection of
O_DIRECT’d memory, for instance, that might change even if a uffd monitor
considers it write-protected.

In such a case, a “partial failure” is possible, since only part of the memory
was write-protected. The uffd monitor should be allowed to continue
execution, but it has to know the part of the memory that was successfully
write-protected. 

To support “partial failure”, the kernel should return to
UFFDIO_WRITEPROTECT-users the number of pages/bytes that were not
successfully write-protected, unless no memory was successfully
write-protected. (Unlike NUMA, pages that were skipped should be accounted
as “successfully write-protected"). 

I am only raising this subject to avoid multiple API changes.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] mm/uffd: Fix missing markers on hugetlb
  2023-01-04 22:52 [PATCH 0/3] mm/uffd: Fix missing markers on hugetlb Peter Xu
                   ` (2 preceding siblings ...)
  2023-01-04 22:52 ` [PATCH 3/3] mm/uffd: Detect pgtable allocation failures Peter Xu
@ 2023-01-05  8:16 ` David Hildenbrand
  3 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2023-01-05  8:16 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Mike Kravetz, Muchun Song, Nadav Amit, Andrea Arcangeli,
	James Houghton, Axel Rasmussen, Andrew Morton

On 04.01.23 23:52, Peter Xu wrote:
> When James was developing the vma split fix for hugetlb pmd sharing, he
> found that hugetlb uffd-wp is broken with the test case he developed [1]:
> 
> https://lore.kernel.org/r/CADrL8HWSym93=yNpTUdWebOEzUOTR2ffbfUk04XdK6O+PNJNoA@mail.gmail.com
> 
> Missing hugetlb pgtable pages caused uffd-wp to lose message when vma split
> happens to be across a shared huge pmd range in the test.
> 
> The issue is pgtable pre-allocation on hugetlb path was overlooked.  That
> was fixed in patch 1.

Nice timing, I stumbled over that while adjusting background snapshot 
code in QEMU and wondered why we are not allocating page tables in that 
case -- and wanted to ask you why :)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] mm/hugetlb: Pre-allocate pgtable pages for uffd wr-protects
  2023-01-04 22:52 ` [PATCH 1/3] mm/hugetlb: Pre-allocate pgtable pages for uffd wr-protects Peter Xu
  2023-01-05  1:50   ` James Houghton
@ 2023-01-05  8:39   ` David Hildenbrand
  2023-01-05 18:37   ` Mike Kravetz
  2 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2023-01-05  8:39 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Mike Kravetz, Muchun Song, Nadav Amit, Andrea Arcangeli,
	James Houghton, Axel Rasmussen, Andrew Morton, linux-stable

On 04.01.23 23:52, Peter Xu wrote:
> Userfaultfd-wp uses pte markers to mark wr-protected pages for both shmem
> and hugetlb.  Shmem has pre-allocation ready for markers, but hugetlb path
> was overlooked.
> 
> Doing so by calling huge_pte_alloc() if the initial pgtable walk fails to
> find the huge ptep.  It's possible that huge_pte_alloc() can fail with high
> memory pressure, in that case stop the loop immediately and fail silently.
> This is not the most ideal solution but it matches with what we do with
> shmem meanwhile it avoids the splat in dmesg.
> 
> Cc: linux-stable <stable@vger.kernel.org> # 5.19+
> Fixes: 60dfaad65aa9 ("mm/hugetlb: allow uffd wr-protect none ptes")
> Reported-by: James Houghton <jthoughton@google.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   mm/hugetlb.c | 13 +++++++++++--
>   1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index bf7a1f628357..017d9159cddf 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6649,8 +6649,17 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>   		spinlock_t *ptl;
>   		ptep = hugetlb_walk(vma, address, psize);

if (!ptep && likely(!uffd_wp)) {
	/* Nothing to protect. */
	address |= last_addr_mask;
	continue;
} else if (!ptep) {
	...
}

Might look slightly more readable would minimize changes. This should 
work, so

Acked-by: David Hildenbrand <david@redhat.com>

>   		if (!ptep) {
> -			address |= last_addr_mask;
> -			continue;
> +			if (!uffd_wp) {
> +				address |= last_addr_mask;
> +				continue;
> +			}
> +			/*
> +			 * Userfaultfd wr-protect requires pgtable
> +			 * pre-allocations to install pte markers.
> +			 */
> +			ptep = huge_pte_alloc(mm, vma, address, psize);
> +			if (!ptep)
> +				break;
>   		}
>   		ptl = huge_pte_lock(h, mm, ptep);
>   		if (huge_pmd_unshare(mm, vma, address, ptep)) {

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] mm/mprotect: Use long for page accountings and retval
  2023-01-04 22:52 ` [PATCH 2/3] mm/mprotect: Use long for page accountings and retval Peter Xu
  2023-01-05  1:51   ` James Houghton
@ 2023-01-05  8:44   ` David Hildenbrand
  2023-01-05 19:22     ` Peter Xu
  2023-01-05 18:48   ` Mike Kravetz
  2 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2023-01-05  8:44 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Mike Kravetz, Muchun Song, Nadav Amit, Andrea Arcangeli,
	James Houghton, Axel Rasmussen, Andrew Morton

On 04.01.23 23:52, Peter Xu wrote:
> Switch to use type "long" for page accountings and retval across the whole
> procedure of change_protection().
> 
> The change should have shrinked the possible maximum page number to be half
> comparing to previous (ULONG_MAX / 2), but it shouldn't overflow on any
> system either because the maximum possible pages touched by change
> protection should be ULONG_MAX / PAGE_SIZE.

Yeah, highly unlikely.

> 
> Two reasons to switch from "unsigned long" to "long":
> 
>    1. It suites better on count_vm_numa_events(), whose 2nd parameter takes
>       a long type.
> 
>    2. It paves way for returning negative (error) values in the future.
> 
> Currently the only caller that consumes this retval is change_prot_numa(),
> where the unsigned long was converted to an int.  Since at it, touching up
> the numa code to also take a long, so it'll avoid any possible overflow too
> during the int-size convertion.

I'm wondering if we should just return the number of changed pages via a 
separate pointer and later using an int for returning errors -- when 
touching this interface already.

Only who's actually interested in the number of pages would pass a 
pointer to an unsigned long (NUMA).

And code that expects that there never ever are failures (mprotect, 
NUMA) could simply check for WARN_ON_ONCE(ret).

I assume you evaluated that option as well, what was your conclusion?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] mm/uffd: Detect pgtable allocation failures
  2023-01-04 22:52 ` [PATCH 3/3] mm/uffd: Detect pgtable allocation failures Peter Xu
  2023-01-05  1:52   ` James Houghton
  2023-01-05  3:10   ` Nadav Amit
@ 2023-01-05  8:47   ` David Hildenbrand
  2 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2023-01-05  8:47 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Mike Kravetz, Muchun Song, Nadav Amit, Andrea Arcangeli,
	James Houghton, Axel Rasmussen, Andrew Morton

On 04.01.23 23:52, Peter Xu wrote:
> Before this patch, when there's any pgtable allocation issues happened
> during change_protection(), the error will be ignored from the syscall.
> For shmem, there will be an error dumped into the host dmesg.  Two issues
> with that:
> 
>    (1) Doing a trace dump when allocation fails is not anything close to
>        grace..

s/..//

> 
>    (2) The user should be notified with any kind of such error, so the user
>        can trap it and decide what to do next, either by retrying, or stop
>        the process properly, or anything else.
> 
> For userfault users, this will change the API of UFFDIO_WRITEPROTECT when
> pgtable allocation failure happened.  It should not normally break anyone,
> though.  If it breaks, then in good ways.
> 
> One man-page update will be on the way to introduce the new -ENOMEM for
> UFFDIO_WRITEPROTECT.  Not marking stable so we keep the old behavior on the
> 5.19-till-now kernels.

We'd now fail after already having modified some state (protected some 
PTEs). I assume that can already happen when protecting across multiple 
VMAs and is expected, right?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] mm/uffd: Detect pgtable allocation failures
  2023-01-05  3:10   ` Nadav Amit
@ 2023-01-05  8:59     ` David Hildenbrand
  2023-01-05 18:01       ` Nadav Amit
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2023-01-05  8:59 UTC (permalink / raw)
  To: Nadav Amit, Peter Xu
  Cc: Linux-MM, kernel list, Mike Kravetz, Muchun Song,
	Andrea Arcangeli, James Houghton, Axel Rasmussen, Andrew Morton

On 05.01.23 04:10, Nadav Amit wrote:
> 
>> On Jan 4, 2023, at 2:52 PM, Peter Xu <peterx@redhat.com> wrote:
>>
>> Before this patch, when there's any pgtable allocation issues happened
>> during change_protection(), the error will be ignored from the syscall.
>> For shmem, there will be an error dumped into the host dmesg.  Two issues
>> with that:
>>
>>   (1) Doing a trace dump when allocation fails is not anything close to
>>       grace..
>>
>>   (2) The user should be notified with any kind of such error, so the user
>>       can trap it and decide what to do next, either by retrying, or stop
>>       the process properly, or anything else.
>>
>> For userfault users, this will change the API of UFFDIO_WRITEPROTECT when
>> pgtable allocation failure happened.  It should not normally break anyone,
>> though.  If it breaks, then in good ways.
>>
>> One man-page update will be on the way to introduce the new -ENOMEM for
>> UFFDIO_WRITEPROTECT.  Not marking stable so we keep the old behavior on the
>> 5.19-till-now kernels.
> 
> I understand that the current assumption is that change_protection() should
> fully succeed or fail, and I guess this is the current behavior.
> 
> However, to be more “future-proof” perhaps this needs to be revisited.
> 
> For instance, UFFDIO_WRITEPROTECT can benefit from the ability to (based on
> userspace request) prevent write-protection of pages that are pinned. This is
> necessary to allow userspace uffd monitor to avoid write-protection of
> O_DIRECT’d memory, for instance, that might change even if a uffd monitor
> considers it write-protected.

Just a note that this is pretty tricky IMHO, because:

a) We cannot distinguished "pinned readable" from "pinned writable"
b) We can have false positives ("pinned") even for compound pages due to
    concurrent GUP-fast.
c) Synchronizing against GUP-fast is pretty tricky ... as we learned.
    Concurrent pinning is usually problematic.
d) O_DIRECT still uses FOLL_GET and we cannot identify that. (at least
    that should be figured out at one point)

I have a patch lying around for a very long time that removes that 
special-pinned handling from softdirty code, because of the above 
reasons (and because it forgets THP). For now I didn't send it because 
for softdirty, it's acceptable to over-indicate and it hasn't been 
reported to be an actual problem so far.

For existing UFFDIO_WRITEPROTECT users, however, it might be very 
harmful (especially for existing users) to get false protection errors. 
Failing due to ENOMEM is different to failing due to some temporary 
concurrency issues.

Having that said, I started thinking about alternative ways of detecting 
that in that past, without much outcome so far: that latest idea was 
indicating "this MM has had pinned pages at one point, be careful 
because any techniques that use write-protection (softdirty, mprotect, 
uffd-wp) won't be able to catch writes via pinned pages reliably".

Hm.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] mm/uffd: Detect pgtable allocation failures
  2023-01-05  8:59     ` David Hildenbrand
@ 2023-01-05 18:01       ` Nadav Amit
  2023-01-05 19:51         ` Peter Xu
  2023-01-09  8:36         ` David Hildenbrand
  0 siblings, 2 replies; 21+ messages in thread
From: Nadav Amit @ 2023-01-05 18:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, Linux-MM, kernel list, Mike Kravetz, Muchun Song,
	Andrea Arcangeli, James Houghton, Axel Rasmussen, Andrew Morton



> On Jan 5, 2023, at 12:59 AM, David Hildenbrand <david@redhat.com> wrote:
> 
> On 05.01.23 04:10, Nadav Amit wrote:
>>> On Jan 4, 2023, at 2:52 PM, Peter Xu <peterx@redhat.com> wrote:
>>> 
>>> Before this patch, when there's any pgtable allocation issues happened
>>> during change_protection(), the error will be ignored from the syscall.
>>> For shmem, there will be an error dumped into the host dmesg.  Two issues
>>> with that:
>>> 
>>>  (1) Doing a trace dump when allocation fails is not anything close to
>>>      grace..
>>> 
>>>  (2) The user should be notified with any kind of such error, so the user
>>>      can trap it and decide what to do next, either by retrying, or stop
>>>      the process properly, or anything else.
>>> 
>>> For userfault users, this will change the API of UFFDIO_WRITEPROTECT when
>>> pgtable allocation failure happened.  It should not normally break anyone,
>>> though.  If it breaks, then in good ways.
>>> 
>>> One man-page update will be on the way to introduce the new -ENOMEM for
>>> UFFDIO_WRITEPROTECT.  Not marking stable so we keep the old behavior on the
>>> 5.19-till-now kernels.
>> I understand that the current assumption is that change_protection() should
>> fully succeed or fail, and I guess this is the current behavior.
>> However, to be more “future-proof” perhaps this needs to be revisited.
>> For instance, UFFDIO_WRITEPROTECT can benefit from the ability to (based on
>> userspace request) prevent write-protection of pages that are pinned. This is
>> necessary to allow userspace uffd monitor to avoid write-protection of
>> O_DIRECT’d memory, for instance, that might change even if a uffd monitor
>> considers it write-protected.
> 
> Just a note that this is pretty tricky IMHO, because:
> 
> a) We cannot distinguished "pinned readable" from "pinned writable"
> b) We can have false positives ("pinned") even for compound pages due to
>   concurrent GUP-fast.
> c) Synchronizing against GUP-fast is pretty tricky ... as we learned.
>   Concurrent pinning is usually problematic.
> d) O_DIRECT still uses FOLL_GET and we cannot identify that. (at least
>   that should be figured out at one point)

My prototype used the page-count IIRC, so it had false-positives (but
addressed O_DIRECT). And yes, precise refinement is complicated. However,
if you need to uffd-wp memory, then without such a mechanism you need to
ensure no kerenl/DMA write to these pages is possible. The only other
option I can think of is interposing/seccomp on a variety of syscalls,
to prevent uffd-wp of such memory.

> 
> I have a patch lying around for a very long time that removes that special-pinned handling from softdirty code, because of the above reasons (and because it forgets THP). For now I didn't send it because for softdirty, it's acceptable to over-indicate and it hasn't been reported to be an actual problem so far.
> 
> For existing UFFDIO_WRITEPROTECT users, however, it might be very harmful (especially for existing users) to get false protection errors. Failing due to ENOMEM is different to failing due to some temporary concurrency issues.

Yes, I propose it as an optional flag for UFFD-WP. Anyhow, I believe
the UFFD-WP as implemented now is not efficient and should’ve been
vectored to allow one TLB shootdown for many non-consecutive pages. 

> 
> Having that said, I started thinking about alternative ways of detecting that in that past, without much outcome so far: that latest idea was indicating "this MM has had pinned pages at one point, be careful because any techniques that use write-protection (softdirty, mprotect, uffd-wp) won't be able to catch writes via pinned pages reliably".

I am not sure what the best way to detect that a page is write-pinned
reliably. My point was that if a change is already carried to
write-protect mechanisms, then this issue should be considered. Because
otherwise, many use-cases of uffd-wp would encounter implementation
issues.

I will not “kill” myself over it now, but I think it worth consideration.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] mm/hugetlb: Pre-allocate pgtable pages for uffd wr-protects
  2023-01-04 22:52 ` [PATCH 1/3] mm/hugetlb: Pre-allocate pgtable pages for uffd wr-protects Peter Xu
  2023-01-05  1:50   ` James Houghton
  2023-01-05  8:39   ` David Hildenbrand
@ 2023-01-05 18:37   ` Mike Kravetz
  2 siblings, 0 replies; 21+ messages in thread
From: Mike Kravetz @ 2023-01-05 18:37 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Muchun Song, Nadav Amit,
	Andrea Arcangeli, David Hildenbrand, James Houghton,
	Axel Rasmussen, Andrew Morton, linux-stable

On 01/04/23 17:52, Peter Xu wrote:
> Userfaultfd-wp uses pte markers to mark wr-protected pages for both shmem
> and hugetlb.  Shmem has pre-allocation ready for markers, but hugetlb path
> was overlooked.
> 
> Doing so by calling huge_pte_alloc() if the initial pgtable walk fails to
> find the huge ptep.  It's possible that huge_pte_alloc() can fail with high
> memory pressure, in that case stop the loop immediately and fail silently.
> This is not the most ideal solution but it matches with what we do with
> shmem meanwhile it avoids the splat in dmesg.
> 
> Cc: linux-stable <stable@vger.kernel.org> # 5.19+
> Fixes: 60dfaad65aa9 ("mm/hugetlb: allow uffd wr-protect none ptes")
> Reported-by: James Houghton <jthoughton@google.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/hugetlb.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)

Thanks Peter and James!

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index bf7a1f628357..017d9159cddf 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6649,8 +6649,17 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  		spinlock_t *ptl;
>  		ptep = hugetlb_walk(vma, address, psize);
>  		if (!ptep) {
> -			address |= last_addr_mask;
> -			continue;
> +			if (!uffd_wp) {
> +				address |= last_addr_mask;
> +				continue;
> +			}
> +			/*
> +			 * Userfaultfd wr-protect requires pgtable
> +			 * pre-allocations to install pte markers.
> +			 */
> +			ptep = huge_pte_alloc(mm, vma, address, psize);
> +			if (!ptep)
> +				break;
>  		}
>  		ptl = huge_pte_lock(h, mm, ptep);
>  		if (huge_pmd_unshare(mm, vma, address, ptep)) {
> -- 
> 2.37.3
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] mm/mprotect: Use long for page accountings and retval
  2023-01-04 22:52 ` [PATCH 2/3] mm/mprotect: Use long for page accountings and retval Peter Xu
  2023-01-05  1:51   ` James Houghton
  2023-01-05  8:44   ` David Hildenbrand
@ 2023-01-05 18:48   ` Mike Kravetz
  2 siblings, 0 replies; 21+ messages in thread
From: Mike Kravetz @ 2023-01-05 18:48 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Muchun Song, Nadav Amit,
	Andrea Arcangeli, David Hildenbrand, James Houghton,
	Axel Rasmussen, Andrew Morton

On 01/04/23 17:52, Peter Xu wrote:
> Switch to use type "long" for page accountings and retval across the whole
> procedure of change_protection().
> 
> The change should have shrinked the possible maximum page number to be half
> comparing to previous (ULONG_MAX / 2), but it shouldn't overflow on any
> system either because the maximum possible pages touched by change
> protection should be ULONG_MAX / PAGE_SIZE.
> 
> Two reasons to switch from "unsigned long" to "long":
> 
>   1. It suites better on count_vm_numa_events(), whose 2nd parameter takes
>      a long type.
> 
>   2. It paves way for returning negative (error) values in the future.
> 
> Currently the only caller that consumes this retval is change_prot_numa(),
> where the unsigned long was converted to an int.  Since at it, touching up
> the numa code to also take a long, so it'll avoid any possible overflow too
> during the int-size convertion.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/hugetlb.h |  4 ++--
>  include/linux/mm.h      |  2 +-
>  mm/hugetlb.c            |  4 ++--
>  mm/mempolicy.c          |  2 +-
>  mm/mprotect.c           | 26 +++++++++++++-------------
>  5 files changed, 19 insertions(+), 19 deletions(-)

Acked-by: Mike Kravetz <mike.kravetz@oracle.com>

> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index b6b10101bea7..e3aa336df900 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -248,7 +248,7 @@ void hugetlb_vma_lock_release(struct kref *kref);
>  
>  int pmd_huge(pmd_t pmd);
>  int pud_huge(pud_t pud);
> -unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> +long hugetlb_change_protection(struct vm_area_struct *vma,
>  		unsigned long address, unsigned long end, pgprot_t newprot,
>  		unsigned long cp_flags);
>  
> @@ -437,7 +437,7 @@ static inline void move_hugetlb_state(struct folio *old_folio,
>  {
>  }
>  
> -static inline unsigned long hugetlb_change_protection(
> +static inline long hugetlb_change_protection(
>  			struct vm_area_struct *vma, unsigned long address,
>  			unsigned long end, pgprot_t newprot,
>  			unsigned long cp_flags)
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c37f9330f14e..86fe17e6ded7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2132,7 +2132,7 @@ static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma
>  }
>  bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  			     pte_t pte);
> -extern unsigned long change_protection(struct mmu_gather *tlb,
> +extern long change_protection(struct mmu_gather *tlb,
>  			      struct vm_area_struct *vma, unsigned long start,
>  			      unsigned long end, unsigned long cp_flags);
>  extern int mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 017d9159cddf..84bc665c7c86 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6613,7 +6613,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	return i ? i : err;
>  }
>  
> -unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> +long hugetlb_change_protection(struct vm_area_struct *vma,
>  		unsigned long address, unsigned long end,
>  		pgprot_t newprot, unsigned long cp_flags)
>  {
> @@ -6622,7 +6622,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  	pte_t *ptep;
>  	pte_t pte;
>  	struct hstate *h = hstate_vma(vma);
> -	unsigned long pages = 0, psize = huge_page_size(h);
> +	long pages = 0, psize = huge_page_size(h);

Small nit,
psize is passed to routines as an unsigned long argument.  Arithmetic
should always be correct, but I am not sure if some of the static
checkers may complain.
-- 
Mike Kravetz

>  	bool shared_pmd = false;
>  	struct mmu_notifier_range range;
>  	unsigned long last_addr_mask;

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] mm/mprotect: Use long for page accountings and retval
  2023-01-05  8:44   ` David Hildenbrand
@ 2023-01-05 19:22     ` Peter Xu
  2023-01-09  8:04       ` David Hildenbrand
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2023-01-05 19:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Mike Kravetz, Muchun Song, Nadav Amit,
	Andrea Arcangeli, James Houghton, Axel Rasmussen, Andrew Morton

On Thu, Jan 05, 2023 at 09:44:16AM +0100, David Hildenbrand wrote:
> I'm wondering if we should just return the number of changed pages via a
> separate pointer and later using an int for returning errors -- when
> touching this interface already.
> 
> Only who's actually interested in the number of pages would pass a pointer
> to an unsigned long (NUMA).
> 
> And code that expects that there never ever are failures (mprotect, NUMA)
> could simply check for WARN_ON_ONCE(ret).
> 
> I assume you evaluated that option as well, what was your conclusion?

Since a single long can cover both things as retval, it's better to keep it
simple?  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] mm/uffd: Detect pgtable allocation failures
  2023-01-05 18:01       ` Nadav Amit
@ 2023-01-05 19:51         ` Peter Xu
  2023-01-18 21:51           ` Nadav Amit
  2023-01-09  8:36         ` David Hildenbrand
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Xu @ 2023-01-05 19:51 UTC (permalink / raw)
  To: Nadav Amit
  Cc: David Hildenbrand, Linux-MM, kernel list, Mike Kravetz,
	Muchun Song, Andrea Arcangeli, James Houghton, Axel Rasmussen,
	Andrew Morton

On Thu, Jan 05, 2023 at 10:01:46AM -0800, Nadav Amit wrote:
> 
> 
> > On Jan 5, 2023, at 12:59 AM, David Hildenbrand <david@redhat.com> wrote:
> > 
> > On 05.01.23 04:10, Nadav Amit wrote:
> >>> On Jan 4, 2023, at 2:52 PM, Peter Xu <peterx@redhat.com> wrote:
> >>> 
> >>> Before this patch, when there's any pgtable allocation issues happened
> >>> during change_protection(), the error will be ignored from the syscall.
> >>> For shmem, there will be an error dumped into the host dmesg.  Two issues
> >>> with that:
> >>> 
> >>>  (1) Doing a trace dump when allocation fails is not anything close to
> >>>      grace..
> >>> 
> >>>  (2) The user should be notified with any kind of such error, so the user
> >>>      can trap it and decide what to do next, either by retrying, or stop
> >>>      the process properly, or anything else.
> >>> 
> >>> For userfault users, this will change the API of UFFDIO_WRITEPROTECT when
> >>> pgtable allocation failure happened.  It should not normally break anyone,
> >>> though.  If it breaks, then in good ways.
> >>> 
> >>> One man-page update will be on the way to introduce the new -ENOMEM for
> >>> UFFDIO_WRITEPROTECT.  Not marking stable so we keep the old behavior on the
> >>> 5.19-till-now kernels.
> >> I understand that the current assumption is that change_protection() should
> >> fully succeed or fail, and I guess this is the current behavior.
> >> However, to be more “future-proof” perhaps this needs to be revisited.
> >> For instance, UFFDIO_WRITEPROTECT can benefit from the ability to (based on
> >> userspace request) prevent write-protection of pages that are pinned. This is
> >> necessary to allow userspace uffd monitor to avoid write-protection of
> >> O_DIRECT’d memory, for instance, that might change even if a uffd monitor
> >> considers it write-protected.
> > 
> > Just a note that this is pretty tricky IMHO, because:
> > 
> > a) We cannot distinguished "pinned readable" from "pinned writable"
> > b) We can have false positives ("pinned") even for compound pages due to
> >   concurrent GUP-fast.
> > c) Synchronizing against GUP-fast is pretty tricky ... as we learned.
> >   Concurrent pinning is usually problematic.
> > d) O_DIRECT still uses FOLL_GET and we cannot identify that. (at least
> >   that should be figured out at one point)
> 
> My prototype used the page-count IIRC, so it had false-positives (but
> addressed O_DIRECT).

I think this means the app is fine to not be able to write protect some
page being requested?  For a swap framework I think it's fine, but maybe
not for taking a snapshot, so I agree it should be an optional flag as you
mentioned below.

> And yes, precise refinement is complicated. However,
> if you need to uffd-wp memory, then without such a mechanism you need to
> ensure no kerenl/DMA write to these pages is possible. The only other
> option I can think of is interposing/seccomp on a variety of syscalls,
> to prevent uffd-wp of such memory.
> 
> > 
> > I have a patch lying around for a very long time that removes that special-pinned handling from softdirty code, because of the above reasons (and because it forgets THP). For now I didn't send it because for softdirty, it's acceptable to over-indicate and it hasn't been reported to be an actual problem so far.
> > 
> > For existing UFFDIO_WRITEPROTECT users, however, it might be very harmful (especially for existing users) to get false protection errors. Failing due to ENOMEM is different to failing due to some temporary concurrency issues.
> 
> Yes, I propose it as an optional flag for UFFD-WP.
> Anyhow, I believe the UFFD-WP as implemented now is not efficient and
> should’ve been vectored to allow one TLB shootdown for many
> non-consecutive pages.

Agreed.  Would providing a vector of ranges help too for a few uffd ioctls?

I'm also curious whether you're still actively developing (or running) your
iouring series.

> 
> > 
> > Having that said, I started thinking about alternative ways of detecting that in that past, without much outcome so far: that latest idea was indicating "this MM has had pinned pages at one point, be careful because any techniques that use write-protection (softdirty, mprotect, uffd-wp) won't be able to catch writes via pinned pages reliably".
> 
> I am not sure what the best way to detect that a page is write-pinned
> reliably. My point was that if a change is already carried to
> write-protect mechanisms, then this issue should be considered. Because
> otherwise, many use-cases of uffd-wp would encounter implementation
> issues.
> 
> I will not “kill” myself over it now, but I think it worth consideration.

The current interface change is small and limited only to the extra -ENOMEM
retval with memory pressures (personally I don't really know how to trigger
this, as I never succeeded myself even with memory pressure..).  What you
said does sound like a new area to explore, and I think it's fine to change
the interface again.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] mm/mprotect: Use long for page accountings and retval
  2023-01-05 19:22     ` Peter Xu
@ 2023-01-09  8:04       ` David Hildenbrand
  0 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2023-01-09  8:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Mike Kravetz, Muchun Song, Nadav Amit,
	Andrea Arcangeli, James Houghton, Axel Rasmussen, Andrew Morton

On 05.01.23 20:22, Peter Xu wrote:
> On Thu, Jan 05, 2023 at 09:44:16AM +0100, David Hildenbrand wrote:
>> I'm wondering if we should just return the number of changed pages via a
>> separate pointer and later using an int for returning errors -- when
>> touching this interface already.
>>
>> Only who's actually interested in the number of pages would pass a pointer
>> to an unsigned long (NUMA).
>>
>> And code that expects that there never ever are failures (mprotect, NUMA)
>> could simply check for WARN_ON_ONCE(ret).
>>
>> I assume you evaluated that option as well, what was your conclusion?
> 
> Since a single long can cover both things as retval, it's better to keep it
> simple?  Thanks,
> 

Fine with me.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] mm/uffd: Detect pgtable allocation failures
  2023-01-05 18:01       ` Nadav Amit
  2023-01-05 19:51         ` Peter Xu
@ 2023-01-09  8:36         ` David Hildenbrand
  1 sibling, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2023-01-09  8:36 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Xu, Linux-MM, kernel list, Mike Kravetz, Muchun Song,
	Andrea Arcangeli, James Houghton, Axel Rasmussen, Andrew Morton

On 05.01.23 19:01, Nadav Amit wrote:
> 
> 
>> On Jan 5, 2023, at 12:59 AM, David Hildenbrand <david@redhat.com> wrote:
>>
>> On 05.01.23 04:10, Nadav Amit wrote:
>>>> On Jan 4, 2023, at 2:52 PM, Peter Xu <peterx@redhat.com> wrote:
>>>>
>>>> Before this patch, when there's any pgtable allocation issues happened
>>>> during change_protection(), the error will be ignored from the syscall.
>>>> For shmem, there will be an error dumped into the host dmesg.  Two issues
>>>> with that:
>>>>
>>>>   (1) Doing a trace dump when allocation fails is not anything close to
>>>>       grace..
>>>>
>>>>   (2) The user should be notified with any kind of such error, so the user
>>>>       can trap it and decide what to do next, either by retrying, or stop
>>>>       the process properly, or anything else.
>>>>
>>>> For userfault users, this will change the API of UFFDIO_WRITEPROTECT when
>>>> pgtable allocation failure happened.  It should not normally break anyone,
>>>> though.  If it breaks, then in good ways.
>>>>
>>>> One man-page update will be on the way to introduce the new -ENOMEM for
>>>> UFFDIO_WRITEPROTECT.  Not marking stable so we keep the old behavior on the
>>>> 5.19-till-now kernels.
>>> I understand that the current assumption is that change_protection() should
>>> fully succeed or fail, and I guess this is the current behavior.
>>> However, to be more “future-proof” perhaps this needs to be revisited.
>>> For instance, UFFDIO_WRITEPROTECT can benefit from the ability to (based on
>>> userspace request) prevent write-protection of pages that are pinned. This is
>>> necessary to allow userspace uffd monitor to avoid write-protection of
>>> O_DIRECT’d memory, for instance, that might change even if a uffd monitor
>>> considers it write-protected.
>>
>> Just a note that this is pretty tricky IMHO, because:
>>
>> a) We cannot distinguished "pinned readable" from "pinned writable"
>> b) We can have false positives ("pinned") even for compound pages due to
>>    concurrent GUP-fast.
>> c) Synchronizing against GUP-fast is pretty tricky ... as we learned.
>>    Concurrent pinning is usually problematic.
>> d) O_DIRECT still uses FOLL_GET and we cannot identify that. (at least
>>    that should be figured out at one point)
> 
> My prototype used the page-count IIRC, so it had false-positives (but

I suspect GUP-fast is still problematic, I might be wrong.

> addressed O_DIRECT). And yes, precise refinement is complicated. However,
> if you need to uffd-wp memory, then without such a mechanism you need to
> ensure no kerenl/DMA write to these pages is possible. The only other
> option I can think of is interposing/seccomp on a variety of syscalls,
> to prevent uffd-wp of such memory.

The whole thing reminds me of MADV_DONTNEED+pinning: an application 
shouldn't do it, because you can only get it wrong :) I know, that's a 
bad answer.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] mm/uffd: Detect pgtable allocation failures
  2023-01-05 19:51         ` Peter Xu
@ 2023-01-18 21:51           ` Nadav Amit
  0 siblings, 0 replies; 21+ messages in thread
From: Nadav Amit @ 2023-01-18 21:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, Linux-MM, kernel list, Mike Kravetz,
	Muchun Song, Andrea Arcangeli, James Houghton, Axel Rasmussen,
	Andrew Morton

Sorry for the late response.

>> Yes, I propose it as an optional flag for UFFD-WP.
>> Anyhow, I believe the UFFD-WP as implemented now is not efficient and
>> should’ve been vectored to allow one TLB shootdown for many
>> non-consecutive pages.
> 
> Agreed.  Would providing a vector of ranges help too for a few uffd ioctls?
> 
> I'm also curious whether you're still actively developing (or running) your
> iouring series.

So I finished building a prototype some time ago, and one of the benefits
was in reducing memory reclamation time. Unfortunately, MGLRU then came and
took away a lot of the benefit.

A colleague of mine had a slightly different use-case, so I gave him the
code and he showed interest in upstreaming it. After some probing, it turns
out he decided he is not into the effort of upstreaming it. I can upstream
the vectored WP once I write some tests.
>> 
>> I am not sure what the best way to detect that a page is write-pinned
>> reliably. My point was that if a change is already carried to
>> write-protect mechanisms, then this issue should be considered. Because
>> otherwise, many use-cases of uffd-wp would encounter implementation
>> issues.
>> 
>> I will not “kill” myself over it now, but I think it worth consideration.
> 
> The current interface change is small and limited only to the extra -ENOMEM
> retval with memory pressures (personally I don't really know how to trigger
> this, as I never succeeded myself even with memory pressure..).  What you
> said does sound like a new area to explore, and I think it's fine to change
> the interface again.

Understood.

Thanks and sorry again for the late response,
Nadav

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-01-18 21:52 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-04 22:52 [PATCH 0/3] mm/uffd: Fix missing markers on hugetlb Peter Xu
2023-01-04 22:52 ` [PATCH 1/3] mm/hugetlb: Pre-allocate pgtable pages for uffd wr-protects Peter Xu
2023-01-05  1:50   ` James Houghton
2023-01-05  8:39   ` David Hildenbrand
2023-01-05 18:37   ` Mike Kravetz
2023-01-04 22:52 ` [PATCH 2/3] mm/mprotect: Use long for page accountings and retval Peter Xu
2023-01-05  1:51   ` James Houghton
2023-01-05  8:44   ` David Hildenbrand
2023-01-05 19:22     ` Peter Xu
2023-01-09  8:04       ` David Hildenbrand
2023-01-05 18:48   ` Mike Kravetz
2023-01-04 22:52 ` [PATCH 3/3] mm/uffd: Detect pgtable allocation failures Peter Xu
2023-01-05  1:52   ` James Houghton
2023-01-05  3:10   ` Nadav Amit
2023-01-05  8:59     ` David Hildenbrand
2023-01-05 18:01       ` Nadav Amit
2023-01-05 19:51         ` Peter Xu
2023-01-18 21:51           ` Nadav Amit
2023-01-09  8:36         ` David Hildenbrand
2023-01-05  8:47   ` David Hildenbrand
2023-01-05  8:16 ` [PATCH 0/3] mm/uffd: Fix missing markers on hugetlb David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.