linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RESEND v6 00/16] mm: Page fault enhancements
@ 2020-02-20 15:53 Peter Xu
  2020-02-20 15:53 ` [PATCH RESEND v6 01/16] mm/gup: Rename "nonblocking" to "locked" where proper Peter Xu
                   ` (18 more replies)
  0 siblings, 19 replies; 31+ messages in thread
From: Peter Xu @ 2020-02-20 15:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Martin Cracauer, Linus Torvalds, Mike Rapoport,
	Kirill A . Shutemov, Johannes Weiner, Dr . David Alan Gilbert,
	David Hildenbrand, Bobby Powers, Maya Gokhale, Jerome Glisse,
	Mike Kravetz, Matthew Wilcox, Marty McFadden, Mel Gorman, peterx,
	Hugh Dickins, Brian Geffon, Denis Plotnikov, Pavel Emelyanov

[Resend v6]

This is v6 of the series.  It is majorly a rebase to 5.6-rc2, nothing
else to be expected (plus some tests after the rebase).  Instead of
rewrite the cover letter I decided to use what we have for v5.

Adding extra CCs for both Bobby Powers <bobbypowers@gmail.com> and
Brian Geffon <bgeffon@google.com>.

Online repo: https://github.com/xzpeter/linux/tree/mm-pf-signal-retry

Any review comment is appreciated.  Thanks,

=============== v5 cover letter ==================

This is v5 of the series.  As Matthew suggested, I split the previous
patch "mm: Return faster for non-fatal signals in user mode faults"
into a few smaller ones:

  1. One patch to introduce fatal_signal_pending(), and use it in
     archs that can directly apply

  2. A few more patches to let the rest archs to use the new helper.
     With that we can have an unified entry for signal detection

  3. One last patch to change fatal_signal_pending() to detect
     userspace non-fatal signal

Nothing should have changed in the rest patches.  Because the fault
retry patches will depend on the previous ones, I decided to simply
repost all the patches.

Here's the new patchset layout:

Patch 1-2:      cleanup, and potential bugfix of hugetlbfs on fault retry

Patch 3-9:      let page fault to respond to non-fatal signals faster

Patch 10:       remove the userfaultfd NOPAGE emulation

Patch 11-14:    allow page fault to retry more than once

Patch 15-16:    let gup code to use FAULT_FLAG_KILLABLE too

I would really appreciate any review comments for the series,
especially for the first two patches which IMHO are even not related
to this patchset and they should either cleanup or fix things.

Smoke tested on x86 only.

Thanks,

v5:
- split "mm: Return faster for non-fatal signals in user mode faults"
  into a few more patches, let all archs to use an unified entry for
  fast signal handling (fatal_signal_pending)

v4:
- use lore.kernel.org for all the links in commit messages [Kirill]
- one more patch ("mm/gup: Fix __get_user_pages() on fault retry of
  hugetlb") to fix hugetlb path on fault retry
- one more patch ("mm/gup: Allow to react to fatal signals") to:
  - use down_read_killable() properly [Linus]
  - pass in FAULT_FLAG_KILLABLE for all GUP [Linus]
- one more patch ("mm/userfaultfd: Honor FAULT_FLAG_KILLABLE in fault
  path") to let handle_userfaultfd() respect FAULT_FLAG_KILLABLE.
  Should have no functional change after previous two new patches.

v3:
- check fatal signals in __get_user_page_locked() [Linus]
- add r-bs

v2:
- resent previous version, rebase only

=============== v1 cover letter ==================

This series is split out of userfaultfd-wp series to only cover the
general page fault changes, since it seems to make sense itself.

Basically it does two things:

  (a) Allows the page fault handlers to be more interactive on not
      only SIGKILL, but also the rest of userspace signals (especially
      for user-mode faults), and,

  (b) Allows the page fault retry (VM_FAULT_RETRY) to happen for more
      than once.

I'm keeping the CC list as in uffd-wp v5, hopefully I'm not sending
too much spams...

And, instead of writting again the cover letter, I'm just copy-pasting
my previous link here which has more details on why we do this:

  https://patchwork.kernel.org/cover/10691991/

The major change from that latest version should be that we introduced
a new page fault flag FAULT_FLAG_INTERRUPTIBLE as suggested by Linus
[1] to represents that we would like the fault handler to respond to
non-fatal signals.  Also, we're more careful now on when to do the
immediate return of the page fault for such signals.  For example, now
we'll only check against signal_pending() for user-mode page faults
and we keep the kernel-mode page fault patch untouched for it.  More
information can be found in separate patches.

The patchset is only lightly tested on x86.

All comments are greatly welcomed.  Thanks,

[1] https://lkml.org/lkml/2019/6/25/1382

Peter Xu (16):
  mm/gup: Rename "nonblocking" to "locked" where proper
  mm/gup: Fix __get_user_pages() on fault retry of hugetlb
  mm: Introduce fault_signal_pending()
  x86/mm: Use helper fault_signal_pending()
  arc/mm: Use helper fault_signal_pending()
  arm64/mm: Use helper fault_signal_pending()
  powerpc/mm: Use helper fault_signal_pending()
  sh/mm: Use helper fault_signal_pending()
  mm: Return faster for non-fatal signals in user mode faults
  userfaultfd: Don't retake mmap_sem to emulate NOPAGE
  mm: Introduce FAULT_FLAG_DEFAULT
  mm: Introduce FAULT_FLAG_INTERRUPTIBLE
  mm: Allow VM_FAULT_RETRY for multiple times
  mm/gup: Allow VM_FAULT_RETRY for multiple times
  mm/gup: Allow to react to fatal signals
  mm/userfaultfd: Honor FAULT_FLAG_KILLABLE in fault path

 arch/alpha/mm/fault.c           |  6 +--
 arch/arc/mm/fault.c             | 35 +++++--------
 arch/arm/mm/fault.c             |  7 +--
 arch/arm64/mm/fault.c           | 26 +++------
 arch/hexagon/mm/vm_fault.c      |  5 +-
 arch/ia64/mm/fault.c            |  5 +-
 arch/m68k/mm/fault.c            |  7 +--
 arch/microblaze/mm/fault.c      |  5 +-
 arch/mips/mm/fault.c            |  5 +-
 arch/nds32/mm/fault.c           |  5 +-
 arch/nios2/mm/fault.c           |  7 +--
 arch/openrisc/mm/fault.c        |  5 +-
 arch/parisc/mm/fault.c          |  8 ++-
 arch/powerpc/mm/fault.c         | 20 ++-----
 arch/riscv/mm/fault.c           |  9 +---
 arch/s390/mm/fault.c            | 10 ++--
 arch/sh/mm/fault.c              | 13 +++--
 arch/sparc/mm/fault_32.c        |  5 +-
 arch/sparc/mm/fault_64.c        |  5 +-
 arch/um/kernel/trap.c           |  3 +-
 arch/unicore32/mm/fault.c       |  8 ++-
 arch/x86/mm/fault.c             | 30 +++++------
 arch/xtensa/mm/fault.c          |  5 +-
 drivers/gpu/drm/ttm/ttm_bo_vm.c | 12 +++--
 fs/userfaultfd.c                | 62 ++++++++++------------
 include/linux/mm.h              | 81 ++++++++++++++++++++++++----
 include/linux/sched/signal.h    | 14 +++++
 mm/filemap.c                    |  2 +-
 mm/gup.c                        | 93 +++++++++++++++++++++------------
 mm/hugetlb.c                    | 17 +++---
 mm/internal.h                   |  6 +--
 31 files changed, 281 insertions(+), 240 deletions(-)

-- 
2.24.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH RESEND v6 01/16] mm/gup: Rename "nonblocking" to "locked" where proper
  2020-02-20 15:53 [PATCH RESEND v6 00/16] mm: Page fault enhancements Peter Xu
@ 2020-02-20 15:53 ` Peter Xu
  2020-02-20 15:53 ` [PATCH RESEND v6 02/16] mm/gup: Fix __get_user_pages() on fault retry of hugetlb Peter Xu
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2020-02-20 15:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Martin Cracauer, Linus Torvalds, Mike Rapoport,
	Kirill A . Shutemov, Johannes Weiner, Dr . David Alan Gilbert,
	David Hildenbrand, Bobby Powers, Maya Gokhale, Jerome Glisse,
	Mike Kravetz, Matthew Wilcox, Marty McFadden, Mel Gorman, peterx,
	Hugh Dickins, Brian Geffon, Denis Plotnikov, Pavel Emelyanov

There's plenty of places around __get_user_pages() that has a parameter
"nonblocking" which does not really mean that "it won't block" (because
it can really block) but instead it shows whether the mmap_sem is
released by up_read() during the page fault handling mostly when
VM_FAULT_RETRY is returned.

We have the correct naming in e.g. get_user_pages_locked() or
get_user_pages_remote() as "locked", however there're still many places
that are using the "nonblocking" as name.

Renaming the places to "locked" where proper to better suite the
functionality of the variable.  While at it, fixing up some of the
comments accordingly.

Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c     | 44 +++++++++++++++++++++-----------------------
 mm/hugetlb.c |  8 ++++----
 2 files changed, 25 insertions(+), 27 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 1b521e0ac1de..1b4411bd0042 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -630,12 +630,12 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
 }
 
 /*
- * mmap_sem must be held on entry.  If @nonblocking != NULL and
- * *@flags does not include FOLL_NOWAIT, the mmap_sem may be released.
- * If it is, *@nonblocking will be set to 0 and -EBUSY returned.
+ * mmap_sem must be held on entry.  If @locked != NULL and *@flags
+ * does not include FOLL_NOWAIT, the mmap_sem may be released.  If it
+ * is, *@locked will be set to 0 and -EBUSY returned.
  */
 static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
-		unsigned long address, unsigned int *flags, int *nonblocking)
+		unsigned long address, unsigned int *flags, int *locked)
 {
 	unsigned int fault_flags = 0;
 	vm_fault_t ret;
@@ -647,7 +647,7 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 		fault_flags |= FAULT_FLAG_WRITE;
 	if (*flags & FOLL_REMOTE)
 		fault_flags |= FAULT_FLAG_REMOTE;
-	if (nonblocking)
+	if (locked)
 		fault_flags |= FAULT_FLAG_ALLOW_RETRY;
 	if (*flags & FOLL_NOWAIT)
 		fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
@@ -673,8 +673,8 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 	}
 
 	if (ret & VM_FAULT_RETRY) {
-		if (nonblocking && !(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
-			*nonblocking = 0;
+		if (locked && !(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
+			*locked = 0;
 		return -EBUSY;
 	}
 
@@ -751,7 +751,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
  *		only intends to ensure the pages are faulted in.
  * @vmas:	array of pointers to vmas corresponding to each page.
  *		Or NULL if the caller does not require them.
- * @nonblocking: whether waiting for disk IO or mmap_sem contention
+ * @locked:     whether we're still with the mmap_sem held
  *
  * Returns either number of pages pinned (which may be less than the
  * number requested), or an error. Details about the return value:
@@ -786,13 +786,11 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
  * appropriate) must be called after the page is finished with, and
  * before put_page is called.
  *
- * If @nonblocking != NULL, __get_user_pages will not wait for disk IO
- * or mmap_sem contention, and if waiting is needed to pin all pages,
- * *@nonblocking will be set to 0.  Further, if @gup_flags does not
- * include FOLL_NOWAIT, the mmap_sem will be released via up_read() in
- * this case.
+ * If @locked != NULL, *@locked will be set to 0 when mmap_sem is
+ * released by an up_read().  That can happen if @gup_flags does not
+ * have FOLL_NOWAIT.
  *
- * A caller using such a combination of @nonblocking and @gup_flags
+ * A caller using such a combination of @locked and @gup_flags
  * must therefore hold the mmap_sem for reading only, and recognize
  * when it's been released.  Otherwise, it must be held for either
  * reading or writing and will not be released.
@@ -804,7 +802,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned long start, unsigned long nr_pages,
 		unsigned int gup_flags, struct page **pages,
-		struct vm_area_struct **vmas, int *nonblocking)
+		struct vm_area_struct **vmas, int *locked)
 {
 	long ret = 0, i = 0;
 	struct vm_area_struct *vma = NULL;
@@ -850,7 +848,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			if (is_vm_hugetlb_page(vma)) {
 				i = follow_hugetlb_page(mm, vma, pages, vmas,
 						&start, &nr_pages, i,
-						gup_flags, nonblocking);
+						gup_flags, locked);
 				continue;
 			}
 		}
@@ -868,7 +866,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		page = follow_page_mask(vma, start, foll_flags, &ctx);
 		if (!page) {
 			ret = faultin_page(tsk, vma, start, &foll_flags,
-					nonblocking);
+					   locked);
 			switch (ret) {
 			case 0:
 				goto retry;
@@ -1129,7 +1127,7 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
  * @vma:   target vma
  * @start: start address
  * @end:   end address
- * @nonblocking:
+ * @locked: whether the mmap_sem is still held
  *
  * This takes care of mlocking the pages too if VM_LOCKED is set.
  *
@@ -1137,14 +1135,14 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
  *
  * vma->vm_mm->mmap_sem must be held.
  *
- * If @nonblocking is NULL, it may be held for read or write and will
+ * If @locked is NULL, it may be held for read or write and will
  * be unperturbed.
  *
- * If @nonblocking is non-NULL, it must held for read only and may be
- * released.  If it's released, *@nonblocking will be set to 0.
+ * If @locked is non-NULL, it must held for read only and may be
+ * released.  If it's released, *@locked will be set to 0.
  */
 long populate_vma_page_range(struct vm_area_struct *vma,
-		unsigned long start, unsigned long end, int *nonblocking)
+		unsigned long start, unsigned long end, int *locked)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long nr_pages = (end - start) / PAGE_SIZE;
@@ -1179,7 +1177,7 @@ long populate_vma_page_range(struct vm_area_struct *vma,
 	 * not result in a stack expansion that recurses back here.
 	 */
 	return __get_user_pages(current, mm, start, nr_pages, gup_flags,
-				NULL, NULL, nonblocking);
+				NULL, NULL, locked);
 }
 
 /*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dd8737a94bec..c84f721db020 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4266,7 +4266,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 struct page **pages, struct vm_area_struct **vmas,
 			 unsigned long *position, unsigned long *nr_pages,
-			 long i, unsigned int flags, int *nonblocking)
+			 long i, unsigned int flags, int *locked)
 {
 	unsigned long pfn_offset;
 	unsigned long vaddr = *position;
@@ -4337,7 +4337,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				spin_unlock(ptl);
 			if (flags & FOLL_WRITE)
 				fault_flags |= FAULT_FLAG_WRITE;
-			if (nonblocking)
+			if (locked)
 				fault_flags |= FAULT_FLAG_ALLOW_RETRY;
 			if (flags & FOLL_NOWAIT)
 				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
@@ -4354,9 +4354,9 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				break;
 			}
 			if (ret & VM_FAULT_RETRY) {
-				if (nonblocking &&
+				if (locked &&
 				    !(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
-					*nonblocking = 0;
+					*locked = 0;
 				*nr_pages = 0;
 				/*
 				 * VM_FAULT_RETRY must not return an
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RESEND v6 02/16] mm/gup: Fix __get_user_pages() on fault retry of hugetlb
  2020-02-20 15:53 [PATCH RESEND v6 00/16] mm: Page fault enhancements Peter Xu
  2020-02-20 15:53 ` [PATCH RESEND v6 01/16] mm/gup: Rename "nonblocking" to "locked" where proper Peter Xu
@ 2020-02-20 15:53 ` Peter Xu
  2020-03-02 19:02   ` David Hildenbrand
  2020-02-20 15:53 ` [PATCH RESEND v6 03/16] mm: Introduce fault_signal_pending() Peter Xu
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 31+ messages in thread
From: Peter Xu @ 2020-02-20 15:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Martin Cracauer, Linus Torvalds, Mike Rapoport,
	Kirill A . Shutemov, Johannes Weiner, Dr . David Alan Gilbert,
	David Hildenbrand, Bobby Powers, Maya Gokhale, Jerome Glisse,
	Mike Kravetz, Matthew Wilcox, Marty McFadden, Mel Gorman, peterx,
	Hugh Dickins, Brian Geffon, Denis Plotnikov, Pavel Emelyanov

When follow_hugetlb_page() returns with *locked==0, it means we've got
a VM_FAULT_RETRY within the fauling process and we've released the
mmap_sem.  When that happens, we should stop and bail out.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index 1b4411bd0042..76cb420c0fb7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -849,6 +849,16 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 				i = follow_hugetlb_page(mm, vma, pages, vmas,
 						&start, &nr_pages, i,
 						gup_flags, locked);
+				if (locked && *locked == 0) {
+					/*
+					 * We've got a VM_FAULT_RETRY
+					 * and we've lost mmap_sem.
+					 * We must stop here.
+					 */
+					BUG_ON(gup_flags & FOLL_NOWAIT);
+					BUG_ON(ret != 0);
+					goto out;
+				}
 				continue;
 			}
 		}
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RESEND v6 03/16] mm: Introduce fault_signal_pending()
  2020-02-20 15:53 [PATCH RESEND v6 00/16] mm: Page fault enhancements Peter Xu
  2020-02-20 15:53 ` [PATCH RESEND v6 01/16] mm/gup: Rename "nonblocking" to "locked" where proper Peter Xu
  2020-02-20 15:53 ` [PATCH RESEND v6 02/16] mm/gup: Fix __get_user_pages() on fault retry of hugetlb Peter Xu
@ 2020-02-20 15:53 ` Peter Xu
  2020-03-02 19:04   ` David Hildenbrand
  2020-02-20 15:53 ` [PATCH RESEND v6 04/16] x86/mm: Use helper fault_signal_pending() Peter Xu
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 31+ messages in thread
From: Peter Xu @ 2020-02-20 15:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Martin Cracauer, Linus Torvalds, Mike Rapoport,
	Kirill A . Shutemov, Johannes Weiner, Dr . David Alan Gilbert,
	David Hildenbrand, Bobby Powers, Maya Gokhale, Jerome Glisse,
	Mike Kravetz, Matthew Wilcox, Marty McFadden, Mel Gorman, peterx,
	Hugh Dickins, Brian Geffon, Denis Plotnikov, Pavel Emelyanov

For most architectures, we've got a quick path to detect fatal signal
after a handle_mm_fault().  Introduce a helper for that quick path.

It cleans the current codes a bit so we don't need to duplicate the
same check across archs.  More importantly, this will be an unified
place that we handle the signal immediately right after an interrupted
page fault, so it'll be much easier for us if we want to change the
behavior of handling signals later on for all the archs.

Note that currently only part of the archs are using this new helper,
because some archs have their own way to handle signals.  In the
follow up patches, we'll try to apply this helper to all the rest of
archs.

Another note is that the "regs" parameter in the new helper is not
used yet.  It'll be used very soon.  Now we kept it in this patch only
to avoid touching all the archs again in the follow up patches.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/alpha/mm/fault.c        |  2 +-
 arch/arm/mm/fault.c          |  2 +-
 arch/hexagon/mm/vm_fault.c   |  2 +-
 arch/ia64/mm/fault.c         |  2 +-
 arch/m68k/mm/fault.c         |  2 +-
 arch/microblaze/mm/fault.c   |  2 +-
 arch/mips/mm/fault.c         |  2 +-
 arch/nds32/mm/fault.c        |  2 +-
 arch/nios2/mm/fault.c        |  2 +-
 arch/openrisc/mm/fault.c     |  2 +-
 arch/parisc/mm/fault.c       |  2 +-
 arch/riscv/mm/fault.c        |  2 +-
 arch/s390/mm/fault.c         |  3 +--
 arch/sparc/mm/fault_32.c     |  2 +-
 arch/sparc/mm/fault_64.c     |  2 +-
 arch/unicore32/mm/fault.c    |  2 +-
 arch/xtensa/mm/fault.c       |  2 +-
 include/linux/sched/signal.h | 13 +++++++++++++
 18 files changed, 30 insertions(+), 18 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index 741e61ef9d3f..aea33b599037 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -150,7 +150,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
 	   the fault.  */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(fault, regs))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index bd0f4821f7e1..937b81ff8649 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -295,7 +295,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	 * signal first. We do not need to release the mmap_sem because
 	 * it would already be released in __lock_page_or_retry in
 	 * mm/filemap.c. */
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
+	if (fault_signal_pending(fault, regs)) {
 		if (!user_mode(regs))
 			goto no_context;
 		return 0;
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index b3bc71680ae4..d19beaf11b4c 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -91,7 +91,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
 
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(fault, regs))
 		return;
 
 	/* The most common case -- we are done. */
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index c2f299fe9e04..211b4f439384 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -141,7 +141,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(fault, regs))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index e9b1d7585b43..a455e202691b 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -138,7 +138,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 	fault = handle_mm_fault(vma, address, flags);
 	pr_debug("handle_mm_fault returns %x\n", fault);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(fault, regs))
 		return 0;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index e6a810b0c7ad..cdde01dcdfc3 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -217,7 +217,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(fault, regs))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 1e8d00793784..0ee6fafc57bc 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -154,7 +154,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(regs))
 		return;
 
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
diff --git a/arch/nds32/mm/fault.c b/arch/nds32/mm/fault.c
index 906dfb25353c..0e63f81eff5b 100644
--- a/arch/nds32/mm/fault.c
+++ b/arch/nds32/mm/fault.c
@@ -214,7 +214,7 @@ void do_page_fault(unsigned long entry, unsigned long addr,
 	 * signal first. We do not need to release the mmap_sem because it
 	 * would already be released in __lock_page_or_retry in mm/filemap.c.
 	 */
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
+	if (fault_signal_pending(fault, regs)) {
 		if (!user_mode(regs))
 			goto no_context;
 		return;
diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
index 6a2e716b959f..704ace8ca0ee 100644
--- a/arch/nios2/mm/fault.c
+++ b/arch/nios2/mm/fault.c
@@ -133,7 +133,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause,
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(fault, regs))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index 5d4d3a9691d0..85c7eb0c0186 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -161,7 +161,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
 
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(fault, regs))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index adbd5e2144a3..f9be1d1cb43f 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -304,7 +304,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
 
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(fault, regs))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index cf7248e07f43..1d3869e9ddef 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -117,7 +117,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
 	 * signal first. We do not need to release the mmap_sem because it
 	 * would already be released in __lock_page_or_retry in mm/filemap.c.
 	 */
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(tsk))
+	if (fault_signal_pending(fault, regs))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 7b0bb475c166..179cf92a56e5 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -480,8 +480,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
 	 * the fault.
 	 */
 	fault = handle_mm_fault(vma, address, flags);
-	/* No reason to continue if interrupted by SIGKILL. */
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
+	if (fault_signal_pending(fault, regs)) {
 		fault = VM_FAULT_SIGNAL;
 		if (flags & FAULT_FLAG_RETRY_NOWAIT)
 			goto out_up;
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index 89976c9b936c..6efbeb227644 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -237,7 +237,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(fault, regs))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index 8b7ddbd14b65..dd1ed6555831 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -425,7 +425,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
 
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(fault, regs))
 		goto exit_exception;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index 76342de9cf8c..59d0e6ec2cfc 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -250,7 +250,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	 * signal first. We do not need to release the mmap_sem because
 	 * it would already be released in __lock_page_or_retry in
 	 * mm/filemap.c. */
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(fault, regs))
 		return 0;
 
 	if (!(fault & VM_FAULT_ERROR) && (flags & FAULT_FLAG_ALLOW_RETRY)) {
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index bee30a77cd70..59515905d4ad 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -110,7 +110,7 @@ void do_page_fault(struct pt_regs *regs)
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(fault, regs))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 88050259c466..4c87ffce64d1 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -369,6 +369,19 @@ static inline int signal_pending_state(long state, struct task_struct *p)
 	return (state & TASK_INTERRUPTIBLE) || __fatal_signal_pending(p);
 }
 
+/*
+ * This should only be used in fault handlers to decide whether we
+ * should stop the current fault routine to handle the signals
+ * instead, especially with the case where we've got interrupted with
+ * a VM_FAULT_RETRY.
+ */
+static inline bool fault_signal_pending(unsigned int fault_flags,
+					struct pt_regs *regs)
+{
+	return unlikely((fault_flags & VM_FAULT_RETRY) &&
+			fatal_signal_pending(current));
+}
+
 /*
  * Reevaluate whether the task has signals pending delivery.
  * Wake the task if so.
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RESEND v6 04/16] x86/mm: Use helper fault_signal_pending()
  2020-02-20 15:53 [PATCH RESEND v6 00/16] mm: Page fault enhancements Peter Xu
                   ` (2 preceding siblings ...)
  2020-02-20 15:53 ` [PATCH RESEND v6 03/16] mm: Introduce fault_signal_pending() Peter Xu
@ 2020-02-20 15:53 ` Peter Xu
  2020-02-20 15:58 ` [PATCH RESEND v6 05/16] arc/mm: " Peter Xu
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2020-02-20 15:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Martin Cracauer, Linus Torvalds, Mike Rapoport,
	Kirill A . Shutemov, Johannes Weiner, Dr . David Alan Gilbert,
	David Hildenbrand, Bobby Powers, Maya Gokhale, Jerome Glisse,
	Mike Kravetz, Matthew Wilcox, Marty McFadden, Mel Gorman, peterx,
	Hugh Dickins, Brian Geffon, Denis Plotnikov, Pavel Emelyanov

Let's move the fatal signal check even earlier so that we can directly
use the new fault_signal_pending() in x86 mm code.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/mm/fault.c | 28 +++++++++++++---------------
 1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index fa4ea09593ab..6a00bc8d047f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1442,27 +1442,25 @@ void do_user_addr_fault(struct pt_regs *regs,
 	fault = handle_mm_fault(vma, address, flags);
 	major |= fault & VM_FAULT_MAJOR;
 
+	/* Quick path to respond to signals */
+	if (fault_signal_pending(fault, regs)) {
+		if (!user_mode(regs))
+			no_context(regs, hw_error_code, address, SIGBUS,
+				   BUS_ADRERR);
+		return;
+	}
+
 	/*
 	 * If we need to retry the mmap_sem has already been released,
 	 * and if there is a fatal signal pending there is no guarantee
 	 * that we made any progress. Handle this case first.
 	 */
-	if (unlikely(fault & VM_FAULT_RETRY)) {
+	if (unlikely((fault & VM_FAULT_RETRY) &&
+		     (flags & FAULT_FLAG_ALLOW_RETRY))) {
 		/* Retry at most once */
-		if (flags & FAULT_FLAG_ALLOW_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
-			flags |= FAULT_FLAG_TRIED;
-			if (!fatal_signal_pending(tsk))
-				goto retry;
-		}
-
-		/* User mode? Just return to handle the fatal exception */
-		if (flags & FAULT_FLAG_USER)
-			return;
-
-		/* Not returning to user mode? Handle exceptions or die: */
-		no_context(regs, hw_error_code, address, SIGBUS, BUS_ADRERR);
-		return;
+		flags &= ~FAULT_FLAG_ALLOW_RETRY;
+		flags |= FAULT_FLAG_TRIED;
+		goto retry;
 	}
 
 	up_read(&mm->mmap_sem);
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RESEND v6 05/16] arc/mm: Use helper fault_signal_pending()
  2020-02-20 15:53 [PATCH RESEND v6 00/16] mm: Page fault enhancements Peter Xu
                   ` (3 preceding siblings ...)
  2020-02-20 15:53 ` [PATCH RESEND v6 04/16] x86/mm: Use helper fault_signal_pending() Peter Xu
@ 2020-02-20 15:58 ` Peter Xu
  2020-02-20 15:59 ` [PATCH RESEND v6 06/16] arm64/mm: " Peter Xu
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2020-02-20 15:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Peter Xu, Martin Cracauer, Mike Rapoport, Hugh Dickins,
	Jerome Glisse, Kirill A . Shutemov, Matthew Wilcox,
	Pavel Emelyanov, Brian Geffon, Maya Gokhale, Denis Plotnikov,
	Andrea Arcangeli, Johannes Weiner, Dr . David Alan Gilbert,
	Linus Torvalds, Mike Kravetz, Marty McFadden, David Hildenbrand,
	Bobby Powers, Mel Gorman

Let ARC to use the new helper fault_signal_pending() by moving the
signal check out of the retry logic as standalone.  This should also
helps to simplify the code a bit.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/arc/mm/fault.c | 34 +++++++++++++---------------------
 1 file changed, 13 insertions(+), 21 deletions(-)

diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index fb86bc3e9b35..6eb821a59b49 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -133,29 +133,21 @@ void do_page_fault(unsigned long address, struct pt_regs *regs)
 
 	fault = handle_mm_fault(vma, address, flags);
 
+	/* Quick path to respond to signals */
+	if (fault_signal_pending(fault, regs)) {
+		if (!user_mode(regs))
+			goto no_context;
+		return;
+	}
+
 	/*
-	 * Fault retry nuances
+	 * Fault retry nuances, mmap_sem already relinquished by core mm
 	 */
-	if (unlikely(fault & VM_FAULT_RETRY)) {
-
-		/*
-		 * If fault needs to be retried, handle any pending signals
-		 * first (by returning to user mode).
-		 * mmap_sem already relinquished by core mm for RETRY case
-		 */
-		if (fatal_signal_pending(current)) {
-			if (!user_mode(regs))
-				goto no_context;
-			return;
-		}
-		/*
-		 * retry state machine
-		 */
-		if (flags & FAULT_FLAG_ALLOW_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
-			flags |= FAULT_FLAG_TRIED;
-			goto retry;
-		}
+	if (unlikely((fault & VM_FAULT_RETRY) &&
+		     (flags & FAULT_FLAG_ALLOW_RETRY))) {
+		flags &= ~FAULT_FLAG_ALLOW_RETRY;
+		flags |= FAULT_FLAG_TRIED;
+		goto retry;
 	}
 
 bad_area:
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RESEND v6 06/16] arm64/mm: Use helper fault_signal_pending()
  2020-02-20 15:53 [PATCH RESEND v6 00/16] mm: Page fault enhancements Peter Xu
                   ` (4 preceding siblings ...)
  2020-02-20 15:58 ` [PATCH RESEND v6 05/16] arc/mm: " Peter Xu
@ 2020-02-20 15:59 ` Peter Xu
  2020-02-20 16:02 ` [PATCH RESEND v6 07/16] powerpc/mm: " Peter Xu
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2020-02-20 15:59 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Peter Xu, Martin Cracauer, Mike Rapoport, Hugh Dickins,
	Jerome Glisse, Kirill A . Shutemov, Matthew Wilcox,
	Pavel Emelyanov, Brian Geffon, Maya Gokhale, Denis Plotnikov,
	Andrea Arcangeli, Johannes Weiner, Dr . David Alan Gilbert,
	Linus Torvalds, Mike Kravetz, Marty McFadden, David Hildenbrand,
	Bobby Powers, Mel Gorman

Let the arm64 fault handling to use the new fault_signal_pending()
helper, by moving the signal handling out of the retry logic.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/arm64/mm/fault.c | 19 +++++++------------
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 85566d32958f..6f4b69d712b1 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -513,19 +513,14 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
 	fault = __do_page_fault(mm, addr, mm_flags, vm_flags);
 	major |= fault & VM_FAULT_MAJOR;
 
-	if (fault & VM_FAULT_RETRY) {
-		/*
-		 * If we need to retry but a fatal signal is pending,
-		 * handle the signal first. We do not need to release
-		 * the mmap_sem because it would already be released
-		 * in __lock_page_or_retry in mm/filemap.c.
-		 */
-		if (fatal_signal_pending(current)) {
-			if (!user_mode(regs))
-				goto no_context;
-			return 0;
-		}
+	/* Quick path to respond to signals */
+	if (fault_signal_pending(fault, regs)) {
+		if (!user_mode(regs))
+			goto no_context;
+		return 0;
+	}
 
+	if (fault & VM_FAULT_RETRY) {
 		/*
 		 * Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk of
 		 * starvation.
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RESEND v6 07/16] powerpc/mm: Use helper fault_signal_pending()
  2020-02-20 15:53 [PATCH RESEND v6 00/16] mm: Page fault enhancements Peter Xu
                   ` (5 preceding siblings ...)
  2020-02-20 15:59 ` [PATCH RESEND v6 06/16] arm64/mm: " Peter Xu
@ 2020-02-20 16:02 ` Peter Xu
  2020-02-20 16:02 ` [PATCH RESEND v6 08/16] sh/mm: " Peter Xu
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2020-02-20 16:02 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Peter Xu, Martin Cracauer, Mike Rapoport, Hugh Dickins,
	Jerome Glisse, Kirill A . Shutemov, Matthew Wilcox,
	Pavel Emelyanov, Brian Geffon, Maya Gokhale, Denis Plotnikov,
	Andrea Arcangeli, Johannes Weiner, Dr . David Alan Gilbert,
	Linus Torvalds, Mike Kravetz, Marty McFadden, David Hildenbrand,
	Bobby Powers, Mel Gorman

Let powerpc code to use the new helper, by moving the signal handling
earlier before the retry logic.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/powerpc/mm/fault.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 8db0507619e2..0868172ce4e3 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -582,6 +582,9 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
 
 	major |= fault & VM_FAULT_MAJOR;
 
+	if (fault_signal_pending(fault, regs))
+		return user_mode(regs) ? 0 : SIGBUS;
+
 	/*
 	 * Handle the retry right now, the mmap_sem has been released in that
 	 * case.
@@ -595,15 +598,8 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
 			 */
 			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
-			if (!fatal_signal_pending(current))
-				goto retry;
+			goto retry;
 		}
-
-		/*
-		 * User mode? Just return to handle the fatal exception otherwise
-		 * return to bad_page_fault
-		 */
-		return is_user ? 0 : SIGBUS;
 	}
 
 	up_read(&current->mm->mmap_sem);
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RESEND v6 08/16] sh/mm: Use helper fault_signal_pending()
  2020-02-20 15:53 [PATCH RESEND v6 00/16] mm: Page fault enhancements Peter Xu
                   ` (6 preceding siblings ...)
  2020-02-20 16:02 ` [PATCH RESEND v6 07/16] powerpc/mm: " Peter Xu
@ 2020-02-20 16:02 ` Peter Xu
  2020-02-20 16:02 ` [PATCH RESEND v6 09/16] mm: Return faster for non-fatal signals in user mode faults Peter Xu
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2020-02-20 16:02 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Peter Xu, Martin Cracauer, Mike Rapoport, Hugh Dickins,
	Jerome Glisse, Kirill A . Shutemov, Matthew Wilcox,
	Pavel Emelyanov, Brian Geffon, Maya Gokhale, Denis Plotnikov,
	Andrea Arcangeli, Johannes Weiner, Dr . David Alan Gilbert,
	Linus Torvalds, Mike Kravetz, Marty McFadden, David Hildenbrand,
	Bobby Powers, Mel Gorman

Let SH to use the new fault_signal_pending() helper.  Here we'll need
to move the up_read() out because that's actually needed as long as
!RETRY cases.  At the meantime we can drop all the rest of up_read()s
now (which seems to be cleaner).

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/sh/mm/fault.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index 5f51456f4fc7..eb4048ad0b38 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -302,25 +302,25 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
 	 * Pagefault was interrupted by SIGKILL. We have no reason to
 	 * continue pagefault.
 	 */
-	if (fatal_signal_pending(current)) {
-		if (!(fault & VM_FAULT_RETRY))
-			up_read(&current->mm->mmap_sem);
+	if (fault_signal_pending(fault, regs)) {
 		if (!user_mode(regs))
 			no_context(regs, error_code, address);
 		return 1;
 	}
 
+	/* Release mmap_sem first if necessary */
+	if (!(fault & VM_FAULT_RETRY))
+		up_read(&current->mm->mmap_sem);
+
 	if (!(fault & VM_FAULT_ERROR))
 		return 0;
 
 	if (fault & VM_FAULT_OOM) {
 		/* Kernel mode? Handle exceptions or die: */
 		if (!user_mode(regs)) {
-			up_read(&current->mm->mmap_sem);
 			no_context(regs, error_code, address);
 			return 1;
 		}
-		up_read(&current->mm->mmap_sem);
 
 		/*
 		 * We ran out of memory, call the OOM killer, and return the
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RESEND v6 09/16] mm: Return faster for non-fatal signals in user mode faults
  2020-02-20 15:53 [PATCH RESEND v6 00/16] mm: Page fault enhancements Peter Xu
                   ` (7 preceding siblings ...)
  2020-02-20 16:02 ` [PATCH RESEND v6 08/16] sh/mm: " Peter Xu
@ 2020-02-20 16:02 ` Peter Xu
  2020-02-20 16:02 ` [PATCH RESEND v6 10/16] userfaultfd: Don't retake mmap_sem to emulate NOPAGE Peter Xu
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2020-02-20 16:02 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Peter Xu, Martin Cracauer, Mike Rapoport, Hugh Dickins,
	Jerome Glisse, Kirill A . Shutemov, Matthew Wilcox,
	Pavel Emelyanov, Brian Geffon, Maya Gokhale, Denis Plotnikov,
	Andrea Arcangeli, Johannes Weiner, Dr . David Alan Gilbert,
	Linus Torvalds, Mike Kravetz, Marty McFadden, David Hildenbrand,
	Bobby Powers, Mel Gorman

The idea comes from the upstream discussion between Linus and Andrea:

  https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/

A summary to the issue: there was a special path in handle_userfault()
in the past that we'll return a VM_FAULT_NOPAGE when we detected
non-fatal signals when waiting for userfault handling.  We did that by
reacquiring the mmap_sem before returning.  However that brings a risk
in that the vmas might have changed when we retake the mmap_sem and
even we could be holding an invalid vma structure.

This patch is a preparation of removing that special path by allowing
the page fault to return even faster if we were interrupted by a
non-fatal signal during a user-mode page fault handling routine.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/sched/signal.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 4c87ffce64d1..09d40ce6a162 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -379,7 +379,8 @@ static inline bool fault_signal_pending(unsigned int fault_flags,
 					struct pt_regs *regs)
 {
 	return unlikely((fault_flags & VM_FAULT_RETRY) &&
-			fatal_signal_pending(current));
+			(fatal_signal_pending(current) ||
+			 (user_mode(regs) && signal_pending(current))));
 }
 
 /*
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RESEND v6 10/16] userfaultfd: Don't retake mmap_sem to emulate NOPAGE
  2020-02-20 15:53 [PATCH RESEND v6 00/16] mm: Page fault enhancements Peter Xu
                   ` (8 preceding siblings ...)
  2020-02-20 16:02 ` [PATCH RESEND v6 09/16] mm: Return faster for non-fatal signals in user mode faults Peter Xu
@ 2020-02-20 16:02 ` Peter Xu
  2020-02-20 16:02 ` [PATCH RESEND v6 11/16] mm: Introduce FAULT_FLAG_DEFAULT Peter Xu
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2020-02-20 16:02 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Peter Xu, Martin Cracauer, Mike Rapoport, Hugh Dickins,
	Jerome Glisse, Kirill A . Shutemov, Matthew Wilcox,
	Pavel Emelyanov, Brian Geffon, Maya Gokhale, Denis Plotnikov,
	Andrea Arcangeli, Johannes Weiner, Dr . David Alan Gilbert,
	Linus Torvalds, Mike Kravetz, Marty McFadden, David Hildenbrand,
	Bobby Powers, Mel Gorman

This patch removes the risk path in handle_userfault() then we will be
sure that the callers of handle_mm_fault() will know that the VMAs
might have changed.  Meanwhile with previous patch we don't lose
responsiveness as well since the core mm code now can handle the
nonfatal userspace signals even if we return VM_FAULT_RETRY.

Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c | 24 ------------------------
 1 file changed, 24 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 37df7c9eedb1..888272621f38 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -524,30 +524,6 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 
 	__set_current_state(TASK_RUNNING);
 
-	if (return_to_userland) {
-		if (signal_pending(current) &&
-		    !fatal_signal_pending(current)) {
-			/*
-			 * If we got a SIGSTOP or SIGCONT and this is
-			 * a normal userland page fault, just let
-			 * userland return so the signal will be
-			 * handled and gdb debugging works.  The page
-			 * fault code immediately after we return from
-			 * this function is going to release the
-			 * mmap_sem and it's not depending on it
-			 * (unlike gup would if we were not to return
-			 * VM_FAULT_RETRY).
-			 *
-			 * If a fatal signal is pending we still take
-			 * the streamlined VM_FAULT_RETRY failure path
-			 * and there's no need to retake the mmap_sem
-			 * in such case.
-			 */
-			down_read(&mm->mmap_sem);
-			ret = VM_FAULT_NOPAGE;
-		}
-	}
-
 	/*
 	 * Here we race with the list_del; list_add in
 	 * userfaultfd_ctx_read(), however because we don't ever run
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RESEND v6 11/16] mm: Introduce FAULT_FLAG_DEFAULT
  2020-02-20 15:53 [PATCH RESEND v6 00/16] mm: Page fault enhancements Peter Xu
                   ` (9 preceding siblings ...)
  2020-02-20 16:02 ` [PATCH RESEND v6 10/16] userfaultfd: Don't retake mmap_sem to emulate NOPAGE Peter Xu
@ 2020-02-20 16:02 ` Peter Xu
  2020-02-20 16:02 ` [PATCH RESEND v6 13/16] mm: Allow VM_FAULT_RETRY for multiple times Peter Xu
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2020-02-20 16:02 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Peter Xu, Martin Cracauer, Mike Rapoport, Hugh Dickins,
	Jerome Glisse, Kirill A . Shutemov, Matthew Wilcox,
	Pavel Emelyanov, Brian Geffon, Maya Gokhale, Denis Plotnikov,
	Andrea Arcangeli, Johannes Weiner, Dr . David Alan Gilbert,
	Linus Torvalds, Mike Kravetz, Marty McFadden, David Hildenbrand,
	Bobby Powers, Mel Gorman

Although there're tons of arch-specific page fault handlers, most of
them are still sharing the same initial value of the page fault flags.
Say, merely all of the page fault handlers would allow the fault to be
retried, and they also allow the fault to respond to SIGKILL.

Let's define a default value for the fault flags to replace those
initial page fault flags that were copied over.  With this, it'll be
far easier to introduce new fault flag that can be used by all the
architectures instead of touching all the archs.

Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/alpha/mm/fault.c      | 2 +-
 arch/arc/mm/fault.c        | 2 +-
 arch/arm/mm/fault.c        | 2 +-
 arch/arm64/mm/fault.c      | 2 +-
 arch/hexagon/mm/vm_fault.c | 2 +-
 arch/ia64/mm/fault.c       | 2 +-
 arch/m68k/mm/fault.c       | 2 +-
 arch/microblaze/mm/fault.c | 2 +-
 arch/mips/mm/fault.c       | 2 +-
 arch/nds32/mm/fault.c      | 2 +-
 arch/nios2/mm/fault.c      | 2 +-
 arch/openrisc/mm/fault.c   | 2 +-
 arch/parisc/mm/fault.c     | 2 +-
 arch/powerpc/mm/fault.c    | 2 +-
 arch/riscv/mm/fault.c      | 2 +-
 arch/s390/mm/fault.c       | 2 +-
 arch/sh/mm/fault.c         | 2 +-
 arch/sparc/mm/fault_32.c   | 2 +-
 arch/sparc/mm/fault_64.c   | 2 +-
 arch/um/kernel/trap.c      | 2 +-
 arch/unicore32/mm/fault.c  | 2 +-
 arch/x86/mm/fault.c        | 2 +-
 arch/xtensa/mm/fault.c     | 2 +-
 include/linux/mm.h         | 7 +++++++
 24 files changed, 30 insertions(+), 23 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index aea33b599037..fcfa229cc1e7 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -89,7 +89,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
 	const struct exception_table_entry *fixup;
 	int si_code = SEGV_MAPERR;
 	vm_fault_t fault;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	/* As of EV6, a load into $31/$f31 is a prefetch, and never faults
 	   (or is suppressed by the PALcode).  Support that for older CPUs
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index 6eb821a59b49..643fad774071 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -100,7 +100,7 @@ void do_page_fault(unsigned long address, struct pt_regs *regs)
 	         (regs->ecr_cause == ECR_C_PROTV_INST_FETCH))
 		exec = 1;
 
-	flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	flags = FAULT_FLAG_DEFAULT;
 	if (user_mode(regs))
 		flags |= FAULT_FLAG_USER;
 	if (write)
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 937b81ff8649..18ef0b143ac2 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -241,7 +241,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	struct mm_struct *mm;
 	int sig, code;
 	vm_fault_t fault;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	if (kprobe_page_fault(regs, fsr))
 		return 0;
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 6f4b69d712b1..cbb29a43aa7f 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -446,7 +446,7 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
 	struct mm_struct *mm = current->mm;
 	vm_fault_t fault, major = 0;
 	unsigned long vm_flags = VM_READ | VM_WRITE | VM_EXEC;
-	unsigned int mm_flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int mm_flags = FAULT_FLAG_DEFAULT;
 
 	if (kprobe_page_fault(regs, esr))
 		return 0;
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index d19beaf11b4c..d9e15d941bdb 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -41,7 +41,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
 	int si_code = SEGV_MAPERR;
 	vm_fault_t fault;
 	const struct exception_table_entry *fixup;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	/*
 	 * If we're in an interrupt or have no user context,
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 211b4f439384..b5aa4e80c762 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -65,7 +65,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
 	struct mm_struct *mm = current->mm;
 	unsigned long mask;
 	vm_fault_t fault;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	mask = ((((isr >> IA64_ISR_X_BIT) & 1UL) << VM_EXEC_BIT)
 		| (((isr >> IA64_ISR_W_BIT) & 1UL) << VM_WRITE_BIT));
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index a455e202691b..182799fd9987 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -71,7 +71,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct * vma;
 	vm_fault_t fault;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	pr_debug("do page fault:\nregs->sr=%#x, regs->pc=%#lx, address=%#lx, %ld, %p\n",
 		regs->sr, regs->pc, address, error_code, mm ? mm->pgd : NULL);
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index cdde01dcdfc3..32da02778a63 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -91,7 +91,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
 	int code = SEGV_MAPERR;
 	int is_write = error_code & ESR_S;
 	vm_fault_t fault;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	regs->ear = address;
 	regs->esr = error_code;
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 0ee6fafc57bc..f177da67d940 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -44,7 +44,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
 	const int field = sizeof(unsigned long) * 2;
 	int si_code;
 	vm_fault_t fault;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	static DEFINE_RATELIMIT_STATE(ratelimit_state, 5 * HZ, 10);
 
diff --git a/arch/nds32/mm/fault.c b/arch/nds32/mm/fault.c
index 0e63f81eff5b..2810a4e5ab27 100644
--- a/arch/nds32/mm/fault.c
+++ b/arch/nds32/mm/fault.c
@@ -80,7 +80,7 @@ void do_page_fault(unsigned long entry, unsigned long addr,
 	int si_code;
 	vm_fault_t fault;
 	unsigned int mask = VM_READ | VM_WRITE | VM_EXEC;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	error_code = error_code & (ITYPE_mskINST | ITYPE_mskETYPE);
 	tsk = current;
diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
index 704ace8ca0ee..c38bea4220fb 100644
--- a/arch/nios2/mm/fault.c
+++ b/arch/nios2/mm/fault.c
@@ -47,7 +47,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause,
 	struct mm_struct *mm = tsk->mm;
 	int code = SEGV_MAPERR;
 	vm_fault_t fault;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	cause >>= 2;
 
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index 85c7eb0c0186..30d5c51e9d40 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -50,7 +50,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
 	struct vm_area_struct *vma;
 	int si_code;
 	vm_fault_t fault;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	tsk = current;
 
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index f9be1d1cb43f..8e88e5c5f26a 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -274,7 +274,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
 	if (!mm)
 		goto no_context;
 
-	flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	flags = FAULT_FLAG_DEFAULT;
 	if (user_mode(regs))
 		flags |= FAULT_FLAG_USER;
 
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 0868172ce4e3..d7e1f8dc7e4c 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -434,7 +434,7 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
 {
 	struct vm_area_struct * vma;
 	struct mm_struct *mm = current->mm;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_DEFAULT;
  	int is_exec = TRAP(regs) == 0x400;
 	int is_user = user_mode(regs);
 	int is_write = page_fault_is_write(error_code);
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index 1d3869e9ddef..a252d9e38561 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -30,7 +30,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
 	struct vm_area_struct *vma;
 	struct mm_struct *mm;
 	unsigned long addr, cause;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_DEFAULT;
 	int code = SEGV_MAPERR;
 	vm_fault_t fault;
 
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 179cf92a56e5..551ac311bd35 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -429,7 +429,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
 
 	address = trans_exc_code & __FAIL_ADDR_MASK;
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
-	flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	flags = FAULT_FLAG_DEFAULT;
 	if (user_mode(regs))
 		flags |= FAULT_FLAG_USER;
 	if (access == VM_WRITE || (trans_exc_code & store_indication) == 0x400)
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index eb4048ad0b38..d9c8f2d00a54 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -380,7 +380,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
 	struct mm_struct *mm;
 	struct vm_area_struct * vma;
 	vm_fault_t fault;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	tsk = current;
 	mm = tsk->mm;
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index 6efbeb227644..a91b0c2d84f8 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -168,7 +168,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
 	int from_user = !(regs->psr & PSR_PS);
 	int code;
 	vm_fault_t fault;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	if (text_fault)
 		address = regs->pc;
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index dd1ed6555831..30653418a672 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -271,7 +271,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
 	int si_code, fault_code;
 	vm_fault_t fault;
 	unsigned long address, mm_rss;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	fault_code = get_thread_fault_code();
 
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 818553064f04..c59ad37eacda 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -33,7 +33,7 @@ int handle_page_fault(unsigned long address, unsigned long ip,
 	pmd_t *pmd;
 	pte_t *pte;
 	int err = -EFAULT;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	*code_out = SEGV_MAPERR;
 
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index 59d0e6ec2cfc..34a90453ca18 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -202,7 +202,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	struct mm_struct *mm;
 	int sig, code;
 	vm_fault_t fault;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	tsk = current;
 	mm = tsk->mm;
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 6a00bc8d047f..7b6f65333355 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1288,7 +1288,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 	struct task_struct *tsk;
 	struct mm_struct *mm;
 	vm_fault_t fault, major = 0;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	tsk = current;
 	mm = tsk->mm;
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index 59515905d4ad..7d196dc951e8 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -43,7 +43,7 @@ void do_page_fault(struct pt_regs *regs)
 
 	int is_write, is_exec;
 	vm_fault_t fault;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	code = SEGV_MAPERR;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 52269e56c514..08ca35d3c341 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -388,6 +388,13 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
 #define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
 
+/*
+ * The default fault flags that should be used by most of the
+ * arch-specific page fault handlers.
+ */
+#define FAULT_FLAG_DEFAULT  (FAULT_FLAG_ALLOW_RETRY | \
+			     FAULT_FLAG_KILLABLE)
+
 #define FAULT_FLAG_TRACE \
 	{ FAULT_FLAG_WRITE,		"WRITE" }, \
 	{ FAULT_FLAG_MKWRITE,		"MKWRITE" }, \
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RESEND v6 13/16] mm: Allow VM_FAULT_RETRY for multiple times
  2020-02-20 15:53 [PATCH RESEND v6 00/16] mm: Page fault enhancements Peter Xu
                   ` (10 preceding siblings ...)
  2020-02-20 16:02 ` [PATCH RESEND v6 11/16] mm: Introduce FAULT_FLAG_DEFAULT Peter Xu
@ 2020-02-20 16:02 ` Peter Xu
  2020-02-20 16:02 ` [PATCH RESEND v6 15/16] mm/gup: Allow to react to fatal signals Peter Xu
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2020-02-20 16:02 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Peter Xu, Martin Cracauer, Mike Rapoport, Hugh Dickins,
	Jerome Glisse, Kirill A . Shutemov, Matthew Wilcox,
	Pavel Emelyanov, Brian Geffon, Maya Gokhale, Denis Plotnikov,
	Andrea Arcangeli, Johannes Weiner, Dr . David Alan Gilbert,
	Linus Torvalds, Mike Kravetz, Marty McFadden, David Hildenbrand,
	Bobby Powers, Mel Gorman

The idea comes from a discussion between Linus and Andrea [1].

Before this patch we only allow a page fault to retry once.  We
achieved this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
handle_mm_fault() the second time.  This was majorly used to avoid
unexpected starvation of the system by looping over forever to handle
the page fault on a single page.  However that should hardly happen,
and after all for each code path to return a VM_FAULT_RETRY we'll
first wait for a condition (during which time we should possibly yield
the cpu) to happen before VM_FAULT_RETRY is really returned.

This patch removes the restriction by keeping the
FAULT_FLAG_ALLOW_RETRY flag when we receive VM_FAULT_RETRY.  It means
that the page fault handler now can retry the page fault for multiple
times if necessary without the need to generate another page fault
event.  Meanwhile we still keep the FAULT_FLAG_TRIED flag so page
fault handler can still identify whether a page fault is the first
attempt or not.

Then we'll have these combinations of fault flags (only considering
ALLOW_RETRY flag and TRIED flag):

  - ALLOW_RETRY and !TRIED:  this means the page fault allows to
                             retry, and this is the first try

  - ALLOW_RETRY and TRIED:   this means the page fault allows to
                             retry, and this is not the first try

  - !ALLOW_RETRY and !TRIED: this means the page fault does not allow
                             to retry at all

  - !ALLOW_RETRY and TRIED:  this is forbidden and should never be used

In existing code we have multiple places that has taken special care
of the first condition above by checking against (fault_flags &
FAULT_FLAG_ALLOW_RETRY).  This patch introduces a simple helper to
detect the first retry of a page fault by checking against
both (fault_flags & FAULT_FLAG_ALLOW_RETRY) and !(fault_flag &
FAULT_FLAG_TRIED) because now even the 2nd try will have the
ALLOW_RETRY set, then use that helper in all existing special paths.
One example is in __lock_page_or_retry(), now we'll drop the mmap_sem
only in the first attempt of page fault and we'll keep it in follow up
retries, so old locking behavior will be retained.

This will be a nice enhancement for current code [2] at the same time
a supporting material for the future userfaultfd-writeprotect work,
since in that work there will always be an explicit userfault
writeprotect retry for protected pages, and if that cannot resolve the
page fault (e.g., when userfaultfd-writeprotect is used in conjunction
with swapped pages) then we'll possibly need a 3rd retry of the page
fault.  It might also benefit other potential users who will have
similar requirement like userfault write-protection.

GUP code is not touched yet and will be covered in follow up patch.

Please read the thread below for more information.

[1] https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/
[2] https://lore.kernel.org/lkml/20181230154648.GB9832@redhat.com/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/alpha/mm/fault.c           |  2 +-
 arch/arc/mm/fault.c             |  1 -
 arch/arm/mm/fault.c             |  3 ---
 arch/arm64/mm/fault.c           |  5 -----
 arch/hexagon/mm/vm_fault.c      |  1 -
 arch/ia64/mm/fault.c            |  1 -
 arch/m68k/mm/fault.c            |  3 ---
 arch/microblaze/mm/fault.c      |  1 -
 arch/mips/mm/fault.c            |  1 -
 arch/nds32/mm/fault.c           |  1 -
 arch/nios2/mm/fault.c           |  3 ---
 arch/openrisc/mm/fault.c        |  1 -
 arch/parisc/mm/fault.c          |  4 +---
 arch/powerpc/mm/fault.c         |  6 ------
 arch/riscv/mm/fault.c           |  5 -----
 arch/s390/mm/fault.c            |  5 +----
 arch/sh/mm/fault.c              |  1 -
 arch/sparc/mm/fault_32.c        |  1 -
 arch/sparc/mm/fault_64.c        |  1 -
 arch/um/kernel/trap.c           |  1 -
 arch/unicore32/mm/fault.c       |  4 +---
 arch/x86/mm/fault.c             |  2 --
 arch/xtensa/mm/fault.c          |  1 -
 drivers/gpu/drm/ttm/ttm_bo_vm.c | 12 ++++++++---
 include/linux/mm.h              | 37 +++++++++++++++++++++++++++++++++
 mm/filemap.c                    |  2 +-
 mm/internal.h                   |  6 +++---
 27 files changed, 54 insertions(+), 57 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index fcfa229cc1e7..c2d7b6d7bac7 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -169,7 +169,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+			flags |= FAULT_FLAG_TRIED;
 
 			 /* No need to up_read(&mm->mmap_sem) as we would
 			 * have already released it in __lock_page_or_retry
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index 643fad774071..92b339c7adba 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -145,7 +145,6 @@ void do_page_fault(unsigned long address, struct pt_regs *regs)
 	 */
 	if (unlikely((fault & VM_FAULT_RETRY) &&
 		     (flags & FAULT_FLAG_ALLOW_RETRY))) {
-		flags &= ~FAULT_FLAG_ALLOW_RETRY;
 		flags |= FAULT_FLAG_TRIED;
 		goto retry;
 	}
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 18ef0b143ac2..b598e6978b29 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -319,9 +319,6 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 					regs, addr);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			* of starvation. */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 			goto retry;
 		}
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index cbb29a43aa7f..1027851d469a 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -521,12 +521,7 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
 	}
 
 	if (fault & VM_FAULT_RETRY) {
-		/*
-		 * Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk of
-		 * starvation.
-		 */
 		if (mm_flags & FAULT_FLAG_ALLOW_RETRY) {
-			mm_flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			mm_flags |= FAULT_FLAG_TRIED;
 			goto retry;
 		}
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index d9e15d941bdb..72334b26317a 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -102,7 +102,6 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
 			else
 				current->min_flt++;
 			if (fault & VM_FAULT_RETRY) {
-				flags &= ~FAULT_FLAG_ALLOW_RETRY;
 				flags |= FAULT_FLAG_TRIED;
 				goto retry;
 			}
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index b5aa4e80c762..30d0c1fca99e 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -167,7 +167,6 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			 /* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index 182799fd9987..f7afb9897966 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -162,9 +162,6 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation. */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index 32da02778a63..3248141f8ed5 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -236,7 +236,6 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index f177da67d940..fd64b135fd7b 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -178,7 +178,6 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
 			tsk->min_flt++;
 		}
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/nds32/mm/fault.c b/arch/nds32/mm/fault.c
index 2810a4e5ab27..0cf0c08c7da2 100644
--- a/arch/nds32/mm/fault.c
+++ b/arch/nds32/mm/fault.c
@@ -246,7 +246,6 @@ void do_page_fault(unsigned long entry, unsigned long addr,
 				      1, regs, addr);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
index c38bea4220fb..ec9d8a9c426f 100644
--- a/arch/nios2/mm/fault.c
+++ b/arch/nios2/mm/fault.c
@@ -157,9 +157,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause,
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation. */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index 30d5c51e9d40..8af1cc78c4fb 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -181,7 +181,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
 		else
 			tsk->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			 /* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index 8e88e5c5f26a..86e8c848f3d7 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -328,14 +328,12 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
-
 			/*
 			 * No need to up_read(&mm->mmap_sem) as we would
 			 * have already released it in __lock_page_or_retry
 			 * in mm/filemap.c.
 			 */
-
+			flags |= FAULT_FLAG_TRIED;
 			goto retry;
 		}
 	}
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index d7e1f8dc7e4c..d15f0f0ee806 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -590,13 +590,7 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
 	 * case.
 	 */
 	if (unlikely(fault & VM_FAULT_RETRY)) {
-		/* We retry only once */
 		if (flags & FAULT_FLAG_ALLOW_RETRY) {
-			/*
-			 * Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation.
-			 */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 			goto retry;
 		}
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index a252d9e38561..be84e32adc4c 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -144,11 +144,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
 				      1, regs, addr);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			/*
-			 * Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation.
-			 */
-			flags &= ~(FAULT_FLAG_ALLOW_RETRY);
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 551ac311bd35..aeccdb30899a 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -513,10 +513,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
 				fault = VM_FAULT_PFAULT;
 				goto out_up;
 			}
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation. */
-			flags &= ~(FAULT_FLAG_ALLOW_RETRY |
-				   FAULT_FLAG_RETRY_NOWAIT);
+			flags &= ~FAULT_FLAG_RETRY_NOWAIT;
 			flags |= FAULT_FLAG_TRIED;
 			down_read(&mm->mmap_sem);
 			goto retry;
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index d9c8f2d00a54..13ee4d20e622 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -481,7 +481,6 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
 				      regs, address);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/*
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index a91b0c2d84f8..f6e0e601f857 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -261,7 +261,6 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
 				      1, regs, address);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index 30653418a672..c0c0dd471b6b 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -449,7 +449,6 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
 				      1, regs, address);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			/* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index c59ad37eacda..8f18cf56b3dd 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -97,7 +97,6 @@ int handle_page_fault(unsigned long address, unsigned long ip,
 			else
 				current->min_flt++;
 			if (fault & VM_FAULT_RETRY) {
-				flags &= ~FAULT_FLAG_ALLOW_RETRY;
 				flags |= FAULT_FLAG_TRIED;
 
 				goto retry;
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index 34a90453ca18..a9bd08fbe588 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -259,9 +259,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 		else
 			tsk->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			* of starvation. */
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+			flags |= FAULT_FLAG_TRIED;
 			goto retry;
 		}
 	}
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 7b6f65333355..4ce647bbe546 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1457,8 +1457,6 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 */
 	if (unlikely((fault & VM_FAULT_RETRY) &&
 		     (flags & FAULT_FLAG_ALLOW_RETRY))) {
-		/* Retry at most once */
-		flags &= ~FAULT_FLAG_ALLOW_RETRY;
 		flags |= FAULT_FLAG_TRIED;
 		goto retry;
 	}
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index 7d196dc951e8..e7172bd53ced 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -128,7 +128,6 @@ void do_page_fault(struct pt_regs *regs)
 		else
 			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 
 			 /* No need to up_read(&mm->mmap_sem) as we would
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 389128b8c4dd..cb8829ca6c7f 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -59,9 +59,10 @@ static vm_fault_t ttm_bo_vm_fault_idle(struct ttm_buffer_object *bo,
 
 	/*
 	 * If possible, avoid waiting for GPU with mmap_sem
-	 * held.
+	 * held.  We only do this if the fault allows retry and this
+	 * is the first attempt.
 	 */
-	if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
+	if (fault_flag_allow_retry_first(vmf->flags)) {
 		ret = VM_FAULT_RETRY;
 		if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
 			goto out_unlock;
@@ -135,7 +136,12 @@ vm_fault_t ttm_bo_vm_reserve(struct ttm_buffer_object *bo,
 	 * for the buffer to become unreserved.
 	 */
 	if (unlikely(!dma_resv_trylock(bo->base.resv))) {
-		if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
+		/*
+		 * If the fault allows retry and this is the first
+		 * fault attempt, we try to release the mmap_sem
+		 * before waiting
+		 */
+		if (fault_flag_allow_retry_first(vmf->flags)) {
 			if (!(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
 				ttm_bo_get(bo);
 				up_read(&vmf->vma->vm_mm->mmap_sem);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ff653f9136dd..51a886d50758 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -391,6 +391,25 @@ extern pgprot_t protection_map[16];
  * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
  * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
  * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ *
+ * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
+ * whether we would allow page faults to retry by specifying these two
+ * fault flags correctly.  Currently there can be three legal combinations:
+ *
+ * (a) ALLOW_RETRY and !TRIED:  this means the page fault allows retry, and
+ *                              this is the first try
+ *
+ * (b) ALLOW_RETRY and TRIED:   this means the page fault allows retry, and
+ *                              we've already tried at least once
+ *
+ * (c) !ALLOW_RETRY and !TRIED: this means the page fault does not allow retry
+ *
+ * The unlisted combination (!ALLOW_RETRY && TRIED) is illegal and should never
+ * be used.  Note that page faults can be allowed to retry for multiple times,
+ * in which case we'll have an initial fault with flags (a) then later on
+ * continuous faults with flags (b).  We should always try to detect pending
+ * signals before a retry to make sure the continuous page faults can still be
+ * interrupted if necessary.
  */
 #define FAULT_FLAG_WRITE			0x01
 #define FAULT_FLAG_MKWRITE			0x02
@@ -411,6 +430,24 @@ extern pgprot_t protection_map[16];
 			     FAULT_FLAG_KILLABLE | \
 			     FAULT_FLAG_INTERRUPTIBLE)
 
+/**
+ * fault_flag_allow_retry_first - check ALLOW_RETRY the first time
+ *
+ * This is mostly used for places where we want to try to avoid taking
+ * the mmap_sem for too long a time when waiting for another condition
+ * to change, in which case we can try to be polite to release the
+ * mmap_sem in the first round to avoid potential starvation of other
+ * processes that would also want the mmap_sem.
+ *
+ * Return: true if the page fault allows retry and this is the first
+ * attempt of the fault handling; false otherwise.
+ */
+static inline bool fault_flag_allow_retry_first(unsigned int flags)
+{
+	return (flags & FAULT_FLAG_ALLOW_RETRY) &&
+	    (!(flags & FAULT_FLAG_TRIED));
+}
+
 #define FAULT_FLAG_TRACE \
 	{ FAULT_FLAG_WRITE,		"WRITE" }, \
 	{ FAULT_FLAG_MKWRITE,		"MKWRITE" }, \
diff --git a/mm/filemap.c b/mm/filemap.c
index 1784478270e1..590ec3a9f5da 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1386,7 +1386,7 @@ EXPORT_SYMBOL_GPL(__lock_page_killable);
 int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 			 unsigned int flags)
 {
-	if (flags & FAULT_FLAG_ALLOW_RETRY) {
+	if (fault_flag_allow_retry_first(flags)) {
 		/*
 		 * CAUTION! In this case, mmap_sem is not released
 		 * even though return 0.
diff --git a/mm/internal.h b/mm/internal.h
index 3cf20ab3ca01..5958cfe50a0c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -377,10 +377,10 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
 	/*
 	 * FAULT_FLAG_RETRY_NOWAIT means we don't want to wait on page locks or
 	 * anything, so we only pin the file and drop the mmap_sem if only
-	 * FAULT_FLAG_ALLOW_RETRY is set.
+	 * FAULT_FLAG_ALLOW_RETRY is set, while this is the first attempt.
 	 */
-	if ((flags & (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT)) ==
-	    FAULT_FLAG_ALLOW_RETRY) {
+	if (fault_flag_allow_retry_first(flags) &&
+	    !(flags & FAULT_FLAG_RETRY_NOWAIT)) {
 		fpin = get_file(vmf->vma->vm_file);
 		up_read(&vmf->vma->vm_mm->mmap_sem);
 	}
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RESEND v6 15/16] mm/gup: Allow to react to fatal signals
  2020-02-20 15:53 [PATCH RESEND v6 00/16] mm: Page fault enhancements Peter Xu
                   ` (11 preceding siblings ...)
  2020-02-20 16:02 ` [PATCH RESEND v6 13/16] mm: Allow VM_FAULT_RETRY for multiple times Peter Xu
@ 2020-02-20 16:02 ` Peter Xu
  2020-02-20 16:03 ` [PATCH RESEND v6 16/16] mm/userfaultfd: Honor FAULT_FLAG_KILLABLE in fault path Peter Xu
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2020-02-20 16:02 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Peter Xu, Martin Cracauer, Mike Rapoport, Hugh Dickins,
	Jerome Glisse, Kirill A . Shutemov, Matthew Wilcox,
	Pavel Emelyanov, Brian Geffon, Maya Gokhale, Denis Plotnikov,
	Andrea Arcangeli, Johannes Weiner, Dr . David Alan Gilbert,
	Linus Torvalds, Mike Kravetz, Marty McFadden, David Hildenbrand,
	Bobby Powers, Mel Gorman

The existing gup code does not react to the fatal signals in many code
paths.  For example, in one retry path of gup we're still using
down_read() rather than down_read_killable().  Also, when doing page
faults we don't pass in FAULT_FLAG_KILLABLE as well, which means that
within the faulting process we'll wait in non-killable way as well.
These were spotted by Linus during the code review of some other
patches.

Let's allow the gup code to react to fatal signals to improve the
responsiveness of threads when during gup and being killed.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c     | 12 +++++++++---
 mm/hugetlb.c |  3 ++-
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index ec2b76f44a01..3f0cb14334ac 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -648,7 +648,7 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 	if (*flags & FOLL_REMOTE)
 		fault_flags |= FAULT_FLAG_REMOTE;
 	if (locked)
-		fault_flags |= FAULT_FLAG_ALLOW_RETRY;
+		fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
 	if (*flags & FOLL_NOWAIT)
 		fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
 	if (*flags & FOLL_TRIED) {
@@ -991,7 +991,7 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 	address = untagged_addr(address);
 
 	if (unlocked)
-		fault_flags |= FAULT_FLAG_ALLOW_RETRY;
+		fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
 
 retry:
 	vma = find_extend_vma(mm, address);
@@ -1113,7 +1113,13 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
 			break;
 
 		*locked = 1;
-		down_read(&mm->mmap_sem);
+		ret = down_read_killable(&mm->mmap_sem);
+		if (ret) {
+			BUG_ON(ret > 0);
+			if (!pages_done)
+				pages_done = ret;
+			break;
+		}
 
 		ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED,
 				       pages, NULL, locked);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ac9a28d51674..c342b091a7a4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4338,7 +4338,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			if (flags & FOLL_WRITE)
 				fault_flags |= FAULT_FLAG_WRITE;
 			if (locked)
-				fault_flags |= FAULT_FLAG_ALLOW_RETRY;
+				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
+					FAULT_FLAG_KILLABLE;
 			if (flags & FOLL_NOWAIT)
 				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
 					FAULT_FLAG_RETRY_NOWAIT;
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RESEND v6 16/16] mm/userfaultfd: Honor FAULT_FLAG_KILLABLE in fault path
  2020-02-20 15:53 [PATCH RESEND v6 00/16] mm: Page fault enhancements Peter Xu
                   ` (12 preceding siblings ...)
  2020-02-20 16:02 ` [PATCH RESEND v6 15/16] mm/gup: Allow to react to fatal signals Peter Xu
@ 2020-02-20 16:03 ` Peter Xu
  2020-02-20 19:53 ` [PATCH RESEND v6 12/16] mm: Introduce FAULT_FLAG_INTERRUPTIBLE Peter Xu
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2020-02-20 16:03 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Peter Xu, Martin Cracauer, Mike Rapoport, Hugh Dickins,
	Jerome Glisse, Kirill A . Shutemov, Matthew Wilcox,
	Pavel Emelyanov, Brian Geffon, Maya Gokhale, Denis Plotnikov,
	Andrea Arcangeli, Johannes Weiner, Dr . David Alan Gilbert,
	Linus Torvalds, Mike Kravetz, Marty McFadden, David Hildenbrand,
	Bobby Powers, Mel Gorman

Userfaultfd fault path was by default killable even if the caller does
not have FAULT_FLAG_KILLABLE.  That makes sense before in that when
with gup we don't have FAULT_FLAG_KILLABLE properly set before.  Now
after previous patch we've got FAULT_FLAG_KILLABLE applied even for
gup code so it should also make sense to let userfaultfd to honor the
FAULT_FLAG_KILLABLE.

Because we're unconditionally setting FAULT_FLAG_KILLABLE in gup code
right now, this patch should have no functional change.  It also
cleaned the code a little bit by introducing some helpers.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c | 36 ++++++++++++++++++++++++++++--------
 1 file changed, 28 insertions(+), 8 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index c076d3295958..703c1c3faa6e 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -334,6 +334,30 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 	return ret;
 }
 
+/* Should pair with userfaultfd_signal_pending() */
+static inline long userfaultfd_get_blocking_state(unsigned int flags)
+{
+	if (flags & FAULT_FLAG_INTERRUPTIBLE)
+		return TASK_INTERRUPTIBLE;
+
+	if (flags & FAULT_FLAG_KILLABLE)
+		return TASK_KILLABLE;
+
+	return TASK_UNINTERRUPTIBLE;
+}
+
+/* Should pair with userfaultfd_get_blocking_state() */
+static inline bool userfaultfd_signal_pending(unsigned int flags)
+{
+	if (flags & FAULT_FLAG_INTERRUPTIBLE)
+		return signal_pending(current);
+
+	if (flags & FAULT_FLAG_KILLABLE)
+		return fatal_signal_pending(current);
+
+	return false;
+}
+
 /*
  * The locking rules involved in returning VM_FAULT_RETRY depending on
  * FAULT_FLAG_ALLOW_RETRY, FAULT_FLAG_RETRY_NOWAIT and
@@ -355,7 +379,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	struct userfaultfd_ctx *ctx;
 	struct userfaultfd_wait_queue uwq;
 	vm_fault_t ret = VM_FAULT_SIGBUS;
-	bool must_wait, return_to_userland;
+	bool must_wait;
 	long blocking_state;
 
 	/*
@@ -462,9 +486,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	uwq.ctx = ctx;
 	uwq.waken = false;
 
-	return_to_userland = vmf->flags & FAULT_FLAG_INTERRUPTIBLE;
-	blocking_state = return_to_userland ? TASK_INTERRUPTIBLE :
-			 TASK_KILLABLE;
+	blocking_state = userfaultfd_get_blocking_state(vmf->flags);
 
 	spin_lock_irq(&ctx->fault_pending_wqh.lock);
 	/*
@@ -490,8 +512,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	up_read(&mm->mmap_sem);
 
 	if (likely(must_wait && !READ_ONCE(ctx->released) &&
-		   (return_to_userland ? !signal_pending(current) :
-		    !fatal_signal_pending(current)))) {
+		   !userfaultfd_signal_pending(vmf->flags))) {
 		wake_up_poll(&ctx->fd_wqh, EPOLLIN);
 		schedule();
 		ret |= VM_FAULT_MAJOR;
@@ -513,8 +534,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 			set_current_state(blocking_state);
 			if (READ_ONCE(uwq.waken) ||
 			    READ_ONCE(ctx->released) ||
-			    (return_to_userland ? signal_pending(current) :
-			     fatal_signal_pending(current)))
+			    userfaultfd_signal_pending(vmf->flags))
 				break;
 			schedule();
 		}
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RESEND v6 12/16] mm: Introduce FAULT_FLAG_INTERRUPTIBLE
  2020-02-20 15:53 [PATCH RESEND v6 00/16] mm: Page fault enhancements Peter Xu
                   ` (13 preceding siblings ...)
  2020-02-20 16:03 ` [PATCH RESEND v6 16/16] mm/userfaultfd: Honor FAULT_FLAG_KILLABLE in fault path Peter Xu
@ 2020-02-20 19:53 ` Peter Xu
  2020-02-20 19:53 ` [PATCH RESEND v6 14/16] mm/gup: Allow VM_FAULT_RETRY for multiple times Peter Xu
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2020-02-20 19:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Peter Xu, Martin Cracauer, Mike Rapoport, Hugh Dickins,
	Jerome Glisse, Kirill A . Shutemov, Matthew Wilcox,
	Pavel Emelyanov, Brian Geffon, Maya Gokhale, Denis Plotnikov,
	Andrea Arcangeli, Johannes Weiner, Dr . David Alan Gilbert,
	Linus Torvalds, Mike Kravetz, Marty McFadden, David Hildenbrand,
	Bobby Powers, Mel Gorman

handle_userfaultfd() is currently the only one place in the kernel
page fault procedures that can respond to non-fatal userspace signals.
It was trying to detect such an allowance by checking against USER &
KILLABLE flags, which was "un-official".

In this patch, we introduced a new flag (FAULT_FLAG_INTERRUPTIBLE) to
show that the fault handler allows the fault procedure to respond even
to non-fatal signals.  Meanwhile, add this new flag to the default
fault flags so that all the page fault handlers can benefit from the
new flag.  With that, replacing the userfault check to this one.

Since the line is getting even longer, clean up the fault flags a bit
too to ease TTY users.

Although we've got a new flag and applied it, we shouldn't have any
functional change with this patch so far.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c   |  4 +---
 include/linux/mm.h | 39 ++++++++++++++++++++++++++++-----------
 2 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 888272621f38..c076d3295958 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -462,9 +462,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	uwq.ctx = ctx;
 	uwq.waken = false;
 
-	return_to_userland =
-		(vmf->flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
-		(FAULT_FLAG_USER|FAULT_FLAG_KILLABLE);
+	return_to_userland = vmf->flags & FAULT_FLAG_INTERRUPTIBLE;
 	blocking_state = return_to_userland ? TASK_INTERRUPTIBLE :
 			 TASK_KILLABLE;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 08ca35d3c341..ff653f9136dd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -378,22 +378,38 @@ extern unsigned int kobjsize(const void *objp);
  */
 extern pgprot_t protection_map[16];
 
-#define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
-#define FAULT_FLAG_MKWRITE	0x02	/* Fault was mkwrite of existing pte */
-#define FAULT_FLAG_ALLOW_RETRY	0x04	/* Retry fault if blocking */
-#define FAULT_FLAG_RETRY_NOWAIT	0x08	/* Don't drop mmap_sem and wait when retrying */
-#define FAULT_FLAG_KILLABLE	0x10	/* The fault task is in SIGKILL killable region */
-#define FAULT_FLAG_TRIED	0x20	/* Second try */
-#define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
-#define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
-#define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
+/**
+ * Fault flag definitions.
+ *
+ * @FAULT_FLAG_WRITE: Fault was a write fault.
+ * @FAULT_FLAG_MKWRITE: Fault was mkwrite of existing PTE.
+ * @FAULT_FLAG_ALLOW_RETRY: Allow to retry the fault if blocked.
+ * @FAULT_FLAG_RETRY_NOWAIT: Don't drop mmap_sem and wait when retrying.
+ * @FAULT_FLAG_KILLABLE: The fault task is in SIGKILL killable region.
+ * @FAULT_FLAG_TRIED: The fault has been tried once.
+ * @FAULT_FLAG_USER: The fault originated in userspace.
+ * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
+ * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
+ * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ */
+#define FAULT_FLAG_WRITE			0x01
+#define FAULT_FLAG_MKWRITE			0x02
+#define FAULT_FLAG_ALLOW_RETRY			0x04
+#define FAULT_FLAG_RETRY_NOWAIT			0x08
+#define FAULT_FLAG_KILLABLE			0x10
+#define FAULT_FLAG_TRIED			0x20
+#define FAULT_FLAG_USER				0x40
+#define FAULT_FLAG_REMOTE			0x80
+#define FAULT_FLAG_INSTRUCTION  		0x100
+#define FAULT_FLAG_INTERRUPTIBLE		0x200
 
 /*
  * The default fault flags that should be used by most of the
  * arch-specific page fault handlers.
  */
 #define FAULT_FLAG_DEFAULT  (FAULT_FLAG_ALLOW_RETRY | \
-			     FAULT_FLAG_KILLABLE)
+			     FAULT_FLAG_KILLABLE | \
+			     FAULT_FLAG_INTERRUPTIBLE)
 
 #define FAULT_FLAG_TRACE \
 	{ FAULT_FLAG_WRITE,		"WRITE" }, \
@@ -404,7 +420,8 @@ extern pgprot_t protection_map[16];
 	{ FAULT_FLAG_TRIED,		"TRIED" }, \
 	{ FAULT_FLAG_USER,		"USER" }, \
 	{ FAULT_FLAG_REMOTE,		"REMOTE" }, \
-	{ FAULT_FLAG_INSTRUCTION,	"INSTRUCTION" }
+	{ FAULT_FLAG_INSTRUCTION,	"INSTRUCTION" }, \
+	{ FAULT_FLAG_INTERRUPTIBLE,	"INTERRUPTIBLE" }
 
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RESEND v6 14/16] mm/gup: Allow VM_FAULT_RETRY for multiple times
  2020-02-20 15:53 [PATCH RESEND v6 00/16] mm: Page fault enhancements Peter Xu
                   ` (14 preceding siblings ...)
  2020-02-20 19:53 ` [PATCH RESEND v6 12/16] mm: Introduce FAULT_FLAG_INTERRUPTIBLE Peter Xu
@ 2020-02-20 19:53 ` Peter Xu
  2020-02-21 19:26 ` [PATCH RESEND v6 00/16] mm: Page fault enhancements Brian Geffon
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2020-02-20 19:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Peter Xu, Martin Cracauer, Mike Rapoport, Hugh Dickins,
	Jerome Glisse, Kirill A . Shutemov, Matthew Wilcox,
	Pavel Emelyanov, Brian Geffon, Maya Gokhale, Denis Plotnikov,
	Andrea Arcangeli, Johannes Weiner, Dr . David Alan Gilbert,
	Linus Torvalds, Mike Kravetz, Marty McFadden, David Hildenbrand,
	Bobby Powers, Mel Gorman

This is the gup counterpart of the change that allows the
VM_FAULT_RETRY to happen for more than once.  One thing to mention is
that we must check the fatal signal here before retry because the GUP
can be interrupted by that, otherwise we can loop forever.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c     | 27 +++++++++++++++++++++------
 mm/hugetlb.c |  6 ++++--
 2 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 76cb420c0fb7..ec2b76f44a01 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -652,7 +652,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 	if (*flags & FOLL_NOWAIT)
 		fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
 	if (*flags & FOLL_TRIED) {
-		VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
+		/*
+		 * Note: FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_TRIED
+		 * can co-exist
+		 */
 		fault_flags |= FAULT_FLAG_TRIED;
 	}
 
@@ -1012,7 +1015,6 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 		down_read(&mm->mmap_sem);
 		if (!(fault_flags & FAULT_FLAG_TRIED)) {
 			*unlocked = true;
-			fault_flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			fault_flags |= FAULT_FLAG_TRIED;
 			goto retry;
 		}
@@ -1096,17 +1098,30 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
 		if (likely(pages))
 			pages += ret;
 		start += ret << PAGE_SHIFT;
+		lock_dropped = true;
 
+retry:
 		/*
 		 * Repeat on the address that fired VM_FAULT_RETRY
-		 * without FAULT_FLAG_ALLOW_RETRY but with
-		 * FAULT_FLAG_TRIED.
+		 * with both FAULT_FLAG_ALLOW_RETRY and
+		 * FAULT_FLAG_TRIED.  Note that GUP can be interrupted
+		 * by fatal signals, so we need to check it before we
+		 * start trying again otherwise it can loop forever.
 		 */
+
+		if (fatal_signal_pending(current))
+			break;
+
 		*locked = 1;
-		lock_dropped = true;
 		down_read(&mm->mmap_sem);
+
 		ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED,
-				       pages, NULL, NULL);
+				       pages, NULL, locked);
+		if (!*locked) {
+			/* Continue to retry until we succeeded */
+			BUG_ON(ret != 0);
+			goto retry;
+		}
 		if (ret != 1) {
 			BUG_ON(ret > 1);
 			if (!pages_done)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c84f721db020..ac9a28d51674 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4343,8 +4343,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
 					FAULT_FLAG_RETRY_NOWAIT;
 			if (flags & FOLL_TRIED) {
-				VM_WARN_ON_ONCE(fault_flags &
-						FAULT_FLAG_ALLOW_RETRY);
+				/*
+				 * Note: FAULT_FLAG_ALLOW_RETRY and
+				 * FAULT_FLAG_TRIED can co-exist
+				 */
 				fault_flags |= FAULT_FLAG_TRIED;
 			}
 			ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH RESEND v6 00/16] mm: Page fault enhancements
  2020-02-20 15:53 [PATCH RESEND v6 00/16] mm: Page fault enhancements Peter Xu
                   ` (15 preceding siblings ...)
  2020-02-20 19:53 ` [PATCH RESEND v6 14/16] mm/gup: Allow VM_FAULT_RETRY for multiple times Peter Xu
@ 2020-02-21 19:26 ` Brian Geffon
  2020-03-02 17:31   ` Peter Xu
  2020-02-21 19:32 ` Linus Torvalds
  2020-03-07 20:33 ` David Hildenbrand
  18 siblings, 1 reply; 31+ messages in thread
From: Brian Geffon @ 2020-02-21 19:26 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, LKML, Andrea Arcangeli, Martin Cracauer,
	Linus Torvalds, Mike Rapoport, Kirill A . Shutemov,
	Johannes Weiner, Dr . David Alan Gilbert, David Hildenbrand,
	Bobby Powers, Maya Gokhale, Jerome Glisse, Mike Kravetz,
	Matthew Wilcox, Marty McFadden, Mel Gorman, Hugh Dickins,
	Denis Plotnikov, Pavel Emelyanov

I tested the entire patchset because I'm very interested in fault
retries with userfaultfd and the series has been stable and worked
well on x86.

Tested-by: Brian Geffon <bgeffon@google.com>


On Thu, Feb 20, 2020 at 7:54 AM Peter Xu <peterx@redhat.com> wrote:
>
> [Resend v6]
>
> This is v6 of the series.  It is majorly a rebase to 5.6-rc2, nothing
> else to be expected (plus some tests after the rebase).  Instead of
> rewrite the cover letter I decided to use what we have for v5.
>
> Adding extra CCs for both Bobby Powers <bobbypowers@gmail.com> and
> Brian Geffon <bgeffon@google.com>.
>
> Online repo: https://github.com/xzpeter/linux/tree/mm-pf-signal-retry
>
> Any review comment is appreciated.  Thanks,
>
> =============== v5 cover letter ==================
>
> This is v5 of the series.  As Matthew suggested, I split the previous
> patch "mm: Return faster for non-fatal signals in user mode faults"
> into a few smaller ones:
>
>   1. One patch to introduce fatal_signal_pending(), and use it in
>      archs that can directly apply
>
>   2. A few more patches to let the rest archs to use the new helper.
>      With that we can have an unified entry for signal detection
>
>   3. One last patch to change fatal_signal_pending() to detect
>      userspace non-fatal signal
>
> Nothing should have changed in the rest patches.  Because the fault
> retry patches will depend on the previous ones, I decided to simply
> repost all the patches.
>
> Here's the new patchset layout:
>
> Patch 1-2:      cleanup, and potential bugfix of hugetlbfs on fault retry
>
> Patch 3-9:      let page fault to respond to non-fatal signals faster
>
> Patch 10:       remove the userfaultfd NOPAGE emulation
>
> Patch 11-14:    allow page fault to retry more than once
>
> Patch 15-16:    let gup code to use FAULT_FLAG_KILLABLE too
>
> I would really appreciate any review comments for the series,
> especially for the first two patches which IMHO are even not related
> to this patchset and they should either cleanup or fix things.
>
> Smoke tested on x86 only.
>
> Thanks,
>
> v5:
> - split "mm: Return faster for non-fatal signals in user mode faults"
>   into a few more patches, let all archs to use an unified entry for
>   fast signal handling (fatal_signal_pending)
>
> v4:
> - use lore.kernel.org for all the links in commit messages [Kirill]
> - one more patch ("mm/gup: Fix __get_user_pages() on fault retry of
>   hugetlb") to fix hugetlb path on fault retry
> - one more patch ("mm/gup: Allow to react to fatal signals") to:
>   - use down_read_killable() properly [Linus]
>   - pass in FAULT_FLAG_KILLABLE for all GUP [Linus]
> - one more patch ("mm/userfaultfd: Honor FAULT_FLAG_KILLABLE in fault
>   path") to let handle_userfaultfd() respect FAULT_FLAG_KILLABLE.
>   Should have no functional change after previous two new patches.
>
> v3:
> - check fatal signals in __get_user_page_locked() [Linus]
> - add r-bs
>
> v2:
> - resent previous version, rebase only
>
> =============== v1 cover letter ==================
>
> This series is split out of userfaultfd-wp series to only cover the
> general page fault changes, since it seems to make sense itself.
>
> Basically it does two things:
>
>   (a) Allows the page fault handlers to be more interactive on not
>       only SIGKILL, but also the rest of userspace signals (especially
>       for user-mode faults), and,
>
>   (b) Allows the page fault retry (VM_FAULT_RETRY) to happen for more
>       than once.
>
> I'm keeping the CC list as in uffd-wp v5, hopefully I'm not sending
> too much spams...
>
> And, instead of writting again the cover letter, I'm just copy-pasting
> my previous link here which has more details on why we do this:
>
>   https://patchwork.kernel.org/cover/10691991/
>
> The major change from that latest version should be that we introduced
> a new page fault flag FAULT_FLAG_INTERRUPTIBLE as suggested by Linus
> [1] to represents that we would like the fault handler to respond to
> non-fatal signals.  Also, we're more careful now on when to do the
> immediate return of the page fault for such signals.  For example, now
> we'll only check against signal_pending() for user-mode page faults
> and we keep the kernel-mode page fault patch untouched for it.  More
> information can be found in separate patches.
>
> The patchset is only lightly tested on x86.
>
> All comments are greatly welcomed.  Thanks,
>
> [1] https://lkml.org/lkml/2019/6/25/1382
>
> Peter Xu (16):
>   mm/gup: Rename "nonblocking" to "locked" where proper
>   mm/gup: Fix __get_user_pages() on fault retry of hugetlb
>   mm: Introduce fault_signal_pending()
>   x86/mm: Use helper fault_signal_pending()
>   arc/mm: Use helper fault_signal_pending()
>   arm64/mm: Use helper fault_signal_pending()
>   powerpc/mm: Use helper fault_signal_pending()
>   sh/mm: Use helper fault_signal_pending()
>   mm: Return faster for non-fatal signals in user mode faults
>   userfaultfd: Don't retake mmap_sem to emulate NOPAGE
>   mm: Introduce FAULT_FLAG_DEFAULT
>   mm: Introduce FAULT_FLAG_INTERRUPTIBLE
>   mm: Allow VM_FAULT_RETRY for multiple times
>   mm/gup: Allow VM_FAULT_RETRY for multiple times
>   mm/gup: Allow to react to fatal signals
>   mm/userfaultfd: Honor FAULT_FLAG_KILLABLE in fault path
>
>  arch/alpha/mm/fault.c           |  6 +--
>  arch/arc/mm/fault.c             | 35 +++++--------
>  arch/arm/mm/fault.c             |  7 +--
>  arch/arm64/mm/fault.c           | 26 +++------
>  arch/hexagon/mm/vm_fault.c      |  5 +-
>  arch/ia64/mm/fault.c            |  5 +-
>  arch/m68k/mm/fault.c            |  7 +--
>  arch/microblaze/mm/fault.c      |  5 +-
>  arch/mips/mm/fault.c            |  5 +-
>  arch/nds32/mm/fault.c           |  5 +-
>  arch/nios2/mm/fault.c           |  7 +--
>  arch/openrisc/mm/fault.c        |  5 +-
>  arch/parisc/mm/fault.c          |  8 ++-
>  arch/powerpc/mm/fault.c         | 20 ++-----
>  arch/riscv/mm/fault.c           |  9 +---
>  arch/s390/mm/fault.c            | 10 ++--
>  arch/sh/mm/fault.c              | 13 +++--
>  arch/sparc/mm/fault_32.c        |  5 +-
>  arch/sparc/mm/fault_64.c        |  5 +-
>  arch/um/kernel/trap.c           |  3 +-
>  arch/unicore32/mm/fault.c       |  8 ++-
>  arch/x86/mm/fault.c             | 30 +++++------
>  arch/xtensa/mm/fault.c          |  5 +-
>  drivers/gpu/drm/ttm/ttm_bo_vm.c | 12 +++--
>  fs/userfaultfd.c                | 62 ++++++++++------------
>  include/linux/mm.h              | 81 ++++++++++++++++++++++++----
>  include/linux/sched/signal.h    | 14 +++++
>  mm/filemap.c                    |  2 +-
>  mm/gup.c                        | 93 +++++++++++++++++++++------------
>  mm/hugetlb.c                    | 17 +++---
>  mm/internal.h                   |  6 +--
>  31 files changed, 281 insertions(+), 240 deletions(-)
>
> --
> 2.24.1
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RESEND v6 00/16] mm: Page fault enhancements
  2020-02-20 15:53 [PATCH RESEND v6 00/16] mm: Page fault enhancements Peter Xu
                   ` (16 preceding siblings ...)
  2020-02-21 19:26 ` [PATCH RESEND v6 00/16] mm: Page fault enhancements Brian Geffon
@ 2020-02-21 19:32 ` Linus Torvalds
  2020-02-21 20:11   ` Peter Xu
  2020-03-07 20:33 ` David Hildenbrand
  18 siblings, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2020-02-21 19:32 UTC (permalink / raw)
  To: Peter Xu, Andrew Morton
  Cc: Linux-MM, Linux Kernel Mailing List, Andrea Arcangeli,
	Martin Cracauer, Mike Rapoport, Kirill A . Shutemov,
	Johannes Weiner, Dr . David Alan Gilbert, David Hildenbrand,
	Bobby Powers, Maya Gokhale, Jerome Glisse, Mike Kravetz,
	Matthew Wilcox, Marty McFadden, Mel Gorman, Hugh Dickins,
	Brian Geffon, Denis Plotnikov, Pavel Emelyanov

On Thu, Feb 20, 2020 at 7:54 AM Peter Xu <peterx@redhat.com> wrote:
>
> This is v6 of the series.  It is majorly a rebase to 5.6-rc2, nothing
> else to be expected (plus some tests after the rebase).  Instead of
> rewrite the cover letter I decided to use what we have for v5.

I continue to think this is the right thing to do, and the series
looks good to me.

I'd love for it to get more testing, but realistically I suspect that
being in linux-next will be the right thing.

I've been assuming this will go through Andrew. He's not explicitly
cc'd, though (although maybe he does read all of linux-mm and has seen
this several times as a result).

               Linus


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RESEND v6 00/16] mm: Page fault enhancements
  2020-02-21 19:32 ` Linus Torvalds
@ 2020-02-21 20:11   ` Peter Xu
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2020-02-21 20:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Linux-MM, Linux Kernel Mailing List,
	Andrea Arcangeli, Martin Cracauer, Mike Rapoport,
	Kirill A . Shutemov, Johannes Weiner, Dr . David Alan Gilbert,
	David Hildenbrand, Bobby Powers, Maya Gokhale, Jerome Glisse,
	Mike Kravetz, Matthew Wilcox, Marty McFadden, Mel Gorman,
	Hugh Dickins, Brian Geffon, Denis Plotnikov, Pavel Emelyanov

On Fri, Feb 21, 2020 at 11:32:36AM -0800, Linus Torvalds wrote:
> On Thu, Feb 20, 2020 at 7:54 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > This is v6 of the series.  It is majorly a rebase to 5.6-rc2, nothing
> > else to be expected (plus some tests after the rebase).  Instead of
> > rewrite the cover letter I decided to use what we have for v5.
> 
> I continue to think this is the right thing to do, and the series
> looks good to me.
> 
> I'd love for it to get more testing, but realistically I suspect that
> being in linux-next will be the right thing.
> 
> I've been assuming this will go through Andrew. He's not explicitly
> cc'd, though (although maybe he does read all of linux-mm and has seen
> this several times as a result).

I'm very sorry for missing that.  I should have cced Andrew for every
mm patches.  Hope that's not a problem for this series.  I'll remember
to do that from now on.  Thanks!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RESEND v6 00/16] mm: Page fault enhancements
  2020-02-21 19:26 ` [PATCH RESEND v6 00/16] mm: Page fault enhancements Brian Geffon
@ 2020-03-02 17:31   ` Peter Xu
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2020-03-02 17:31 UTC (permalink / raw)
  To: Brian Geffon, Andrew Morton
  Cc: linux-mm, LKML, Andrea Arcangeli, Martin Cracauer,
	Linus Torvalds, Mike Rapoport, Kirill A . Shutemov,
	Johannes Weiner, Dr . David Alan Gilbert, David Hildenbrand,
	Bobby Powers, Maya Gokhale, Jerome Glisse, Mike Kravetz,
	Matthew Wilcox, Marty McFadden, Mel Gorman, Hugh Dickins,
	Denis Plotnikov, Pavel Emelyanov

On Fri, Feb 21, 2020 at 11:26:12AM -0800, Brian Geffon wrote:
> I tested the entire patchset because I'm very interested in fault
> retries with userfaultfd and the series has been stable and worked
> well on x86.
> 
> Tested-by: Brian Geffon <bgeffon@google.com>

(Thanks again for Brian's quick follow up, and adding Andrew in again)

Hi, Andrew,

Do you have plan to queue this series for linux-next?  Please let me
know if you want me to repost with you CCed for the whole series.

Thanks!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RESEND v6 02/16] mm/gup: Fix __get_user_pages() on fault retry of hugetlb
  2020-02-20 15:53 ` [PATCH RESEND v6 02/16] mm/gup: Fix __get_user_pages() on fault retry of hugetlb Peter Xu
@ 2020-03-02 19:02   ` David Hildenbrand
  2020-03-02 20:07     ` Peter Xu
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2020-03-02 19:02 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Martin Cracauer, Linus Torvalds, Mike Rapoport,
	Kirill A . Shutemov, Johannes Weiner, Dr . David Alan Gilbert,
	Bobby Powers, Maya Gokhale, Jerome Glisse, Mike Kravetz,
	Matthew Wilcox, Marty McFadden, Mel Gorman, Hugh Dickins,
	Brian Geffon, Denis Plotnikov, Pavel Emelyanov

On 20.02.20 16:53, Peter Xu wrote:
> When follow_hugetlb_page() returns with *locked==0, it means we've got
> a VM_FAULT_RETRY within the fauling process and we've released the
> mmap_sem.  When that happens, we should stop and bail out.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/gup.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 1b4411bd0042..76cb420c0fb7 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -849,6 +849,16 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  				i = follow_hugetlb_page(mm, vma, pages, vmas,
>  						&start, &nr_pages, i,
>  						gup_flags, locked);
> +				if (locked && *locked == 0) {
> +					/*
> +					 * We've got a VM_FAULT_RETRY
> +					 * and we've lost mmap_sem.
> +					 * We must stop here.
> +					 */
> +					BUG_ON(gup_flags & FOLL_NOWAIT);
> +					BUG_ON(ret != 0);

Can we be sure ret is really set to != 0 at this point? At least,
reading the code this is not clear to me.

Shouldn't we set "ret = i" and assert that i is an error (e.g., EBUSY?).
Or set -EBUSY explicitly?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RESEND v6 03/16] mm: Introduce fault_signal_pending()
  2020-02-20 15:53 ` [PATCH RESEND v6 03/16] mm: Introduce fault_signal_pending() Peter Xu
@ 2020-03-02 19:04   ` David Hildenbrand
  0 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2020-03-02 19:04 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Martin Cracauer, Linus Torvalds, Mike Rapoport,
	Kirill A . Shutemov, Johannes Weiner, Dr . David Alan Gilbert,
	Bobby Powers, Maya Gokhale, Jerome Glisse, Mike Kravetz,
	Matthew Wilcox, Marty McFadden, Mel Gorman, Hugh Dickins,
	Brian Geffon, Denis Plotnikov, Pavel Emelyanov

On 20.02.20 16:53, Peter Xu wrote:
> For most architectures, we've got a quick path to detect fatal signal
> after a handle_mm_fault().  Introduce a helper for that quick path.
> 
> It cleans the current codes a bit so we don't need to duplicate the
> same check across archs.  More importantly, this will be an unified
> place that we handle the signal immediately right after an interrupted
> page fault, so it'll be much easier for us if we want to change the
> behavior of handling signals later on for all the archs.
> 
> Note that currently only part of the archs are using this new helper,
> because some archs have their own way to handle signals.  In the
> follow up patches, we'll try to apply this helper to all the rest of
> archs.
> 
> Another note is that the "regs" parameter in the new helper is not
> used yet.  It'll be used very soon.  Now we kept it in this patch only
> to avoid touching all the archs again in the follow up patches.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/alpha/mm/fault.c        |  2 +-
>  arch/arm/mm/fault.c          |  2 +-
>  arch/hexagon/mm/vm_fault.c   |  2 +-
>  arch/ia64/mm/fault.c         |  2 +-
>  arch/m68k/mm/fault.c         |  2 +-
>  arch/microblaze/mm/fault.c   |  2 +-
>  arch/mips/mm/fault.c         |  2 +-
>  arch/nds32/mm/fault.c        |  2 +-
>  arch/nios2/mm/fault.c        |  2 +-
>  arch/openrisc/mm/fault.c     |  2 +-
>  arch/parisc/mm/fault.c       |  2 +-
>  arch/riscv/mm/fault.c        |  2 +-
>  arch/s390/mm/fault.c         |  3 +--
>  arch/sparc/mm/fault_32.c     |  2 +-
>  arch/sparc/mm/fault_64.c     |  2 +-
>  arch/unicore32/mm/fault.c    |  2 +-
>  arch/xtensa/mm/fault.c       |  2 +-
>  include/linux/sched/signal.h | 13 +++++++++++++
>  18 files changed, 30 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
> index 741e61ef9d3f..aea33b599037 100644
> --- a/arch/alpha/mm/fault.c
> +++ b/arch/alpha/mm/fault.c
> @@ -150,7 +150,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
>  	   the fault.  */
>  	fault = handle_mm_fault(vma, address, flags);
>  
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +	if (fault_signal_pending(fault, regs))
>  		return;
>  
>  	if (unlikely(fault & VM_FAULT_ERROR)) {
> diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
> index bd0f4821f7e1..937b81ff8649 100644
> --- a/arch/arm/mm/fault.c
> +++ b/arch/arm/mm/fault.c
> @@ -295,7 +295,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
>  	 * signal first. We do not need to release the mmap_sem because
>  	 * it would already be released in __lock_page_or_retry in
>  	 * mm/filemap.c. */
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
> +	if (fault_signal_pending(fault, regs)) {
>  		if (!user_mode(regs))
>  			goto no_context;
>  		return 0;
> diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
> index b3bc71680ae4..d19beaf11b4c 100644
> --- a/arch/hexagon/mm/vm_fault.c
> +++ b/arch/hexagon/mm/vm_fault.c
> @@ -91,7 +91,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
>  
>  	fault = handle_mm_fault(vma, address, flags);
>  
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +	if (fault_signal_pending(fault, regs))
>  		return;
>  
>  	/* The most common case -- we are done. */
> diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
> index c2f299fe9e04..211b4f439384 100644
> --- a/arch/ia64/mm/fault.c
> +++ b/arch/ia64/mm/fault.c
> @@ -141,7 +141,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
>  	 */
>  	fault = handle_mm_fault(vma, address, flags);
>  
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +	if (fault_signal_pending(fault, regs))
>  		return;
>  
>  	if (unlikely(fault & VM_FAULT_ERROR)) {
> diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
> index e9b1d7585b43..a455e202691b 100644
> --- a/arch/m68k/mm/fault.c
> +++ b/arch/m68k/mm/fault.c
> @@ -138,7 +138,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
>  	fault = handle_mm_fault(vma, address, flags);
>  	pr_debug("handle_mm_fault returns %x\n", fault);
>  
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +	if (fault_signal_pending(fault, regs))
>  		return 0;
>  
>  	if (unlikely(fault & VM_FAULT_ERROR)) {
> diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
> index e6a810b0c7ad..cdde01dcdfc3 100644
> --- a/arch/microblaze/mm/fault.c
> +++ b/arch/microblaze/mm/fault.c
> @@ -217,7 +217,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
>  	 */
>  	fault = handle_mm_fault(vma, address, flags);
>  
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +	if (fault_signal_pending(fault, regs))
>  		return;
>  
>  	if (unlikely(fault & VM_FAULT_ERROR)) {
> diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
> index 1e8d00793784..0ee6fafc57bc 100644
> --- a/arch/mips/mm/fault.c
> +++ b/arch/mips/mm/fault.c
> @@ -154,7 +154,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
>  	 */
>  	fault = handle_mm_fault(vma, address, flags);
>  
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +	if (fault_signal_pending(regs))
>  		return;
>  
>  	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
> diff --git a/arch/nds32/mm/fault.c b/arch/nds32/mm/fault.c
> index 906dfb25353c..0e63f81eff5b 100644
> --- a/arch/nds32/mm/fault.c
> +++ b/arch/nds32/mm/fault.c
> @@ -214,7 +214,7 @@ void do_page_fault(unsigned long entry, unsigned long addr,
>  	 * signal first. We do not need to release the mmap_sem because it
>  	 * would already be released in __lock_page_or_retry in mm/filemap.c.
>  	 */
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
> +	if (fault_signal_pending(fault, regs)) {
>  		if (!user_mode(regs))
>  			goto no_context;
>  		return;
> diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
> index 6a2e716b959f..704ace8ca0ee 100644
> --- a/arch/nios2/mm/fault.c
> +++ b/arch/nios2/mm/fault.c
> @@ -133,7 +133,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause,
>  	 */
>  	fault = handle_mm_fault(vma, address, flags);
>  
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +	if (fault_signal_pending(fault, regs))
>  		return;
>  
>  	if (unlikely(fault & VM_FAULT_ERROR)) {
> diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
> index 5d4d3a9691d0..85c7eb0c0186 100644
> --- a/arch/openrisc/mm/fault.c
> +++ b/arch/openrisc/mm/fault.c
> @@ -161,7 +161,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
>  
>  	fault = handle_mm_fault(vma, address, flags);
>  
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +	if (fault_signal_pending(fault, regs))
>  		return;
>  
>  	if (unlikely(fault & VM_FAULT_ERROR)) {
> diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
> index adbd5e2144a3..f9be1d1cb43f 100644
> --- a/arch/parisc/mm/fault.c
> +++ b/arch/parisc/mm/fault.c
> @@ -304,7 +304,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
>  
>  	fault = handle_mm_fault(vma, address, flags);
>  
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +	if (fault_signal_pending(fault, regs))
>  		return;
>  
>  	if (unlikely(fault & VM_FAULT_ERROR)) {
> diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
> index cf7248e07f43..1d3869e9ddef 100644
> --- a/arch/riscv/mm/fault.c
> +++ b/arch/riscv/mm/fault.c
> @@ -117,7 +117,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
>  	 * signal first. We do not need to release the mmap_sem because it
>  	 * would already be released in __lock_page_or_retry in mm/filemap.c.
>  	 */
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(tsk))
> +	if (fault_signal_pending(fault, regs))
>  		return;
>  
>  	if (unlikely(fault & VM_FAULT_ERROR)) {
> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> index 7b0bb475c166..179cf92a56e5 100644
> --- a/arch/s390/mm/fault.c
> +++ b/arch/s390/mm/fault.c
> @@ -480,8 +480,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>  	 * the fault.
>  	 */
>  	fault = handle_mm_fault(vma, address, flags);
> -	/* No reason to continue if interrupted by SIGKILL. */
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
> +	if (fault_signal_pending(fault, regs)) {
>  		fault = VM_FAULT_SIGNAL;
>  		if (flags & FAULT_FLAG_RETRY_NOWAIT)
>  			goto out_up;
> diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
> index 89976c9b936c..6efbeb227644 100644
> --- a/arch/sparc/mm/fault_32.c
> +++ b/arch/sparc/mm/fault_32.c
> @@ -237,7 +237,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
>  	 */
>  	fault = handle_mm_fault(vma, address, flags);
>  
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +	if (fault_signal_pending(fault, regs))
>  		return;
>  
>  	if (unlikely(fault & VM_FAULT_ERROR)) {
> diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
> index 8b7ddbd14b65..dd1ed6555831 100644
> --- a/arch/sparc/mm/fault_64.c
> +++ b/arch/sparc/mm/fault_64.c
> @@ -425,7 +425,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
>  
>  	fault = handle_mm_fault(vma, address, flags);
>  
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +	if (fault_signal_pending(fault, regs))
>  		goto exit_exception;
>  
>  	if (unlikely(fault & VM_FAULT_ERROR)) {
> diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
> index 76342de9cf8c..59d0e6ec2cfc 100644
> --- a/arch/unicore32/mm/fault.c
> +++ b/arch/unicore32/mm/fault.c
> @@ -250,7 +250,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
>  	 * signal first. We do not need to release the mmap_sem because
>  	 * it would already be released in __lock_page_or_retry in
>  	 * mm/filemap.c. */
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +	if (fault_signal_pending(fault, regs))
>  		return 0;
>  
>  	if (!(fault & VM_FAULT_ERROR) && (flags & FAULT_FLAG_ALLOW_RETRY)) {
> diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
> index bee30a77cd70..59515905d4ad 100644
> --- a/arch/xtensa/mm/fault.c
> +++ b/arch/xtensa/mm/fault.c
> @@ -110,7 +110,7 @@ void do_page_fault(struct pt_regs *regs)
>  	 */
>  	fault = handle_mm_fault(vma, address, flags);
>  
> -	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> +	if (fault_signal_pending(fault, regs))
>  		return;
>  
>  	if (unlikely(fault & VM_FAULT_ERROR)) {
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 88050259c466..4c87ffce64d1 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -369,6 +369,19 @@ static inline int signal_pending_state(long state, struct task_struct *p)
>  	return (state & TASK_INTERRUPTIBLE) || __fatal_signal_pending(p);
>  }
>  
> +/*
> + * This should only be used in fault handlers to decide whether we
> + * should stop the current fault routine to handle the signals
> + * instead, especially with the case where we've got interrupted with
> + * a VM_FAULT_RETRY.
> + */
> +static inline bool fault_signal_pending(unsigned int fault_flags,
> +					struct pt_regs *regs)
> +{
> +	return unlikely((fault_flags & VM_FAULT_RETRY) &&
> +			fatal_signal_pending(current));
> +}
> +
>  /*
>   * Reevaluate whether the task has signals pending delivery.
>   * Wake the task if so.
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RESEND v6 02/16] mm/gup: Fix __get_user_pages() on fault retry of hugetlb
  2020-03-02 19:02   ` David Hildenbrand
@ 2020-03-02 20:07     ` Peter Xu
  2020-03-02 20:22       ` David Hildenbrand
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Xu @ 2020-03-02 20:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Martin Cracauer,
	Linus Torvalds, Mike Rapoport, Kirill A . Shutemov,
	Johannes Weiner, Dr . David Alan Gilbert, Bobby Powers,
	Maya Gokhale, Jerome Glisse, Mike Kravetz, Matthew Wilcox,
	Marty McFadden, Mel Gorman, Hugh Dickins, Brian Geffon,
	Denis Plotnikov, Pavel Emelyanov

On Mon, Mar 02, 2020 at 08:02:34PM +0100, David Hildenbrand wrote:
> On 20.02.20 16:53, Peter Xu wrote:
> > When follow_hugetlb_page() returns with *locked==0, it means we've got
> > a VM_FAULT_RETRY within the fauling process and we've released the
> > mmap_sem.  When that happens, we should stop and bail out.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  mm/gup.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 1b4411bd0042..76cb420c0fb7 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -849,6 +849,16 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> >  				i = follow_hugetlb_page(mm, vma, pages, vmas,
> >  						&start, &nr_pages, i,
> >  						gup_flags, locked);
> > +				if (locked && *locked == 0) {
> > +					/*
> > +					 * We've got a VM_FAULT_RETRY
> > +					 * and we've lost mmap_sem.
> > +					 * We must stop here.
> > +					 */
> > +					BUG_ON(gup_flags & FOLL_NOWAIT);
> > +					BUG_ON(ret != 0);
> 
> Can we be sure ret is really set to != 0 at this point? At least,
> reading the code this is not clear to me.

Here I wanted to make sure ret is zero (it's BUG_ON, not assert).

"ret" is the fallback return value only if error happens when i==0.
Here we want to make sure even if no page is pinned we'll return zero
gracefully when VM_FAULT_RETRY happened when following the hugetlb
pages.

> 
> Shouldn't we set "ret = i" and assert that i is an error (e.g., EBUSY?).
> Or set -EBUSY explicitly?

No.  Here "i" could only be either positive (when we've got some pages
pinned no matter where), or zero (when follow_hugetlb_page released
the mmap_sem on the first page that it wants to pin).  So imo "i"
should never be negative instead.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RESEND v6 02/16] mm/gup: Fix __get_user_pages() on fault retry of hugetlb
  2020-03-02 20:07     ` Peter Xu
@ 2020-03-02 20:22       ` David Hildenbrand
  0 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2020-03-02 20:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, linux-mm, linux-kernel, Andrea Arcangeli,
	Martin Cracauer, Linus Torvalds, Mike Rapoport,
	Kirill A . Shutemov, Johannes Weiner, Dr . David Alan Gilbert,
	Bobby Powers, Maya Gokhale, Jerome Glisse, Mike Kravetz,
	Matthew Wilcox, Marty McFadden, Mel Gorman, Hugh Dickins,
	Brian Geffon, Denis Plotnikov, Pavel Emelyanov



> Am 02.03.2020 um 21:07 schrieb Peter Xu <peterx@redhat.com>:
> 
> On Mon, Mar 02, 2020 at 08:02:34PM +0100, David Hildenbrand wrote:
>>> On 20.02.20 16:53, Peter Xu wrote:
>>> When follow_hugetlb_page() returns with *locked==0, it means we've got
>>> a VM_FAULT_RETRY within the fauling process and we've released the
>>> mmap_sem.  When that happens, we should stop and bail out.
>>> 
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>> mm/gup.c | 10 ++++++++++
>>> 1 file changed, 10 insertions(+)
>>> 
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index 1b4411bd0042..76cb420c0fb7 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -849,6 +849,16 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>>>                i = follow_hugetlb_page(mm, vma, pages, vmas,
>>>                        &start, &nr_pages, i,
>>>                        gup_flags, locked);
>>> +                if (locked && *locked == 0) {
>>> +                    /*
>>> +                     * We've got a VM_FAULT_RETRY
>>> +                     * and we've lost mmap_sem.
>>> +                     * We must stop here.
>>> +                     */
>>> +                    BUG_ON(gup_flags & FOLL_NOWAIT);
>>> +                    BUG_ON(ret != 0);
>> 
>> Can we be sure ret is really set to != 0 at this point? At least,
>> reading the code this is not clear to me.
> 
> Here I wanted to make sure ret is zero (it's BUG_ON, not assert).

Sorry, I completely misread that BUG_ON for whatever reason, maybe I was staring for too long into my computer screen :)

> 
> "ret" is the fallback return value only if error happens when i==0.
> Here we want to make sure even if no page is pinned we'll return zero
> gracefully when VM_FAULT_RETRY happened when following the hugetlb
> pages.

Makes sense!

> 
>> 
>> Shouldn't we set "ret = i" and assert that i is an error (e.g., EBUSY?).
>> Or set -EBUSY explicitly?
> 
> No.  Here "i" could only be either positive (when we've got some pages
> pinned no matter where), or zero (when follow_hugetlb_page released
> the mmap_sem on the first page that it wants to pin).  So imo "i"
> should never be negative instead.

I briefly scanned the function and spotted some errors being returned, that‘s why I was wondering.

Thanks!



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RESEND v6 00/16] mm: Page fault enhancements
  2020-02-20 15:53 [PATCH RESEND v6 00/16] mm: Page fault enhancements Peter Xu
                   ` (17 preceding siblings ...)
  2020-02-21 19:32 ` Linus Torvalds
@ 2020-03-07 20:33 ` David Hildenbrand
  2020-03-07 21:47   ` Peter Xu
  2020-03-08 12:49   ` David Hildenbrand
  18 siblings, 2 replies; 31+ messages in thread
From: David Hildenbrand @ 2020-03-07 20:33 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Martin Cracauer, Linus Torvalds, Mike Rapoport,
	Kirill A . Shutemov, Johannes Weiner, Dr . David Alan Gilbert,
	Bobby Powers, Maya Gokhale, Jerome Glisse, Mike Kravetz,
	Matthew Wilcox, Marty McFadden, Mel Gorman, Hugh Dickins,
	Brian Geffon, Denis Plotnikov, Pavel Emelyanov, dgilbert

On 20.02.20 16:53, Peter Xu wrote:
> [Resend v6]
> 
> This is v6 of the series.  It is majorly a rebase to 5.6-rc2, nothing
> else to be expected (plus some tests after the rebase).  Instead of
> rewrite the cover letter I decided to use what we have for v5.
> 
> Adding extra CCs for both Bobby Powers <bobbypowers@gmail.com> and
> Brian Geffon <bgeffon@google.com>.
> 
> Online repo: https://github.com/xzpeter/linux/tree/mm-pf-signal-retry
> 
> Any review comment is appreciated.  Thanks,

If I am not completely missing something (and all my testing today was
wrong) there is a very simple reason why I *LOVE* this series and it made
my weekend. It makes userfaultfd with concurrent discarding (e.g.,
MADV_DONTNEED) of pages actually usable.

The issue in current code is that between placing a page and waking
up a waiter, somebody can zap the new placed page and trigger
re-fault, triggering a SIGBUS and crashing an application where all
memory is supposed to be accessible. And there is no real way to protect
from that, because when the fault handler will be woken up and retry
is not deterministic (e.g., making madvise(MADV_DONTNEED) and
UFFDIO_ZEROPAGE mutually exclusive does not help).

Find a simple reproducer at the end of this mail.

Before this series:
[root@localhost ~]# ./a.out 
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
Progress!
[   34.849604] FAULT_FLAG_ALLOW_RETRY missing 70
[   34.850466] CPU: 1 PID: 651 Comm: a.out Not tainted 5.6.0-rc2+ #92
[   34.851525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4
[   34.852818] Call Trace:
[   34.853045]  dump_stack+0x8f/0xd0
[   34.853338]  handle_userfault.cold+0x1a/0x2e
[   34.853704]  ? find_held_lock+0x2b/0x80
[   34.854031]  ? __handle_mm_fault+0x18c5/0x1900
[   34.854409]  __handle_mm_fault+0x18d4/0x1900
[   34.854784]  handle_mm_fault+0x169/0x360
[   34.855120]  do_user_addr_fault+0x20d/0x490
[   34.855478]  async_page_fault+0x43/0x50
[   34.855809] RIP: 0033:0x401659
[   34.856069] Code: ba 1f 00 00 00 be 01 00 00 00 bf 10 21 40 00 e8 ad fa ff ff bf ff ff ff ff e8 93 fa ff ff 48 8b8
[   34.857629] RSP: 002b:00007ffcfd536ec0 EFLAGS: 00010246
[   34.858076] RAX: 00007fcba86a4000 RBX: 0000000000000000 RCX: 00007fcba85784ef
[   34.858675] RDX: 00007fcba86a4007 RSI: 00000000016524e0 RDI: 00007fcba864b320
[   34.859272] RBP: 00007ffcfd536f20 R08: 000000000000000a R09: 0000000000000070
[   34.859876] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000401120
[   34.860472] R13: 00007ffcfd537000 R14: 0000000000000000 R15: 0000000000000000

After this series:
Well, "Progress!" all day long.


Can we please have a way to identify that this "feature" is available?
I'd appreciate a new read-only UFFD_FEAT_ , so we can detect this from
user space easily and use concurrent discards without crashing our applications.


Questions:
1. I assume KVM will do multiple retries as well, and have the same behavior, right?

2. What will happen if I don't place a page on a pagefault, but only do a UFFDIO_WAKE?
   For now we were able to trigger a signal this way. If the behavior is changed, can
   we make this configurable via a UFFD_FEAT?

--- snip ---
#include <string.h>
#include <stdbool.h>
#include <stdint.h>
#include <sys/types.h>
#include <stdio.h>
#include <pthread.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <poll.h>
#include <linux/userfaultfd.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/ioctl.h>

static int page_size;

static void *fault_handler_thread(void *arg)
{
    const long uffd = (long) arg;
    struct pollfd pollfd = {
        .fd = uffd,
        .events = POLLIN,
    };
    int ret;

    while (true) {
        struct uffdio_zeropage zeropage = {};
        struct uffd_msg msg;
        ssize_t nread;

        if (poll(&pollfd, 1, -1) == -1) {
            fprintf(stderr, "POLL failed: %s\n", strerror(errno));
            exit(-1);
        }
        if (read(uffd, &msg, sizeof(msg)) != sizeof(msg)) {
            fprintf(stderr, "READ failed\n");
            exit(-1);
        }
        if (msg.event != UFFD_EVENT_PAGEFAULT) {
            fprintf(stderr, "Not UFFD_EVENT_PAGEFAULT\n");
            exit(-1);
        }

        zeropage.range.start = msg.arg.pagefault.address;
        zeropage.range.len = page_size;
        do {
            ret = ioctl(uffd, UFFDIO_ZEROPAGE, &zeropage);
            if (ret && errno != EAGAIN) {
                fprintf(stderr, "UFFDIO_ZEROPAGE failed:%s\n", strerror(errno));
                exit(-1);
            }
        } while (ret);
    }
}
static void *discard_thread(void *arg)
{
    while (true) {
        if (madvise(arg, page_size, MADV_DONTNEED)) {
            fprintf(stderr, "MADV_DONTNEED failed:%s\n", strerror(errno));
            exit(-1);
        }
        usleep(1000);
    }
}

int main(void)
{
    struct uffdio_register reg;
    struct uffdio_api api = {
        .api = UFFD_API,
    };
    pthread_t fault, discard;
    long uffd;
    char *area;

    page_size = sysconf(_SC_PAGE_SIZE);

    uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
    if (uffd == -1) {
        fprintf(stderr, "Could not create uffd: %s\n", strerror(errno));
        exit(-1);
    }
    if (ioctl(uffd, UFFDIO_API, &api) == -1) {
        fprintf(stderr, "UFFDIO_API failed: %s\n", strerror(errno));
        exit(-1);
    }

    area = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
                MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (area == MAP_FAILED) {
        fprintf(stderr, "Could not allocate memory");
        exit(-1);
    }

    reg.range.start = (uint64_t) area;
    reg.range.len = page_size,
    reg.mode = UFFDIO_REGISTER_MODE_MISSING;
    if (ioctl(uffd, UFFDIO_REGISTER, &reg) == -1) {
        fprintf(stderr, "UFFDIO_REGISTER failed: %s\n", strerror(errno));
        exit(-1);
    }

    /* thread to provide zeropages */
    if (pthread_create(&fault, NULL, fault_handler_thread,
                       (void *) uffd)) {
        fprintf(stderr, "Could not create fault handling thread");
        exit(-1);
    }

    /* thread to discard the page */
    if (pthread_create(&discard, NULL, discard_thread,
                       (void *) area)) {
        fprintf(stderr, "Could not create discard thread");
        exit(-1);
    }

    /* keep reading/writing the page */
    while (true) {
        area[7] = area[1];
        usleep(10000);
        printf("Progress!\n");
    }
    return 0;
}


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RESEND v6 00/16] mm: Page fault enhancements
  2020-03-07 20:33 ` David Hildenbrand
@ 2020-03-07 21:47   ` Peter Xu
  2020-03-08 12:12     ` David Hildenbrand
  2020-03-08 12:49   ` David Hildenbrand
  1 sibling, 1 reply; 31+ messages in thread
From: Peter Xu @ 2020-03-07 21:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Martin Cracauer,
	Linus Torvalds, Mike Rapoport, Kirill A . Shutemov,
	Johannes Weiner, Dr . David Alan Gilbert, Bobby Powers,
	Maya Gokhale, Jerome Glisse, Mike Kravetz, Matthew Wilcox,
	Marty McFadden, Mel Gorman, Hugh Dickins, Brian Geffon,
	Denis Plotnikov, Pavel Emelyanov

On Sat, Mar 07, 2020 at 09:33:08PM +0100, David Hildenbrand wrote:
> On 20.02.20 16:53, Peter Xu wrote:
> > [Resend v6]
> > 
> > This is v6 of the series.  It is majorly a rebase to 5.6-rc2, nothing
> > else to be expected (plus some tests after the rebase).  Instead of
> > rewrite the cover letter I decided to use what we have for v5.
> > 
> > Adding extra CCs for both Bobby Powers <bobbypowers@gmail.com> and
> > Brian Geffon <bgeffon@google.com>.
> > 
> > Online repo: https://github.com/xzpeter/linux/tree/mm-pf-signal-retry
> > 
> > Any review comment is appreciated.  Thanks,
> 
> If I am not completely missing something (and all my testing today was
> wrong) there is a very simple reason why I *LOVE* this series and it made
> my weekend. It makes userfaultfd with concurrent discarding (e.g.,
> MADV_DONTNEED) of pages actually usable.

Hi, David,

Thanks for doing further test against this branch!

> 
> The issue in current code is that between placing a page and waking
> up a waiter, somebody can zap the new placed page and trigger
> re-fault, triggering a SIGBUS and crashing an application where all
> memory is supposed to be accessible. And there is no real way to protect
> from that, because when the fault handler will be woken up and retry
> is not deterministic (e.g., making madvise(MADV_DONTNEED) and
> UFFDIO_ZEROPAGE mutually exclusive does not help).
> 
> Find a simple reproducer at the end of this mail.
> 
> Before this series:
> [root@localhost ~]# ./a.out 
> Progress!
> Progress!
> Progress!
> Progress!
> Progress!
> Progress!
> Progress!
> Progress!
> Progress!
> Progress!
> Progress!
> Progress!
> [   34.849604] FAULT_FLAG_ALLOW_RETRY missing 70
> [   34.850466] CPU: 1 PID: 651 Comm: a.out Not tainted 5.6.0-rc2+ #92
> [   34.851525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4
> [   34.852818] Call Trace:
> [   34.853045]  dump_stack+0x8f/0xd0
> [   34.853338]  handle_userfault.cold+0x1a/0x2e
> [   34.853704]  ? find_held_lock+0x2b/0x80
> [   34.854031]  ? __handle_mm_fault+0x18c5/0x1900
> [   34.854409]  __handle_mm_fault+0x18d4/0x1900
> [   34.854784]  handle_mm_fault+0x169/0x360
> [   34.855120]  do_user_addr_fault+0x20d/0x490
> [   34.855478]  async_page_fault+0x43/0x50
> [   34.855809] RIP: 0033:0x401659
> [   34.856069] Code: ba 1f 00 00 00 be 01 00 00 00 bf 10 21 40 00 e8 ad fa ff ff bf ff ff ff ff e8 93 fa ff ff 48 8b8
> [   34.857629] RSP: 002b:00007ffcfd536ec0 EFLAGS: 00010246
> [   34.858076] RAX: 00007fcba86a4000 RBX: 0000000000000000 RCX: 00007fcba85784ef
> [   34.858675] RDX: 00007fcba86a4007 RSI: 00000000016524e0 RDI: 00007fcba864b320
> [   34.859272] RBP: 00007ffcfd536f20 R08: 000000000000000a R09: 0000000000000070
> [   34.859876] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000401120
> [   34.860472] R13: 00007ffcfd537000 R14: 0000000000000000 R15: 0000000000000000
> 
> After this series:
> Well, "Progress!" all day long.

Yes, IIUC the race can happen like this in your below test:

     main thread          uffd thread             disgard thread
     ===========          ===========             ==============
     access page
       uffd page fault
         wait for page
                          UFFDIO_ZEROCOPY
                            put a page P there
                                                  MADV_DONTNEED on P
                            wakeup main thread
         return from fault
       page still missing
       uffd page fault again
       (without ALLOW_RETRY)
       --> SIGBUS.

And I agree this should be one if the major issues that this series
wants to address.

> 
> 
> Can we please have a way to identify that this "feature" is available?
> I'd appreciate a new read-only UFFD_FEAT_ , so we can detect this from
> user space easily and use concurrent discards without crashing our applications.

I'm not sure how others think about it, but to me this still fells
into the bucket of "solving an existing problem" rather than a
feature.  Also note that this should change the behavior for the page
fault logic in general, rather than an uffd-only change. So I'm also
not sure whether UFFD_FEAT_* suites here even if we want it.

> 
> 
> Questions:
> 1. I assume KVM will do multiple retries as well, and have the same behavior, right?

Yes I believe so, or we'll need to fix it.

> 
> 2. What will happen if I don't place a page on a pagefault, but only do a UFFDIO_WAKE?
>    For now we were able to trigger a signal this way.

If I'm not mistaken the UFFDIO_WAKE will directly trigger the sigbus
even without the help of the MADV_DONTNEED race.

> If the behavior is changed, can
>    we make this configurable via a UFFD_FEAT?

I'll still think that could be an overkill, but I'll leave the
discussion to the experts.

Thanks,

> 
> --- snip ---
> #include <string.h>
> #include <stdbool.h>
> #include <stdint.h>
> #include <sys/types.h>
> #include <stdio.h>
> #include <pthread.h>
> #include <errno.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <fcntl.h>
> #include <poll.h>
> #include <linux/userfaultfd.h>
> #include <sys/mman.h>
> #include <sys/syscall.h>
> #include <sys/ioctl.h>
> 
> static int page_size;
> 
> static void *fault_handler_thread(void *arg)
> {
>     const long uffd = (long) arg;
>     struct pollfd pollfd = {
>         .fd = uffd,
>         .events = POLLIN,
>     };
>     int ret;
> 
>     while (true) {
>         struct uffdio_zeropage zeropage = {};
>         struct uffd_msg msg;
>         ssize_t nread;
> 
>         if (poll(&pollfd, 1, -1) == -1) {
>             fprintf(stderr, "POLL failed: %s\n", strerror(errno));
>             exit(-1);
>         }
>         if (read(uffd, &msg, sizeof(msg)) != sizeof(msg)) {
>             fprintf(stderr, "READ failed\n");
>             exit(-1);
>         }
>         if (msg.event != UFFD_EVENT_PAGEFAULT) {
>             fprintf(stderr, "Not UFFD_EVENT_PAGEFAULT\n");
>             exit(-1);
>         }
> 
>         zeropage.range.start = msg.arg.pagefault.address;
>         zeropage.range.len = page_size;
>         do {
>             ret = ioctl(uffd, UFFDIO_ZEROPAGE, &zeropage);
>             if (ret && errno != EAGAIN) {
>                 fprintf(stderr, "UFFDIO_ZEROPAGE failed:%s\n", strerror(errno));
>                 exit(-1);
>             }
>         } while (ret);
>     }
> }
> static void *discard_thread(void *arg)
> {
>     while (true) {
>         if (madvise(arg, page_size, MADV_DONTNEED)) {
>             fprintf(stderr, "MADV_DONTNEED failed:%s\n", strerror(errno));
>             exit(-1);
>         }
>         usleep(1000);
>     }
> }
> 
> int main(void)
> {
>     struct uffdio_register reg;
>     struct uffdio_api api = {
>         .api = UFFD_API,
>     };
>     pthread_t fault, discard;
>     long uffd;
>     char *area;
> 
>     page_size = sysconf(_SC_PAGE_SIZE);
> 
>     uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
>     if (uffd == -1) {
>         fprintf(stderr, "Could not create uffd: %s\n", strerror(errno));
>         exit(-1);
>     }
>     if (ioctl(uffd, UFFDIO_API, &api) == -1) {
>         fprintf(stderr, "UFFDIO_API failed: %s\n", strerror(errno));
>         exit(-1);
>     }
> 
>     area = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
>                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>     if (area == MAP_FAILED) {
>         fprintf(stderr, "Could not allocate memory");
>         exit(-1);
>     }
> 
>     reg.range.start = (uint64_t) area;
>     reg.range.len = page_size,
>     reg.mode = UFFDIO_REGISTER_MODE_MISSING;
>     if (ioctl(uffd, UFFDIO_REGISTER, &reg) == -1) {
>         fprintf(stderr, "UFFDIO_REGISTER failed: %s\n", strerror(errno));
>         exit(-1);
>     }
> 
>     /* thread to provide zeropages */
>     if (pthread_create(&fault, NULL, fault_handler_thread,
>                        (void *) uffd)) {
>         fprintf(stderr, "Could not create fault handling thread");
>         exit(-1);
>     }
> 
>     /* thread to discard the page */
>     if (pthread_create(&discard, NULL, discard_thread,
>                        (void *) area)) {
>         fprintf(stderr, "Could not create discard thread");
>         exit(-1);
>     }
> 
>     /* keep reading/writing the page */
>     while (true) {
>         area[7] = area[1];
>         usleep(10000);
>         printf("Progress!\n");
>     }
>     return 0;
> }
> 
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RESEND v6 00/16] mm: Page fault enhancements
  2020-03-07 21:47   ` Peter Xu
@ 2020-03-08 12:12     ` David Hildenbrand
  2020-03-09 19:51       ` Peter Xu
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2020-03-08 12:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Martin Cracauer,
	Linus Torvalds, Mike Rapoport, Kirill A . Shutemov,
	Johannes Weiner, Dr . David Alan Gilbert, Bobby Powers,
	Maya Gokhale, Jerome Glisse, Mike Kravetz, Matthew Wilcox,
	Marty McFadden, Mel Gorman, Hugh Dickins, Brian Geffon,
	Denis Plotnikov, Pavel Emelyanov

[...]

> Yes, IIUC the race can happen like this in your below test:
> 
>      main thread          uffd thread             disgard thread
>      ===========          ===========             ==============
>      access page
>        uffd page fault
>          wait for page
>                           UFFDIO_ZEROCOPY
>                             put a page P there
>                                                   MADV_DONTNEED on P
>                             wakeup main thread
>          return from fault
>        page still missing
>        uffd page fault again
>        (without ALLOW_RETRY)
>        --> SIGBUS.

Exactly!

>> Can we please have a way to identify that this "feature" is available?
>> I'd appreciate a new read-only UFFD_FEAT_ , so we can detect this from
>> user space easily and use concurrent discards without crashing our applications.
> 
> I'm not sure how others think about it, but to me this still fells
> into the bucket of "solving an existing problem" rather than a
> feature.  Also note that this should change the behavior for the page
> fault logic in general, rather than an uffd-only change. So I'm also
> not sure whether UFFD_FEAT_* suites here even if we want it.

So, are we planning on backporting this to stable kernels?

Imagine using this in QEMU/KVM to allow discards (e.g., balloon
inflation) while postcopy is active . You certainly don't want random
guest crashes. So either, we treat this as a fix (and backport) or as a
change in behavior/feature.

[...]

>>
>> 2. What will happen if I don't place a page on a pagefault, but only do a UFFDIO_WAKE?
>>    For now we were able to trigger a signal this way.
> 
> If I'm not mistaken the UFFDIO_WAKE will directly trigger the sigbus
> even without the help of the MADV_DONTNEED race.

Yes, that's the current way of injecting a SIGBUS instead of resolving
the pagefault. And AFAIKs, you're changing that behavior. (I am not
aware of a user, there could be use cases, but somehow it's strange to
get a signal when accessing memory that is mapped READ|WRITE and also
represented like this in e.g., /proc/$PID/maps). So IMHO, only the new
behavior makes really sense.

> 
>> If the behavior is changed, can
>>    we make this configurable via a UFFD_FEAT?
> 
> I'll still think that could be an overkill, but I'll leave the
> discussion to the experts.

I'll be happy to hear what Andrea Et al. think. At least I really want
to see the new behavior - and if it's not a fix, then I want some way to
detect if a kernel has this new (fixed?) behavior.

Thanks a lot for this work!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RESEND v6 00/16] mm: Page fault enhancements
  2020-03-07 20:33 ` David Hildenbrand
  2020-03-07 21:47   ` Peter Xu
@ 2020-03-08 12:49   ` David Hildenbrand
  1 sibling, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2020-03-08 12:49 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Martin Cracauer, Linus Torvalds, Mike Rapoport,
	Kirill A . Shutemov, Johannes Weiner, Dr . David Alan Gilbert,
	Bobby Powers, Maya Gokhale, Jerome Glisse, Mike Kravetz,
	Matthew Wilcox, Marty McFadden, Mel Gorman, Hugh Dickins,
	Brian Geffon, Denis Plotnikov, Pavel Emelyanov

On 07.03.20 21:33, David Hildenbrand wrote:
> On 20.02.20 16:53, Peter Xu wrote:
>> [Resend v6]
>>
>> This is v6 of the series.  It is majorly a rebase to 5.6-rc2, nothing
>> else to be expected (plus some tests after the rebase).  Instead of
>> rewrite the cover letter I decided to use what we have for v5.
>>
>> Adding extra CCs for both Bobby Powers <bobbypowers@gmail.com> and
>> Brian Geffon <bgeffon@google.com>.
>>
>> Online repo: https://github.com/xzpeter/linux/tree/mm-pf-signal-retry
>>
>> Any review comment is appreciated.  Thanks,
> 
> If I am not completely missing something (and all my testing today was
> wrong) there is a very simple reason why I *LOVE* this series and it made
> my weekend. It makes userfaultfd with concurrent discarding (e.g.,
> MADV_DONTNEED) of pages actually usable.
> 
> The issue in current code is that between placing a page and waking
> up a waiter, somebody can zap the new placed page and trigger
> re-fault, triggering a SIGBUS and crashing an application where all
> memory is supposed to be accessible. And there is no real way to protect
> from that, because when the fault handler will be woken up and retry
> is not deterministic (e.g., making madvise(MADV_DONTNEED) and
> UFFDIO_ZEROPAGE mutually exclusive does not help).
> 
> Find a simple reproducer at the end of this mail.

See below for another test case. It's not immediately clear why we can get
SIGBUS in this example, UFFD_FEATURE_EVENT_REMOVE was supposed to protect
us from this - I thought. I think because the actual discard is delayed after
the UFFD_EVENT_REMOVE has been processed until the next UFFDIO_ZEROPAGE
takes place. Then we have the same race as in the other test case.

It also showcases why the use of UFFD_FEATURE_EVENT_REMOVE in combination
with placing of pages in a handler can easily deadlock.

Before this series: SIGBUS - FAULT_FLAG_ALLOW_RETRY missing 70
After this series: Keeps on running

But the general behavior of UFFD_FEATURE_EVENT_REMOVE is not
really useful in practice due to the possible deadlock that I work
around in this patch. You can drop the read() etc. from the loop
to observe the deadlock.

---
#include <string.h>
#include <stdbool.h>
#include <stdint.h>
#include <sys/types.h>
#include <stdio.h>
#include <pthread.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <poll.h>
#include <linux/userfaultfd.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/ioctl.h>

static int page_size;

static void *fault_handler_thread(void *arg)
{
    const long uffd = (long) arg;
    struct pollfd pollfd = {
        .fd = uffd,
        .events = POLLIN,
    };
    int ret;

    while (true) {
        struct uffdio_zeropage zeropage = {};
        struct uffd_msg msg;
        ssize_t nread;

        if (poll(&pollfd, 1, -1) == -1) {
            fprintf(stderr, "POLL failed: %s\n", strerror(errno));
            exit(-1);
        }
        if (read(uffd, &msg, sizeof(msg)) != sizeof(msg)) {
            fprintf(stderr, "READ failed\n");
            exit(-1);
        }
        if (msg.event == UFFD_EVENT_REMOVE) {
            continue;
        }
        if (msg.event != UFFD_EVENT_PAGEFAULT) {
            fprintf(stderr, "Not UFFD_EVENT_PAGEFAULT\n");
            exit(-1);
        }

        zeropage.range.start = msg.arg.pagefault.address;
        zeropage.range.len = page_size;
        /*
         * The following loop would deadlock in case we get a MADV_DONTNEED
         * at the wrong time. UFFDIO_ZEROPAGE won't be able to make progress
         * until we processed the UFFD_EVENT_REMOVE. So we manually have
         * to read and process the UFFD_EVENT_REMOVE. This is nasty, because
         * once we could get multiple UFFD_EVENT_PAGEFAULT, this would no
         * longer be usable - we would get another pagefault request, which
         * we cannot process and so on ...
         */
        do {
            ret = ioctl(uffd, UFFDIO_ZEROPAGE, &zeropage);
            if (ret && errno != EAGAIN) {
                fprintf(stderr, "UFFDIO_ZEROPAGE failed:%s\n", strerror(errno));
                exit(-1);
            }
            /*
             * We're expecting a UFFD_EVENT_REMOVE event Soemtimes we
             * get none ... ?
             */
            if (read(uffd, &msg, sizeof(msg)) != sizeof(msg)) {
                continue;
            }
            if (msg.event != UFFD_EVENT_REMOVE) {
                fprintf(stderr, "Not a remove event\n");
                continue;
            }
        } while (ret);
    }
}

static void *discard_thread(void *arg)
{
    while (true) {
        if (madvise(arg, page_size, MADV_DONTNEED)) {
            fprintf(stderr, "MADV_DONTNEED failed:%s\n", strerror(errno));
            exit(-1);
        }
        printf("Discard!\n");
        usleep(1000);
    }
}

int main(void)
{
    struct uffdio_register reg;
    struct uffdio_api api = {
        .api = UFFD_API,
        .features = UFFD_FEATURE_EVENT_REMOVE,
    };
    pthread_t fault, discard;
    long uffd;
    char *area;

    page_size = sysconf(_SC_PAGE_SIZE);

    uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
    if (uffd == -1) {
        fprintf(stderr, "Could not create uffd: %s\n", strerror(errno));
        exit(-1);
    }
    if (ioctl(uffd, UFFDIO_API, &api) == -1) {
        fprintf(stderr, "UFFDIO_API failed: %s\n", strerror(errno));
        exit(-1);
    }

    area = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
                MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (area == MAP_FAILED) {
        fprintf(stderr, "Could not allocate memory");
        exit(-1);
    }

    reg.range.start = (uint64_t) area;
    reg.range.len = page_size,
    reg.mode = UFFDIO_REGISTER_MODE_MISSING;
    if (ioctl(uffd, UFFDIO_REGISTER, &reg) == -1) {
        fprintf(stderr, "UFFDIO_REGISTER failed: %s\n", strerror(errno));
        exit(-1);
    }

    /* thread to provide zeropages */
    if (pthread_create(&fault, NULL, fault_handler_thread,
                       (void *) uffd)) {
        fprintf(stderr, "Could not create fault handing thread");
        exit(-1);
    }
    /* thread to discard the page */
    if (pthread_create(&discard, NULL, discard_thread,
                       (void *) area)) {
        fprintf(stderr, "Could not create discard thread");
        exit(-1);
    }

    /* keep reading/writing the page */
    while (true) {
        area[7] = area[1];
        usleep(10000);
        printf("Progress!\n");
    }
    return 0;
}


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RESEND v6 00/16] mm: Page fault enhancements
  2020-03-08 12:12     ` David Hildenbrand
@ 2020-03-09 19:51       ` Peter Xu
  2020-03-09 20:06         ` David Hildenbrand
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Xu @ 2020-03-09 19:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Martin Cracauer,
	Linus Torvalds, Mike Rapoport, Kirill A . Shutemov,
	Johannes Weiner, Dr . David Alan Gilbert, Bobby Powers,
	Maya Gokhale, Jerome Glisse, Mike Kravetz, Matthew Wilcox,
	Marty McFadden, Mel Gorman, Hugh Dickins, Brian Geffon,
	Denis Plotnikov, Pavel Emelyanov

On Sun, Mar 08, 2020 at 01:12:34PM +0100, David Hildenbrand wrote:
> [...]
> 
> > Yes, IIUC the race can happen like this in your below test:
> > 
> >      main thread          uffd thread             disgard thread
> >      ===========          ===========             ==============
> >      access page
> >        uffd page fault
> >          wait for page
> >                           UFFDIO_ZEROCOPY
> >                             put a page P there
> >                                                   MADV_DONTNEED on P
> >                             wakeup main thread
> >          return from fault
> >        page still missing
> >        uffd page fault again
> >        (without ALLOW_RETRY)
> >        --> SIGBUS.
> 
> Exactly!
> 
> >> Can we please have a way to identify that this "feature" is available?
> >> I'd appreciate a new read-only UFFD_FEAT_ , so we can detect this from
> >> user space easily and use concurrent discards without crashing our applications.
> > 
> > I'm not sure how others think about it, but to me this still fells
> > into the bucket of "solving an existing problem" rather than a
> > feature.  Also note that this should change the behavior for the page
> > fault logic in general, rather than an uffd-only change. So I'm also
> > not sure whether UFFD_FEAT_* suites here even if we want it.
> 
> So, are we planning on backporting this to stable kernels?

I don't have a plan so far.  I'm still at the phase to only worry
about whether it can be at least merged in master.. :)

I would think it won't worth it to backport this to stables though,
considering that it could potentially change quite a bit for faulting
procedures, and after all the issues we're fixing shouldn't be common
to general users.

> 
> Imagine using this in QEMU/KVM to allow discards (e.g., balloon
> inflation) while postcopy is active . You certainly don't want random
> guest crashes. So either, we treat this as a fix (and backport) or as a
> change in behavior/feature.

I think we don't need to worry on that - QEMU will prohibit ballooning
during postcopy starting from the first day.  Feel free to see QEMU
commit 371ff5a3f04cd7 ("Inhibit ballooning during postcopy").

> 
> [...]
> 
> >>
> >> 2. What will happen if I don't place a page on a pagefault, but only do a UFFDIO_WAKE?
> >>    For now we were able to trigger a signal this way.
> > 
> > If I'm not mistaken the UFFDIO_WAKE will directly trigger the sigbus
> > even without the help of the MADV_DONTNEED race.
> 
> Yes, that's the current way of injecting a SIGBUS instead of resolving
> the pagefault. And AFAIKs, you're changing that behavior. (I am not
> aware of a user, there could be use cases, but somehow it's strange to
> get a signal when accessing memory that is mapped READ|WRITE and also
> represented like this in e.g., /proc/$PID/maps). So IMHO, only the new
> behavior makes really sense.

I agree, I'm not sure how other people think on ABI stability, but...
for my own preference I don't worry much on ABI breakage for a problem
like this.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RESEND v6 00/16] mm: Page fault enhancements
  2020-03-09 19:51       ` Peter Xu
@ 2020-03-09 20:06         ` David Hildenbrand
  0 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2020-03-09 20:06 UTC (permalink / raw)
  To: Peter Xu, Andrea Arcangeli
  Cc: David Hildenbrand, linux-mm, linux-kernel, Andrea Arcangeli,
	Martin Cracauer, Linus Torvalds, Mike Rapoport,
	Kirill A . Shutemov, Johannes Weiner, Dr . David Alan Gilbert,
	Bobby Powers, Maya Gokhale, Jerome Glisse, Mike Kravetz,
	Matthew Wilcox, Marty McFadden, Mel Gorman, Hugh Dickins,
	Brian Geffon, Denis Plotnikov, Pavel Emelyanov



> Am 09.03.2020 um 20:51 schrieb Peter Xu <peterx@redhat.com>:
> 
> On Sun, Mar 08, 2020 at 01:12:34PM +0100, David Hildenbrand wrote:
>> [...]
>> 
>>> Yes, IIUC the race can happen like this in your below test:
>>> 
>>>     main thread          uffd thread             disgard thread
>>>     ===========          ===========             ==============
>>>     access page
>>>       uffd page fault
>>>         wait for page
>>>                          UFFDIO_ZEROCOPY
>>>                            put a page P there
>>>                                                  MADV_DONTNEED on P
>>>                            wakeup main thread
>>>         return from fault
>>>       page still missing
>>>       uffd page fault again
>>>       (without ALLOW_RETRY)
>>>       --> SIGBUS.
>> 
>> Exactly!
>> 
>>>> Can we please have a way to identify that this "feature" is available?
>>>> I'd appreciate a new read-only UFFD_FEAT_ , so we can detect this from
>>>> user space easily and use concurrent discards without crashing our applications.
>>> 
>>> I'm not sure how others think about it, but to me this still fells
>>> into the bucket of "solving an existing problem" rather than a
>>> feature.  Also note that this should change the behavior for the page
>>> fault logic in general, rather than an uffd-only change. So I'm also
>>> not sure whether UFFD_FEAT_* suites here even if we want it.
>> 
>> So, are we planning on backporting this to stable kernels?
> 
> I don't have a plan so far.  I'm still at the phase to only worry
> about whether it can be at least merged in master.. :)
> 
> I would think it won't worth it to backport this to stables though,
> considering that it could potentially change quite a bit for faulting
> procedures, and after all the issues we're fixing shouldn't be common
> to general users.
> 
>> 
>> Imagine using this in QEMU/KVM to allow discards (e.g., balloon
>> inflation) while postcopy is active . You certainly don't want random
>> guest crashes. So either, we treat this as a fix (and backport) or as a
>> change in behavior/feature.
> 
> I think we don't need to worry on that - QEMU will prohibit ballooning
> during postcopy starting from the first day.  Feel free to see QEMU
> commit 371ff5a3f04cd7 ("Inhibit ballooning during postcopy").

Imagine I want to change that or imagine I have another user that heavily depends on such races to never happen.

IOW I want to know for sure if my application can crash or not.

@Andrea what are your thoughts on a new feature flag to identify this behavior?


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2020-03-09 20:07 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-20 15:53 [PATCH RESEND v6 00/16] mm: Page fault enhancements Peter Xu
2020-02-20 15:53 ` [PATCH RESEND v6 01/16] mm/gup: Rename "nonblocking" to "locked" where proper Peter Xu
2020-02-20 15:53 ` [PATCH RESEND v6 02/16] mm/gup: Fix __get_user_pages() on fault retry of hugetlb Peter Xu
2020-03-02 19:02   ` David Hildenbrand
2020-03-02 20:07     ` Peter Xu
2020-03-02 20:22       ` David Hildenbrand
2020-02-20 15:53 ` [PATCH RESEND v6 03/16] mm: Introduce fault_signal_pending() Peter Xu
2020-03-02 19:04   ` David Hildenbrand
2020-02-20 15:53 ` [PATCH RESEND v6 04/16] x86/mm: Use helper fault_signal_pending() Peter Xu
2020-02-20 15:58 ` [PATCH RESEND v6 05/16] arc/mm: " Peter Xu
2020-02-20 15:59 ` [PATCH RESEND v6 06/16] arm64/mm: " Peter Xu
2020-02-20 16:02 ` [PATCH RESEND v6 07/16] powerpc/mm: " Peter Xu
2020-02-20 16:02 ` [PATCH RESEND v6 08/16] sh/mm: " Peter Xu
2020-02-20 16:02 ` [PATCH RESEND v6 09/16] mm: Return faster for non-fatal signals in user mode faults Peter Xu
2020-02-20 16:02 ` [PATCH RESEND v6 10/16] userfaultfd: Don't retake mmap_sem to emulate NOPAGE Peter Xu
2020-02-20 16:02 ` [PATCH RESEND v6 11/16] mm: Introduce FAULT_FLAG_DEFAULT Peter Xu
2020-02-20 16:02 ` [PATCH RESEND v6 13/16] mm: Allow VM_FAULT_RETRY for multiple times Peter Xu
2020-02-20 16:02 ` [PATCH RESEND v6 15/16] mm/gup: Allow to react to fatal signals Peter Xu
2020-02-20 16:03 ` [PATCH RESEND v6 16/16] mm/userfaultfd: Honor FAULT_FLAG_KILLABLE in fault path Peter Xu
2020-02-20 19:53 ` [PATCH RESEND v6 12/16] mm: Introduce FAULT_FLAG_INTERRUPTIBLE Peter Xu
2020-02-20 19:53 ` [PATCH RESEND v6 14/16] mm/gup: Allow VM_FAULT_RETRY for multiple times Peter Xu
2020-02-21 19:26 ` [PATCH RESEND v6 00/16] mm: Page fault enhancements Brian Geffon
2020-03-02 17:31   ` Peter Xu
2020-02-21 19:32 ` Linus Torvalds
2020-02-21 20:11   ` Peter Xu
2020-03-07 20:33 ` David Hildenbrand
2020-03-07 21:47   ` Peter Xu
2020-03-08 12:12     ` David Hildenbrand
2020-03-09 19:51       ` Peter Xu
2020-03-09 20:06         ` David Hildenbrand
2020-03-08 12:49   ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).