All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/8] userfaultfd: add write protect support
@ 2015-11-19 22:33 Shaohua Li
  2015-11-19 22:33 ` [RFC 1/8] userfaultfd: add helper for writeprotect check Shaohua Li
                   ` (8 more replies)
  0 siblings, 9 replies; 13+ messages in thread
From: Shaohua Li @ 2015-11-19 22:33 UTC (permalink / raw)
  To: linux-mm; +Cc: kernel-team, Andrew Morton

Hi,

There is plan to support write protect fault into userfaultfd before, but it's
not implemented yet. I'm working on a library to support different types of
buffer like compressed buffer and file buffer, something like a page cache
implementation in userspace. The buffer enables userfaultfd and does something
like decompression in userfault handler. When memory size exceeds a
threshold, madvise is used to reclaim memory. The problem is data can be
corrupted in reclaim without memory protection support.

For example, in the compressed buffer case, reclaim does:
1. compress memory range and store compressed data elsewhere
2. madvise the memory range

But if the memory is changed before 2, new change is lost. memory write
protection can solve the issue. With it, the reclaim does:
1. write protect memory range
2. compress memory range and store compressed data elsewhere
3. madvise the memory range
4. undo write protect memory range and wakeup tasks waiting in write protect
fault.
If a task changes memory before 3, write protect fault will be triggered. we
can put the task into sleep till step 4 runs for example. In this way memory
changes will not be lost.

This patch set add write protect support for userfaultfd. One issue is write
protect fault can happen even without enabling write protect in userfault. For
example, a write to address backed by zero page. There is no way to distinguish
if this is a write protect fault expected by userfault. This patch just blindly
triggers write protect fault to userfault if corresponding vma enables
VM_UFFD_WP. Application should be prepared to handle such write protect fault.

Thanks,
Shaohua


Shaohua Li (8):
  userfaultfd: add helper for writeprotect check
  userfaultfd: support write protection for userfault vma range
  userfaultfd: expose writeprotect API to ioctl
  userfaultfd: allow userfaultfd register success with writeprotection
  userfaultfd: undo write proctection in unregister
  userfaultfd: hook userfault handler to write protection fault
  userfaultfd: fault try one more time
  userfaultfd: enabled write protection in userfaultfd API

 arch/alpha/mm/fault.c            |  8 ++++-
 arch/arc/mm/fault.c              |  8 ++++-
 arch/arm/mm/fault.c              |  8 ++++-
 arch/arm64/mm/fault.c            |  8 ++++-
 arch/avr32/mm/fault.c            |  8 ++++-
 arch/cris/mm/fault.c             |  8 ++++-
 arch/hexagon/mm/vm_fault.c       |  8 ++++-
 arch/ia64/mm/fault.c             |  8 ++++-
 arch/m68k/mm/fault.c             |  8 ++++-
 arch/metag/mm/fault.c            |  8 ++++-
 arch/microblaze/mm/fault.c       |  8 ++++-
 arch/mips/mm/fault.c             |  8 ++++-
 arch/mn10300/mm/fault.c          |  8 ++++-
 arch/nios2/mm/fault.c            |  8 ++++-
 arch/openrisc/mm/fault.c         |  8 ++++-
 arch/parisc/mm/fault.c           |  8 ++++-
 arch/powerpc/mm/fault.c          |  8 ++++-
 arch/s390/mm/fault.c             |  9 +++++-
 arch/sh/mm/fault.c               |  8 ++++-
 arch/sparc/mm/fault_32.c         |  8 ++++-
 arch/sparc/mm/fault_64.c         |  8 ++++-
 arch/tile/mm/fault.c             |  8 ++++-
 arch/um/kernel/trap.c            |  8 ++++-
 arch/unicore32/mm/fault.c        |  8 ++++-
 arch/x86/mm/fault.c              |  9 +++++-
 arch/xtensa/mm/fault.c           |  8 ++++-
 fs/userfaultfd.c                 | 66 ++++++++++++++++++++++++++++++++++------
 include/linux/mm.h               |  3 +-
 include/linux/userfaultfd_k.h    | 12 ++++++++
 include/uapi/linux/userfaultfd.h | 17 +++++++++--
 mm/memory.c                      | 66 +++++++++++++++++++++++++++++-----------
 mm/userfaultfd.c                 | 52 +++++++++++++++++++++++++++++++
 32 files changed, 369 insertions(+), 57 deletions(-)

-- 
2.4.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC 1/8] userfaultfd: add helper for writeprotect check
  2015-11-19 22:33 [RFC 0/8] userfaultfd: add write protect support Shaohua Li
@ 2015-11-19 22:33 ` Shaohua Li
  2015-11-19 22:33 ` [RFC 2/8] userfaultfd: support write protection for userfault vma range Shaohua Li
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2015-11-19 22:33 UTC (permalink / raw)
  To: linux-mm
  Cc: kernel-team, Andrew Morton, Andrea Arcangeli, Pavel Emelyanov,
	Rik van Riel, Kirill A. Shutemov, Mel Gorman, Hugh Dickins,
	Johannes Weiner

add helper for writeprotect check. Will use it later.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 include/linux/userfaultfd_k.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 587480a..4e22d11 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -48,6 +48,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_UFFD_MISSING;
 }
 
+static inline bool userfaultfd_wp(struct vm_area_struct *vma)
+{
+	return vma->vm_flags & VM_UFFD_WP;
+}
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
 	return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
@@ -75,6 +80,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct *vma)
 	return false;
 }
 
+static inline bool userfaultfd_wp(struct vm_area_struct *vma)
+{
+	return false;
+}
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
 	return false;
-- 
2.4.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 2/8] userfaultfd: support write protection for userfault vma range
  2015-11-19 22:33 [RFC 0/8] userfaultfd: add write protect support Shaohua Li
  2015-11-19 22:33 ` [RFC 1/8] userfaultfd: add helper for writeprotect check Shaohua Li
@ 2015-11-19 22:33 ` Shaohua Li
  2016-04-14 21:07   ` Andrea Arcangeli
  2015-11-19 22:33 ` [RFC 3/8] userfaultfd: expose writeprotect API to ioctl Shaohua Li
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 13+ messages in thread
From: Shaohua Li @ 2015-11-19 22:33 UTC (permalink / raw)
  To: linux-mm
  Cc: kernel-team, Andrew Morton, Andrea Arcangeli, Pavel Emelyanov,
	Rik van Riel, Kirill A. Shutemov, Mel Gorman, Hugh Dickins,
	Johannes Weiner

Add API to enable/disable writeprotect a vma range. Unlike mprotect,
this doesn't split/merge vmas.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 include/linux/userfaultfd_k.h |  2 ++
 mm/userfaultfd.c              | 52 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 4e22d11..44f5b09 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -35,6 +35,8 @@ extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
 extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
 			      unsigned long dst_start,
 			      unsigned long len);
+extern int mwriteprotect_range(struct mm_struct *dst_mm,
+		unsigned long start, unsigned long len, bool enable_wp);
 
 /* mm helpers */
 static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 77fee93..3e97e24 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -306,3 +306,55 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
 {
 	return __mcopy_atomic(dst_mm, start, 0, len, true);
 }
+
+int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
+	unsigned long len, bool enable_wp)
+{
+	struct vm_area_struct *dst_vma;
+	pgprot_t newprot;
+	int err;
+
+	/*
+	 * Sanitize the command parameters:
+	 */
+	BUG_ON(start & ~PAGE_MASK);
+	BUG_ON(len & ~PAGE_MASK);
+
+	/* Does the address range wrap, or is the span zero-sized? */
+	BUG_ON(start + len <= start);
+
+	down_read(&dst_mm->mmap_sem);
+
+	/*
+	 * Make sure the vma is not shared, that the dst range is
+	 * both valid and fully within a single existing vma.
+	 */
+	err = -EINVAL;
+	dst_vma = find_vma(dst_mm, start);
+	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+		goto out_unlock;
+	if (start < dst_vma->vm_start ||
+	    start + len > dst_vma->vm_end)
+		goto out_unlock;
+
+	if (!dst_vma->vm_userfaultfd_ctx.ctx)
+		goto out_unlock;
+	if (!userfaultfd_wp(dst_vma))
+		goto out_unlock;
+
+	if (dst_vma->vm_ops)
+		goto out_unlock;
+
+	if (enable_wp)
+		newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE));
+	else
+		newprot = vm_get_page_prot(dst_vma->vm_flags);
+
+	change_protection(dst_vma, start, start + len, newprot,
+				!enable_wp, 0);
+
+	err = 0;
+out_unlock:
+	up_read(&dst_mm->mmap_sem);
+	return err;
+}
-- 
2.4.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 3/8] userfaultfd: expose writeprotect API to ioctl
  2015-11-19 22:33 [RFC 0/8] userfaultfd: add write protect support Shaohua Li
  2015-11-19 22:33 ` [RFC 1/8] userfaultfd: add helper for writeprotect check Shaohua Li
  2015-11-19 22:33 ` [RFC 2/8] userfaultfd: support write protection for userfault vma range Shaohua Li
@ 2015-11-19 22:33 ` Shaohua Li
  2015-11-19 22:33 ` [RFC 4/8] userfaultfd: allow userfaultfd register success with writeprotection Shaohua Li
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2015-11-19 22:33 UTC (permalink / raw)
  To: linux-mm
  Cc: kernel-team, Andrew Morton, Andrea Arcangeli, Pavel Emelyanov,
	Rik van Riel, Kirill A. Shutemov, Mel Gorman, Hugh Dickins,
	Johannes Weiner

Add the writeprotect API to userfaultfd ioctl

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 fs/userfaultfd.c                 | 45 ++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/userfaultfd.h | 10 +++++++++
 2 files changed, 55 insertions(+)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 5031170..eaa5086 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1122,6 +1122,49 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
 	return ret;
 }
 
+static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
+				    unsigned long arg)
+{
+	int ret;
+	struct uffdio_writeprotect uffdio_wp;
+	struct uffdio_writeprotect __user *user_uffdio_wp;
+	struct userfaultfd_wake_range range;
+
+	user_uffdio_wp = (struct uffdio_writeprotect __user *) arg;
+
+	if (copy_from_user(&uffdio_wp, user_uffdio_wp,
+			   sizeof(struct uffdio_writeprotect)))
+		return -EFAULT;
+
+	ret = validate_range(ctx->mm, uffdio_wp.range.start,
+			     uffdio_wp.range.len);
+	if (ret)
+		return ret;
+
+	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
+			       UFFDIO_WRITEPROTECT_MODE_WP))
+		return -EINVAL;
+	if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) &&
+	     (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE))
+		return -EINVAL;
+
+	if (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP)
+		ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
+			uffdio_wp.range.len, true);
+	else
+		ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
+			uffdio_wp.range.len, false);
+	if (ret)
+		return ret;
+
+	if (!(uffdio_wp.mode & UFFDIO_COPY_MODE_DONTWAKE)) {
+		range.start = uffdio_wp.range.start;
+		range.len = uffdio_wp.range.len;
+		wake_userfault(ctx, &range);
+	}
+	return ret;
+}
+
 /*
  * userland asks for a certain API version and we return which bits
  * and ioctl commands are implemented in this kernel for such API
@@ -1186,6 +1229,8 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
 	case UFFDIO_ZEROPAGE:
 		ret = userfaultfd_zeropage(ctx, arg);
 		break;
+	case UFFDIO_WRITEPROTECT:
+		ret = userfaultfd_writeprotect(ctx, arg);
 	}
 	return ret;
 }
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 9057d7a..8898bd7 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -40,6 +40,7 @@
 #define _UFFDIO_WAKE			(0x02)
 #define _UFFDIO_COPY			(0x03)
 #define _UFFDIO_ZEROPAGE		(0x04)
+#define _UFFDIO_WRITEPROTECT		(0x05)
 #define _UFFDIO_API			(0x3F)
 
 /* userfaultfd ioctl ids */
@@ -56,6 +57,8 @@
 				      struct uffdio_copy)
 #define UFFDIO_ZEROPAGE		_IOWR(UFFDIO, _UFFDIO_ZEROPAGE,	\
 				      struct uffdio_zeropage)
+#define UFFDIO_WRITEPROTECT	_IOWR(UFFDIO, _UFFDIO_WRITEPROTECT, \
+				      struct uffdio_writeprotect)
 
 /* read() structure */
 struct uffd_msg {
@@ -164,4 +167,11 @@ struct uffdio_zeropage {
 	__s64 zeropage;
 };
 
+struct uffdio_writeprotect {
+	struct uffdio_range range;
+	/* !WP means undo writeprotect. DONTWAKE is valid only with !WP */
+#define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
+#define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
+	__u64 mode;
+};
 #endif /* _LINUX_USERFAULTFD_H */
-- 
2.4.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 4/8] userfaultfd: allow userfaultfd register success with writeprotection
  2015-11-19 22:33 [RFC 0/8] userfaultfd: add write protect support Shaohua Li
                   ` (2 preceding siblings ...)
  2015-11-19 22:33 ` [RFC 3/8] userfaultfd: expose writeprotect API to ioctl Shaohua Li
@ 2015-11-19 22:33 ` Shaohua Li
  2015-11-19 22:33 ` [RFC 5/8] userfaultfd: undo write proctection in unregister Shaohua Li
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2015-11-19 22:33 UTC (permalink / raw)
  To: linux-mm
  Cc: kernel-team, Andrew Morton, Andrea Arcangeli, Pavel Emelyanov,
	Rik van Riel, Kirill A. Shutemov, Mel Gorman, Hugh Dickins,
	Johannes Weiner

The userfaultfd register ioctl currently disables writeprotection.
Enable it.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 fs/userfaultfd.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index eaa5086..12176b5 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -236,6 +236,8 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 	 */
 	if (pte_none(*pte))
 		ret = true;
+	if (!pte_write(*pte) && (reason & VM_UFFD_WP))
+		ret = true;
 	pte_unmap(pte);
 
 out:
@@ -736,15 +738,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 	vm_flags = 0;
 	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING)
 		vm_flags |= VM_UFFD_MISSING;
-	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) {
+	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP)
 		vm_flags |= VM_UFFD_WP;
-		/*
-		 * FIXME: remove the below error constraint by
-		 * implementing the wprotect tracking mode.
-		 */
-		ret = -EINVAL;
-		goto out;
-	}
 
 	ret = validate_range(mm, uffdio_register.range.start,
 			     uffdio_register.range.len);
@@ -784,6 +779,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 		ret = -EINVAL;
 		if (cur->vm_ops)
 			goto out_unlock;
+		if ((vm_flags & VM_UFFD_WP) && !(cur->vm_flags & VM_WRITE))
+			goto out_unlock;
 
 		/*
 		 * Check that this vma isn't already owned by a
-- 
2.4.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 5/8] userfaultfd: undo write proctection in unregister
  2015-11-19 22:33 [RFC 0/8] userfaultfd: add write protect support Shaohua Li
                   ` (3 preceding siblings ...)
  2015-11-19 22:33 ` [RFC 4/8] userfaultfd: allow userfaultfd register success with writeprotection Shaohua Li
@ 2015-11-19 22:33 ` Shaohua Li
  2015-11-19 22:33 ` [RFC 6/8] userfaultfd: hook userfault handler to write protection fault Shaohua Li
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2015-11-19 22:33 UTC (permalink / raw)
  To: linux-mm
  Cc: kernel-team, Andrew Morton, Andrea Arcangeli, Pavel Emelyanov,
	Rik van Riel, Kirill A. Shutemov, Mel Gorman, Hugh Dickins,
	Johannes Weiner

After a userfaultfd unregister, make sure the range doesn't disable
write in ptes.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 fs/userfaultfd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 12176b5..c79a3fd 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -953,6 +953,9 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 		if (vma->vm_start > start)
 			start = vma->vm_start;
 		vma_end = min(end, vma->vm_end);
+		if (userfaultfd_wp(vma))
+			change_protection(vma, start, vma_end,
+				vm_get_page_prot(vma->vm_flags), 1, 0);
 
 		new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
 		prev = vma_merge(mm, prev, start, vma_end, new_flags,
-- 
2.4.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 6/8] userfaultfd: hook userfault handler to write protection fault
  2015-11-19 22:33 [RFC 0/8] userfaultfd: add write protect support Shaohua Li
                   ` (4 preceding siblings ...)
  2015-11-19 22:33 ` [RFC 5/8] userfaultfd: undo write proctection in unregister Shaohua Li
@ 2015-11-19 22:33 ` Shaohua Li
  2015-11-20  2:54   ` Jerome Glisse
  2015-11-19 22:33 ` [RFC 7/8] userfaultfd: fault try one more time Shaohua Li
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 13+ messages in thread
From: Shaohua Li @ 2015-11-19 22:33 UTC (permalink / raw)
  To: linux-mm
  Cc: kernel-team, Andrew Morton, Andrea Arcangeli, Pavel Emelyanov,
	Rik van Riel, Kirill A. Shutemov, Mel Gorman, Hugh Dickins,
	Johannes Weiner

There are several cases write protection fault happens. It could be a write
to zero page, swaped page or userfault write protected page. When the
fault happens, there is no way to know if userfault write protect the
page before. Here we just blindly issue a userfault notification for vma
with VM_UFFD_WP regardless if app write protects it yet. Application
should be ready to handle such wp fault.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 mm/memory.c | 66 +++++++++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 49 insertions(+), 17 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index deb679c..5d16a31 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1994,10 +1994,11 @@ static inline int wp_page_reuse(struct mm_struct *mm,
 			struct vm_area_struct *vma, unsigned long address,
 			pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
 			struct page *page, int page_mkwrite,
-			int dirty_shared)
+			int dirty_shared, unsigned int flags)
 	__releases(ptl)
 {
 	pte_t entry;
+	bool do_uffd = false;
 	/*
 	 * Clear the pages cpupid information as the existing
 	 * information potentially belongs to a now completely
@@ -2008,10 +2009,16 @@ static inline int wp_page_reuse(struct mm_struct *mm,
 
 	flush_cache_page(vma, address, pte_pfn(orig_pte));
 	entry = pte_mkyoung(orig_pte);
-	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+	if (userfaultfd_wp(vma) && page) {
+		entry = pte_mkdirty(entry);
+		do_uffd = true;
+	} else
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 	if (ptep_set_access_flags(vma, address, page_table, entry, 1))
 		update_mmu_cache(vma, address, page_table);
 	pte_unmap_unlock(page_table, ptl);
+	if (do_uffd)
+		return handle_userfault(vma, address, flags, VM_UFFD_WP);
 
 	if (dirty_shared) {
 		struct address_space *mapping;
@@ -2059,7 +2066,7 @@ static inline int wp_page_reuse(struct mm_struct *mm,
  */
 static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, pte_t *page_table, pmd_t *pmd,
-			pte_t orig_pte, struct page *old_page)
+			pte_t orig_pte, struct page *old_page, unsigned int flags)
 {
 	struct page *new_page = NULL;
 	spinlock_t *ptl = NULL;
@@ -2068,6 +2075,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 	const unsigned long mmun_start = address & PAGE_MASK;	/* For mmu_notifiers */
 	const unsigned long mmun_end = mmun_start + PAGE_SIZE;	/* For mmu_notifiers */
 	struct mem_cgroup *memcg;
+	bool do_uffd = false;
 
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
@@ -2105,7 +2113,15 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 		flush_cache_page(vma, address, pte_pfn(orig_pte));
 		entry = mk_pte(new_page, vma->vm_page_prot);
-		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		/*
+		 * there is no way to know if we should do writeprotect here,
+		 * force a writeprotect
+		 */
+		if (userfaultfd_wp(vma)) {
+			entry = pte_mkdirty(entry);
+			do_uffd = true;
+		} else
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		/*
 		 * Clear the pte entry and flush it first, before updating the
 		 * pte with the new entry. This will avoid a race condition
@@ -2173,6 +2189,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 		page_cache_release(old_page);
 	}
+	if (do_uffd)
+		return handle_userfault(vma, address, flags, VM_UFFD_WP);
 	return page_copied ? VM_FAULT_WRITE : 0;
 oom_free_new:
 	page_cache_release(new_page);
@@ -2189,7 +2207,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 static int wp_pfn_shared(struct mm_struct *mm,
 			struct vm_area_struct *vma, unsigned long address,
 			pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
-			pmd_t *pmd)
+			pmd_t *pmd, unsigned int flags)
 {
 	if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
 		struct vm_fault vmf = {
@@ -2215,13 +2233,13 @@ static int wp_pfn_shared(struct mm_struct *mm,
 		}
 	}
 	return wp_page_reuse(mm, vma, address, page_table, ptl, orig_pte,
-			     NULL, 0, 0);
+			     NULL, 0, 0, flags);
 }
 
 static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
 			  unsigned long address, pte_t *page_table,
 			  pmd_t *pmd, spinlock_t *ptl, pte_t orig_pte,
-			  struct page *old_page)
+			  struct page *old_page, unsigned int flags)
 	__releases(ptl)
 {
 	int page_mkwrite = 0;
@@ -2261,7 +2279,7 @@ static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	return wp_page_reuse(mm, vma, address, page_table, ptl,
-			     orig_pte, old_page, page_mkwrite, 1);
+			     orig_pte, old_page, page_mkwrite, 1, flags);
 }
 
 /*
@@ -2284,7 +2302,7 @@ static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
  */
 static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		spinlock_t *ptl, pte_t orig_pte)
+		spinlock_t *ptl, pte_t orig_pte, unsigned int flags)
 	__releases(ptl)
 {
 	struct page *old_page;
@@ -2301,11 +2319,11 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
 				     (VM_WRITE|VM_SHARED))
 			return wp_pfn_shared(mm, vma, address, page_table, ptl,
-					     orig_pte, pmd);
+					     orig_pte, pmd, flags);
 
 		pte_unmap_unlock(page_table, ptl);
 		return wp_page_copy(mm, vma, address, page_table, pmd,
-				    orig_pte, old_page);
+				    orig_pte, old_page, flags);
 	}
 
 	/*
@@ -2336,13 +2354,13 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			page_move_anon_rmap(old_page, vma, address);
 			unlock_page(old_page);
 			return wp_page_reuse(mm, vma, address, page_table, ptl,
-					     orig_pte, old_page, 0, 0);
+					     orig_pte, old_page, 0, 0, flags);
 		}
 		unlock_page(old_page);
 	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
 					(VM_WRITE|VM_SHARED))) {
 		return wp_page_shared(mm, vma, address, page_table, pmd,
-				      ptl, orig_pte, old_page);
+				      ptl, orig_pte, old_page, flags);
 	}
 
 	/*
@@ -2352,7 +2370,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	pte_unmap_unlock(page_table, ptl);
 	return wp_page_copy(mm, vma, address, page_table, pmd,
-			    orig_pte, old_page);
+			    orig_pte, old_page, flags);
 }
 
 static void unmap_mapping_range_vma(struct vm_area_struct *vma,
@@ -2455,6 +2473,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int locked;
 	int exclusive = 0;
 	int ret = 0;
+	bool do_uffd = false;
 
 	if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
 		goto out;
@@ -2559,7 +2578,15 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	dec_mm_counter_fast(mm, MM_SWAPENTS);
 	pte = mk_pte(page, vma->vm_page_prot);
 	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
-		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
+		/*
+		 * there is no way to know if we should do writeprotect here,
+		 * force a writeprotect
+		 */
+		if (userfaultfd_wp(vma)) {
+			pte = pte_mkdirty(pte);
+			do_uffd = true;
+		} else
+			pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 		flags &= ~FAULT_FLAG_WRITE;
 		ret |= VM_FAULT_WRITE;
 		exclusive = 1;
@@ -2595,7 +2622,8 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	if (flags & FAULT_FLAG_WRITE) {
-		ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
+		ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl,
+					pte, flags);
 		if (ret & VM_FAULT_ERROR)
 			ret &= VM_FAULT_ERROR;
 		goto out;
@@ -2603,6 +2631,10 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, address, page_table);
+	if (do_uffd) {
+		pte_unmap_unlock(page_table, ptl);
+		return handle_userfault(vma, address, flags, VM_UFFD_WP);
+	}
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 out:
@@ -3309,7 +3341,7 @@ static int handle_pte_fault(struct mm_struct *mm,
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!pte_write(entry))
 			return do_wp_page(mm, vma, address,
-					pte, pmd, ptl, entry);
+					pte, pmd, ptl, entry, flags);
 		entry = pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);
-- 
2.4.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 7/8] userfaultfd: fault try one more time
  2015-11-19 22:33 [RFC 0/8] userfaultfd: add write protect support Shaohua Li
                   ` (5 preceding siblings ...)
  2015-11-19 22:33 ` [RFC 6/8] userfaultfd: hook userfault handler to write protection fault Shaohua Li
@ 2015-11-19 22:33 ` Shaohua Li
  2015-11-20  3:04   ` Jerome Glisse
  2015-11-19 22:33 ` [RFC 8/8] userfaultfd: enabled write protection in userfaultfd API Shaohua Li
  2015-11-20  3:13 ` [RFC 0/8] userfaultfd: add write protect support Jerome Glisse
  8 siblings, 1 reply; 13+ messages in thread
From: Shaohua Li @ 2015-11-19 22:33 UTC (permalink / raw)
  To: linux-mm
  Cc: kernel-team, Andrew Morton, Andrea Arcangeli, Pavel Emelyanov,
	Rik van Riel, Kirill A. Shutemov, Mel Gorman, Hugh Dickins,
	Johannes Weiner

For a swapin memory write fault, fault handler already retry once to
read the page in. userfaultfd can't do the retry again and fail. Give
another retry for userfaultfd in such case. gup isn't fixed yet, so will
return -EBUSY.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 arch/alpha/mm/fault.c      | 8 +++++++-
 arch/arc/mm/fault.c        | 8 +++++++-
 arch/arm/mm/fault.c        | 8 +++++++-
 arch/arm64/mm/fault.c      | 8 +++++++-
 arch/avr32/mm/fault.c      | 8 +++++++-
 arch/cris/mm/fault.c       | 8 +++++++-
 arch/hexagon/mm/vm_fault.c | 8 +++++++-
 arch/ia64/mm/fault.c       | 8 +++++++-
 arch/m68k/mm/fault.c       | 8 +++++++-
 arch/metag/mm/fault.c      | 8 +++++++-
 arch/microblaze/mm/fault.c | 8 +++++++-
 arch/mips/mm/fault.c       | 8 +++++++-
 arch/mn10300/mm/fault.c    | 8 +++++++-
 arch/nios2/mm/fault.c      | 8 +++++++-
 arch/openrisc/mm/fault.c   | 8 +++++++-
 arch/parisc/mm/fault.c     | 8 +++++++-
 arch/powerpc/mm/fault.c    | 8 +++++++-
 arch/s390/mm/fault.c       | 9 ++++++++-
 arch/sh/mm/fault.c         | 8 +++++++-
 arch/sparc/mm/fault_32.c   | 8 +++++++-
 arch/sparc/mm/fault_64.c   | 8 +++++++-
 arch/tile/mm/fault.c       | 8 +++++++-
 arch/um/kernel/trap.c      | 8 +++++++-
 arch/unicore32/mm/fault.c  | 8 +++++++-
 arch/x86/mm/fault.c        | 9 ++++++++-
 arch/xtensa/mm/fault.c     | 8 +++++++-
 fs/userfaultfd.c           | 5 +++--
 include/linux/mm.h         | 3 ++-
 28 files changed, 189 insertions(+), 29 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index 4a905bd..ba5de3e 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -88,7 +88,8 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
 	const struct exception_table_entry *fixup;
 	int fault, si_code = SEGV_MAPERR;
 	siginfo_t info;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+			     FAULT_FLAG_ALLOW_UFFD_RETRY;
 
 	/* As of EV6, a load into $31/$f31 is a prefetch, and never faults
 	   (or is suppressed by the PALcode).  Support that for older CPUs
@@ -178,6 +179,11 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
 			goto retry;
 		}
 	}
+	if ((flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+	    (fault & VM_FAULT_UFFD_RETRY)) {
+		flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+		goto retry;
+	}
 
 	up_read(&mm->mmap_sem);
 
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index af63f4a..7dc8f791 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -68,7 +68,8 @@ void do_page_fault(unsigned long address, struct pt_regs *regs)
 	siginfo_t info;
 	int fault, ret;
 	int write = regs->ecr_cause & ECR_C_PROTV_STORE;  /* ST/EX */
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+			     FAULT_FLAG_ALLOW_UFFD_RETRY;
 
 	/*
 	 * We fault-in kernel-space virtual memory on-demand. The
@@ -168,6 +169,11 @@ void do_page_fault(unsigned long address, struct pt_regs *regs)
 				goto retry;
 			}
 		}
+		if ((flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+		    (fault & VM_FAULT_UFFD_RETRY)) {
+			flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+			goto retry;
+		}
 
 		/* Fault Handled Gracefully */
 		up_read(&mm->mmap_sem);
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index daafcf1..59c1f64 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -260,7 +260,8 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	struct task_struct *tsk;
 	struct mm_struct *mm;
 	int fault, sig, code;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+			     FAULT_FLAG_ALLOW_UFFD_RETRY;
 
 	if (notify_page_fault(regs, fsr))
 		return 0;
@@ -342,6 +343,11 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 			goto retry;
 		}
 	}
+	if (!(fault & VM_FAULT_ERROR) && (flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+	    (fault & VM_FAULT_UFFD_RETRY)) {
+		flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+		goto retry;
+	}
 
 	up_read(&mm->mmap_sem);
 
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 19211c4..d66dfbc 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -199,7 +199,8 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
 	struct mm_struct *mm;
 	int fault, sig, code;
 	unsigned long vm_flags = VM_READ | VM_WRITE | VM_EXEC;
-	unsigned int mm_flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int mm_flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+				FAULT_FLAG_ALLOW_UFFD_RETRY;
 
 	tsk = current;
 	mm  = tsk->mm;
@@ -291,6 +292,11 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
 			goto retry;
 		}
 	}
+	if ((mm_flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+	    (fault & VM_FAULT_UFFD_RETRY)) {
+		mm_flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+		goto retry;
+	}
 
 	up_read(&mm->mmap_sem);
 
diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c
index c035339..d15f7ef 100644
--- a/arch/avr32/mm/fault.c
+++ b/arch/avr32/mm/fault.c
@@ -64,7 +64,8 @@ asmlinkage void do_page_fault(unsigned long ecr, struct pt_regs *regs)
 	long signr;
 	int code;
 	int fault;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+			     FAULT_FLAG_ALLOW_UFFD_RETRY;
 
 	if (notify_page_fault(regs, ecr))
 		return;
@@ -166,6 +167,11 @@ asmlinkage void do_page_fault(unsigned long ecr, struct pt_regs *regs)
 			goto retry;
 		}
 	}
+	if ((flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+	    (fault & VM_FAULT_UFFD_RETRY)) {
+		flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+		goto retry;
+	}
 
 	up_read(&mm->mmap_sem);
 	return;
diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c
index 3066d40..62dde48 100644
--- a/arch/cris/mm/fault.c
+++ b/arch/cris/mm/fault.c
@@ -58,7 +58,8 @@ do_page_fault(unsigned long address, struct pt_regs *regs,
 	struct vm_area_struct * vma;
 	siginfo_t info;
 	int fault;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+			     FAULT_FLAG_ALLOW_UFFD_RETRY;
 
 	D(printk(KERN_DEBUG
 		 "Page fault for %lX on %X at %lX, prot %d write %d\n",
@@ -201,6 +202,11 @@ do_page_fault(unsigned long address, struct pt_regs *regs,
 			goto retry;
 		}
 	}
+	if ((flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+	    (fault & VM_FAULT_UFFD_RETRY)) {
+		flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+		goto retry;
+	}
 
 	up_read(&mm->mmap_sem);
 	return;
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index 8704c93..9046ffd 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -53,7 +53,8 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
 	int si_code = SEGV_MAPERR;
 	int fault;
 	const struct exception_table_entry *fixup;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+			     FAULT_FLAG_ALLOW_UFFD_RETRY;
 
 	/*
 	 * If we're in an interrupt or have no user context,
@@ -119,6 +120,11 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
 				goto retry;
 			}
 		}
+		if ((flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+		    (fault & VM_FAULT_UFFD_RETRY)) {
+			flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+			goto retry;
+		}
 
 		up_read(&mm->mmap_sem);
 		return;
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 70b40d1..ca3008d 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -85,7 +85,8 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
 	struct siginfo si;
 	unsigned long mask;
 	int fault;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+			     FAULT_FLAG_ALLOW_UFFD_RETRY;
 
 	mask = ((((isr >> IA64_ISR_X_BIT) & 1UL) << VM_EXEC_BIT)
 		| (((isr >> IA64_ISR_W_BIT) & 1UL) << VM_WRITE_BIT));
@@ -198,6 +199,11 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
 			goto retry;
 		}
 	}
+	if ((flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+	    (fault & VM_FAULT_UFFD_RETRY)) {
+		flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+		goto retry;
+	}
 
 	up_read(&mm->mmap_sem);
 	return;
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index 6a94cdd..ecaf9fb 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -72,7 +72,8 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct * vma;
 	int fault;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+			     FAULT_FLAG_ALLOW_UFFD_RETRY;
 
 	pr_debug("do page fault:\nregs->sr=%#x, regs->pc=%#lx, address=%#lx, %ld, %p\n",
 		regs->sr, regs->pc, address, error_code, mm ? mm->pgd : NULL);
@@ -177,6 +178,11 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 			goto retry;
 		}
 	}
+	if ((flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+	    (fault & VM_FAULT_UFFD_RETRY)) {
+		flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+		goto retry;
+	}
 
 	up_read(&mm->mmap_sem);
 	return 0;
diff --git a/arch/metag/mm/fault.c b/arch/metag/mm/fault.c
index f57edca..be053cf 100644
--- a/arch/metag/mm/fault.c
+++ b/arch/metag/mm/fault.c
@@ -53,7 +53,8 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 	struct vm_area_struct *vma, *prev_vma;
 	siginfo_t info;
 	int fault;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+			     FAULT_FLAG_ALLOW_UFFD_RETRY;
 
 	tsk = current;
 
@@ -165,6 +166,11 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 			goto retry;
 		}
 	}
+	if ((flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+	    (fault & VM_FAULT_UFFD_RETRY)) {
+		flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+		goto retry;
+	}
 
 	up_read(&mm->mmap_sem);
 	return 0;
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index 177dfc0..2121910 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -92,7 +92,8 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
 	int code = SEGV_MAPERR;
 	int is_write = error_code & ESR_S;
 	int fault;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+			     FAULT_FLAG_ALLOW_UFFD_RETRY;
 
 	regs->ear = address;
 	regs->esr = error_code;
@@ -249,6 +250,11 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
 			goto retry;
 		}
 	}
+	if ((flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+	    (fault & VM_FAULT_UFFD_RETRY)) {
+		flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+		goto retry;
+	}
 
 	up_read(&mm->mmap_sem);
 
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 4b88fa0..f7cd73a 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -45,7 +45,8 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
 	const int field = sizeof(unsigned long) * 2;
 	siginfo_t info;
 	int fault;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+			     FAULT_FLAG_ALLOW_UFFD_RETRY;
 
 	static DEFINE_RATELIMIT_STATE(ratelimit_state, 5 * HZ, 10);
 
@@ -191,6 +192,11 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
 			goto retry;
 		}
 	}
+	if ((flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+	    (fault & VM_FAULT_UFFD_RETRY)) {
+		flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+		goto retry;
+	}
 
 	up_read(&mm->mmap_sem);
 	return;
diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c
index 4a1d181..2ea4ec7 100644
--- a/arch/mn10300/mm/fault.c
+++ b/arch/mn10300/mm/fault.c
@@ -124,7 +124,8 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long fault_code,
 	unsigned long page;
 	siginfo_t info;
 	int fault;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+			     FAULT_FLAG_ALLOW_UFFD_RETRY;
 
 #ifdef CONFIG_GDBSTUB
 	/* handle GDB stub causing a fault */
@@ -284,6 +285,11 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long fault_code,
 			goto retry;
 		}
 	}
+	if ((flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+	    (fault & VM_FAULT_UFFD_RETRY)) {
+		flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+		goto retry;
+	}
 
 	up_read(&mm->mmap_sem);
 	return;
diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
index b51878b..0166754 100644
--- a/arch/nios2/mm/fault.c
+++ b/arch/nios2/mm/fault.c
@@ -47,7 +47,8 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause,
 	struct mm_struct *mm = tsk->mm;
 	int code = SEGV_MAPERR;
 	int fault;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+			     FAULT_FLAG_ALLOW_UFFD_RETRY;
 
 	cause >>= 2;
 
@@ -171,6 +172,11 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause,
 			goto retry;
 		}
 	}
+	if ((flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+	    (fault & VM_FAULT_UFFD_RETRY)) {
+		flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+		goto retry;
+	}
 
 	up_read(&mm->mmap_sem);
 	return;
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index 230ac20..e6049e4 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -54,7 +54,8 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
 	struct vm_area_struct *vma;
 	siginfo_t info;
 	int fault;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+			     FAULT_FLAG_ALLOW_UFFD_RETRY;
 
 	tsk = current;
 
@@ -196,6 +197,11 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
 			goto retry;
 		}
 	}
+	if ((flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+	    (fault & VM_FAULT_UFFD_RETRY)) {
+		flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+		goto retry;
+	}
 
 	up_read(&mm->mmap_sem);
 	return;
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index a762864..8b98cb2 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -215,7 +215,8 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
 	if (!mm)
 		goto no_context;
 
-	flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+		FAULT_FLAG_ALLOW_UFFD_RETRY;
 	if (user_mode(regs))
 		flags |= FAULT_FLAG_USER;
 
@@ -279,6 +280,11 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
 			goto retry;
 		}
 	}
+	if ((flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+	    (fault & VM_FAULT_UFFD_RETRY)) {
+		flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+		goto retry;
+	}
 	up_read(&mm->mmap_sem);
 	return;
 
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index a67c6d7..e84b4ef2 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -211,7 +211,8 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address,
 	enum ctx_state prev_state = exception_enter();
 	struct vm_area_struct * vma;
 	struct mm_struct *mm = current->mm;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+			     FAULT_FLAG_ALLOW_UFFD_RETRY;
 	int code = SEGV_MAPERR;
 	int is_write = 0;
 	int trap = TRAP(regs);
@@ -474,6 +475,11 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address,
 			goto retry;
 		}
 	}
+	if ((flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+	    (fault & VM_FAULT_UFFD_RETRY)) {
+		flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+		goto retry;
+	}
 
 	up_read(&mm->mmap_sem);
 	goto bail;
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index ec1a30d..a5e34cb 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -405,7 +405,8 @@ static inline int do_exception(struct pt_regs *regs, int access)
 
 	address = trans_exc_code & __FAIL_ADDR_MASK;
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
-	flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+		FAULT_FLAG_ALLOW_UFFD_RETRY;
 	if (user_mode(regs))
 		flags |= FAULT_FLAG_USER;
 	if (access == VM_WRITE || (trans_exc_code & store_indication) == 0x400)
@@ -498,6 +499,12 @@ static inline int do_exception(struct pt_regs *regs, int access)
 			goto retry;
 		}
 	}
+	if ((flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+	    (fault & VM_FAULT_UFFD_RETRY)) {
+		flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+		down_read(&mm->mmap_sem);
+		goto retry;
+	}
 #ifdef CONFIG_PGSTE
 	if (gmap) {
 		address =  __gmap_link(gmap, current->thread.gmap_addr,
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index 79d8276..a4b19cf 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -403,7 +403,8 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
 	struct mm_struct *mm;
 	struct vm_area_struct * vma;
 	int fault;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+			     FAULT_FLAG_ALLOW_UFFD_RETRY;
 
 	tsk = current;
 	mm = tsk->mm;
@@ -515,6 +516,11 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
 			goto retry;
 		}
 	}
+	if ((flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+	    (fault & VM_FAULT_UFFD_RETRY)) {
+		flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+		goto retry;
+	}
 
 	up_read(&mm->mmap_sem);
 }
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index c399e7b..024b798 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -174,7 +174,8 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
 	unsigned long g2;
 	int from_user = !(regs->psr & PSR_PS);
 	int fault, code;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+			     FAULT_FLAG_ALLOW_UFFD_RETRY;
 
 	if (text_fault)
 		address = regs->pc;
@@ -278,6 +279,11 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
 			goto retry;
 		}
 	}
+	if ((flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+	    (fault & VM_FAULT_UFFD_RETRY)) {
+		flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+		goto retry;
+	}
 
 	up_read(&mm->mmap_sem);
 	return;
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index dbabe57..453b975 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -287,7 +287,8 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
 	unsigned int insn = 0;
 	int si_code, fault_code, fault;
 	unsigned long address, mm_rss;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+			     FAULT_FLAG_ALLOW_UFFD_RETRY;
 
 	fault_code = get_thread_fault_code();
 
@@ -476,6 +477,11 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
 			goto retry;
 		}
 	}
+	if ((flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+	    (fault & VM_FAULT_UFFD_RETRY)) {
+		flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+		goto retry;
+	}
 	up_read(&mm->mmap_sem);
 
 	mm_rss = get_mm_rss(mm);
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index 13eac59..39b2dce 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -278,7 +278,8 @@ static int handle_page_fault(struct pt_regs *regs,
 	if (!is_page_fault)
 		write = 1;
 
-	flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+		FAULT_FLAG_ALLOW_UFFD_RETRY;
 
 	is_kernel_mode = !user_mode(regs);
 
@@ -466,6 +467,11 @@ static int handle_page_fault(struct pt_regs *regs,
 			goto retry;
 		}
 	}
+	if ((flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+	    (fault & VM_FAULT_UFFD_RETRY)) {
+		flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+		goto retry;
+	}
 
 #if CHIP_HAS_TILE_DMA()
 	/* If this was a DMA TLB fault, restart the DMA engine. */
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 98783dd..0bb4f3d 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -31,7 +31,8 @@ int handle_page_fault(unsigned long address, unsigned long ip,
 	pmd_t *pmd;
 	pte_t *pte;
 	int err = -EFAULT;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+			     FAULT_FLAG_ALLOW_UFFD_RETRY;
 
 	*code_out = SEGV_MAPERR;
 
@@ -101,6 +102,11 @@ int handle_page_fault(unsigned long address, unsigned long ip,
 				goto retry;
 			}
 		}
+		if ((flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+		    (fault & VM_FAULT_UFFD_RETRY)) {
+			flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+			goto retry;
+		}
 
 		pgd = pgd_offset(mm, address);
 		pud = pud_offset(pgd, address);
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index afccef552..546e7dc 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -209,7 +209,8 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	struct task_struct *tsk;
 	struct mm_struct *mm;
 	int fault, sig, code;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+			     FAULT_FLAG_ALLOW_UFFD_RETRY;
 
 	tsk = current;
 	mm = tsk->mm;
@@ -272,6 +273,11 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 			goto retry;
 		}
 	}
+	if (!(fault & VM_FAULT_ERROR) && (flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+	    (fault & VM_FAULT_UFFD_RETRY)) {
+		flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+		goto retry;
+	}
 
 	up_read(&mm->mmap_sem);
 
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index eef44d9..4732f60 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1062,7 +1062,8 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 	struct task_struct *tsk;
 	struct mm_struct *mm;
 	int fault, major = 0;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+			     FAULT_FLAG_ALLOW_UFFD_RETRY;
 
 	tsk = current;
 	mm = tsk->mm;
@@ -1251,6 +1252,12 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 			if (!fatal_signal_pending(tsk))
 				goto retry;
 		}
+		if ((flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+		    (fault & VM_FAULT_UFFD_RETRY)) {
+			flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+			if (!fatal_signal_pending(tsk))
+				goto retry;
+		}
 
 		/* User mode? Just return to handle the fatal exception */
 		if (flags & FAULT_FLAG_USER)
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index c9784c1..b6c19ce 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -45,7 +45,8 @@ void do_page_fault(struct pt_regs *regs)
 
 	int is_write, is_exec;
 	int fault;
-	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
+			     FAULT_FLAG_ALLOW_UFFD_RETRY;
 
 	info.si_code = SEGV_MAPERR;
 
@@ -141,6 +142,11 @@ void do_page_fault(struct pt_regs *regs)
 			goto retry;
 		}
 	}
+	if ((flags & FAULT_FLAG_ALLOW_UFFD_RETRY) &&
+	    (fault & VM_FAULT_UFFD_RETRY)) {
+		flags &= ~FAULT_FLAG_ALLOW_UFFD_RETRY;
+		goto retry;
+	}
 
 	up_read(&mm->mmap_sem);
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index c79a3fd..bbf0ef2 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -298,7 +298,8 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	 * without first stopping userland access to the memory. For
 	 * VM_UFFD_MISSING userfaults this is enough for now.
 	 */
-	if (unlikely(!(flags & FAULT_FLAG_ALLOW_RETRY))) {
+	if (unlikely(!(flags & (FAULT_FLAG_ALLOW_RETRY |
+			FAULT_FLAG_ALLOW_UFFD_RETRY)))) {
 		/*
 		 * Validate the invariant that nowait must allow retry
 		 * to be sure not to return SIGBUS erroneously on
@@ -357,7 +358,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 		    !fatal_signal_pending(current)))) {
 		wake_up_poll(&ctx->fd_wqh, POLLIN);
 		schedule();
-		ret |= VM_FAULT_MAJOR;
+		ret |= VM_FAULT_MAJOR | VM_FAULT_UFFD_RETRY;
 	}
 
 	__set_current_state(TASK_RUNNING);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 00bad77..b4c6e44 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -219,6 +219,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_KILLABLE	0x10	/* The fault task is in SIGKILL killable region */
 #define FAULT_FLAG_TRIED	0x20	/* Second try */
 #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
+#define FAULT_FLAG_ALLOW_UFFD_RETRY 0x80/* userfault retry */
 
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
@@ -1027,7 +1028,7 @@ static inline void clear_page_pfmemalloc(struct page *page)
 #define VM_FAULT_HWPOISON 0x0010	/* Hit poisoned small page */
 #define VM_FAULT_HWPOISON_LARGE 0x0020  /* Hit poisoned large page. Index encoded in upper bits */
 #define VM_FAULT_SIGSEGV 0x0040
-
+#define VM_FAULT_UFFD_RETRY 0x0080
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
-- 
2.4.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 8/8] userfaultfd: enabled write protection in userfaultfd API
  2015-11-19 22:33 [RFC 0/8] userfaultfd: add write protect support Shaohua Li
                   ` (6 preceding siblings ...)
  2015-11-19 22:33 ` [RFC 7/8] userfaultfd: fault try one more time Shaohua Li
@ 2015-11-19 22:33 ` Shaohua Li
  2015-11-20  3:13 ` [RFC 0/8] userfaultfd: add write protect support Jerome Glisse
  8 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2015-11-19 22:33 UTC (permalink / raw)
  To: linux-mm
  Cc: kernel-team, Andrew Morton, Andrea Arcangeli, Pavel Emelyanov,
	Rik van Riel, Kirill A. Shutemov, Mel Gorman, Hugh Dickins,
	Johannes Weiner

Now it's safe to enable write protection in userfaultfd API

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 include/uapi/linux/userfaultfd.h | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 8898bd7..12c7a36 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -17,7 +17,7 @@
  * #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP | \
  *			      UFFD_FEATURE_EVENT_FORK)
  */
-#define UFFD_API_FEATURES (0)
+#define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -25,7 +25,8 @@
 #define UFFD_API_RANGE_IOCTLS			\
 	((__u64)1 << _UFFDIO_WAKE |		\
 	 (__u64)1 << _UFFDIO_COPY |		\
-	 (__u64)1 << _UFFDIO_ZEROPAGE)
+	 (__u64)1 << _UFFDIO_ZEROPAGE |		\
+	 (__u64)1 << _UFFDIO_WRITEPROTECT)
 
 /*
  * Valid ioctl command number range with this API is from 0x00 to
@@ -108,8 +109,8 @@ struct uffdio_api {
 	 * are to be considered implicitly always enabled in all kernels as
 	 * long as the uffdio_api.api requested matches UFFD_API.
 	 */
-#if 0 /* not available yet */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
+#if 0 /* not available yet */
 #define UFFD_FEATURE_EVENT_FORK			(1<<1)
 #endif
 	__u64 features;
-- 
2.4.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC 6/8] userfaultfd: hook userfault handler to write protection fault
  2015-11-19 22:33 ` [RFC 6/8] userfaultfd: hook userfault handler to write protection fault Shaohua Li
@ 2015-11-20  2:54   ` Jerome Glisse
  0 siblings, 0 replies; 13+ messages in thread
From: Jerome Glisse @ 2015-11-20  2:54 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, kernel-team, Andrew Morton, Andrea Arcangeli,
	Pavel Emelyanov, Rik van Riel, Kirill A. Shutemov, Mel Gorman,
	Hugh Dickins, Johannes Weiner

On Thu, Nov 19, 2015 at 02:33:51PM -0800, Shaohua Li wrote:
> There are several cases write protection fault happens. It could be a write
> to zero page, swaped page or userfault write protected page. When the
> fault happens, there is no way to know if userfault write protect the
> page before. Here we just blindly issue a userfault notification for vma
> with VM_UFFD_WP regardless if app write protects it yet. Application
> should be ready to handle such wp fault.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Pavel Emelyanov <xemul@parallels.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  mm/memory.c | 66 +++++++++++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 49 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index deb679c..5d16a31 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1994,10 +1994,11 @@ static inline int wp_page_reuse(struct mm_struct *mm,
>  			struct vm_area_struct *vma, unsigned long address,
>  			pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
>  			struct page *page, int page_mkwrite,
> -			int dirty_shared)
> +			int dirty_shared, unsigned int flags)
>  	__releases(ptl)
>  {
>  	pte_t entry;
> +	bool do_uffd = false;
>  	/*
>  	 * Clear the pages cpupid information as the existing
>  	 * information potentially belongs to a now completely
> @@ -2008,10 +2009,16 @@ static inline int wp_page_reuse(struct mm_struct *mm,
>  
>  	flush_cache_page(vma, address, pte_pfn(orig_pte));
>  	entry = pte_mkyoung(orig_pte);
> -	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +	if (userfaultfd_wp(vma) && page) {
> +		entry = pte_mkdirty(entry);


Why do you pte_mkdirty() it makes no sense to me unless i am missing something.
In fact, IIRC, userfaultd is only concerning private anonymous vma so you should
only need to modify 3 places. do_anonymous_page(), do_swap_page() and do_wp_page()

You also want to hook in wp_huge_pmd() and __do_huge_pmd_anonymous_page() to
properly cover THP.

So i think you need to simplify this patch and make sure you handle THP properly.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 7/8] userfaultfd: fault try one more time
  2015-11-19 22:33 ` [RFC 7/8] userfaultfd: fault try one more time Shaohua Li
@ 2015-11-20  3:04   ` Jerome Glisse
  0 siblings, 0 replies; 13+ messages in thread
From: Jerome Glisse @ 2015-11-20  3:04 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, kernel-team, Andrew Morton, Andrea Arcangeli,
	Pavel Emelyanov, Rik van Riel, Kirill A. Shutemov, Mel Gorman,
	Hugh Dickins, Johannes Weiner

On Thu, Nov 19, 2015 at 02:33:52PM -0800, Shaohua Li wrote:
> For a swapin memory write fault, fault handler already retry once to
> read the page in. userfaultfd can't do the retry again and fail. Give
> another retry for userfaultfd in such case. gup isn't fixed yet, so will
> return -EBUSY.

This whole patch make me nervous. I do not see the point in it. So on
page fault in first pass you have the RETRY flag set and you can either
return VM_FAULT_RETRY because (1) lock_page_or_retry() in do_swap_page()
or because (2) handle_userfault().

In second case, on retry you already have a valid read only pte so you
go directly to do_wp_page() and this is properly handle by current
handle_userfault() code. So it does not make sense to add complexity
for that case.

You seem to hint that you are doing this for the first case (1) but even
for that one it does not make sense. So if we fail to lock the page it
is because someone else is doing something with that page and most likely
it is related to the userfaultfd already (like another thread took the
fault and is doing all the steps you need). So you just want a regular
retry, ie do_swap_page() return retry and on retry it is likely that
everything is already all good. If not that it takes the slow painful
wait code path.

I genuinely do not see what benefit and reasons there is to this new
special usefaultfd retry flag.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/8] userfaultfd: add write protect support
  2015-11-19 22:33 [RFC 0/8] userfaultfd: add write protect support Shaohua Li
                   ` (7 preceding siblings ...)
  2015-11-19 22:33 ` [RFC 8/8] userfaultfd: enabled write protection in userfaultfd API Shaohua Li
@ 2015-11-20  3:13 ` Jerome Glisse
  8 siblings, 0 replies; 13+ messages in thread
From: Jerome Glisse @ 2015-11-20  3:13 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-mm, kernel-team, Andrew Morton

On Thu, Nov 19, 2015 at 02:33:45PM -0800, Shaohua Li wrote:
> Hi,
> 
> There is plan to support write protect fault into userfaultfd before, but it's
> not implemented yet. I'm working on a library to support different types of
> buffer like compressed buffer and file buffer, something like a page cache
> implementation in userspace. The buffer enables userfaultfd and does something
> like decompression in userfault handler. When memory size exceeds a
> threshold, madvise is used to reclaim memory. The problem is data can be
> corrupted in reclaim without memory protection support.
> 
> For example, in the compressed buffer case, reclaim does:
> 1. compress memory range and store compressed data elsewhere
> 2. madvise the memory range
> 
> But if the memory is changed before 2, new change is lost. memory write
> protection can solve the issue. With it, the reclaim does:
> 1. write protect memory range
> 2. compress memory range and store compressed data elsewhere
> 3. madvise the memory range
> 4. undo write protect memory range and wakeup tasks waiting in write protect
> fault.
> If a task changes memory before 3, write protect fault will be triggered. we
> can put the task into sleep till step 4 runs for example. In this way memory
> changes will not be lost.

While i understand the whole concept of write protection while doing compression.
I do not see valid usecase for this. Inside the kernel we already have thing like
zswap that already does what you seem to want to do (ie compress memory range and
transparently uncompress it on next CPU access).

I fail to see a usecase where we would realy would like to do this in userspace.

> 
> This patch set add write protect support for userfaultfd. One issue is write
> protect fault can happen even without enabling write protect in userfault. For
> example, a write to address backed by zero page. There is no way to distinguish
> if this is a write protect fault expected by userfault. This patch just blindly
> triggers write protect fault to userfault if corresponding vma enables
> VM_UFFD_WP. Application should be prepared to handle such write protect fault.
> 
> Thanks,
> Shaohua
> 
> 
> Shaohua Li (8):
>   userfaultfd: add helper for writeprotect check
>   userfaultfd: support write protection for userfault vma range
>   userfaultfd: expose writeprotect API to ioctl
>   userfaultfd: allow userfaultfd register success with writeprotection
>   userfaultfd: undo write proctection in unregister
>   userfaultfd: hook userfault handler to write protection fault
>   userfaultfd: fault try one more time
>   userfaultfd: enabled write protection in userfaultfd API

>From organization point of view, i would put the "expose writeprotect API to ioctl"
as the last patch in the serie after all the plumbing is done. This would make
"enabled write protection in userfaultfd API" useless and avoid akward changes in
some of the others patches where you add commented/disabled code.

Also you want to handle GUP, like you want the write protection to fails if there
is GUP and you want GUP to force breaking write protection, otherwise this will be
broken if anyone mix it with something that trigger GUP.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 2/8] userfaultfd: support write protection for userfault vma range
  2015-11-19 22:33 ` [RFC 2/8] userfaultfd: support write protection for userfault vma range Shaohua Li
@ 2016-04-14 21:07   ` Andrea Arcangeli
  0 siblings, 0 replies; 13+ messages in thread
From: Andrea Arcangeli @ 2016-04-14 21:07 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, kernel-team, Andrew Morton, Pavel Emelyanov,
	Rik van Riel, Kirill A. Shutemov, Mel Gorman, Hugh Dickins,
	Johannes Weiner

Hello,

Do you have a more recent version of this patchset?

On Thu, Nov 19, 2015 at 02:33:47PM -0800, Shaohua Li wrote:
> +	down_read(&dst_mm->mmap_sem);

[..]

> +	if (enable_wp)
> +		newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE));
> +	else
> +		newprot = vm_get_page_prot(dst_vma->vm_flags);

The vm_flags for anon vmas are always wrprotected, just we mark them
writable during fault or during cow if vm_flags VM_WRITE is set, when
we know it's not shared. So this requires checking the mapcount
somewhere while fork cannot run, or the above won't properly
unprotect?

> +
> +	change_protection(dst_vma, start, start + len, newprot,
> +				!enable_wp, 0);

change_protection(prot_numa=0) assumes mmap_sem hold for writing
breaking here:

	 /* !prot_numa is protected by mmap_sem held for write */
	if (!prot_numa)
		return pte_offset_map_lock(vma->vm_mm, pmd, addr, ptl);

	pmdl = pmd_lock(vma->vm_mm, pmd);
	if (unlikely(pmd_trans_huge(*pmd) || pmd_none(*pmd))) {
		spin_unlock(pmdl);
		return NULL;
	}

	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, ptl);
	spin_unlock(pmdl);

With userfaultfd the pmd can be trans unstable as we only hold the
mmap_sem for reading.

In short calling change_protection() with prot_numa==0 with only the
mmap_sem for reading looks wrong...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-04-14 21:07 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-19 22:33 [RFC 0/8] userfaultfd: add write protect support Shaohua Li
2015-11-19 22:33 ` [RFC 1/8] userfaultfd: add helper for writeprotect check Shaohua Li
2015-11-19 22:33 ` [RFC 2/8] userfaultfd: support write protection for userfault vma range Shaohua Li
2016-04-14 21:07   ` Andrea Arcangeli
2015-11-19 22:33 ` [RFC 3/8] userfaultfd: expose writeprotect API to ioctl Shaohua Li
2015-11-19 22:33 ` [RFC 4/8] userfaultfd: allow userfaultfd register success with writeprotection Shaohua Li
2015-11-19 22:33 ` [RFC 5/8] userfaultfd: undo write proctection in unregister Shaohua Li
2015-11-19 22:33 ` [RFC 6/8] userfaultfd: hook userfault handler to write protection fault Shaohua Li
2015-11-20  2:54   ` Jerome Glisse
2015-11-19 22:33 ` [RFC 7/8] userfaultfd: fault try one more time Shaohua Li
2015-11-20  3:04   ` Jerome Glisse
2015-11-19 22:33 ` [RFC 8/8] userfaultfd: enabled write protection in userfaultfd API Shaohua Li
2015-11-20  3:13 ` [RFC 0/8] userfaultfd: add write protect support Jerome Glisse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.