linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks
@ 2021-08-03 19:18 Andreas Gruenbacher
  2021-08-03 19:18 ` [PATCH v5 01/12] iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value Andreas Gruenbacher
                   ` (12 more replies)
  0 siblings, 13 replies; 23+ messages in thread
From: Andreas Gruenbacher @ 2021-08-03 19:18 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig,
	Darrick J. Wong, Paul Mackerras
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher, kvm-ppc

Hi all,

here's another update on top of v5.14-rc4.  There seems to be a bug in
get_user_pages_fast when called with FOLL_FAST_ONLY; please see below.

Changes:

 * Change fault_in_pages_{readable,writeable} to return the number of
   bytes that should be accessible instead of failing outright when
   part of the requested region cannot be faulted in.  Change
   iov_iter_fault_in_readable to those same semantics.

 * Add fault_in_iov_iter_writeable for safely faulting in pages for
   writing without modifying the pages.


With this patch queue, fstest generic/208 (aio-dio-invalidate-failure.c)
endlessly spins in gfs2_file_direct_write.  It looks as if there's a bug
in get_user_pages_fast when called with FOLL_FAST_ONLY:

 (1) The test case performs an aio write into a 32 MB buffer.

 (2) The buffer is initially not in memory, so when iomap_dio_rw() ->
     ... -> bio_iov_iter_get_pages() is called with the iter->noio flag
     set, we get to get_user_pages_fast() with FOLL_FAST_ONLY set.
     get_user_pages_fast() returns 0, which causes
     bio_iov_iter_get_pages to return -EFAULT.

 (3) Then gfs2_file_direct_write faults in the entire buffer with
     fault_in_iov_iter_readable(), which succeeds.

 (4) With the buffer in memory, we retry the iomap_dio_rw() ->
     ... -> bio_iov_iter_get_pages() -> ... -> get_user_pages_fast().
     This should succeed now, but get_user_pages_fast() still returns 0.

 (5) Thus we end up in step (3) again.

The buffered writes generic/208 performs are unrelated to this hang.


Apart from the generic/208 hang, gfs2 still needs a better strategy for
faulting in more reasonable chunks of memory at a time and for resuming
requests after faulting in pages.  We've got some of the pieces in place
for safely allowing that, but more work remains to be done.


For immediate consideration by Al Viro:

  iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value


For immediate consideration by Paul Mackerras:

  powerpc/kvm: Fix kvm_use_magic_page


Thanks,
Andreas


Andreas Gruenbacher (12):
  iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value
  powerpc/kvm: Fix kvm_use_magic_page
  Turn fault_in_pages_{readable,writeable} into
    fault_in_{readable,writeable}
  Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
  iov_iter: Introduce fault_in_iov_iter_writeable
  gfs2: Add wrapper for iomap_file_buffered_write
  gfs2: Fix mmap + page fault deadlocks for buffered I/O
  iomap: Fix iomap_dio_rw return value for user copies
  iomap: Support restarting direct I/O requests after user copy failures
  iomap: Add done_before argument to iomap_dio_rw
  iov_iter: Introduce noio flag to disable page faults
  gfs2: Fix mmap + page fault deadlocks for direct I/O

 arch/powerpc/kernel/kvm.c           |   3 +-
 arch/powerpc/kernel/signal_32.c     |   4 +-
 arch/powerpc/kernel/signal_64.c     |   2 +-
 arch/x86/kernel/fpu/signal.c        |   8 +-
 drivers/gpu/drm/armada/armada_gem.c |   7 +-
 fs/btrfs/file.c                     |   8 +-
 fs/btrfs/ioctl.c                    |   7 +-
 fs/ext4/file.c                      |   5 +-
 fs/f2fs/file.c                      |   6 +-
 fs/fuse/file.c                      |   2 +-
 fs/gfs2/file.c                      |  95 ++++++++++++++++++++---
 fs/iomap/buffered-io.c              |   2 +-
 fs/iomap/direct-io.c                |  28 +++++--
 fs/ntfs/file.c                      |   2 +-
 fs/xfs/xfs_file.c                   |   6 +-
 fs/zonefs/super.c                   |   4 +-
 include/linux/iomap.h               |  11 ++-
 include/linux/pagemap.h             |  58 +-------------
 include/linux/uio.h                 |   4 +-
 lib/iov_iter.c                      | 107 ++++++++++++++++++++------
 mm/filemap.c                        |   4 +-
 mm/gup.c                            | 113 ++++++++++++++++++++++++++++
 22 files changed, 360 insertions(+), 126 deletions(-)

-- 
2.26.3


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v5 01/12] iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value
  2021-08-03 19:18 [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
@ 2021-08-03 19:18 ` Andreas Gruenbacher
  2021-08-03 19:18 ` [PATCH v5 02/12] powerpc/kvm: Fix kvm_use_magic_page Andreas Gruenbacher
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Andreas Gruenbacher @ 2021-08-03 19:18 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

Both iov_iter_get_pages and iov_iter_get_pages_alloc return the number of bytes
of the iovec they could get the pages for.  When they cannot get any pages,
they're supposed to return 0, but when the start of the iovec isn't page
aligned, the calculation goes wrong and they return a negative value.  Fix both
functions.

In addition, change iov_iter_get_pages_alloc to return NULL in that case to
prevent resource leaks.

It seems that the cifs and nfs filesystems don't handle the zero case very
well.  Could the maintainers please have a look?

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 lib/iov_iter.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index e23123ae3a13..25dfc48536d7 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1484,7 +1484,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 		res = get_user_pages_fast(addr, n,
 				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0,
 				pages);
-		if (unlikely(res < 0))
+		if (unlikely(res <= 0))
 			return res;
 		return (res == n ? len : res * PAGE_SIZE) - *start;
 	}
@@ -1608,8 +1608,9 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 			return -ENOMEM;
 		res = get_user_pages_fast(addr, n,
 				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0, p);
-		if (unlikely(res < 0)) {
+		if (unlikely(res <= 0)) {
 			kvfree(p);
+			*pages = NULL;
 			return res;
 		}
 		*pages = p;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 02/12] powerpc/kvm: Fix kvm_use_magic_page
  2021-08-03 19:18 [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
  2021-08-03 19:18 ` [PATCH v5 01/12] iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value Andreas Gruenbacher
@ 2021-08-03 19:18 ` Andreas Gruenbacher
  2021-08-03 19:18 ` [PATCH v5 03/12] Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable} Andreas Gruenbacher
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Andreas Gruenbacher @ 2021-08-03 19:18 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig,
	Darrick J. Wong, Paul Mackerras
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher, kvm-ppc, stable

When switching from __get_user to fault_in_pages_readable, commit 9f9eae5ce717
broke kvm_use_magic_page: fault_in_pages_readable returns 0 on success.  Fix
that.

Fixes: 9f9eae5ce717 ("powerpc/kvm: Prefer fault_in_pages_readable function")
Cc: stable@vger.kernel.org # v4.18+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 arch/powerpc/kernel/kvm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c
index 617eba82531c..d89cf802d9aa 100644
--- a/arch/powerpc/kernel/kvm.c
+++ b/arch/powerpc/kernel/kvm.c
@@ -669,7 +669,7 @@ static void __init kvm_use_magic_page(void)
 	on_each_cpu(kvm_map_magic_page, &features, 1);
 
 	/* Quick self-test to see if the mapping works */
-	if (!fault_in_pages_readable((const char *)KVM_MAGIC_PAGE, sizeof(u32))) {
+	if (fault_in_pages_readable((const char *)KVM_MAGIC_PAGE, sizeof(u32))) {
 		kvm_patching_worked = false;
 		return;
 	}
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 03/12] Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}
  2021-08-03 19:18 [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
  2021-08-03 19:18 ` [PATCH v5 01/12] iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value Andreas Gruenbacher
  2021-08-03 19:18 ` [PATCH v5 02/12] powerpc/kvm: Fix kvm_use_magic_page Andreas Gruenbacher
@ 2021-08-03 19:18 ` Andreas Gruenbacher
  2021-08-03 19:43   ` Linus Torvalds
  2021-08-03 20:57   ` Al Viro
  2021-08-03 19:18 ` [PATCH v5 04/12] Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable Andreas Gruenbacher
                   ` (9 subsequent siblings)
  12 siblings, 2 replies; 23+ messages in thread
From: Andreas Gruenbacher @ 2021-08-03 19:18 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

Turn fault_in_pages_{readable,writeable} into versions that return the number
of bytes faulted in instead of returning a non-zero value when any of the
requested pages couldn't be faulted in.  This supports the existing users that
require all pages to be faulted in, but also new users that are happy if any
pages can be faulted in.

Neither of these functions is entirely trivial and it doesn't seem useful to
inline them, so move them to mm/gup.c.

Rename the functions to fault_in_{readable,writeable} to make sure that code
that uses them can be fixed instead of breaking silently.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 arch/powerpc/kernel/kvm.c           |  3 +-
 arch/powerpc/kernel/signal_32.c     |  4 +-
 arch/powerpc/kernel/signal_64.c     |  2 +-
 arch/x86/kernel/fpu/signal.c        |  8 ++--
 drivers/gpu/drm/armada/armada_gem.c |  7 ++--
 fs/btrfs/ioctl.c                    |  7 ++--
 include/linux/pagemap.h             | 57 ++---------------------------
 lib/iov_iter.c                      | 12 +++---
 mm/filemap.c                        |  2 +-
 mm/gup.c                            | 52 ++++++++++++++++++++++++++
 10 files changed, 78 insertions(+), 76 deletions(-)

diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c
index d89cf802d9aa..b8fe9f16dec2 100644
--- a/arch/powerpc/kernel/kvm.c
+++ b/arch/powerpc/kernel/kvm.c
@@ -669,7 +669,8 @@ static void __init kvm_use_magic_page(void)
 	on_each_cpu(kvm_map_magic_page, &features, 1);
 
 	/* Quick self-test to see if the mapping works */
-	if (fault_in_pages_readable((const char *)KVM_MAGIC_PAGE, sizeof(u32))) {
+	if (fault_in_readable((const char __user *)KVM_MAGIC_PAGE,
+			sizeof(u32)) != sizeof(u32)) {
 		kvm_patching_worked = false;
 		return;
 	}
diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 0608581967f0..4619604b85f6 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -1048,7 +1048,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx,
 	if (new_ctx == NULL)
 		return 0;
 	if (!access_ok(new_ctx, ctx_size) ||
-	    fault_in_pages_readable((u8 __user *)new_ctx, ctx_size))
+	    fault_in_readable((char __user *)new_ctx, ctx_size) != ctx_size)
 		return -EFAULT;
 
 	/*
@@ -1237,7 +1237,7 @@ SYSCALL_DEFINE3(debug_setcontext, struct ucontext __user *, ctx,
 #endif
 
 	if (!access_ok(ctx, sizeof(*ctx)) ||
-	    fault_in_pages_readable((u8 __user *)ctx, sizeof(*ctx)))
+	    fault_in_readable((char __user *)ctx, sizeof(*ctx)) != sizeof(*ctx))
 		return -EFAULT;
 
 	/*
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 1831bba0582e..889612ac23ca 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -688,7 +688,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx,
 	if (new_ctx == NULL)
 		return 0;
 	if (!access_ok(new_ctx, ctx_size) ||
-	    fault_in_pages_readable((u8 __user *)new_ctx, ctx_size))
+	    fault_in_readable((char __user *)new_ctx, ctx_size) != ctx_size)
 		return -EFAULT;
 
 	/*
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 445c57c9c539..9cb7c4d2c6d2 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -205,7 +205,8 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
 	fpregs_unlock();
 
 	if (ret) {
-		if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
+		if (fault_in_writeable(buf_fx, fpu_user_xstate_size) ==
+				fpu_user_xstate_size)
 			goto retry;
 		return -EFAULT;
 	}
@@ -278,10 +279,9 @@ static int restore_fpregs_from_user(void __user *buf, u64 xrestore,
 		if (ret != -EFAULT)
 			return -EINVAL;
 
-		ret = fault_in_pages_readable(buf, size);
-		if (!ret)
+		if (fault_in_readable(buf, size) == size)
 			goto retry;
-		return ret;
+		return -EFAULT;
 	}
 
 	/*
diff --git a/drivers/gpu/drm/armada/armada_gem.c b/drivers/gpu/drm/armada/armada_gem.c
index 21909642ee4c..ceb68a5ee31f 100644
--- a/drivers/gpu/drm/armada/armada_gem.c
+++ b/drivers/gpu/drm/armada/armada_gem.c
@@ -336,7 +336,7 @@ int armada_gem_pwrite_ioctl(struct drm_device *dev, void *data,
 	struct drm_armada_gem_pwrite *args = data;
 	struct armada_gem_object *dobj;
 	char __user *ptr;
-	int ret;
+	int ret = 0;
 
 	DRM_DEBUG_DRIVER("handle %u off %u size %u ptr 0x%llx\n",
 		args->handle, args->offset, args->size, args->ptr);
@@ -349,9 +349,8 @@ int armada_gem_pwrite_ioctl(struct drm_device *dev, void *data,
 	if (!access_ok(ptr, args->size))
 		return -EFAULT;
 
-	ret = fault_in_pages_readable(ptr, args->size);
-	if (ret)
-		return ret;
+	if (fault_in_readable(ptr, args->size) != args->size)
+		return -EFAULT;
 
 	dobj = armada_gem_object_lookup(file, args->handle);
 	if (dobj == NULL)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0ba98e08a029..c30382f89544 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2244,9 +2244,10 @@ static noinline int search_ioctl(struct inode *inode,
 	key.offset = sk->min_offset;
 
 	while (1) {
-		ret = fault_in_pages_writeable(ubuf + sk_offset,
-					       *buf_size - sk_offset);
-		if (ret)
+		size_t size = *buf_size - sk_offset;
+
+		ret = -EFAULT;
+		if (fault_in_writeable(ubuf + sk_offset, size) != size)
 			break;
 
 		ret = btrfs_search_forward(root, &key, path, sk->min_transid);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ed02aa522263..7c9edc9694d9 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -734,61 +734,10 @@ int wait_on_page_private_2_killable(struct page *page);
 extern void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter);
 
 /*
- * Fault everything in given userspace address range in.
+ * Fault in userspace address range.
  */
-static inline int fault_in_pages_writeable(char __user *uaddr, int size)
-{
-	char __user *end = uaddr + size - 1;
-
-	if (unlikely(size == 0))
-		return 0;
-
-	if (unlikely(uaddr > end))
-		return -EFAULT;
-	/*
-	 * Writing zeroes into userspace here is OK, because we know that if
-	 * the zero gets there, we'll be overwriting it.
-	 */
-	do {
-		if (unlikely(__put_user(0, uaddr) != 0))
-			return -EFAULT;
-		uaddr += PAGE_SIZE;
-	} while (uaddr <= end);
-
-	/* Check whether the range spilled into the next page. */
-	if (((unsigned long)uaddr & PAGE_MASK) ==
-			((unsigned long)end & PAGE_MASK))
-		return __put_user(0, end);
-
-	return 0;
-}
-
-static inline int fault_in_pages_readable(const char __user *uaddr, int size)
-{
-	volatile char c;
-	const char __user *end = uaddr + size - 1;
-
-	if (unlikely(size == 0))
-		return 0;
-
-	if (unlikely(uaddr > end))
-		return -EFAULT;
-
-	do {
-		if (unlikely(__get_user(c, uaddr) != 0))
-			return -EFAULT;
-		uaddr += PAGE_SIZE;
-	} while (uaddr <= end);
-
-	/* Check whether the range spilled into the next page. */
-	if (((unsigned long)uaddr & PAGE_MASK) ==
-			((unsigned long)end & PAGE_MASK)) {
-		return __get_user(c, end);
-	}
-
-	(void)c;
-	return 0;
-}
+size_t fault_in_writeable(char __user *uaddr, size_t size);
+size_t fault_in_readable(const char __user *uaddr, size_t size);
 
 int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 25dfc48536d7..92d877a698f0 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -191,7 +191,8 @@ static size_t copy_page_to_iter_iovec(struct page *page, size_t offset, size_t b
 	buf = iov->iov_base + skip;
 	copy = min(bytes, iov->iov_len - skip);
 
-	if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_pages_writeable(buf, copy)) {
+	if (IS_ENABLED(CONFIG_HIGHMEM) &&
+	    fault_in_writeable(buf, copy) == copy) {
 		kaddr = kmap_atomic(page);
 		from = kaddr + offset;
 
@@ -275,7 +276,8 @@ static size_t copy_page_from_iter_iovec(struct page *page, size_t offset, size_t
 	buf = iov->iov_base + skip;
 	copy = min(bytes, iov->iov_len - skip);
 
-	if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_pages_readable(buf, copy)) {
+	if (IS_ENABLED(CONFIG_HIGHMEM) &&
+	    fault_in_readable(buf, copy) == copy) {
 		kaddr = kmap_atomic(page);
 		to = kaddr + offset;
 
@@ -446,13 +448,11 @@ int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes)
 			bytes = i->count;
 		for (p = i->iov, skip = i->iov_offset; bytes; p++, skip = 0) {
 			size_t len = min(bytes, p->iov_len - skip);
-			int err;
 
 			if (unlikely(!len))
 				continue;
-			err = fault_in_pages_readable(p->iov_base + skip, len);
-			if (unlikely(err))
-				return err;
+			if (fault_in_readable(p->iov_base + skip, len) != len)
+				return -EFAULT;
 			bytes -= len;
 		}
 	}
diff --git a/mm/filemap.c b/mm/filemap.c
index d1458ecf2f51..4dec3bc7752e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -88,7 +88,7 @@
  *    ->lock_page		(access_process_vm)
  *
  *  ->i_mutex			(generic_perform_write)
- *    ->mmap_lock		(fault_in_pages_readable->do_page_fault)
+ *    ->mmap_lock		(fault_in_readable->do_page_fault)
  *
  *  bdi->wb.list_lock
  *    sb_lock			(fs/fs-writeback.c)
diff --git a/mm/gup.c b/mm/gup.c
index 42b8b1fa6521..d04984d5d93c 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1669,6 +1669,58 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
 }
 #endif /* !CONFIG_MMU */
 
+size_t fault_in_writeable(char __user *uaddr, size_t size)
+{
+	char __user *start = uaddr, *end;
+
+	if (unlikely(size == 0))
+		return 0;
+	if (!PAGE_ALIGNED(uaddr)) {
+		if (unlikely(__put_user(0, uaddr) != 0))
+			return 0;
+		uaddr = (char __user *)PAGE_ALIGN((unsigned long)uaddr);
+	}
+	end = (char __user *)PAGE_ALIGN((unsigned long)start + size);
+	if (unlikely(end < start))
+		end = NULL;
+	while (uaddr != end) {
+		if (unlikely(__put_user(0, uaddr) != 0))
+			goto out;
+		uaddr += PAGE_SIZE;
+	}
+
+out:
+	return min_t(size_t, uaddr - start, size);
+}
+EXPORT_SYMBOL(fault_in_writeable);
+
+size_t fault_in_readable(const char __user *uaddr, size_t size)
+{
+	const char __user *start = uaddr, *end;
+	volatile char c;
+
+	if (unlikely(size == 0))
+		return 0;
+	if (!PAGE_ALIGNED(uaddr)) {
+		if (unlikely(__get_user(c, uaddr) != 0))
+			return 0;
+		uaddr = (const char __user *)PAGE_ALIGN((unsigned long)uaddr);
+	}
+	end = (const char __user *)PAGE_ALIGN((unsigned long)start + size);
+	if (unlikely(end < start))
+		end = NULL;
+	while (uaddr != end) {
+		if (unlikely(__get_user(c, uaddr) != 0))
+			goto out;
+		uaddr += PAGE_SIZE;
+	}
+
+out:
+	(void)c;
+	return min_t(size_t, uaddr - start, size);
+}
+EXPORT_SYMBOL(fault_in_readable);
+
 /**
  * get_dump_page() - pin user page in memory while writing it to core dump
  * @addr: user address
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 04/12] Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
  2021-08-03 19:18 [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (2 preceding siblings ...)
  2021-08-03 19:18 ` [PATCH v5 03/12] Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable} Andreas Gruenbacher
@ 2021-08-03 19:18 ` Andreas Gruenbacher
  2021-08-03 19:18 ` [PATCH v5 05/12] iov_iter: Introduce fault_in_iov_iter_writeable Andreas Gruenbacher
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Andreas Gruenbacher @ 2021-08-03 19:18 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

Turn iov_iter_fault_in_readable into a function that returns the number of
bytes faulted in instead of returning a non-zero value when any of the
requested pages couldn't be faulted in.  This supports the existing users that
require all pages to be faulted in, but also new users that are happy if any
pages can be faulted in.

Rename iov_iter_fault_in_readable to an unfortunately clumsy
fault_in_iov_iter_readable to make sure that code that uses it can be fixed
instead of breaking silently.

Fix up the existing users.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/btrfs/file.c        |  3 ++-
 fs/f2fs/file.c         |  6 +++---
 fs/fuse/file.c         |  2 +-
 fs/iomap/buffered-io.c |  2 +-
 fs/ntfs/file.c         |  2 +-
 include/linux/uio.h    |  2 +-
 lib/iov_iter.c         | 33 ++++++++++++++++++++++-----------
 mm/filemap.c           |  2 +-
 8 files changed, 32 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index ee34497500e1..8ff9e0bb5b0f 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1698,7 +1698,8 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
 		 * Fault pages before locking them in prepare_pages
 		 * to avoid recursive lock
 		 */
-		if (unlikely(iov_iter_fault_in_readable(i, write_bytes))) {
+		if (unlikely(fault_in_iov_iter_readable(i, write_bytes) !=
+				write_bytes)) {
 			ret = -EFAULT;
 			break;
 		}
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 6afd4562335f..7c172573f18a 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -4255,16 +4255,16 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 
 	ret = generic_write_checks(iocb, from);
 	if (ret > 0) {
+		size_t count = iov_iter_count(from);
 		bool preallocated = false;
 		size_t target_size = 0;
 		int err;
 
-		if (iov_iter_fault_in_readable(from, iov_iter_count(from)))
+		if (fault_in_iov_iter_readable(from, count) != count)
 			set_inode_flag(inode, FI_NO_PREALLOC);
 
 		if ((iocb->ki_flags & IOCB_NOWAIT)) {
-			if (!f2fs_overwrite_io(inode, iocb->ki_pos,
-						iov_iter_count(from)) ||
+			if (!f2fs_overwrite_io(inode, iocb->ki_pos, count) ||
 				f2fs_has_inline_data(inode) ||
 				f2fs_force_buffered_io(inode, iocb, from)) {
 				clear_inode_flag(inode, FI_NO_PREALLOC);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 97f860cfc195..d5dd01f20f1e 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1160,7 +1160,7 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
 
  again:
 		err = -EFAULT;
-		if (iov_iter_fault_in_readable(ii, bytes))
+		if (fault_in_iov_iter_readable(ii, bytes) != bytes)
 			break;
 
 		err = -ENOMEM;
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 87ccb3438bec..d5de094fef73 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -749,7 +749,7 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
 		 */
-		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+		if (unlikely(fault_in_iov_iter_readable(i, bytes) != bytes)) {
 			status = -EFAULT;
 			break;
 		}
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index ab4f3362466d..cddac274c35a 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -1829,7 +1829,7 @@ static ssize_t ntfs_perform_write(struct file *file, struct iov_iter *i,
 		 * pages being swapped out between us bringing them into memory
 		 * and doing the actual copying.
 		 */
-		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+		if (unlikely(fault_in_iov_iter_readable(i, bytes) != bytes)) {
 			status = -EFAULT;
 			break;
 		}
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 82c3c3e819e0..12d30246c2e9 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -119,7 +119,7 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
 				  size_t bytes, struct iov_iter *i);
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
 void iov_iter_revert(struct iov_iter *i, size_t bytes);
-int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes);
+size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t bytes);
 size_t iov_iter_single_seg_count(const struct iov_iter *i);
 size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 92d877a698f0..c0fa1618561c 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -432,33 +432,44 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t by
 }
 
 /*
+ * fault_in_iov_iter_readable - fault in iov iterator for reading
+ * @i: iterator
+ * @size: maximum length
+ *
  * Fault in one or more iovecs of the given iov_iter, to a maximum length of
- * bytes.  For each iovec, fault in each page that constitutes the iovec.
+ * @size.  For each iovec, fault in each page that constitutes the iovec.
+ *
+ * Returns the number of bytes faulted in, or 0 if no bytes could be faulted in
+ * (i.e., because the address is invalid).
  *
- * Return 0 on success, or non-zero if the memory could not be accessed (i.e.
- * because it is an invalid address).
+ * Always returns the number of avaliable bytes for non-user space iterators.
  */
-int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes)
+size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t size)
 {
+	if (size > i->count)
+		size = i->count;
+
 	if (iter_is_iovec(i)) {
 		const struct iovec *p;
+		size_t bytes = size;
 		size_t skip;
 
-		if (bytes > i->count)
-			bytes = i->count;
 		for (p = i->iov, skip = i->iov_offset; bytes; p++, skip = 0) {
 			size_t len = min(bytes, p->iov_len - skip);
+			size_t ret;
 
 			if (unlikely(!len))
 				continue;
-			if (fault_in_readable(p->iov_base + skip, len) != len)
-				return -EFAULT;
-			bytes -= len;
+			ret = fault_in_readable(p->iov_base + skip, len);
+			bytes -= ret;
+			if (ret != len)
+				break;
 		}
+		return size - bytes;
 	}
-	return 0;
+	return size;
 }
-EXPORT_SYMBOL(iov_iter_fault_in_readable);
+EXPORT_SYMBOL(fault_in_iov_iter_readable);
 
 void iov_iter_init(struct iov_iter *i, unsigned int direction,
 			const struct iovec *iov, unsigned long nr_segs,
diff --git a/mm/filemap.c b/mm/filemap.c
index 4dec3bc7752e..5f5aed060c9e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3643,7 +3643,7 @@ ssize_t generic_perform_write(struct file *file,
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
 		 */
-		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+		if (unlikely(fault_in_iov_iter_readable(i, bytes) != bytes)) {
 			status = -EFAULT;
 			break;
 		}
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 05/12] iov_iter: Introduce fault_in_iov_iter_writeable
  2021-08-03 19:18 [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (3 preceding siblings ...)
  2021-08-03 19:18 ` [PATCH v5 04/12] Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable Andreas Gruenbacher
@ 2021-08-03 19:18 ` Andreas Gruenbacher
  2021-08-04 15:24   ` Andreas Gruenbacher
  2021-08-03 19:18 ` [PATCH v5 06/12] gfs2: Add wrapper for iomap_file_buffered_write Andreas Gruenbacher
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 23+ messages in thread
From: Andreas Gruenbacher @ 2021-08-03 19:18 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

Introduce a new fault_in_iov_iter_writeable helper for safely faulting in an
iterator for writing.  Uses get_user_pages() to fault in the pages without
actually writing to them, which would be destructive.

We'll use fault_in_iov_iter_writeable in gfs2 once we've determined that the
iterator passed to .read_iter isn't in memory.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 include/linux/pagemap.h |  1 +
 include/linux/uio.h     |  1 +
 lib/iov_iter.c          | 41 +++++++++++++++++++++++++++
 mm/gup.c                | 61 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 104 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 7c9edc9694d9..a629807edb8c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -737,6 +737,7 @@ extern void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter);
  * Fault in userspace address range.
  */
 size_t fault_in_writeable(char __user *uaddr, size_t size);
+size_t fault_in_safe_writeable(const char __user *uaddr, size_t size);
 size_t fault_in_readable(const char __user *uaddr, size_t size);
 
 int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 12d30246c2e9..ffa431aeb067 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -120,6 +120,7 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
 void iov_iter_revert(struct iov_iter *i, size_t bytes);
 size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t bytes);
+size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t bytes);
 size_t iov_iter_single_seg_count(const struct iov_iter *i);
 size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index c0fa1618561c..4ffc76801eaa 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -471,6 +471,47 @@ size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t size)
 }
 EXPORT_SYMBOL(fault_in_iov_iter_readable);
 
+/*
+ * fault_in_iov_iter_writeable - fault in iov iterator for writing
+ * @i: iterator
+ * @size: maximum length
+ *
+ * Faults in the iterator using get_user_pages(), i.e., without triggering
+ * hardware page faults.  This is primarily useful when we know that some or
+ * all of the pages in @i aren't in memory.
+ *
+ * Returns the number of bytes faulted in, or 0 if no bytes could be faulted in
+ * (i.e., because the address is invalid).
+ *
+ * Always returns the number of avaliable bytes for non-user space iterators.
+ */
+size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t size)
+{
+	if (size > i->count)
+		size = i->count;
+
+	if (iter_is_iovec(i)) {
+		const struct iovec *p;
+		size_t bytes = size;
+		size_t skip;
+
+		for (p = i->iov, skip = i->iov_offset; bytes; p++, skip = 0) {
+			size_t len = min(bytes, p->iov_len - skip);
+			size_t ret;
+
+			if (unlikely(!len))
+				continue;
+			ret = fault_in_safe_writeable(p->iov_base + skip, len);
+			bytes -= ret;
+			if (ret != len)
+				break;
+		}
+		return size - bytes;
+	}
+	return size;
+}
+EXPORT_SYMBOL(fault_in_iov_iter_writeable);
+
 void iov_iter_init(struct iov_iter *i, unsigned int direction,
 			const struct iovec *iov, unsigned long nr_segs,
 			size_t count)
diff --git a/mm/gup.c b/mm/gup.c
index d04984d5d93c..7218e27c2481 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1694,6 +1694,67 @@ size_t fault_in_writeable(char __user *uaddr, size_t size)
 }
 EXPORT_SYMBOL(fault_in_writeable);
 
+/**
+ * fault_in_safe_writeable - fault in an address range for writing
+ * @uaddr: start of address range
+ * @size: length of address range
+ *
+ * Faults in an address range using get_user_pages, i.e., without triggering
+ * hardware page faults.  This is primarily useful when we know that some or
+ * all of the pages in the address range aren't in memory.
+ *
+ * Other than fault_in_writeable(), this function is non-destructive.
+ *
+ * Note that we don't pin or otherwise hold the pages referenced that we fault
+ * in.  There's no guarantee that they'll stay in memory for any duration of
+ * time.
+ *
+ * Returns the number of bytes faulted in from @uaddr.
+ */
+size_t fault_in_safe_writeable(const char __user *uaddr, size_t size)
+{
+	unsigned long start = (unsigned long)uaddr;
+	unsigned long end, nstart, nend;
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma = NULL;
+	int locked = 0;
+
+	/* FIXME: Protect against overflow! */
+
+	end = PAGE_ALIGN(start + size);
+	for (nstart = start & PAGE_MASK; nstart < end; nstart = nend) {
+		unsigned long nr_pages;
+		long ret;
+
+		if (!locked) {
+			locked = 1;
+			mmap_read_lock(mm);
+			vma = find_vma(mm, nstart);
+		} else if (nstart >= vma->vm_end)
+			vma = vma->vm_next;
+		if (!vma || vma->vm_start >= end)
+			break;
+		nend = min(end, vma->vm_end);
+		if (vma->vm_flags & (VM_IO | VM_PFNMAP))
+			continue;
+		if (nstart < vma->vm_start)
+			nstart = vma->vm_start;
+		nr_pages = (nend - nstart) / PAGE_SIZE;
+		ret = __get_user_pages_locked(mm, nstart, nr_pages,
+					      NULL, NULL, &locked,
+					      FOLL_TOUCH | FOLL_WRITE);
+		if (ret <= 0)
+			break;
+		nend = nstart + ret * PAGE_SIZE;
+	}
+	if (locked)
+		mmap_read_unlock(mm);
+	if (nstart > start)
+		return min(nstart - start, size);
+	return 0;
+}
+EXPORT_SYMBOL(fault_in_safe_writeable);
+
 size_t fault_in_readable(const char __user *uaddr, size_t size)
 {
 	const char __user *start = uaddr, *end;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 06/12] gfs2: Add wrapper for iomap_file_buffered_write
  2021-08-03 19:18 [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (4 preceding siblings ...)
  2021-08-03 19:18 ` [PATCH v5 05/12] iov_iter: Introduce fault_in_iov_iter_writeable Andreas Gruenbacher
@ 2021-08-03 19:18 ` Andreas Gruenbacher
  2021-08-03 19:18 ` [PATCH v5 07/12] gfs2: Fix mmap + page fault deadlocks for buffered I/O Andreas Gruenbacher
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Andreas Gruenbacher @ 2021-08-03 19:18 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

Add a wrapper around iomap_file_buffered_write.  We'll add code for when
the operation needs to be retried here later.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 84ec053d43b4..55ec1cadc9e6 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -876,6 +876,18 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return written ? written : ret;
 }
 
+static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file_inode(file);
+	ssize_t ret;
+
+	current->backing_dev_info = inode_to_bdi(inode);
+	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+	current->backing_dev_info = NULL;
+	return ret;
+}
+
 /**
  * gfs2_file_write_iter - Perform a write to a file
  * @iocb: The io context
@@ -927,9 +939,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			goto out_unlock;
 
 		iocb->ki_flags |= IOCB_DSYNC;
-		current->backing_dev_info = inode_to_bdi(inode);
-		buffered = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
-		current->backing_dev_info = NULL;
+		buffered = gfs2_file_buffered_write(iocb, from);
 		if (unlikely(buffered <= 0)) {
 			if (!ret)
 				ret = buffered;
@@ -951,9 +961,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		if (!ret || ret2 > 0)
 			ret += ret2;
 	} else {
-		current->backing_dev_info = inode_to_bdi(inode);
-		ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
-		current->backing_dev_info = NULL;
+		ret = gfs2_file_buffered_write(iocb, from);
 		if (likely(ret > 0)) {
 			iocb->ki_pos += ret;
 			ret = generic_write_sync(iocb, ret);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 07/12] gfs2: Fix mmap + page fault deadlocks for buffered I/O
  2021-08-03 19:18 [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (5 preceding siblings ...)
  2021-08-03 19:18 ` [PATCH v5 06/12] gfs2: Add wrapper for iomap_file_buffered_write Andreas Gruenbacher
@ 2021-08-03 19:18 ` Andreas Gruenbacher
  2021-08-03 19:18 ` [PATCH v5 08/12] iomap: Fix iomap_dio_rw return value for user copies Andreas Gruenbacher
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Andreas Gruenbacher @ 2021-08-03 19:18 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

In the .read_iter and .write_iter file operations, we're accessing
user-space memory while holding the inodes glock.  There's a possibility
that the memory is mapped to the same file, in which case we'd recurse on
the same glock.

More complex scenarios can involve multiple glocks, processes, and even cluster
nodes.

Avoids these kinds of problems by disabling page faults while holding a glock.
If a page fault occurs, we either end up with a partial read or write, or with
-EFAULT if nothing could be read or written.  In that case, we drop the glock,
fault in the requested pages manually, and repeat the operation.

This locking problem in gfs2 was originally reported by Jan Kara.  Linus came
up with the proposal to disable page faults.  Many thanks to Al Viro and
Matthew Wilcox for their feedback as well.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 55ec1cadc9e6..c0f86a28f1bf 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -843,6 +843,12 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	size_t written = 0;
 	ssize_t ret;
 
+	/*
+	 * In this function, we disable page faults when we're holding the
+	 * inode glock while doing I/O.  If a page fault occurs, we drop the
+	 * inode glock, fault in the pages manually, and retry.
+	 */
+
 	if (iocb->ki_flags & IOCB_DIRECT) {
 		ret = gfs2_file_direct_read(iocb, to, &gh);
 		if (likely(ret != -ENOTBLK))
@@ -864,13 +870,20 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	}
 	ip = GFS2_I(iocb->ki_filp->f_mapping->host);
 	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
+retry:
 	ret = gfs2_glock_nq(&gh);
 	if (ret)
 		goto out_uninit;
+	pagefault_disable();
 	ret = generic_file_read_iter(iocb, to);
+	pagefault_enable();
 	if (ret > 0)
 		written += ret;
 	gfs2_glock_dq(&gh);
+	if (unlikely(iov_iter_count(to) && (ret > 0 || ret == -EFAULT)) &&
+	    iter_is_iovec(to) &&
+	    fault_in_iov_iter_writeable(to, SIZE_MAX) != 0)
+		goto retry;
 out_uninit:
 	gfs2_holder_uninit(&gh);
 	return written ? written : ret;
@@ -882,9 +895,22 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *fro
 	struct inode *inode = file_inode(file);
 	ssize_t ret;
 
+	/*
+	 * In this function, we disable page faults when we're holding the
+	 * inode glock while doing I/O.  If a page fault occurs, we drop the
+	 * inode glock, fault in the pages manually, and retry.
+	 */
+
+retry:
 	current->backing_dev_info = inode_to_bdi(inode);
+	pagefault_disable();
 	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+	pagefault_enable();
 	current->backing_dev_info = NULL;
+	if (unlikely(ret == -EFAULT) &&
+	    iter_is_iovec(from) &&
+	    fault_in_iov_iter_readable(from, SIZE_MAX) != 0)
+		goto retry;
 	return ret;
 }
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 08/12] iomap: Fix iomap_dio_rw return value for user copies
  2021-08-03 19:18 [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (6 preceding siblings ...)
  2021-08-03 19:18 ` [PATCH v5 07/12] gfs2: Fix mmap + page fault deadlocks for buffered I/O Andreas Gruenbacher
@ 2021-08-03 19:18 ` Andreas Gruenbacher
  2021-08-03 19:18 ` [PATCH v5 09/12] iomap: Support restarting direct I/O requests after user copy failures Andreas Gruenbacher
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Andreas Gruenbacher @ 2021-08-03 19:18 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

When a user copy fails in one of the helpers of iomap_dio_rw, fail with -EFAULT
instead of returning 0.  This matches what iomap_dio_bio_actor returns when it
gets an -EFAULT from bio_iov_iter_get_pages.  With these changes,
iomap_dio_actor consistently fails with -EFAULT when a user page cannot be
faulted in.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/iomap/direct-io.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 9398b8c31323..8054f5d6c273 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -370,7 +370,7 @@ iomap_dio_hole_actor(loff_t length, struct iomap_dio *dio)
 {
 	length = iov_iter_zero(length, dio->submit.iter);
 	dio->size += length;
-	return length;
+	return length ? length : -EFAULT;
 }
 
 static loff_t
@@ -397,7 +397,7 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length,
 		copied = copy_to_iter(iomap->inline_data + pos, length, iter);
 	}
 	dio->size += copied;
-	return copied;
+	return copied ? copied : -EFAULT;
 }
 
 static loff_t
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 09/12] iomap: Support restarting direct I/O requests after user copy failures
  2021-08-03 19:18 [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (7 preceding siblings ...)
  2021-08-03 19:18 ` [PATCH v5 08/12] iomap: Fix iomap_dio_rw return value for user copies Andreas Gruenbacher
@ 2021-08-03 19:18 ` Andreas Gruenbacher
  2021-08-03 19:18 ` [PATCH v5 10/12] iomap: Add done_before argument to iomap_dio_rw Andreas Gruenbacher
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Andreas Gruenbacher @ 2021-08-03 19:18 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

In iomap_dio_rw, when iomap_apply returns an -EFAULT error, complete the
request synchronously.  Either return a partial result (when the
IOMAP_DIO_FAULT_RETRY flag is set and the caller is thus prepared to handle
partial results), or reset the iterator to the start and fail to allow
restarting the request.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/iomap/direct-io.c  | 13 +++++++++++++
 include/linux/iomap.h |  7 +++++++
 2 files changed, 20 insertions(+)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 8054f5d6c273..35c3f2bae65a 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -561,6 +561,19 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		ret = iomap_apply(inode, pos, count, iomap_flags, ops, dio,
 				iomap_dio_actor);
 		if (ret <= 0) {
+			if (ret == -EFAULT) {
+				/*
+				 * Finish synchronously and revert the iterator
+				 * when failing the request to allow a retry.
+				 */
+				wait_for_completion = true;
+				if (dio->size &&
+				    (dio_flags & IOMAP_DIO_PARTIAL))
+					ret = 0;
+				else
+					iov_iter_revert(iter, dio->size);
+			}
+
 			/* magic error code to fall back to buffered I/O */
 			if (ret == -ENOTBLK) {
 				wait_for_completion = true;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 479c1da3e221..bcae4814b8e3 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -267,6 +267,13 @@ struct iomap_dio_ops {
   */
 #define IOMAP_DIO_OVERWRITE_ONLY	(1 << 1)
 
+/*
+ * When a page fault occurs, return a partial synchronous result and allow
+ * the caller to retry the rest of the operation after dealing with the page
+ * fault.
+ */
+#define IOMAP_DIO_PARTIAL		(1 << 2)
+
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
 		unsigned int dio_flags);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 10/12] iomap: Add done_before argument to iomap_dio_rw
  2021-08-03 19:18 [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (8 preceding siblings ...)
  2021-08-03 19:18 ` [PATCH v5 09/12] iomap: Support restarting direct I/O requests after user copy failures Andreas Gruenbacher
@ 2021-08-03 19:18 ` Andreas Gruenbacher
  2021-08-03 19:18 ` [PATCH v5 11/12] iov_iter: Introduce noio flag to disable page faults Andreas Gruenbacher
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Andreas Gruenbacher @ 2021-08-03 19:18 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

Add a done_before argument to iomap_dio_rw that indicates how much of the
request has already been transferred.  When the request succeeds, we report
that done_before additional bytes were tranferred.  This is useful for
finishing a request asynchronously when part of the request has already been
completed synchronously.

We'll use that to allow iomap_dio_rw to be used with page faults disabled: when
a page fault occurs while submitting a request, we synchronously complete the
part of the request that has already been submitted.  The caller can then take
care of the page fault and call iomap_dio_rw again for the rest of the request,
passing in the number of bytes already tranferred.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/btrfs/file.c       |  5 +++--
 fs/ext4/file.c        |  5 +++--
 fs/gfs2/file.c        |  4 ++--
 fs/iomap/direct-io.c  | 11 ++++++++---
 fs/xfs/xfs_file.c     |  6 +++---
 fs/zonefs/super.c     |  4 ++--
 include/linux/iomap.h |  4 ++--
 7 files changed, 23 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 8ff9e0bb5b0f..c20cc0fc61d9 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1946,7 +1946,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	}
 
 	dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
-			     0);
+			     0, 0);
 
 	btrfs_inode_unlock(inode, ilock_flags);
 
@@ -3638,7 +3638,8 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
 		return 0;
 
 	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
-	ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops, 0);
+	ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
+			   0, 0);
 	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
 	return ret;
 }
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 816dedcbd541..4a5e7fd31fb5 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -74,7 +74,7 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		return generic_file_read_iter(iocb, to);
 	}
 
-	ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0);
+	ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0, 0);
 	inode_unlock_shared(inode);
 
 	file_accessed(iocb->ki_filp);
@@ -566,7 +566,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (ilock_shared)
 		iomap_ops = &ext4_iomap_overwrite_ops;
 	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
-			   (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0);
+			   (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0,
+			   0);
 	if (ret == -ENOTBLK)
 		ret = 0;
 
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index c0f86a28f1bf..d98f690097e2 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -792,7 +792,7 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
 	if (ret)
 		goto out_uninit;
 
-	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0);
+	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0, 0);
 	gfs2_glock_dq(gh);
 out_uninit:
 	gfs2_holder_uninit(gh);
@@ -826,7 +826,7 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	if (offset + len > i_size_read(&ip->i_inode))
 		goto out;
 
-	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0);
+	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0, 0);
 	if (ret == -ENOTBLK)
 		ret = 0;
 out:
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 35c3f2bae65a..7564e740aff8 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -31,6 +31,7 @@ struct iomap_dio {
 	atomic_t		ref;
 	unsigned		flags;
 	int			error;
+	size_t			done_before;
 	bool			wait_for_completion;
 
 	union {
@@ -126,6 +127,9 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
 	if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
 		ret = generic_write_sync(iocb, ret);
 
+	if (ret >= 0)
+		ret += dio->done_before;
+
 	kfree(dio);
 
 	return ret;
@@ -450,7 +454,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 struct iomap_dio *
 __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
-		unsigned int dio_flags)
+		unsigned int dio_flags, size_t done_before)
 {
 	struct address_space *mapping = iocb->ki_filp->f_mapping;
 	struct inode *inode = file_inode(iocb->ki_filp);
@@ -477,6 +481,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	dio->dops = dops;
 	dio->error = 0;
 	dio->flags = 0;
+	dio->done_before = done_before;
 
 	dio->submit.iter = iter;
 	dio->submit.waiter = current;
@@ -655,11 +660,11 @@ EXPORT_SYMBOL_GPL(__iomap_dio_rw);
 ssize_t
 iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
-		unsigned int dio_flags)
+		unsigned int dio_flags, size_t done_before)
 {
 	struct iomap_dio *dio;
 
-	dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags);
+	dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags, done_before);
 	if (IS_ERR_OR_NULL(dio))
 		return PTR_ERR_OR_ZERO(dio);
 	return iomap_dio_complete(dio);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index cc3cfb12df53..3103d9bda466 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -259,7 +259,7 @@ xfs_file_dio_read(
 	ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
 	if (ret)
 		return ret;
-	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0);
+	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, 0);
 	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
 
 	return ret;
@@ -569,7 +569,7 @@ xfs_file_dio_write_aligned(
 	}
 	trace_xfs_file_direct_write(iocb, from);
 	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
-			   &xfs_dio_write_ops, 0);
+			   &xfs_dio_write_ops, 0, 0);
 out_unlock:
 	if (iolock)
 		xfs_iunlock(ip, iolock);
@@ -647,7 +647,7 @@ xfs_file_dio_write_unaligned(
 
 	trace_xfs_file_direct_write(iocb, from);
 	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
-			   &xfs_dio_write_ops, flags);
+			   &xfs_dio_write_ops, flags, 0);
 
 	/*
 	 * Retry unaligned I/O with exclusive blocking semantics if the DIO
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 70055d486bf7..85ca2f5fe06e 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -864,7 +864,7 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 		ret = zonefs_file_dio_append(iocb, from);
 	else
 		ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops,
-				   &zonefs_write_dio_ops, 0);
+				   &zonefs_write_dio_ops, 0, 0);
 	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
 	    (ret > 0 || ret == -EIOCBQUEUED)) {
 		if (ret > 0)
@@ -999,7 +999,7 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		}
 		file_accessed(iocb->ki_filp);
 		ret = iomap_dio_rw(iocb, to, &zonefs_iomap_ops,
-				   &zonefs_read_dio_ops, 0);
+				   &zonefs_read_dio_ops, 0, 0);
 	} else {
 		ret = generic_file_read_iter(iocb, to);
 		if (ret == -EIO)
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index bcae4814b8e3..908bda10024c 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -276,10 +276,10 @@ struct iomap_dio_ops {
 
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
-		unsigned int dio_flags);
+		unsigned int dio_flags, size_t done_before);
 struct iomap_dio *__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
-		unsigned int dio_flags);
+		unsigned int dio_flags, size_t done_before);
 ssize_t iomap_dio_complete(struct iomap_dio *dio);
 int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 11/12] iov_iter: Introduce noio flag to disable page faults
  2021-08-03 19:18 [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (9 preceding siblings ...)
  2021-08-03 19:18 ` [PATCH v5 10/12] iomap: Add done_before argument to iomap_dio_rw Andreas Gruenbacher
@ 2021-08-03 19:18 ` Andreas Gruenbacher
  2021-08-03 19:18 ` [PATCH v5 12/12] gfs2: Fix mmap + page fault deadlocks for direct I/O Andreas Gruenbacher
  2021-08-03 19:45 ` [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks Linus Torvalds
  12 siblings, 0 replies; 23+ messages in thread
From: Andreas Gruenbacher @ 2021-08-03 19:18 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

Introduce a new noio flag to indicate to get_user_pages to use the
FOLL_FAST_ONLY flag.  This will cause get_user_pages to fail when it
would otherwise fault in a page.

Currently, the noio flag is only checked in iov_iter_get_pages and
iov_iter_get_pages_alloc.  This is enough for iomaop_dio_rw, but it
may make sense to check for this flag in other contexts as well.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 include/linux/uio.h |  1 +
 lib/iov_iter.c      | 20 +++++++++++++++-----
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index ffa431aeb067..679e48454497 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -29,6 +29,7 @@ enum iter_type {
 
 struct iov_iter {
 	u8 iter_type;
+	bool noio;
 	bool data_source;
 	size_t iov_offset;
 	size_t count;
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 4ffc76801eaa..66f0c9362bb5 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -519,6 +519,7 @@ void iov_iter_init(struct iov_iter *i, unsigned int direction,
 	WARN_ON(direction & ~(READ | WRITE));
 	*i = (struct iov_iter) {
 		.iter_type = ITER_IOVEC,
+		.noio = false,
 		.data_source = direction,
 		.iov = iov,
 		.nr_segs = nr_segs,
@@ -1529,13 +1530,17 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 		return 0;
 
 	if (likely(iter_is_iovec(i))) {
+		unsigned int gup_flags = 0;
 		unsigned long addr;
 
+		if (iov_iter_rw(i) != WRITE)
+			gup_flags |= FOLL_WRITE;
+		if (i->noio)
+			gup_flags |= FOLL_FAST_ONLY;
+
 		addr = first_iovec_segment(i, &len, start, maxsize, maxpages);
 		n = DIV_ROUND_UP(len, PAGE_SIZE);
-		res = get_user_pages_fast(addr, n,
-				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0,
-				pages);
+		res = get_user_pages_fast(addr, n, gup_flags, pages);
 		if (unlikely(res <= 0))
 			return res;
 		return (res == n ? len : res * PAGE_SIZE) - *start;
@@ -1651,15 +1656,20 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		return 0;
 
 	if (likely(iter_is_iovec(i))) {
+		unsigned int gup_flags = 0;
 		unsigned long addr;
 
+		if (iov_iter_rw(i) != WRITE)
+			gup_flags |= FOLL_WRITE;
+		if (i->noio)
+			gup_flags |= FOLL_FAST_ONLY;
+
 		addr = first_iovec_segment(i, &len, start, maxsize, ~0U);
 		n = DIV_ROUND_UP(len, PAGE_SIZE);
 		p = get_pages_array(n);
 		if (!p)
 			return -ENOMEM;
-		res = get_user_pages_fast(addr, n,
-				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0, p);
+		res = get_user_pages_fast(addr, n, gup_flags, p);
 		if (unlikely(res <= 0)) {
 			kvfree(p);
 			*pages = NULL;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 12/12] gfs2: Fix mmap + page fault deadlocks for direct I/O
  2021-08-03 19:18 [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (10 preceding siblings ...)
  2021-08-03 19:18 ` [PATCH v5 11/12] iov_iter: Introduce noio flag to disable page faults Andreas Gruenbacher
@ 2021-08-03 19:18 ` Andreas Gruenbacher
  2021-08-03 19:45 ` [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks Linus Torvalds
  12 siblings, 0 replies; 23+ messages in thread
From: Andreas Gruenbacher @ 2021-08-03 19:18 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

Also disable page faults during direct I/O requests and implement the same
kind of retry logic as in the buffered I/O case.

Direct I/O requests differ from buffered I/O requests in that they use
bio_iov_iter_get_pages for grabbing page references and faulting in pages
instead of triggering physical page faults.  Those manual page faults can be
disabled with the new iocb->noio flag.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index d98f690097e2..ed42b7675551 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -782,21 +782,47 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
 	struct file *file = iocb->ki_filp;
 	struct gfs2_inode *ip = GFS2_I(file->f_mapping->host);
 	size_t count = iov_iter_count(to);
+	size_t written = 0;
 	ssize_t ret;
 
+	/*
+	 * In this function, we disable page faults when we're holding the
+	 * inode glock while doing I/O.  If a page fault occurs, we drop the
+	 * inode glock, fault in the pages manually, and retry.
+	 *
+	 * Unlike generic_file_read_iter, for reads, iomap_dio_rw can trigger
+	 * physical as well as manual page faults, and we need to disable both
+	 * kinds.
+	 */
+
 	if (!count)
 		return 0; /* skip atime */
 
 	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
+retry:
 	ret = gfs2_glock_nq(gh);
 	if (ret)
 		goto out_uninit;
 
-	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0, 0);
+	pagefault_disable();
+	to->noio = true;
+	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL,
+			   IOMAP_DIO_PARTIAL, written);
+	to->noio = false;
+	pagefault_enable();
+
 	gfs2_glock_dq(gh);
+	if (ret > 0)
+		written = ret;
+	if (unlikely(iov_iter_count(to) && (ret > 0 || ret == -EFAULT)) &&
+	    iter_is_iovec(to) &&
+	    fault_in_iov_iter_writeable(to, SIZE_MAX) != 0)
+		goto retry;
 out_uninit:
 	gfs2_holder_uninit(gh);
-	return ret;
+	if (ret < 0)
+		return ret;
+	return written;
 }
 
 static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
@@ -809,6 +835,15 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	loff_t offset = iocb->ki_pos;
 	ssize_t ret;
 
+	/*
+	 * In this function, we disable page faults when we're holding the
+	 * inode glock while doing I/O.  If a page fault occurs, we drop the
+	 * inode glock, fault in the pages manually, and retry.
+	 *
+	 * For writes, iomap_dio_rw only triggers manual page faults, so we
+	 * don't need to disable physical ones.
+	 */
+
 	/*
 	 * Deferred lock, even if its a write, since we do no allocation on
 	 * this path. All we need to change is the atime, and this lock mode
@@ -818,6 +853,7 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	 * VFS does.
 	 */
 	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
+retry:
 	ret = gfs2_glock_nq(gh);
 	if (ret)
 		goto out_uninit;
@@ -826,11 +862,18 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	if (offset + len > i_size_read(&ip->i_inode))
 		goto out;
 
+	from->noio = true;
 	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0, 0);
+	from->noio = false;
+
 	if (ret == -ENOTBLK)
 		ret = 0;
 out:
 	gfs2_glock_dq(gh);
+	if (unlikely(ret == -EFAULT) &&
+	    iter_is_iovec(from) &&
+	    fault_in_iov_iter_readable(from, len) == len)
+		goto retry;
 out_uninit:
 	gfs2_holder_uninit(gh);
 	return ret;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 03/12] Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}
  2021-08-03 19:18 ` [PATCH v5 03/12] Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable} Andreas Gruenbacher
@ 2021-08-03 19:43   ` Linus Torvalds
  2021-08-03 20:57   ` Al Viro
  1 sibling, 0 replies; 23+ messages in thread
From: Linus Torvalds @ 2021-08-03 19:43 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Alexander Viro, Christoph Hellwig, Darrick J. Wong, Jan Kara,
	Matthew Wilcox, cluster-devel, linux-fsdevel,
	Linux Kernel Mailing List, ocfs2-devel

On Tue, Aug 3, 2021 at 12:18 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> Turn fault_in_pages_{readable,writeable} into versions that return the number
> of bytes faulted in instead of returning a non-zero value when any of the
> requested pages couldn't be faulted in.

Ugh. This ends up making most users have horribly nasty conditionals.

I think I suggested (or maybe that was just my internal voice and I
never actually vocalized it) using the same semantics as
"copy_to/from_user()" for fault_in_pages_{readable|writable}().

Namely to return the number of bytes *not* faulted in.

That makes it trivial to test "did I get everything" - becasue zero
means "nothing failed" and remains the "complete success" marker.

And it still allows for the (much less common) case of "ok, I need to
know about partial failures".

So instead of this horror:

-               if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
+               if (fault_in_writeable(buf_fx, fpu_user_xstate_size) ==
+                               fpu_user_xstate_size)

you'd just have

-               if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
+               if (!fault_in_writeable(buf_fx, fpu_user_xstate_size))

because zero would continue to be a marker of success.

                 Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks
  2021-08-03 19:18 [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (11 preceding siblings ...)
  2021-08-03 19:18 ` [PATCH v5 12/12] gfs2: Fix mmap + page fault deadlocks for direct I/O Andreas Gruenbacher
@ 2021-08-03 19:45 ` Linus Torvalds
  2021-08-16 19:14   ` Andreas Gruenbacher
  12 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2021-08-03 19:45 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Alexander Viro, Christoph Hellwig, Darrick J. Wong,
	Paul Mackerras, Jan Kara, Matthew Wilcox, cluster-devel,
	linux-fsdevel, Linux Kernel Mailing List, ocfs2-devel, kvm-ppc

On Tue, Aug 3, 2021 at 12:18 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
>
> With this patch queue, fstest generic/208 (aio-dio-invalidate-failure.c)
> endlessly spins in gfs2_file_direct_write.  It looks as if there's a bug
> in get_user_pages_fast when called with FOLL_FAST_ONLY:
>
>  (1) The test case performs an aio write into a 32 MB buffer.
>
>  (2) The buffer is initially not in memory, so when iomap_dio_rw() ->
>      ... -> bio_iov_iter_get_pages() is called with the iter->noio flag
>      set, we get to get_user_pages_fast() with FOLL_FAST_ONLY set.
>      get_user_pages_fast() returns 0, which causes
>      bio_iov_iter_get_pages to return -EFAULT.
>
>  (3) Then gfs2_file_direct_write faults in the entire buffer with
>      fault_in_iov_iter_readable(), which succeeds.
>
>  (4) With the buffer in memory, we retry the iomap_dio_rw() ->
>      ... -> bio_iov_iter_get_pages() -> ... -> get_user_pages_fast().
>      This should succeed now, but get_user_pages_fast() still returns 0.

Hmm. Have you tried to figure out why that "still returns 0" happens?

One option - for debugging only - would be to introduce a new flag to
get_user_pages_fast() tyhat says "print out reason if failed" and make
the retry (but not the original one) have that flag set.

There are a couple of things of note when it comes to "get_user_pages_fast()":

 (a) some architectures don't even enable it

 (b) it can be very picky about the page table contents, and wants the
accessed bit to already be set (or the dirty bit, in the case of a
write).

but (a) shouldn't be an issue on any common platform and (b) shouldn't
be an issue with  fault_in_iov_iter_readable() that actually does a
__get_user() so it will access through the page tables.

(It might be more of an issue with fault_in_iov_iter_writable() due to
walking the page tables by hand - if we don't do the proper
access/dirty setting, I could see get_user_pages_fast() failing).

Anyway, for reason (a) I do think that eventually we should probably
introduce FOLL_NOFAULT, and allow the full "slow" page table walk -
just not calling down to handle_mm_fault() if it fails.

But (a) should be a non-issue in your test environment, and so it
would be interesting to hear what it is that fails. Because scanning
through the patches, they all _look_ fine to me (apart from the one
comment about return values, which is more about being consistent with
copy_to/from_user() and making the code simpler - not about
correctness)

                       Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 03/12] Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}
  2021-08-03 19:18 ` [PATCH v5 03/12] Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable} Andreas Gruenbacher
  2021-08-03 19:43   ` Linus Torvalds
@ 2021-08-03 20:57   ` Al Viro
  2021-08-03 21:38     ` Andreas Gruenbacher
  1 sibling, 1 reply; 23+ messages in thread
From: Al Viro @ 2021-08-03 20:57 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Linus Torvalds, Christoph Hellwig, Darrick J. Wong, Jan Kara,
	Matthew Wilcox, cluster-devel, linux-fsdevel, linux-kernel,
	ocfs2-devel

On Tue, Aug 03, 2021 at 09:18:09PM +0200, Andreas Gruenbacher wrote:
> Turn fault_in_pages_{readable,writeable} into versions that return the number
> of bytes faulted in instead of returning a non-zero value when any of the
> requested pages couldn't be faulted in.  This supports the existing users that
> require all pages to be faulted in, but also new users that are happy if any
> pages can be faulted in.
> 
> Neither of these functions is entirely trivial and it doesn't seem useful to
> inline them, so move them to mm/gup.c.
> 
> Rename the functions to fault_in_{readable,writeable} to make sure that code
> that uses them can be fixed instead of breaking silently.

Sigh...  We'd already discussed that; it's bloody pointless.  Making short
fault-in return success - absolutely.  Reporting exact number of bytes is
not going to be of any use to callers.

Please, don't.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 03/12] Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}
  2021-08-03 20:57   ` Al Viro
@ 2021-08-03 21:38     ` Andreas Gruenbacher
  0 siblings, 0 replies; 23+ messages in thread
From: Andreas Gruenbacher @ 2021-08-03 21:38 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Christoph Hellwig, Darrick J. Wong, Jan Kara,
	Matthew Wilcox, cluster-devel, linux-fsdevel, LKML, ocfs2-devel

On Tue, Aug 3, 2021 at 10:57 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Aug 03, 2021 at 09:18:09PM +0200, Andreas Gruenbacher wrote:
> > Turn fault_in_pages_{readable,writeable} into versions that return the number
> > of bytes faulted in instead of returning a non-zero value when any of the
> > requested pages couldn't be faulted in.  This supports the existing users that
> > require all pages to be faulted in, but also new users that are happy if any
> > pages can be faulted in.
> >
> > Neither of these functions is entirely trivial and it doesn't seem useful to
> > inline them, so move them to mm/gup.c.
> >
> > Rename the functions to fault_in_{readable,writeable} to make sure that code
> > that uses them can be fixed instead of breaking silently.
>
> Sigh...  We'd already discussed that; it's bloody pointless.  Making short
> fault-in return success - absolutely.  Reporting exact number of bytes is
> not going to be of any use to callers.

I'm not actually convinced that you're right about this. Let's see
what we'll end up with; we can always strip things down in the end.

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 05/12] iov_iter: Introduce fault_in_iov_iter_writeable
  2021-08-03 19:18 ` [PATCH v5 05/12] iov_iter: Introduce fault_in_iov_iter_writeable Andreas Gruenbacher
@ 2021-08-04 15:24   ` Andreas Gruenbacher
  0 siblings, 0 replies; 23+ messages in thread
From: Andreas Gruenbacher @ 2021-08-04 15:24 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel, LKML,
	ocfs2-devel

On Tue, Aug 3, 2021 at 9:18 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> Introduce a new fault_in_iov_iter_writeable helper for safely faulting in an
> iterator for writing.  Uses get_user_pages() to fault in the pages without
> actually writing to them, which would be destructive.
>
> We'll use fault_in_iov_iter_writeable in gfs2 once we've determined that the
> iterator passed to .read_iter isn't in memory.
>
> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> ---
>  include/linux/pagemap.h |  1 +
>  include/linux/uio.h     |  1 +
>  lib/iov_iter.c          | 41 +++++++++++++++++++++++++++
>  mm/gup.c                | 61 +++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 104 insertions(+)
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 7c9edc9694d9..a629807edb8c 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -737,6 +737,7 @@ extern void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter);
>   * Fault in userspace address range.
>   */
>  size_t fault_in_writeable(char __user *uaddr, size_t size);
> +size_t fault_in_safe_writeable(const char __user *uaddr, size_t size);
>  size_t fault_in_readable(const char __user *uaddr, size_t size);
>
>  int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index 12d30246c2e9..ffa431aeb067 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -120,6 +120,7 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
>  void iov_iter_advance(struct iov_iter *i, size_t bytes);
>  void iov_iter_revert(struct iov_iter *i, size_t bytes);
>  size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t bytes);
> +size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t bytes);
>  size_t iov_iter_single_seg_count(const struct iov_iter *i);
>  size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
>                          struct iov_iter *i);
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index c0fa1618561c..4ffc76801eaa 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -471,6 +471,47 @@ size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t size)
>  }
>  EXPORT_SYMBOL(fault_in_iov_iter_readable);
>
> +/*
> + * fault_in_iov_iter_writeable - fault in iov iterator for writing
> + * @i: iterator
> + * @size: maximum length
> + *
> + * Faults in the iterator using get_user_pages(), i.e., without triggering
> + * hardware page faults.  This is primarily useful when we know that some or
> + * all of the pages in @i aren't in memory.
> + *
> + * Returns the number of bytes faulted in, or 0 if no bytes could be faulted in
> + * (i.e., because the address is invalid).
> + *
> + * Always returns the number of avaliable bytes for non-user space iterators.
> + */
> +size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t size)
> +{
> +       if (size > i->count)
> +               size = i->count;
> +
> +       if (iter_is_iovec(i)) {
> +               const struct iovec *p;
> +               size_t bytes = size;
> +               size_t skip;
> +
> +               for (p = i->iov, skip = i->iov_offset; bytes; p++, skip = 0) {
> +                       size_t len = min(bytes, p->iov_len - skip);
> +                       size_t ret;
> +
> +                       if (unlikely(!len))
> +                               continue;
> +                       ret = fault_in_safe_writeable(p->iov_base + skip, len);
> +                       bytes -= ret;
> +                       if (ret != len)
> +                               break;
> +               }
> +               return size - bytes;
> +       }
> +       return size;
> +}
> +EXPORT_SYMBOL(fault_in_iov_iter_writeable);
> +
>  void iov_iter_init(struct iov_iter *i, unsigned int direction,
>                         const struct iovec *iov, unsigned long nr_segs,
>                         size_t count)
> diff --git a/mm/gup.c b/mm/gup.c
> index d04984d5d93c..7218e27c2481 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1694,6 +1694,67 @@ size_t fault_in_writeable(char __user *uaddr, size_t size)
>  }
>  EXPORT_SYMBOL(fault_in_writeable);
>
> +/**
> + * fault_in_safe_writeable - fault in an address range for writing
> + * @uaddr: start of address range
> + * @size: length of address range
> + *
> + * Faults in an address range using get_user_pages, i.e., without triggering
> + * hardware page faults.  This is primarily useful when we know that some or
> + * all of the pages in the address range aren't in memory.
> + *
> + * Other than fault_in_writeable(), this function is non-destructive.
> + *
> + * Note that we don't pin or otherwise hold the pages referenced that we fault
> + * in.  There's no guarantee that they'll stay in memory for any duration of
> + * time.
> + *
> + * Returns the number of bytes faulted in from @uaddr.
> + */
> +size_t fault_in_safe_writeable(const char __user *uaddr, size_t size)
> +{
> +       unsigned long start = (unsigned long)uaddr;
> +       unsigned long end, nstart, nend;
> +       struct mm_struct *mm = current->mm;
> +       struct vm_area_struct *vma = NULL;
> +       int locked = 0;
> +
> +       /* FIXME: Protect against overflow! */
> +
> +       end = PAGE_ALIGN(start + size);
> +       for (nstart = start & PAGE_MASK; nstart < end; nstart = nend) {
> +               unsigned long nr_pages;
> +               long ret;
> +
> +               if (!locked) {
> +                       locked = 1;
> +                       mmap_read_lock(mm);
> +                       vma = find_vma(mm, nstart);
> +               } else if (nstart >= vma->vm_end)
> +                       vma = vma->vm_next;
> +               if (!vma || vma->vm_start >= end)
> +                       break;
> +               nend = min(end, vma->vm_end);
> +               if (vma->vm_flags & (VM_IO | VM_PFNMAP))
> +                       continue;

Shouldn't we disallow read()ing into those kinds of vmas? If we skip
over them here and then the actual write results in -EFAULT, we'll end
up in a loop.

> +               if (nstart < vma->vm_start)
> +                       nstart = vma->vm_start;

Likewise, shouldn't we fail for memory ranges not covered by a vma?

> +               nr_pages = (nend - nstart) / PAGE_SIZE;
> +               ret = __get_user_pages_locked(mm, nstart, nr_pages,
> +                                             NULL, NULL, &locked,
> +                                             FOLL_TOUCH | FOLL_WRITE);
> +               if (ret <= 0)
> +                       break;
> +               nend = nstart + ret * PAGE_SIZE;
> +       }
> +       if (locked)
> +               mmap_read_unlock(mm);
> +       if (nstart > start)
> +               return min(nstart - start, size);
> +       return 0;
> +}
> +EXPORT_SYMBOL(fault_in_safe_writeable);
> +
>  size_t fault_in_readable(const char __user *uaddr, size_t size)
>  {
>         const char __user *start = uaddr, *end;
> --
> 2.26.3
>

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks
  2021-08-03 19:45 ` [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks Linus Torvalds
@ 2021-08-16 19:14   ` Andreas Gruenbacher
  2021-08-18 21:49     ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Andreas Gruenbacher @ 2021-08-16 19:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexander Viro, Christoph Hellwig, Darrick J. Wong,
	Paul Mackerras, Jan Kara, Matthew Wilcox, cluster-devel,
	linux-fsdevel, Linux Kernel Mailing List, ocfs2-devel, kvm-ppc

On Tue, Aug 3, 2021 at 9:45 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Aug 3, 2021 at 12:18 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > With this patch queue, fstest generic/208 (aio-dio-invalidate-failure.c)
> > endlessly spins in gfs2_file_direct_write.  It looks as if there's a bug
> > in get_user_pages_fast when called with FOLL_FAST_ONLY:
> >
> >  (1) The test case performs an aio write into a 32 MB buffer.
> >
> >  (2) The buffer is initially not in memory, so when iomap_dio_rw() ->
> >      ... -> bio_iov_iter_get_pages() is called with the iter->noio flag
> >      set, we get to get_user_pages_fast() with FOLL_FAST_ONLY set.
> >      get_user_pages_fast() returns 0, which causes
> >      bio_iov_iter_get_pages to return -EFAULT.
> >
> >  (3) Then gfs2_file_direct_write faults in the entire buffer with
> >      fault_in_iov_iter_readable(), which succeeds.
> >
> >  (4) With the buffer in memory, we retry the iomap_dio_rw() ->
> >      ... -> bio_iov_iter_get_pages() -> ... -> get_user_pages_fast().
> >      This should succeed now, but get_user_pages_fast() still returns 0.
>
> Hmm. Have you tried to figure out why that "still returns 0" happens?

The call stack is:

gup_pte_range
gup_pmd_range
gup_pud_range
gup_p4d_range
gup_pgd_range
lockless_pages_from_mm
internal_get_user_pages_fast
get_user_pages_fast
iov_iter_get_pages
__bio_iov_iter_get_pages
bio_iov_iter_get_pages
iomap_dio_bio_actor
iomap_dio_actor
iomap_apply
iomap_dio_rw
gfs2_file_direct_write

In gup_pte_range, pte_special(pte) is true and so we return 0.

> One option - for debugging only - would be to introduce a new flag to
> get_user_pages_fast() that says "print out reason if failed" and make
> the retry (but not the original one) have that flag set.
>
> There are a couple of things of note when it comes to "get_user_pages_fast()":
>
>  (a) some architectures don't even enable it
>
>  (b) it can be very picky about the page table contents, and wants the
> accessed bit to already be set (or the dirty bit, in the case of a
> write).
>
> but (a) shouldn't be an issue on any common platform and (b) shouldn't
> be an issue with  fault_in_iov_iter_readable() that actually does a
> __get_user() so it will access through the page tables.
>
> (It might be more of an issue with fault_in_iov_iter_writable() due to
> walking the page tables by hand - if we don't do the proper
> access/dirty setting, I could see get_user_pages_fast() failing).
>
> Anyway, for reason (a) I do think that eventually we should probably
> introduce FOLL_NOFAULT, and allow the full "slow" page table walk -
> just not calling down to handle_mm_fault() if it fails.
>
> But (a) should be a non-issue in your test environment, and so it
> would be interesting to hear what it is that fails. Because scanning
> through the patches, they all _look_ fine to me (apart from the one
> comment about return values, which is more about being consistent with
> copy_to/from_user() and making the code simpler - not about
> correctness)
>
>                        Linus
>

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks
  2021-08-16 19:14   ` Andreas Gruenbacher
@ 2021-08-18 21:49     ` Linus Torvalds
  2021-08-19 19:40       ` Andreas Gruenbacher
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2021-08-18 21:49 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Alexander Viro, Christoph Hellwig, Darrick J. Wong,
	Paul Mackerras, Jan Kara, Matthew Wilcox, cluster-devel,
	linux-fsdevel, Linux Kernel Mailing List, ocfs2-devel, kvm-ppc

[ Sorry for the delay, I was on the road and this fell through the cracks ]

On Mon, Aug 16, 2021 at 12:14 PM Andreas Gruenbacher
<agruenba@redhat.com> wrote:
>
> On Tue, Aug 3, 2021 at 9:45 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Hmm. Have you tried to figure out why that "still returns 0" happens?
>
> The call stack is:
>
> gup_pte_range
> gup_pmd_range
> gup_pud_range
> gup_p4d_range
> gup_pgd_range
> lockless_pages_from_mm
> internal_get_user_pages_fast
> get_user_pages_fast
> iov_iter_get_pages
> __bio_iov_iter_get_pages
> bio_iov_iter_get_pages
> iomap_dio_bio_actor
> iomap_dio_actor
> iomap_apply
> iomap_dio_rw
> gfs2_file_direct_write
>
> In gup_pte_range, pte_special(pte) is true and so we return 0.

Ok, so that is indeed something that the fast-case can't handle,
because some of the special code wants to have the mm_lock so that it
can look at the vma flags (eg "vm_normal_page()" and friends.

That said, some of these cases even the full GUP won't ever handle,
simply because a mapping doesn't necessarily even _have_ a 'struct
page' associated with it if it's a VM_IO mapping.

So it turns out that you can't just always do
fault_in_iov_iter_readable() and then assume that you can do
iov_iter_get_pages() and repeat until successful.

We could certainly make get_user_pages_fast() handle a few more cases,
but I get the feeling that we need to have separate error cases for
EFAULT - no page exists - and the "page exists, but cannot be mapped
as a 'struct page'" case.

I also do still think that even regardless of that, we want to just
add a FOLL_NOFAULT flag that just disables calling handle_mm_fault(),
and then you can use the regular get_user_pages().

That at least gives us the full _normal_ page handling stuff.

                   Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks
  2021-08-18 21:49     ` Linus Torvalds
@ 2021-08-19 19:40       ` Andreas Gruenbacher
  2021-08-19 20:14         ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Andreas Gruenbacher @ 2021-08-19 19:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexander Viro, Christoph Hellwig, Darrick J. Wong,
	Paul Mackerras, Jan Kara, Matthew Wilcox, cluster-devel,
	linux-fsdevel, Linux Kernel Mailing List, ocfs2-devel, kvm-ppc

On Wed, Aug 18, 2021 at 11:49 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> [ Sorry for the delay, I was on the road and this fell through the cracks ]

No harm done, I was busy enough implementing your previous suggestions.

> On Mon, Aug 16, 2021 at 12:14 PM Andreas Gruenbacher
> <agruenba@redhat.com> wrote:
> >
> > On Tue, Aug 3, 2021 at 9:45 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > Hmm. Have you tried to figure out why that "still returns 0" happens?
> >
> > The call stack is:
> >
> > gup_pte_range
> > gup_pmd_range
> > gup_pud_range
> > gup_p4d_range
> > gup_pgd_range
> > lockless_pages_from_mm
> > internal_get_user_pages_fast
> > get_user_pages_fast
> > iov_iter_get_pages
> > __bio_iov_iter_get_pages
> > bio_iov_iter_get_pages
> > iomap_dio_bio_actor
> > iomap_dio_actor
> > iomap_apply
> > iomap_dio_rw
> > gfs2_file_direct_write
> >
> > In gup_pte_range, pte_special(pte) is true and so we return 0.
>
> Ok, so that is indeed something that the fast-case can't handle,
> because some of the special code wants to have the mm_lock so that it
> can look at the vma flags (eg "vm_normal_page()" and friends.
>
> That said, some of these cases even the full GUP won't ever handle,
> simply because a mapping doesn't necessarily even _have_ a 'struct
> page' associated with it if it's a VM_IO mapping.
>
> So it turns out that you can't just always do
> fault_in_iov_iter_readable() and then assume that you can do
> iov_iter_get_pages() and repeat until successful.
>
> We could certainly make get_user_pages_fast() handle a few more cases,
> but I get the feeling that we need to have separate error cases for
> EFAULT - no page exists - and the "page exists, but cannot be mapped
> as a 'struct page'" case.

Hmm, what if GUP is made to skip VM_IO vmas without adding anything to
the pages array? That would match fault_in_iov_iter_writeable, which
is modeled after __mm_populate and which skips VM_IO and VM_PFNMAP
vmas.

The other strategy I've added is to scale back the page fault windows
to a single page if faulting in multiple pages didn't help, and to
give up if the I/O operation still fails after that. So pathological
cases won't loop indefinitely anymore at least.

> I also do still think that even regardless of that, we want to just
> add a FOLL_NOFAULT flag that just disables calling handle_mm_fault(),
> and then you can use the regular get_user_pages().
>
> That at least gives us the full _normal_ page handling stuff.

And it does fix the generic/208 failure.

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks
  2021-08-19 19:40       ` Andreas Gruenbacher
@ 2021-08-19 20:14         ` Linus Torvalds
  2021-08-19 21:39           ` Andreas Gruenbacher
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2021-08-19 20:14 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Alexander Viro, Christoph Hellwig, Darrick J. Wong,
	Paul Mackerras, Jan Kara, Matthew Wilcox, cluster-devel,
	linux-fsdevel, Linux Kernel Mailing List, ocfs2-devel, kvm-ppc

On Thu, Aug 19, 2021 at 12:41 PM Andreas Gruenbacher
<agruenba@redhat.com> wrote:
>
> Hmm, what if GUP is made to skip VM_IO vmas without adding anything to
> the pages array? That would match fault_in_iov_iter_writeable, which
> is modeled after __mm_populate and which skips VM_IO and VM_PFNMAP
> vmas.

I don't understand what you mean.. GUP already skips VM_IO (and
VM_PFNMAP) pages. It just returns EFAULT.

We could make it return another error. We already have DAX and
FOLL_LONGTERM returning -EOPNOTSUPP.

Of course, I think some code ends up always just returning "number of
pages looked up" and might return 0 for "no pages" rather than the
error for the first page.

So we may end up having interfaces that then lose that explanation
error code, but I didn't check.

But we couldn't make it just say "skip them and try later addresses",
if that is what you meant. THAT makes no sense - that would just make
GUP look up some other address than what was asked for.

> > I also do still think that even regardless of that, we want to just
> > add a FOLL_NOFAULT flag that just disables calling handle_mm_fault(),
> > and then you can use the regular get_user_pages().
> >
> > That at least gives us the full _normal_ page handling stuff.
>
> And it does fix the generic/208 failure.

Good. So I think the approach is usable, even if we might have corner
cases left.

So I think the remaining issue is exactly things like VM_IO and
VM_PFNMAP. Do the fstests have test-cases for things like this? It
_is_ quite specialized, it might be a good idea to have that.

Of course, doing direct-IO from special memory regions with zerocopy
might be something special people actually want to do. But I think
we've had that VM_IO flag testing there basically forever, so I don't
think it has ever worked (for some definition of "ever").

            Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks
  2021-08-19 20:14         ` Linus Torvalds
@ 2021-08-19 21:39           ` Andreas Gruenbacher
  0 siblings, 0 replies; 23+ messages in thread
From: Andreas Gruenbacher @ 2021-08-19 21:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexander Viro, Christoph Hellwig, Darrick J. Wong,
	Paul Mackerras, Jan Kara, Matthew Wilcox, cluster-devel,
	linux-fsdevel, Linux Kernel Mailing List, ocfs2-devel, kvm-ppc

On Thu, Aug 19, 2021 at 10:14 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Aug 19, 2021 at 12:41 PM Andreas Gruenbacher
> <agruenba@redhat.com> wrote:
> >
> > Hmm, what if GUP is made to skip VM_IO vmas without adding anything to
> > the pages array? That would match fault_in_iov_iter_writeable, which
> > is modeled after __mm_populate and which skips VM_IO and VM_PFNMAP
> > vmas.
>
> I don't understand what you mean.. GUP already skips VM_IO (and
> VM_PFNMAP) pages. It just returns EFAULT.
>
> We could make it return another error. We already have DAX and
> FOLL_LONGTERM returning -EOPNOTSUPP.
>
> Of course, I think some code ends up always just returning "number of
> pages looked up" and might return 0 for "no pages" rather than the
> error for the first page.
>
> So we may end up having interfaces that then lose that explanation
> error code, but I didn't check.
>
> But we couldn't make it just say "skip them and try later addresses",
> if that is what you meant. THAT makes no sense - that would just make
> GUP look up some other address than what was asked for.

get_user_pages has a start and a nr_pages argument, which specifies an
address range from start to start + nr_pages * PAGE_SIZE. If pages !=
NULL, it adds a pointer to that array for each PAGE_SIZE subpage. I
was thinking of skipping over VM_IO vmas in that process, so when the
range starts in a mappable vma, runs into a VM_IO vma, and ends in a
mappable vma, the pages in the pages array would be discontiguous;
they would only cover the mappable vmas. But that would make it
difficult to make sense of what's in the pages array. So scratch that
idea.

> > > I also do still think that even regardless of that, we want to just
> > > add a FOLL_NOFAULT flag that just disables calling handle_mm_fault(),
> > > and then you can use the regular get_user_pages().
> > >
> > > That at least gives us the full _normal_ page handling stuff.
> >
> > And it does fix the generic/208 failure.
>
> Good. So I think the approach is usable, even if we might have corner
> cases left.
>
> So I think the remaining issue is exactly things like VM_IO and
> VM_PFNMAP. Do the fstests have test-cases for things like this? It
> _is_ quite specialized, it might be a good idea to have that.
>
> Of course, doing direct-IO from special memory regions with zerocopy
> might be something special people actually want to do. But I think
> we've had that VM_IO flag testing there basically forever, so I don't
> think it has ever worked (for some definition of "ever").

The v6 patch queue should handle those cases acceptably well for now,
but I don't think we have tests covering that at all.

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-08-19 21:40 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-03 19:18 [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
2021-08-03 19:18 ` [PATCH v5 01/12] iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value Andreas Gruenbacher
2021-08-03 19:18 ` [PATCH v5 02/12] powerpc/kvm: Fix kvm_use_magic_page Andreas Gruenbacher
2021-08-03 19:18 ` [PATCH v5 03/12] Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable} Andreas Gruenbacher
2021-08-03 19:43   ` Linus Torvalds
2021-08-03 20:57   ` Al Viro
2021-08-03 21:38     ` Andreas Gruenbacher
2021-08-03 19:18 ` [PATCH v5 04/12] Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable Andreas Gruenbacher
2021-08-03 19:18 ` [PATCH v5 05/12] iov_iter: Introduce fault_in_iov_iter_writeable Andreas Gruenbacher
2021-08-04 15:24   ` Andreas Gruenbacher
2021-08-03 19:18 ` [PATCH v5 06/12] gfs2: Add wrapper for iomap_file_buffered_write Andreas Gruenbacher
2021-08-03 19:18 ` [PATCH v5 07/12] gfs2: Fix mmap + page fault deadlocks for buffered I/O Andreas Gruenbacher
2021-08-03 19:18 ` [PATCH v5 08/12] iomap: Fix iomap_dio_rw return value for user copies Andreas Gruenbacher
2021-08-03 19:18 ` [PATCH v5 09/12] iomap: Support restarting direct I/O requests after user copy failures Andreas Gruenbacher
2021-08-03 19:18 ` [PATCH v5 10/12] iomap: Add done_before argument to iomap_dio_rw Andreas Gruenbacher
2021-08-03 19:18 ` [PATCH v5 11/12] iov_iter: Introduce noio flag to disable page faults Andreas Gruenbacher
2021-08-03 19:18 ` [PATCH v5 12/12] gfs2: Fix mmap + page fault deadlocks for direct I/O Andreas Gruenbacher
2021-08-03 19:45 ` [PATCH v5 00/12] gfs2: Fix mmap + page fault deadlocks Linus Torvalds
2021-08-16 19:14   ` Andreas Gruenbacher
2021-08-18 21:49     ` Linus Torvalds
2021-08-19 19:40       ` Andreas Gruenbacher
2021-08-19 20:14         ` Linus Torvalds
2021-08-19 21:39           ` Andreas Gruenbacher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).