ocfs2-devel.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
@ 2021-10-19 13:41 Andreas Gruenbacher
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 01/17] iov_iter: Fix iov_iter_get_pages{, _alloc} page fault return value Andreas Gruenbacher
                   ` (17 more replies)
  0 siblings, 18 replies; 47+ messages in thread
From: Andreas Gruenbacher @ 2021-10-19 13:41 UTC (permalink / raw)
  To: Linus Torvalds, Catalin Marinas, Paul Mackerras
  Cc: kvm-ppc, cluster-devel, Jan Kara, Andreas Gruenbacher,
	linux-kernel, Christoph Hellwig, Alexander Viro, linux-fsdevel,
	linux-btrfs, ocfs2-devel

Hi Linus and all,

here's another update of this patch queue on top of v5.15-rc5.  Changes:

 * Rebase on top of v5.15-rc5.

 * Bump iocb->ki_pos in gfs2_file_buffered_write to make sure
   partial writes go to the correct file offset.

 * Some comment and commit message improvements.


I've pushed this here:

  https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/log/?h=for-next.mmap-fault
  28ea41433945ed1d442d2a131d1bcbfa24646c6f


>From my point of view, the following questions remain:

 * I hope these patches will be merged for v5.16, but what process
   should I follow for that?  The patch queue contains mm and iomap
   changes, so a pull request from the gfs2 tree would be unusual.

 * Will Catalin Marinas's work for supporting arm64 sub-page faults
   be queued behind these patches?  We have an overlap in
   fault_in_[pages_]readable fault_in_[pages_]writeable, so one of
   the two patch queues will need some adjustments.


Thanks,
Andreas


Andreas Gruenbacher (16):
  iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value
  powerpc/kvm: Fix kvm_use_magic_page
  gup: Turn fault_in_pages_{readable,writeable} into
    fault_in_{readable,writeable}
  iov_iter: Turn iov_iter_fault_in_readable into
    fault_in_iov_iter_readable
  iov_iter: Introduce fault_in_iov_iter_writeable
  gfs2: Add wrapper for iomap_file_buffered_write
  gfs2: Clean up function may_grant
  gfs2: Move the inode glock locking to gfs2_file_buffered_write
  gfs2: Eliminate ip->i_gh
  gfs2: Fix mmap + page fault deadlocks for buffered I/O
  iomap: Fix iomap_dio_rw return value for user copies
  iomap: Support partial direct I/O on user copy failures
  iomap: Add done_before argument to iomap_dio_rw
  gup: Introduce FOLL_NOFAULT flag to disable page faults
  iov_iter: Introduce nofault flag to disable page faults
  gfs2: Fix mmap + page fault deadlocks for direct I/O

Bob Peterson (1):
  gfs2: Introduce flag for glock holder auto-demotion

 arch/powerpc/kernel/kvm.c           |   3 +-
 arch/powerpc/kernel/signal_32.c     |   4 +-
 arch/powerpc/kernel/signal_64.c     |   2 +-
 arch/x86/kernel/fpu/signal.c        |   7 +-
 drivers/gpu/drm/armada/armada_gem.c |   7 +-
 fs/btrfs/file.c                     |   7 +-
 fs/btrfs/ioctl.c                    |   5 +-
 fs/erofs/data.c                     |   2 +-
 fs/ext4/file.c                      |   5 +-
 fs/f2fs/file.c                      |   2 +-
 fs/fuse/file.c                      |   2 +-
 fs/gfs2/bmap.c                      |  60 +----
 fs/gfs2/file.c                      | 256 +++++++++++++++++++--
 fs/gfs2/glock.c                     | 331 +++++++++++++++++++++-------
 fs/gfs2/glock.h                     |  20 ++
 fs/gfs2/incore.h                    |   4 +-
 fs/iomap/buffered-io.c              |   2 +-
 fs/iomap/direct-io.c                |  21 +-
 fs/ntfs/file.c                      |   2 +-
 fs/ntfs3/file.c                     |   2 +-
 fs/xfs/xfs_file.c                   |   6 +-
 fs/zonefs/super.c                   |   4 +-
 include/linux/iomap.h               |  11 +-
 include/linux/mm.h                  |   3 +-
 include/linux/pagemap.h             |  58 +----
 include/linux/uio.h                 |   4 +-
 lib/iov_iter.c                      | 103 +++++++--
 mm/filemap.c                        |   4 +-
 mm/gup.c                            | 139 +++++++++++-
 29 files changed, 790 insertions(+), 286 deletions(-)

-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Ocfs2-devel] [PATCH v8 01/17] iov_iter: Fix iov_iter_get_pages{, _alloc} page fault return value
  2021-10-19 13:41 [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
@ 2021-10-19 13:41 ` Andreas Gruenbacher
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 02/17] powerpc/kvm: Fix kvm_use_magic_page Andreas Gruenbacher
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 47+ messages in thread
From: Andreas Gruenbacher @ 2021-10-19 13:41 UTC (permalink / raw)
  To: Linus Torvalds, Catalin Marinas
  Cc: kvm-ppc, Paul Mackerras, cluster-devel, Jan Kara,
	Andreas Gruenbacher, linux-kernel, Christoph Hellwig,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

Both iov_iter_get_pages and iov_iter_get_pages_alloc return the number
of bytes of the iovec they could get the pages for.  When they cannot
get any pages, they're supposed to return 0, but when the start of the
iovec isn't page aligned, the calculation goes wrong and they return a
negative value.  Fix both functions.

In addition, change iov_iter_get_pages_alloc to return NULL in that case
to prevent resource leaks.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 lib/iov_iter.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 755c10c5138c..60b5e6edfbaa 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1488,7 +1488,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 		res = get_user_pages_fast(addr, n,
 				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0,
 				pages);
-		if (unlikely(res < 0))
+		if (unlikely(res <= 0))
 			return res;
 		return (res == n ? len : res * PAGE_SIZE) - *start;
 	}
@@ -1612,8 +1612,9 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 			return -ENOMEM;
 		res = get_user_pages_fast(addr, n,
 				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0, p);
-		if (unlikely(res < 0)) {
+		if (unlikely(res <= 0)) {
 			kvfree(p);
+			*pages = NULL;
 			return res;
 		}
 		*pages = p;
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Ocfs2-devel] [PATCH v8 02/17] powerpc/kvm: Fix kvm_use_magic_page
  2021-10-19 13:41 [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 01/17] iov_iter: Fix iov_iter_get_pages{, _alloc} page fault return value Andreas Gruenbacher
@ 2021-10-19 13:41 ` Andreas Gruenbacher
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 03/17] gup: Turn fault_in_pages_{readable, writeable} into fault_in_{readable, writeable} Andreas Gruenbacher
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 47+ messages in thread
From: Andreas Gruenbacher @ 2021-10-19 13:41 UTC (permalink / raw)
  To: Linus Torvalds, Catalin Marinas, Paul Mackerras
  Cc: kvm-ppc, cluster-devel, Jan Kara, Andreas Gruenbacher,
	linux-kernel, Christoph Hellwig, Alexander Viro, linux-fsdevel,
	stable, linux-btrfs, ocfs2-devel

When switching from __get_user to fault_in_pages_readable, commit
9f9eae5ce717 broke kvm_use_magic_page: like __get_user,
fault_in_pages_readable returns 0 on success.

Fixes: 9f9eae5ce717 ("powerpc/kvm: Prefer fault_in_pages_readable function")
Cc: stable@vger.kernel.org # v4.18+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 arch/powerpc/kernel/kvm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c
index 617eba82531c..d89cf802d9aa 100644
--- a/arch/powerpc/kernel/kvm.c
+++ b/arch/powerpc/kernel/kvm.c
@@ -669,7 +669,7 @@ static void __init kvm_use_magic_page(void)
 	on_each_cpu(kvm_map_magic_page, &features, 1);
 
 	/* Quick self-test to see if the mapping works */
-	if (!fault_in_pages_readable((const char *)KVM_MAGIC_PAGE, sizeof(u32))) {
+	if (fault_in_pages_readable((const char *)KVM_MAGIC_PAGE, sizeof(u32))) {
 		kvm_patching_worked = false;
 		return;
 	}
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Ocfs2-devel] [PATCH v8 03/17] gup: Turn fault_in_pages_{readable, writeable} into fault_in_{readable, writeable}
  2021-10-19 13:41 [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 01/17] iov_iter: Fix iov_iter_get_pages{, _alloc} page fault return value Andreas Gruenbacher
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 02/17] powerpc/kvm: Fix kvm_use_magic_page Andreas Gruenbacher
@ 2021-10-19 13:41 ` Andreas Gruenbacher
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 04/17] iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable Andreas Gruenbacher
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 47+ messages in thread
From: Andreas Gruenbacher @ 2021-10-19 13:41 UTC (permalink / raw)
  To: Linus Torvalds, Catalin Marinas
  Cc: kvm-ppc, Paul Mackerras, cluster-devel, Jan Kara,
	Andreas Gruenbacher, linux-kernel, Christoph Hellwig,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

Turn fault_in_pages_{readable,writeable} into versions that return the
number of bytes not faulted in, similar to copy_to_user, instead of
returning a non-zero value when any of the requested pages couldn't be
faulted in.  This supports the existing users that require all pages to
be faulted in as well as new users that are happy if any pages can be
faulted in.

Rename the functions to fault_in_{readable,writeable} to make sure
this change doesn't silently break things.

Neither of these functions is entirely trivial and it doesn't seem
useful to inline them, so move them to mm/gup.c.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 arch/powerpc/kernel/kvm.c           |  3 +-
 arch/powerpc/kernel/signal_32.c     |  4 +-
 arch/powerpc/kernel/signal_64.c     |  2 +-
 arch/x86/kernel/fpu/signal.c        |  7 ++-
 drivers/gpu/drm/armada/armada_gem.c |  7 ++-
 fs/btrfs/ioctl.c                    |  5 +-
 include/linux/pagemap.h             | 57 ++---------------------
 lib/iov_iter.c                      | 10 ++--
 mm/filemap.c                        |  2 +-
 mm/gup.c                            | 72 +++++++++++++++++++++++++++++
 10 files changed, 93 insertions(+), 76 deletions(-)

diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c
index d89cf802d9aa..6568823cf306 100644
--- a/arch/powerpc/kernel/kvm.c
+++ b/arch/powerpc/kernel/kvm.c
@@ -669,7 +669,8 @@ static void __init kvm_use_magic_page(void)
 	on_each_cpu(kvm_map_magic_page, &features, 1);
 
 	/* Quick self-test to see if the mapping works */
-	if (fault_in_pages_readable((const char *)KVM_MAGIC_PAGE, sizeof(u32))) {
+	if (fault_in_readable((const char __user *)KVM_MAGIC_PAGE,
+			      sizeof(u32))) {
 		kvm_patching_worked = false;
 		return;
 	}
diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 0608581967f0..38c3eae40c14 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -1048,7 +1048,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx,
 	if (new_ctx == NULL)
 		return 0;
 	if (!access_ok(new_ctx, ctx_size) ||
-	    fault_in_pages_readable((u8 __user *)new_ctx, ctx_size))
+	    fault_in_readable((char __user *)new_ctx, ctx_size))
 		return -EFAULT;
 
 	/*
@@ -1237,7 +1237,7 @@ SYSCALL_DEFINE3(debug_setcontext, struct ucontext __user *, ctx,
 #endif
 
 	if (!access_ok(ctx, sizeof(*ctx)) ||
-	    fault_in_pages_readable((u8 __user *)ctx, sizeof(*ctx)))
+	    fault_in_readable((char __user *)ctx, sizeof(*ctx)))
 		return -EFAULT;
 
 	/*
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 1831bba0582e..9f471b4a11e3 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -688,7 +688,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx,
 	if (new_ctx == NULL)
 		return 0;
 	if (!access_ok(new_ctx, ctx_size) ||
-	    fault_in_pages_readable((u8 __user *)new_ctx, ctx_size))
+	    fault_in_readable((char __user *)new_ctx, ctx_size))
 		return -EFAULT;
 
 	/*
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index fa17a27390ab..164c96434704 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -205,7 +205,7 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
 	fpregs_unlock();
 
 	if (ret) {
-		if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
+		if (!fault_in_writeable(buf_fx, fpu_user_xstate_size))
 			goto retry;
 		return -EFAULT;
 	}
@@ -278,10 +278,9 @@ static int restore_fpregs_from_user(void __user *buf, u64 xrestore,
 		if (ret != -EFAULT)
 			return -EINVAL;
 
-		ret = fault_in_pages_readable(buf, size);
-		if (!ret)
+		if (!fault_in_readable(buf, size))
 			goto retry;
-		return ret;
+		return -EFAULT;
 	}
 
 	/*
diff --git a/drivers/gpu/drm/armada/armada_gem.c b/drivers/gpu/drm/armada/armada_gem.c
index 21909642ee4c..8fbb25913327 100644
--- a/drivers/gpu/drm/armada/armada_gem.c
+++ b/drivers/gpu/drm/armada/armada_gem.c
@@ -336,7 +336,7 @@ int armada_gem_pwrite_ioctl(struct drm_device *dev, void *data,
 	struct drm_armada_gem_pwrite *args = data;
 	struct armada_gem_object *dobj;
 	char __user *ptr;
-	int ret;
+	int ret = 0;
 
 	DRM_DEBUG_DRIVER("handle %u off %u size %u ptr 0x%llx\n",
 		args->handle, args->offset, args->size, args->ptr);
@@ -349,9 +349,8 @@ int armada_gem_pwrite_ioctl(struct drm_device *dev, void *data,
 	if (!access_ok(ptr, args->size))
 		return -EFAULT;
 
-	ret = fault_in_pages_readable(ptr, args->size);
-	if (ret)
-		return ret;
+	if (fault_in_readable(ptr, args->size))
+		return -EFAULT;
 
 	dobj = armada_gem_object_lookup(file, args->handle);
 	if (dobj == NULL)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index cc61813213d8..c0739f0af634 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2261,9 +2261,8 @@ static noinline int search_ioctl(struct inode *inode,
 	key.offset = sk->min_offset;
 
 	while (1) {
-		ret = fault_in_pages_writeable(ubuf + sk_offset,
-					       *buf_size - sk_offset);
-		if (ret)
+		ret = -EFAULT;
+		if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset))
 			break;
 
 		ret = btrfs_search_forward(root, &key, path, sk->min_transid);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 62db6b0176b9..9fe94f7a4f7e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -733,61 +733,10 @@ int wait_on_page_private_2_killable(struct page *page);
 extern void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter);
 
 /*
- * Fault everything in given userspace address range in.
+ * Fault in userspace address range.
  */
-static inline int fault_in_pages_writeable(char __user *uaddr, size_t size)
-{
-	char __user *end = uaddr + size - 1;
-
-	if (unlikely(size == 0))
-		return 0;
-
-	if (unlikely(uaddr > end))
-		return -EFAULT;
-	/*
-	 * Writing zeroes into userspace here is OK, because we know that if
-	 * the zero gets there, we'll be overwriting it.
-	 */
-	do {
-		if (unlikely(__put_user(0, uaddr) != 0))
-			return -EFAULT;
-		uaddr += PAGE_SIZE;
-	} while (uaddr <= end);
-
-	/* Check whether the range spilled into the next page. */
-	if (((unsigned long)uaddr & PAGE_MASK) ==
-			((unsigned long)end & PAGE_MASK))
-		return __put_user(0, end);
-
-	return 0;
-}
-
-static inline int fault_in_pages_readable(const char __user *uaddr, size_t size)
-{
-	volatile char c;
-	const char __user *end = uaddr + size - 1;
-
-	if (unlikely(size == 0))
-		return 0;
-
-	if (unlikely(uaddr > end))
-		return -EFAULT;
-
-	do {
-		if (unlikely(__get_user(c, uaddr) != 0))
-			return -EFAULT;
-		uaddr += PAGE_SIZE;
-	} while (uaddr <= end);
-
-	/* Check whether the range spilled into the next page. */
-	if (((unsigned long)uaddr & PAGE_MASK) ==
-			((unsigned long)end & PAGE_MASK)) {
-		return __get_user(c, end);
-	}
-
-	(void)c;
-	return 0;
-}
+size_t fault_in_writeable(char __user *uaddr, size_t size);
+size_t fault_in_readable(const char __user *uaddr, size_t size);
 
 int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 60b5e6edfbaa..c88908f0f138 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -191,7 +191,7 @@ static size_t copy_page_to_iter_iovec(struct page *page, size_t offset, size_t b
 	buf = iov->iov_base + skip;
 	copy = min(bytes, iov->iov_len - skip);
 
-	if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_pages_writeable(buf, copy)) {
+	if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_writeable(buf, copy)) {
 		kaddr = kmap_atomic(page);
 		from = kaddr + offset;
 
@@ -275,7 +275,7 @@ static size_t copy_page_from_iter_iovec(struct page *page, size_t offset, size_t
 	buf = iov->iov_base + skip;
 	copy = min(bytes, iov->iov_len - skip);
 
-	if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_pages_readable(buf, copy)) {
+	if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_readable(buf, copy)) {
 		kaddr = kmap_atomic(page);
 		to = kaddr + offset;
 
@@ -446,13 +446,11 @@ int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes)
 			bytes = i->count;
 		for (p = i->iov, skip = i->iov_offset; bytes; p++, skip = 0) {
 			size_t len = min(bytes, p->iov_len - skip);
-			int err;
 
 			if (unlikely(!len))
 				continue;
-			err = fault_in_pages_readable(p->iov_base + skip, len);
-			if (unlikely(err))
-				return err;
+			if (fault_in_readable(p->iov_base + skip, len))
+				return -EFAULT;
 			bytes -= len;
 		}
 	}
diff --git a/mm/filemap.c b/mm/filemap.c
index dae481293b5d..ff34f4087f87 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -90,7 +90,7 @@
  *      ->lock_page		(filemap_fault, access_process_vm)
  *
  *  ->i_rwsem			(generic_perform_write)
- *    ->mmap_lock		(fault_in_pages_readable->do_page_fault)
+ *    ->mmap_lock		(fault_in_readable->do_page_fault)
  *
  *  bdi->wb.list_lock
  *    sb_lock			(fs/fs-writeback.c)
diff --git a/mm/gup.c b/mm/gup.c
index 886d6148d3d0..a7efb027d6cf 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1656,6 +1656,78 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
 }
 #endif /* !CONFIG_MMU */
 
+/**
+ * fault_in_writeable - fault in userspace address range for writing
+ * @uaddr: start of address range
+ * @size: size of address range
+ *
+ * Returns the number of bytes not faulted in (like copy_to_user() and
+ * copy_from_user()).
+ */
+size_t fault_in_writeable(char __user *uaddr, size_t size)
+{
+	char __user *start = uaddr, *end;
+
+	if (unlikely(size == 0))
+		return 0;
+	if (!PAGE_ALIGNED(uaddr)) {
+		if (unlikely(__put_user(0, uaddr) != 0))
+			return size;
+		uaddr = (char __user *)PAGE_ALIGN((unsigned long)uaddr);
+	}
+	end = (char __user *)PAGE_ALIGN((unsigned long)start + size);
+	if (unlikely(end < start))
+		end = NULL;
+	while (uaddr != end) {
+		if (unlikely(__put_user(0, uaddr) != 0))
+			goto out;
+		uaddr += PAGE_SIZE;
+	}
+
+out:
+	if (size > uaddr - start)
+		return size - (uaddr - start);
+	return 0;
+}
+EXPORT_SYMBOL(fault_in_writeable);
+
+/**
+ * fault_in_readable - fault in userspace address range for reading
+ * @uaddr: start of user address range
+ * @size: size of user address range
+ *
+ * Returns the number of bytes not faulted in (like copy_to_user() and
+ * copy_from_user()).
+ */
+size_t fault_in_readable(const char __user *uaddr, size_t size)
+{
+	const char __user *start = uaddr, *end;
+	volatile char c;
+
+	if (unlikely(size == 0))
+		return 0;
+	if (!PAGE_ALIGNED(uaddr)) {
+		if (unlikely(__get_user(c, uaddr) != 0))
+			return size;
+		uaddr = (const char __user *)PAGE_ALIGN((unsigned long)uaddr);
+	}
+	end = (const char __user *)PAGE_ALIGN((unsigned long)start + size);
+	if (unlikely(end < start))
+		end = NULL;
+	while (uaddr != end) {
+		if (unlikely(__get_user(c, uaddr) != 0))
+			goto out;
+		uaddr += PAGE_SIZE;
+	}
+
+out:
+	(void)c;
+	if (size > uaddr - start)
+		return size - (uaddr - start);
+	return 0;
+}
+EXPORT_SYMBOL(fault_in_readable);
+
 /**
  * get_dump_page() - pin user page in memory while writing it to core dump
  * @addr: user address
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Ocfs2-devel] [PATCH v8 04/17] iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
  2021-10-19 13:41 [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (2 preceding siblings ...)
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 03/17] gup: Turn fault_in_pages_{readable, writeable} into fault_in_{readable, writeable} Andreas Gruenbacher
@ 2021-10-19 13:41 ` Andreas Gruenbacher
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 05/17] iov_iter: Introduce fault_in_iov_iter_writeable Andreas Gruenbacher
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 47+ messages in thread
From: Andreas Gruenbacher @ 2021-10-19 13:41 UTC (permalink / raw)
  To: Linus Torvalds, Catalin Marinas
  Cc: kvm-ppc, Paul Mackerras, cluster-devel, Jan Kara,
	Andreas Gruenbacher, linux-kernel, Christoph Hellwig,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

Turn iov_iter_fault_in_readable into a function that returns the number
of bytes not faulted in, similar to copy_to_user, instead of returning a
non-zero value when any of the requested pages couldn't be faulted in.
This supports the existing users that require all pages to be faulted in
as well as new users that are happy if any pages can be faulted in.

Rename iov_iter_fault_in_readable to fault_in_iov_iter_readable to make
sure this change doesn't silently break things.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/btrfs/file.c        |  2 +-
 fs/f2fs/file.c         |  2 +-
 fs/fuse/file.c         |  2 +-
 fs/iomap/buffered-io.c |  2 +-
 fs/ntfs/file.c         |  2 +-
 fs/ntfs3/file.c        |  2 +-
 include/linux/uio.h    |  2 +-
 lib/iov_iter.c         | 33 +++++++++++++++++++++------------
 mm/filemap.c           |  2 +-
 9 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 7ff577005d0f..f37211d3bb69 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1710,7 +1710,7 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
 		 * Fault pages before locking them in prepare_pages
 		 * to avoid recursive lock
 		 */
-		if (unlikely(iov_iter_fault_in_readable(i, write_bytes))) {
+		if (unlikely(fault_in_iov_iter_readable(i, write_bytes))) {
 			ret = -EFAULT;
 			break;
 		}
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 9c8ef33bd8d3..eb971e1e7227 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -4276,7 +4276,7 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		size_t target_size = 0;
 		int err;
 
-		if (iov_iter_fault_in_readable(from, iov_iter_count(from)))
+		if (fault_in_iov_iter_readable(from, iov_iter_count(from)))
 			set_inode_flag(inode, FI_NO_PREALLOC);
 
 		if ((iocb->ki_flags & IOCB_NOWAIT)) {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 11404f8c21c7..4b6d8e13322d 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1164,7 +1164,7 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
 
  again:
 		err = -EFAULT;
-		if (iov_iter_fault_in_readable(ii, bytes))
+		if (fault_in_iov_iter_readable(ii, bytes))
 			break;
 
 		err = -ENOMEM;
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 9cc5798423d1..1753c26c8e76 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -750,7 +750,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
 		 */
-		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+		if (unlikely(fault_in_iov_iter_readable(i, bytes))) {
 			status = -EFAULT;
 			break;
 		}
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index ab4f3362466d..a43adeacd930 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -1829,7 +1829,7 @@ static ssize_t ntfs_perform_write(struct file *file, struct iov_iter *i,
 		 * pages being swapped out between us bringing them into memory
 		 * and doing the actual copying.
 		 */
-		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+		if (unlikely(fault_in_iov_iter_readable(i, bytes))) {
 			status = -EFAULT;
 			break;
 		}
diff --git a/fs/ntfs3/file.c b/fs/ntfs3/file.c
index 424450e77ad5..a52388387175 100644
--- a/fs/ntfs3/file.c
+++ b/fs/ntfs3/file.c
@@ -987,7 +987,7 @@ static ssize_t ntfs_compress_write(struct kiocb *iocb, struct iov_iter *from)
 		frame_vbo = pos & ~(frame_size - 1);
 		index = frame_vbo >> PAGE_SHIFT;
 
-		if (unlikely(iov_iter_fault_in_readable(from, bytes))) {
+		if (unlikely(fault_in_iov_iter_readable(from, bytes))) {
 			err = -EFAULT;
 			goto out;
 		}
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 207101a9c5c3..d18458af6681 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -133,7 +133,7 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
 				  size_t bytes, struct iov_iter *i);
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
 void iov_iter_revert(struct iov_iter *i, size_t bytes);
-int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes);
+size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t bytes);
 size_t iov_iter_single_seg_count(const struct iov_iter *i);
 size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index c88908f0f138..ce3d4f610626 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -430,33 +430,42 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t by
 }
 
 /*
+ * fault_in_iov_iter_readable - fault in iov iterator for reading
+ * @i: iterator
+ * @size: maximum length
+ *
  * Fault in one or more iovecs of the given iov_iter, to a maximum length of
- * bytes.  For each iovec, fault in each page that constitutes the iovec.
+ * @size.  For each iovec, fault in each page that constitutes the iovec.
+ *
+ * Returns the number of bytes not faulted in (like copy_to_user() and
+ * copy_from_user()).
  *
- * Return 0 on success, or non-zero if the memory could not be accessed (i.e.
- * because it is an invalid address).
+ * Always returns 0 for non-userspace iterators.
  */
-int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes)
+size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t size)
 {
 	if (iter_is_iovec(i)) {
+		size_t count = min(size, iov_iter_count(i));
 		const struct iovec *p;
 		size_t skip;
 
-		if (bytes > i->count)
-			bytes = i->count;
-		for (p = i->iov, skip = i->iov_offset; bytes; p++, skip = 0) {
-			size_t len = min(bytes, p->iov_len - skip);
+		size -= count;
+		for (p = i->iov, skip = i->iov_offset; count; p++, skip = 0) {
+			size_t len = min(count, p->iov_len - skip);
+			size_t ret;
 
 			if (unlikely(!len))
 				continue;
-			if (fault_in_readable(p->iov_base + skip, len))
-				return -EFAULT;
-			bytes -= len;
+			ret = fault_in_readable(p->iov_base + skip, len);
+			count -= len - ret;
+			if (ret)
+				break;
 		}
+		return count + size;
 	}
 	return 0;
 }
-EXPORT_SYMBOL(iov_iter_fault_in_readable);
+EXPORT_SYMBOL(fault_in_iov_iter_readable);
 
 void iov_iter_init(struct iov_iter *i, unsigned int direction,
 			const struct iovec *iov, unsigned long nr_segs,
diff --git a/mm/filemap.c b/mm/filemap.c
index ff34f4087f87..4dd5edcd39fd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3757,7 +3757,7 @@ ssize_t generic_perform_write(struct file *file,
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
 		 */
-		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+		if (unlikely(fault_in_iov_iter_readable(i, bytes))) {
 			status = -EFAULT;
 			break;
 		}
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Ocfs2-devel] [PATCH v8 05/17] iov_iter: Introduce fault_in_iov_iter_writeable
  2021-10-19 13:41 [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (3 preceding siblings ...)
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 04/17] iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable Andreas Gruenbacher
@ 2021-10-19 13:41 ` Andreas Gruenbacher
  2021-10-20 16:25   ` Catalin Marinas
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 06/17] gfs2: Add wrapper for iomap_file_buffered_write Andreas Gruenbacher
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 47+ messages in thread
From: Andreas Gruenbacher @ 2021-10-19 13:41 UTC (permalink / raw)
  To: Linus Torvalds, Catalin Marinas
  Cc: kvm-ppc, Paul Mackerras, cluster-devel, Jan Kara,
	Andreas Gruenbacher, linux-kernel, Christoph Hellwig,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

Introduce a new fault_in_iov_iter_writeable helper for safely faulting
in an iterator for writing.  Uses get_user_pages() to fault in the pages
without actually writing to them, which would be destructive.

We'll use fault_in_iov_iter_writeable in gfs2 once we've determined that
the iterator passed to .read_iter isn't in memory.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 include/linux/pagemap.h |  1 +
 include/linux/uio.h     |  1 +
 lib/iov_iter.c          | 39 +++++++++++++++++++++++++
 mm/gup.c                | 63 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 104 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 9fe94f7a4f7e..2f7dd14083d9 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -736,6 +736,7 @@ extern void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter);
  * Fault in userspace address range.
  */
 size_t fault_in_writeable(char __user *uaddr, size_t size);
+size_t fault_in_safe_writeable(const char __user *uaddr, size_t size);
 size_t fault_in_readable(const char __user *uaddr, size_t size);
 
 int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
diff --git a/include/linux/uio.h b/include/linux/uio.h
index d18458af6681..25d1c24fd829 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -134,6 +134,7 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
 void iov_iter_revert(struct iov_iter *i, size_t bytes);
 size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t bytes);
+size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t bytes);
 size_t iov_iter_single_seg_count(const struct iov_iter *i);
 size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index ce3d4f610626..ac9a87e727a3 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -467,6 +467,45 @@ size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t size)
 }
 EXPORT_SYMBOL(fault_in_iov_iter_readable);
 
+/*
+ * fault_in_iov_iter_writeable - fault in iov iterator for writing
+ * @i: iterator
+ * @size: maximum length
+ *
+ * Faults in the iterator using get_user_pages(), i.e., without triggering
+ * hardware page faults.  This is primarily useful when we already know that
+ * some or all of the pages in @i aren't in memory.
+ *
+ * Returns the number of bytes not faulted in, like copy_to_user() and
+ * copy_from_user().
+ *
+ * Always returns 0 for non-user-space iterators.
+ */
+size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t size)
+{
+	if (iter_is_iovec(i)) {
+		size_t count = min(size, iov_iter_count(i));
+		const struct iovec *p;
+		size_t skip;
+
+		size -= count;
+		for (p = i->iov, skip = i->iov_offset; count; p++, skip = 0) {
+			size_t len = min(count, p->iov_len - skip);
+			size_t ret;
+
+			if (unlikely(!len))
+				continue;
+			ret = fault_in_safe_writeable(p->iov_base + skip, len);
+			count -= len - ret;
+			if (ret)
+				break;
+		}
+		return count + size;
+	}
+	return 0;
+}
+EXPORT_SYMBOL(fault_in_iov_iter_writeable);
+
 void iov_iter_init(struct iov_iter *i, unsigned int direction,
 			const struct iovec *iov, unsigned long nr_segs,
 			size_t count)
diff --git a/mm/gup.c b/mm/gup.c
index a7efb027d6cf..614b8536b3b6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1691,6 +1691,69 @@ size_t fault_in_writeable(char __user *uaddr, size_t size)
 }
 EXPORT_SYMBOL(fault_in_writeable);
 
+/*
+ * fault_in_safe_writeable - fault in an address range for writing
+ * @uaddr: start of address range
+ * @size: length of address range
+ *
+ * Faults in an address range using get_user_pages, i.e., without triggering
+ * hardware page faults.  This is primarily useful when we already know that
+ * some or all of the pages in the address range aren't in memory.
+ *
+ * Other than fault_in_writeable(), this function is non-destructive.
+ *
+ * Note that we don't pin or otherwise hold the pages referenced that we fault
+ * in.  There's no guarantee that they'll stay in memory for any duration of
+ * time.
+ *
+ * Returns the number of bytes not faulted in, like copy_to_user() and
+ * copy_from_user().
+ */
+size_t fault_in_safe_writeable(const char __user *uaddr, size_t size)
+{
+	unsigned long start = (unsigned long)uaddr;
+	unsigned long end, nstart, nend;
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma = NULL;
+	int locked = 0;
+
+	nstart = start & PAGE_MASK;
+	end = PAGE_ALIGN(start + size);
+	if (end < nstart)
+		end = 0;
+	for (; nstart != end; nstart = nend) {
+		unsigned long nr_pages;
+		long ret;
+
+		if (!locked) {
+			locked = 1;
+			mmap_read_lock(mm);
+			vma = find_vma(mm, nstart);
+		} else if (nstart >= vma->vm_end)
+			vma = vma->vm_next;
+		if (!vma || vma->vm_start >= end)
+			break;
+		nend = end ? min(end, vma->vm_end) : vma->vm_end;
+		if (vma->vm_flags & (VM_IO | VM_PFNMAP))
+			continue;
+		if (nstart < vma->vm_start)
+			nstart = vma->vm_start;
+		nr_pages = (nend - nstart) / PAGE_SIZE;
+		ret = __get_user_pages_locked(mm, nstart, nr_pages,
+					      NULL, NULL, &locked,
+					      FOLL_TOUCH | FOLL_WRITE);
+		if (ret <= 0)
+			break;
+		nend = nstart + ret * PAGE_SIZE;
+	}
+	if (locked)
+		mmap_read_unlock(mm);
+	if (nstart == end)
+		return 0;
+	return size - min_t(size_t, nstart - start, size);
+}
+EXPORT_SYMBOL(fault_in_safe_writeable);
+
 /**
  * fault_in_readable - fault in userspace address range for reading
  * @uaddr: start of user address range
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Ocfs2-devel] [PATCH v8 06/17] gfs2: Add wrapper for iomap_file_buffered_write
  2021-10-19 13:41 [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (4 preceding siblings ...)
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 05/17] iov_iter: Introduce fault_in_iov_iter_writeable Andreas Gruenbacher
@ 2021-10-19 13:41 ` Andreas Gruenbacher
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 07/17] gfs2: Clean up function may_grant Andreas Gruenbacher
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 47+ messages in thread
From: Andreas Gruenbacher @ 2021-10-19 13:41 UTC (permalink / raw)
  To: Linus Torvalds, Catalin Marinas
  Cc: kvm-ppc, Paul Mackerras, cluster-devel, Jan Kara,
	Andreas Gruenbacher, linux-kernel, Christoph Hellwig,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

Add a wrapper around iomap_file_buffered_write.  We'll add code for when
the operation needs to be retried here later.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 27 +++++++++++++++++----------
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index c559827cb6f9..da742b470f23 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -876,6 +876,20 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return written ? written : ret;
 }
 
+static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file_inode(file);
+	ssize_t ret;
+
+	current->backing_dev_info = inode_to_bdi(inode);
+	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+	current->backing_dev_info = NULL;
+	if (ret > 0)
+		iocb->ki_pos += ret;
+	return ret;
+}
+
 /**
  * gfs2_file_write_iter - Perform a write to a file
  * @iocb: The io context
@@ -927,9 +941,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			goto out_unlock;
 
 		iocb->ki_flags |= IOCB_DSYNC;
-		current->backing_dev_info = inode_to_bdi(inode);
-		buffered = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
-		current->backing_dev_info = NULL;
+		buffered = gfs2_file_buffered_write(iocb, from);
 		if (unlikely(buffered <= 0)) {
 			if (!ret)
 				ret = buffered;
@@ -943,7 +955,6 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		 * the direct I/O range as we don't know if the buffered pages
 		 * made it to disk.
 		 */
-		iocb->ki_pos += buffered;
 		ret2 = generic_write_sync(iocb, buffered);
 		invalidate_mapping_pages(mapping,
 				(iocb->ki_pos - buffered) >> PAGE_SHIFT,
@@ -951,13 +962,9 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		if (!ret || ret2 > 0)
 			ret += ret2;
 	} else {
-		current->backing_dev_info = inode_to_bdi(inode);
-		ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
-		current->backing_dev_info = NULL;
-		if (likely(ret > 0)) {
-			iocb->ki_pos += ret;
+		ret = gfs2_file_buffered_write(iocb, from);
+		if (likely(ret > 0))
 			ret = generic_write_sync(iocb, ret);
-		}
 	}
 
 out_unlock:
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Ocfs2-devel] [PATCH v8 07/17] gfs2: Clean up function may_grant
  2021-10-19 13:41 [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (5 preceding siblings ...)
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 06/17] gfs2: Add wrapper for iomap_file_buffered_write Andreas Gruenbacher
@ 2021-10-19 13:41 ` Andreas Gruenbacher
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 08/17] gfs2: Introduce flag for glock holder auto-demotion Andreas Gruenbacher
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 47+ messages in thread
From: Andreas Gruenbacher @ 2021-10-19 13:41 UTC (permalink / raw)
  To: Linus Torvalds, Catalin Marinas
  Cc: kvm-ppc, Paul Mackerras, cluster-devel, Jan Kara,
	Andreas Gruenbacher, linux-kernel, Christoph Hellwig,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

Pass the first current glock holder into function may_grant and
deobfuscate the logic there.

We're now using function find_first_holder in do_promote, so move the
function's definition above do_promote.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/glock.c | 120 ++++++++++++++++++++++++++++--------------------
 1 file changed, 70 insertions(+), 50 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index e0eaa9cf9fb6..f24db2ececfb 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -301,46 +301,59 @@ void gfs2_glock_put(struct gfs2_glock *gl)
 }
 
 /**
- * may_grant - check if its ok to grant a new lock
+ * may_grant - check if it's ok to grant a new lock
  * @gl: The glock
+ * @current_gh: One of the current holders of @gl
  * @gh: The lock request which we wish to grant
  *
- * Returns: true if its ok to grant the lock
+ * With our current compatibility rules, if a glock has one or more active
+ * holders (HIF_HOLDER flag set), any of those holders can be passed in as
+ * @current_gh; they are all the same as far as compatibility with the new @gh
+ * goes.
+ *
+ * Returns true if it's ok to grant the lock.
  */
 
-static inline int may_grant(const struct gfs2_glock *gl, const struct gfs2_holder *gh)
-{
-	const struct gfs2_holder *gh_head = list_first_entry(&gl->gl_holders, const struct gfs2_holder, gh_list);
+static inline bool may_grant(const struct gfs2_glock *gl,
+			     const struct gfs2_holder *current_gh,
+			     const struct gfs2_holder *gh)
+{
+	if (current_gh) {
+		BUG_ON(!test_bit(HIF_HOLDER, &current_gh->gh_iflags));
+
+		switch(current_gh->gh_state) {
+		case LM_ST_EXCLUSIVE:
+			/*
+			 * Here we make a special exception to grant holders
+			 * who agree to share the EX lock with other holders
+			 * who also have the bit set. If the original holder
+			 * has the LM_FLAG_NODE_SCOPE bit set, we grant more
+			 * holders with the bit set.
+			 */
+			return gh->gh_state == LM_ST_EXCLUSIVE &&
+			       (current_gh->gh_flags & LM_FLAG_NODE_SCOPE) &&
+			       (gh->gh_flags & LM_FLAG_NODE_SCOPE);
 
-	if (gh != gh_head) {
-		/**
-		 * Here we make a special exception to grant holders who agree
-		 * to share the EX lock with other holders who also have the
-		 * bit set. If the original holder has the LM_FLAG_NODE_SCOPE bit
-		 * is set, we grant more holders with the bit set.
-		 */
-		if (gh_head->gh_state == LM_ST_EXCLUSIVE &&
-		    (gh_head->gh_flags & LM_FLAG_NODE_SCOPE) &&
-		    gh->gh_state == LM_ST_EXCLUSIVE &&
-		    (gh->gh_flags & LM_FLAG_NODE_SCOPE))
-			return 1;
-		if ((gh->gh_state == LM_ST_EXCLUSIVE ||
-		     gh_head->gh_state == LM_ST_EXCLUSIVE))
-			return 0;
+		case LM_ST_SHARED:
+		case LM_ST_DEFERRED:
+			return gh->gh_state == current_gh->gh_state;
+
+		default:
+			return false;
+		}
 	}
+
 	if (gl->gl_state == gh->gh_state)
-		return 1;
+		return true;
 	if (gh->gh_flags & GL_EXACT)
-		return 0;
+		return false;
 	if (gl->gl_state == LM_ST_EXCLUSIVE) {
-		if (gh->gh_state == LM_ST_SHARED && gh_head->gh_state == LM_ST_SHARED)
-			return 1;
-		if (gh->gh_state == LM_ST_DEFERRED && gh_head->gh_state == LM_ST_DEFERRED)
-			return 1;
+		return gh->gh_state == LM_ST_SHARED ||
+		       gh->gh_state == LM_ST_DEFERRED;
 	}
-	if (gl->gl_state != LM_ST_UNLOCKED && (gh->gh_flags & LM_FLAG_ANY))
-		return 1;
-	return 0;
+	if (gh->gh_flags & LM_FLAG_ANY)
+		return gl->gl_state != LM_ST_UNLOCKED;
+	return false;
 }
 
 static void gfs2_holder_wake(struct gfs2_holder *gh)
@@ -380,6 +393,24 @@ static void do_error(struct gfs2_glock *gl, const int ret)
 	}
 }
 
+/**
+ * find_first_holder - find the first "holder" gh
+ * @gl: the glock
+ */
+
+static inline struct gfs2_holder *find_first_holder(const struct gfs2_glock *gl)
+{
+	struct gfs2_holder *gh;
+
+	if (!list_empty(&gl->gl_holders)) {
+		gh = list_first_entry(&gl->gl_holders, struct gfs2_holder,
+				      gh_list);
+		if (test_bit(HIF_HOLDER, &gh->gh_iflags))
+			return gh;
+	}
+	return NULL;
+}
+
 /**
  * do_promote - promote as many requests as possible on the current queue
  * @gl: The glock
@@ -393,14 +424,16 @@ __releases(&gl->gl_lockref.lock)
 __acquires(&gl->gl_lockref.lock)
 {
 	const struct gfs2_glock_operations *glops = gl->gl_ops;
-	struct gfs2_holder *gh, *tmp;
+	struct gfs2_holder *gh, *tmp, *first_gh;
 	int ret;
 
+	first_gh = find_first_holder(gl);
+
 restart:
 	list_for_each_entry_safe(gh, tmp, &gl->gl_holders, gh_list) {
 		if (test_bit(HIF_HOLDER, &gh->gh_iflags))
 			continue;
-		if (may_grant(gl, gh)) {
+		if (may_grant(gl, first_gh, gh)) {
 			if (gh->gh_list.prev == &gl->gl_holders &&
 			    glops->go_lock) {
 				spin_unlock(&gl->gl_lockref.lock);
@@ -722,23 +755,6 @@ __acquires(&gl->gl_lockref.lock)
 	spin_lock(&gl->gl_lockref.lock);
 }
 
-/**
- * find_first_holder - find the first "holder" gh
- * @gl: the glock
- */
-
-static inline struct gfs2_holder *find_first_holder(const struct gfs2_glock *gl)
-{
-	struct gfs2_holder *gh;
-
-	if (!list_empty(&gl->gl_holders)) {
-		gh = list_first_entry(&gl->gl_holders, struct gfs2_holder, gh_list);
-		if (test_bit(HIF_HOLDER, &gh->gh_iflags))
-			return gh;
-	}
-	return NULL;
-}
-
 /**
  * run_queue - do all outstanding tasks related to a glock
  * @gl: The glock in question
@@ -1354,8 +1370,12 @@ __acquires(&gl->gl_lockref.lock)
 		GLOCK_BUG_ON(gl, true);
 
 	if (gh->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB)) {
-		if (test_bit(GLF_LOCK, &gl->gl_flags))
-			try_futile = !may_grant(gl, gh);
+		if (test_bit(GLF_LOCK, &gl->gl_flags)) {
+			struct gfs2_holder *first_gh;
+
+			first_gh = find_first_holder(gl);
+			try_futile = !may_grant(gl, first_gh, gh);
+		}
 		if (test_bit(GLF_INVALIDATE_IN_PROGRESS, &gl->gl_flags))
 			goto fail;
 	}
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Ocfs2-devel] [PATCH v8 08/17] gfs2: Introduce flag for glock holder auto-demotion
  2021-10-19 13:41 [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (6 preceding siblings ...)
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 07/17] gfs2: Clean up function may_grant Andreas Gruenbacher
@ 2021-10-19 13:41 ` Andreas Gruenbacher
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 09/17] gfs2: Move the inode glock locking to gfs2_file_buffered_write Andreas Gruenbacher
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 47+ messages in thread
From: Andreas Gruenbacher @ 2021-10-19 13:41 UTC (permalink / raw)
  To: Linus Torvalds, Catalin Marinas
  Cc: kvm-ppc, Paul Mackerras, cluster-devel, Jan Kara,
	Andreas Gruenbacher, linux-kernel, Christoph Hellwig,
	Bob Peterson, Alexander Viro, linux-fsdevel, linux-btrfs,
	ocfs2-devel

From: Bob Peterson <rpeterso@redhat.com>

This patch introduces a new HIF_MAY_DEMOTE flag and infrastructure that
will allow glocks to be demoted automatically on locking conflicts.
When a locking request comes in that isn't compatible with the locking
state of an active holder and that holder has the HIF_MAY_DEMOTE flag
set, the holder will be demoted before the incoming locking request is
granted.

Note that this mechanism demotes active holders (with the HIF_HOLDER
flag set), while before we were only demoting glocks without any active
holders.  This allows processes to keep hold of locks that may form a
cyclic locking dependency; the core glock logic will then break those
dependencies in case a conflicting locking request occurs.  We'll use
this to avoid giving up the inode glock proactively before faulting in
pages.

Processes that allow a glock holder to be taken away indicate this by
calling gfs2_holder_allow_demote(), which sets the HIF_MAY_DEMOTE flag.
Later, they call gfs2_holder_disallow_demote() to clear the flag again,
and then they check if their holder is still queued: if it is, they are
still holding the glock; if it isn't, they can re-acquire the glock (or
abort).

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/glock.c  | 215 +++++++++++++++++++++++++++++++++++++++--------
 fs/gfs2/glock.h  |  20 +++++
 fs/gfs2/incore.h |   1 +
 3 files changed, 200 insertions(+), 36 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index f24db2ececfb..6a595ba5a979 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -58,6 +58,7 @@ struct gfs2_glock_iter {
 typedef void (*glock_examiner) (struct gfs2_glock * gl);
 
 static void do_xmote(struct gfs2_glock *gl, struct gfs2_holder *gh, unsigned int target);
+static void __gfs2_glock_dq(struct gfs2_holder *gh);
 
 static struct dentry *gfs2_root;
 static struct workqueue_struct *glock_workqueue;
@@ -197,6 +198,12 @@ static int demote_ok(const struct gfs2_glock *gl)
 
 	if (gl->gl_state == LM_ST_UNLOCKED)
 		return 0;
+	/*
+	 * Note that demote_ok is used for the lru process of disposing of
+	 * glocks. For this purpose, we don't care if the glock's holders
+	 * have the HIF_MAY_DEMOTE flag set or not. If someone is using
+	 * them, don't demote.
+	 */
 	if (!list_empty(&gl->gl_holders))
 		return 0;
 	if (glops->go_demote_ok)
@@ -379,7 +386,7 @@ static void do_error(struct gfs2_glock *gl, const int ret)
 	struct gfs2_holder *gh, *tmp;
 
 	list_for_each_entry_safe(gh, tmp, &gl->gl_holders, gh_list) {
-		if (test_bit(HIF_HOLDER, &gh->gh_iflags))
+		if (!test_bit(HIF_WAIT, &gh->gh_iflags))
 			continue;
 		if (ret & LM_OUT_ERROR)
 			gh->gh_error = -EIO;
@@ -393,6 +400,40 @@ static void do_error(struct gfs2_glock *gl, const int ret)
 	}
 }
 
+/**
+ * demote_incompat_holders - demote incompatible demoteable holders
+ * @gl: the glock we want to promote
+ * @new_gh: the new holder to be promoted
+ */
+static void demote_incompat_holders(struct gfs2_glock *gl,
+				    struct gfs2_holder *new_gh)
+{
+	struct gfs2_holder *gh;
+
+	/*
+	 * Demote incompatible holders before we make ourselves eligible.
+	 * (This holder may or may not allow auto-demoting, but we don't want
+	 * to demote the new holder before it's even granted.)
+	 */
+	list_for_each_entry(gh, &gl->gl_holders, gh_list) {
+		/*
+		 * Since holders are at the front of the list, we stop when we
+		 * find the first non-holder.
+		 */
+		if (!test_bit(HIF_HOLDER, &gh->gh_iflags))
+			return;
+		if (test_bit(HIF_MAY_DEMOTE, &gh->gh_iflags) &&
+		    !may_grant(gl, new_gh, gh)) {
+			/*
+			 * We should not recurse into do_promote because
+			 * __gfs2_glock_dq only calls handle_callback,
+			 * gfs2_glock_add_to_lru and __gfs2_glock_queue_work.
+			 */
+			__gfs2_glock_dq(gh);
+		}
+	}
+}
+
 /**
  * find_first_holder - find the first "holder" gh
  * @gl: the glock
@@ -411,6 +452,26 @@ static inline struct gfs2_holder *find_first_holder(const struct gfs2_glock *gl)
 	return NULL;
 }
 
+/**
+ * find_first_strong_holder - find the first non-demoteable holder
+ * @gl: the glock
+ *
+ * Find the first holder that doesn't have the HIF_MAY_DEMOTE flag set.
+ */
+static inline struct gfs2_holder
+*find_first_strong_holder(struct gfs2_glock *gl)
+{
+	struct gfs2_holder *gh;
+
+	list_for_each_entry(gh, &gl->gl_holders, gh_list) {
+		if (!test_bit(HIF_HOLDER, &gh->gh_iflags))
+			return NULL;
+		if (!test_bit(HIF_MAY_DEMOTE, &gh->gh_iflags))
+			return gh;
+	}
+	return NULL;
+}
+
 /**
  * do_promote - promote as many requests as possible on the current queue
  * @gl: The glock
@@ -425,15 +486,21 @@ __acquires(&gl->gl_lockref.lock)
 {
 	const struct gfs2_glock_operations *glops = gl->gl_ops;
 	struct gfs2_holder *gh, *tmp, *first_gh;
+	bool incompat_holders_demoted = false;
 	int ret;
 
-	first_gh = find_first_holder(gl);
+	first_gh = find_first_strong_holder(gl);
 
 restart:
 	list_for_each_entry_safe(gh, tmp, &gl->gl_holders, gh_list) {
-		if (test_bit(HIF_HOLDER, &gh->gh_iflags))
+		if (!test_bit(HIF_WAIT, &gh->gh_iflags))
 			continue;
 		if (may_grant(gl, first_gh, gh)) {
+			if (!incompat_holders_demoted) {
+				demote_incompat_holders(gl, first_gh);
+				incompat_holders_demoted = true;
+				first_gh = gh;
+			}
 			if (gh->gh_list.prev == &gl->gl_holders &&
 			    glops->go_lock) {
 				spin_unlock(&gl->gl_lockref.lock);
@@ -459,6 +526,11 @@ __acquires(&gl->gl_lockref.lock)
 			gfs2_holder_wake(gh);
 			continue;
 		}
+		/*
+		 * If we get here, it means we may not grant this holder for
+		 * some reason. If this holder is the head of the list, it
+		 * means we have a blocked holder at the head, so return 1.
+		 */
 		if (gh->gh_list.prev == &gl->gl_holders)
 			return 1;
 		do_error(gl, 0);
@@ -1373,7 +1445,7 @@ __acquires(&gl->gl_lockref.lock)
 		if (test_bit(GLF_LOCK, &gl->gl_flags)) {
 			struct gfs2_holder *first_gh;
 
-			first_gh = find_first_holder(gl);
+			first_gh = find_first_strong_holder(gl);
 			try_futile = !may_grant(gl, first_gh, gh);
 		}
 		if (test_bit(GLF_INVALIDATE_IN_PROGRESS, &gl->gl_flags))
@@ -1382,7 +1454,8 @@ __acquires(&gl->gl_lockref.lock)
 
 	list_for_each_entry(gh2, &gl->gl_holders, gh_list) {
 		if (unlikely(gh2->gh_owner_pid == gh->gh_owner_pid &&
-		    (gh->gh_gl->gl_ops->go_type != LM_TYPE_FLOCK)))
+		    (gh->gh_gl->gl_ops->go_type != LM_TYPE_FLOCK) &&
+		    !test_bit(HIF_MAY_DEMOTE, &gh2->gh_iflags)))
 			goto trap_recursive;
 		if (try_futile &&
 		    !(gh2->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB))) {
@@ -1478,51 +1551,83 @@ int gfs2_glock_poll(struct gfs2_holder *gh)
 	return test_bit(HIF_WAIT, &gh->gh_iflags) ? 0 : 1;
 }
 
-/**
- * gfs2_glock_dq - dequeue a struct gfs2_holder from a glock (release a glock)
- * @gh: the glock holder
- *
- */
+static inline bool needs_demote(struct gfs2_glock *gl)
+{
+	return (test_bit(GLF_DEMOTE, &gl->gl_flags) ||
+		test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags));
+}
 
-void gfs2_glock_dq(struct gfs2_holder *gh)
+static void __gfs2_glock_dq(struct gfs2_holder *gh)
 {
 	struct gfs2_glock *gl = gh->gh_gl;
 	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
 	unsigned delay = 0;
 	int fast_path = 0;
 
-	spin_lock(&gl->gl_lockref.lock);
 	/*
-	 * If we're in the process of file system withdraw, we cannot just
-	 * dequeue any glocks until our journal is recovered, lest we
-	 * introduce file system corruption. We need two exceptions to this
-	 * rule: We need to allow unlocking of nondisk glocks and the glock
-	 * for our own journal that needs recovery.
+	 * This while loop is similar to function demote_incompat_holders:
+	 * If the glock is due to be demoted (which may be from another node
+	 * or even if this holder is GL_NOCACHE), the weak holders are
+	 * demoted as well, allowing the glock to be demoted.
 	 */
-	if (test_bit(SDF_WITHDRAW_RECOVERY, &sdp->sd_flags) &&
-	    glock_blocked_by_withdraw(gl) &&
-	    gh->gh_gl != sdp->sd_jinode_gl) {
-		sdp->sd_glock_dqs_held++;
-		spin_unlock(&gl->gl_lockref.lock);
-		might_sleep();
-		wait_on_bit(&sdp->sd_flags, SDF_WITHDRAW_RECOVERY,
-			    TASK_UNINTERRUPTIBLE);
-		spin_lock(&gl->gl_lockref.lock);
-	}
-	if (gh->gh_flags & GL_NOCACHE)
-		handle_callback(gl, LM_ST_UNLOCKED, 0, false);
+	while (gh) {
+		/*
+		 * If we're in the process of file system withdraw, we cannot
+		 * just dequeue any glocks until our journal is recovered, lest
+		 * we introduce file system corruption. We need two exceptions
+		 * to this rule: We need to allow unlocking of nondisk glocks
+		 * and the glock for our own journal that needs recovery.
+		 */
+		if (test_bit(SDF_WITHDRAW_RECOVERY, &sdp->sd_flags) &&
+		    glock_blocked_by_withdraw(gl) &&
+		    gh->gh_gl != sdp->sd_jinode_gl) {
+			sdp->sd_glock_dqs_held++;
+			spin_unlock(&gl->gl_lockref.lock);
+			might_sleep();
+			wait_on_bit(&sdp->sd_flags, SDF_WITHDRAW_RECOVERY,
+				    TASK_UNINTERRUPTIBLE);
+			spin_lock(&gl->gl_lockref.lock);
+		}
+
+		/*
+		 * This holder should not be cached, so mark it for demote.
+		 * Note: this should be done before the check for needs_demote
+		 * below.
+		 */
+		if (gh->gh_flags & GL_NOCACHE)
+			handle_callback(gl, LM_ST_UNLOCKED, 0, false);
+
+		list_del_init(&gh->gh_list);
+		clear_bit(HIF_HOLDER, &gh->gh_iflags);
+		trace_gfs2_glock_queue(gh, 0);
+
+		/*
+		 * If there hasn't been a demote request we are done.
+		 * (Let the remaining holders, if any, keep holding it.)
+		 */
+		if (!needs_demote(gl)) {
+			if (list_empty(&gl->gl_holders))
+				fast_path = 1;
+			break;
+		}
+		/*
+		 * If we have another strong holder (we cannot auto-demote)
+		 * we are done. It keeps holding it until it is done.
+		 */
+		if (find_first_strong_holder(gl))
+			break;
 
-	list_del_init(&gh->gh_list);
-	clear_bit(HIF_HOLDER, &gh->gh_iflags);
-	if (list_empty(&gl->gl_holders) &&
-	    !test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags) &&
-	    !test_bit(GLF_DEMOTE, &gl->gl_flags))
-		fast_path = 1;
+		/*
+		 * If we have a weak holder at the head of the list, it
+		 * (and all others like it) must be auto-demoted. If there
+		 * are no more weak holders, we exit the while loop.
+		 */
+		gh = find_first_holder(gl);
+	}
 
 	if (!test_bit(GLF_LFLUSH, &gl->gl_flags) && demote_ok(gl))
 		gfs2_glock_add_to_lru(gl);
 
-	trace_gfs2_glock_queue(gh, 0);
 	if (unlikely(!fast_path)) {
 		gl->gl_lockref.count++;
 		if (test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags) &&
@@ -1531,6 +1636,19 @@ void gfs2_glock_dq(struct gfs2_holder *gh)
 			delay = gl->gl_hold_time;
 		__gfs2_glock_queue_work(gl, delay);
 	}
+}
+
+/**
+ * gfs2_glock_dq - dequeue a struct gfs2_holder from a glock (release a glock)
+ * @gh: the glock holder
+ *
+ */
+void gfs2_glock_dq(struct gfs2_holder *gh)
+{
+	struct gfs2_glock *gl = gh->gh_gl;
+
+	spin_lock(&gl->gl_lockref.lock);
+	__gfs2_glock_dq(gh);
 	spin_unlock(&gl->gl_lockref.lock);
 }
 
@@ -1693,6 +1811,7 @@ void gfs2_glock_dq_m(unsigned int num_gh, struct gfs2_holder *ghs)
 
 void gfs2_glock_cb(struct gfs2_glock *gl, unsigned int state)
 {
+	struct gfs2_holder mock_gh = { .gh_gl = gl, .gh_state = state, };
 	unsigned long delay = 0;
 	unsigned long holdtime;
 	unsigned long now = jiffies;
@@ -1707,6 +1826,28 @@ void gfs2_glock_cb(struct gfs2_glock *gl, unsigned int state)
 		if (test_bit(GLF_REPLY_PENDING, &gl->gl_flags))
 			delay = gl->gl_hold_time;
 	}
+	/*
+	 * Note 1: We cannot call demote_incompat_holders from handle_callback
+	 * or gfs2_set_demote due to recursion problems like: gfs2_glock_dq ->
+	 * handle_callback -> demote_incompat_holders -> gfs2_glock_dq
+	 * Plus, we only want to demote the holders if the request comes from
+	 * a remote cluster node because local holder conflicts are resolved
+	 * elsewhere.
+	 *
+	 * Note 2: if a remote node wants this glock in EX mode, lock_dlm will
+	 * request that we set our state to UNLOCKED. Here we mock up a holder
+	 * to make it look like someone wants the lock EX locally. Any SH
+	 * and DF requests should be able to share the lock without demoting.
+	 *
+	 * Note 3: We only want to demote the demoteable holders when there
+	 * are no more strong holders. The demoteable holders might as well
+	 * keep the glock until the last strong holder is done with it.
+	 */
+	if (!find_first_strong_holder(gl)) {
+		if (state == LM_ST_UNLOCKED)
+			mock_gh.gh_state = LM_ST_EXCLUSIVE;
+		demote_incompat_holders(gl, &mock_gh);
+	}
 	handle_callback(gl, state, delay, true);
 	__gfs2_glock_queue_work(gl, delay);
 	spin_unlock(&gl->gl_lockref.lock);
@@ -2096,6 +2237,8 @@ static const char *hflags2str(char *buf, u16 flags, unsigned long iflags)
 		*p++ = 'H';
 	if (test_bit(HIF_WAIT, &iflags))
 		*p++ = 'W';
+	if (test_bit(HIF_MAY_DEMOTE, &iflags))
+		*p++ = 'D';
 	*p = 0;
 	return buf;
 }
diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index 31a8f2f649b5..9012487da4c6 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -150,6 +150,8 @@ static inline struct gfs2_holder *gfs2_glock_is_locked_by_me(struct gfs2_glock *
 	list_for_each_entry(gh, &gl->gl_holders, gh_list) {
 		if (!test_bit(HIF_HOLDER, &gh->gh_iflags))
 			break;
+		if (test_bit(HIF_MAY_DEMOTE, &gh->gh_iflags))
+			continue;
 		if (gh->gh_owner_pid == pid)
 			goto out;
 	}
@@ -325,6 +327,24 @@ static inline void glock_clear_object(struct gfs2_glock *gl, void *object)
 	spin_unlock(&gl->gl_lockref.lock);
 }
 
+static inline void gfs2_holder_allow_demote(struct gfs2_holder *gh)
+{
+	struct gfs2_glock *gl = gh->gh_gl;
+
+	spin_lock(&gl->gl_lockref.lock);
+	set_bit(HIF_MAY_DEMOTE, &gh->gh_iflags);
+	spin_unlock(&gl->gl_lockref.lock);
+}
+
+static inline void gfs2_holder_disallow_demote(struct gfs2_holder *gh)
+{
+	struct gfs2_glock *gl = gh->gh_gl;
+
+	spin_lock(&gl->gl_lockref.lock);
+	clear_bit(HIF_MAY_DEMOTE, &gh->gh_iflags);
+	spin_unlock(&gl->gl_lockref.lock);
+}
+
 extern void gfs2_inode_remember_delete(struct gfs2_glock *gl, u64 generation);
 extern bool gfs2_inode_already_deleted(struct gfs2_glock *gl, u64 generation);
 
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 0fe49770166e..58b7bac501e4 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -252,6 +252,7 @@ struct gfs2_lkstats {
 
 enum {
 	/* States */
+	HIF_MAY_DEMOTE		= 1,
 	HIF_HOLDER		= 6,  /* Set for gh that "holds" the glock */
 	HIF_WAIT		= 10,
 };
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Ocfs2-devel] [PATCH v8 09/17] gfs2: Move the inode glock locking to gfs2_file_buffered_write
  2021-10-19 13:41 [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (7 preceding siblings ...)
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 08/17] gfs2: Introduce flag for glock holder auto-demotion Andreas Gruenbacher
@ 2021-10-19 13:41 ` Andreas Gruenbacher
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 10/17] gfs2: Eliminate ip->i_gh Andreas Gruenbacher
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 47+ messages in thread
From: Andreas Gruenbacher @ 2021-10-19 13:41 UTC (permalink / raw)
  To: Linus Torvalds, Catalin Marinas
  Cc: kvm-ppc, Paul Mackerras, cluster-devel, Jan Kara,
	Andreas Gruenbacher, linux-kernel, Christoph Hellwig,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

So far, for buffered writes, we were taking the inode glock in
gfs2_iomap_begin and dropping it in gfs2_iomap_end with the intention of
not holding the inode glock while iomap_write_actor faults in user
pages.  It turns out that iomap_write_actor is called inside iomap_begin
... iomap_end, so the user pages were still faulted in while holding the
inode glock and the locking code in iomap_begin / iomap_end was
completely pointless.

Move the locking into gfs2_file_buffered_write instead.  We'll take care
of the potential deadlocks due to faulting in user pages while holding a
glock in a subsequent patch.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/bmap.c | 60 +-------------------------------------------------
 fs/gfs2/file.c | 27 +++++++++++++++++++++++
 2 files changed, 28 insertions(+), 59 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 5414c2c33580..7235d539e969 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -961,46 +961,6 @@ static int __gfs2_iomap_get(struct inode *inode, loff_t pos, loff_t length,
 	goto out;
 }
 
-static int gfs2_write_lock(struct inode *inode)
-{
-	struct gfs2_inode *ip = GFS2_I(inode);
-	struct gfs2_sbd *sdp = GFS2_SB(inode);
-	int error;
-
-	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &ip->i_gh);
-	error = gfs2_glock_nq(&ip->i_gh);
-	if (error)
-		goto out_uninit;
-	if (&ip->i_inode == sdp->sd_rindex) {
-		struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
-
-		error = gfs2_glock_nq_init(m_ip->i_gl, LM_ST_EXCLUSIVE,
-					   GL_NOCACHE, &m_ip->i_gh);
-		if (error)
-			goto out_unlock;
-	}
-	return 0;
-
-out_unlock:
-	gfs2_glock_dq(&ip->i_gh);
-out_uninit:
-	gfs2_holder_uninit(&ip->i_gh);
-	return error;
-}
-
-static void gfs2_write_unlock(struct inode *inode)
-{
-	struct gfs2_inode *ip = GFS2_I(inode);
-	struct gfs2_sbd *sdp = GFS2_SB(inode);
-
-	if (&ip->i_inode == sdp->sd_rindex) {
-		struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
-
-		gfs2_glock_dq_uninit(&m_ip->i_gh);
-	}
-	gfs2_glock_dq_uninit(&ip->i_gh);
-}
-
 static int gfs2_iomap_page_prepare(struct inode *inode, loff_t pos,
 				   unsigned len)
 {
@@ -1118,11 +1078,6 @@ static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
 	return ret;
 }
 
-static inline bool gfs2_iomap_need_write_lock(unsigned flags)
-{
-	return (flags & IOMAP_WRITE) && !(flags & IOMAP_DIRECT);
-}
-
 static int gfs2_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
 			    unsigned flags, struct iomap *iomap,
 			    struct iomap *srcmap)
@@ -1135,12 +1090,6 @@ static int gfs2_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
 		iomap->flags |= IOMAP_F_BUFFER_HEAD;
 
 	trace_gfs2_iomap_start(ip, pos, length, flags);
-	if (gfs2_iomap_need_write_lock(flags)) {
-		ret = gfs2_write_lock(inode);
-		if (ret)
-			goto out;
-	}
-
 	ret = __gfs2_iomap_get(inode, pos, length, flags, iomap, &mp);
 	if (ret)
 		goto out_unlock;
@@ -1168,10 +1117,7 @@ static int gfs2_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
 	ret = gfs2_iomap_begin_write(inode, pos, length, flags, iomap, &mp);
 
 out_unlock:
-	if (ret && gfs2_iomap_need_write_lock(flags))
-		gfs2_write_unlock(inode);
 	release_metapath(&mp);
-out:
 	trace_gfs2_iomap_end(ip, iomap, ret);
 	return ret;
 }
@@ -1219,15 +1165,11 @@ static int gfs2_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 	}
 
 	if (unlikely(!written))
-		goto out_unlock;
+		return 0;
 
 	if (iomap->flags & IOMAP_F_SIZE_CHANGED)
 		mark_inode_dirty(inode);
 	set_bit(GLF_DIRTY, &ip->i_gl->gl_flags);
-
-out_unlock:
-	if (gfs2_iomap_need_write_lock(flags))
-		gfs2_write_unlock(inode);
 	return 0;
 }
 
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index da742b470f23..13282f57da37 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -880,13 +880,40 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *fro
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
+	struct gfs2_inode *ip = GFS2_I(inode);
+	struct gfs2_sbd *sdp = GFS2_SB(inode);
 	ssize_t ret;
 
+	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &ip->i_gh);
+	ret = gfs2_glock_nq(&ip->i_gh);
+	if (ret)
+		goto out_uninit;
+
+	if (inode == sdp->sd_rindex) {
+		struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
+
+		ret = gfs2_glock_nq_init(m_ip->i_gl, LM_ST_EXCLUSIVE,
+					 GL_NOCACHE, &m_ip->i_gh);
+		if (ret)
+			goto out_unlock;
+	}
+
 	current->backing_dev_info = inode_to_bdi(inode);
 	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
 	current->backing_dev_info = NULL;
 	if (ret > 0)
 		iocb->ki_pos += ret;
+
+	if (inode == sdp->sd_rindex) {
+		struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
+
+		gfs2_glock_dq_uninit(&m_ip->i_gh);
+	}
+
+out_unlock:
+	gfs2_glock_dq(&ip->i_gh);
+out_uninit:
+	gfs2_holder_uninit(&ip->i_gh);
 	return ret;
 }
 
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Ocfs2-devel] [PATCH v8 10/17] gfs2: Eliminate ip->i_gh
  2021-10-19 13:41 [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (8 preceding siblings ...)
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 09/17] gfs2: Move the inode glock locking to gfs2_file_buffered_write Andreas Gruenbacher
@ 2021-10-19 13:41 ` Andreas Gruenbacher
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 11/17] gfs2: Fix mmap + page fault deadlocks for buffered I/O Andreas Gruenbacher
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 47+ messages in thread
From: Andreas Gruenbacher @ 2021-10-19 13:41 UTC (permalink / raw)
  To: Linus Torvalds, Catalin Marinas
  Cc: kvm-ppc, Paul Mackerras, cluster-devel, Jan Kara,
	Andreas Gruenbacher, linux-kernel, Christoph Hellwig,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

Now that gfs2_file_buffered_write is the only remaining user of
ip->i_gh, we can move the glock holder to the stack (or rather, use the
one we already have on the stack); there is no need for keeping the
holder in the inode anymore.

This is slightly complicated by the fact that we're using ip->i_gh for
the statfs inode in gfs2_file_buffered_write as well.  Writing to the
statfs inode isn't very common, so allocate the statfs holder
dynamically when needed.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c   | 34 +++++++++++++++++++++-------------
 fs/gfs2/incore.h |  3 +--
 2 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 13282f57da37..8f37e4bab995 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -876,16 +876,25 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return written ? written : ret;
 }
 
-static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *from)
+static ssize_t gfs2_file_buffered_write(struct kiocb *iocb,
+					struct iov_iter *from,
+					struct gfs2_holder *gh)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
+	struct gfs2_holder *statfs_gh = NULL;
 	ssize_t ret;
 
-	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &ip->i_gh);
-	ret = gfs2_glock_nq(&ip->i_gh);
+	if (inode == sdp->sd_rindex) {
+		statfs_gh = kmalloc(sizeof(*statfs_gh), GFP_NOFS);
+		if (!statfs_gh)
+			return -ENOMEM;
+	}
+
+	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, gh);
+	ret = gfs2_glock_nq(gh);
 	if (ret)
 		goto out_uninit;
 
@@ -893,7 +902,7 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *fro
 		struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
 
 		ret = gfs2_glock_nq_init(m_ip->i_gl, LM_ST_EXCLUSIVE,
-					 GL_NOCACHE, &m_ip->i_gh);
+					 GL_NOCACHE, statfs_gh);
 		if (ret)
 			goto out_unlock;
 	}
@@ -904,16 +913,15 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *fro
 	if (ret > 0)
 		iocb->ki_pos += ret;
 
-	if (inode == sdp->sd_rindex) {
-		struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
-
-		gfs2_glock_dq_uninit(&m_ip->i_gh);
-	}
+	if (inode == sdp->sd_rindex)
+		gfs2_glock_dq_uninit(statfs_gh);
 
 out_unlock:
-	gfs2_glock_dq(&ip->i_gh);
+	gfs2_glock_dq(gh);
 out_uninit:
-	gfs2_holder_uninit(&ip->i_gh);
+	gfs2_holder_uninit(gh);
+	if (statfs_gh)
+		kfree(statfs_gh);
 	return ret;
 }
 
@@ -968,7 +976,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			goto out_unlock;
 
 		iocb->ki_flags |= IOCB_DSYNC;
-		buffered = gfs2_file_buffered_write(iocb, from);
+		buffered = gfs2_file_buffered_write(iocb, from, &gh);
 		if (unlikely(buffered <= 0)) {
 			if (!ret)
 				ret = buffered;
@@ -989,7 +997,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		if (!ret || ret2 > 0)
 			ret += ret2;
 	} else {
-		ret = gfs2_file_buffered_write(iocb, from);
+		ret = gfs2_file_buffered_write(iocb, from, &gh);
 		if (likely(ret > 0))
 			ret = generic_write_sync(iocb, ret);
 	}
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 58b7bac501e4..ca42d310fd4d 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -387,9 +387,8 @@ struct gfs2_inode {
 	u64 i_generation;
 	u64 i_eattr;
 	unsigned long i_flags;		/* GIF_... */
-	struct gfs2_glock *i_gl; /* Move into i_gh? */
+	struct gfs2_glock *i_gl;
 	struct gfs2_holder i_iopen_gh;
-	struct gfs2_holder i_gh; /* for prepare/commit_write only */
 	struct gfs2_qadata *i_qadata; /* quota allocation data */
 	struct gfs2_holder i_rgd_gh;
 	struct gfs2_blkreserv i_res; /* rgrp multi-block reservation */
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Ocfs2-devel] [PATCH v8 11/17] gfs2: Fix mmap + page fault deadlocks for buffered I/O
  2021-10-19 13:41 [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (9 preceding siblings ...)
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 10/17] gfs2: Eliminate ip->i_gh Andreas Gruenbacher
@ 2021-10-19 13:41 ` Andreas Gruenbacher
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 12/17] iomap: Fix iomap_dio_rw return value for user copies Andreas Gruenbacher
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 47+ messages in thread
From: Andreas Gruenbacher @ 2021-10-19 13:41 UTC (permalink / raw)
  To: Linus Torvalds, Catalin Marinas
  Cc: kvm-ppc, Paul Mackerras, cluster-devel, Jan Kara,
	Andreas Gruenbacher, linux-kernel, Christoph Hellwig,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

In the .read_iter and .write_iter file operations, we're accessing
user-space memory while holding the inode glock.  There is a possibility
that the memory is mapped to the same file, in which case we'd recurse
on the same glock.

We could detect and work around this simple case of recursive locking,
but more complex scenarios exist that involve multiple glocks,
processes, and cluster nodes, and working around all of those cases
isn't practical or even possible.

Avoid these kinds of problems by disabling page faults while holding the
inode glock.  If a page fault would occur, we either end up with a
partial read or write or with -EFAULT if nothing could be read or
written.  In either case, we know that we're not done with the
operation, so we indicate that we're willing to give up the inode glock
and then we fault in the missing pages.  If that made us lose the inode
glock, we return a partial read or write.  Otherwise, we resume the
operation.

This locking problem was originally reported by Jan Kara.  Linus came up
with the idea of disabling page faults.  Many thanks to Al Viro and
Matthew Wilcox for their feedback.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 96 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 8f37e4bab995..b07b9c2d0446 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -776,6 +776,36 @@ static int gfs2_fsync(struct file *file, loff_t start, loff_t end,
 	return ret ? ret : ret1;
 }
 
+static bool should_fault_in_pages(struct iov_iter *i, size_t *prev_count,
+				  size_t *window_size)
+{
+	char __user *p = i->iov[0].iov_base + i->iov_offset;
+	size_t count = iov_iter_count(i);
+	size_t size;
+
+	if (!iter_is_iovec(i))
+		return false;
+
+	if (*prev_count != count || !*window_size) {
+		int pages, nr_dirtied;
+
+		pages = min_t(int, BIO_MAX_VECS,
+			      DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE));
+		nr_dirtied = max(current->nr_dirtied_pause -
+				 current->nr_dirtied, 1);
+		pages = min(pages, nr_dirtied);
+		size = (size_t)PAGE_SIZE * pages - offset_in_page(p);
+	} else {
+		size = (size_t)PAGE_SIZE - offset_in_page(p);
+		if (*window_size <= size)
+			return false;
+	}
+
+	*prev_count = count;
+	*window_size = size;
+	return true;
+}
+
 static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
 				     struct gfs2_holder *gh)
 {
@@ -840,9 +870,17 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	struct gfs2_inode *ip;
 	struct gfs2_holder gh;
+	size_t prev_count = 0, window_size = 0;
 	size_t written = 0;
 	ssize_t ret;
 
+	/*
+	 * In this function, we disable page faults when we're holding the
+	 * inode glock while doing I/O.  If a page fault occurs, we indicate
+	 * that the inode glock may be dropped, fault in the pages manually,
+	 * and retry.
+	 */
+
 	if (iocb->ki_flags & IOCB_DIRECT) {
 		ret = gfs2_file_direct_read(iocb, to, &gh);
 		if (likely(ret != -ENOTBLK))
@@ -864,13 +902,35 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	}
 	ip = GFS2_I(iocb->ki_filp->f_mapping->host);
 	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
+retry:
 	ret = gfs2_glock_nq(&gh);
 	if (ret)
 		goto out_uninit;
+retry_under_glock:
+	pagefault_disable();
 	ret = generic_file_read_iter(iocb, to);
+	pagefault_enable();
 	if (ret > 0)
 		written += ret;
-	gfs2_glock_dq(&gh);
+
+	if (unlikely(iov_iter_count(to) && (ret > 0 || ret == -EFAULT)) &&
+	    should_fault_in_pages(to, &prev_count, &window_size)) {
+		size_t leftover;
+
+		gfs2_holder_allow_demote(&gh);
+		leftover = fault_in_iov_iter_writeable(to, window_size);
+		gfs2_holder_disallow_demote(&gh);
+		if (leftover != window_size) {
+			if (!gfs2_holder_queued(&gh)) {
+				if (written)
+					goto out_uninit;
+				goto retry;
+			}
+			goto retry_under_glock;
+		}
+	}
+	if (gfs2_holder_queued(&gh))
+		gfs2_glock_dq(&gh);
 out_uninit:
 	gfs2_holder_uninit(&gh);
 	return written ? written : ret;
@@ -885,8 +945,17 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb,
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 	struct gfs2_holder *statfs_gh = NULL;
+	size_t prev_count = 0, window_size = 0;
+	size_t read = 0;
 	ssize_t ret;
 
+	/*
+	 * In this function, we disable page faults when we're holding the
+	 * inode glock while doing I/O.  If a page fault occurs, we indicate
+	 * that the inode glock may be dropped, fault in the pages manually,
+	 * and retry.
+	 */
+
 	if (inode == sdp->sd_rindex) {
 		statfs_gh = kmalloc(sizeof(*statfs_gh), GFP_NOFS);
 		if (!statfs_gh)
@@ -894,10 +963,11 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb,
 	}
 
 	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, gh);
+retry:
 	ret = gfs2_glock_nq(gh);
 	if (ret)
 		goto out_uninit;
-
+retry_under_glock:
 	if (inode == sdp->sd_rindex) {
 		struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
 
@@ -908,21 +978,42 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb,
 	}
 
 	current->backing_dev_info = inode_to_bdi(inode);
+	pagefault_disable();
 	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+	pagefault_enable();
 	current->backing_dev_info = NULL;
-	if (ret > 0)
+	if (ret > 0) {
 		iocb->ki_pos += ret;
+		read += ret;
+	}
 
 	if (inode == sdp->sd_rindex)
 		gfs2_glock_dq_uninit(statfs_gh);
 
+	if (unlikely(iov_iter_count(from) && (ret > 0 || ret == -EFAULT)) &&
+	    should_fault_in_pages(from, &prev_count, &window_size)) {
+		size_t leftover;
+
+		gfs2_holder_allow_demote(gh);
+		leftover = fault_in_iov_iter_readable(from, window_size);
+		gfs2_holder_disallow_demote(gh);
+		if (leftover != window_size) {
+			if (!gfs2_holder_queued(gh)) {
+				if (read)
+					goto out_uninit;
+				goto retry;
+			}
+			goto retry_under_glock;
+		}
+	}
 out_unlock:
-	gfs2_glock_dq(gh);
+	if (gfs2_holder_queued(gh))
+		gfs2_glock_dq(gh);
 out_uninit:
 	gfs2_holder_uninit(gh);
 	if (statfs_gh)
 		kfree(statfs_gh);
-	return ret;
+	return read ? read : ret;
 }
 
 /**
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Ocfs2-devel] [PATCH v8 12/17] iomap: Fix iomap_dio_rw return value for user copies
  2021-10-19 13:41 [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (10 preceding siblings ...)
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 11/17] gfs2: Fix mmap + page fault deadlocks for buffered I/O Andreas Gruenbacher
@ 2021-10-19 13:41 ` Andreas Gruenbacher
  2021-10-19 13:42 ` [Ocfs2-devel] [PATCH v8 13/17] iomap: Support partial direct I/O on user copy failures Andreas Gruenbacher
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 47+ messages in thread
From: Andreas Gruenbacher @ 2021-10-19 13:41 UTC (permalink / raw)
  To: Linus Torvalds, Catalin Marinas
  Cc: kvm-ppc, Paul Mackerras, cluster-devel, Jan Kara,
	Andreas Gruenbacher, linux-kernel, Christoph Hellwig,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

When a user copy fails in one of the helpers of iomap_dio_rw, fail with
-EFAULT instead of returning 0.  This matches what iomap_dio_bio_actor
returns when it gets an -EFAULT from bio_iov_iter_get_pages.  With these
changes, iomap_dio_actor now consistently fails with -EFAULT when a user
page cannot be faulted in.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/direct-io.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 4ecd255e0511..a2a368e824c0 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -371,6 +371,8 @@ static loff_t iomap_dio_hole_iter(const struct iomap_iter *iter,
 	loff_t length = iov_iter_zero(iomap_length(iter), dio->submit.iter);
 
 	dio->size += length;
+	if (!length)
+		return -EFAULT;
 	return length;
 }
 
@@ -402,6 +404,8 @@ static loff_t iomap_dio_inline_iter(const struct iomap_iter *iomi,
 		copied = copy_to_iter(inline_data, length, iter);
 	}
 	dio->size += copied;
+	if (!copied)
+		return -EFAULT;
 	return copied;
 }
 
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Ocfs2-devel] [PATCH v8 13/17] iomap: Support partial direct I/O on user copy failures
  2021-10-19 13:41 [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (11 preceding siblings ...)
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 12/17] iomap: Fix iomap_dio_rw return value for user copies Andreas Gruenbacher
@ 2021-10-19 13:42 ` Andreas Gruenbacher
  2021-10-19 13:42 ` [Ocfs2-devel] [PATCH v8 14/17] iomap: Add done_before argument to iomap_dio_rw Andreas Gruenbacher
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 47+ messages in thread
From: Andreas Gruenbacher @ 2021-10-19 13:42 UTC (permalink / raw)
  To: Linus Torvalds, Catalin Marinas
  Cc: kvm-ppc, Paul Mackerras, cluster-devel, Jan Kara,
	Andreas Gruenbacher, linux-kernel, Christoph Hellwig,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

In iomap_dio_rw, when iomap_apply returns an -EFAULT error and the
IOMAP_DIO_PARTIAL flag is set, complete the request synchronously and
return a partial result.  This allows the caller to deal with the page
fault and retry the remainder of the request.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/direct-io.c  | 6 ++++++
 include/linux/iomap.h | 7 +++++++
 2 files changed, 13 insertions(+)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index a2a368e824c0..a434fb7887b2 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -581,6 +581,12 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (iov_iter_rw(iter) == READ && iomi.pos >= dio->i_size)
 		iov_iter_revert(iter, iomi.pos - dio->i_size);
 
+	if (ret == -EFAULT && dio->size && (dio_flags & IOMAP_DIO_PARTIAL)) {
+		if (!(iocb->ki_flags & IOCB_NOWAIT))
+			wait_for_completion = true;
+		ret = 0;
+	}
+
 	/* magic error code to fall back to buffered I/O */
 	if (ret == -ENOTBLK) {
 		wait_for_completion = true;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 24f8489583ca..2a213b0d1e1f 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -330,6 +330,13 @@ struct iomap_dio_ops {
   */
 #define IOMAP_DIO_OVERWRITE_ONLY	(1 << 1)
 
+/*
+ * When a page fault occurs, return a partial synchronous result and allow
+ * the caller to retry the rest of the operation after dealing with the page
+ * fault.
+ */
+#define IOMAP_DIO_PARTIAL		(1 << 2)
+
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
 		unsigned int dio_flags);
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Ocfs2-devel] [PATCH v8 14/17] iomap: Add done_before argument to iomap_dio_rw
  2021-10-19 13:41 [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (12 preceding siblings ...)
  2021-10-19 13:42 ` [Ocfs2-devel] [PATCH v8 13/17] iomap: Support partial direct I/O on user copy failures Andreas Gruenbacher
@ 2021-10-19 13:42 ` Andreas Gruenbacher
  2021-10-19 15:51   ` Darrick J. Wong
  2021-10-19 13:42 ` [Ocfs2-devel] [PATCH v8 15/17] gup: Introduce FOLL_NOFAULT flag to disable page faults Andreas Gruenbacher
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 47+ messages in thread
From: Andreas Gruenbacher @ 2021-10-19 13:42 UTC (permalink / raw)
  To: Linus Torvalds, Catalin Marinas
  Cc: kvm-ppc, Paul Mackerras, cluster-devel, Jan Kara,
	Andreas Gruenbacher, linux-kernel, Christoph Hellwig,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

Add a done_before argument to iomap_dio_rw that indicates how much of
the request has already been transferred.  When the request succeeds, we
report that done_before additional bytes were tranferred.  This is
useful for finishing a request asynchronously when part of the request
has already been completed synchronously.

We'll use that to allow iomap_dio_rw to be used with page faults
disabled: when a page fault occurs while submitting a request, we
synchronously complete the part of the request that has already been
submitted.  The caller can then take care of the page fault and call
iomap_dio_rw again for the rest of the request, passing in the number of
bytes already tranferred.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/btrfs/file.c       |  5 +++--
 fs/erofs/data.c       |  2 +-
 fs/ext4/file.c        |  5 +++--
 fs/gfs2/file.c        |  4 ++--
 fs/iomap/direct-io.c  | 11 ++++++++---
 fs/xfs/xfs_file.c     |  6 +++---
 fs/zonefs/super.c     |  4 ++--
 include/linux/iomap.h |  4 ++--
 8 files changed, 24 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index f37211d3bb69..9d41b28c67ba 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1957,7 +1957,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	}
 
 	dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
-			     0);
+			     0, 0);
 
 	btrfs_inode_unlock(inode, ilock_flags);
 
@@ -3658,7 +3658,8 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
 		return 0;
 
 	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
-	ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops, 0);
+	ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
+			   0, 0);
 	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
 	return ret;
 }
diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index 9db829715652..16a41d0db55a 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -287,7 +287,7 @@ static ssize_t erofs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 
 		if (!err)
 			return iomap_dio_rw(iocb, to, &erofs_iomap_ops,
-					    NULL, 0);
+					    NULL, 0, 0);
 		if (err < 0)
 			return err;
 	}
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index ac0e11bbb445..b25c1f8f7c4f 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -74,7 +74,7 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		return generic_file_read_iter(iocb, to);
 	}
 
-	ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0);
+	ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0, 0);
 	inode_unlock_shared(inode);
 
 	file_accessed(iocb->ki_filp);
@@ -566,7 +566,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (ilock_shared)
 		iomap_ops = &ext4_iomap_overwrite_ops;
 	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
-			   (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0);
+			   (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0,
+			   0);
 	if (ret == -ENOTBLK)
 		ret = 0;
 
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index b07b9c2d0446..ae06defcf369 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -822,7 +822,7 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
 	if (ret)
 		goto out_uninit;
 
-	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0);
+	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0, 0);
 	gfs2_glock_dq(gh);
 out_uninit:
 	gfs2_holder_uninit(gh);
@@ -856,7 +856,7 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	if (offset + len > i_size_read(&ip->i_inode))
 		goto out;
 
-	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0);
+	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0, 0);
 	if (ret == -ENOTBLK)
 		ret = 0;
 out:
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index a434fb7887b2..fdf68339bc8b 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -31,6 +31,7 @@ struct iomap_dio {
 	atomic_t		ref;
 	unsigned		flags;
 	int			error;
+	size_t			done_before;
 	bool			wait_for_completion;
 
 	union {
@@ -124,6 +125,9 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
 	if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
 		ret = generic_write_sync(iocb, ret);
 
+	if (ret > 0)
+		ret += dio->done_before;
+
 	kfree(dio);
 
 	return ret;
@@ -456,7 +460,7 @@ static loff_t iomap_dio_iter(const struct iomap_iter *iter,
 struct iomap_dio *
 __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
-		unsigned int dio_flags)
+		unsigned int dio_flags, size_t done_before)
 {
 	struct address_space *mapping = iocb->ki_filp->f_mapping;
 	struct inode *inode = file_inode(iocb->ki_filp);
@@ -486,6 +490,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	dio->dops = dops;
 	dio->error = 0;
 	dio->flags = 0;
+	dio->done_before = done_before;
 
 	dio->submit.iter = iter;
 	dio->submit.waiter = current;
@@ -652,11 +657,11 @@ EXPORT_SYMBOL_GPL(__iomap_dio_rw);
 ssize_t
 iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
-		unsigned int dio_flags)
+		unsigned int dio_flags, size_t done_before)
 {
 	struct iomap_dio *dio;
 
-	dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags);
+	dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags, done_before);
 	if (IS_ERR_OR_NULL(dio))
 		return PTR_ERR_OR_ZERO(dio);
 	return iomap_dio_complete(dio);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 7aa943edfc02..240eb932c014 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -259,7 +259,7 @@ xfs_file_dio_read(
 	ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
 	if (ret)
 		return ret;
-	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0);
+	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, 0);
 	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
 
 	return ret;
@@ -569,7 +569,7 @@ xfs_file_dio_write_aligned(
 	}
 	trace_xfs_file_direct_write(iocb, from);
 	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
-			   &xfs_dio_write_ops, 0);
+			   &xfs_dio_write_ops, 0, 0);
 out_unlock:
 	if (iolock)
 		xfs_iunlock(ip, iolock);
@@ -647,7 +647,7 @@ xfs_file_dio_write_unaligned(
 
 	trace_xfs_file_direct_write(iocb, from);
 	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
-			   &xfs_dio_write_ops, flags);
+			   &xfs_dio_write_ops, flags, 0);
 
 	/*
 	 * Retry unaligned I/O with exclusive blocking semantics if the DIO
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index ddc346a9df9b..6122c38ab44d 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -852,7 +852,7 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 		ret = zonefs_file_dio_append(iocb, from);
 	else
 		ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops,
-				   &zonefs_write_dio_ops, 0);
+				   &zonefs_write_dio_ops, 0, 0);
 	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
 	    (ret > 0 || ret == -EIOCBQUEUED)) {
 		if (ret > 0)
@@ -987,7 +987,7 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		}
 		file_accessed(iocb->ki_filp);
 		ret = iomap_dio_rw(iocb, to, &zonefs_iomap_ops,
-				   &zonefs_read_dio_ops, 0);
+				   &zonefs_read_dio_ops, 0, 0);
 	} else {
 		ret = generic_file_read_iter(iocb, to);
 		if (ret == -EIO)
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 2a213b0d1e1f..829f2325ecba 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -339,10 +339,10 @@ struct iomap_dio_ops {
 
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
-		unsigned int dio_flags);
+		unsigned int dio_flags, size_t done_before);
 struct iomap_dio *__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
-		unsigned int dio_flags);
+		unsigned int dio_flags, size_t done_before);
 ssize_t iomap_dio_complete(struct iomap_dio *dio);
 int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
 
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Ocfs2-devel] [PATCH v8 15/17] gup: Introduce FOLL_NOFAULT flag to disable page faults
  2021-10-19 13:41 [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (13 preceding siblings ...)
  2021-10-19 13:42 ` [Ocfs2-devel] [PATCH v8 14/17] iomap: Add done_before argument to iomap_dio_rw Andreas Gruenbacher
@ 2021-10-19 13:42 ` Andreas Gruenbacher
  2021-10-19 13:42 ` [Ocfs2-devel] [PATCH v8 16/17] iov_iter: Introduce nofault " Andreas Gruenbacher
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 47+ messages in thread
From: Andreas Gruenbacher @ 2021-10-19 13:42 UTC (permalink / raw)
  To: Linus Torvalds, Catalin Marinas
  Cc: kvm-ppc, Paul Mackerras, cluster-devel, Jan Kara,
	Andreas Gruenbacher, linux-kernel, Christoph Hellwig,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

Introduce a new FOLL_NOFAULT flag that causes get_user_pages to return
-EFAULT when it would otherwise trigger a page fault.  This is roughly
similar to FOLL_FAST_ONLY but available on all architectures, and less
fragile.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 include/linux/mm.h | 3 ++-
 mm/gup.c           | 4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 73a52aba448f..2f0e6b9f8f3b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2851,7 +2851,8 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
 #define FOLL_NOWAIT	0x20	/* if a disk transfer is needed, start the IO
 				 * and return without waiting upon it */
-#define FOLL_POPULATE	0x40	/* fault in page */
+#define FOLL_POPULATE	0x40	/* fault in pages (with FOLL_MLOCK) */
+#define FOLL_NOFAULT	0x80	/* do not fault in pages */
 #define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
 #define FOLL_NUMA	0x200	/* force NUMA hinting page fault */
 #define FOLL_MIGRATION	0x400	/* wait for page to replace migration entry */
diff --git a/mm/gup.c b/mm/gup.c
index 614b8536b3b6..6ec8f5494424 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -918,6 +918,8 @@ static int faultin_page(struct vm_area_struct *vma,
 	/* mlock all present pages, but do not fault in new pages */
 	if ((*flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK)
 		return -ENOENT;
+	if (*flags & FOLL_NOFAULT)
+		return -EFAULT;
 	if (*flags & FOLL_WRITE)
 		fault_flags |= FAULT_FLAG_WRITE;
 	if (*flags & FOLL_REMOTE)
@@ -2843,7 +2845,7 @@ static int internal_get_user_pages_fast(unsigned long start,
 
 	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
 				       FOLL_FORCE | FOLL_PIN | FOLL_GET |
-				       FOLL_FAST_ONLY)))
+				       FOLL_FAST_ONLY | FOLL_NOFAULT)))
 		return -EINVAL;
 
 	if (gup_flags & FOLL_PIN)
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Ocfs2-devel] [PATCH v8 16/17] iov_iter: Introduce nofault flag to disable page faults
  2021-10-19 13:41 [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (14 preceding siblings ...)
  2021-10-19 13:42 ` [Ocfs2-devel] [PATCH v8 15/17] gup: Introduce FOLL_NOFAULT flag to disable page faults Andreas Gruenbacher
@ 2021-10-19 13:42 ` Andreas Gruenbacher
  2021-10-19 13:42 ` [Ocfs2-devel] [PATCH v8 17/17] gfs2: Fix mmap + page fault deadlocks for direct I/O Andreas Gruenbacher
  2021-10-19 15:40 ` [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks Linus Torvalds
  17 siblings, 0 replies; 47+ messages in thread
From: Andreas Gruenbacher @ 2021-10-19 13:42 UTC (permalink / raw)
  To: Linus Torvalds, Catalin Marinas
  Cc: kvm-ppc, Paul Mackerras, cluster-devel, Jan Kara,
	Andreas Gruenbacher, linux-kernel, Christoph Hellwig,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

Introduce a new nofault flag to indicate to iov_iter_get_pages not to
fault in user pages.

This is implemented by passing the FOLL_NOFAULT flag to get_user_pages,
which causes get_user_pages to fail when it would otherwise fault in a
page. We'll use the ->nofault flag to prevent iomap_dio_rw from faulting
in pages when page faults are not allowed.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 include/linux/uio.h |  1 +
 lib/iov_iter.c      | 20 +++++++++++++++-----
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 25d1c24fd829..6350354f97e9 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -35,6 +35,7 @@ struct iov_iter_state {
 
 struct iov_iter {
 	u8 iter_type;
+	bool nofault;
 	bool data_source;
 	size_t iov_offset;
 	size_t count;
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index ac9a87e727a3..66a740e6e153 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -513,6 +513,7 @@ void iov_iter_init(struct iov_iter *i, unsigned int direction,
 	WARN_ON(direction & ~(READ | WRITE));
 	*i = (struct iov_iter) {
 		.iter_type = ITER_IOVEC,
+		.nofault = false,
 		.data_source = direction,
 		.iov = iov,
 		.nr_segs = nr_segs,
@@ -1527,13 +1528,17 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 		return 0;
 
 	if (likely(iter_is_iovec(i))) {
+		unsigned int gup_flags = 0;
 		unsigned long addr;
 
+		if (iov_iter_rw(i) != WRITE)
+			gup_flags |= FOLL_WRITE;
+		if (i->nofault)
+			gup_flags |= FOLL_NOFAULT;
+
 		addr = first_iovec_segment(i, &len, start, maxsize, maxpages);
 		n = DIV_ROUND_UP(len, PAGE_SIZE);
-		res = get_user_pages_fast(addr, n,
-				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0,
-				pages);
+		res = get_user_pages_fast(addr, n, gup_flags, pages);
 		if (unlikely(res <= 0))
 			return res;
 		return (res == n ? len : res * PAGE_SIZE) - *start;
@@ -1649,15 +1654,20 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		return 0;
 
 	if (likely(iter_is_iovec(i))) {
+		unsigned int gup_flags = 0;
 		unsigned long addr;
 
+		if (iov_iter_rw(i) != WRITE)
+			gup_flags |= FOLL_WRITE;
+		if (i->nofault)
+			gup_flags |= FOLL_NOFAULT;
+
 		addr = first_iovec_segment(i, &len, start, maxsize, ~0U);
 		n = DIV_ROUND_UP(len, PAGE_SIZE);
 		p = get_pages_array(n);
 		if (!p)
 			return -ENOMEM;
-		res = get_user_pages_fast(addr, n,
-				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0, p);
+		res = get_user_pages_fast(addr, n, gup_flags, p);
 		if (unlikely(res <= 0)) {
 			kvfree(p);
 			*pages = NULL;
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Ocfs2-devel] [PATCH v8 17/17] gfs2: Fix mmap + page fault deadlocks for direct I/O
  2021-10-19 13:41 [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (15 preceding siblings ...)
  2021-10-19 13:42 ` [Ocfs2-devel] [PATCH v8 16/17] iov_iter: Introduce nofault " Andreas Gruenbacher
@ 2021-10-19 13:42 ` Andreas Gruenbacher
  2021-10-19 15:40 ` [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks Linus Torvalds
  17 siblings, 0 replies; 47+ messages in thread
From: Andreas Gruenbacher @ 2021-10-19 13:42 UTC (permalink / raw)
  To: Linus Torvalds, Catalin Marinas
  Cc: kvm-ppc, Paul Mackerras, cluster-devel, Jan Kara,
	Andreas Gruenbacher, linux-kernel, Christoph Hellwig,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

Also disable page faults during direct I/O requests and implement a
similar kind of retry logic as in the buffered I/O case.

The retry logic in the direct I/O case differs from the buffered I/O
case in the following way: direct I/O doesn't provide the kinds of
consistency guarantees between concurrent reads and writes that buffered
I/O provides, so once we lose the inode glock while faulting in user
pages, we always resume the operation.  We never need to return a
partial read or write.

This locking problem was originally reported by Jan Kara.  Linus came up
with the idea of disabling page faults.  Many thanks to Al Viro and
Matthew Wilcox for their feedback.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 101 +++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 89 insertions(+), 12 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index ae06defcf369..a8e440b4d21c 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -811,22 +811,65 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
 {
 	struct file *file = iocb->ki_filp;
 	struct gfs2_inode *ip = GFS2_I(file->f_mapping->host);
-	size_t count = iov_iter_count(to);
+	size_t prev_count = 0, window_size = 0;
+	size_t written = 0;
 	ssize_t ret;
 
-	if (!count)
+	/*
+	 * In this function, we disable page faults when we're holding the
+	 * inode glock while doing I/O.  If a page fault occurs, we indicate
+	 * that the inode glock may be dropped, fault in the pages manually,
+	 * and retry.
+	 *
+	 * Unlike generic_file_read_iter, for reads, iomap_dio_rw can trigger
+	 * physical as well as manual page faults, and we need to disable both
+	 * kinds.
+	 *
+	 * For direct I/O, gfs2 takes the inode glock in deferred mode.  This
+	 * locking mode is compatible with other deferred holders, so multiple
+	 * processes and nodes can do direct I/O to a file at the same time.
+	 * There's no guarantee that reads or writes will be atomic.  Any
+	 * coordination among readers and writers needs to happen externally.
+	 */
+
+	if (!iov_iter_count(to))
 		return 0; /* skip atime */
 
 	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
+retry:
 	ret = gfs2_glock_nq(gh);
 	if (ret)
 		goto out_uninit;
+retry_under_glock:
+	pagefault_disable();
+	to->nofault = true;
+	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL,
+			   IOMAP_DIO_PARTIAL, written);
+	to->nofault = false;
+	pagefault_enable();
+	if (ret > 0)
+		written = ret;
 
-	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0, 0);
-	gfs2_glock_dq(gh);
+	if (unlikely(iov_iter_count(to) && (ret > 0 || ret == -EFAULT)) &&
+	    should_fault_in_pages(to, &prev_count, &window_size)) {
+		size_t leftover;
+
+		gfs2_holder_allow_demote(gh);
+		leftover = fault_in_iov_iter_writeable(to, window_size);
+		gfs2_holder_disallow_demote(gh);
+		if (leftover != window_size) {
+			if (!gfs2_holder_queued(gh))
+				goto retry;
+			goto retry_under_glock;
+		}
+	}
+	if (gfs2_holder_queued(gh))
+		gfs2_glock_dq(gh);
 out_uninit:
 	gfs2_holder_uninit(gh);
-	return ret;
+	if (ret < 0)
+		return ret;
+	return written;
 }
 
 static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
@@ -835,10 +878,20 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
 	struct gfs2_inode *ip = GFS2_I(inode);
-	size_t len = iov_iter_count(from);
-	loff_t offset = iocb->ki_pos;
+	size_t prev_count = 0, window_size = 0;
+	size_t read = 0;
 	ssize_t ret;
 
+	/*
+	 * In this function, we disable page faults when we're holding the
+	 * inode glock while doing I/O.  If a page fault occurs, we indicate
+	 * that the inode glock may be dropped, fault in the pages manually,
+	 * and retry.
+	 *
+	 * For writes, iomap_dio_rw only triggers manual page faults, so we
+	 * don't need to disable physical ones.
+	 */
+
 	/*
 	 * Deferred lock, even if its a write, since we do no allocation on
 	 * this path. All we need to change is the atime, and this lock mode
@@ -848,22 +901,46 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	 * VFS does.
 	 */
 	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
+retry:
 	ret = gfs2_glock_nq(gh);
 	if (ret)
 		goto out_uninit;
-
+retry_under_glock:
 	/* Silently fall back to buffered I/O when writing beyond EOF */
-	if (offset + len > i_size_read(&ip->i_inode))
+	if (iocb->ki_pos + iov_iter_count(from) > i_size_read(&ip->i_inode))
 		goto out;
 
-	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0, 0);
+	from->nofault = true;
+	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL,
+			   IOMAP_DIO_PARTIAL, read);
+	from->nofault = false;
+
 	if (ret == -ENOTBLK)
 		ret = 0;
+	if (ret > 0)
+		read = ret;
+
+	if (unlikely(iov_iter_count(from) && (ret > 0 || ret == -EFAULT)) &&
+	    should_fault_in_pages(from, &prev_count, &window_size)) {
+		size_t leftover;
+
+		gfs2_holder_allow_demote(gh);
+		leftover = fault_in_iov_iter_readable(from, window_size);
+		gfs2_holder_disallow_demote(gh);
+		if (leftover != window_size) {
+			if (!gfs2_holder_queued(gh))
+				goto retry;
+			goto retry_under_glock;
+		}
+	}
 out:
-	gfs2_glock_dq(gh);
+	if (gfs2_holder_queued(gh))
+		gfs2_glock_dq(gh);
 out_uninit:
 	gfs2_holder_uninit(gh);
-	return ret;
+	if (ret < 0)
+		return ret;
+	return read;
 }
 
 static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
  2021-10-19 13:41 [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (16 preceding siblings ...)
  2021-10-19 13:42 ` [Ocfs2-devel] [PATCH v8 17/17] gfs2: Fix mmap + page fault deadlocks for direct I/O Andreas Gruenbacher
@ 2021-10-19 15:40 ` Linus Torvalds
  2021-10-19 16:00   ` Bob Peterson
  2021-10-20 16:36   ` Catalin Marinas
  17 siblings, 2 replies; 47+ messages in thread
From: Linus Torvalds @ 2021-10-19 15:40 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: kvm-ppc, Christoph Hellwig, cluster-devel, Jan Kara,
	Catalin Marinas, Linux Kernel Mailing List, Paul Mackerras,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

On Tue, Oct 19, 2021 at 3:42 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> From my point of view, the following questions remain:
>
>  * I hope these patches will be merged for v5.16, but what process
>    should I follow for that?  The patch queue contains mm and iomap
>    changes, so a pull request from the gfs2 tree would be unusual.

Oh, I'd much rather get these as one pull request from the author and
from the person that actually ended up testing this.

It might be "unusual", but it's certainly not unheard of, and trying
to push different parts of the series through different maintainers
would just cause lots of extra churn.

Yes, normally I'd expect filesystem changes to have a diffstat that
clearly shows that "yes, it's all local to this filesystem", and when
I see anything else it raises red flags.

But it raises red flags not because it would be wrong to have changes
to other parts, but simply because when cross-subsystem development
happens, it needs to be discussed and cleared with people. And you've
done that.

So I'd take this as one pull request from you. You've been doing the
work, you get the questionable glory of being in charge of it all.
You'll get the blame too ;)

>  * Will Catalin Marinas's work for supporting arm64 sub-page faults
>    be queued behind these patches?  We have an overlap in
>    fault_in_[pages_]readable fault_in_[pages_]writeable, so one of
>    the two patch queues will need some adjustments.

I think that on the whole they should be developed separately, I don't
think it's going to be a particularly difficult conflict.

That whole discussion does mean that I suspect that we'll have to
change fault_in_iov_iter_writeable() to do the "every 16 bytes" or
whatever thing, and make it use an actual atomic "add zero" or
whatever rather than walk the page tables. But that's a conceptually
separate discussion from this one, I wouldn't actually want to mix up
the two issues too much.

Sure, they touch the same code, so there is _that_ overlap, but one is
about "the hardware rules are a-changing" and the other is about
filesystem use of - and expansion of - the things we do. Let's keep
them separate until ready, and then fix up the fallout at that point
(either as a merge resolution, or even partly after-the-fact).

                     Linus

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 14/17] iomap: Add done_before argument to iomap_dio_rw
  2021-10-19 13:42 ` [Ocfs2-devel] [PATCH v8 14/17] iomap: Add done_before argument to iomap_dio_rw Andreas Gruenbacher
@ 2021-10-19 15:51   ` Darrick J. Wong
  2021-10-19 19:30     ` Andreas Gruenbacher
  0 siblings, 1 reply; 47+ messages in thread
From: Darrick J. Wong @ 2021-10-19 15:51 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: kvm-ppc, Paul Mackerras, cluster-devel, Jan Kara,
	Catalin Marinas, linux-kernel, Christoph Hellwig, Alexander Viro,
	linux-fsdevel, linux-btrfs, Linus Torvalds, ocfs2-devel

On Tue, Oct 19, 2021 at 03:42:01PM +0200, Andreas Gruenbacher wrote:
> Add a done_before argument to iomap_dio_rw that indicates how much of
> the request has already been transferred.  When the request succeeds, we
> report that done_before additional bytes were tranferred.  This is
> useful for finishing a request asynchronously when part of the request
> has already been completed synchronously.
> 
> We'll use that to allow iomap_dio_rw to be used with page faults
> disabled: when a page fault occurs while submitting a request, we
> synchronously complete the part of the request that has already been
> submitted.  The caller can then take care of the page fault and call
> iomap_dio_rw again for the rest of the request, passing in the number of
> bytes already tranferred.
> 
> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> ---
>  fs/btrfs/file.c       |  5 +++--
>  fs/erofs/data.c       |  2 +-
>  fs/ext4/file.c        |  5 +++--
>  fs/gfs2/file.c        |  4 ++--
>  fs/iomap/direct-io.c  | 11 ++++++++---
>  fs/xfs/xfs_file.c     |  6 +++---
>  fs/zonefs/super.c     |  4 ++--
>  include/linux/iomap.h |  4 ++--
>  8 files changed, 24 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index f37211d3bb69..9d41b28c67ba 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1957,7 +1957,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  	}
>  
>  	dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> -			     0);
> +			     0, 0);
>  
>  	btrfs_inode_unlock(inode, ilock_flags);
>  
> @@ -3658,7 +3658,8 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
>  		return 0;
>  
>  	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
> -	ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops, 0);
> +	ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> +			   0, 0);
>  	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
>  	return ret;
>  }
> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
> index 9db829715652..16a41d0db55a 100644
> --- a/fs/erofs/data.c
> +++ b/fs/erofs/data.c
> @@ -287,7 +287,7 @@ static ssize_t erofs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  
>  		if (!err)
>  			return iomap_dio_rw(iocb, to, &erofs_iomap_ops,
> -					    NULL, 0);
> +					    NULL, 0, 0);
>  		if (err < 0)
>  			return err;
>  	}
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index ac0e11bbb445..b25c1f8f7c4f 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -74,7 +74,7 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  		return generic_file_read_iter(iocb, to);
>  	}
>  
> -	ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0);
> +	ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0, 0);
>  	inode_unlock_shared(inode);
>  
>  	file_accessed(iocb->ki_filp);
> @@ -566,7 +566,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  	if (ilock_shared)
>  		iomap_ops = &ext4_iomap_overwrite_ops;
>  	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
> -			   (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0);
> +			   (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0,
> +			   0);
>  	if (ret == -ENOTBLK)
>  		ret = 0;
>  
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index b07b9c2d0446..ae06defcf369 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -822,7 +822,7 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
>  	if (ret)
>  		goto out_uninit;
>  
> -	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0);
> +	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0, 0);
>  	gfs2_glock_dq(gh);
>  out_uninit:
>  	gfs2_holder_uninit(gh);
> @@ -856,7 +856,7 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
>  	if (offset + len > i_size_read(&ip->i_inode))
>  		goto out;
>  
> -	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0);
> +	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0, 0);
>  	if (ret == -ENOTBLK)
>  		ret = 0;
>  out:
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index a434fb7887b2..fdf68339bc8b 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -31,6 +31,7 @@ struct iomap_dio {
>  	atomic_t		ref;
>  	unsigned		flags;
>  	int			error;
> +	size_t			done_before;

I have basically the same comment as last time[1]:

"So, now that I actually understand the reason why the count of
previously transferred bytes has to be passed into the iomap_dio, I
would like this field to have a comment so that stupid maintainers like
me don't forget the subtleties again:

        /*
	 * For asynchronous IO, we have one chance to call the iocb
	 * completion method with the results of the directio operation.
	 * If this operation is a resubmission after a previous partial
	 * completion (e.g. page fault), we need to know about that
	 * progress so that we can report both the results of the prior
	 * completion and the result of the resubmission to the iocb
	 * submitter.
         */
        size_t                  done_before;

With that added, I think I can live with this enough to:
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

[1] https://lore.kernel.org/linux-fsdevel/20210903185351.GD9892@magnolia/

>  	bool			wait_for_completion;
>  
>  	union {
> @@ -124,6 +125,9 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
>  	if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
>  		ret = generic_write_sync(iocb, ret);
>  
> +	if (ret > 0)
> +		ret += dio->done_before;
> +
>  	kfree(dio);
>  
>  	return ret;
> @@ -456,7 +460,7 @@ static loff_t iomap_dio_iter(const struct iomap_iter *iter,
>  struct iomap_dio *
>  __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> -		unsigned int dio_flags)
> +		unsigned int dio_flags, size_t done_before)
>  {
>  	struct address_space *mapping = iocb->ki_filp->f_mapping;
>  	struct inode *inode = file_inode(iocb->ki_filp);
> @@ -486,6 +490,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	dio->dops = dops;
>  	dio->error = 0;
>  	dio->flags = 0;
> +	dio->done_before = done_before;
>  
>  	dio->submit.iter = iter;
>  	dio->submit.waiter = current;
> @@ -652,11 +657,11 @@ EXPORT_SYMBOL_GPL(__iomap_dio_rw);
>  ssize_t
>  iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> -		unsigned int dio_flags)
> +		unsigned int dio_flags, size_t done_before)
>  {
>  	struct iomap_dio *dio;
>  
> -	dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags);
> +	dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags, done_before);
>  	if (IS_ERR_OR_NULL(dio))
>  		return PTR_ERR_OR_ZERO(dio);
>  	return iomap_dio_complete(dio);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 7aa943edfc02..240eb932c014 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -259,7 +259,7 @@ xfs_file_dio_read(
>  	ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
>  	if (ret)
>  		return ret;
> -	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0);
> +	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, 0);
>  	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
>  
>  	return ret;
> @@ -569,7 +569,7 @@ xfs_file_dio_write_aligned(
>  	}
>  	trace_xfs_file_direct_write(iocb, from);
>  	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
> -			   &xfs_dio_write_ops, 0);
> +			   &xfs_dio_write_ops, 0, 0);
>  out_unlock:
>  	if (iolock)
>  		xfs_iunlock(ip, iolock);
> @@ -647,7 +647,7 @@ xfs_file_dio_write_unaligned(
>  
>  	trace_xfs_file_direct_write(iocb, from);
>  	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
> -			   &xfs_dio_write_ops, flags);
> +			   &xfs_dio_write_ops, flags, 0);
>  
>  	/*
>  	 * Retry unaligned I/O with exclusive blocking semantics if the DIO
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index ddc346a9df9b..6122c38ab44d 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -852,7 +852,7 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
>  		ret = zonefs_file_dio_append(iocb, from);
>  	else
>  		ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops,
> -				   &zonefs_write_dio_ops, 0);
> +				   &zonefs_write_dio_ops, 0, 0);
>  	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
>  	    (ret > 0 || ret == -EIOCBQUEUED)) {
>  		if (ret > 0)
> @@ -987,7 +987,7 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  		}
>  		file_accessed(iocb->ki_filp);
>  		ret = iomap_dio_rw(iocb, to, &zonefs_iomap_ops,
> -				   &zonefs_read_dio_ops, 0);
> +				   &zonefs_read_dio_ops, 0, 0);
>  	} else {
>  		ret = generic_file_read_iter(iocb, to);
>  		if (ret == -EIO)
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 2a213b0d1e1f..829f2325ecba 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -339,10 +339,10 @@ struct iomap_dio_ops {
>  
>  ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> -		unsigned int dio_flags);
> +		unsigned int dio_flags, size_t done_before);
>  struct iomap_dio *__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> -		unsigned int dio_flags);
> +		unsigned int dio_flags, size_t done_before);
>  ssize_t iomap_dio_complete(struct iomap_dio *dio);
>  int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
>  
> -- 
> 2.26.3
> 

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
  2021-10-19 15:40 ` [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks Linus Torvalds
@ 2021-10-19 16:00   ` Bob Peterson
  2021-10-20 16:36   ` Catalin Marinas
  1 sibling, 0 replies; 47+ messages in thread
From: Bob Peterson @ 2021-10-19 16:00 UTC (permalink / raw)
  To: Linus Torvalds, Andreas Gruenbacher
  Cc: kvm-ppc, Christoph Hellwig, cluster-devel, Jan Kara,
	Catalin Marinas, Linux Kernel Mailing List, Paul Mackerras,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

On 10/19/21 10:40 AM, Linus Torvalds wrote:
> On Tue, Oct 19, 2021 at 3:42 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>>
>>  From my point of view, the following questions remain:
>>
>>   * I hope these patches will be merged for v5.16, but what process
>>     should I follow for that?  The patch queue contains mm and iomap
>>     changes, so a pull request from the gfs2 tree would be unusual.
> 
> Oh, I'd much rather get these as one pull request from the author and
> from the person that actually ended up testing this.

Hi Linus,

FWIW, I've been working with Andreas on this and have tested it quite
extensively, although only with gfs2. I've tested it with numerous
scenarios, both stand-alone (xfstests as well as several other test
programs I have in my collection) and in a cluster with some very heavy
duty cluster coherency tests. My testing is nearly complete, but not
quite.

Regards,

Bob Peterson
GFS2 File System


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 14/17] iomap: Add done_before argument to iomap_dio_rw
  2021-10-19 15:51   ` Darrick J. Wong
@ 2021-10-19 19:30     ` Andreas Gruenbacher
  2021-10-20  1:57       ` Darrick J. Wong
  0 siblings, 1 reply; 47+ messages in thread
From: Andreas Gruenbacher @ 2021-10-19 19:30 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: kvm-ppc, Paul Mackerras, cluster-devel, Jan Kara,
	Catalin Marinas, LKML, Christoph Hellwig, Alexander Viro,
	linux-fsdevel, linux-btrfs, Linus Torvalds, ocfs2-devel

On Tue, Oct 19, 2021 at 5:51 PM Darrick J. Wong <djwong@kernel.org> wrote:
> On Tue, Oct 19, 2021 at 03:42:01PM +0200, Andreas Gruenbacher wrote:
> > Add a done_before argument to iomap_dio_rw that indicates how much of
> > the request has already been transferred.  When the request succeeds, we
> > report that done_before additional bytes were tranferred.  This is
> > useful for finishing a request asynchronously when part of the request
> > has already been completed synchronously.
> >
> > We'll use that to allow iomap_dio_rw to be used with page faults
> > disabled: when a page fault occurs while submitting a request, we
> > synchronously complete the part of the request that has already been
> > submitted.  The caller can then take care of the page fault and call
> > iomap_dio_rw again for the rest of the request, passing in the number of
> > bytes already tranferred.
> >
> > Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> > ---
> >  fs/btrfs/file.c       |  5 +++--
> >  fs/erofs/data.c       |  2 +-
> >  fs/ext4/file.c        |  5 +++--
> >  fs/gfs2/file.c        |  4 ++--
> >  fs/iomap/direct-io.c  | 11 ++++++++---
> >  fs/xfs/xfs_file.c     |  6 +++---
> >  fs/zonefs/super.c     |  4 ++--
> >  include/linux/iomap.h |  4 ++--
> >  8 files changed, 24 insertions(+), 17 deletions(-)
> >
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index f37211d3bb69..9d41b28c67ba 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -1957,7 +1957,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
> >       }
> >
> >       dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> > -                          0);
> > +                          0, 0);
> >
> >       btrfs_inode_unlock(inode, ilock_flags);
> >
> > @@ -3658,7 +3658,8 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
> >               return 0;
> >
> >       btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
> > -     ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops, 0);
> > +     ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> > +                        0, 0);
> >       btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
> >       return ret;
> >  }
> > diff --git a/fs/erofs/data.c b/fs/erofs/data.c
> > index 9db829715652..16a41d0db55a 100644
> > --- a/fs/erofs/data.c
> > +++ b/fs/erofs/data.c
> > @@ -287,7 +287,7 @@ static ssize_t erofs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >
> >               if (!err)
> >                       return iomap_dio_rw(iocb, to, &erofs_iomap_ops,
> > -                                         NULL, 0);
> > +                                         NULL, 0, 0);
> >               if (err < 0)
> >                       return err;
> >       }
> > diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> > index ac0e11bbb445..b25c1f8f7c4f 100644
> > --- a/fs/ext4/file.c
> > +++ b/fs/ext4/file.c
> > @@ -74,7 +74,7 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >               return generic_file_read_iter(iocb, to);
> >       }
> >
> > -     ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0);
> > +     ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0, 0);
> >       inode_unlock_shared(inode);
> >
> >       file_accessed(iocb->ki_filp);
> > @@ -566,7 +566,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >       if (ilock_shared)
> >               iomap_ops = &ext4_iomap_overwrite_ops;
> >       ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
> > -                        (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0);
> > +                        (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0,
> > +                        0);
> >       if (ret == -ENOTBLK)
> >               ret = 0;
> >
> > diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> > index b07b9c2d0446..ae06defcf369 100644
> > --- a/fs/gfs2/file.c
> > +++ b/fs/gfs2/file.c
> > @@ -822,7 +822,7 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
> >       if (ret)
> >               goto out_uninit;
> >
> > -     ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0);
> > +     ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0, 0);
> >       gfs2_glock_dq(gh);
> >  out_uninit:
> >       gfs2_holder_uninit(gh);
> > @@ -856,7 +856,7 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
> >       if (offset + len > i_size_read(&ip->i_inode))
> >               goto out;
> >
> > -     ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0);
> > +     ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0, 0);
> >       if (ret == -ENOTBLK)
> >               ret = 0;
> >  out:
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index a434fb7887b2..fdf68339bc8b 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -31,6 +31,7 @@ struct iomap_dio {
> >       atomic_t                ref;
> >       unsigned                flags;
> >       int                     error;
> > +     size_t                  done_before;
>
> I have basically the same comment as last time[1]:
>
> "So, now that I actually understand the reason why the count of
> previously transferred bytes has to be passed into the iomap_dio, I
> would like this field to have a comment so that stupid maintainers like
> me don't forget the subtleties again:
>
>         /*
>          * For asynchronous IO, we have one chance to call the iocb
>          * completion method with the results of the directio operation.
>          * If this operation is a resubmission after a previous partial
>          * completion (e.g. page fault), we need to know about that
>          * progress so that we can report both the results of the prior
>          * completion and the result of the resubmission to the iocb
>          * submitter.
>          */
>         size_t                  done_before;
>
> With that added, I think I can live with this enough to:
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Indeed, sorry for missing that. How about the below change instead
though; I think that better sums up what's going on?

--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -454,6 +454,14 @@ static loff_t iomap_dio_iter(const struct iomap_iter *iter,
  * may be pure data writes. In that case, we still need to do a full data sync
  * completion.
  *
+ * When page faults are disabled, __iomap_dio_rw has already prepared some data
+ * to be transferred, @iter references a non-resident page, and @dio_flags
+ * includes IOMAP_DIO_PARTIAL, __iomap_dio_rw will return a partial result.  In
+ * that case, the page or pages can be faulted in and the request resumed with
+ * @done_before set to the number of bytes previously transferred.  The request
+ * will then complete with the correct total number of bytes transferred; this
+ * is essential for completing partial requests asynchronously.
+ *
  * Returns -ENOTBLK In case of a page invalidation invalidation failure for
  * writes.  The callers needs to fall back to buffered I/O in this case.
  */

Thanks,
Andreas

>
> --D
>
> [1] https://lore.kernel.org/linux-fsdevel/20210903185351.GD9892@magnolia/
>
> >       bool                    wait_for_completion;
> >
> >       union {
> > @@ -124,6 +125,9 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
> >       if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
> >               ret = generic_write_sync(iocb, ret);
> >
> > +     if (ret > 0)
> > +             ret += dio->done_before;
> > +
> >       kfree(dio);
> >
> >       return ret;
> > @@ -456,7 +460,7 @@ static loff_t iomap_dio_iter(const struct iomap_iter *iter,
> >  struct iomap_dio *
> >  __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> >               const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> > -             unsigned int dio_flags)
> > +             unsigned int dio_flags, size_t done_before)
> >  {
> >       struct address_space *mapping = iocb->ki_filp->f_mapping;
> >       struct inode *inode = file_inode(iocb->ki_filp);
> > @@ -486,6 +490,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> >       dio->dops = dops;
> >       dio->error = 0;
> >       dio->flags = 0;
> > +     dio->done_before = done_before;
> >
> >       dio->submit.iter = iter;
> >       dio->submit.waiter = current;
> > @@ -652,11 +657,11 @@ EXPORT_SYMBOL_GPL(__iomap_dio_rw);
> >  ssize_t
> >  iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> >               const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> > -             unsigned int dio_flags)
> > +             unsigned int dio_flags, size_t done_before)
> >  {
> >       struct iomap_dio *dio;
> >
> > -     dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags);
> > +     dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags, done_before);
> >       if (IS_ERR_OR_NULL(dio))
> >               return PTR_ERR_OR_ZERO(dio);
> >       return iomap_dio_complete(dio);
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 7aa943edfc02..240eb932c014 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -259,7 +259,7 @@ xfs_file_dio_read(
> >       ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
> >       if (ret)
> >               return ret;
> > -     ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0);
> > +     ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, 0);
> >       xfs_iunlock(ip, XFS_IOLOCK_SHARED);
> >
> >       return ret;
> > @@ -569,7 +569,7 @@ xfs_file_dio_write_aligned(
> >       }
> >       trace_xfs_file_direct_write(iocb, from);
> >       ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
> > -                        &xfs_dio_write_ops, 0);
> > +                        &xfs_dio_write_ops, 0, 0);
> >  out_unlock:
> >       if (iolock)
> >               xfs_iunlock(ip, iolock);
> > @@ -647,7 +647,7 @@ xfs_file_dio_write_unaligned(
> >
> >       trace_xfs_file_direct_write(iocb, from);
> >       ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
> > -                        &xfs_dio_write_ops, flags);
> > +                        &xfs_dio_write_ops, flags, 0);
> >
> >       /*
> >        * Retry unaligned I/O with exclusive blocking semantics if the DIO
> > diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> > index ddc346a9df9b..6122c38ab44d 100644
> > --- a/fs/zonefs/super.c
> > +++ b/fs/zonefs/super.c
> > @@ -852,7 +852,7 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
> >               ret = zonefs_file_dio_append(iocb, from);
> >       else
> >               ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops,
> > -                                &zonefs_write_dio_ops, 0);
> > +                                &zonefs_write_dio_ops, 0, 0);
> >       if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
> >           (ret > 0 || ret == -EIOCBQUEUED)) {
> >               if (ret > 0)
> > @@ -987,7 +987,7 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >               }
> >               file_accessed(iocb->ki_filp);
> >               ret = iomap_dio_rw(iocb, to, &zonefs_iomap_ops,
> > -                                &zonefs_read_dio_ops, 0);
> > +                                &zonefs_read_dio_ops, 0, 0);
> >       } else {
> >               ret = generic_file_read_iter(iocb, to);
> >               if (ret == -EIO)
> > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > index 2a213b0d1e1f..829f2325ecba 100644
> > --- a/include/linux/iomap.h
> > +++ b/include/linux/iomap.h
> > @@ -339,10 +339,10 @@ struct iomap_dio_ops {
> >
> >  ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> >               const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> > -             unsigned int dio_flags);
> > +             unsigned int dio_flags, size_t done_before);
> >  struct iomap_dio *__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> >               const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> > -             unsigned int dio_flags);
> > +             unsigned int dio_flags, size_t done_before);
> >  ssize_t iomap_dio_complete(struct iomap_dio *dio);
> >  int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
> >
> > --
> > 2.26.3
> >
>


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 14/17] iomap: Add done_before argument to iomap_dio_rw
  2021-10-19 19:30     ` Andreas Gruenbacher
@ 2021-10-20  1:57       ` Darrick J. Wong
  0 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2021-10-20  1:57 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: kvm-ppc, Paul Mackerras, cluster-devel, Jan Kara,
	Catalin Marinas, LKML, Christoph Hellwig, Alexander Viro,
	linux-fsdevel, linux-btrfs, Linus Torvalds, ocfs2-devel

On Tue, Oct 19, 2021 at 09:30:58PM +0200, Andreas Gruenbacher wrote:
> On Tue, Oct 19, 2021 at 5:51 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > On Tue, Oct 19, 2021 at 03:42:01PM +0200, Andreas Gruenbacher wrote:
> > > Add a done_before argument to iomap_dio_rw that indicates how much of
> > > the request has already been transferred.  When the request succeeds, we
> > > report that done_before additional bytes were tranferred.  This is
> > > useful for finishing a request asynchronously when part of the request
> > > has already been completed synchronously.
> > >
> > > We'll use that to allow iomap_dio_rw to be used with page faults
> > > disabled: when a page fault occurs while submitting a request, we
> > > synchronously complete the part of the request that has already been
> > > submitted.  The caller can then take care of the page fault and call
> > > iomap_dio_rw again for the rest of the request, passing in the number of
> > > bytes already tranferred.
> > >
> > > Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> > > ---
> > >  fs/btrfs/file.c       |  5 +++--
> > >  fs/erofs/data.c       |  2 +-
> > >  fs/ext4/file.c        |  5 +++--
> > >  fs/gfs2/file.c        |  4 ++--
> > >  fs/iomap/direct-io.c  | 11 ++++++++---
> > >  fs/xfs/xfs_file.c     |  6 +++---
> > >  fs/zonefs/super.c     |  4 ++--
> > >  include/linux/iomap.h |  4 ++--
> > >  8 files changed, 24 insertions(+), 17 deletions(-)
> > >
> > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > > index f37211d3bb69..9d41b28c67ba 100644
> > > --- a/fs/btrfs/file.c
> > > +++ b/fs/btrfs/file.c
> > > @@ -1957,7 +1957,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
> > >       }
> > >
> > >       dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> > > -                          0);
> > > +                          0, 0);
> > >
> > >       btrfs_inode_unlock(inode, ilock_flags);
> > >
> > > @@ -3658,7 +3658,8 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
> > >               return 0;
> > >
> > >       btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
> > > -     ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops, 0);
> > > +     ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> > > +                        0, 0);
> > >       btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
> > >       return ret;
> > >  }
> > > diff --git a/fs/erofs/data.c b/fs/erofs/data.c
> > > index 9db829715652..16a41d0db55a 100644
> > > --- a/fs/erofs/data.c
> > > +++ b/fs/erofs/data.c
> > > @@ -287,7 +287,7 @@ static ssize_t erofs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > >
> > >               if (!err)
> > >                       return iomap_dio_rw(iocb, to, &erofs_iomap_ops,
> > > -                                         NULL, 0);
> > > +                                         NULL, 0, 0);
> > >               if (err < 0)
> > >                       return err;
> > >       }
> > > diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> > > index ac0e11bbb445..b25c1f8f7c4f 100644
> > > --- a/fs/ext4/file.c
> > > +++ b/fs/ext4/file.c
> > > @@ -74,7 +74,7 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > >               return generic_file_read_iter(iocb, to);
> > >       }
> > >
> > > -     ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0);
> > > +     ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0, 0);
> > >       inode_unlock_shared(inode);
> > >
> > >       file_accessed(iocb->ki_filp);
> > > @@ -566,7 +566,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > >       if (ilock_shared)
> > >               iomap_ops = &ext4_iomap_overwrite_ops;
> > >       ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
> > > -                        (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0);
> > > +                        (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0,
> > > +                        0);
> > >       if (ret == -ENOTBLK)
> > >               ret = 0;
> > >
> > > diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> > > index b07b9c2d0446..ae06defcf369 100644
> > > --- a/fs/gfs2/file.c
> > > +++ b/fs/gfs2/file.c
> > > @@ -822,7 +822,7 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
> > >       if (ret)
> > >               goto out_uninit;
> > >
> > > -     ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0);
> > > +     ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0, 0);
> > >       gfs2_glock_dq(gh);
> > >  out_uninit:
> > >       gfs2_holder_uninit(gh);
> > > @@ -856,7 +856,7 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
> > >       if (offset + len > i_size_read(&ip->i_inode))
> > >               goto out;
> > >
> > > -     ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0);
> > > +     ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0, 0);
> > >       if (ret == -ENOTBLK)
> > >               ret = 0;
> > >  out:
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index a434fb7887b2..fdf68339bc8b 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -31,6 +31,7 @@ struct iomap_dio {
> > >       atomic_t                ref;
> > >       unsigned                flags;
> > >       int                     error;
> > > +     size_t                  done_before;
> >
> > I have basically the same comment as last time[1]:
> >
> > "So, now that I actually understand the reason why the count of
> > previously transferred bytes has to be passed into the iomap_dio, I
> > would like this field to have a comment so that stupid maintainers like
> > me don't forget the subtleties again:
> >
> >         /*
> >          * For asynchronous IO, we have one chance to call the iocb
> >          * completion method with the results of the directio operation.
> >          * If this operation is a resubmission after a previous partial
> >          * completion (e.g. page fault), we need to know about that
> >          * progress so that we can report both the results of the prior
> >          * completion and the result of the resubmission to the iocb
> >          * submitter.
> >          */
> >         size_t                  done_before;
> >
> > With that added, I think I can live with this enough to:
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> Indeed, sorry for missing that. How about the below change instead
> though; I think that better sums up what's going on?
> 
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -454,6 +454,14 @@ static loff_t iomap_dio_iter(const struct iomap_iter *iter,
>   * may be pure data writes. In that case, we still need to do a full data sync
>   * completion.
>   *

What do you think about rearranging the first sentence to mention the
new flag a bit earlier?  e.g.

"When page faults are disabled and @dio_flags includes IOMAP_DIO_PARTIAL,
__iomap_dio_rw can return a partial result if it encounters a
non-resident page in @iter after preparing a transfer.  In that case,
the non-resident pages can be faulted in and the request resumed with
@done_before set to the number of bytes previously transferred...

--D

> + * When page faults are disabled, __iomap_dio_rw has already prepared some data
> + * to be transferred, @iter references a non-resident page, and @dio_flags
> + * includes IOMAP_DIO_PARTIAL, __iomap_dio_rw will return a partial result.  In
> + * that case, the page or pages can be faulted in and the request resumed with
> + * @done_before set to the number of bytes previously transferred.  The request
> + * will then complete with the correct total number of bytes transferred; this
> + * is essential for completing partial requests asynchronously.
> + *
>   * Returns -ENOTBLK In case of a page invalidation invalidation failure for
>   * writes.  The callers needs to fall back to buffered I/O in this case.
>   */
> 
> Thanks,
> Andreas
> 
> >
> > --D
> >
> > [1] https://lore.kernel.org/linux-fsdevel/20210903185351.GD9892@magnolia/
> >
> > >       bool                    wait_for_completion;
> > >
> > >       union {
> > > @@ -124,6 +125,9 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
> > >       if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
> > >               ret = generic_write_sync(iocb, ret);
> > >
> > > +     if (ret > 0)
> > > +             ret += dio->done_before;
> > > +
> > >       kfree(dio);
> > >
> > >       return ret;
> > > @@ -456,7 +460,7 @@ static loff_t iomap_dio_iter(const struct iomap_iter *iter,
> > >  struct iomap_dio *
> > >  __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > >               const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> > > -             unsigned int dio_flags)
> > > +             unsigned int dio_flags, size_t done_before)
> > >  {
> > >       struct address_space *mapping = iocb->ki_filp->f_mapping;
> > >       struct inode *inode = file_inode(iocb->ki_filp);
> > > @@ -486,6 +490,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > >       dio->dops = dops;
> > >       dio->error = 0;
> > >       dio->flags = 0;
> > > +     dio->done_before = done_before;
> > >
> > >       dio->submit.iter = iter;
> > >       dio->submit.waiter = current;
> > > @@ -652,11 +657,11 @@ EXPORT_SYMBOL_GPL(__iomap_dio_rw);
> > >  ssize_t
> > >  iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > >               const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> > > -             unsigned int dio_flags)
> > > +             unsigned int dio_flags, size_t done_before)
> > >  {
> > >       struct iomap_dio *dio;
> > >
> > > -     dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags);
> > > +     dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags, done_before);
> > >       if (IS_ERR_OR_NULL(dio))
> > >               return PTR_ERR_OR_ZERO(dio);
> > >       return iomap_dio_complete(dio);
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 7aa943edfc02..240eb932c014 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -259,7 +259,7 @@ xfs_file_dio_read(
> > >       ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
> > >       if (ret)
> > >               return ret;
> > > -     ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0);
> > > +     ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, 0);
> > >       xfs_iunlock(ip, XFS_IOLOCK_SHARED);
> > >
> > >       return ret;
> > > @@ -569,7 +569,7 @@ xfs_file_dio_write_aligned(
> > >       }
> > >       trace_xfs_file_direct_write(iocb, from);
> > >       ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
> > > -                        &xfs_dio_write_ops, 0);
> > > +                        &xfs_dio_write_ops, 0, 0);
> > >  out_unlock:
> > >       if (iolock)
> > >               xfs_iunlock(ip, iolock);
> > > @@ -647,7 +647,7 @@ xfs_file_dio_write_unaligned(
> > >
> > >       trace_xfs_file_direct_write(iocb, from);
> > >       ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
> > > -                        &xfs_dio_write_ops, flags);
> > > +                        &xfs_dio_write_ops, flags, 0);
> > >
> > >       /*
> > >        * Retry unaligned I/O with exclusive blocking semantics if the DIO
> > > diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> > > index ddc346a9df9b..6122c38ab44d 100644
> > > --- a/fs/zonefs/super.c
> > > +++ b/fs/zonefs/super.c
> > > @@ -852,7 +852,7 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
> > >               ret = zonefs_file_dio_append(iocb, from);
> > >       else
> > >               ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops,
> > > -                                &zonefs_write_dio_ops, 0);
> > > +                                &zonefs_write_dio_ops, 0, 0);
> > >       if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
> > >           (ret > 0 || ret == -EIOCBQUEUED)) {
> > >               if (ret > 0)
> > > @@ -987,7 +987,7 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > >               }
> > >               file_accessed(iocb->ki_filp);
> > >               ret = iomap_dio_rw(iocb, to, &zonefs_iomap_ops,
> > > -                                &zonefs_read_dio_ops, 0);
> > > +                                &zonefs_read_dio_ops, 0, 0);
> > >       } else {
> > >               ret = generic_file_read_iter(iocb, to);
> > >               if (ret == -EIO)
> > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > > index 2a213b0d1e1f..829f2325ecba 100644
> > > --- a/include/linux/iomap.h
> > > +++ b/include/linux/iomap.h
> > > @@ -339,10 +339,10 @@ struct iomap_dio_ops {
> > >
> > >  ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > >               const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> > > -             unsigned int dio_flags);
> > > +             unsigned int dio_flags, size_t done_before);
> > >  struct iomap_dio *__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > >               const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> > > -             unsigned int dio_flags);
> > > +             unsigned int dio_flags, size_t done_before);
> > >  ssize_t iomap_dio_complete(struct iomap_dio *dio);
> > >  int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
> > >
> > > --
> > > 2.26.3
> > >
> >
> 

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 05/17] iov_iter: Introduce fault_in_iov_iter_writeable
  2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 05/17] iov_iter: Introduce fault_in_iov_iter_writeable Andreas Gruenbacher
@ 2021-10-20 16:25   ` Catalin Marinas
  0 siblings, 0 replies; 47+ messages in thread
From: Catalin Marinas @ 2021-10-20 16:25 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: kvm-ppc, Paul Mackerras, cluster-devel, Jan Kara, linux-kernel,
	Christoph Hellwig, Alexander Viro, linux-fsdevel, linux-btrfs,
	Linus Torvalds, ocfs2-devel

On Tue, Oct 19, 2021 at 03:41:52PM +0200, Andreas Gruenbacher wrote:
> diff --git a/mm/gup.c b/mm/gup.c
> index a7efb027d6cf..614b8536b3b6 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1691,6 +1691,69 @@ size_t fault_in_writeable(char __user *uaddr, size_t size)
>  }
>  EXPORT_SYMBOL(fault_in_writeable);
>  
> +/*
> + * fault_in_safe_writeable - fault in an address range for writing
> + * @uaddr: start of address range
> + * @size: length of address range
> + *
> + * Faults in an address range using get_user_pages, i.e., without triggering
> + * hardware page faults.  This is primarily useful when we already know that
> + * some or all of the pages in the address range aren't in memory.
> + *
> + * Other than fault_in_writeable(), this function is non-destructive.
> + *
> + * Note that we don't pin or otherwise hold the pages referenced that we fault
> + * in.  There's no guarantee that they'll stay in memory for any duration of
> + * time.
> + *
> + * Returns the number of bytes not faulted in, like copy_to_user() and
> + * copy_from_user().
> + */
> +size_t fault_in_safe_writeable(const char __user *uaddr, size_t size)
> +{
> +	unsigned long start = (unsigned long)uaddr;
> +	unsigned long end, nstart, nend;
> +	struct mm_struct *mm = current->mm;
> +	struct vm_area_struct *vma = NULL;

For arm64 tagged addresses we need the diff below, otherwise the
subsequent find_vma() will fail:

diff --git a/mm/gup.c b/mm/gup.c
index f5f362cb4640..2c51e9748a6a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1713,7 +1713,7 @@ EXPORT_SYMBOL(fault_in_writeable);
  */
 size_t fault_in_safe_writeable(const char __user *uaddr, size_t size)
 {
-	unsigned long start = (unsigned long)uaddr;
+	unsigned long start = (unsigned long)untagged_addr(uaddr);
 	unsigned long end, nstart, nend;
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma = NULL;

-- 
Catalin

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
  2021-10-19 15:40 ` [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks Linus Torvalds
  2021-10-19 16:00   ` Bob Peterson
@ 2021-10-20 16:36   ` Catalin Marinas
  2021-10-20 20:11     ` Linus Torvalds
  1 sibling, 1 reply; 47+ messages in thread
From: Catalin Marinas @ 2021-10-20 16:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kvm-ppc, Christoph Hellwig, cluster-devel, Jan Kara,
	Andreas Gruenbacher, Linux Kernel Mailing List, Paul Mackerras,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

On Tue, Oct 19, 2021 at 05:40:13AM -1000, Linus Torvalds wrote:
> On Tue, Oct 19, 2021 at 3:42 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >  * Will Catalin Marinas's work for supporting arm64 sub-page faults
> >    be queued behind these patches?  We have an overlap in
> >    fault_in_[pages_]readable fault_in_[pages_]writeable, so one of
> >    the two patch queues will need some adjustments.
> 
> I think that on the whole they should be developed separately, I don't
> think it's going to be a particularly difficult conflict.
> 
> That whole discussion does mean that I suspect that we'll have to
> change fault_in_iov_iter_writeable() to do the "every 16 bytes" or
> whatever thing, and make it use an actual atomic "add zero" or
> whatever rather than walk the page tables. But that's a conceptually
> separate discussion from this one, I wouldn't actually want to mix up
> the two issues too much.

I agree we shouldn't mix the two at the moment. The MTE fix requires
some more thinking and it's not 5.16 material yet.

The atomic "add zero" trick isn't that simple for MTE since the arm64
atomic or exclusive instructions run with kernel privileges and
therefore with the kernel tag checking mode. We could toggle the mode to
match user's just for those atomic ops but it will make this probing
even more expensive (though normally it's done on the slow path).

The quick/backportable fix for MTE is probably to just disable tag
checking on user addresses during pagefault_disabled(). As I mentioned
in the other thread, a more elaborate fix I think is to change the
uaccess routines to update an error code somewhere in a similar way to
the arm64 __put_user_error(). But that would require changing lots of
callers.

-- 
Catalin

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
  2021-10-20 16:36   ` Catalin Marinas
@ 2021-10-20 20:11     ` Linus Torvalds
  2021-10-20 22:44       ` Catalin Marinas
  0 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2021-10-20 20:11 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: kvm-ppc, Christoph Hellwig, cluster-devel, Jan Kara,
	Andreas Gruenbacher, Linux Kernel Mailing List, Paul Mackerras,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

On Wed, Oct 20, 2021 at 6:37 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> The atomic "add zero" trick isn't that simple for MTE since the arm64
> atomic or exclusive instructions run with kernel privileges and
> therefore with the kernel tag checking mode.

Are there any instructions that are useful for "probe_user_write()"
kind of thing? We could always just add that as an arch function, with
a fallback to using the futex "add zero" if the architecture doesn't
need anything special.

Although at least for MTE, I think the solution was to do a regular
read, and that checks the tag, and then we could use the gup machinery
for the writability checks.

                Linus

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
  2021-10-20 20:11     ` Linus Torvalds
@ 2021-10-20 22:44       ` Catalin Marinas
  2021-10-21  6:19         ` Linus Torvalds
  0 siblings, 1 reply; 47+ messages in thread
From: Catalin Marinas @ 2021-10-20 22:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kvm-ppc, Christoph Hellwig, cluster-devel, Jan Kara,
	Andreas Gruenbacher, Linux Kernel Mailing List, Paul Mackerras,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

On Wed, Oct 20, 2021 at 10:11:19AM -1000, Linus Torvalds wrote:
> On Wed, Oct 20, 2021 at 6:37 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > The atomic "add zero" trick isn't that simple for MTE since the arm64
> > atomic or exclusive instructions run with kernel privileges and
> > therefore with the kernel tag checking mode.
> 
> Are there any instructions that are useful for "probe_user_write()"
> kind of thing?

If it's on a user address, the only single-instruction that works with
MTE is STTR (as in put_user()) but that's destructive. Other "add zero"
constructs require some potentially expensive system register accesses
just to set the tag checking mode of the current task.

A probe_user_write() on the kernel linear address involves reading the
tag from memory and comparing it with the tag in the user pointer. In
addition, it needs to take into account the current task's tag checking
mode and the vma vm_flags. We should have most of the information in the
gup code.

> Although at least for MTE, I think the solution was to do a regular
> read, and that checks the tag, and then we could use the gup machinery
> for the writability checks.

Yes, for MTE this should work. For CHERI I think an "add zero" would
do the trick (it should have atomics that work on capabilities
directly). However, with MTE doing both get_user() every 16 bytes and
gup can get pretty expensive. The problematic code is
fault_in_safe_writable() in this series.

I can give this 16-byte probing in gup a try (on top of -next) but IMHO
we unnecessarily overload the fault_in_*() logic with something the
kernel cannot fix up. The only reason we do it is so that we get an
error code and bail out of a loop but the uaccess routines could be
extended to report the fault type instead. It looks like we pretty much
duplicate the uaccess in the fault_in_*() functions (four accesses per
cache line).

-- 
Catalin

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
  2021-10-20 22:44       ` Catalin Marinas
@ 2021-10-21  6:19         ` Linus Torvalds
  2021-10-22 18:06           ` Catalin Marinas
  0 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2021-10-21  6:19 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: kvm-ppc, Christoph Hellwig, cluster-devel, Jan Kara,
	Andreas Gruenbacher, Linux Kernel Mailing List, Paul Mackerras,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

On Wed, Oct 20, 2021 at 12:44 PM Catalin Marinas
<catalin.marinas@arm.com> wrote:
>
> However, with MTE doing both get_user() every 16 bytes and
> gup can get pretty expensive.

So I really think that anything that is performance-critical had
better only do the "fault_in_write()" code path in the cold error path
where you took a page fault.

IOW, the loop structure should be


     repeat:
                take_locks();
                pagefault_disable();
                ret = do_things_with_locks();
                pagefault_enable();
                release_locks();

                // Failure path?
                if (unlikely(ret == -EFAULT)) {
                        if (fault_in_writable(..))
                                goto repeat;
                        return -EFAULT;
                }

rather than have it be some kind of "first do fault_in_writable(),
then do the work" model.

So I wouldn't worry too much about the performance concerns. It simply
shouldn't be a common or hot path.

And yes, I've seen code that does that "fault_in_xyz()" before the
critical operation that cannot take page faults, and does it
unconditionally.

But then it isn't the "fault_in_xyz()" that should be blamed if it is
slow, but the caller that does things the wrong way around.

            Linus

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
  2021-10-21  6:19         ` Linus Torvalds
@ 2021-10-22 18:06           ` Catalin Marinas
  2021-10-22 19:22             ` Linus Torvalds
  2021-10-25 18:24             ` Andreas Gruenbacher
  0 siblings, 2 replies; 47+ messages in thread
From: Catalin Marinas @ 2021-10-22 18:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kvm-ppc, Christoph Hellwig, cluster-devel, Jan Kara,
	Andreas Gruenbacher, Linux Kernel Mailing List, Paul Mackerras,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

On Wed, Oct 20, 2021 at 08:19:40PM -1000, Linus Torvalds wrote:
> On Wed, Oct 20, 2021 at 12:44 PM Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> >
> > However, with MTE doing both get_user() every 16 bytes and
> > gup can get pretty expensive.
> 
> So I really think that anything that is performance-critical had
> better only do the "fault_in_write()" code path in the cold error path
> where you took a page fault.
[...]
> So I wouldn't worry too much about the performance concerns. It simply
> shouldn't be a common or hot path.
> 
> And yes, I've seen code that does that "fault_in_xyz()" before the
> critical operation that cannot take page faults, and does it
> unconditionally.
> 
> But then it isn't the "fault_in_xyz()" that should be blamed if it is
> slow, but the caller that does things the wrong way around.

Some more thinking out loud. I did some unscientific benchmarks on a
Raspberry Pi 4 with the filesystem in a RAM block device and a
"dd if=/dev/zero of=/mnt/test" writing 512MB in 1MB blocks. I changed
fault_in_readable() in linux-next to probe every 16 bytes:

- ext4 drops from around 261MB/s to 246MB/s: 5.7% penalty

- btrfs drops from around 360MB/s to 337MB/s: 6.4% penalty

For generic_perform_write() Dave Hansen attempted to move the fault-in
after the uaccess in commit 998ef75ddb57 ("fs: do not prefault
sys_write() user buffer pages"). This was reverted as it was exposing an
ext4 bug. I don't whether it was fixed but re-applying Dave's commit
avoids the performance drop.

btrfs_buffered_write() has a comment about faulting pages in before
locking them in prepare_pages(). I suspect it's a similar problem and
the fault_in() could be moved, though I can't say I understand this code
well enough.

Probing only the first byte(s) in fault_in() would be ideal, no need to
go through all filesystems and try to change the uaccess/probing order.

-- 
Catalin

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
  2021-10-22 18:06           ` Catalin Marinas
@ 2021-10-22 19:22             ` Linus Torvalds
  2021-10-25 19:00               ` Andreas Gruenbacher
  2021-10-25 18:24             ` Andreas Gruenbacher
  1 sibling, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2021-10-22 19:22 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: kvm-ppc, Christoph Hellwig, cluster-devel, Jan Kara,
	Andreas Gruenbacher, Linux Kernel Mailing List, Paul Mackerras,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

On Fri, Oct 22, 2021 at 8:06 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> Probing only the first byte(s) in fault_in() would be ideal, no need to
> go through all filesystems and try to change the uaccess/probing order.

Let's try that. Or rather: probing just the first page - since there
are users like that btrfs ioctl, and the direct-io path.

                  Linus

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
  2021-10-22 18:06           ` Catalin Marinas
  2021-10-22 19:22             ` Linus Torvalds
@ 2021-10-25 18:24             ` Andreas Gruenbacher
  2021-10-26  5:12               ` Theodore Ts'o
                                 ` (2 more replies)
  1 sibling, 3 replies; 47+ messages in thread
From: Andreas Gruenbacher @ 2021-10-25 18:24 UTC (permalink / raw)
  To: Catalin Marinas, Dave Hansen, Ted Ts'o
  Cc: kvm-ppc, Christoph Hellwig, cluster-devel, Jan Kara,
	Linux Kernel Mailing List, Paul Mackerras, Alexander Viro,
	linux-fsdevel, linux-btrfs, Linus Torvalds, ocfs2-devel

commit
On Fri, Oct 22, 2021 at 8:07 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Wed, Oct 20, 2021 at 08:19:40PM -1000, Linus Torvalds wrote:
> > On Wed, Oct 20, 2021 at 12:44 PM Catalin Marinas
> > <catalin.marinas@arm.com> wrote:
> > >
> > > However, with MTE doing both get_user() every 16 bytes and
> > > gup can get pretty expensive.
> >
> > So I really think that anything that is performance-critical had
> > better only do the "fault_in_write()" code path in the cold error path
> > where you took a page fault.
> [...]
> > So I wouldn't worry too much about the performance concerns. It simply
> > shouldn't be a common or hot path.
> >
> > And yes, I've seen code that does that "fault_in_xyz()" before the
> > critical operation that cannot take page faults, and does it
> > unconditionally.
> >
> > But then it isn't the "fault_in_xyz()" that should be blamed if it is
> > slow, but the caller that does things the wrong way around.
>
> Some more thinking out loud. I did some unscientific benchmarks on a
> Raspberry Pi 4 with the filesystem in a RAM block device and a
> "dd if=/dev/zero of=/mnt/test" writing 512MB in 1MB blocks. I changed
> fault_in_readable() in linux-next to probe every 16 bytes:
>
> - ext4 drops from around 261MB/s to 246MB/s: 5.7% penalty
>
> - btrfs drops from around 360MB/s to 337MB/s: 6.4% penalty
>
> For generic_perform_write() Dave Hansen attempted to move the fault-in
> after the uaccess in commit 998ef75ddb57 ("fs: do not prefault
> sys_write() user buffer pages"). This was reverted as it was exposing an
> ext4 bug. I don't [know] whether it was fixed but re-applying Dave's commit
> avoids the performance drop.

Interesting. The revert of commit 998ef75ddb57 is in commit
00a3d660cbac. Maybe Dave and Ted can tell us more about what went
wrong in ext4 and whether it's still an issue.

Commit 998ef75ddb57 looks mostly good except that it should loop
around whenever the fault-in succeeds even partially, so it needs the
semantic change of patch 4 [*] of this series. A copy of the same code
now lives in iomap_write_iter, so the same fix needs to be applied
there. Finally, it may be worthwhile to check for pagefault_disabled()
in generic_perform_write and iomap_write_iter before trying the
fault-in; this would help gfs2 which will always call into
iomap_write_iter with page faults disabled, and additional callers
like that could emerge relatively soon.

[*] https://lore.kernel.org/lkml/20211019134204.3382645-5-agruenba@redhat.com/

> btrfs_buffered_write() has a comment about faulting pages in before
> locking them in prepare_pages(). I suspect it's a similar problem and
> the fault_in() could be moved, though I can't say I understand this code
> well enough.
>
> Probing only the first byte(s) in fault_in() would be ideal, no need to
> go through all filesystems and try to change the uaccess/probing order.

Thanks,
Andreas


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
  2021-10-22 19:22             ` Linus Torvalds
@ 2021-10-25 19:00               ` Andreas Gruenbacher
  2021-10-26 18:24                 ` Catalin Marinas
  0 siblings, 1 reply; 47+ messages in thread
From: Andreas Gruenbacher @ 2021-10-25 19:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kvm-ppc, Christoph Hellwig, cluster-devel, Jan Kara,
	Catalin Marinas, Linux Kernel Mailing List, Paul Mackerras,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

On Fri, Oct 22, 2021 at 9:23 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Oct 22, 2021 at 8:06 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > Probing only the first byte(s) in fault_in() would be ideal, no need to
> > go through all filesystems and try to change the uaccess/probing order.
>
> Let's try that. Or rather: probing just the first page - since there
> are users like that btrfs ioctl, and the direct-io path.

For direct I/O, we actually only want to trigger page fault-in so that
we can grab page references with bio_iov_iter_get_pages. Probing for
sub-page error domains will only slow things down. If we hit -EFAULT
during the actual copy-in or copy-out, we know that the error can't be
page fault related. Similarly, in the buffered I/O case, we only
really care about the next byte, so any probing beyond that is
unnecessary.

So maybe we should split the sub-page error domain probing off from
the fault-in functions. Or at least add an argument to the fault-in
functions that specifies the amount of memory to probe.

Thanks,
Andreas


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
  2021-10-25 18:24             ` Andreas Gruenbacher
@ 2021-10-26  5:12               ` Theodore Ts'o
  2021-10-26  9:44               ` Andreas Gruenbacher
  2021-10-27 21:21               ` Andreas Gruenbacher
  2 siblings, 0 replies; 47+ messages in thread
From: Theodore Ts'o @ 2021-10-26  5:12 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: kvm-ppc, Christoph Hellwig, cluster-devel, Jan Kara,
	Catalin Marinas, Dave Hansen, Linux Kernel Mailing List,
	Paul Mackerras, Alexander Viro, linux-fsdevel, linux-btrfs,
	Linus Torvalds, ocfs2-devel

On Mon, Oct 25, 2021 at 08:24:26PM +0200, Andreas Gruenbacher wrote:
> > For generic_perform_write() Dave Hansen attempted to move the fault-in
> > after the uaccess in commit 998ef75ddb57 ("fs: do not prefault
> > sys_write() user buffer pages"). This was reverted as it was exposing an
> > ext4 bug. I don't [know] whether it was fixed but re-applying Dave's commit
> > avoids the performance drop.
> 
> Interesting. The revert of commit 998ef75ddb57 is in commit
> 00a3d660cbac. Maybe Dave and Ted can tell us more about what went
> wrong in ext4 and whether it's still an issue.

The context for the revert can be found here[1].

[1] https://lore.kernel.org/lkml/20151005152236.GA8140@thunk.org/

And "what went wrong in ext4" was fixed here[2].

[2] https://lore.kernel.org/lkml/20151005152236.GA8140@thunk.org/

which landed upstream as commit b90197b65518 ("ext4: use private
version of page_zero_new_buffers() for data=journal mode").

So it looks like the original issue which triggered the revert in 2015
should be addressed, and we can easily test it by using generic/208
with data=journal mode.

There also seems to be a related discussion about whether we should
unrevert 998ef75ddb57 here[3].  Hmm. there is a mention on that thread
in [3], "Side note: search for "iov_iter_fault_in_writeable()" on lkml
for a gfs2 patch-series that is buggy, exactly because it does *not*
use the atomic user space accesses, and just tries to do the fault-in
to hide the real bug."  I assume that's related to the discussion on
this thread?

[3] https://lore.kernel.org/all/3221175.1624375240@warthog.procyon.org.uk/T/#u

						- Ted

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
  2021-10-25 18:24             ` Andreas Gruenbacher
  2021-10-26  5:12               ` Theodore Ts'o
@ 2021-10-26  9:44               ` Andreas Gruenbacher
  2021-10-27 21:21               ` Andreas Gruenbacher
  2 siblings, 0 replies; 47+ messages in thread
From: Andreas Gruenbacher @ 2021-10-26  9:44 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: kvm-ppc, Christoph Hellwig, cluster-devel, Jan Kara,
	Andreas Gruenbacher, Catalin Marinas, Dave Hansen,
	Linux Kernel Mailing List, Paul Mackerras, Alexander Viro,
	linux-fsdevel, linux-btrfs, Linus Torvalds, ocfs2-devel

Ted,

here's an updated version of Dave Hansen's original commit, but note
that generic/208 won't run on ext4 with data journaling enabled:

  $ MOUNT_OPTIONS='-o data=journal' TEST_DIR=/mnt/test TEST_DEV=/dev/vdb ./tests/generic/208
  QA output created by 208
  208 not run: ext4 data journaling doesn't support O_DIRECT

Thanks,
Andreas

--

Based on commit 998ef75ddb57 ("fs: do not prefault sys_write() user
buffer pages") by Dave Hansen <dave.hansen@linux.intel.com>, but:

* Fix generic_perform_write as well as iomap_write_iter.

* copy_page_from_iter_atomic() doesn't trigger page faults, so there's no need
  to disable page faults around it [see commit 9e8c2af96e0d ("callers of
  iov_copy_from_user_atomic() don't need pagecache_disable()")].

* If fault_in_iov_iter_readable() fails to fault in the entire buffer,
  we still want to read everything up to the fault position.  This depends on
  commit a6294593e8a1 ("iov_iter: Turn iov_iter_fault_in_readable into
  fault_in_iov_iter_readable").

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/iomap/buffered-io.c | 20 +++++++-------------
 mm/filemap.c           | 20 +++++++-------------
 2 files changed, 14 insertions(+), 26 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 1753c26c8e76..d8809cd9ab31 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -744,17 +744,6 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		if (bytes > length)
 			bytes = length;
 
-		/*
-		 * Bring in the user page that we'll copy from _first_.
-		 * Otherwise there's a nasty deadlock on copying from the
-		 * same page as we're writing to, without it being marked
-		 * up-to-date.
-		 */
-		if (unlikely(fault_in_iov_iter_readable(i, bytes))) {
-			status = -EFAULT;
-			break;
-		}
-
 		status = iomap_write_begin(iter, pos, bytes, &page);
 		if (unlikely(status))
 			break;
@@ -777,9 +766,14 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 			 * halfway through, might be a race with munmap,
 			 * might be severe memory pressure.
 			 */
-			if (copied)
+			if (copied) {
 				bytes = copied;
-			goto again;
+				goto again;
+			}
+			if (fault_in_iov_iter_readable(i, bytes) != bytes)
+				goto again;
+			status = -EFAULT;
+			break;
 		}
 		pos += status;
 		written += status;
diff --git a/mm/filemap.c b/mm/filemap.c
index 4dd5edcd39fd..467cdb7d086d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3751,17 +3751,6 @@ ssize_t generic_perform_write(struct file *file,
 						iov_iter_count(i));
 
 again:
-		/*
-		 * Bring in the user page that we will copy from _first_.
-		 * Otherwise there's a nasty deadlock on copying from the
-		 * same page as we're writing to, without it being marked
-		 * up-to-date.
-		 */
-		if (unlikely(fault_in_iov_iter_readable(i, bytes))) {
-			status = -EFAULT;
-			break;
-		}
-
 		if (fatal_signal_pending(current)) {
 			status = -EINTR;
 			break;
@@ -3794,9 +3783,14 @@ ssize_t generic_perform_write(struct file *file,
 			 * halfway through, might be a race with munmap,
 			 * might be severe memory pressure.
 			 */
-			if (copied)
+			if (copied) {
 				bytes = copied;
-			goto again;
+				goto again;
+			}
+			if (fault_in_iov_iter_readable(i, bytes) != bytes)
+				goto again;
+			status = -EFAULT;
+			break;
 		}
 		pos += status;
 		written += status;
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
  2021-10-25 19:00               ` Andreas Gruenbacher
@ 2021-10-26 18:24                 ` Catalin Marinas
  2021-10-26 18:50                   ` Linus Torvalds
  0 siblings, 1 reply; 47+ messages in thread
From: Catalin Marinas @ 2021-10-26 18:24 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: kvm-ppc, Christoph Hellwig, cluster-devel, Jan Kara,
	Linux Kernel Mailing List, Paul Mackerras, Alexander Viro,
	linux-fsdevel, linux-btrfs, Linus Torvalds, ocfs2-devel

On Mon, Oct 25, 2021 at 09:00:43PM +0200, Andreas Gruenbacher wrote:
> On Fri, Oct 22, 2021 at 9:23 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Fri, Oct 22, 2021 at 8:06 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > Probing only the first byte(s) in fault_in() would be ideal, no need to
> > > go through all filesystems and try to change the uaccess/probing order.
> >
> > Let's try that. Or rather: probing just the first page - since there
> > are users like that btrfs ioctl, and the direct-io path.
> 
> For direct I/O, we actually only want to trigger page fault-in so that
> we can grab page references with bio_iov_iter_get_pages. Probing for
> sub-page error domains will only slow things down. If we hit -EFAULT
> during the actual copy-in or copy-out, we know that the error can't be
> page fault related. Similarly, in the buffered I/O case, we only
> really care about the next byte, so any probing beyond that is
> unnecessary.
> 
> So maybe we should split the sub-page error domain probing off from
> the fault-in functions. Or at least add an argument to the fault-in
> functions that specifies the amount of memory to probe.

My preferred option is not to touch fault-in for sub-page faults (though
I have some draft patches, they need testing).

All this fault-in and uaccess with pagefaults_disabled() is needed to
avoid a deadlock when the uaccess fault handling would take the same
lock. With sub-page faults, the kernel cannot fix it up anyway, so the
arch code won't even attempt call handle_mm_fault() (it is not an mm
fault). But the problem is the copy_*_user() etc. API that can only
return the number of bytes not copied. That's what I think should be
fixed. fault_in() feels like the wrong place to address this when it's
not an mm fault.

As for fault_in() getting another argument with the amount of sub-page
probing to do, I think the API gets even more confusing. I was also
thinking, with your patches for fault_in() now returning size_t, is the
expectation to be precise in what cannot be copied? We don't have such
requirement for copy_*_user().

While more intrusive, I'd rather change copy_page_from_iter_atomic()
etc. to take a pointer where to write back an error code. If it's
-EFAULT, retry the loop. If it's -EACCES/EPERM just bail out. Or maybe
simply a bool set if there was an mm fault to be retried. Yet another
option to return an -EAGAIN if it could not process the mm fault due to
page faults being disabled.

Happy to give this a try, unless there's a strong preference for the
fault_in() fix-up (well, I can do both options and post them).

-- 
Catalin

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
  2021-10-26 18:24                 ` Catalin Marinas
@ 2021-10-26 18:50                   ` Linus Torvalds
  2021-10-26 19:18                     ` Linus Torvalds
  2021-10-27 19:13                     ` Catalin Marinas
  0 siblings, 2 replies; 47+ messages in thread
From: Linus Torvalds @ 2021-10-26 18:50 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: kvm-ppc, Christoph Hellwig, cluster-devel, Jan Kara,
	Andreas Gruenbacher, Linux Kernel Mailing List, Paul Mackerras,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

On Tue, Oct 26, 2021 at 11:24 AM Catalin Marinas
<catalin.marinas@arm.com> wrote:
>
> While more intrusive, I'd rather change copy_page_from_iter_atomic()
> etc. to take a pointer where to write back an error code.

I absolutely hate this model.

The thing is, going down that rat-hole, you'll find that you'll need
to add it to *all* the "copy_to/from_user()" cases, which isn't
acceptable. So then you start doing some duplicate versions with
different calling conventions, just because of things like this.

So no, I really don't want a "pass down a reference to an extra error
code" kind of horror.

That said, the fact that these sub-page faults are always
non-recoverable might be a hint to a solution to the problem: maybe we
could extend the existing return code with actual negative error
numbers.

Because for _most_ cases of "copy_to/from_user()" and friends by far,
the only thing we look for is "zero for success".

We could extend the "number of bytes _not_ copied" semantics to say
"negative means fatal", and because there are fairly few places that
actually look at non-zero values, we could have a coccinelle script
that actually marks those places.

End result: no change in calling conventions, no change to most users,
and the (relatively few) cases where we look at the "what about
partial results", we just add a

         .. existing code ..
         ret = copy_from_user(..);
+        if (ret < 0)
+                break;  // or whatever "fatal error" situation
         .. existing  code ..

kind of thing that just stops the re-try.

(The coccinelle script couldn't actually do that, but it could add
some comment marker or something so that it's easy to find and then
manually fix up the places it finds).

             Linus

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
  2021-10-26 18:50                   ` Linus Torvalds
@ 2021-10-26 19:18                     ` Linus Torvalds
  2021-10-27 19:13                     ` Catalin Marinas
  1 sibling, 0 replies; 47+ messages in thread
From: Linus Torvalds @ 2021-10-26 19:18 UTC (permalink / raw)
  To: Catalin Marinas, Harry Wentland, Leo Li, Alex Deucher,
	Christian König
  Cc: kvm-ppc, Christoph Hellwig, cluster-devel, Jan Kara,
	Andreas Gruenbacher, Linux Kernel Mailing List, Paul Mackerras,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

On Tue, Oct 26, 2021 at 11:50 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Because for _most_ cases of "copy_to/from_user()" and friends by far,
> the only thing we look for is "zero for success".

Gaah. Looking around, I almost immediately found some really odd
exceptions to this.

Like parse_write_buffer_into_params() in amdgpu_dm_debugfs.c, which does

        r = copy_from_user(wr_buf_ptr, buf, wr_buf_size);

                /* r is bytes not be copied */
        if (r >= wr_buf_size) {
                DRM_DEBUG_DRIVER("user data not be read\n");
                return -EINVAL;
        }

and allows a partial copy to justy silently succeed, because all the
callers have pre-cleared the wr_buf_ptr buffer.

I have no idea why the code does that - it seems to imply that user
space could give an invalid 'size' parameter and mean to write only
the part that didn't succeed.

Adding AMD GPU driver people just to point out that this code not only
has odd whitespace, but that the pattern for "couldn't copy from user
space" should basically always be

        if (copy_from_user(wr_buf_ptr, buf, wr_buf_size))
                return -EFAULT;

because if user-space passes in a partially invalid buffer, you
generally really shouldn't say "ok, I got part of it and will use that
part"

There _are_ exceptions. We've had situations where user-space only
passes in the pointer to the buffer, but not the size. Bad interface,
but it historically happens for the 'mount()' system call 'data'
pointer. So then we'll copy "up to a page size".

Anyway, there are clearly some crazy users, and converting them all to
also check for negative error returns might be more painful than I
thought it would be.

                 Linus

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
  2021-10-26 18:50                   ` Linus Torvalds
  2021-10-26 19:18                     ` Linus Torvalds
@ 2021-10-27 19:13                     ` Catalin Marinas
  2021-10-27 21:14                       ` Linus Torvalds
  1 sibling, 1 reply; 47+ messages in thread
From: Catalin Marinas @ 2021-10-27 19:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kvm-ppc, Christoph Hellwig, cluster-devel, Jan Kara,
	Andreas Gruenbacher, Linux Kernel Mailing List, Paul Mackerras,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

On Tue, Oct 26, 2021 at 11:50:04AM -0700, Linus Torvalds wrote:
> On Tue, Oct 26, 2021 at 11:24 AM Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > While more intrusive, I'd rather change copy_page_from_iter_atomic()
> > etc. to take a pointer where to write back an error code.
[...]
> That said, the fact that these sub-page faults are always
> non-recoverable might be a hint to a solution to the problem: maybe we
> could extend the existing return code with actual negative error
> numbers.
> 
> Because for _most_ cases of "copy_to/from_user()" and friends by far,
> the only thing we look for is "zero for success".
> 
> We could extend the "number of bytes _not_ copied" semantics to say
> "negative means fatal", and because there are fairly few places that
> actually look at non-zero values, we could have a coccinelle script
> that actually marks those places.

As you already replied, there are some odd places where the returned
uncopied of bytes is used. Also for some valid cases like
copy_mount_options(), it's likely that it will fall back to
byte-at-a-time with MTE since it's a good chance it would hit a fault in
a 4K page (not a fast path though). I'd have to go through all the cases
and check whether the return value is meaningful. The iter_iov.c
functions and their callers also seem to make use of the bytes copied in
case they need to call iov_iter_revert() (though I suppose the
iov_iter_iovec_advance() would skip the update in case of an error).

As an alternative, you mentioned earlier that a per-thread fault status
was not feasible on x86 due to races. Was this only for the hw poison
case? I think the uaccess is slightly different.

We can add a current->non_recoverable_uaccess variable cleared on
pagefault_disable(), only set by uaccess faults and checked by the fs
code before re-attempting the fault_in(). An interrupt shouldn't do a
uaccess (well, if it does a _nofault one, we can detect in_interrupt()
in the MTE exception handler). Last time I looked at io_uring it was
running in a separate kernel thread, not sure whether this was changed.
I don't see what else would be racing with such
current->non_recoverable_uaccess variable. If that's doable, I think
it's the least intrusive approach.

-- 
Catalin

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
  2021-10-27 19:13                     ` Catalin Marinas
@ 2021-10-27 21:14                       ` Linus Torvalds
  2021-10-28 21:20                         ` Catalin Marinas
  0 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2021-10-27 21:14 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: kvm-ppc, Christoph Hellwig, cluster-devel, Jan Kara,
	Andreas Gruenbacher, Linux Kernel Mailing List, Paul Mackerras,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

On Wed, Oct 27, 2021 at 12:13 PM Catalin Marinas
<catalin.marinas@arm.com> wrote:
>
> As an alternative, you mentioned earlier that a per-thread fault status
> was not feasible on x86 due to races. Was this only for the hw poison
> case? I think the uaccess is slightly different.

It's not x86-specific, it's very generic.

If we set some flag in the per-thread status, we'll need to be careful
about not overwriting it if we then have a subsequent NMI that _also_
takes a (completely unrelated) page fault - before we then read the
per-thread flag.

Think 'perf' and fetching backtraces etc.

Note that the NMI page fault can easily also be a pointer coloring
fault on arm64, for exactly the same reason that whatever original
copy_from_user() code was. So this is not a "oh, pointer coloring
faults are different". They have the same re-entrancy issue.

And both the "pagefault_disable" and "fault happens in interrupt
context" cases are also the exact same 'faulthandler_disabled()'
thing. So even at fault time they look very similar.

So we'd have to have some way to separate out only the one we care about.

               Linus

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
  2021-10-25 18:24             ` Andreas Gruenbacher
  2021-10-26  5:12               ` Theodore Ts'o
  2021-10-26  9:44               ` Andreas Gruenbacher
@ 2021-10-27 21:21               ` Andreas Gruenbacher
  2 siblings, 0 replies; 47+ messages in thread
From: Andreas Gruenbacher @ 2021-10-27 21:21 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: kvm-ppc, Christoph Hellwig, cluster-devel, Jan Kara,
	Andreas Gruenbacher, Catalin Marinas, Dave Hansen, linux-kernel,
	Paul Mackerras, Alexander Viro, linux-fsdevel, linux-btrfs,
	Linus Torvalds, ocfs2-devel

One of the arguments against Dave Hansen's patch that eliminates the
pre-faulting was that it doubles the number of page faults in the slow
case.  This can be avoided by using get_user_pages() to do the
"post-faulting", though.  For what it's worth, here's a patch for that
(on top of this series).

Andreas

--

fs: Avoid predictable page faults for sys_write() user buffer pages

Introduce a new fault_in_iov_iter_slow_readable() helper for faulting in
an iterator via get_user_pages() instead of triggering page faults.
This is slower than a simple memory read when the underlying pages are
resident, but avoids the page fault overhead when the underlying pages
need to be faulted in.

Use fault_in_iov_iter_slow_readable() in generic_perform_write and
iomap_write_iter when reading from the user buffer fails.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/iomap/buffered-io.c  |  2 +-
 include/linux/pagemap.h |  3 ++-
 include/linux/uio.h     | 17 ++++++++++++++++-
 lib/iov_iter.c          | 10 ++++++----
 mm/filemap.c            |  2 +-
 mm/gup.c                | 10 ++++++----
 6 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index d8809cd9ab31..15a0b4bb9528 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -770,7 +770,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 				bytes = copied;
 				goto again;
 			}
-			if (fault_in_iov_iter_readable(i, bytes) != bytes)
+			if (fault_in_iov_iter_slow_readable(i, bytes) != bytes)
 				goto again;
 			status = -EFAULT;
 			break;
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2f7dd14083d9..43844ed5675f 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -736,8 +736,9 @@ extern void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter);
  * Fault in userspace address range.
  */
 size_t fault_in_writeable(char __user *uaddr, size_t size);
-size_t fault_in_safe_writeable(const char __user *uaddr, size_t size);
 size_t fault_in_readable(const char __user *uaddr, size_t size);
+size_t __fault_in_slow(const char __user *uaddr, size_t size,
+		       unsigned int flags);
 
 int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 6350354f97e9..b071f4445059 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -8,6 +8,7 @@
 #include <linux/kernel.h>
 #include <linux/thread_info.h>
 #include <uapi/linux/uio.h>
+#include <linux/mm.h>
 
 struct page;
 struct pipe_inode_info;
@@ -135,7 +136,21 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
 void iov_iter_revert(struct iov_iter *i, size_t bytes);
 size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t bytes);
-size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t bytes);
+size_t __fault_in_iov_iter_slow(const struct iov_iter *i, size_t bytes,
+				unsigned int flags);
+
+static inline size_t fault_in_iov_iter_slow_readable(const struct iov_iter *i,
+						     size_t bytes)
+{
+	return __fault_in_iov_iter_slow(i, bytes, 0);
+}
+
+static inline size_t fault_in_iov_iter_writeable(const struct iov_iter *i,
+						 size_t bytes)
+{
+	return __fault_in_iov_iter_slow(i, bytes, FOLL_WRITE);
+}
+
 size_t iov_iter_single_seg_count(const struct iov_iter *i);
 size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 66a740e6e153..73789a5409f6 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -468,9 +468,10 @@ size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t size)
 EXPORT_SYMBOL(fault_in_iov_iter_readable);
 
 /*
- * fault_in_iov_iter_writeable - fault in iov iterator for writing
+ * __fault_in_iov_iter_slow - fault in iov iterator for reading/writing
  * @i: iterator
  * @size: maximum length
+ * @flags: FOLL_* flags (FOLL_WRITE for writing)
  *
  * Faults in the iterator using get_user_pages(), i.e., without triggering
  * hardware page faults.  This is primarily useful when we already know that
@@ -481,7 +482,8 @@ EXPORT_SYMBOL(fault_in_iov_iter_readable);
  *
  * Always returns 0 for non-user-space iterators.
  */
-size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t size)
+size_t __fault_in_iov_iter_slow(const struct iov_iter *i, size_t size,
+				unsigned int flags)
 {
 	if (iter_is_iovec(i)) {
 		size_t count = min(size, iov_iter_count(i));
@@ -495,7 +497,7 @@ size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t size)
 
 			if (unlikely(!len))
 				continue;
-			ret = fault_in_safe_writeable(p->iov_base + skip, len);
+			ret = __fault_in_slow(p->iov_base + skip, len, flags);
 			count -= len - ret;
 			if (ret)
 				break;
@@ -504,7 +506,7 @@ size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t size)
 	}
 	return 0;
 }
-EXPORT_SYMBOL(fault_in_iov_iter_writeable);
+EXPORT_SYMBOL(__fault_in_iov_iter_slow);
 
 void iov_iter_init(struct iov_iter *i, unsigned int direction,
 			const struct iovec *iov, unsigned long nr_segs,
diff --git a/mm/filemap.c b/mm/filemap.c
index 467cdb7d086d..7ca76f4aa974 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3787,7 +3787,7 @@ ssize_t generic_perform_write(struct file *file,
 				bytes = copied;
 				goto again;
 			}
-			if (fault_in_iov_iter_readable(i, bytes) != bytes)
+			if (fault_in_iov_iter_slow_readable(i, bytes) != bytes)
 				goto again;
 			status = -EFAULT;
 			break;
diff --git a/mm/gup.c b/mm/gup.c
index e1c7e4bde11f..def9f478a621 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1694,9 +1694,10 @@ size_t fault_in_writeable(char __user *uaddr, size_t size)
 EXPORT_SYMBOL(fault_in_writeable);
 
 /*
- * fault_in_safe_writeable - fault in an address range for writing
+ * __fault_in_slow - fault in an address range for reading/writing
  * @uaddr: start of address range
  * @size: length of address range
+ * @flags: FOLL_* flags (FOLL_WRITE for writing)
  *
  * Faults in an address range using get_user_pages, i.e., without triggering
  * hardware page faults.  This is primarily useful when we already know that
@@ -1711,7 +1712,8 @@ EXPORT_SYMBOL(fault_in_writeable);
  * Returns the number of bytes not faulted in, like copy_to_user() and
  * copy_from_user().
  */
-size_t fault_in_safe_writeable(const char __user *uaddr, size_t size)
+size_t __fault_in_slow(const char __user *uaddr, size_t size,
+		       unsigned int flags)
 {
 	unsigned long start = (unsigned long)untagged_addr(uaddr);
 	unsigned long end, nstart, nend;
@@ -1743,7 +1745,7 @@ size_t fault_in_safe_writeable(const char __user *uaddr, size_t size)
 		nr_pages = (nend - nstart) / PAGE_SIZE;
 		ret = __get_user_pages_locked(mm, nstart, nr_pages,
 					      NULL, NULL, &locked,
-					      FOLL_TOUCH | FOLL_WRITE);
+					      FOLL_TOUCH | flags);
 		if (ret <= 0)
 			break;
 		nend = nstart + ret * PAGE_SIZE;
@@ -1754,7 +1756,7 @@ size_t fault_in_safe_writeable(const char __user *uaddr, size_t size)
 		return 0;
 	return size - min_t(size_t, nstart - start, size);
 }
-EXPORT_SYMBOL(fault_in_safe_writeable);
+EXPORT_SYMBOL(__fault_in_slow);
 
 /**
  * fault_in_readable - fault in userspace address range for reading
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
  2021-10-27 21:14                       ` Linus Torvalds
@ 2021-10-28 21:20                         ` Catalin Marinas
  2021-10-28 21:40                           ` Catalin Marinas
                                             ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Catalin Marinas @ 2021-10-28 21:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kvm-ppc, Christoph Hellwig, cluster-devel, Jan Kara,
	Andreas Gruenbacher, Linux Kernel Mailing List, Paul Mackerras,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

One last try on this path before I switch to the other options.

On Wed, Oct 27, 2021 at 02:14:48PM -0700, Linus Torvalds wrote:
> On Wed, Oct 27, 2021 at 12:13 PM Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > As an alternative, you mentioned earlier that a per-thread fault status
> > was not feasible on x86 due to races. Was this only for the hw poison
> > case? I think the uaccess is slightly different.
> 
> It's not x86-specific, it's very generic.
> 
> If we set some flag in the per-thread status, we'll need to be careful
> about not overwriting it if we then have a subsequent NMI that _also_
> takes a (completely unrelated) page fault - before we then read the
> per-thread flag.
> 
> Think 'perf' and fetching backtraces etc.
> 
> Note that the NMI page fault can easily also be a pointer coloring
> fault on arm64, for exactly the same reason that whatever original
> copy_from_user() code was. So this is not a "oh, pointer coloring
> faults are different". They have the same re-entrancy issue.
> 
> And both the "pagefault_disable" and "fault happens in interrupt
> context" cases are also the exact same 'faulthandler_disabled()'
> thing. So even at fault time they look very similar.

They do look fairly similar but we should have the information in the
fault handler to distinguish: not a page fault (pte permission or p*d
translation), in_task(), user address, fixup handler. But I agree the
logic looks fragile.

I think for nested contexts we can save the uaccess fault state on
exception entry, restore it on return. Or (needs some thinking on
atomicity) save it in a local variable. The high-level API would look
something like:

	unsigned long uaccess_flags;	/* we could use TIF_ flags */

	uaccess_flags = begin_retriable_uaccess();
	copied = copy_page_from_iter_atomic(...);
	retry = end_retriable_uaccess(uaccess_flags);
	...

	if (!retry)
		break;

I think we'd need a TIF flag to mark the retriable region and another to
track whether a non-recoverable fault occurred. It needs prototyping.

Anyway, if you don't like this approach, I'll look at error codes being
returned but rather than changing all copy_from_user() etc., introduce a
new API that returns different error codes depending on the fault
(e.g -EFAULT vs -EACCES). We already have copy_from_user_nofault(), we'd
need something for the iov_iter stuff to use in the fs code.

-- 
Catalin

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
  2021-10-28 21:20                         ` Catalin Marinas
@ 2021-10-28 21:40                           ` Catalin Marinas
  2021-10-28 22:15                           ` Andreas Grünbacher
  2021-10-28 22:32                           ` Linus Torvalds
  2 siblings, 0 replies; 47+ messages in thread
From: Catalin Marinas @ 2021-10-28 21:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kvm-ppc, Christoph Hellwig, cluster-devel, Jan Kara,
	Andreas Gruenbacher, Linux Kernel Mailing List, Paul Mackerras,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

On Thu, Oct 28, 2021 at 10:20:52PM +0100, Catalin Marinas wrote:
> On Wed, Oct 27, 2021 at 02:14:48PM -0700, Linus Torvalds wrote:
> > On Wed, Oct 27, 2021 at 12:13 PM Catalin Marinas
> > <catalin.marinas@arm.com> wrote:
> > > As an alternative, you mentioned earlier that a per-thread fault status
> > > was not feasible on x86 due to races. Was this only for the hw poison
> > > case? I think the uaccess is slightly different.
> > 
> > It's not x86-specific, it's very generic.
> > 
> > If we set some flag in the per-thread status, we'll need to be careful
> > about not overwriting it if we then have a subsequent NMI that _also_
> > takes a (completely unrelated) page fault - before we then read the
> > per-thread flag.
> > 
> > Think 'perf' and fetching backtraces etc.
> > 
> > Note that the NMI page fault can easily also be a pointer coloring
> > fault on arm64, for exactly the same reason that whatever original
> > copy_from_user() code was. So this is not a "oh, pointer coloring
> > faults are different". They have the same re-entrancy issue.
> > 
> > And both the "pagefault_disable" and "fault happens in interrupt
> > context" cases are also the exact same 'faulthandler_disabled()'
> > thing. So even at fault time they look very similar.
> 
> They do look fairly similar but we should have the information in the
> fault handler to distinguish: not a page fault (pte permission or p*d
> translation), in_task(), user address, fixup handler. But I agree the
> logic looks fragile.
> 
> I think for nested contexts we can save the uaccess fault state on
> exception entry, restore it on return. Or (needs some thinking on
> atomicity) save it in a local variable. The high-level API would look
> something like:
> 
> 	unsigned long uaccess_flags;	/* we could use TIF_ flags */
> 
> 	uaccess_flags = begin_retriable_uaccess();
> 	copied = copy_page_from_iter_atomic(...);
> 	retry = end_retriable_uaccess(uaccess_flags);

It doesn't work with local flags, so it would need to be saved on
exception entry (interrupt, breakpoint etc.) on the stack, restore on
return. But the API would return pretty close (and probably still more
complicated than copy_*() returning an error code).

-- 
Catalin

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
  2021-10-28 21:20                         ` Catalin Marinas
  2021-10-28 21:40                           ` Catalin Marinas
@ 2021-10-28 22:15                           ` Andreas Grünbacher
  2021-10-29 12:50                             ` Catalin Marinas
  2021-10-28 22:32                           ` Linus Torvalds
  2 siblings, 1 reply; 47+ messages in thread
From: Andreas Grünbacher @ 2021-10-28 22:15 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: kvm-ppc, Christoph Hellwig, cluster-devel, Jan Kara,
	Andreas Gruenbacher, Linux Kernel Mailing List, Paul Mackerras,
	Alexander Viro, linux-fsdevel, linux-btrfs, Linus Torvalds,
	ocfs2-devel

Am Do., 28. Okt. 2021 um 23:21 Uhr schrieb Catalin Marinas
<catalin.marinas@arm.com>:
> One last try on this path before I switch to the other options.
>
> On Wed, Oct 27, 2021 at 02:14:48PM -0700, Linus Torvalds wrote:
> > On Wed, Oct 27, 2021 at 12:13 PM Catalin Marinas
> > <catalin.marinas@arm.com> wrote:
> > > As an alternative, you mentioned earlier that a per-thread fault status
> > > was not feasible on x86 due to races. Was this only for the hw poison
> > > case? I think the uaccess is slightly different.
> >
> > It's not x86-specific, it's very generic.
> >
> > If we set some flag in the per-thread status, we'll need to be careful
> > about not overwriting it if we then have a subsequent NMI that _also_
> > takes a (completely unrelated) page fault - before we then read the
> > per-thread flag.
> >
> > Think 'perf' and fetching backtraces etc.
> >
> > Note that the NMI page fault can easily also be a pointer coloring
> > fault on arm64, for exactly the same reason that whatever original
> > copy_from_user() code was. So this is not a "oh, pointer coloring
> > faults are different". They have the same re-entrancy issue.
> >
> > And both the "pagefault_disable" and "fault happens in interrupt
> > context" cases are also the exact same 'faulthandler_disabled()'
> > thing. So even at fault time they look very similar.
>
> They do look fairly similar but we should have the information in the
> fault handler to distinguish: not a page fault (pte permission or p*d
> translation), in_task(), user address, fixup handler. But I agree the
> logic looks fragile.
>
> I think for nested contexts we can save the uaccess fault state on
> exception entry, restore it on return. Or (needs some thinking on
> atomicity) save it in a local variable. The high-level API would look
> something like:
>
>         unsigned long uaccess_flags;    /* we could use TIF_ flags */
>
>         uaccess_flags = begin_retriable_uaccess();
>         copied = copy_page_from_iter_atomic(...);
>         retry = end_retriable_uaccess(uaccess_flags);
>         ...
>
>         if (!retry)
>                 break;
>
> I think we'd need a TIF flag to mark the retriable region and another to
> track whether a non-recoverable fault occurred. It needs prototyping.
>
> Anyway, if you don't like this approach, I'll look at error codes being
> returned but rather than changing all copy_from_user() etc., introduce a
> new API that returns different error codes depending on the fault
> (e.g -EFAULT vs -EACCES). We already have copy_from_user_nofault(), we'd
> need something for the iov_iter stuff to use in the fs code.

We won't need any of that on the filesystem read and write paths. The
two cases there are buffered and direct I/O:

* In the buffered I/O case, the copying happens with page faults
disabled, at a byte granularity. If that returns a short result, we
need to enable page faults, check if the exact address that failed
still fails (in which case we have a sub-page fault),  fault in the
pages, disable page faults again, and repeat. No probing for sub-page
faults beyond the first byte of the fault-in address is needed.
Functions fault_in_{readable,writeable} implicitly have this behavior;
for fault_in_safe_writeable() the choice we have is to either add
probing of the first byte for sub-page faults to this function or
force callers to do that probing separately. At this point, I'd vote
for the former.

* In the direct I/O case, the copying happens while we're holding page
references, so the only page faults that can occur during copying are
sub-page faults. When iomap_dio_rw or its legacy counterpart is called
with page faults disabled, we need to make sure that the caller can
distinguish between page faults triggered during
bio_iov_iter_get_pages() and during the copying, but that's a separate
problem. (At the moment, when iomap_dio_rw fails with -EFAULT, the
caller *cannot* distinguish between a bio_iov_iter_get_pages failure
and a failure during synchronous copying, but that could be fixed by
returning unique error codes from iomap_dio_rw.)

So as far as I can see, the only problematic case we're left with is
copying bigger than byte-size chunks with page faults disabled when we
don't know whether the underlying pages are resident or not. My guess
would be that in this case, if the copying fails, it would be
perfectly acceptable to explicitly probe the entire chunk for sub-page
faults.

Thanks,
Andreas

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
  2021-10-28 21:20                         ` Catalin Marinas
  2021-10-28 21:40                           ` Catalin Marinas
  2021-10-28 22:15                           ` Andreas Grünbacher
@ 2021-10-28 22:32                           ` Linus Torvalds
  2021-10-29 17:50                             ` Catalin Marinas
  2 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2021-10-28 22:32 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: kvm-ppc, Christoph Hellwig, cluster-devel, Jan Kara,
	Andreas Gruenbacher, Linux Kernel Mailing List, Paul Mackerras,
	Alexander Viro, linux-fsdevel, linux-btrfs, ocfs2-devel

On Thu, Oct 28, 2021 at 2:21 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> They do look fairly similar but we should have the information in the
> fault handler to distinguish: not a page fault (pte permission or p*d
> translation), in_task(), user address, fixup handler. But I agree the
> logic looks fragile.

So thinking about this a bit more, I think I have a possible
suggestion for how to handle this..

The pointer color fault (or whatever some other architecture may do to
generate sub-page faults) is not only not recoverable in the sense
that we can't fix it up, it also ends up being a forced SIGSEGV (ie it
can't be blocked - it has to either be caught or cause the process to
be killed).

And the thing is, I think we could just make the rule be that kernel
code that has this kind of retry loop with fault_in_pages() would
force an EFAULT on a pending SIGSEGV.

IOW, the pending SIGSEGV could effectively be exactly that "thread flag".

And that means that fault_in_xyz() wouldn't need to worry about this
situation at all: the regular copy_from_user() (or whatever flavor it
is - to/from/iter/whatever) would take the fault. And if it's a
regular page fault,. it would act exactly like it does now, so no
changes.

If it's a sub-page fault, we'd just make the rule be that we send a
SIGSEGV even if the instruction in question has a user exception
fixup.

Then we just need to add the logic somewhere that does "if active
pending SIGSEGV, return -EFAULT".

Of course, that logic might be in fault_in_xyz(), but it migth also be
a separate function entirely.

So this does effectively end up being a thread flag, but it's also
slightly more than that - it's that a sub-page fault from kernel mode
has semantics that a regular page fault does not.

The whole "kernel access doesn't cause SIGSEGV, but returns -EFAULT
instead" has always been an odd and somewhat wrong-headed thing. Of
course it should cause a SIGSEGV, but that's not how Unix traditionall
worked. We would just say "color faults always raise a signal, even if
the color fault was triggered in a system call".

(And I didn't check: I say "SIGSEGV" above, but maybe the pointer
color faults are actually SIGBUS? Doesn't change the end result).

                 Linus

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
  2021-10-28 22:15                           ` Andreas Grünbacher
@ 2021-10-29 12:50                             ` Catalin Marinas
  0 siblings, 0 replies; 47+ messages in thread
From: Catalin Marinas @ 2021-10-29 12:50 UTC (permalink / raw)
  To: Andreas Grünbacher
  Cc: kvm-ppc, Christoph Hellwig, cluster-devel, Jan Kara,
	Andreas Gruenbacher, Linux Kernel Mailing List, Paul Mackerras,
	Alexander Viro, linux-fsdevel, linux-btrfs, Linus Torvalds,
	ocfs2-devel

On Fri, Oct 29, 2021 at 12:15:55AM +0200, Andreas Grünbacher wrote:
> Am Do., 28. Okt. 2021 um 23:21 Uhr schrieb Catalin Marinas
> <catalin.marinas@arm.com>:
> > I think for nested contexts we can save the uaccess fault state on
> > exception entry, restore it on return. Or (needs some thinking on
> > atomicity) save it in a local variable. The high-level API would look
> > something like:
> >
> >         unsigned long uaccess_flags;    /* we could use TIF_ flags */
> >
> >         uaccess_flags = begin_retriable_uaccess();
> >         copied = copy_page_from_iter_atomic(...);
> >         retry = end_retriable_uaccess(uaccess_flags);
> >         ...
> >
> >         if (!retry)
> >                 break;
> >
> > I think we'd need a TIF flag to mark the retriable region and another to
> > track whether a non-recoverable fault occurred. It needs prototyping.
> >
> > Anyway, if you don't like this approach, I'll look at error codes being
> > returned but rather than changing all copy_from_user() etc., introduce a
> > new API that returns different error codes depending on the fault
> > (e.g -EFAULT vs -EACCES). We already have copy_from_user_nofault(), we'd
> > need something for the iov_iter stuff to use in the fs code.
> 
> We won't need any of that on the filesystem read and write paths. The
> two cases there are buffered and direct I/O:

Thanks for the clarification, very useful.

> * In the buffered I/O case, the copying happens with page faults
> disabled, at a byte granularity. If that returns a short result, we
> need to enable page faults, check if the exact address that failed
> still fails (in which case we have a sub-page fault),  fault in the
> pages, disable page faults again, and repeat. No probing for sub-page
> faults beyond the first byte of the fault-in address is needed.
> Functions fault_in_{readable,writeable} implicitly have this behavior;
> for fault_in_safe_writeable() the choice we have is to either add
> probing of the first byte for sub-page faults to this function or
> force callers to do that probing separately. At this point, I'd vote
> for the former.

This sounds fine to me (and I have some draft patches already on top of
your series).

> * In the direct I/O case, the copying happens while we're holding page
> references, so the only page faults that can occur during copying are
> sub-page faults.

Does holding a page reference guarantee that the user pte pointing to
such page won't change, for example a pte_mkold()? I assume for direct
I/O, the PG_locked is not held. But see below, it may not be relevant.

> When iomap_dio_rw or its legacy counterpart is called
> with page faults disabled, we need to make sure that the caller can
> distinguish between page faults triggered during
> bio_iov_iter_get_pages() and during the copying, but that's a separate
> problem. (At the moment, when iomap_dio_rw fails with -EFAULT, the
> caller *cannot* distinguish between a bio_iov_iter_get_pages failure
> and a failure during synchronous copying, but that could be fixed by
> returning unique error codes from iomap_dio_rw.)

Since the direct I/O pins the pages in memory, does it even need to do a
uaccess? It could copy the data via the kernel mapping (kmap). For arm64
MTE, all such accesses are not checked (they use a match-all pointer
tag) since the kernel is not set up to handle such sub-page faults (no
copy_from/to_user but a direct access).

> So as far as I can see, the only problematic case we're left with is
> copying bigger than byte-size chunks with page faults disabled when we
> don't know whether the underlying pages are resident or not. My guess
> would be that in this case, if the copying fails, it would be
> perfectly acceptable to explicitly probe the entire chunk for sub-page
> faults.

Yeah, if there are only a couple of places left, we can add the explicit
probing (via some probe_user_writable function).

-- 
Catalin

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
  2021-10-28 22:32                           ` Linus Torvalds
@ 2021-10-29 17:50                             ` Catalin Marinas
  2021-10-29 18:47                               ` Linus Torvalds
  0 siblings, 1 reply; 47+ messages in thread
From: Catalin Marinas @ 2021-10-29 17:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kvm-ppc, Christoph Hellwig, cluster-devel, Jan Kara,
	Andreas Gruenbacher, Linux Kernel Mailing List, Paul Mackerras,
	Tony Luck, Alexander Viro, Andy Lutomirski, linux-fsdevel,
	linux-btrfs, ocfs2-devel

On Thu, Oct 28, 2021 at 03:32:23PM -0700, Linus Torvalds wrote:
> The pointer color fault (or whatever some other architecture may do to
> generate sub-page faults) is not only not recoverable in the sense
> that we can't fix it up, it also ends up being a forced SIGSEGV (ie it
> can't be blocked - it has to either be caught or cause the process to
> be killed).
> 
> And the thing is, I think we could just make the rule be that kernel
> code that has this kind of retry loop with fault_in_pages() would
> force an EFAULT on a pending SIGSEGV.
> 
> IOW, the pending SIGSEGV could effectively be exactly that "thread flag".
> 
> And that means that fault_in_xyz() wouldn't need to worry about this
> situation at all: the regular copy_from_user() (or whatever flavor it
> is - to/from/iter/whatever) would take the fault. And if it's a
> regular page fault,. it would act exactly like it does now, so no
> changes.
> 
> If it's a sub-page fault, we'd just make the rule be that we send a
> SIGSEGV even if the instruction in question has a user exception
> fixup.
> 
> Then we just need to add the logic somewhere that does "if active
> pending SIGSEGV, return -EFAULT".
> 
> Of course, that logic might be in fault_in_xyz(), but it migth also be
> a separate function entirely.
> 
> So this does effectively end up being a thread flag, but it's also
> slightly more than that - it's that a sub-page fault from kernel mode
> has semantics that a regular page fault does not.
> 
> The whole "kernel access doesn't cause SIGSEGV, but returns -EFAULT
> instead" has always been an odd and somewhat wrong-headed thing. Of
> course it should cause a SIGSEGV, but that's not how Unix traditionall
> worked. We would just say "color faults always raise a signal, even if
> the color fault was triggered in a system call".

It's doable and, at least for MTE, people have asked for a signal even
when the fault was caused by a kernel uaccess. But there are some
potentially confusing aspects to sort out:

First of all, a uaccess in interrupt should not force such signal as it
had nothing to do with the interrupted context. I guess we can do an
in_task() check in the fault handler.

Second, is there a chance that we enter the fault-in loop with a SIGSEGV
already pending? Maybe it's not a problem, we just bail out of the loop
early and deliver the signal, though unrelated to the actual uaccess in
the loop.

Third is the sigcontext.pc presented to the signal handler. Normally for
SIGSEGV it points to the address of a load/store instruction and a
handler could disable MTE and restart from that point. With a syscall we
don't want it to point to the syscall place as it shouldn't be restarted
in case it copied something. Pointing it to the next instruction after
syscall is backwards-compatible but it may confuse the handler (if it
does some reporting). I think we need add a new si_code that describes a
fault in kernel mode to differentiate from the genuine user access.

There was a discussion back in August on infinite loops with hwpoison
and Tony said that Andy convinced him that the kernel should not send a
SIGBUS for uaccess:

https://lore.kernel.org/linux-edac/20210823152437.GA1637466@agluck-desk2.amr.corp.intel.com/

I personally like the approach of a SIG{SEGV,BUS} on uaccess and I don't
think the ABI change is significant but ideally we should have a unified
approach that's not just for MTE.

Adding Andy and Tony (the background is potentially infinite loops with
faults at sub-page granularity: arm64 MTE, hwpoison, sparc ADI).

Thanks.

-- 
Catalin

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
  2021-10-29 17:50                             ` Catalin Marinas
@ 2021-10-29 18:47                               ` Linus Torvalds
  0 siblings, 0 replies; 47+ messages in thread
From: Linus Torvalds @ 2021-10-29 18:47 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: kvm-ppc, Christoph Hellwig, cluster-devel, Jan Kara,
	Andreas Gruenbacher, Linux Kernel Mailing List, Paul Mackerras,
	Tony Luck, Alexander Viro, Andy Lutomirski, linux-fsdevel,
	linux-btrfs, ocfs2-devel

On Fri, Oct 29, 2021 at 10:50 AM Catalin Marinas
<catalin.marinas@arm.com> wrote:
>
> First of all, a uaccess in interrupt should not force such signal as it
> had nothing to do with the interrupted context. I guess we can do an
> in_task() check in the fault handler.

Yeah. It ends up being similar to the thread flag in that you still
end up having to protect against NMI and other users of asynchronous
page faults.

So the suggestion was more of a "mindset" difference and modified
version of the task flag rather than anything fundamentally different.

> Second, is there a chance that we enter the fault-in loop with a SIGSEGV
> already pending? Maybe it's not a problem, we just bail out of the loop
> early and deliver the signal, though unrelated to the actual uaccess in
> the loop.

If we ever run in user space with a pending per-thread SIGSEGV, that
would already be a fairly bad bug. The intent of "force_sig()" is not
only to make sure you can't block the signal, but also that it targets
the particular thread that caused the problem: unlike other random
"send signal to process", a SIGSEGV caused by a bad memory access is
really local to that _thread_, not the signal thread group.

So somebody else sending a SIGSEGV asynchronsly is actually very
different - it goes to the thread group (although you can specify
individual threads too - but once you do that you're already outside
of POSIX).

That said, the more I look at it, the more I think I was wrong. I
think the "we have a SIGSEGV pending" could act as the per-thread
flag, but the complexity of the signal handling is probably an
argument against it.

Not because a SIGSEGV could already be pending, but because so many
other situations could be pending.

In particular, the signal code won't send new signals to a thread if
that thread group is already exiting. So another thread may have
already started the exit and core dump sequence, and is in the process
of killing the shared signal threads, and if one of those threads is
now in the kernel and goes through the copy_from_user() dance, that
whole "thread group is exiting" will mean that the signal code won't
add a new SIGSEGV to the queue.

So the signal could conceptually be used as the flag to stop looping,
but it ends up being such a complicated flag that I think it's
probably not worth it after all. Even if it semantically would be
fairly nice to use pre-existing machinery.

Could it be worked around? Sure. That kernel loop probably has to
check for fatal_signal_pending() anyway, so it would all work even in
the presense of the above kinds of issues. But just the fact that I
went and looked at just how exciting the signal code is made me think
"ok, conceptually nice, but we take a lot of locks and we do a lot of
special things even in the 'simple' force_sig() case".

> Third is the sigcontext.pc presented to the signal handler. Normally for
> SIGSEGV it points to the address of a load/store instruction and a
> handler could disable MTE and restart from that point. With a syscall we
> don't want it to point to the syscall place as it shouldn't be restarted
> in case it copied something.

I think this is actually independent of the whole "how to return
errors". We'll still need to return an error from the system call,
even if we also have a signal pending.

                  Linus

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2021-10-29 18:47 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-19 13:41 [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 01/17] iov_iter: Fix iov_iter_get_pages{, _alloc} page fault return value Andreas Gruenbacher
2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 02/17] powerpc/kvm: Fix kvm_use_magic_page Andreas Gruenbacher
2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 03/17] gup: Turn fault_in_pages_{readable, writeable} into fault_in_{readable, writeable} Andreas Gruenbacher
2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 04/17] iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable Andreas Gruenbacher
2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 05/17] iov_iter: Introduce fault_in_iov_iter_writeable Andreas Gruenbacher
2021-10-20 16:25   ` Catalin Marinas
2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 06/17] gfs2: Add wrapper for iomap_file_buffered_write Andreas Gruenbacher
2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 07/17] gfs2: Clean up function may_grant Andreas Gruenbacher
2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 08/17] gfs2: Introduce flag for glock holder auto-demotion Andreas Gruenbacher
2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 09/17] gfs2: Move the inode glock locking to gfs2_file_buffered_write Andreas Gruenbacher
2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 10/17] gfs2: Eliminate ip->i_gh Andreas Gruenbacher
2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 11/17] gfs2: Fix mmap + page fault deadlocks for buffered I/O Andreas Gruenbacher
2021-10-19 13:41 ` [Ocfs2-devel] [PATCH v8 12/17] iomap: Fix iomap_dio_rw return value for user copies Andreas Gruenbacher
2021-10-19 13:42 ` [Ocfs2-devel] [PATCH v8 13/17] iomap: Support partial direct I/O on user copy failures Andreas Gruenbacher
2021-10-19 13:42 ` [Ocfs2-devel] [PATCH v8 14/17] iomap: Add done_before argument to iomap_dio_rw Andreas Gruenbacher
2021-10-19 15:51   ` Darrick J. Wong
2021-10-19 19:30     ` Andreas Gruenbacher
2021-10-20  1:57       ` Darrick J. Wong
2021-10-19 13:42 ` [Ocfs2-devel] [PATCH v8 15/17] gup: Introduce FOLL_NOFAULT flag to disable page faults Andreas Gruenbacher
2021-10-19 13:42 ` [Ocfs2-devel] [PATCH v8 16/17] iov_iter: Introduce nofault " Andreas Gruenbacher
2021-10-19 13:42 ` [Ocfs2-devel] [PATCH v8 17/17] gfs2: Fix mmap + page fault deadlocks for direct I/O Andreas Gruenbacher
2021-10-19 15:40 ` [Ocfs2-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks Linus Torvalds
2021-10-19 16:00   ` Bob Peterson
2021-10-20 16:36   ` Catalin Marinas
2021-10-20 20:11     ` Linus Torvalds
2021-10-20 22:44       ` Catalin Marinas
2021-10-21  6:19         ` Linus Torvalds
2021-10-22 18:06           ` Catalin Marinas
2021-10-22 19:22             ` Linus Torvalds
2021-10-25 19:00               ` Andreas Gruenbacher
2021-10-26 18:24                 ` Catalin Marinas
2021-10-26 18:50                   ` Linus Torvalds
2021-10-26 19:18                     ` Linus Torvalds
2021-10-27 19:13                     ` Catalin Marinas
2021-10-27 21:14                       ` Linus Torvalds
2021-10-28 21:20                         ` Catalin Marinas
2021-10-28 21:40                           ` Catalin Marinas
2021-10-28 22:15                           ` Andreas Grünbacher
2021-10-29 12:50                             ` Catalin Marinas
2021-10-28 22:32                           ` Linus Torvalds
2021-10-29 17:50                             ` Catalin Marinas
2021-10-29 18:47                               ` Linus Torvalds
2021-10-25 18:24             ` Andreas Gruenbacher
2021-10-26  5:12               ` Theodore Ts'o
2021-10-26  9:44               ` Andreas Gruenbacher
2021-10-27 21:21               ` Andreas Gruenbacher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).