ocfs2-devel.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [Ocfs2-devel] [PATCH v6 00/19] gfs2: Fix mmap + page fault deadlocks
@ 2021-08-19 19:40 Andreas Gruenbacher
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 01/19] iov_iter: Fix iov_iter_get_pages{, _alloc} page fault return value Andreas Gruenbacher
                   ` (18 more replies)
  0 siblings, 19 replies; 33+ messages in thread
From: Andreas Gruenbacher @ 2021-08-19 19:40 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig,
	Darrick J. Wong, Paul Mackerras
  Cc: Jan Kara, Andreas Gruenbacher, linux-kernel, cluster-devel,
	kvm-ppc, linux-fsdevel, ocfs2-devel

Hi all,

here's another update on top of v5.14-rc6.  Changes:

 * Per request from Linus, change fault_in_{readable,writeable} to
   return the number of bytes *not* faulted in, like copy_to_user() and
   copy_from_user() does.  Convert fault_in_iov_iter_readable and
   fault_in_iov_iter_writeable to those same semantics.

 * Per suggestion from Linus, introduce a new FOLL_NOFAULT flag to
   prevent get_user_pages from faulting in pages.  This is similar to
   FOLL_FAST_ONLY, but less fragile and available on all architectures.
   Use that for turning off page faults during iov_iter_get_pages() and
   iov_iter_get_pages_alloc().

 * Introduce a new HIF_MAY_DEMOTE flag that allows a glock to be taken
   away from a holder when a conflicting locking request comes in.  This
   allows glock holders to hang on to glocks as long as no conflicting
   locking requests occur.  This avoids returning short reads and writes
   when pages need to be faulted in.

 * Limit the number of pages that are faulted in at once to a more
   sensible size instead of faulting in all pages at once.  When
   faulting in pages doesn't lead to success, fault in a single page
   in the next attempt.  When that still doesn't succeed, give up.
   This should prevent endless loops when fault_in_iov_iter_*() and
   bio_iov_iter_get_pages() disagree.

 * It turns out that taking the inode glock in gfs2_write_lock and
   releasing it in gfs2_write_unlock was entirely pointless, so move
   the locking into gfs2_file_buffered_write instead.  This then also
   allows to eliminate ip->i_gh.


This iteration fixes the issues with fstest generic/208.


For immediate consideration by Al Viro:

  iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value


For immediate consideration by Paul Mackerras:

  powerpc/kvm: Fix kvm_use_magic_page


Thanks,
Andreas

Andreas Gruenbacher (16):
  iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value
  powerpc/kvm: Fix kvm_use_magic_page
  Turn fault_in_pages_{readable,writeable} into
    fault_in_{readable,writeable}
  Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
  iov_iter: Introduce fault_in_iov_iter_writeable
  gfs2: Add wrapper for iomap_file_buffered_write
  gfs2: Clean up function may_grant
  gfs2: Move the inode glock locking to gfs2_file_buffered_write
  gfs2: Fix mmap + page fault deadlocks for buffered I/O
  iomap: Fix iomap_dio_rw return value for user copies
  iomap: Support partial direct I/O on user copy failures
  iomap: Add done_before argument to iomap_dio_rw
  gup: Introduce FOLL_NOFAULT flag to disable page faults
  iov_iter: Introduce nofault flag to disable page faults
  gfs2: Fix mmap + page fault deadlocks for direct I/O
  gfs2: Eliminate ip->i_gh

Bob Peterson (3):
  gfs2: Eliminate vestigial HIF_FIRST
  gfs2: Remove redundant check from gfs2_glock_dq
  gfs2: Introduce flag for glock holder auto-demotion

 arch/powerpc/kernel/kvm.c           |   3 +-
 arch/powerpc/kernel/signal_32.c     |   4 +-
 arch/powerpc/kernel/signal_64.c     |   2 +-
 arch/x86/kernel/fpu/signal.c        |   7 +-
 drivers/gpu/drm/armada/armada_gem.c |   7 +-
 fs/btrfs/file.c                     |   7 +-
 fs/btrfs/ioctl.c                    |   5 +-
 fs/ext4/file.c                      |   5 +-
 fs/f2fs/file.c                      |   2 +-
 fs/fuse/file.c                      |   2 +-
 fs/gfs2/bmap.c                      |  60 +----
 fs/gfs2/file.c                      | 244 ++++++++++++++++++--
 fs/gfs2/glock.c                     | 340 +++++++++++++++++++++-------
 fs/gfs2/glock.h                     |  20 ++
 fs/gfs2/incore.h                    |   5 +-
 fs/iomap/buffered-io.c              |   2 +-
 fs/iomap/direct-io.c                |  21 +-
 fs/ntfs/file.c                      |   2 +-
 fs/xfs/xfs_file.c                   |   6 +-
 fs/zonefs/super.c                   |   4 +-
 include/linux/iomap.h               |  11 +-
 include/linux/mm.h                  |   3 +-
 include/linux/pagemap.h             |  58 +----
 include/linux/uio.h                 |   4 +-
 lib/iov_iter.c                      | 103 +++++++--
 mm/filemap.c                        |   4 +-
 mm/gup.c                            | 139 +++++++++++-
 27 files changed, 784 insertions(+), 286 deletions(-)

-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Ocfs2-devel] [PATCH v6 01/19] iov_iter: Fix iov_iter_get_pages{, _alloc} page fault return value
  2021-08-19 19:40 [Ocfs2-devel] [PATCH v6 00/19] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
@ 2021-08-19 19:40 ` Andreas Gruenbacher
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 02/19] powerpc/kvm: Fix kvm_use_magic_page Andreas Gruenbacher
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Andreas Gruenbacher @ 2021-08-19 19:40 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Andreas Gruenbacher, linux-kernel, cluster-devel,
	linux-fsdevel, ocfs2-devel

Both iov_iter_get_pages and iov_iter_get_pages_alloc return the number
of bytes of the iovec they could get the pages for.  When they cannot
get any pages, they're supposed to return 0, but when the start of the
iovec isn't page aligned, the calculation goes wrong and they return a
negative value.  Fix both functions.

In addition, change iov_iter_get_pages_alloc to return NULL in that case
to prevent resource leaks.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 lib/iov_iter.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index e23123ae3a13..25dfc48536d7 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1484,7 +1484,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 		res = get_user_pages_fast(addr, n,
 				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0,
 				pages);
-		if (unlikely(res < 0))
+		if (unlikely(res <= 0))
 			return res;
 		return (res == n ? len : res * PAGE_SIZE) - *start;
 	}
@@ -1608,8 +1608,9 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 			return -ENOMEM;
 		res = get_user_pages_fast(addr, n,
 				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0, p);
-		if (unlikely(res < 0)) {
+		if (unlikely(res <= 0)) {
 			kvfree(p);
+			*pages = NULL;
 			return res;
 		}
 		*pages = p;
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Ocfs2-devel] [PATCH v6 02/19] powerpc/kvm: Fix kvm_use_magic_page
  2021-08-19 19:40 [Ocfs2-devel] [PATCH v6 00/19] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 01/19] iov_iter: Fix iov_iter_get_pages{, _alloc} page fault return value Andreas Gruenbacher
@ 2021-08-19 19:40 ` Andreas Gruenbacher
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 03/19] Turn fault_in_pages_{readable, writeable} into fault_in_{readable, writeable} Andreas Gruenbacher
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Andreas Gruenbacher @ 2021-08-19 19:40 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig,
	Darrick J. Wong, Paul Mackerras
  Cc: Jan Kara, Andreas Gruenbacher, linux-kernel, cluster-devel,
	kvm-ppc, linux-fsdevel, stable, ocfs2-devel

When switching from __get_user to fault_in_pages_readable, commit
9f9eae5ce717 broke kvm_use_magic_page: fault_in_pages_readable returns 0
on success.  Fix that.

Fixes: 9f9eae5ce717 ("powerpc/kvm: Prefer fault_in_pages_readable function")
Cc: stable@vger.kernel.org # v4.18+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 arch/powerpc/kernel/kvm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c
index 617eba82531c..d89cf802d9aa 100644
--- a/arch/powerpc/kernel/kvm.c
+++ b/arch/powerpc/kernel/kvm.c
@@ -669,7 +669,7 @@ static void __init kvm_use_magic_page(void)
 	on_each_cpu(kvm_map_magic_page, &features, 1);
 
 	/* Quick self-test to see if the mapping works */
-	if (!fault_in_pages_readable((const char *)KVM_MAGIC_PAGE, sizeof(u32))) {
+	if (fault_in_pages_readable((const char *)KVM_MAGIC_PAGE, sizeof(u32))) {
 		kvm_patching_worked = false;
 		return;
 	}
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Ocfs2-devel] [PATCH v6 03/19] Turn fault_in_pages_{readable, writeable} into fault_in_{readable, writeable}
  2021-08-19 19:40 [Ocfs2-devel] [PATCH v6 00/19] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 01/19] iov_iter: Fix iov_iter_get_pages{, _alloc} page fault return value Andreas Gruenbacher
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 02/19] powerpc/kvm: Fix kvm_use_magic_page Andreas Gruenbacher
@ 2021-08-19 19:40 ` Andreas Gruenbacher
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 04/19] Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable Andreas Gruenbacher
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Andreas Gruenbacher @ 2021-08-19 19:40 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Andreas Gruenbacher, linux-kernel, cluster-devel,
	linux-fsdevel, ocfs2-devel

Turn fault_in_pages_{readable,writeable} into versions that return the
number of bytes not faulted in (similar to copy_to_user) instead of
returning a non-zero value when any of the requested pages couldn't be
faulted in.  This supports the existing users that require all pages to
be faulted in as well as new users that are happy if any pages can be
faulted in.

Neither of these functions is entirely trivial and it doesn't seem
useful to inline them, so move them to mm/gup.c.

Rename the functions to fault_in_{readable,writeable} to make sure that
code that uses them can be fixed instead of breaking silently.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 arch/powerpc/kernel/kvm.c           |  3 +-
 arch/powerpc/kernel/signal_32.c     |  4 +-
 arch/powerpc/kernel/signal_64.c     |  2 +-
 arch/x86/kernel/fpu/signal.c        |  7 ++-
 drivers/gpu/drm/armada/armada_gem.c |  7 ++-
 fs/btrfs/ioctl.c                    |  5 +-
 include/linux/pagemap.h             | 57 ++---------------------
 lib/iov_iter.c                      | 10 ++--
 mm/filemap.c                        |  2 +-
 mm/gup.c                            | 72 +++++++++++++++++++++++++++++
 10 files changed, 93 insertions(+), 76 deletions(-)

diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c
index d89cf802d9aa..6568823cf306 100644
--- a/arch/powerpc/kernel/kvm.c
+++ b/arch/powerpc/kernel/kvm.c
@@ -669,7 +669,8 @@ static void __init kvm_use_magic_page(void)
 	on_each_cpu(kvm_map_magic_page, &features, 1);
 
 	/* Quick self-test to see if the mapping works */
-	if (fault_in_pages_readable((const char *)KVM_MAGIC_PAGE, sizeof(u32))) {
+	if (fault_in_readable((const char __user *)KVM_MAGIC_PAGE,
+			      sizeof(u32))) {
 		kvm_patching_worked = false;
 		return;
 	}
diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 0608581967f0..38c3eae40c14 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -1048,7 +1048,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx,
 	if (new_ctx == NULL)
 		return 0;
 	if (!access_ok(new_ctx, ctx_size) ||
-	    fault_in_pages_readable((u8 __user *)new_ctx, ctx_size))
+	    fault_in_readable((char __user *)new_ctx, ctx_size))
 		return -EFAULT;
 
 	/*
@@ -1237,7 +1237,7 @@ SYSCALL_DEFINE3(debug_setcontext, struct ucontext __user *, ctx,
 #endif
 
 	if (!access_ok(ctx, sizeof(*ctx)) ||
-	    fault_in_pages_readable((u8 __user *)ctx, sizeof(*ctx)))
+	    fault_in_readable((char __user *)ctx, sizeof(*ctx)))
 		return -EFAULT;
 
 	/*
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 1831bba0582e..9f471b4a11e3 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -688,7 +688,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx,
 	if (new_ctx == NULL)
 		return 0;
 	if (!access_ok(new_ctx, ctx_size) ||
-	    fault_in_pages_readable((u8 __user *)new_ctx, ctx_size))
+	    fault_in_readable((char __user *)new_ctx, ctx_size))
 		return -EFAULT;
 
 	/*
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 445c57c9c539..ba6bdec81603 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -205,7 +205,7 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
 	fpregs_unlock();
 
 	if (ret) {
-		if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
+		if (!fault_in_writeable(buf_fx, fpu_user_xstate_size))
 			goto retry;
 		return -EFAULT;
 	}
@@ -278,10 +278,9 @@ static int restore_fpregs_from_user(void __user *buf, u64 xrestore,
 		if (ret != -EFAULT)
 			return -EINVAL;
 
-		ret = fault_in_pages_readable(buf, size);
-		if (!ret)
+		if (!fault_in_readable(buf, size))
 			goto retry;
-		return ret;
+		return -EFAULT;
 	}
 
 	/*
diff --git a/drivers/gpu/drm/armada/armada_gem.c b/drivers/gpu/drm/armada/armada_gem.c
index 21909642ee4c..8fbb25913327 100644
--- a/drivers/gpu/drm/armada/armada_gem.c
+++ b/drivers/gpu/drm/armada/armada_gem.c
@@ -336,7 +336,7 @@ int armada_gem_pwrite_ioctl(struct drm_device *dev, void *data,
 	struct drm_armada_gem_pwrite *args = data;
 	struct armada_gem_object *dobj;
 	char __user *ptr;
-	int ret;
+	int ret = 0;
 
 	DRM_DEBUG_DRIVER("handle %u off %u size %u ptr 0x%llx\n",
 		args->handle, args->offset, args->size, args->ptr);
@@ -349,9 +349,8 @@ int armada_gem_pwrite_ioctl(struct drm_device *dev, void *data,
 	if (!access_ok(ptr, args->size))
 		return -EFAULT;
 
-	ret = fault_in_pages_readable(ptr, args->size);
-	if (ret)
-		return ret;
+	if (fault_in_readable(ptr, args->size))
+		return -EFAULT;
 
 	dobj = armada_gem_object_lookup(file, args->handle);
 	if (dobj == NULL)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0ba98e08a029..9233ecc31e2e 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2244,9 +2244,8 @@ static noinline int search_ioctl(struct inode *inode,
 	key.offset = sk->min_offset;
 
 	while (1) {
-		ret = fault_in_pages_writeable(ubuf + sk_offset,
-					       *buf_size - sk_offset);
-		if (ret)
+		ret = -EFAULT;
+		if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset))
 			break;
 
 		ret = btrfs_search_forward(root, &key, path, sk->min_transid);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ed02aa522263..7c9edc9694d9 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -734,61 +734,10 @@ int wait_on_page_private_2_killable(struct page *page);
 extern void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter);
 
 /*
- * Fault everything in given userspace address range in.
+ * Fault in userspace address range.
  */
-static inline int fault_in_pages_writeable(char __user *uaddr, int size)
-{
-	char __user *end = uaddr + size - 1;
-
-	if (unlikely(size == 0))
-		return 0;
-
-	if (unlikely(uaddr > end))
-		return -EFAULT;
-	/*
-	 * Writing zeroes into userspace here is OK, because we know that if
-	 * the zero gets there, we'll be overwriting it.
-	 */
-	do {
-		if (unlikely(__put_user(0, uaddr) != 0))
-			return -EFAULT;
-		uaddr += PAGE_SIZE;
-	} while (uaddr <= end);
-
-	/* Check whether the range spilled into the next page. */
-	if (((unsigned long)uaddr & PAGE_MASK) ==
-			((unsigned long)end & PAGE_MASK))
-		return __put_user(0, end);
-
-	return 0;
-}
-
-static inline int fault_in_pages_readable(const char __user *uaddr, int size)
-{
-	volatile char c;
-	const char __user *end = uaddr + size - 1;
-
-	if (unlikely(size == 0))
-		return 0;
-
-	if (unlikely(uaddr > end))
-		return -EFAULT;
-
-	do {
-		if (unlikely(__get_user(c, uaddr) != 0))
-			return -EFAULT;
-		uaddr += PAGE_SIZE;
-	} while (uaddr <= end);
-
-	/* Check whether the range spilled into the next page. */
-	if (((unsigned long)uaddr & PAGE_MASK) ==
-			((unsigned long)end & PAGE_MASK)) {
-		return __get_user(c, end);
-	}
-
-	(void)c;
-	return 0;
-}
+size_t fault_in_writeable(char __user *uaddr, size_t size);
+size_t fault_in_readable(const char __user *uaddr, size_t size);
 
 int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 25dfc48536d7..069cedd9d7b4 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -191,7 +191,7 @@ static size_t copy_page_to_iter_iovec(struct page *page, size_t offset, size_t b
 	buf = iov->iov_base + skip;
 	copy = min(bytes, iov->iov_len - skip);
 
-	if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_pages_writeable(buf, copy)) {
+	if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_writeable(buf, copy)) {
 		kaddr = kmap_atomic(page);
 		from = kaddr + offset;
 
@@ -275,7 +275,7 @@ static size_t copy_page_from_iter_iovec(struct page *page, size_t offset, size_t
 	buf = iov->iov_base + skip;
 	copy = min(bytes, iov->iov_len - skip);
 
-	if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_pages_readable(buf, copy)) {
+	if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_readable(buf, copy)) {
 		kaddr = kmap_atomic(page);
 		to = kaddr + offset;
 
@@ -446,13 +446,11 @@ int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes)
 			bytes = i->count;
 		for (p = i->iov, skip = i->iov_offset; bytes; p++, skip = 0) {
 			size_t len = min(bytes, p->iov_len - skip);
-			int err;
 
 			if (unlikely(!len))
 				continue;
-			err = fault_in_pages_readable(p->iov_base + skip, len);
-			if (unlikely(err))
-				return err;
+			if (fault_in_readable(p->iov_base + skip, len))
+				return -EFAULT;
 			bytes -= len;
 		}
 	}
diff --git a/mm/filemap.c b/mm/filemap.c
index d1458ecf2f51..4dec3bc7752e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -88,7 +88,7 @@
  *    ->lock_page		(access_process_vm)
  *
  *  ->i_mutex			(generic_perform_write)
- *    ->mmap_lock		(fault_in_pages_readable->do_page_fault)
+ *    ->mmap_lock		(fault_in_readable->do_page_fault)
  *
  *  bdi->wb.list_lock
  *    sb_lock			(fs/fs-writeback.c)
diff --git a/mm/gup.c b/mm/gup.c
index b94717977d17..0cf47955e5a1 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1672,6 +1672,78 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
 }
 #endif /* !CONFIG_MMU */
 
+/**
+ * fault_in_writeable - fault in userspace address range for writing
+ * @uaddr: start of address range
+ * @size: size of address range
+ *
+ * Returns the number of bytes not faulted in (like copy_to_user() and
+ * copy_from_user()).
+ */
+size_t fault_in_writeable(char __user *uaddr, size_t size)
+{
+	char __user *start = uaddr, *end;
+
+	if (unlikely(size == 0))
+		return 0;
+	if (!PAGE_ALIGNED(uaddr)) {
+		if (unlikely(__put_user(0, uaddr) != 0))
+			return size;
+		uaddr = (char __user *)PAGE_ALIGN((unsigned long)uaddr);
+	}
+	end = (char __user *)PAGE_ALIGN((unsigned long)start + size);
+	if (unlikely(end < start))
+		end = NULL;
+	while (uaddr != end) {
+		if (unlikely(__put_user(0, uaddr) != 0))
+			goto out;
+		uaddr += PAGE_SIZE;
+	}
+
+out:
+	if (size > uaddr - start)
+		return size - (uaddr - start);
+	return 0;
+}
+EXPORT_SYMBOL(fault_in_writeable);
+
+/**
+ * fault_in_readable - fault in userspace address range for reading
+ * @uaddr: start of user address range
+ * @size: size of user address range
+ *
+ * Returns the number of bytes not faulted in (like copy_to_user() and
+ * copy_from_user()).
+ */
+size_t fault_in_readable(const char __user *uaddr, size_t size)
+{
+	const char __user *start = uaddr, *end;
+	volatile char c;
+
+	if (unlikely(size == 0))
+		return 0;
+	if (!PAGE_ALIGNED(uaddr)) {
+		if (unlikely(__get_user(c, uaddr) != 0))
+			return size;
+		uaddr = (const char __user *)PAGE_ALIGN((unsigned long)uaddr);
+	}
+	end = (const char __user *)PAGE_ALIGN((unsigned long)start + size);
+	if (unlikely(end < start))
+		end = NULL;
+	while (uaddr != end) {
+		if (unlikely(__get_user(c, uaddr) != 0))
+			goto out;
+		uaddr += PAGE_SIZE;
+	}
+
+out:
+	(void)c;
+	if (size > uaddr - start)
+		return size - (uaddr - start);
+	return 0;
+}
+EXPORT_SYMBOL(fault_in_readable);
+
 /**
  * get_dump_page() - pin user page in memory while writing it to core dump
  * @addr: user address
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Ocfs2-devel] [PATCH v6 04/19] Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
  2021-08-19 19:40 [Ocfs2-devel] [PATCH v6 00/19] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (2 preceding siblings ...)
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 03/19] Turn fault_in_pages_{readable, writeable} into fault_in_{readable, writeable} Andreas Gruenbacher
@ 2021-08-19 19:40 ` Andreas Gruenbacher
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 05/19] iov_iter: Introduce fault_in_iov_iter_writeable Andreas Gruenbacher
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Andreas Gruenbacher @ 2021-08-19 19:40 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Andreas Gruenbacher, linux-kernel, cluster-devel,
	linux-fsdevel, ocfs2-devel

Turn iov_iter_fault_in_readable into a function that returns the number
of bytes not faulted in (similar to copy_to_user) instead of returning a
non-zero value when any of the requested pages couldn't be faulted in.
This supports the existing users that require all pages to be faulted
in, but also new users that are happy if any pages can be faulted in.

Rename iov_iter_fault_in_readable to fault_in_iov_iter_readable to make
sure that code that uses it can be fixed instead of breaking silently.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/btrfs/file.c        |  2 +-
 fs/f2fs/file.c         |  2 +-
 fs/fuse/file.c         |  2 +-
 fs/iomap/buffered-io.c |  2 +-
 fs/ntfs/file.c         |  2 +-
 include/linux/uio.h    |  2 +-
 lib/iov_iter.c         | 33 +++++++++++++++++++++------------
 mm/filemap.c           |  2 +-
 8 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index ee34497500e1..281c77cfe91a 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1698,7 +1698,7 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
 		 * Fault pages before locking them in prepare_pages
 		 * to avoid recursive lock
 		 */
-		if (unlikely(iov_iter_fault_in_readable(i, write_bytes))) {
+		if (unlikely(fault_in_iov_iter_readable(i, write_bytes))) {
 			ret = -EFAULT;
 			break;
 		}
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 6afd4562335f..b04b6c909a8b 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -4259,7 +4259,7 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		size_t target_size = 0;
 		int err;
 
-		if (iov_iter_fault_in_readable(from, iov_iter_count(from)))
+		if (fault_in_iov_iter_readable(from, iov_iter_count(from)))
 			set_inode_flag(inode, FI_NO_PREALLOC);
 
 		if ((iocb->ki_flags & IOCB_NOWAIT)) {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 97f860cfc195..da49ef71dab5 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1160,7 +1160,7 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
 
  again:
 		err = -EFAULT;
-		if (iov_iter_fault_in_readable(ii, bytes))
+		if (fault_in_iov_iter_readable(ii, bytes))
 			break;
 
 		err = -ENOMEM;
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 87ccb3438bec..7dc42dd3a724 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -749,7 +749,7 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
 		 */
-		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+		if (unlikely(fault_in_iov_iter_readable(i, bytes))) {
 			status = -EFAULT;
 			break;
 		}
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index ab4f3362466d..a43adeacd930 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -1829,7 +1829,7 @@ static ssize_t ntfs_perform_write(struct file *file, struct iov_iter *i,
 		 * pages being swapped out between us bringing them into memory
 		 * and doing the actual copying.
 		 */
-		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+		if (unlikely(fault_in_iov_iter_readable(i, bytes))) {
 			status = -EFAULT;
 			break;
 		}
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 82c3c3e819e0..12d30246c2e9 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -119,7 +119,7 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
 				  size_t bytes, struct iov_iter *i);
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
 void iov_iter_revert(struct iov_iter *i, size_t bytes);
-int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes);
+size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t bytes);
 size_t iov_iter_single_seg_count(const struct iov_iter *i);
 size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 069cedd9d7b4..082ab155496d 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -430,33 +430,42 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t by
 }
 
 /*
+ * fault_in_iov_iter_readable - fault in iov iterator for reading
+ * @i: iterator
+ * @size: maximum length
+ *
  * Fault in one or more iovecs of the given iov_iter, to a maximum length of
- * bytes.  For each iovec, fault in each page that constitutes the iovec.
+ * @size.  For each iovec, fault in each page that constitutes the iovec.
+ *
+ * Returns the number of bytes not faulted in (like copy_to_user() and
+ * copy_from_user()).
  *
- * Return 0 on success, or non-zero if the memory could not be accessed (i.e.
- * because it is an invalid address).
+ * Always returns 0 for non-userspace iterators.
  */
-int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes)
+size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t size)
 {
 	if (iter_is_iovec(i)) {
+		size_t count = min(size, iov_iter_count(i));
 		const struct iovec *p;
 		size_t skip;
 
-		if (bytes > i->count)
-			bytes = i->count;
-		for (p = i->iov, skip = i->iov_offset; bytes; p++, skip = 0) {
-			size_t len = min(bytes, p->iov_len - skip);
+		size -= count;
+		for (p = i->iov, skip = i->iov_offset; count; p++, skip = 0) {
+			size_t len = min(count, p->iov_len - skip);
+			size_t ret;
 
 			if (unlikely(!len))
 				continue;
-			if (fault_in_readable(p->iov_base + skip, len))
-				return -EFAULT;
-			bytes -= len;
+			ret = fault_in_readable(p->iov_base + skip, len);
+			count -= len - ret;
+			if (ret)
+				break;
 		}
+		return count + size;
 	}
 	return 0;
 }
-EXPORT_SYMBOL(iov_iter_fault_in_readable);
+EXPORT_SYMBOL(fault_in_iov_iter_readable);
 
 void iov_iter_init(struct iov_iter *i, unsigned int direction,
 			const struct iovec *iov, unsigned long nr_segs,
diff --git a/mm/filemap.c b/mm/filemap.c
index 4dec3bc7752e..83af8a534339 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3643,7 +3643,7 @@ ssize_t generic_perform_write(struct file *file,
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
 		 */
-		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+		if (unlikely(fault_in_iov_iter_readable(i, bytes))) {
 			status = -EFAULT;
 			break;
 		}
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Ocfs2-devel] [PATCH v6 05/19] iov_iter: Introduce fault_in_iov_iter_writeable
  2021-08-19 19:40 [Ocfs2-devel] [PATCH v6 00/19] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (3 preceding siblings ...)
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 04/19] Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable Andreas Gruenbacher
@ 2021-08-19 19:40 ` Andreas Gruenbacher
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 06/19] gfs2: Add wrapper for iomap_file_buffered_write Andreas Gruenbacher
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Andreas Gruenbacher @ 2021-08-19 19:40 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Andreas Gruenbacher, linux-kernel, cluster-devel,
	linux-fsdevel, ocfs2-devel

Introduce a new fault_in_iov_iter_writeable helper for safely faulting
in an iterator for writing.  Uses get_user_pages() to fault in the pages
without actually writing to them, which would be destructive.

We'll use fault_in_iov_iter_writeable in gfs2 once we've determined that
the iterator passed to .read_iter isn't in memory.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 include/linux/pagemap.h |  1 +
 include/linux/uio.h     |  1 +
 lib/iov_iter.c          | 39 +++++++++++++++++++++++++
 mm/gup.c                | 63 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 104 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 7c9edc9694d9..a629807edb8c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -737,6 +737,7 @@ extern void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter);
  * Fault in userspace address range.
  */
 size_t fault_in_writeable(char __user *uaddr, size_t size);
+size_t fault_in_safe_writeable(const char __user *uaddr, size_t size);
 size_t fault_in_readable(const char __user *uaddr, size_t size);
 
 int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 12d30246c2e9..ffa431aeb067 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -120,6 +120,7 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
 void iov_iter_revert(struct iov_iter *i, size_t bytes);
 size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t bytes);
+size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t bytes);
 size_t iov_iter_single_seg_count(const struct iov_iter *i);
 size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 082ab155496d..968f2d2595cd 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -467,6 +467,45 @@ size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t size)
 }
 EXPORT_SYMBOL(fault_in_iov_iter_readable);
 
+/*
+ * fault_in_iov_iter_writeable - fault in iov iterator for writing
+ * @i: iterator
+ * @size: maximum length
+ *
+ * Faults in the iterator using get_user_pages(), i.e., without triggering
+ * hardware page faults.  This is primarily useful when we know that some or
+ * all of the pages in @i aren't in memory.
+ *
+ * Returns the number of bytes not faulted in (like copy_to_user() and
+ * copy_from_user()).
+ *
+ * Always returns 0 for non-user space iterators.
+ */
+size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t size)
+{
+	if (iter_is_iovec(i)) {
+		size_t count = min(size, iov_iter_count(i));
+		const struct iovec *p;
+		size_t skip;
+
+		size -= count;
+		for (p = i->iov, skip = i->iov_offset; count; p++, skip = 0) {
+			size_t len = min(count, p->iov_len - skip);
+			size_t ret;
+
+			if (unlikely(!len))
+				continue;
+			ret = fault_in_safe_writeable(p->iov_base + skip, len);
+			count -= len - ret;
+			if (ret)
+				break;
+		}
+		return count + size;
+	}
+	return 0;
+}
+EXPORT_SYMBOL(fault_in_iov_iter_writeable);
+
 void iov_iter_init(struct iov_iter *i, unsigned int direction,
 			const struct iovec *iov, unsigned long nr_segs,
 			size_t count)
diff --git a/mm/gup.c b/mm/gup.c
index 0cf47955e5a1..03ab03b68dc7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1707,6 +1707,69 @@ size_t fault_in_writeable(char __user *uaddr, size_t size)
 }
 EXPORT_SYMBOL(fault_in_writeable);
 
+/*
+ * fault_in_safe_writeable - fault in an address range for writing
+ * @uaddr: start of address range
+ * @size: length of address range
+ *
+ * Faults in an address range using get_user_pages, i.e., without triggering
+ * hardware page faults.  This is primarily useful when we know that some or
+ * all of the pages in the address range aren't in memory.
+ *
+ * Other than fault_in_writeable(), this function is non-destructive.
+ *
+ * Note that we don't pin or otherwise hold the pages referenced that we fault
+ * in.  There's no guarantee that they'll stay in memory for any duration of
+ * time.
+ *
+ * Returns the number of bytes not faulted in (like copy_to_user() and
+ * copy_from_user()).
+ */
+size_t fault_in_safe_writeable(const char __user *uaddr, size_t size)
+{
+	unsigned long start = (unsigned long)uaddr;
+	unsigned long end, nstart, nend;
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma = NULL;
+	int locked = 0;
+
+	nstart = start & PAGE_MASK;
+	end = PAGE_ALIGN(start + size);
+	if (end < nstart)
+		end = 0;
+	for (; nstart != end; nstart = nend) {
+		unsigned long nr_pages;
+		long ret;
+
+		if (!locked) {
+			locked = 1;
+			mmap_read_lock(mm);
+			vma = find_vma(mm, nstart);
+		} else if (nstart >= vma->vm_end)
+			vma = vma->vm_next;
+		if (!vma || vma->vm_start >= end)
+			break;
+		nend = end ? min(end, vma->vm_end) : vma->vm_end;
+		if (vma->vm_flags & (VM_IO | VM_PFNMAP))
+			continue;
+		if (nstart < vma->vm_start)
+			nstart = vma->vm_start;
+		nr_pages = (nend - nstart) / PAGE_SIZE;
+		ret = __get_user_pages_locked(mm, nstart, nr_pages,
+					      NULL, NULL, &locked,
+					      FOLL_TOUCH | FOLL_WRITE);
+		if (ret <= 0)
+			break;
+		nend = nstart + ret * PAGE_SIZE;
+	}
+	if (locked)
+		mmap_read_unlock(mm);
+	if (nstart == end)
+		return 0;
+	return size - min_t(size_t, nstart - start, size);
+}
+EXPORT_SYMBOL(fault_in_safe_writeable);
+
 /**
  * fault_in_readable - fault in userspace address range for reading
  * @uaddr: start of user address range
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Ocfs2-devel] [PATCH v6 06/19] gfs2: Add wrapper for iomap_file_buffered_write
  2021-08-19 19:40 [Ocfs2-devel] [PATCH v6 00/19] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (4 preceding siblings ...)
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 05/19] iov_iter: Introduce fault_in_iov_iter_writeable Andreas Gruenbacher
@ 2021-08-19 19:40 ` Andreas Gruenbacher
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 07/19] gfs2: Clean up function may_grant Andreas Gruenbacher
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Andreas Gruenbacher @ 2021-08-19 19:40 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Andreas Gruenbacher, linux-kernel, cluster-devel,
	linux-fsdevel, ocfs2-devel

Add a wrapper around iomap_file_buffered_write.  We'll add code for when
the operation needs to be retried here later.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 84ec053d43b4..55ec1cadc9e6 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -876,6 +876,18 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return written ? written : ret;
 }
 
+static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file_inode(file);
+	ssize_t ret;
+
+	current->backing_dev_info = inode_to_bdi(inode);
+	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+	current->backing_dev_info = NULL;
+	return ret;
+}
+
 /**
  * gfs2_file_write_iter - Perform a write to a file
  * @iocb: The io context
@@ -927,9 +939,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			goto out_unlock;
 
 		iocb->ki_flags |= IOCB_DSYNC;
-		current->backing_dev_info = inode_to_bdi(inode);
-		buffered = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
-		current->backing_dev_info = NULL;
+		buffered = gfs2_file_buffered_write(iocb, from);
 		if (unlikely(buffered <= 0)) {
 			if (!ret)
 				ret = buffered;
@@ -951,9 +961,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		if (!ret || ret2 > 0)
 			ret += ret2;
 	} else {
-		current->backing_dev_info = inode_to_bdi(inode);
-		ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
-		current->backing_dev_info = NULL;
+		ret = gfs2_file_buffered_write(iocb, from);
 		if (likely(ret > 0)) {
 			iocb->ki_pos += ret;
 			ret = generic_write_sync(iocb, ret);
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Ocfs2-devel] [PATCH v6 07/19] gfs2: Clean up function may_grant
  2021-08-19 19:40 [Ocfs2-devel] [PATCH v6 00/19] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (5 preceding siblings ...)
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 06/19] gfs2: Add wrapper for iomap_file_buffered_write Andreas Gruenbacher
@ 2021-08-19 19:40 ` Andreas Gruenbacher
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 08/19] gfs2: Eliminate vestigial HIF_FIRST Andreas Gruenbacher
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Andreas Gruenbacher @ 2021-08-19 19:40 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Andreas Gruenbacher, linux-kernel, cluster-devel,
	linux-fsdevel, ocfs2-devel

Pass the first current glock holder into function may_grant and
deobfuscate the logic there.

We're now using function find_first_holder in do_promote, so move the
function's definition above do_promote.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/glock.c | 120 ++++++++++++++++++++++++++++--------------------
 1 file changed, 70 insertions(+), 50 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 1f3902ecdded..545b435f55ea 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -301,46 +301,59 @@ void gfs2_glock_put(struct gfs2_glock *gl)
 }
 
 /**
- * may_grant - check if its ok to grant a new lock
+ * may_grant - check if it's ok to grant a new lock
  * @gl: The glock
+ * @current_gh: One of the current holders of @gl
  * @gh: The lock request which we wish to grant
  *
- * Returns: true if its ok to grant the lock
+ * With our current compatibility rules, if a glock has one or more active
+ * holders (HIF_HOLDER flag set), any of those holders can be passed in as
+ * @current_gh; they are all the same as far as compatibility with the new @gh
+ * goes.
+ *
+ * Returns true if it's ok to grant the lock.
  */
 
-static inline int may_grant(const struct gfs2_glock *gl, const struct gfs2_holder *gh)
-{
-	const struct gfs2_holder *gh_head = list_first_entry(&gl->gl_holders, const struct gfs2_holder, gh_list);
+static inline bool may_grant(const struct gfs2_glock *gl,
+			     const struct gfs2_holder *current_gh,
+			     const struct gfs2_holder *gh)
+{
+	if (current_gh) {
+		BUG_ON(!test_bit(HIF_HOLDER, &current_gh->gh_iflags));
+
+		switch(current_gh->gh_state) {
+		case LM_ST_EXCLUSIVE:
+			/*
+			 * Here we make a special exception to grant holders
+			 * who agree to share the EX lock with other holders
+			 * who also have the bit set. If the original holder
+			 * has the LM_FLAG_NODE_SCOPE bit set, we grant more
+			 * holders with the bit set.
+			 */
+			return gh->gh_state == LM_ST_EXCLUSIVE &&
+			       (current_gh->gh_flags & LM_FLAG_NODE_SCOPE) &&
+			       (gh->gh_flags & LM_FLAG_NODE_SCOPE);
 
-	if (gh != gh_head) {
-		/**
-		 * Here we make a special exception to grant holders who agree
-		 * to share the EX lock with other holders who also have the
-		 * bit set. If the original holder has the LM_FLAG_NODE_SCOPE bit
-		 * is set, we grant more holders with the bit set.
-		 */
-		if (gh_head->gh_state == LM_ST_EXCLUSIVE &&
-		    (gh_head->gh_flags & LM_FLAG_NODE_SCOPE) &&
-		    gh->gh_state == LM_ST_EXCLUSIVE &&
-		    (gh->gh_flags & LM_FLAG_NODE_SCOPE))
-			return 1;
-		if ((gh->gh_state == LM_ST_EXCLUSIVE ||
-		     gh_head->gh_state == LM_ST_EXCLUSIVE))
-			return 0;
+		case LM_ST_SHARED:
+		case LM_ST_DEFERRED:
+			return gh->gh_state == current_gh->gh_state;
+
+		default:
+			return false;
+		}
 	}
+
 	if (gl->gl_state == gh->gh_state)
-		return 1;
+		return true;
 	if (gh->gh_flags & GL_EXACT)
-		return 0;
+		return false;
 	if (gl->gl_state == LM_ST_EXCLUSIVE) {
-		if (gh->gh_state == LM_ST_SHARED && gh_head->gh_state == LM_ST_SHARED)
-			return 1;
-		if (gh->gh_state == LM_ST_DEFERRED && gh_head->gh_state == LM_ST_DEFERRED)
-			return 1;
+		return gh->gh_state == LM_ST_SHARED ||
+		       gh->gh_state == LM_ST_DEFERRED;
 	}
-	if (gl->gl_state != LM_ST_UNLOCKED && (gh->gh_flags & LM_FLAG_ANY))
-		return 1;
-	return 0;
+	if (gh->gh_flags & LM_FLAG_ANY)
+		return gl->gl_state != LM_ST_UNLOCKED;
+	return false;
 }
 
 static void gfs2_holder_wake(struct gfs2_holder *gh)
@@ -380,6 +393,24 @@ static void do_error(struct gfs2_glock *gl, const int ret)
 	}
 }
 
+/**
+ * find_first_holder - find the first "holder" gh
+ * @gl: the glock
+ */
+
+static inline struct gfs2_holder *find_first_holder(const struct gfs2_glock *gl)
+{
+	struct gfs2_holder *gh;
+
+	if (!list_empty(&gl->gl_holders)) {
+		gh = list_first_entry(&gl->gl_holders, struct gfs2_holder,
+				      gh_list);
+		if (test_bit(HIF_HOLDER, &gh->gh_iflags))
+			return gh;
+	}
+	return NULL;
+}
+
 /**
  * do_promote - promote as many requests as possible on the current queue
  * @gl: The glock
@@ -393,14 +424,16 @@ __releases(&gl->gl_lockref.lock)
 __acquires(&gl->gl_lockref.lock)
 {
 	const struct gfs2_glock_operations *glops = gl->gl_ops;
-	struct gfs2_holder *gh, *tmp;
+	struct gfs2_holder *gh, *tmp, *first_gh;
 	int ret;
 
+	first_gh = find_first_holder(gl);
+
 restart:
 	list_for_each_entry_safe(gh, tmp, &gl->gl_holders, gh_list) {
 		if (test_bit(HIF_HOLDER, &gh->gh_iflags))
 			continue;
-		if (may_grant(gl, gh)) {
+		if (may_grant(gl, first_gh, gh)) {
 			if (gh->gh_list.prev == &gl->gl_holders &&
 			    glops->go_lock) {
 				spin_unlock(&gl->gl_lockref.lock);
@@ -722,23 +755,6 @@ __acquires(&gl->gl_lockref.lock)
 	spin_lock(&gl->gl_lockref.lock);
 }
 
-/**
- * find_first_holder - find the first "holder" gh
- * @gl: the glock
- */
-
-static inline struct gfs2_holder *find_first_holder(const struct gfs2_glock *gl)
-{
-	struct gfs2_holder *gh;
-
-	if (!list_empty(&gl->gl_holders)) {
-		gh = list_first_entry(&gl->gl_holders, struct gfs2_holder, gh_list);
-		if (test_bit(HIF_HOLDER, &gh->gh_iflags))
-			return gh;
-	}
-	return NULL;
-}
-
 /**
  * run_queue - do all outstanding tasks related to a glock
  * @gl: The glock in question
@@ -1354,8 +1370,12 @@ __acquires(&gl->gl_lockref.lock)
 		GLOCK_BUG_ON(gl, true);
 
 	if (gh->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB)) {
-		if (test_bit(GLF_LOCK, &gl->gl_flags))
-			try_futile = !may_grant(gl, gh);
+		if (test_bit(GLF_LOCK, &gl->gl_flags)) {
+			struct gfs2_holder *first_gh;
+
+			first_gh = find_first_holder(gl);
+			try_futile = !may_grant(gl, first_gh, gh);
+		}
 		if (test_bit(GLF_INVALIDATE_IN_PROGRESS, &gl->gl_flags))
 			goto fail;
 	}
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Ocfs2-devel] [PATCH v6 08/19] gfs2: Eliminate vestigial HIF_FIRST
  2021-08-19 19:40 [Ocfs2-devel] [PATCH v6 00/19] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (6 preceding siblings ...)
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 07/19] gfs2: Clean up function may_grant Andreas Gruenbacher
@ 2021-08-19 19:40 ` Andreas Gruenbacher
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 09/19] gfs2: Remove redundant check from gfs2_glock_dq Andreas Gruenbacher
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Andreas Gruenbacher @ 2021-08-19 19:40 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, linux-kernel, cluster-devel, Bob Peterson,
	linux-fsdevel, ocfs2-devel

From: Bob Peterson <rpeterso@redhat.com>

Holder flag HIF_FIRST is no longer used or needed, so remove it.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/glock.c  | 2 --
 fs/gfs2/incore.h | 1 -
 2 files changed, 3 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 545b435f55ea..fd280b6c37ce 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -2097,8 +2097,6 @@ static const char *hflags2str(char *buf, u16 flags, unsigned long iflags)
 		*p++ = 'H';
 	if (test_bit(HIF_WAIT, &iflags))
 		*p++ = 'W';
-	if (test_bit(HIF_FIRST, &iflags))
-		*p++ = 'F';
 	*p = 0;
 	return buf;
 }
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index e6f820f146cb..5c6b985254aa 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -253,7 +253,6 @@ struct gfs2_lkstats {
 enum {
 	/* States */
 	HIF_HOLDER		= 6,  /* Set for gh that "holds" the glock */
-	HIF_FIRST		= 7,
 	HIF_WAIT		= 10,
 };
 
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Ocfs2-devel] [PATCH v6 09/19] gfs2: Remove redundant check from gfs2_glock_dq
  2021-08-19 19:40 [Ocfs2-devel] [PATCH v6 00/19] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (7 preceding siblings ...)
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 08/19] gfs2: Eliminate vestigial HIF_FIRST Andreas Gruenbacher
@ 2021-08-19 19:40 ` Andreas Gruenbacher
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 10/19] gfs2: Introduce flag for glock holder auto-demotion Andreas Gruenbacher
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Andreas Gruenbacher @ 2021-08-19 19:40 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, linux-kernel, cluster-devel, Bob Peterson,
	linux-fsdevel, ocfs2-devel

From: Bob Peterson <rpeterso@redhat.com>

In function gfs2_glock_dq, it checks to see if this is the fast path.
Before this patch, it checked both "find_first_holder(gl) == NULL" and
list_empty(&gl->gl_holders), which is redundant. If gl_holders is empty
then find_first_holder must return NULL. This patch removes the
redundancy.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/glock.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index fd280b6c37ce..f24db2ececfb 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1514,12 +1514,11 @@ void gfs2_glock_dq(struct gfs2_holder *gh)
 
 	list_del_init(&gh->gh_list);
 	clear_bit(HIF_HOLDER, &gh->gh_iflags);
-	if (find_first_holder(gl) == NULL) {
-		if (list_empty(&gl->gl_holders) &&
-		    !test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags) &&
-		    !test_bit(GLF_DEMOTE, &gl->gl_flags))
-			fast_path = 1;
-	}
+	if (list_empty(&gl->gl_holders) &&
+	    !test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags) &&
+	    !test_bit(GLF_DEMOTE, &gl->gl_flags))
+		fast_path = 1;
+
 	if (!test_bit(GLF_LFLUSH, &gl->gl_flags) && demote_ok(gl))
 		gfs2_glock_add_to_lru(gl);
 
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Ocfs2-devel] [PATCH v6 10/19] gfs2: Introduce flag for glock holder auto-demotion
  2021-08-19 19:40 [Ocfs2-devel] [PATCH v6 00/19] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (8 preceding siblings ...)
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 09/19] gfs2: Remove redundant check from gfs2_glock_dq Andreas Gruenbacher
@ 2021-08-19 19:40 ` Andreas Gruenbacher
  2021-08-20  9:35   ` [Ocfs2-devel] [Cluster-devel] " Steven Whitehouse
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 11/19] gfs2: Move the inode glock locking to gfs2_file_buffered_write Andreas Gruenbacher
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 33+ messages in thread
From: Andreas Gruenbacher @ 2021-08-19 19:40 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, linux-kernel, cluster-devel, Bob Peterson,
	linux-fsdevel, ocfs2-devel

From: Bob Peterson <rpeterso@redhat.com>

This patch introduces a new HIF_MAY_DEMOTE flag and infrastructure that
will allow glocks to be demoted automatically on locking conflicts.
When a locking request comes in that isn't compatible with the locking
state of a holder and that holder has the HIF_MAY_DEMOTE flag set, the
holder will be demoted automatically before the incoming locking request
is granted.

Processes that allow a glock holder to be taken away indicate this by
calling gfs2_holder_allow_demote().  When they need the glock again,
they call gfs2_holder_disallow_demote() and then they check if the
holder is still queued: if it is, they're still holding the glock; if it
isn't, they need to re-acquire the glock.

This allows processes to hang on to locks that could become part of a
cyclic locking dependency.  The locks will be given up when a (rare)
conflicting locking request occurs, and don't need to be given up
prematurely.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
---
 fs/gfs2/glock.c  | 221 +++++++++++++++++++++++++++++++++++++++--------
 fs/gfs2/glock.h  |  20 +++++
 fs/gfs2/incore.h |   1 +
 3 files changed, 206 insertions(+), 36 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index f24db2ececfb..d1b06a09ce2f 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -58,6 +58,7 @@ struct gfs2_glock_iter {
 typedef void (*glock_examiner) (struct gfs2_glock * gl);
 
 static void do_xmote(struct gfs2_glock *gl, struct gfs2_holder *gh, unsigned int target);
+static void __gfs2_glock_dq(struct gfs2_holder *gh);
 
 static struct dentry *gfs2_root;
 static struct workqueue_struct *glock_workqueue;
@@ -197,6 +198,12 @@ static int demote_ok(const struct gfs2_glock *gl)
 
 	if (gl->gl_state == LM_ST_UNLOCKED)
 		return 0;
+	/*
+	 * Note that demote_ok is used for the lru process of disposing of
+	 * glocks. For this purpose, we don't care if the glock's holders
+	 * have the HIF_MAY_DEMOTE flag set or not. If someone is using
+	 * them, don't demote.
+	 */
 	if (!list_empty(&gl->gl_holders))
 		return 0;
 	if (glops->go_demote_ok)
@@ -379,7 +386,7 @@ static void do_error(struct gfs2_glock *gl, const int ret)
 	struct gfs2_holder *gh, *tmp;
 
 	list_for_each_entry_safe(gh, tmp, &gl->gl_holders, gh_list) {
-		if (test_bit(HIF_HOLDER, &gh->gh_iflags))
+		if (!test_bit(HIF_WAIT, &gh->gh_iflags))
 			continue;
 		if (ret & LM_OUT_ERROR)
 			gh->gh_error = -EIO;
@@ -393,6 +400,40 @@ static void do_error(struct gfs2_glock *gl, const int ret)
 	}
 }
 
+/**
+ * demote_incompat_holders - demote incompatible demoteable holders
+ * @gl: the glock we want to promote
+ * @new_gh: the new holder to be promoted
+ */
+static void demote_incompat_holders(struct gfs2_glock *gl,
+				    struct gfs2_holder *new_gh)
+{
+	struct gfs2_holder *gh;
+
+	/*
+	 * Demote incompatible holders before we make ourselves eligible.
+	 * (This holder may or may not allow auto-demoting, but we don't want
+	 * to demote the new holder before it's even granted.)
+	 */
+	list_for_each_entry(gh, &gl->gl_holders, gh_list) {
+		/*
+		 * Since holders are at the front of the list, we stop when we
+		 * find the first non-holder.
+		 */
+		if (!test_bit(HIF_HOLDER, &gh->gh_iflags))
+			return;
+		if (test_bit(HIF_MAY_DEMOTE, &gh->gh_iflags) &&
+		    !may_grant(gl, new_gh, gh)) {
+			/*
+			 * We should not recurse into do_promote because
+			 * __gfs2_glock_dq only calls handle_callback,
+			 * gfs2_glock_add_to_lru and __gfs2_glock_queue_work.
+			 */
+			__gfs2_glock_dq(gh);
+		}
+	}
+}
+
 /**
  * find_first_holder - find the first "holder" gh
  * @gl: the glock
@@ -411,6 +452,26 @@ static inline struct gfs2_holder *find_first_holder(const struct gfs2_glock *gl)
 	return NULL;
 }
 
+/**
+ * find_first_strong_holder - find the first non-demoteable holder
+ * @gl: the glock
+ *
+ * Find the first holder that doesn't have the HIF_MAY_DEMOTE flag set.
+ */
+static inline struct gfs2_holder
+*find_first_strong_holder(struct gfs2_glock *gl)
+{
+	struct gfs2_holder *gh;
+
+	list_for_each_entry(gh, &gl->gl_holders, gh_list) {
+		if (!test_bit(HIF_HOLDER, &gh->gh_iflags))
+			return NULL;
+		if (!test_bit(HIF_MAY_DEMOTE, &gh->gh_iflags))
+			return gh;
+	}
+	return NULL;
+}
+
 /**
  * do_promote - promote as many requests as possible on the current queue
  * @gl: The glock
@@ -425,15 +486,27 @@ __acquires(&gl->gl_lockref.lock)
 {
 	const struct gfs2_glock_operations *glops = gl->gl_ops;
 	struct gfs2_holder *gh, *tmp, *first_gh;
+	bool incompat_holders_demoted = false;
 	int ret;
 
-	first_gh = find_first_holder(gl);
+	first_gh = find_first_strong_holder(gl);
 
 restart:
 	list_for_each_entry_safe(gh, tmp, &gl->gl_holders, gh_list) {
-		if (test_bit(HIF_HOLDER, &gh->gh_iflags))
+		if (!test_bit(HIF_WAIT, &gh->gh_iflags))
 			continue;
 		if (may_grant(gl, first_gh, gh)) {
+			if (!incompat_holders_demoted) {
+				demote_incompat_holders(gl, first_gh);
+				incompat_holders_demoted = true;
+				first_gh = gh;
+			}
+			/*
+			 * The first holder (and only the first holder) on the
+			 * list to be promoted needs to call the go_lock
+			 * function. This does things like inode_refresh
+			 * to read an inode from disk.
+			 */
 			if (gh->gh_list.prev == &gl->gl_holders &&
 			    glops->go_lock) {
 				spin_unlock(&gl->gl_lockref.lock);
@@ -459,6 +532,11 @@ __acquires(&gl->gl_lockref.lock)
 			gfs2_holder_wake(gh);
 			continue;
 		}
+		/*
+		 * If we get here, it means we may not grant this holder for
+		 * some reason. If this holder is the head of the list, it
+		 * means we have a blocked holder at the head, so return 1.
+		 */
 		if (gh->gh_list.prev == &gl->gl_holders)
 			return 1;
 		do_error(gl, 0);
@@ -1373,7 +1451,7 @@ __acquires(&gl->gl_lockref.lock)
 		if (test_bit(GLF_LOCK, &gl->gl_flags)) {
 			struct gfs2_holder *first_gh;
 
-			first_gh = find_first_holder(gl);
+			first_gh = find_first_strong_holder(gl);
 			try_futile = !may_grant(gl, first_gh, gh);
 		}
 		if (test_bit(GLF_INVALIDATE_IN_PROGRESS, &gl->gl_flags))
@@ -1382,7 +1460,8 @@ __acquires(&gl->gl_lockref.lock)
 
 	list_for_each_entry(gh2, &gl->gl_holders, gh_list) {
 		if (unlikely(gh2->gh_owner_pid == gh->gh_owner_pid &&
-		    (gh->gh_gl->gl_ops->go_type != LM_TYPE_FLOCK)))
+		    (gh->gh_gl->gl_ops->go_type != LM_TYPE_FLOCK) &&
+		    !test_bit(HIF_MAY_DEMOTE, &gh2->gh_iflags)))
 			goto trap_recursive;
 		if (try_futile &&
 		    !(gh2->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB))) {
@@ -1478,51 +1557,83 @@ int gfs2_glock_poll(struct gfs2_holder *gh)
 	return test_bit(HIF_WAIT, &gh->gh_iflags) ? 0 : 1;
 }
 
-/**
- * gfs2_glock_dq - dequeue a struct gfs2_holder from a glock (release a glock)
- * @gh: the glock holder
- *
- */
+static inline bool needs_demote(struct gfs2_glock *gl)
+{
+	return (test_bit(GLF_DEMOTE, &gl->gl_flags) ||
+		test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags));
+}
 
-void gfs2_glock_dq(struct gfs2_holder *gh)
+static void __gfs2_glock_dq(struct gfs2_holder *gh)
 {
 	struct gfs2_glock *gl = gh->gh_gl;
 	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
 	unsigned delay = 0;
 	int fast_path = 0;
 
-	spin_lock(&gl->gl_lockref.lock);
 	/*
-	 * If we're in the process of file system withdraw, we cannot just
-	 * dequeue any glocks until our journal is recovered, lest we
-	 * introduce file system corruption. We need two exceptions to this
-	 * rule: We need to allow unlocking of nondisk glocks and the glock
-	 * for our own journal that needs recovery.
+	 * This while loop is similar to function demote_incompat_holders:
+	 * If the glock is due to be demoted (which may be from another node
+	 * or even if this holder is GL_NOCACHE), the weak holders are
+	 * demoted as well, allowing the glock to be demoted.
 	 */
-	if (test_bit(SDF_WITHDRAW_RECOVERY, &sdp->sd_flags) &&
-	    glock_blocked_by_withdraw(gl) &&
-	    gh->gh_gl != sdp->sd_jinode_gl) {
-		sdp->sd_glock_dqs_held++;
-		spin_unlock(&gl->gl_lockref.lock);
-		might_sleep();
-		wait_on_bit(&sdp->sd_flags, SDF_WITHDRAW_RECOVERY,
-			    TASK_UNINTERRUPTIBLE);
-		spin_lock(&gl->gl_lockref.lock);
-	}
-	if (gh->gh_flags & GL_NOCACHE)
-		handle_callback(gl, LM_ST_UNLOCKED, 0, false);
+	while (gh) {
+		/*
+		 * If we're in the process of file system withdraw, we cannot
+		 * just dequeue any glocks until our journal is recovered, lest
+		 * we introduce file system corruption. We need two exceptions
+		 * to this rule: We need to allow unlocking of nondisk glocks
+		 * and the glock for our own journal that needs recovery.
+		 */
+		if (test_bit(SDF_WITHDRAW_RECOVERY, &sdp->sd_flags) &&
+		    glock_blocked_by_withdraw(gl) &&
+		    gh->gh_gl != sdp->sd_jinode_gl) {
+			sdp->sd_glock_dqs_held++;
+			spin_unlock(&gl->gl_lockref.lock);
+			might_sleep();
+			wait_on_bit(&sdp->sd_flags, SDF_WITHDRAW_RECOVERY,
+				    TASK_UNINTERRUPTIBLE);
+			spin_lock(&gl->gl_lockref.lock);
+		}
+
+		/*
+		 * This holder should not be cached, so mark it for demote.
+		 * Note: this should be done before the check for needs_demote
+		 * below.
+		 */
+		if (gh->gh_flags & GL_NOCACHE)
+			handle_callback(gl, LM_ST_UNLOCKED, 0, false);
 
-	list_del_init(&gh->gh_list);
-	clear_bit(HIF_HOLDER, &gh->gh_iflags);
-	if (list_empty(&gl->gl_holders) &&
-	    !test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags) &&
-	    !test_bit(GLF_DEMOTE, &gl->gl_flags))
-		fast_path = 1;
+		list_del_init(&gh->gh_list);
+		clear_bit(HIF_HOLDER, &gh->gh_iflags);
+		trace_gfs2_glock_queue(gh, 0);
+
+		/*
+		 * If there hasn't been a demote request we are done.
+		 * (Let the remaining holders, if any, keep holding it.)
+		 */
+		if (!needs_demote(gl)) {
+			if (list_empty(&gl->gl_holders))
+				fast_path = 1;
+			break;
+		}
+		/*
+		 * If we have another strong holder (we cannot auto-demote)
+		 * we are done. It keeps holding it until it is done.
+		 */
+		if (find_first_strong_holder(gl))
+			break;
+
+		/*
+		 * If we have a weak holder at the head of the list, it
+		 * (and all others like it) must be auto-demoted. If there
+		 * are no more weak holders, we exit the while loop.
+		 */
+		gh = find_first_holder(gl);
+	}
 
 	if (!test_bit(GLF_LFLUSH, &gl->gl_flags) && demote_ok(gl))
 		gfs2_glock_add_to_lru(gl);
 
-	trace_gfs2_glock_queue(gh, 0);
 	if (unlikely(!fast_path)) {
 		gl->gl_lockref.count++;
 		if (test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags) &&
@@ -1531,6 +1642,19 @@ void gfs2_glock_dq(struct gfs2_holder *gh)
 			delay = gl->gl_hold_time;
 		__gfs2_glock_queue_work(gl, delay);
 	}
+}
+
+/**
+ * gfs2_glock_dq - dequeue a struct gfs2_holder from a glock (release a glock)
+ * @gh: the glock holder
+ *
+ */
+void gfs2_glock_dq(struct gfs2_holder *gh)
+{
+	struct gfs2_glock *gl = gh->gh_gl;
+
+	spin_lock(&gl->gl_lockref.lock);
+	__gfs2_glock_dq(gh);
 	spin_unlock(&gl->gl_lockref.lock);
 }
 
@@ -1693,6 +1817,7 @@ void gfs2_glock_dq_m(unsigned int num_gh, struct gfs2_holder *ghs)
 
 void gfs2_glock_cb(struct gfs2_glock *gl, unsigned int state)
 {
+	struct gfs2_holder mock_gh = { .gh_gl = gl, .gh_state = state, };
 	unsigned long delay = 0;
 	unsigned long holdtime;
 	unsigned long now = jiffies;
@@ -1707,6 +1832,28 @@ void gfs2_glock_cb(struct gfs2_glock *gl, unsigned int state)
 		if (test_bit(GLF_REPLY_PENDING, &gl->gl_flags))
 			delay = gl->gl_hold_time;
 	}
+	/*
+	 * Note 1: We cannot call demote_incompat_holders from handle_callback
+	 * or gfs2_set_demote due to recursion problems like: gfs2_glock_dq ->
+	 * handle_callback -> demote_incompat_holders -> gfs2_glock_dq
+	 * Plus, we only want to demote the holders if the request comes from
+	 * a remote cluster node because local holder conflicts are resolved
+	 * elsewhere.
+	 *
+	 * Note 2: if a remote node wants this glock in EX mode, lock_dlm will
+	 * request that we set our state to UNLOCKED. Here we mock up a holder
+	 * to make it look like someone wants the lock EX locally. Any SH
+	 * and DF requests should be able to share the lock without demoting.
+	 *
+	 * Note 3: We only want to demote the demoteable holders when there
+	 * are no more strong holders. The demoteable holders might as well
+	 * keep the glock until the last strong holder is done with it.
+	 */
+	if (!find_first_strong_holder(gl)) {
+		if (state == LM_ST_UNLOCKED)
+			mock_gh.gh_state = LM_ST_EXCLUSIVE;
+		demote_incompat_holders(gl, &mock_gh);
+	}
 	handle_callback(gl, state, delay, true);
 	__gfs2_glock_queue_work(gl, delay);
 	spin_unlock(&gl->gl_lockref.lock);
@@ -2096,6 +2243,8 @@ static const char *hflags2str(char *buf, u16 flags, unsigned long iflags)
 		*p++ = 'H';
 	if (test_bit(HIF_WAIT, &iflags))
 		*p++ = 'W';
+	if (test_bit(HIF_MAY_DEMOTE, &iflags))
+		*p++ = 'D';
 	*p = 0;
 	return buf;
 }
diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index 31a8f2f649b5..9012487da4c6 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -150,6 +150,8 @@ static inline struct gfs2_holder *gfs2_glock_is_locked_by_me(struct gfs2_glock *
 	list_for_each_entry(gh, &gl->gl_holders, gh_list) {
 		if (!test_bit(HIF_HOLDER, &gh->gh_iflags))
 			break;
+		if (test_bit(HIF_MAY_DEMOTE, &gh->gh_iflags))
+			continue;
 		if (gh->gh_owner_pid == pid)
 			goto out;
 	}
@@ -325,6 +327,24 @@ static inline void glock_clear_object(struct gfs2_glock *gl, void *object)
 	spin_unlock(&gl->gl_lockref.lock);
 }
 
+static inline void gfs2_holder_allow_demote(struct gfs2_holder *gh)
+{
+	struct gfs2_glock *gl = gh->gh_gl;
+
+	spin_lock(&gl->gl_lockref.lock);
+	set_bit(HIF_MAY_DEMOTE, &gh->gh_iflags);
+	spin_unlock(&gl->gl_lockref.lock);
+}
+
+static inline void gfs2_holder_disallow_demote(struct gfs2_holder *gh)
+{
+	struct gfs2_glock *gl = gh->gh_gl;
+
+	spin_lock(&gl->gl_lockref.lock);
+	clear_bit(HIF_MAY_DEMOTE, &gh->gh_iflags);
+	spin_unlock(&gl->gl_lockref.lock);
+}
+
 extern void gfs2_inode_remember_delete(struct gfs2_glock *gl, u64 generation);
 extern bool gfs2_inode_already_deleted(struct gfs2_glock *gl, u64 generation);
 
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 5c6b985254aa..e73a81db0714 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -252,6 +252,7 @@ struct gfs2_lkstats {
 
 enum {
 	/* States */
+	HIF_MAY_DEMOTE		= 1,
 	HIF_HOLDER		= 6,  /* Set for gh that "holds" the glock */
 	HIF_WAIT		= 10,
 };
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Ocfs2-devel] [PATCH v6 11/19] gfs2: Move the inode glock locking to gfs2_file_buffered_write
  2021-08-19 19:40 [Ocfs2-devel] [PATCH v6 00/19] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (9 preceding siblings ...)
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 10/19] gfs2: Introduce flag for glock holder auto-demotion Andreas Gruenbacher
@ 2021-08-19 19:40 ` Andreas Gruenbacher
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 12/19] gfs2: Fix mmap + page fault deadlocks for buffered I/O Andreas Gruenbacher
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Andreas Gruenbacher @ 2021-08-19 19:40 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Andreas Gruenbacher, linux-kernel, cluster-devel,
	linux-fsdevel, ocfs2-devel

So far, for buffered writes, we were taking the inode glock in
gfs2_iomap_begin and dropping it in gfs2_iomap_end with the intention of
not holding the inode glock while iomap_write_actor faults in user
pages.  It turns out that iomap_write_actor is called inside iomap_begin
... iomap_end, so the user pages were still faulted in while holding the
inode glock and the locking code in iomap_begin / iomap_end was
completely pointless.

Move the locking into gfs2_file_buffered_write instead.  We'll take care
of the potential deadlocks due to faulting user pages while holding a
glock in a subsequent patch.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/bmap.c | 60 +-------------------------------------------------
 fs/gfs2/file.c | 27 +++++++++++++++++++++++
 2 files changed, 28 insertions(+), 59 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index ed8b67b21718..0d90f1809efb 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -961,46 +961,6 @@ static int __gfs2_iomap_get(struct inode *inode, loff_t pos, loff_t length,
 	goto out;
 }
 
-static int gfs2_write_lock(struct inode *inode)
-{
-	struct gfs2_inode *ip = GFS2_I(inode);
-	struct gfs2_sbd *sdp = GFS2_SB(inode);
-	int error;
-
-	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &ip->i_gh);
-	error = gfs2_glock_nq(&ip->i_gh);
-	if (error)
-		goto out_uninit;
-	if (&ip->i_inode == sdp->sd_rindex) {
-		struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
-
-		error = gfs2_glock_nq_init(m_ip->i_gl, LM_ST_EXCLUSIVE,
-					   GL_NOCACHE, &m_ip->i_gh);
-		if (error)
-			goto out_unlock;
-	}
-	return 0;
-
-out_unlock:
-	gfs2_glock_dq(&ip->i_gh);
-out_uninit:
-	gfs2_holder_uninit(&ip->i_gh);
-	return error;
-}
-
-static void gfs2_write_unlock(struct inode *inode)
-{
-	struct gfs2_inode *ip = GFS2_I(inode);
-	struct gfs2_sbd *sdp = GFS2_SB(inode);
-
-	if (&ip->i_inode == sdp->sd_rindex) {
-		struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
-
-		gfs2_glock_dq_uninit(&m_ip->i_gh);
-	}
-	gfs2_glock_dq_uninit(&ip->i_gh);
-}
-
 static int gfs2_iomap_page_prepare(struct inode *inode, loff_t pos,
 				   unsigned len, struct iomap *iomap)
 {
@@ -1119,11 +1079,6 @@ static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
 	return ret;
 }
 
-static inline bool gfs2_iomap_need_write_lock(unsigned flags)
-{
-	return (flags & IOMAP_WRITE) && !(flags & IOMAP_DIRECT);
-}
-
 static int gfs2_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
 			    unsigned flags, struct iomap *iomap,
 			    struct iomap *srcmap)
@@ -1136,12 +1091,6 @@ static int gfs2_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
 		iomap->flags |= IOMAP_F_BUFFER_HEAD;
 
 	trace_gfs2_iomap_start(ip, pos, length, flags);
-	if (gfs2_iomap_need_write_lock(flags)) {
-		ret = gfs2_write_lock(inode);
-		if (ret)
-			goto out;
-	}
-
 	ret = __gfs2_iomap_get(inode, pos, length, flags, iomap, &mp);
 	if (ret)
 		goto out_unlock;
@@ -1169,10 +1118,7 @@ static int gfs2_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
 	ret = gfs2_iomap_begin_write(inode, pos, length, flags, iomap, &mp);
 
 out_unlock:
-	if (ret && gfs2_iomap_need_write_lock(flags))
-		gfs2_write_unlock(inode);
 	release_metapath(&mp);
-out:
 	trace_gfs2_iomap_end(ip, iomap, ret);
 	return ret;
 }
@@ -1220,15 +1166,11 @@ static int gfs2_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 	}
 
 	if (unlikely(!written))
-		goto out_unlock;
+		return 0;
 
 	if (iomap->flags & IOMAP_F_SIZE_CHANGED)
 		mark_inode_dirty(inode);
 	set_bit(GLF_DIRTY, &ip->i_gl->gl_flags);
-
-out_unlock:
-	if (gfs2_iomap_need_write_lock(flags))
-		gfs2_write_unlock(inode);
 	return 0;
 }
 
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 55ec1cadc9e6..813154d60834 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -880,11 +880,38 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *fro
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
+	struct gfs2_inode *ip = GFS2_I(inode);
+	struct gfs2_sbd *sdp = GFS2_SB(inode);
 	ssize_t ret;
 
+	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &ip->i_gh);
+	ret = gfs2_glock_nq(&ip->i_gh);
+	if (ret)
+		goto out_uninit;
+
+	if (inode == sdp->sd_rindex) {
+		struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
+
+		ret = gfs2_glock_nq_init(m_ip->i_gl, LM_ST_EXCLUSIVE,
+					 GL_NOCACHE, &m_ip->i_gh);
+		if (ret)
+			goto out_unlock;
+	}
+
 	current->backing_dev_info = inode_to_bdi(inode);
 	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
 	current->backing_dev_info = NULL;
+
+	if (inode == sdp->sd_rindex) {
+		struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
+
+		gfs2_glock_dq_uninit(&m_ip->i_gh);
+	}
+
+out_unlock:
+	gfs2_glock_dq(&ip->i_gh);
+out_uninit:
+	gfs2_holder_uninit(&ip->i_gh);
 	return ret;
 }
 
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Ocfs2-devel] [PATCH v6 12/19] gfs2: Fix mmap + page fault deadlocks for buffered I/O
  2021-08-19 19:40 [Ocfs2-devel] [PATCH v6 00/19] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (10 preceding siblings ...)
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 11/19] gfs2: Move the inode glock locking to gfs2_file_buffered_write Andreas Gruenbacher
@ 2021-08-19 19:40 ` Andreas Gruenbacher
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 13/19] iomap: Fix iomap_dio_rw return value for user copies Andreas Gruenbacher
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Andreas Gruenbacher @ 2021-08-19 19:40 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Andreas Gruenbacher, linux-kernel, cluster-devel,
	linux-fsdevel, ocfs2-devel

In the .read_iter and .write_iter file operations, we're accessing
user-space memory while holding the inodes glock.  There's a possibility
that the memory is mapped to the same file, in which case we'd recurse
on the same glock.

More complex scenarios can involve multiple glocks, processes, and even
cluster nodes.

Avoids these kinds of problems by disabling page faults while holding a
glock.  If a page fault occurs, we either end up with a partial read or
write, or with -EFAULT if nothing could be read or written.  In that
case, we indicate that we're willing to give up the glock, fault in the
requested pages manually, and repeat the operation.

This kind of locking problem in gfs2 was originally reported by Jan
Kara.  Linus came up with the proposal to disable page faults.  Many
thanks to Al Viro and Matthew Wilcox for their feedback.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 98 +++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 93 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 813154d60834..c4262d6ba5e4 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -776,6 +776,36 @@ static int gfs2_fsync(struct file *file, loff_t start, loff_t end,
 	return ret ? ret : ret1;
 }
 
+static bool should_fault_in_pages(struct iov_iter *i, size_t *prev_count,
+				  size_t *window_size)
+{
+	char __user *p = i->iov[0].iov_base + i->iov_offset;
+	size_t count = iov_iter_count(i);
+	size_t size;
+
+	if (!iter_is_iovec(i))
+		return false;
+
+	if (*prev_count != count || !*window_size) {
+		int pages, nr_dirtied;
+
+		pages = min_t(int, BIO_MAX_VECS,
+			      DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE));
+		nr_dirtied = max(current->nr_dirtied_pause -
+				 current->nr_dirtied, 1);
+		pages = min(pages, nr_dirtied);
+		size = (size_t)PAGE_SIZE * pages - offset_in_page(p);
+	} else {
+		size = (size_t)PAGE_SIZE - offset_in_page(p);
+		if (*window_size <= size)
+			return false;
+	}
+
+	*prev_count = count;
+	*window_size = size;
+	return true;
+}
+
 static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
 				     struct gfs2_holder *gh)
 {
@@ -840,9 +870,16 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	struct gfs2_inode *ip;
 	struct gfs2_holder gh;
+	size_t prev_count = 0, window_size = 0;
 	size_t written = 0;
 	ssize_t ret;
 
+	/*
+	 * In this function, we disable page faults when we're holding the
+	 * inode glock while doing I/O.  If a page fault occurs, we drop the
+	 * inode glock, fault in the pages manually, and retry.
+	 */
+
 	if (iocb->ki_flags & IOCB_DIRECT) {
 		ret = gfs2_file_direct_read(iocb, to, &gh);
 		if (likely(ret != -ENOTBLK))
@@ -864,13 +901,35 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	}
 	ip = GFS2_I(iocb->ki_filp->f_mapping->host);
 	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
+retry:
 	ret = gfs2_glock_nq(&gh);
 	if (ret)
 		goto out_uninit;
+retry_under_glock:
+	pagefault_disable();
 	ret = generic_file_read_iter(iocb, to);
+	pagefault_enable();
 	if (ret > 0)
 		written += ret;
-	gfs2_glock_dq(&gh);
+
+	if (unlikely(iov_iter_count(to) && (ret > 0 || ret == -EFAULT)) &&
+	    should_fault_in_pages(to, &prev_count, &window_size)) {
+		size_t leftover;
+
+		gfs2_holder_allow_demote(&gh);
+		leftover = fault_in_iov_iter_writeable(to, window_size);
+		gfs2_holder_disallow_demote(&gh);
+		if (leftover != window_size) {
+			if (!gfs2_holder_queued(&gh)) {
+				if (written)
+					goto out_uninit;
+				goto retry;
+			}
+			goto retry_under_glock;
+		}
+	}
+	if (gfs2_holder_queued(&gh))
+		gfs2_glock_dq(&gh);
 out_uninit:
 	gfs2_holder_uninit(&gh);
 	return written ? written : ret;
@@ -882,13 +941,22 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *fro
 	struct inode *inode = file_inode(file);
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
+	size_t prev_count = 0, window_size = 0;
+	size_t read = 0;
 	ssize_t ret;
 
+	/*
+	 * In this function, we disable page faults when we're holding the
+	 * inode glock while doing I/O.  If a page fault occurs, we drop the
+	 * inode glock, fault in the pages manually, and retry.
+	 */
+
 	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &ip->i_gh);
+retry:
 	ret = gfs2_glock_nq(&ip->i_gh);
 	if (ret)
 		goto out_uninit;
-
+retry_under_glock:
 	if (inode == sdp->sd_rindex) {
 		struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
 
@@ -899,20 +967,40 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *fro
 	}
 
 	current->backing_dev_info = inode_to_bdi(inode);
+	pagefault_disable();
 	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+	pagefault_enable();
 	current->backing_dev_info = NULL;
+	if (ret > 0)
+		read += ret;
 
 	if (inode == sdp->sd_rindex) {
 		struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
 
 		gfs2_glock_dq_uninit(&m_ip->i_gh);
 	}
-
+	if (unlikely(iov_iter_count(from) && (ret > 0 || ret == -EFAULT)) &&
+	    should_fault_in_pages(from, &prev_count, &window_size)) {
+		size_t leftover;
+
+		gfs2_holder_allow_demote(&ip->i_gh);
+		leftover = fault_in_iov_iter_readable(from, window_size);
+		gfs2_holder_disallow_demote(&ip->i_gh);
+		if (leftover != window_size) {
+			if (!gfs2_holder_queued(&ip->i_gh)) {
+				if (read)
+					goto out_uninit;
+				goto retry;
+			}
+			goto retry_under_glock;
+		}
+	}
 out_unlock:
-	gfs2_glock_dq(&ip->i_gh);
+	if (gfs2_holder_queued(&ip->i_gh))
+		gfs2_glock_dq(&ip->i_gh);
 out_uninit:
 	gfs2_holder_uninit(&ip->i_gh);
-	return ret;
+	return read ? read : ret;
 }
 
 /**
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Ocfs2-devel] [PATCH v6 13/19] iomap: Fix iomap_dio_rw return value for user copies
  2021-08-19 19:40 [Ocfs2-devel] [PATCH v6 00/19] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (11 preceding siblings ...)
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 12/19] gfs2: Fix mmap + page fault deadlocks for buffered I/O Andreas Gruenbacher
@ 2021-08-19 19:40 ` Andreas Gruenbacher
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 14/19] iomap: Support partial direct I/O on user copy failures Andreas Gruenbacher
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Andreas Gruenbacher @ 2021-08-19 19:40 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Andreas Gruenbacher, linux-kernel, cluster-devel,
	linux-fsdevel, ocfs2-devel

When a user copy fails in one of the helpers of iomap_dio_rw, fail with
-EFAULT instead of returning 0.  This matches what iomap_dio_bio_actor
returns when it gets an -EFAULT from bio_iov_iter_get_pages.  With these
changes, iomap_dio_actor now consistently fails with -EFAULT when a user
page cannot be faulted in.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/iomap/direct-io.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 9398b8c31323..8054f5d6c273 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -370,7 +370,7 @@ iomap_dio_hole_actor(loff_t length, struct iomap_dio *dio)
 {
 	length = iov_iter_zero(length, dio->submit.iter);
 	dio->size += length;
-	return length;
+	return length ? length : -EFAULT;
 }
 
 static loff_t
@@ -397,7 +397,7 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length,
 		copied = copy_to_iter(iomap->inline_data + pos, length, iter);
 	}
 	dio->size += copied;
-	return copied;
+	return copied ? copied : -EFAULT;
 }
 
 static loff_t
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Ocfs2-devel] [PATCH v6 14/19] iomap: Support partial direct I/O on user copy failures
  2021-08-19 19:40 [Ocfs2-devel] [PATCH v6 00/19] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (12 preceding siblings ...)
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 13/19] iomap: Fix iomap_dio_rw return value for user copies Andreas Gruenbacher
@ 2021-08-19 19:40 ` Andreas Gruenbacher
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 15/19] iomap: Add done_before argument to iomap_dio_rw Andreas Gruenbacher
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Andreas Gruenbacher @ 2021-08-19 19:40 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Andreas Gruenbacher, linux-kernel, cluster-devel,
	linux-fsdevel, ocfs2-devel

In iomap_dio_rw, when iomap_apply returns an -EFAULT error and the
IOMAP_DIO_PARTIAL flag is set, complete the request synchronously and
return a partial result.  This allows the caller to deal with the page
fault and retry the remainder of the request.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/iomap/direct-io.c  | 6 ++++++
 include/linux/iomap.h | 7 +++++++
 2 files changed, 13 insertions(+)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 8054f5d6c273..ba88fe51b77a 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -561,6 +561,12 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		ret = iomap_apply(inode, pos, count, iomap_flags, ops, dio,
 				iomap_dio_actor);
 		if (ret <= 0) {
+			if (ret == -EFAULT && dio->size &&
+			    (dio_flags & IOMAP_DIO_PARTIAL)) {
+				wait_for_completion = true;
+				ret = 0;
+			}
+
 			/* magic error code to fall back to buffered I/O */
 			if (ret == -ENOTBLK) {
 				wait_for_completion = true;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 479c1da3e221..bcae4814b8e3 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -267,6 +267,13 @@ struct iomap_dio_ops {
   */
 #define IOMAP_DIO_OVERWRITE_ONLY	(1 << 1)
 
+/*
+ * When a page fault occurs, return a partial synchronous result and allow
+ * the caller to retry the rest of the operation after dealing with the page
+ * fault.
+ */
+#define IOMAP_DIO_PARTIAL		(1 << 2)
+
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
 		unsigned int dio_flags);
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Ocfs2-devel] [PATCH v6 15/19] iomap: Add done_before argument to iomap_dio_rw
  2021-08-19 19:40 [Ocfs2-devel] [PATCH v6 00/19] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (13 preceding siblings ...)
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 14/19] iomap: Support partial direct I/O on user copy failures Andreas Gruenbacher
@ 2021-08-19 19:40 ` Andreas Gruenbacher
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 16/19] gup: Introduce FOLL_NOFAULT flag to disable page faults Andreas Gruenbacher
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Andreas Gruenbacher @ 2021-08-19 19:40 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Andreas Gruenbacher, linux-kernel, cluster-devel,
	linux-fsdevel, ocfs2-devel

Add a done_before argument to iomap_dio_rw that indicates how much of
the request has already been transferred.  When the request succeeds, we
report that done_before additional bytes were tranferred.  This is
useful for finishing a request asynchronously when part of the request
has already been completed synchronously.

We'll use that to allow iomap_dio_rw to be used with page faults
disabled: when a page fault occurs while submitting a request, we
synchronously complete the part of the request that has already been
submitted.  The caller can then take care of the page fault and call
iomap_dio_rw again for the rest of the request, passing in the number of
bytes already tranferred.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/btrfs/file.c       |  5 +++--
 fs/ext4/file.c        |  5 +++--
 fs/gfs2/file.c        |  4 ++--
 fs/iomap/direct-io.c  | 11 ++++++++---
 fs/xfs/xfs_file.c     |  6 +++---
 fs/zonefs/super.c     |  4 ++--
 include/linux/iomap.h |  4 ++--
 7 files changed, 23 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 281c77cfe91a..8817fe6b5fc0 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1945,7 +1945,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	}
 
 	dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
-			     0);
+			     0, 0);
 
 	btrfs_inode_unlock(inode, ilock_flags);
 
@@ -3637,7 +3637,8 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
 		return 0;
 
 	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
-	ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops, 0);
+	ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
+			   0, 0);
 	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
 	return ret;
 }
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 816dedcbd541..4a5e7fd31fb5 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -74,7 +74,7 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		return generic_file_read_iter(iocb, to);
 	}
 
-	ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0);
+	ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0, 0);
 	inode_unlock_shared(inode);
 
 	file_accessed(iocb->ki_filp);
@@ -566,7 +566,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (ilock_shared)
 		iomap_ops = &ext4_iomap_overwrite_ops;
 	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
-			   (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0);
+			   (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0,
+			   0);
 	if (ret == -ENOTBLK)
 		ret = 0;
 
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index c4262d6ba5e4..8ca6bdba907c 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -822,7 +822,7 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
 	if (ret)
 		goto out_uninit;
 
-	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0);
+	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0, 0);
 	gfs2_glock_dq(gh);
 out_uninit:
 	gfs2_holder_uninit(gh);
@@ -856,7 +856,7 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	if (offset + len > i_size_read(&ip->i_inode))
 		goto out;
 
-	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0);
+	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0, 0);
 	if (ret == -ENOTBLK)
 		ret = 0;
 out:
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index ba88fe51b77a..dcf9a2b4381f 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -31,6 +31,7 @@ struct iomap_dio {
 	atomic_t		ref;
 	unsigned		flags;
 	int			error;
+	size_t			done_before;
 	bool			wait_for_completion;
 
 	union {
@@ -126,6 +127,9 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
 	if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
 		ret = generic_write_sync(iocb, ret);
 
+	if (ret > 0)
+		ret += dio->done_before;
+
 	kfree(dio);
 
 	return ret;
@@ -450,7 +454,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 struct iomap_dio *
 __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
-		unsigned int dio_flags)
+		unsigned int dio_flags, size_t done_before)
 {
 	struct address_space *mapping = iocb->ki_filp->f_mapping;
 	struct inode *inode = file_inode(iocb->ki_filp);
@@ -477,6 +481,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	dio->dops = dops;
 	dio->error = 0;
 	dio->flags = 0;
+	dio->done_before = done_before;
 
 	dio->submit.iter = iter;
 	dio->submit.waiter = current;
@@ -648,11 +653,11 @@ EXPORT_SYMBOL_GPL(__iomap_dio_rw);
 ssize_t
 iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
-		unsigned int dio_flags)
+		unsigned int dio_flags, size_t done_before)
 {
 	struct iomap_dio *dio;
 
-	dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags);
+	dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags, done_before);
 	if (IS_ERR_OR_NULL(dio))
 		return PTR_ERR_OR_ZERO(dio);
 	return iomap_dio_complete(dio);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index cc3cfb12df53..3103d9bda466 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -259,7 +259,7 @@ xfs_file_dio_read(
 	ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
 	if (ret)
 		return ret;
-	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0);
+	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, 0);
 	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
 
 	return ret;
@@ -569,7 +569,7 @@ xfs_file_dio_write_aligned(
 	}
 	trace_xfs_file_direct_write(iocb, from);
 	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
-			   &xfs_dio_write_ops, 0);
+			   &xfs_dio_write_ops, 0, 0);
 out_unlock:
 	if (iolock)
 		xfs_iunlock(ip, iolock);
@@ -647,7 +647,7 @@ xfs_file_dio_write_unaligned(
 
 	trace_xfs_file_direct_write(iocb, from);
 	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
-			   &xfs_dio_write_ops, flags);
+			   &xfs_dio_write_ops, flags, 0);
 
 	/*
 	 * Retry unaligned I/O with exclusive blocking semantics if the DIO
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 70055d486bf7..85ca2f5fe06e 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -864,7 +864,7 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 		ret = zonefs_file_dio_append(iocb, from);
 	else
 		ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops,
-				   &zonefs_write_dio_ops, 0);
+				   &zonefs_write_dio_ops, 0, 0);
 	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
 	    (ret > 0 || ret == -EIOCBQUEUED)) {
 		if (ret > 0)
@@ -999,7 +999,7 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		}
 		file_accessed(iocb->ki_filp);
 		ret = iomap_dio_rw(iocb, to, &zonefs_iomap_ops,
-				   &zonefs_read_dio_ops, 0);
+				   &zonefs_read_dio_ops, 0, 0);
 	} else {
 		ret = generic_file_read_iter(iocb, to);
 		if (ret == -EIO)
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index bcae4814b8e3..908bda10024c 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -276,10 +276,10 @@ struct iomap_dio_ops {
 
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
-		unsigned int dio_flags);
+		unsigned int dio_flags, size_t done_before);
 struct iomap_dio *__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
-		unsigned int dio_flags);
+		unsigned int dio_flags, size_t done_before);
 ssize_t iomap_dio_complete(struct iomap_dio *dio);
 int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
 
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Ocfs2-devel] [PATCH v6 16/19] gup: Introduce FOLL_NOFAULT flag to disable page faults
  2021-08-19 19:40 [Ocfs2-devel] [PATCH v6 00/19] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (14 preceding siblings ...)
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 15/19] iomap: Add done_before argument to iomap_dio_rw Andreas Gruenbacher
@ 2021-08-19 19:40 ` Andreas Gruenbacher
  2021-08-19 19:41 ` [Ocfs2-devel] [PATCH v6 17/19] iov_iter: Introduce nofault " Andreas Gruenbacher
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Andreas Gruenbacher @ 2021-08-19 19:40 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Andreas Gruenbacher, linux-kernel, cluster-devel,
	linux-fsdevel, ocfs2-devel

Introduce a new FOLL_NOFAULT flag that causes get_user_pages to return
-EFAULT when it would otherwise trigger a page fault.  This is roughly
similar to FOLL_FAST_ONLY but available on all architectures, and less
fragile.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 include/linux/mm.h | 3 ++-
 mm/gup.c           | 4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7ca22e6e694a..958246aa343f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2850,7 +2850,8 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
 #define FOLL_NOWAIT	0x20	/* if a disk transfer is needed, start the IO
 				 * and return without waiting upon it */
-#define FOLL_POPULATE	0x40	/* fault in page */
+#define FOLL_POPULATE	0x40	/* fault in pages (with FOLL_MLOCK) */
+#define FOLL_NOFAULT	0x80	/* do not fault in pages */
 #define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
 #define FOLL_NUMA	0x200	/* force NUMA hinting page fault */
 #define FOLL_MIGRATION	0x400	/* wait for page to replace migration entry */
diff --git a/mm/gup.c b/mm/gup.c
index 03ab03b68dc7..69056adcc8c9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -932,6 +932,8 @@ static int faultin_page(struct vm_area_struct *vma,
 	/* mlock all present pages, but do not fault in new pages */
 	if ((*flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK)
 		return -ENOENT;
+	if (*flags & FOLL_NOFAULT)
+		return -EFAULT;
 	if (*flags & FOLL_WRITE)
 		fault_flags |= FAULT_FLAG_WRITE;
 	if (*flags & FOLL_REMOTE)
@@ -2857,7 +2859,7 @@ static int internal_get_user_pages_fast(unsigned long start,
 
 	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
 				       FOLL_FORCE | FOLL_PIN | FOLL_GET |
-				       FOLL_FAST_ONLY)))
+				       FOLL_FAST_ONLY | FOLL_NOFAULT)))
 		return -EINVAL;
 
 	if (gup_flags & FOLL_PIN)
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Ocfs2-devel] [PATCH v6 17/19] iov_iter: Introduce nofault flag to disable page faults
  2021-08-19 19:40 [Ocfs2-devel] [PATCH v6 00/19] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (15 preceding siblings ...)
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 16/19] gup: Introduce FOLL_NOFAULT flag to disable page faults Andreas Gruenbacher
@ 2021-08-19 19:41 ` Andreas Gruenbacher
  2021-08-19 19:41 ` [Ocfs2-devel] [PATCH v6 18/19] gfs2: Fix mmap + page fault deadlocks for direct I/O Andreas Gruenbacher
  2021-08-19 19:41 ` [Ocfs2-devel] [PATCH v6 19/19] gfs2: Eliminate ip->i_gh Andreas Gruenbacher
  18 siblings, 0 replies; 33+ messages in thread
From: Andreas Gruenbacher @ 2021-08-19 19:41 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Andreas Gruenbacher, linux-kernel, cluster-devel,
	linux-fsdevel, ocfs2-devel

Introduce a new nofault flag to indicate to get_user_pages to use the
FOLL_NOFAULT flag.  This will cause get_user_pages to fail when it
would otherwise fault in a page.

Currently, the noio flag is only checked in iov_iter_get_pages and
iov_iter_get_pages_alloc.  This is enough for iomaop_dio_rw, but it
may make sense to check in other contexts as well.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 include/linux/uio.h |  1 +
 lib/iov_iter.c      | 20 +++++++++++++++-----
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index ffa431aeb067..ea35e511268f 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -29,6 +29,7 @@ enum iter_type {
 
 struct iov_iter {
 	u8 iter_type;
+	bool nofault;
 	bool data_source;
 	size_t iov_offset;
 	size_t count;
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 968f2d2595cd..22a82f272754 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -513,6 +513,7 @@ void iov_iter_init(struct iov_iter *i, unsigned int direction,
 	WARN_ON(direction & ~(READ | WRITE));
 	*i = (struct iov_iter) {
 		.iter_type = ITER_IOVEC,
+		.nofault = false,
 		.data_source = direction,
 		.iov = iov,
 		.nr_segs = nr_segs,
@@ -1523,13 +1524,17 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 		return 0;
 
 	if (likely(iter_is_iovec(i))) {
+		unsigned int gup_flags = 0;
 		unsigned long addr;
 
+		if (iov_iter_rw(i) != WRITE)
+			gup_flags |= FOLL_WRITE;
+		if (i->nofault)
+			gup_flags |= FOLL_NOFAULT;
+
 		addr = first_iovec_segment(i, &len, start, maxsize, maxpages);
 		n = DIV_ROUND_UP(len, PAGE_SIZE);
-		res = get_user_pages_fast(addr, n,
-				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0,
-				pages);
+		res = get_user_pages_fast(addr, n, gup_flags, pages);
 		if (unlikely(res <= 0))
 			return res;
 		return (res == n ? len : res * PAGE_SIZE) - *start;
@@ -1645,15 +1650,20 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		return 0;
 
 	if (likely(iter_is_iovec(i))) {
+		unsigned int gup_flags = 0;
 		unsigned long addr;
 
+		if (iov_iter_rw(i) != WRITE)
+			gup_flags |= FOLL_WRITE;
+		if (i->nofault)
+			gup_flags |= FOLL_NOFAULT;
+
 		addr = first_iovec_segment(i, &len, start, maxsize, ~0U);
 		n = DIV_ROUND_UP(len, PAGE_SIZE);
 		p = get_pages_array(n);
 		if (!p)
 			return -ENOMEM;
-		res = get_user_pages_fast(addr, n,
-				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0, p);
+		res = get_user_pages_fast(addr, n, gup_flags, p);
 		if (unlikely(res <= 0)) {
 			kvfree(p);
 			*pages = NULL;
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Ocfs2-devel] [PATCH v6 18/19] gfs2: Fix mmap + page fault deadlocks for direct I/O
  2021-08-19 19:40 [Ocfs2-devel] [PATCH v6 00/19] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (16 preceding siblings ...)
  2021-08-19 19:41 ` [Ocfs2-devel] [PATCH v6 17/19] iov_iter: Introduce nofault " Andreas Gruenbacher
@ 2021-08-19 19:41 ` Andreas Gruenbacher
  2021-08-19 19:41 ` [Ocfs2-devel] [PATCH v6 19/19] gfs2: Eliminate ip->i_gh Andreas Gruenbacher
  18 siblings, 0 replies; 33+ messages in thread
From: Andreas Gruenbacher @ 2021-08-19 19:41 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Andreas Gruenbacher, linux-kernel, cluster-devel,
	linux-fsdevel, ocfs2-devel

Also disable page faults during direct I/O requests and implement the
same kind of retry logic as in the buffered I/O case.

Direct I/O requests differ from buffered I/O requests in that they use
bio_iov_iter_get_pages for grabbing page references and faulting in
pages instead of triggering physical page faults.  Those manual page
faults can be disabled with the new iocb->nofault flag.

This kind of locking problem in gfs2 was originally reported by Jan
Kara.  Linus came up with the proposal to disable page faults.  Many
thanks to Al Viro and Matthew Wilcox for their feedback.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 99 ++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 87 insertions(+), 12 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 8ca6bdba907c..2cf3466b9dde 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -811,22 +811,64 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
 {
 	struct file *file = iocb->ki_filp;
 	struct gfs2_inode *ip = GFS2_I(file->f_mapping->host);
-	size_t count = iov_iter_count(to);
+	size_t prev_count = 0, window_size = 0;
+	size_t written = 0;
 	ssize_t ret;
 
-	if (!count)
+	/*
+	 * In this function, we disable page faults when we're holding the
+	 * inode glock while doing I/O.  If a page fault occurs, we drop the
+	 * inode glock, fault in the pages manually, and retry.
+	 *
+	 * Unlike generic_file_read_iter, for reads, iomap_dio_rw can trigger
+	 * physical as well as manual page faults, and we need to disable both
+	 * kinds.
+	 *
+	 * For direct I/O, gfs2 takes the inode glock in deferred mode.  This
+	 * locking mode is compatible with other deferred holders, so multiple
+	 * processes and nodes can do direct I/O to a file at the same time.
+	 * There's no guarantee that reads or writes will be atomic.  Any
+	 * coordination among readers and writers needs to happen externally.
+	 */
+
+	if (!iov_iter_count(to))
 		return 0; /* skip atime */
 
 	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
+retry:
 	ret = gfs2_glock_nq(gh);
 	if (ret)
 		goto out_uninit;
+retry_under_glock:
+	pagefault_disable();
+	to->nofault = true;
+	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL,
+			   IOMAP_DIO_PARTIAL, written);
+	to->nofault = false;
+	pagefault_enable();
+	if (ret > 0)
+		written = ret;
+
+	if (unlikely(iov_iter_count(to) && (ret > 0 || ret == -EFAULT)) &&
+	    should_fault_in_pages(to, &prev_count, &window_size)) {
+		size_t leftover;
 
-	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0, 0);
-	gfs2_glock_dq(gh);
+		gfs2_holder_allow_demote(gh);
+		leftover = fault_in_iov_iter_writeable(to, window_size);
+		gfs2_holder_disallow_demote(gh);
+		if (leftover != window_size) {
+			if (!gfs2_holder_queued(gh))
+				goto retry;
+			goto retry_under_glock;
+		}
+	}
+	if (gfs2_holder_queued(gh))
+		gfs2_glock_dq(gh);
 out_uninit:
 	gfs2_holder_uninit(gh);
-	return ret;
+	if (ret < 0)
+		return ret;
+	return written;
 }
 
 static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
@@ -835,10 +877,19 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
 	struct gfs2_inode *ip = GFS2_I(inode);
-	size_t len = iov_iter_count(from);
-	loff_t offset = iocb->ki_pos;
+	size_t prev_count = 0, window_size = 0;
+	size_t read = 0;
 	ssize_t ret;
 
+	/*
+	 * In this function, we disable page faults when we're holding the
+	 * inode glock while doing I/O.  If a page fault occurs, we drop the
+	 * inode glock, fault in the pages manually, and retry.
+	 *
+	 * For writes, iomap_dio_rw only triggers manual page faults, so we
+	 * don't need to disable physical ones.
+	 */
+
 	/*
 	 * Deferred lock, even if its a write, since we do no allocation on
 	 * this path. All we need to change is the atime, and this lock mode
@@ -848,22 +899,46 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	 * VFS does.
 	 */
 	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
+retry:
 	ret = gfs2_glock_nq(gh);
 	if (ret)
 		goto out_uninit;
-
+retry_under_glock:
 	/* Silently fall back to buffered I/O when writing beyond EOF */
-	if (offset + len > i_size_read(&ip->i_inode))
+	if (iocb->ki_pos + iov_iter_count(from) > i_size_read(&ip->i_inode))
 		goto out;
 
-	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0, 0);
+	from->nofault = true;
+	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL,
+			   IOMAP_DIO_PARTIAL, read);
+	from->nofault = false;
+
 	if (ret == -ENOTBLK)
 		ret = 0;
+	if (ret > 0)
+		read = ret;
+
+	if (unlikely(iov_iter_count(from) && (ret > 0 || ret == -EFAULT)) &&
+	    should_fault_in_pages(from, &prev_count, &window_size)) {
+		size_t leftover;
+
+		gfs2_holder_allow_demote(gh);
+		leftover = fault_in_iov_iter_readable(from, window_size);
+		gfs2_holder_disallow_demote(gh);
+		if (leftover != window_size) {
+			if (!gfs2_holder_queued(gh))
+				goto retry;
+			goto retry_under_glock;
+		}
+	}
 out:
-	gfs2_glock_dq(gh);
+	if (gfs2_holder_queued(gh))
+		gfs2_glock_dq(gh);
 out_uninit:
 	gfs2_holder_uninit(gh);
-	return ret;
+	if (ret < 0)
+		return ret;
+	return read;
 }
 
 static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Ocfs2-devel] [PATCH v6 19/19] gfs2: Eliminate ip->i_gh
  2021-08-19 19:40 [Ocfs2-devel] [PATCH v6 00/19] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (17 preceding siblings ...)
  2021-08-19 19:41 ` [Ocfs2-devel] [PATCH v6 18/19] gfs2: Fix mmap + page fault deadlocks for direct I/O Andreas Gruenbacher
@ 2021-08-19 19:41 ` Andreas Gruenbacher
  18 siblings, 0 replies; 33+ messages in thread
From: Andreas Gruenbacher @ 2021-08-19 19:41 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Andreas Gruenbacher, linux-kernel, cluster-devel,
	linux-fsdevel, ocfs2-devel

Now that gfs2_file_buffered_write is the only remaining user of
ip->i_gh, we can allocate a glock holder on the stack and use that
instead.

This is slightly complicated by the fact that we're using ip->i_gh for
the statfs inode in gfs2_file_buffered_write as well.  Since writing to
the statfs inode isn't very common, allocate the statfs holder
dynamically when needed.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c   | 42 +++++++++++++++++++++++++-----------------
 fs/gfs2/incore.h |  3 +--
 2 files changed, 26 insertions(+), 19 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 2cf3466b9dde..9729cb3483c0 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -1010,12 +1010,15 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return written ? written : ret;
 }
 
-static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *from)
+static ssize_t gfs2_file_buffered_write(struct kiocb *iocb,
+					struct iov_iter *from,
+					struct gfs2_holder *gh)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
+	struct gfs2_holder *statfs_gh = NULL;
 	size_t prev_count = 0, window_size = 0;
 	size_t read = 0;
 	ssize_t ret;
@@ -1026,9 +1029,15 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *fro
 	 * inode glock, fault in the pages manually, and retry.
 	 */
 
-	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &ip->i_gh);
+	if (inode == sdp->sd_rindex) {
+		statfs_gh = kmalloc(sizeof(*statfs_gh), GFP_NOFS);
+		if (!statfs_gh)
+			return -ENOMEM;
+	}
+
+	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, gh);
 retry:
-	ret = gfs2_glock_nq(&ip->i_gh);
+	ret = gfs2_glock_nq(gh);
 	if (ret)
 		goto out_uninit;
 retry_under_glock:
@@ -1036,7 +1045,7 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *fro
 		struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
 
 		ret = gfs2_glock_nq_init(m_ip->i_gl, LM_ST_EXCLUSIVE,
-					 GL_NOCACHE, &m_ip->i_gh);
+					 GL_NOCACHE, statfs_gh);
 		if (ret)
 			goto out_unlock;
 	}
@@ -1049,20 +1058,17 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *fro
 	if (ret > 0)
 		read += ret;
 
-	if (inode == sdp->sd_rindex) {
-		struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
-
-		gfs2_glock_dq_uninit(&m_ip->i_gh);
-	}
+	if (inode == sdp->sd_rindex)
+		gfs2_glock_dq_uninit(statfs_gh);
 	if (unlikely(iov_iter_count(from) && (ret > 0 || ret == -EFAULT)) &&
 	    should_fault_in_pages(from, &prev_count, &window_size)) {
 		size_t leftover;
 
-		gfs2_holder_allow_demote(&ip->i_gh);
+		gfs2_holder_allow_demote(gh);
 		leftover = fault_in_iov_iter_readable(from, window_size);
-		gfs2_holder_disallow_demote(&ip->i_gh);
+		gfs2_holder_disallow_demote(gh);
 		if (leftover != window_size) {
-			if (!gfs2_holder_queued(&ip->i_gh)) {
+			if (!gfs2_holder_queued(gh)) {
 				if (read)
 					goto out_uninit;
 				goto retry;
@@ -1071,10 +1077,12 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *fro
 		}
 	}
 out_unlock:
-	if (gfs2_holder_queued(&ip->i_gh))
-		gfs2_glock_dq(&ip->i_gh);
+	if (gfs2_holder_queued(gh))
+		gfs2_glock_dq(gh);
 out_uninit:
-	gfs2_holder_uninit(&ip->i_gh);
+	gfs2_holder_uninit(gh);
+	if (statfs_gh)
+		kfree(statfs_gh);
 	return read ? read : ret;
 }
 
@@ -1129,7 +1137,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			goto out_unlock;
 
 		iocb->ki_flags |= IOCB_DSYNC;
-		buffered = gfs2_file_buffered_write(iocb, from);
+		buffered = gfs2_file_buffered_write(iocb, from, &gh);
 		if (unlikely(buffered <= 0)) {
 			if (!ret)
 				ret = buffered;
@@ -1151,7 +1159,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		if (!ret || ret2 > 0)
 			ret += ret2;
 	} else {
-		ret = gfs2_file_buffered_write(iocb, from);
+		ret = gfs2_file_buffered_write(iocb, from, &gh);
 		if (likely(ret > 0)) {
 			iocb->ki_pos += ret;
 			ret = generic_write_sync(iocb, ret);
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index e73a81db0714..87abdcc1de0c 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -387,9 +387,8 @@ struct gfs2_inode {
 	u64 i_generation;
 	u64 i_eattr;
 	unsigned long i_flags;		/* GIF_... */
-	struct gfs2_glock *i_gl; /* Move into i_gh? */
+	struct gfs2_glock *i_gl;
 	struct gfs2_holder i_iopen_gh;
-	struct gfs2_holder i_gh; /* for prepare/commit_write only */
 	struct gfs2_qadata *i_qadata; /* quota allocation data */
 	struct gfs2_holder i_rgd_gh;
 	struct gfs2_blkreserv i_res; /* rgrp multi-block reservation */
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [Ocfs2-devel] [Cluster-devel] [PATCH v6 10/19] gfs2: Introduce flag for glock holder auto-demotion
  2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 10/19] gfs2: Introduce flag for glock holder auto-demotion Andreas Gruenbacher
@ 2021-08-20  9:35   ` Steven Whitehouse
  2021-08-20 13:11     ` Bob Peterson
  2021-08-20 13:17     ` Andreas Gruenbacher
  0 siblings, 2 replies; 33+ messages in thread
From: Steven Whitehouse @ 2021-08-20  9:35 UTC (permalink / raw)
  To: Andreas Gruenbacher, Linus Torvalds, Alexander Viro,
	Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, linux-kernel, cluster-devel, linux-fsdevel, ocfs2-devel

Hi,

On Thu, 2021-08-19 at 21:40 +0200, Andreas Gruenbacher wrote:
> From: Bob Peterson <rpeterso@redhat.com>
> 
> This patch introduces a new HIF_MAY_DEMOTE flag and infrastructure
> that
> will allow glocks to be demoted automatically on locking conflicts.
> When a locking request comes in that isn't compatible with the
> locking
> state of a holder and that holder has the HIF_MAY_DEMOTE flag set,
> the
> holder will be demoted automatically before the incoming locking
> request
> is granted.
> 
I'm not sure I understand what is going on here. When there are locking
conflicts we generate call backs and those result in glock demotion.
There is no need for a flag to indicate that I think, since it is the
default behaviour anyway. Or perhaps the explanation is just a bit
confusing...

> Processes that allow a glock holder to be taken away indicate this by
> calling gfs2_holder_allow_demote().  When they need the glock again,
> they call gfs2_holder_disallow_demote() and then they check if the
> holder is still queued: if it is, they're still holding the glock; if
> it
> isn't, they need to re-acquire the glock.
> 
> This allows processes to hang on to locks that could become part of a
> cyclic locking dependency.  The locks will be given up when a (rare)
> conflicting locking request occurs, and don't need to be given up
> prematurely.
This seems backwards to me. We already have the glock layer cache the
locks until they are required by another node. We also have the min
hold time to make sure that we don't bounce locks too much. So what is
the problem that you are trying to solve here I wonder?

> 
> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
> ---
>  fs/gfs2/glock.c  | 221 +++++++++++++++++++++++++++++++++++++++----
> ----
>  fs/gfs2/glock.h  |  20 +++++
>  fs/gfs2/incore.h |   1 +
>  3 files changed, 206 insertions(+), 36 deletions(-)
> 
> diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
> index f24db2ececfb..d1b06a09ce2f 100644
> --- a/fs/gfs2/glock.c
> +++ b/fs/gfs2/glock.c
> @@ -58,6 +58,7 @@ struct gfs2_glock_iter {
>  typedef void (*glock_examiner) (struct gfs2_glock * gl);
>  
>  static void do_xmote(struct gfs2_glock *gl, struct gfs2_holder *gh,
> unsigned int target);
> +static void __gfs2_glock_dq(struct gfs2_holder *gh);
>  
>  static struct dentry *gfs2_root;
>  static struct workqueue_struct *glock_workqueue;
> @@ -197,6 +198,12 @@ static int demote_ok(const struct gfs2_glock
> *gl)
>  
>  	if (gl->gl_state == LM_ST_UNLOCKED)
>  		return 0;
> +	/*
> +	 * Note that demote_ok is used for the lru process of disposing
> of
> +	 * glocks. For this purpose, we don't care if the glock's
> holders
> +	 * have the HIF_MAY_DEMOTE flag set or not. If someone is using
> +	 * them, don't demote.
> +	 */
>  	if (!list_empty(&gl->gl_holders))
>  		return 0;
>  	if (glops->go_demote_ok)
> @@ -379,7 +386,7 @@ static void do_error(struct gfs2_glock *gl, const
> int ret)
>  	struct gfs2_holder *gh, *tmp;
>  
>  	list_for_each_entry_safe(gh, tmp, &gl->gl_holders, gh_list) {
> -		if (test_bit(HIF_HOLDER, &gh->gh_iflags))
> +		if (!test_bit(HIF_WAIT, &gh->gh_iflags))
>  			continue;
>  		if (ret & LM_OUT_ERROR)
>  			gh->gh_error = -EIO;
> @@ -393,6 +400,40 @@ static void do_error(struct gfs2_glock *gl,
> const int ret)
>  	}
>  }
>  
> +/**
> + * demote_incompat_holders - demote incompatible demoteable holders
> + * @gl: the glock we want to promote
> + * @new_gh: the new holder to be promoted
> + */
> +static void demote_incompat_holders(struct gfs2_glock *gl,
> +				    struct gfs2_holder *new_gh)
> +{
> +	struct gfs2_holder *gh;
> +
> +	/*
> +	 * Demote incompatible holders before we make ourselves
> eligible.
> +	 * (This holder may or may not allow auto-demoting, but we
> don't want
> +	 * to demote the new holder before it's even granted.)
> +	 */
> +	list_for_each_entry(gh, &gl->gl_holders, gh_list) {
> +		/*
> +		 * Since holders are at the front of the list, we stop
> when we
> +		 * find the first non-holder.
> +		 */
> +		if (!test_bit(HIF_HOLDER, &gh->gh_iflags))
> +			return;
> +		if (test_bit(HIF_MAY_DEMOTE, &gh->gh_iflags) &&
> +		    !may_grant(gl, new_gh, gh)) {
> +			/*
> +			 * We should not recurse into do_promote
> because
> +			 * __gfs2_glock_dq only calls handle_callback,
> +			 * gfs2_glock_add_to_lru and
> __gfs2_glock_queue_work.
> +			 */
> +			__gfs2_glock_dq(gh);
> +		}
> +	}
> +}
> +
>  /**
>   * find_first_holder - find the first "holder" gh
>   * @gl: the glock
> @@ -411,6 +452,26 @@ static inline struct gfs2_holder
> *find_first_holder(const struct gfs2_glock *gl)
>  	return NULL;
>  }
>  
> +/**
> + * find_first_strong_holder - find the first non-demoteable holder
> + * @gl: the glock
> + *
> + * Find the first holder that doesn't have the HIF_MAY_DEMOTE flag
> set.
> + */
> +static inline struct gfs2_holder
> +*find_first_strong_holder(struct gfs2_glock *gl)
> +{
> +	struct gfs2_holder *gh;
> +
> +	list_for_each_entry(gh, &gl->gl_holders, gh_list) {
> +		if (!test_bit(HIF_HOLDER, &gh->gh_iflags))
> +			return NULL;
> +		if (!test_bit(HIF_MAY_DEMOTE, &gh->gh_iflags))
> +			return gh;
> +	}
> +	return NULL;
> +}
> +
>  /**
>   * do_promote - promote as many requests as possible on the current
> queue
>   * @gl: The glock
> @@ -425,15 +486,27 @@ __acquires(&gl->gl_lockref.lock)
>  {
>  	const struct gfs2_glock_operations *glops = gl->gl_ops;
>  	struct gfs2_holder *gh, *tmp, *first_gh;
> +	bool incompat_holders_demoted = false;
>  	int ret;
>  
> -	first_gh = find_first_holder(gl);
> +	first_gh = find_first_strong_holder(gl);
>  
>  restart:
>  	list_for_each_entry_safe(gh, tmp, &gl->gl_holders, gh_list) {
> -		if (test_bit(HIF_HOLDER, &gh->gh_iflags))
> +		if (!test_bit(HIF_WAIT, &gh->gh_iflags))
>  			continue;
>  		if (may_grant(gl, first_gh, gh)) {
> +			if (!incompat_holders_demoted) {
> +				demote_incompat_holders(gl, first_gh);
> +				incompat_holders_demoted = true;
> +				first_gh = gh;
> +			}
> +			/*
> +			 * The first holder (and only the first holder)
> on the
> +			 * list to be promoted needs to call the
> go_lock
> +			 * function. This does things like
> inode_refresh
> +			 * to read an inode from disk.
> +			 */
>  			if (gh->gh_list.prev == &gl->gl_holders &&
>  			    glops->go_lock) {
>  				spin_unlock(&gl->gl_lockref.lock);
> @@ -459,6 +532,11 @@ __acquires(&gl->gl_lockref.lock)
>  			gfs2_holder_wake(gh);
>  			continue;
>  		}
> +		/*
> +		 * If we get here, it means we may not grant this
> holder for
> +		 * some reason. If this holder is the head of the list,
> it
> +		 * means we have a blocked holder at the head, so
> return 1.
> +		 */
>  		if (gh->gh_list.prev == &gl->gl_holders)
>  			return 1;
>  		do_error(gl, 0);
> @@ -1373,7 +1451,7 @@ __acquires(&gl->gl_lockref.lock)
>  		if (test_bit(GLF_LOCK, &gl->gl_flags)) {
>  			struct gfs2_holder *first_gh;
>  
> -			first_gh = find_first_holder(gl);
> +			first_gh = find_first_strong_holder(gl);
>  			try_futile = !may_grant(gl, first_gh, gh);
>  		}
>  		if (test_bit(GLF_INVALIDATE_IN_PROGRESS, &gl-
> >gl_flags))
> @@ -1382,7 +1460,8 @@ __acquires(&gl->gl_lockref.lock)
>  
>  	list_for_each_entry(gh2, &gl->gl_holders, gh_list) {
>  		if (unlikely(gh2->gh_owner_pid == gh->gh_owner_pid &&
> -		    (gh->gh_gl->gl_ops->go_type != LM_TYPE_FLOCK)))
> +		    (gh->gh_gl->gl_ops->go_type != LM_TYPE_FLOCK) &&
> +		    !test_bit(HIF_MAY_DEMOTE, &gh2->gh_iflags)))
>  			goto trap_recursive;
>  		if (try_futile &&
>  		    !(gh2->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB)))
> {
> @@ -1478,51 +1557,83 @@ int gfs2_glock_poll(struct gfs2_holder *gh)
>  	return test_bit(HIF_WAIT, &gh->gh_iflags) ? 0 : 1;
>  }
>  
> -/**
> - * gfs2_glock_dq - dequeue a struct gfs2_holder from a glock
> (release a glock)
> - * @gh: the glock holder
> - *
> - */
> +static inline bool needs_demote(struct gfs2_glock *gl)
> +{
> +	return (test_bit(GLF_DEMOTE, &gl->gl_flags) ||
> +		test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags));
> +}
>  
> -void gfs2_glock_dq(struct gfs2_holder *gh)
> +static void __gfs2_glock_dq(struct gfs2_holder *gh)
>  {
>  	struct gfs2_glock *gl = gh->gh_gl;
>  	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
>  	unsigned delay = 0;
>  	int fast_path = 0;
>  
> -	spin_lock(&gl->gl_lockref.lock);
>  	/*
> -	 * If we're in the process of file system withdraw, we cannot
> just
> -	 * dequeue any glocks until our journal is recovered, lest we
> -	 * introduce file system corruption. We need two exceptions to
> this
> -	 * rule: We need to allow unlocking of nondisk glocks and the
> glock
> -	 * for our own journal that needs recovery.
> +	 * This while loop is similar to function
> demote_incompat_holders:
> +	 * If the glock is due to be demoted (which may be from another
> node
> +	 * or even if this holder is GL_NOCACHE), the weak holders are
> +	 * demoted as well, allowing the glock to be demoted.
>  	 */
> -	if (test_bit(SDF_WITHDRAW_RECOVERY, &sdp->sd_flags) &&
> -	    glock_blocked_by_withdraw(gl) &&
> -	    gh->gh_gl != sdp->sd_jinode_gl) {
> -		sdp->sd_glock_dqs_held++;
> -		spin_unlock(&gl->gl_lockref.lock);
> -		might_sleep();
> -		wait_on_bit(&sdp->sd_flags, SDF_WITHDRAW_RECOVERY,
> -			    TASK_UNINTERRUPTIBLE);
> -		spin_lock(&gl->gl_lockref.lock);
> -	}
> -	if (gh->gh_flags & GL_NOCACHE)
> -		handle_callback(gl, LM_ST_UNLOCKED, 0, false);
> +	while (gh) {
> +		/*
> +		 * If we're in the process of file system withdraw, we
> cannot
> +		 * just dequeue any glocks until our journal is
> recovered, lest
> +		 * we introduce file system corruption. We need two
> exceptions
> +		 * to this rule: We need to allow unlocking of nondisk
> glocks
> +		 * and the glock for our own journal that needs
> recovery.
> +		 */
> +		if (test_bit(SDF_WITHDRAW_RECOVERY, &sdp->sd_flags) &&
> +		    glock_blocked_by_withdraw(gl) &&
> +		    gh->gh_gl != sdp->sd_jinode_gl) {
> +			sdp->sd_glock_dqs_held++;
> +			spin_unlock(&gl->gl_lockref.lock);
> +			might_sleep();
> +			wait_on_bit(&sdp->sd_flags,
> SDF_WITHDRAW_RECOVERY,
> +				    TASK_UNINTERRUPTIBLE);
> +			spin_lock(&gl->gl_lockref.lock);
> +		}
> +
> +		/*
> +		 * This holder should not be cached, so mark it for
> demote.
> +		 * Note: this should be done before the check for
> needs_demote
> +		 * below.
> +		 */
> +		if (gh->gh_flags & GL_NOCACHE)
> +			handle_callback(gl, LM_ST_UNLOCKED, 0, false);
>  
> -	list_del_init(&gh->gh_list);
> -	clear_bit(HIF_HOLDER, &gh->gh_iflags);
> -	if (list_empty(&gl->gl_holders) &&
> -	    !test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags) &&
> -	    !test_bit(GLF_DEMOTE, &gl->gl_flags))
> -		fast_path = 1;
> +		list_del_init(&gh->gh_list);
> +		clear_bit(HIF_HOLDER, &gh->gh_iflags);
> +		trace_gfs2_glock_queue(gh, 0);
> +
> +		/*
> +		 * If there hasn't been a demote request we are done.
> +		 * (Let the remaining holders, if any, keep holding
> it.)
> +		 */
> +		if (!needs_demote(gl)) {
> +			if (list_empty(&gl->gl_holders))
> +				fast_path = 1;
> +			break;
> +		}
> +		/*
> +		 * If we have another strong holder (we cannot auto-
> demote)
> +		 * we are done. It keeps holding it until it is done.
> +		 */
> +		if (find_first_strong_holder(gl))
> +			break;
> +
> +		/*
> +		 * If we have a weak holder at the head of the list, it
> +		 * (and all others like it) must be auto-demoted. If
> there
> +		 * are no more weak holders, we exit the while loop.
> +		 */
> +		gh = find_first_holder(gl);
> +	}
>  
>  	if (!test_bit(GLF_LFLUSH, &gl->gl_flags) && demote_ok(gl))
>  		gfs2_glock_add_to_lru(gl);
>  
> -	trace_gfs2_glock_queue(gh, 0);
>  	if (unlikely(!fast_path)) {
>  		gl->gl_lockref.count++;
>  		if (test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags) &&
> @@ -1531,6 +1642,19 @@ void gfs2_glock_dq(struct gfs2_holder *gh)
>  			delay = gl->gl_hold_time;
>  		__gfs2_glock_queue_work(gl, delay);
>  	}
> +}
> +
> +/**
> + * gfs2_glock_dq - dequeue a struct gfs2_holder from a glock
> (release a glock)
> + * @gh: the glock holder
> + *
> + */
> +void gfs2_glock_dq(struct gfs2_holder *gh)
> +{
> +	struct gfs2_glock *gl = gh->gh_gl;
> +
> +	spin_lock(&gl->gl_lockref.lock);
> +	__gfs2_glock_dq(gh);
>  	spin_unlock(&gl->gl_lockref.lock);
>  }
>  
> @@ -1693,6 +1817,7 @@ void gfs2_glock_dq_m(unsigned int num_gh,
> struct gfs2_holder *ghs)
>  
>  void gfs2_glock_cb(struct gfs2_glock *gl, unsigned int state)
>  {
> +	struct gfs2_holder mock_gh = { .gh_gl = gl, .gh_state = state,
> };
>  	unsigned long delay = 0;
>  	unsigned long holdtime;
>  	unsigned long now = jiffies;
> @@ -1707,6 +1832,28 @@ void gfs2_glock_cb(struct gfs2_glock *gl,
> unsigned int state)
>  		if (test_bit(GLF_REPLY_PENDING, &gl->gl_flags))
>  			delay = gl->gl_hold_time;
>  	}
> +	/*
> +	 * Note 1: We cannot call demote_incompat_holders from
> handle_callback
> +	 * or gfs2_set_demote due to recursion problems like:
> gfs2_glock_dq ->
> +	 * handle_callback -> demote_incompat_holders -> gfs2_glock_dq
> +	 * Plus, we only want to demote the holders if the request
> comes from
> +	 * a remote cluster node because local holder conflicts are
> resolved
> +	 * elsewhere.
> +	 *
> +	 * Note 2: if a remote node wants this glock in EX mode,
> lock_dlm will
> +	 * request that we set our state to UNLOCKED. Here we mock up a
> holder
> +	 * to make it look like someone wants the lock EX locally. Any
> SH
> +	 * and DF requests should be able to share the lock without
> demoting.
> +	 *
> +	 * Note 3: We only want to demote the demoteable holders when
> there
> +	 * are no more strong holders. The demoteable holders might as
> well
> +	 * keep the glock until the last strong holder is done with it.
> +	 */
> +	if (!find_first_strong_holder(gl)) {
> +		if (state == LM_ST_UNLOCKED)
> +			mock_gh.gh_state = LM_ST_EXCLUSIVE;
> +		demote_incompat_holders(gl, &mock_gh);
> +	}
>  	handle_callback(gl, state, delay, true);
>  	__gfs2_glock_queue_work(gl, delay);
>  	spin_unlock(&gl->gl_lockref.lock);
> @@ -2096,6 +2243,8 @@ static const char *hflags2str(char *buf, u16
> flags, unsigned long iflags)
>  		*p++ = 'H';
>  	if (test_bit(HIF_WAIT, &iflags))
>  		*p++ = 'W';
> +	if (test_bit(HIF_MAY_DEMOTE, &iflags))
> +		*p++ = 'D';
>  	*p = 0;
>  	return buf;
>  }
> diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
> index 31a8f2f649b5..9012487da4c6 100644
> --- a/fs/gfs2/glock.h
> +++ b/fs/gfs2/glock.h
> @@ -150,6 +150,8 @@ static inline struct gfs2_holder
> *gfs2_glock_is_locked_by_me(struct gfs2_glock *
>  	list_for_each_entry(gh, &gl->gl_holders, gh_list) {
>  		if (!test_bit(HIF_HOLDER, &gh->gh_iflags))
>  			break;
> +		if (test_bit(HIF_MAY_DEMOTE, &gh->gh_iflags))
> +			continue;
>  		if (gh->gh_owner_pid == pid)
>  			goto out;
>  	}
> @@ -325,6 +327,24 @@ static inline void glock_clear_object(struct
> gfs2_glock *gl, void *object)
>  	spin_unlock(&gl->gl_lockref.lock);
>  }
>  
> +static inline void gfs2_holder_allow_demote(struct gfs2_holder *gh)
> +{
> +	struct gfs2_glock *gl = gh->gh_gl;
> +
> +	spin_lock(&gl->gl_lockref.lock);
> +	set_bit(HIF_MAY_DEMOTE, &gh->gh_iflags);
> +	spin_unlock(&gl->gl_lockref.lock);
> +}
> +
> +static inline void gfs2_holder_disallow_demote(struct gfs2_holder
> *gh)
> +{
> +	struct gfs2_glock *gl = gh->gh_gl;
> +
> +	spin_lock(&gl->gl_lockref.lock);
> +	clear_bit(HIF_MAY_DEMOTE, &gh->gh_iflags);
> +	spin_unlock(&gl->gl_lockref.lock);
> +}
> +
This looks a bit strange... bit operations are atomic anyway, so why do
we need that spinlock here?

Steve.

>  extern void gfs2_inode_remember_delete(struct gfs2_glock *gl, u64
> generation);
>  extern bool gfs2_inode_already_deleted(struct gfs2_glock *gl, u64
> generation);
>  
> diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
> index 5c6b985254aa..e73a81db0714 100644
> --- a/fs/gfs2/incore.h
> +++ b/fs/gfs2/incore.h
> @@ -252,6 +252,7 @@ struct gfs2_lkstats {
>  
>  enum {
>  	/* States */
> +	HIF_MAY_DEMOTE		= 1,
>  	HIF_HOLDER		= 6,  /* Set for gh that "holds" the
> glock */
>  	HIF_WAIT		= 10,
>  };


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Ocfs2-devel] [Cluster-devel] [PATCH v6 10/19] gfs2: Introduce flag for glock holder auto-demotion
  2021-08-20  9:35   ` [Ocfs2-devel] [Cluster-devel] " Steven Whitehouse
@ 2021-08-20 13:11     ` Bob Peterson
  2021-08-20 13:41       ` Steven Whitehouse
  2021-08-20 15:22       ` Andreas Gruenbacher
  2021-08-20 13:17     ` Andreas Gruenbacher
  1 sibling, 2 replies; 33+ messages in thread
From: Bob Peterson @ 2021-08-20 13:11 UTC (permalink / raw)
  To: Steven Whitehouse, Andreas Gruenbacher, Linus Torvalds,
	Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, linux-kernel, cluster-devel, linux-fsdevel, ocfs2-devel

On 8/20/21 4:35 AM, Steven Whitehouse wrote:
> Hi,
> 
> On Thu, 2021-08-19 at 21:40 +0200, Andreas Gruenbacher wrote:
>> From: Bob Peterson <rpeterso@redhat.com>
>>
>> This patch introduces a new HIF_MAY_DEMOTE flag and infrastructure
>> that
>> will allow glocks to be demoted automatically on locking conflicts.
>> When a locking request comes in that isn't compatible with the
>> locking
>> state of a holder and that holder has the HIF_MAY_DEMOTE flag set,
>> the
>> holder will be demoted automatically before the incoming locking
>> request
>> is granted.
>>
> I'm not sure I understand what is going on here. When there are locking
> conflicts we generate call backs and those result in glock demotion.
> There is no need for a flag to indicate that I think, since it is the
> default behaviour anyway. Or perhaps the explanation is just a bit
> confusing...

I agree that the whole concept and explanation are confusing. Andreas 
and I went through several heated arguments about the symantics, 
comments, patch descriptions, etc. We played around with many different 
flag name ideas, etc. We did not agree on the best way to describe the 
whole concept. He didn't like my explanation and I didn't like his. So 
yes, it is confusing.

My preferred terminology was "DOD" or "Dequeue On Demand" which makes 
the concept more understandable to me. So basically a process can say
"I need to hold this glock, but for an unknown and possibly lengthy 
period of time, but please feel free to dequeue it if it's in your way."
And bear in mind that several processes may do the same, simultaneously.

You can almost think of this as a performance enhancement. This concept 
allows a process to hold a glock for much longer periods of time, at a 
lower priority, for example, when gfs2_file_read_iter needs to hold the 
glock for very long-running iterative reads.

The process requesting a holder with "Demote On Demand" must then 
determine if its holder has been stolen away (dequeued on demand) after 
its lengthy operation, and therefore needs to pick up the pieces of 
where it left off in its process.

Meanwhile, another process may need to hold the glock. If its requested 
mode is compatible, say SH and SH, the lock is simply granted with no 
further delay. If the mode is incompatible, regardless of whether it's 
on the local node or a different node in the cluster, these 
longer-term/lower-priority holders may be dequeued or prempted by 
another request to hold the glock. Note that although these holders are 
dequeued-on-demand, they are never "uninitted" as part of the process. 
Nor must they ever be, since they may be on another process's heap.

This differs from the normal glock demote process in which the demote 
bit is set on ("requesting" the glock be demoted) but still needs to 
block until the holder does its actual dequeue.

>> Processes that allow a glock holder to be taken away indicate this by
>> calling gfs2_holder_allow_demote().  When they need the glock again,
>> they call gfs2_holder_disallow_demote() and then they check if the
>> holder is still queued: if it is, they're still holding the glock; if
>> it
>> isn't, they need to re-acquire the glock.
>>
>> This allows processes to hang on to locks that could become part of a
>> cyclic locking dependency.  The locks will be given up when a (rare)
>> conflicting locking request occurs, and don't need to be given up
>> prematurely.
> This seems backwards to me. We already have the glock layer cache the
> locks until they are required by another node. We also have the min
> hold time to make sure that we don't bounce locks too much. So what is
> the problem that you are trying to solve here I wonder?

Again, this is simply allowing premption of lenghy/low-priority holders 
whereas the normal demote process will only demote when the glock is 
dequeued after this potentially very-long period of time.

The minimum hold time solves a different problem, and Andreas and I 
talked just yesterday about possibly revisiting how that all works. The 
problem with minimum hold time is that in many cases the glock state 
machine does not want to grant new holders if the demote bit is on, so 
it ends up wasting more time than solving the actual problem.
But that's another problem for another day.

Regards,

Bob Peterson


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Ocfs2-devel] [Cluster-devel] [PATCH v6 10/19] gfs2: Introduce flag for glock holder auto-demotion
  2021-08-20  9:35   ` [Ocfs2-devel] [Cluster-devel] " Steven Whitehouse
  2021-08-20 13:11     ` Bob Peterson
@ 2021-08-20 13:17     ` Andreas Gruenbacher
  2021-08-20 13:47       ` Steven Whitehouse
  1 sibling, 1 reply; 33+ messages in thread
From: Andreas Gruenbacher @ 2021-08-20 13:17 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: cluster-devel, Jan Kara, LKML, Christoph Hellwig, Alexander Viro,
	linux-fsdevel, Linus Torvalds, ocfs2-devel

On Fri, Aug 20, 2021 at 11:35 AM Steven Whitehouse <swhiteho@redhat.com> wrote:
> On Thu, 2021-08-19 at 21:40 +0200, Andreas Gruenbacher wrote:
> > From: Bob Peterson <rpeterso@redhat.com>
> >
> > This patch introduces a new HIF_MAY_DEMOTE flag and infrastructure
> > that will allow glocks to be demoted automatically on locking conflicts.
> > When a locking request comes in that isn't compatible with the locking
> > state of a holder and that holder has the HIF_MAY_DEMOTE flag set, the
> > holder will be demoted automatically before the incoming locking request
> > is granted.
>
> I'm not sure I understand what is going on here. When there are locking
> conflicts we generate call backs and those result in glock demotion.
> There is no need for a flag to indicate that I think, since it is the
> default behaviour anyway. Or perhaps the explanation is just a bit
> confusing...

When a glock has active holders (with the HIF_HOLDER flag set), the
glock won't be demoted to a state incompatible with any of those
holders.

> > Processes that allow a glock holder to be taken away indicate this by
> > calling gfs2_holder_allow_demote().  When they need the glock again,
> > they call gfs2_holder_disallow_demote() and then they check if the
> > holder is still queued: if it is, they're still holding the glock; if
> > it isn't, they need to re-acquire the glock.
> >
> > This allows processes to hang on to locks that could become part of a
> > cyclic locking dependency.  The locks will be given up when a (rare)
> > conflicting locking request occurs, and don't need to be given up
> > prematurely.
>
> This seems backwards to me. We already have the glock layer cache the
> locks until they are required by another node. We also have the min
> hold time to make sure that we don't bounce locks too much. So what is
> the problem that you are trying to solve here I wonder?

This solves the problem of faulting in pages during read and write
operations: on the one hand, we want to hold the inode glock across
those operations. On the other hand, those operations may fault in
pages, which may require taking the same or other inode glocks,
directly or indirectly, which can deadlock.

So before we fault in pages, we indicate with
gfs2_holder_allow_demote(gh) that we can cope if the glock is taken
away from us. After faulting in the pages, we indicate with
gfs2_holder_disallow_demote(gh) that we now actually need the glock
again. At that point, we either still have the glock (i.e., the holder
is still queued and it has the HIF_HOLDER flag set), or we don't.

The different kinds of read and write operations differ in how they
handle the latter case:

 * When a buffered read or write loses the inode glock, it returns a
short result. This
   prevents torn writes and reading things that have never existed on
disk in that form.

 * When a direct read or write loses the inode glock, it re-acquires
it before resuming
   the operation. Direct I/O is not expected to return partial results
and doesn't provide
   any kind of synchronization among processes.

We could solve this kind of problem in other ways, for example, by
keeping a glock generation number, dropping the glock before faulting
in pages, re-acquiring it afterwards, and checking if the generation
number has changed. This would still be an additional piece of glock
infrastructure, but more heavyweight than the HIF_MAY_DEMOTE flag
which uses the existing glock holder infrastructure.

> > Signed-off-by: Bob Peterson <rpeterso@redhat.com>
> > ---
> >  fs/gfs2/glock.c  | 221 +++++++++++++++++++++++++++++++++++++++----
> > ----
> >  fs/gfs2/glock.h  |  20 +++++
> >  fs/gfs2/incore.h |   1 +
> >  3 files changed, 206 insertions(+), 36 deletions(-)
> >
> > diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
> > index f24db2ececfb..d1b06a09ce2f 100644
> > --- a/fs/gfs2/glock.c
> > +++ b/fs/gfs2/glock.c
> > @@ -58,6 +58,7 @@ struct gfs2_glock_iter {
> >  typedef void (*glock_examiner) (struct gfs2_glock * gl);
> >
> >  static void do_xmote(struct gfs2_glock *gl, struct gfs2_holder *gh,
> > unsigned int target);
> > +static void __gfs2_glock_dq(struct gfs2_holder *gh);
> >
> >  static struct dentry *gfs2_root;
> >  static struct workqueue_struct *glock_workqueue;
> > @@ -197,6 +198,12 @@ static int demote_ok(const struct gfs2_glock
> > *gl)
> >
> >       if (gl->gl_state == LM_ST_UNLOCKED)
> >               return 0;
> > +     /*
> > +      * Note that demote_ok is used for the lru process of disposing
> > of
> > +      * glocks. For this purpose, we don't care if the glock's
> > holders
> > +      * have the HIF_MAY_DEMOTE flag set or not. If someone is using
> > +      * them, don't demote.
> > +      */
> >       if (!list_empty(&gl->gl_holders))
> >               return 0;
> >       if (glops->go_demote_ok)
> > @@ -379,7 +386,7 @@ static void do_error(struct gfs2_glock *gl, const
> > int ret)
> >       struct gfs2_holder *gh, *tmp;
> >
> >       list_for_each_entry_safe(gh, tmp, &gl->gl_holders, gh_list) {
> > -             if (test_bit(HIF_HOLDER, &gh->gh_iflags))
> > +             if (!test_bit(HIF_WAIT, &gh->gh_iflags))
> >                       continue;
> >               if (ret & LM_OUT_ERROR)
> >                       gh->gh_error = -EIO;
> > @@ -393,6 +400,40 @@ static void do_error(struct gfs2_glock *gl,
> > const int ret)
> >       }
> >  }
> >
> > +/**
> > + * demote_incompat_holders - demote incompatible demoteable holders
> > + * @gl: the glock we want to promote
> > + * @new_gh: the new holder to be promoted
> > + */
> > +static void demote_incompat_holders(struct gfs2_glock *gl,
> > +                                 struct gfs2_holder *new_gh)
> > +{
> > +     struct gfs2_holder *gh;
> > +
> > +     /*
> > +      * Demote incompatible holders before we make ourselves
> > eligible.
> > +      * (This holder may or may not allow auto-demoting, but we
> > don't want
> > +      * to demote the new holder before it's even granted.)
> > +      */
> > +     list_for_each_entry(gh, &gl->gl_holders, gh_list) {
> > +             /*
> > +              * Since holders are at the front of the list, we stop
> > when we
> > +              * find the first non-holder.
> > +              */
> > +             if (!test_bit(HIF_HOLDER, &gh->gh_iflags))
> > +                     return;
> > +             if (test_bit(HIF_MAY_DEMOTE, &gh->gh_iflags) &&
> > +                 !may_grant(gl, new_gh, gh)) {
> > +                     /*
> > +                      * We should not recurse into do_promote
> > because
> > +                      * __gfs2_glock_dq only calls handle_callback,
> > +                      * gfs2_glock_add_to_lru and
> > __gfs2_glock_queue_work.
> > +                      */
> > +                     __gfs2_glock_dq(gh);
> > +             }
> > +     }
> > +}
> > +
> >  /**
> >   * find_first_holder - find the first "holder" gh
> >   * @gl: the glock
> > @@ -411,6 +452,26 @@ static inline struct gfs2_holder
> > *find_first_holder(const struct gfs2_glock *gl)
> >       return NULL;
> >  }
> >
> > +/**
> > + * find_first_strong_holder - find the first non-demoteable holder
> > + * @gl: the glock
> > + *
> > + * Find the first holder that doesn't have the HIF_MAY_DEMOTE flag
> > set.
> > + */
> > +static inline struct gfs2_holder
> > +*find_first_strong_holder(struct gfs2_glock *gl)
> > +{
> > +     struct gfs2_holder *gh;
> > +
> > +     list_for_each_entry(gh, &gl->gl_holders, gh_list) {
> > +             if (!test_bit(HIF_HOLDER, &gh->gh_iflags))
> > +                     return NULL;
> > +             if (!test_bit(HIF_MAY_DEMOTE, &gh->gh_iflags))
> > +                     return gh;
> > +     }
> > +     return NULL;
> > +}
> > +
> >  /**
> >   * do_promote - promote as many requests as possible on the current
> > queue
> >   * @gl: The glock
> > @@ -425,15 +486,27 @@ __acquires(&gl->gl_lockref.lock)
> >  {
> >       const struct gfs2_glock_operations *glops = gl->gl_ops;
> >       struct gfs2_holder *gh, *tmp, *first_gh;
> > +     bool incompat_holders_demoted = false;
> >       int ret;
> >
> > -     first_gh = find_first_holder(gl);
> > +     first_gh = find_first_strong_holder(gl);
> >
> >  restart:
> >       list_for_each_entry_safe(gh, tmp, &gl->gl_holders, gh_list) {
> > -             if (test_bit(HIF_HOLDER, &gh->gh_iflags))
> > +             if (!test_bit(HIF_WAIT, &gh->gh_iflags))
> >                       continue;
> >               if (may_grant(gl, first_gh, gh)) {
> > +                     if (!incompat_holders_demoted) {
> > +                             demote_incompat_holders(gl, first_gh);
> > +                             incompat_holders_demoted = true;
> > +                             first_gh = gh;
> > +                     }
> > +                     /*
> > +                      * The first holder (and only the first holder)
> > on the
> > +                      * list to be promoted needs to call the
> > go_lock
> > +                      * function. This does things like
> > inode_refresh
> > +                      * to read an inode from disk.
> > +                      */
> >                       if (gh->gh_list.prev == &gl->gl_holders &&
> >                           glops->go_lock) {
> >                               spin_unlock(&gl->gl_lockref.lock);
> > @@ -459,6 +532,11 @@ __acquires(&gl->gl_lockref.lock)
> >                       gfs2_holder_wake(gh);
> >                       continue;
> >               }
> > +             /*
> > +              * If we get here, it means we may not grant this
> > holder for
> > +              * some reason. If this holder is the head of the list,
> > it
> > +              * means we have a blocked holder at the head, so
> > return 1.
> > +              */
> >               if (gh->gh_list.prev == &gl->gl_holders)
> >                       return 1;
> >               do_error(gl, 0);
> > @@ -1373,7 +1451,7 @@ __acquires(&gl->gl_lockref.lock)
> >               if (test_bit(GLF_LOCK, &gl->gl_flags)) {
> >                       struct gfs2_holder *first_gh;
> >
> > -                     first_gh = find_first_holder(gl);
> > +                     first_gh = find_first_strong_holder(gl);
> >                       try_futile = !may_grant(gl, first_gh, gh);
> >               }
> >               if (test_bit(GLF_INVALIDATE_IN_PROGRESS, &gl-
> > >gl_flags))
> > @@ -1382,7 +1460,8 @@ __acquires(&gl->gl_lockref.lock)
> >
> >       list_for_each_entry(gh2, &gl->gl_holders, gh_list) {
> >               if (unlikely(gh2->gh_owner_pid == gh->gh_owner_pid &&
> > -                 (gh->gh_gl->gl_ops->go_type != LM_TYPE_FLOCK)))
> > +                 (gh->gh_gl->gl_ops->go_type != LM_TYPE_FLOCK) &&
> > +                 !test_bit(HIF_MAY_DEMOTE, &gh2->gh_iflags)))
> >                       goto trap_recursive;
> >               if (try_futile &&
> >                   !(gh2->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB)))
> > {
> > @@ -1478,51 +1557,83 @@ int gfs2_glock_poll(struct gfs2_holder *gh)
> >       return test_bit(HIF_WAIT, &gh->gh_iflags) ? 0 : 1;
> >  }
> >
> > -/**
> > - * gfs2_glock_dq - dequeue a struct gfs2_holder from a glock
> > (release a glock)
> > - * @gh: the glock holder
> > - *
> > - */
> > +static inline bool needs_demote(struct gfs2_glock *gl)
> > +{
> > +     return (test_bit(GLF_DEMOTE, &gl->gl_flags) ||
> > +             test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags));
> > +}
> >
> > -void gfs2_glock_dq(struct gfs2_holder *gh)
> > +static void __gfs2_glock_dq(struct gfs2_holder *gh)
> >  {
> >       struct gfs2_glock *gl = gh->gh_gl;
> >       struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
> >       unsigned delay = 0;
> >       int fast_path = 0;
> >
> > -     spin_lock(&gl->gl_lockref.lock);
> >       /*
> > -      * If we're in the process of file system withdraw, we cannot
> > just
> > -      * dequeue any glocks until our journal is recovered, lest we
> > -      * introduce file system corruption. We need two exceptions to
> > this
> > -      * rule: We need to allow unlocking of nondisk glocks and the
> > glock
> > -      * for our own journal that needs recovery.
> > +      * This while loop is similar to function
> > demote_incompat_holders:
> > +      * If the glock is due to be demoted (which may be from another
> > node
> > +      * or even if this holder is GL_NOCACHE), the weak holders are
> > +      * demoted as well, allowing the glock to be demoted.
> >        */
> > -     if (test_bit(SDF_WITHDRAW_RECOVERY, &sdp->sd_flags) &&
> > -         glock_blocked_by_withdraw(gl) &&
> > -         gh->gh_gl != sdp->sd_jinode_gl) {
> > -             sdp->sd_glock_dqs_held++;
> > -             spin_unlock(&gl->gl_lockref.lock);
> > -             might_sleep();
> > -             wait_on_bit(&sdp->sd_flags, SDF_WITHDRAW_RECOVERY,
> > -                         TASK_UNINTERRUPTIBLE);
> > -             spin_lock(&gl->gl_lockref.lock);
> > -     }
> > -     if (gh->gh_flags & GL_NOCACHE)
> > -             handle_callback(gl, LM_ST_UNLOCKED, 0, false);
> > +     while (gh) {
> > +             /*
> > +              * If we're in the process of file system withdraw, we
> > cannot
> > +              * just dequeue any glocks until our journal is
> > recovered, lest
> > +              * we introduce file system corruption. We need two
> > exceptions
> > +              * to this rule: We need to allow unlocking of nondisk
> > glocks
> > +              * and the glock for our own journal that needs
> > recovery.
> > +              */
> > +             if (test_bit(SDF_WITHDRAW_RECOVERY, &sdp->sd_flags) &&
> > +                 glock_blocked_by_withdraw(gl) &&
> > +                 gh->gh_gl != sdp->sd_jinode_gl) {
> > +                     sdp->sd_glock_dqs_held++;
> > +                     spin_unlock(&gl->gl_lockref.lock);
> > +                     might_sleep();
> > +                     wait_on_bit(&sdp->sd_flags,
> > SDF_WITHDRAW_RECOVERY,
> > +                                 TASK_UNINTERRUPTIBLE);
> > +                     spin_lock(&gl->gl_lockref.lock);
> > +             }
> > +
> > +             /*
> > +              * This holder should not be cached, so mark it for
> > demote.
> > +              * Note: this should be done before the check for
> > needs_demote
> > +              * below.
> > +              */
> > +             if (gh->gh_flags & GL_NOCACHE)
> > +                     handle_callback(gl, LM_ST_UNLOCKED, 0, false);
> >
> > -     list_del_init(&gh->gh_list);
> > -     clear_bit(HIF_HOLDER, &gh->gh_iflags);
> > -     if (list_empty(&gl->gl_holders) &&
> > -         !test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags) &&
> > -         !test_bit(GLF_DEMOTE, &gl->gl_flags))
> > -             fast_path = 1;
> > +             list_del_init(&gh->gh_list);
> > +             clear_bit(HIF_HOLDER, &gh->gh_iflags);
> > +             trace_gfs2_glock_queue(gh, 0);
> > +
> > +             /*
> > +              * If there hasn't been a demote request we are done.
> > +              * (Let the remaining holders, if any, keep holding
> > it.)
> > +              */
> > +             if (!needs_demote(gl)) {
> > +                     if (list_empty(&gl->gl_holders))
> > +                             fast_path = 1;
> > +                     break;
> > +             }
> > +             /*
> > +              * If we have another strong holder (we cannot auto-
> > demote)
> > +              * we are done. It keeps holding it until it is done.
> > +              */
> > +             if (find_first_strong_holder(gl))
> > +                     break;
> > +
> > +             /*
> > +              * If we have a weak holder at the head of the list, it
> > +              * (and all others like it) must be auto-demoted. If
> > there
> > +              * are no more weak holders, we exit the while loop.
> > +              */
> > +             gh = find_first_holder(gl);
> > +     }
> >
> >       if (!test_bit(GLF_LFLUSH, &gl->gl_flags) && demote_ok(gl))
> >               gfs2_glock_add_to_lru(gl);
> >
> > -     trace_gfs2_glock_queue(gh, 0);
> >       if (unlikely(!fast_path)) {
> >               gl->gl_lockref.count++;
> >               if (test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags) &&
> > @@ -1531,6 +1642,19 @@ void gfs2_glock_dq(struct gfs2_holder *gh)
> >                       delay = gl->gl_hold_time;
> >               __gfs2_glock_queue_work(gl, delay);
> >       }
> > +}
> > +
> > +/**
> > + * gfs2_glock_dq - dequeue a struct gfs2_holder from a glock
> > (release a glock)
> > + * @gh: the glock holder
> > + *
> > + */
> > +void gfs2_glock_dq(struct gfs2_holder *gh)
> > +{
> > +     struct gfs2_glock *gl = gh->gh_gl;
> > +
> > +     spin_lock(&gl->gl_lockref.lock);
> > +     __gfs2_glock_dq(gh);
> >       spin_unlock(&gl->gl_lockref.lock);
> >  }
> >
> > @@ -1693,6 +1817,7 @@ void gfs2_glock_dq_m(unsigned int num_gh,
> > struct gfs2_holder *ghs)
> >
> >  void gfs2_glock_cb(struct gfs2_glock *gl, unsigned int state)
> >  {
> > +     struct gfs2_holder mock_gh = { .gh_gl = gl, .gh_state = state,
> > };
> >       unsigned long delay = 0;
> >       unsigned long holdtime;
> >       unsigned long now = jiffies;
> > @@ -1707,6 +1832,28 @@ void gfs2_glock_cb(struct gfs2_glock *gl,
> > unsigned int state)
> >               if (test_bit(GLF_REPLY_PENDING, &gl->gl_flags))
> >                       delay = gl->gl_hold_time;
> >       }
> > +     /*
> > +      * Note 1: We cannot call demote_incompat_holders from
> > handle_callback
> > +      * or gfs2_set_demote due to recursion problems like:
> > gfs2_glock_dq ->
> > +      * handle_callback -> demote_incompat_holders -> gfs2_glock_dq
> > +      * Plus, we only want to demote the holders if the request
> > comes from
> > +      * a remote cluster node because local holder conflicts are
> > resolved
> > +      * elsewhere.
> > +      *
> > +      * Note 2: if a remote node wants this glock in EX mode,
> > lock_dlm will
> > +      * request that we set our state to UNLOCKED. Here we mock up a
> > holder
> > +      * to make it look like someone wants the lock EX locally. Any
> > SH
> > +      * and DF requests should be able to share the lock without
> > demoting.
> > +      *
> > +      * Note 3: We only want to demote the demoteable holders when
> > there
> > +      * are no more strong holders. The demoteable holders might as
> > well
> > +      * keep the glock until the last strong holder is done with it.
> > +      */
> > +     if (!find_first_strong_holder(gl)) {
> > +             if (state == LM_ST_UNLOCKED)
> > +                     mock_gh.gh_state = LM_ST_EXCLUSIVE;
> > +             demote_incompat_holders(gl, &mock_gh);
> > +     }
> >       handle_callback(gl, state, delay, true);
> >       __gfs2_glock_queue_work(gl, delay);
> >       spin_unlock(&gl->gl_lockref.lock);
> > @@ -2096,6 +2243,8 @@ static const char *hflags2str(char *buf, u16
> > flags, unsigned long iflags)
> >               *p++ = 'H';
> >       if (test_bit(HIF_WAIT, &iflags))
> >               *p++ = 'W';
> > +     if (test_bit(HIF_MAY_DEMOTE, &iflags))
> > +             *p++ = 'D';
> >       *p = 0;
> >       return buf;
> >  }
> > diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
> > index 31a8f2f649b5..9012487da4c6 100644
> > --- a/fs/gfs2/glock.h
> > +++ b/fs/gfs2/glock.h
> > @@ -150,6 +150,8 @@ static inline struct gfs2_holder
> > *gfs2_glock_is_locked_by_me(struct gfs2_glock *
> >       list_for_each_entry(gh, &gl->gl_holders, gh_list) {
> >               if (!test_bit(HIF_HOLDER, &gh->gh_iflags))
> >                       break;
> > +             if (test_bit(HIF_MAY_DEMOTE, &gh->gh_iflags))
> > +                     continue;
> >               if (gh->gh_owner_pid == pid)
> >                       goto out;
> >       }
> > @@ -325,6 +327,24 @@ static inline void glock_clear_object(struct
> > gfs2_glock *gl, void *object)
> >       spin_unlock(&gl->gl_lockref.lock);
> >  }
> >
> > +static inline void gfs2_holder_allow_demote(struct gfs2_holder *gh)
> > +{
> > +     struct gfs2_glock *gl = gh->gh_gl;
> > +
> > +     spin_lock(&gl->gl_lockref.lock);
> > +     set_bit(HIF_MAY_DEMOTE, &gh->gh_iflags);
> > +     spin_unlock(&gl->gl_lockref.lock);
> > +}
> > +
> > +static inline void gfs2_holder_disallow_demote(struct gfs2_holder
> > *gh)
> > +{
> > +     struct gfs2_glock *gl = gh->gh_gl;
> > +
> > +     spin_lock(&gl->gl_lockref.lock);
> > +     clear_bit(HIF_MAY_DEMOTE, &gh->gh_iflags);
> > +     spin_unlock(&gl->gl_lockref.lock);
> > +}
> > +
>
> This looks a bit strange... bit operations are atomic anyway, so why do
> we need that spinlock here?

This is about making sure that the glock state engine will make
consistent decisions. Currently, those decisions are made under that
spin lock. We could set the HIF_MAY_DEMOTE flag followed by a memory
barrier and the glock state engine would *probably* still make the
right decisions most of the time, but that's not easy to ensure
anymore.

We surely want to prevent the glock state engine from making changes
while clearing the flag, though.

> Steve.
>
> >  extern void gfs2_inode_remember_delete(struct gfs2_glock *gl, u64
> > generation);
> >  extern bool gfs2_inode_already_deleted(struct gfs2_glock *gl, u64
> > generation);
> >
> > diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
> > index 5c6b985254aa..e73a81db0714 100644
> > --- a/fs/gfs2/incore.h
> > +++ b/fs/gfs2/incore.h
> > @@ -252,6 +252,7 @@ struct gfs2_lkstats {
> >
> >  enum {
> >       /* States */
> > +     HIF_MAY_DEMOTE          = 1,
> >       HIF_HOLDER              = 6,  /* Set for gh that "holds" the
> > glock */
> >       HIF_WAIT                = 10,
> >  };
>

Thanks,
Andreas


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Ocfs2-devel] [Cluster-devel] [PATCH v6 10/19] gfs2: Introduce flag for glock holder auto-demotion
  2021-08-20 13:11     ` Bob Peterson
@ 2021-08-20 13:41       ` Steven Whitehouse
  2021-08-20 15:22       ` Andreas Gruenbacher
  1 sibling, 0 replies; 33+ messages in thread
From: Steven Whitehouse @ 2021-08-20 13:41 UTC (permalink / raw)
  To: Bob Peterson, Andreas Gruenbacher, Linus Torvalds,
	Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, linux-kernel, cluster-devel, linux-fsdevel, ocfs2-devel

Hi,

On Fri, 2021-08-20 at 08:11 -0500, Bob Peterson wrote:
> On 8/20/21 4:35 AM, Steven Whitehouse wrote:
> > Hi,
> > 
> > On Thu, 2021-08-19 at 21:40 +0200, Andreas Gruenbacher wrote:
> > > From: Bob Peterson <rpeterso@redhat.com>
> > > 
> > > This patch introduces a new HIF_MAY_DEMOTE flag and
> > > infrastructure
> > > that
> > > will allow glocks to be demoted automatically on locking
> > > conflicts.
> > > When a locking request comes in that isn't compatible with the
> > > locking
> > > state of a holder and that holder has the HIF_MAY_DEMOTE flag
> > > set,
> > > the
> > > holder will be demoted automatically before the incoming locking
> > > request
> > > is granted.
> > > 
> > I'm not sure I understand what is going on here. When there are
> > locking
> > conflicts we generate call backs and those result in glock
> > demotion.
> > There is no need for a flag to indicate that I think, since it is
> > the
> > default behaviour anyway. Or perhaps the explanation is just a bit
> > confusing...
> 
> I agree that the whole concept and explanation are confusing.
> Andreas 
> and I went through several heated arguments about the symantics, 
> comments, patch descriptions, etc. We played around with many
> different 
> flag name ideas, etc. We did not agree on the best way to describe
> the 
> whole concept. He didn't like my explanation and I didn't like his.
> So 
> yes, it is confusing.
> 
That seems to be a good reason to take a step back and look at this a
bit closer. If we are finding this confusing, then someone else looking
at it at a future date, who may not be steeped in GFS2 knowledge is
likely to find it almost impossible.

So at least the description needs some work here I think, to make it
much clearer what the overall aim is. It would be good to start with a
statement of the problem that it is trying to solve which Andreas has
hinted at in his reply just now,

Steve.



_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Ocfs2-devel] [Cluster-devel] [PATCH v6 10/19] gfs2: Introduce flag for glock holder auto-demotion
  2021-08-20 13:17     ` Andreas Gruenbacher
@ 2021-08-20 13:47       ` Steven Whitehouse
  2021-08-20 14:43         ` Andreas Grünbacher
  0 siblings, 1 reply; 33+ messages in thread
From: Steven Whitehouse @ 2021-08-20 13:47 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: cluster-devel, Jan Kara, LKML, Christoph Hellwig, Alexander Viro,
	linux-fsdevel, Linus Torvalds, ocfs2-devel

Hi,

On Fri, 2021-08-20 at 15:17 +0200, Andreas Gruenbacher wrote:
> On Fri, Aug 20, 2021 at 11:35 AM Steven Whitehouse <
> swhiteho@redhat.com> wrote:
> > On Thu, 2021-08-19 at 21:40 +0200, Andreas Gruenbacher wrote:
> > > From: Bob Peterson <rpeterso@redhat.com>
> > > 
> > > This patch introduces a new HIF_MAY_DEMOTE flag and
> > > infrastructure
> > > that will allow glocks to be demoted automatically on locking
> > > conflicts.
> > > When a locking request comes in that isn't compatible with the
> > > locking
> > > state of a holder and that holder has the HIF_MAY_DEMOTE flag
> > > set, the
> > > holder will be demoted automatically before the incoming locking
> > > request
> > > is granted.
> > 
> > I'm not sure I understand what is going on here. When there are
> > locking
> > conflicts we generate call backs and those result in glock
> > demotion.
> > There is no need for a flag to indicate that I think, since it is
> > the
> > default behaviour anyway. Or perhaps the explanation is just a bit
> > confusing...
> 
> When a glock has active holders (with the HIF_HOLDER flag set), the
> glock won't be demoted to a state incompatible with any of those
> holders.
> 
Ok, that is a much clearer explanation of what the patch does. Active
holders have always prevented demotions previously.

> > > Processes that allow a glock holder to be taken away indicate
> > > this by
> > > calling gfs2_holder_allow_demote().  When they need the glock
> > > again,
> > > they call gfs2_holder_disallow_demote() and then they check if
> > > the
> > > holder is still queued: if it is, they're still holding the
> > > glock; if
> > > it isn't, they need to re-acquire the glock.
> > > 
> > > This allows processes to hang on to locks that could become part
> > > of a
> > > cyclic locking dependency.  The locks will be given up when a
> > > (rare)
> > > conflicting locking request occurs, and don't need to be given up
> > > prematurely.
> > 
> > This seems backwards to me. We already have the glock layer cache
> > the
> > locks until they are required by another node. We also have the min
> > hold time to make sure that we don't bounce locks too much. So what
> > is
> > the problem that you are trying to solve here I wonder?
> 
> This solves the problem of faulting in pages during read and write
> operations: on the one hand, we want to hold the inode glock across
> those operations. On the other hand, those operations may fault in
> pages, which may require taking the same or other inode glocks,
> directly or indirectly, which can deadlock.
> 
> So before we fault in pages, we indicate with
> gfs2_holder_allow_demote(gh) that we can cope if the glock is taken
> away from us. After faulting in the pages, we indicate with
> gfs2_holder_disallow_demote(gh) that we now actually need the glock
> again. At that point, we either still have the glock (i.e., the
> holder
> is still queued and it has the HIF_HOLDER flag set), or we don't.
> 
> The different kinds of read and write operations differ in how they
> handle the latter case:
> 
>  * When a buffered read or write loses the inode glock, it returns a
> short result. This
>    prevents torn writes and reading things that have never existed on
> disk in that form.
> 
>  * When a direct read or write loses the inode glock, it re-acquires
> it before resuming
>    the operation. Direct I/O is not expected to return partial
> results
> and doesn't provide
>    any kind of synchronization among processes.
> 
> We could solve this kind of problem in other ways, for example, by
> keeping a glock generation number, dropping the glock before faulting
> in pages, re-acquiring it afterwards, and checking if the generation
> number has changed. This would still be an additional piece of glock
> infrastructure, but more heavyweight than the HIF_MAY_DEMOTE flag
> which uses the existing glock holder infrastructure.

This is working towards the "why" but could probably be summarised a
bit more. We always used to manage to avoid holding fs locks when
copying to/from userspace to avoid these complications. If that is no
longer possible then it would be good to document what the new
expectations are somewhere suitable in Documentation/filesystems/...
just so we make sure it is clear what the new system is, and everyone
will be on the same page,

Steve.




_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Ocfs2-devel] [Cluster-devel] [PATCH v6 10/19] gfs2: Introduce flag for glock holder auto-demotion
  2021-08-20 13:47       ` Steven Whitehouse
@ 2021-08-20 14:43         ` Andreas Grünbacher
  0 siblings, 0 replies; 33+ messages in thread
From: Andreas Grünbacher @ 2021-08-20 14:43 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: cluster-devel, Jan Kara, Andreas Gruenbacher, LKML,
	Christoph Hellwig, Alexander Viro, linux-fsdevel, Linus Torvalds,
	ocfs2-devel

Am Fr., 20. Aug. 2021 um 15:49 Uhr schrieb Steven Whitehouse
<swhiteho@redhat.com>:
> We always used to manage to avoid holding fs locks when copying to/from userspace
> to avoid these complications.

I realize the intent, but that goal has never actually been achieved.
Direct I/O has *always* been calling get_user_pages() while holding
the inode glock in deferred mode.

The situation is slightly different for buffered I/O, but we have the
same problem there at least since switching to iomap. (And it's a
fundamental property of iomap that we want to hold the inode glock
across multi-page mappings, not an implementation deficit.)

Here's a slightly outdated version of a test case that makes all
versions of gfs2 prior to this patch queue unhappy:
https://lore.kernel.org/fstests/20210531152604.240462-1-agruenba@redhat.com/

Thanks,
Andreas

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Ocfs2-devel] [Cluster-devel] [PATCH v6 10/19] gfs2: Introduce flag for glock holder auto-demotion
  2021-08-20 13:11     ` Bob Peterson
  2021-08-20 13:41       ` Steven Whitehouse
@ 2021-08-20 15:22       ` Andreas Gruenbacher
  2021-08-23  8:14         ` Steven Whitehouse
  1 sibling, 1 reply; 33+ messages in thread
From: Andreas Gruenbacher @ 2021-08-20 15:22 UTC (permalink / raw)
  To: Bob Peterson
  Cc: cluster-devel, Jan Kara, LKML, Christoph Hellwig, Alexander Viro,
	linux-fsdevel, Linus Torvalds, Steven Whitehouse, ocfs2-devel

On Fri, Aug 20, 2021 at 3:11 PM Bob Peterson <rpeterso@redhat.com> wrote:
> On 8/20/21 4:35 AM, Steven Whitehouse wrote:
> > Hi,
> >
> > On Thu, 2021-08-19 at 21:40 +0200, Andreas Gruenbacher wrote:
> >> From: Bob Peterson <rpeterso@redhat.com>
> >>
> >> This patch introduces a new HIF_MAY_DEMOTE flag and infrastructure
> >> that
> >> will allow glocks to be demoted automatically on locking conflicts.
> >> When a locking request comes in that isn't compatible with the
> >> locking
> >> state of a holder and that holder has the HIF_MAY_DEMOTE flag set,
> >> the
> >> holder will be demoted automatically before the incoming locking
> >> request
> >> is granted.
> >>
> > I'm not sure I understand what is going on here. When there are locking
> > conflicts we generate call backs and those result in glock demotion.
> > There is no need for a flag to indicate that I think, since it is the
> > default behaviour anyway. Or perhaps the explanation is just a bit
> > confusing...
>
> I agree that the whole concept and explanation are confusing. Andreas
> and I went through several heated arguments about the semantics,
> comments, patch descriptions, etc. We played around with many different
> flag name ideas, etc. We did not agree on the best way to describe the
> whole concept. He didn't like my explanation and I didn't like his. So
> yes, it is confusing.
>
> My preferred terminology was "DOD" or "Dequeue On Demand"

... which is useless because it adds no clarity as to whose demand
we're talking about.

> which makes
> the concept more understandable to me. So basically a process can say
> "I need to hold this glock, but for an unknown and possibly lengthy
> period of time, but please feel free to dequeue it if it's in your way."
> And bear in mind that several processes may do the same, simultaneously.
>
> You can almost think of this as a performance enhancement. This concept
> allows a process to hold a glock for much longer periods of time, at a
> lower priority, for example, when gfs2_file_read_iter needs to hold the
> glock for very long-running iterative reads.

Consider a process that allocates a somewhat large buffer and reads
into it in chunks that are not page aligned. The buffer initially
won't be faulted in, so we fault in the first chunk and write into it.
Then, when reading the second chunk, we find that the first page of
the second chunk is already present. We fill it, set the
HIF_MAY_DEMOTE flag, fault in more pages, and clear the HIF_MAY_DEMOTE
flag. If we then still have the glock (which is very likely), we
resume the read. Otherwise, we return a short result.

Thanks,
Andreas


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Ocfs2-devel] [Cluster-devel] [PATCH v6 10/19] gfs2: Introduce flag for glock holder auto-demotion
  2021-08-20 15:22       ` Andreas Gruenbacher
@ 2021-08-23  8:14         ` Steven Whitehouse
  2021-08-23 15:18           ` Andreas Gruenbacher
  0 siblings, 1 reply; 33+ messages in thread
From: Steven Whitehouse @ 2021-08-23  8:14 UTC (permalink / raw)
  To: Andreas Gruenbacher, Bob Peterson
  Cc: cluster-devel, Jan Kara, LKML, Christoph Hellwig, Alexander Viro,
	linux-fsdevel, Linus Torvalds, ocfs2-devel

On Fri, 2021-08-20 at 17:22 +0200, Andreas Gruenbacher wrote:
> On Fri, Aug 20, 2021 at 3:11 PM Bob Peterson <rpeterso@redhat.com>
> wrote:
> > 
[snip]
> > 
> > You can almost think of this as a performance enhancement. This
> > concept
> > allows a process to hold a glock for much longer periods of time,
> > at a
> > lower priority, for example, when gfs2_file_read_iter needs to hold
> > the
> > glock for very long-running iterative reads.
> 
> Consider a process that allocates a somewhat large buffer and reads
> into it in chunks that are not page aligned. The buffer initially
> won't be faulted in, so we fault in the first chunk and write into
> it.
> Then, when reading the second chunk, we find that the first page of
> the second chunk is already present. We fill it, set the
> HIF_MAY_DEMOTE flag, fault in more pages, and clear the
> HIF_MAY_DEMOTE
> flag. If we then still have the glock (which is very likely), we
> resume the read. Otherwise, we return a short result.
> 
> Thanks,
> Andreas
> 

If the goal here is just to allow the glock to be held for a longer
period of time, but with occasional interruptions to prevent
starvation, then we have a potential model for this. There is
cond_resched_lock() which does this for spin locks. So perhaps we might
do something similar:

/**
 * gfs2_glock_cond_regain - Conditionally drop and regain glock
 * @gl: The glock
 * @gh: A granted holder for the glock
 *
 * If there is a pending demote request for this glock, drop and 
 * requeue a lock request for this glock. If there is no pending
 * demote request, this is a no-op. In either case the glock is
 * held on both entry and exit.
 *
 * Returns: 0 if no pending demote, 1 if lock dropped and regained
 */
int gfs2_glock_cond_regain(struct gfs2_glock *gl, struct gfs2_holder
*gh);

That seems more easily understood, and clearly documents places where
the lock may be dropped and regained. I think that the implementation
should be simpler and cleaner, compared with the current proposed
patch. There are only two bit flags related to pending demotes, for
example, so the check should be trivial.

It may need a few changes depending on the exact circumstances, but
hopefully that illustrates the concept,

Steve.




_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Ocfs2-devel] [Cluster-devel] [PATCH v6 10/19] gfs2: Introduce flag for glock holder auto-demotion
  2021-08-23  8:14         ` Steven Whitehouse
@ 2021-08-23 15:18           ` Andreas Gruenbacher
  2021-08-23 16:05             ` Matthew Wilcox
  2021-08-24  7:59             ` Steven Whitehouse
  0 siblings, 2 replies; 33+ messages in thread
From: Andreas Gruenbacher @ 2021-08-23 15:18 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: cluster-devel, Jan Kara, LKML, Christoph Hellwig, linux-fsdevel,
	Alexander Viro, Bob Peterson, Linus Torvalds, ocfs2-devel

On Mon, Aug 23, 2021 at 10:14 AM Steven Whitehouse <swhiteho@redhat.com> wrote:
> On Fri, 2021-08-20 at 17:22 +0200, Andreas Gruenbacher wrote:
> > On Fri, Aug 20, 2021 at 3:11 PM Bob Peterson <rpeterso@redhat.com>
> > wrote:
> > >
> [snip]
> > >
> > > You can almost think of this as a performance enhancement. This
> > > concept
> > > allows a process to hold a glock for much longer periods of time,
> > > at a
> > > lower priority, for example, when gfs2_file_read_iter needs to hold
> > > the
> > > glock for very long-running iterative reads.
> >
> > Consider a process that allocates a somewhat large buffer and reads
> > into it in chunks that are not page aligned. The buffer initially
> > won't be faulted in, so we fault in the first chunk and write into
> > it.
> > Then, when reading the second chunk, we find that the first page of
> > the second chunk is already present. We fill it, set the
> > HIF_MAY_DEMOTE flag, fault in more pages, and clear the
> > HIF_MAY_DEMOTE
> > flag. If we then still have the glock (which is very likely), we
> > resume the read. Otherwise, we return a short result.
> >
> > Thanks,
> > Andreas
> >
>
> If the goal here is just to allow the glock to be held for a longer
> period of time, but with occasional interruptions to prevent
> starvation, then we have a potential model for this. There is
> cond_resched_lock() which does this for spin locks.

This isn't an appropriate model for what I'm trying to achieve here.
In the cond_resched case, we know at the time of the cond_resched call
whether or not we want to schedule. If we do, we want to drop the spin
lock, schedule, and then re-acquire the spin lock. In the case we're
looking at here, we want to fault in user pages. There is no way of
knowing beforehand if the glock we're currently holding will have to
be dropped to achieve that. In fact, it will almost never have to be
dropped. But if it does, we need to drop it straight away to allow the
conflicting locking request to succeed.

Have a look at how the patch queue uses gfs2_holder_allow_demote() and
gfs2_holder_disallow_demote():

https://listman.redhat.com/archives/cluster-devel/2021-August/msg00128.html
https://listman.redhat.com/archives/cluster-devel/2021-August/msg00134.html

Thanks,
Andreas


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Ocfs2-devel] [Cluster-devel] [PATCH v6 10/19] gfs2: Introduce flag for glock holder auto-demotion
  2021-08-23 15:18           ` Andreas Gruenbacher
@ 2021-08-23 16:05             ` Matthew Wilcox
  2021-08-23 16:36               ` Bob Peterson
  2021-08-23 19:12               ` Andreas Gruenbacher
  2021-08-24  7:59             ` Steven Whitehouse
  1 sibling, 2 replies; 33+ messages in thread
From: Matthew Wilcox @ 2021-08-23 16:05 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: cluster-devel, Jan Kara, LKML, Christoph Hellwig, linux-fsdevel,
	Alexander Viro, Bob Peterson, Linus Torvalds, Steven Whitehouse,
	ocfs2-devel

On Mon, Aug 23, 2021 at 05:18:12PM +0200, Andreas Gruenbacher wrote:
> On Mon, Aug 23, 2021 at 10:14 AM Steven Whitehouse <swhiteho@redhat.com> wrote:
> > If the goal here is just to allow the glock to be held for a longer
> > period of time, but with occasional interruptions to prevent
> > starvation, then we have a potential model for this. There is
> > cond_resched_lock() which does this for spin locks.
> 
> This isn't an appropriate model for what I'm trying to achieve here.
> In the cond_resched case, we know at the time of the cond_resched call
> whether or not we want to schedule. If we do, we want to drop the spin
> lock, schedule, and then re-acquire the spin lock. In the case we're
> looking at here, we want to fault in user pages. There is no way of
> knowing beforehand if the glock we're currently holding will have to
> be dropped to achieve that. In fact, it will almost never have to be
> dropped. But if it does, we need to drop it straight away to allow the
> conflicting locking request to succeed.

It occurs to me that this is similar to the wound/wait mutexes
(include/linux/ww_mutex.h & Documentation/locking/ww-mutex-design.rst).
You want to mark the glock as woundable before faulting, and then discover
if it was wounded after faulting.  Maybe sharing this terminology will
aid in understanding?

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Ocfs2-devel] [Cluster-devel] [PATCH v6 10/19] gfs2: Introduce flag for glock holder auto-demotion
  2021-08-23 16:05             ` Matthew Wilcox
@ 2021-08-23 16:36               ` Bob Peterson
  2021-08-23 19:12               ` Andreas Gruenbacher
  1 sibling, 0 replies; 33+ messages in thread
From: Bob Peterson @ 2021-08-23 16:36 UTC (permalink / raw)
  To: Matthew Wilcox, Andreas Gruenbacher
  Cc: cluster-devel, Jan Kara, LKML, Christoph Hellwig, Alexander Viro,
	linux-fsdevel, Linus Torvalds, Steven Whitehouse, ocfs2-devel

On 8/23/21 11:05 AM, Matthew Wilcox wrote:
> On Mon, Aug 23, 2021 at 05:18:12PM +0200, Andreas Gruenbacher wrote:
>> On Mon, Aug 23, 2021 at 10:14 AM Steven Whitehouse <swhiteho@redhat.com> wrote:
>>> If the goal here is just to allow the glock to be held for a longer
>>> period of time, but with occasional interruptions to prevent
>>> starvation, then we have a potential model for this. There is
>>> cond_resched_lock() which does this for spin locks.
>>
>> This isn't an appropriate model for what I'm trying to achieve here.
>> In the cond_resched case, we know at the time of the cond_resched call
>> whether or not we want to schedule. If we do, we want to drop the spin
>> lock, schedule, and then re-acquire the spin lock. In the case we're
>> looking at here, we want to fault in user pages. There is no way of
>> knowing beforehand if the glock we're currently holding will have to
>> be dropped to achieve that. In fact, it will almost never have to be
>> dropped. But if it does, we need to drop it straight away to allow the
>> conflicting locking request to succeed.
> 
> It occurs to me that this is similar to the wound/wait mutexes
> (include/linux/ww_mutex.h & Documentation/locking/ww-mutex-design.rst).
> You want to mark the glock as woundable before faulting, and then discover
> if it was wounded after faulting.  Maybe sharing this terminology will
> aid in understanding?
> 
Hmm. Woundable. I like it.
Andreas and I argued about the terminology but we never found a 
middle-ground. Perhaps this is it. Thanks, Matthew.

Regards,

Bob Peterson


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Ocfs2-devel] [Cluster-devel] [PATCH v6 10/19] gfs2: Introduce flag for glock holder auto-demotion
  2021-08-23 16:05             ` Matthew Wilcox
  2021-08-23 16:36               ` Bob Peterson
@ 2021-08-23 19:12               ` Andreas Gruenbacher
  1 sibling, 0 replies; 33+ messages in thread
From: Andreas Gruenbacher @ 2021-08-23 19:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: cluster-devel, Jan Kara, LKML, Christoph Hellwig, linux-fsdevel,
	Alexander Viro, Bob Peterson, Linus Torvalds, Steven Whitehouse,
	ocfs2-devel

On Mon, Aug 23, 2021 at 6:06 PM Matthew Wilcox <willy@infradead.org> wrote:
> On Mon, Aug 23, 2021 at 05:18:12PM +0200, Andreas Gruenbacher wrote:
> > On Mon, Aug 23, 2021 at 10:14 AM Steven Whitehouse <swhiteho@redhat.com> wrote:
> > > If the goal here is just to allow the glock to be held for a longer
> > > period of time, but with occasional interruptions to prevent
> > > starvation, then we have a potential model for this. There is
> > > cond_resched_lock() which does this for spin locks.
> >
> > This isn't an appropriate model for what I'm trying to achieve here.
> > In the cond_resched case, we know at the time of the cond_resched call
> > whether or not we want to schedule. If we do, we want to drop the spin
> > lock, schedule, and then re-acquire the spin lock. In the case we're
> > looking at here, we want to fault in user pages. There is no way of
> > knowing beforehand if the glock we're currently holding will have to
> > be dropped to achieve that. In fact, it will almost never have to be
> > dropped. But if it does, we need to drop it straight away to allow the
> > conflicting locking request to succeed.
>
> It occurs to me that this is similar to the wound/wait mutexes
> (include/linux/ww_mutex.h & Documentation/locking/ww-mutex-design.rst).
> You want to mark the glock as woundable before faulting, and then discover
> if it was wounded after faulting.  Maybe sharing this terminology will
> aid in understanding?

I've looked at the ww_mutex documentation. A "transaction" wounds
another "transaction" and that other transaction then "dies", or it
"heals" and restarts. In the glock case, a process sets and clears the
HIF_MAY_DEMOTE flag on one of its own glock holder contexts. After
clearing the flag, it either still holds the glock or it doesn't;
nothing needs to be done to "die" or to "heal". So I'm not sure we
want to conflate two concepts.

One of the earlier terms we've used was "stealing", with a
HIF_MAY_STEAL flag. That works, but it's slightly less obvious what
happens to a glock holder when the glock is stolen from it. (The
holder gets dequeued, __gfs2_glock_dq.) The glock code already uses
the terms promote/demote, acquire/release, enqueue/dequeue, and
_nq/_dq for various forms of acquiring and releasing a glock, so we're
not in a shortage or names right now apparently.

Thanks,
Andreas


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Ocfs2-devel] [Cluster-devel] [PATCH v6 10/19] gfs2: Introduce flag for glock holder auto-demotion
  2021-08-23 15:18           ` Andreas Gruenbacher
  2021-08-23 16:05             ` Matthew Wilcox
@ 2021-08-24  7:59             ` Steven Whitehouse
  1 sibling, 0 replies; 33+ messages in thread
From: Steven Whitehouse @ 2021-08-24  7:59 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: cluster-devel, Jan Kara, LKML, Christoph Hellwig, linux-fsdevel,
	Alexander Viro, Bob Peterson, Linus Torvalds, ocfs2-devel

Hi,

On Mon, 2021-08-23 at 17:18 +0200, Andreas Gruenbacher wrote:
> On Mon, Aug 23, 2021 at 10:14 AM Steven Whitehouse <
> swhiteho@redhat.com> wrote:
> > On Fri, 2021-08-20 at 17:22 +0200, Andreas Gruenbacher wrote:
> > > On Fri, Aug 20, 2021 at 3:11 PM Bob Peterson <rpeterso@redhat.com
> > > >
> > > wrote:
> > [snip]
> > > > You can almost think of this as a performance enhancement. This
> > > > concept
> > > > allows a process to hold a glock for much longer periods of
> > > > time,
> > > > at a
> > > > lower priority, for example, when gfs2_file_read_iter needs to
> > > > hold
> > > > the
> > > > glock for very long-running iterative reads.
> > > 
> > > Consider a process that allocates a somewhat large buffer and
> > > reads
> > > into it in chunks that are not page aligned. The buffer initially
> > > won't be faulted in, so we fault in the first chunk and write
> > > into
> > > it.
> > > Then, when reading the second chunk, we find that the first page
> > > of
> > > the second chunk is already present. We fill it, set the
> > > HIF_MAY_DEMOTE flag, fault in more pages, and clear the
> > > HIF_MAY_DEMOTE
> > > flag. If we then still have the glock (which is very likely), we
> > > resume the read. Otherwise, we return a short result.
> > > 
> > > Thanks,
> > > Andreas
> > > 
> > 
> > If the goal here is just to allow the glock to be held for a longer
> > period of time, but with occasional interruptions to prevent
> > starvation, then we have a potential model for this. There is
> > cond_resched_lock() which does this for spin locks.
> 
> This isn't an appropriate model for what I'm trying to achieve here.
> In the cond_resched case, we know at the time of the cond_resched
> call
> whether or not we want to schedule. If we do, we want to drop the
> spin
> lock, schedule, and then re-acquire the spin lock. In the case we're
> looking at here, we want to fault in user pages. There is no way of
> knowing beforehand if the glock we're currently holding will have to
> be dropped to achieve that. In fact, it will almost never have to be
> dropped. But if it does, we need to drop it straight away to allow
> the
> conflicting locking request to succeed.
> 
> Have a look at how the patch queue uses gfs2_holder_allow_demote()
> and
> gfs2_holder_disallow_demote():
> 
> https://listman.redhat.com/archives/cluster-devel/2021-August/msg00128.html
> https://listman.redhat.com/archives/cluster-devel/2021-August/msg00134.html
> 
> Thanks,
> Andreas
> 

Ah, now I see! Apologies if I've misunderstood the intent here,
initially. It is complicated and not so obvious - at least to me!

You've added a lot of context to this patch in your various replies on
this thread. The original patch description explains how the feature is
implemented, but doesn't really touch on why - that is left to the
other patches that you pointed to above. A short paragraph or two on
the "why" side of things added to the patch description would be
helpful I think.

Your message on Friday (20 Aug 2021 15:17:41 +0200 (20/08/21 14:17:41))
has a good explanation in the second part of it, which with what you've
said above (or similar) would be a good basis.

Apologies again for not understanding what is going on,

Steve.



_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2021-08-24  8:03 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-19 19:40 [Ocfs2-devel] [PATCH v6 00/19] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 01/19] iov_iter: Fix iov_iter_get_pages{, _alloc} page fault return value Andreas Gruenbacher
2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 02/19] powerpc/kvm: Fix kvm_use_magic_page Andreas Gruenbacher
2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 03/19] Turn fault_in_pages_{readable, writeable} into fault_in_{readable, writeable} Andreas Gruenbacher
2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 04/19] Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable Andreas Gruenbacher
2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 05/19] iov_iter: Introduce fault_in_iov_iter_writeable Andreas Gruenbacher
2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 06/19] gfs2: Add wrapper for iomap_file_buffered_write Andreas Gruenbacher
2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 07/19] gfs2: Clean up function may_grant Andreas Gruenbacher
2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 08/19] gfs2: Eliminate vestigial HIF_FIRST Andreas Gruenbacher
2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 09/19] gfs2: Remove redundant check from gfs2_glock_dq Andreas Gruenbacher
2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 10/19] gfs2: Introduce flag for glock holder auto-demotion Andreas Gruenbacher
2021-08-20  9:35   ` [Ocfs2-devel] [Cluster-devel] " Steven Whitehouse
2021-08-20 13:11     ` Bob Peterson
2021-08-20 13:41       ` Steven Whitehouse
2021-08-20 15:22       ` Andreas Gruenbacher
2021-08-23  8:14         ` Steven Whitehouse
2021-08-23 15:18           ` Andreas Gruenbacher
2021-08-23 16:05             ` Matthew Wilcox
2021-08-23 16:36               ` Bob Peterson
2021-08-23 19:12               ` Andreas Gruenbacher
2021-08-24  7:59             ` Steven Whitehouse
2021-08-20 13:17     ` Andreas Gruenbacher
2021-08-20 13:47       ` Steven Whitehouse
2021-08-20 14:43         ` Andreas Grünbacher
2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 11/19] gfs2: Move the inode glock locking to gfs2_file_buffered_write Andreas Gruenbacher
2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 12/19] gfs2: Fix mmap + page fault deadlocks for buffered I/O Andreas Gruenbacher
2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 13/19] iomap: Fix iomap_dio_rw return value for user copies Andreas Gruenbacher
2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 14/19] iomap: Support partial direct I/O on user copy failures Andreas Gruenbacher
2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 15/19] iomap: Add done_before argument to iomap_dio_rw Andreas Gruenbacher
2021-08-19 19:40 ` [Ocfs2-devel] [PATCH v6 16/19] gup: Introduce FOLL_NOFAULT flag to disable page faults Andreas Gruenbacher
2021-08-19 19:41 ` [Ocfs2-devel] [PATCH v6 17/19] iov_iter: Introduce nofault " Andreas Gruenbacher
2021-08-19 19:41 ` [Ocfs2-devel] [PATCH v6 18/19] gfs2: Fix mmap + page fault deadlocks for direct I/O Andreas Gruenbacher
2021-08-19 19:41 ` [Ocfs2-devel] [PATCH v6 19/19] gfs2: Eliminate ip->i_gh Andreas Gruenbacher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).