ocfs2-devel.oss.oracle.com archive mirror
 help / color / mirror / Atom feed
* [Ocfs2-devel] [PATCH v2 0/6] gfs2: Fix mmap + page fault deadlocks
@ 2021-07-18 22:39 Andreas Gruenbacher
  2021-07-18 22:39 ` [Ocfs2-devel] [PATCH v2 1/6] iov_iter: Introduce fault_in_iov_iter helper Andreas Gruenbacher
                   ` (7 more replies)
  0 siblings, 8 replies; 12+ messages in thread
From: Andreas Gruenbacher @ 2021-07-18 22:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: cluster-devel, Jan Kara, Andreas Gruenbacher, linux-kernel,
	Christoph Hellwig, Alexander Viro, linux-fsdevel, ocfs2-devel

Hi Linus et al.,

here's an update to the gfs2 mmap + page faults fix that implements
Linus's suggestion of disabling page faults while we're holding the
inode glock.

This passes fstests except for test generic/095, which fails due to an
independent bug in the iov_iter code.  I'm currently trying to get
initial feedback from Al and Christoph on that.

Any feedback would be welcome.

Thanks a lot,
Andreas

Andreas Gruenbacher (6):
  iov_iter: Introduce fault_in_iov_iter helper
  iomap: Fix iomap_dio_rw return value for page faults
  gfs2: Add wrapper for iomap_file_buffered_write
  gfs2: Fix mmap + page fault deadlocks for buffered I/O
  iov_iter: Introduce ITER_FLAG_FAST_ONLY flag
  gfs2: Fix mmap + page fault deadlocks for direct I/O

 fs/gfs2/file.c       | 79 ++++++++++++++++++++++++++++++++++++++++----
 fs/iomap/direct-io.c |  2 ++
 include/linux/mm.h   |  3 ++
 include/linux/uio.h  | 16 +++++++--
 lib/iov_iter.c       | 62 +++++++++++++++++++++++++++++++---
 mm/gup.c             | 68 ++++++++++++++++++++++++++++++++++++++
 6 files changed, 215 insertions(+), 15 deletions(-)

-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH v2 1/6] iov_iter: Introduce fault_in_iov_iter helper
  2021-07-18 22:39 [Ocfs2-devel] [PATCH v2 0/6] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
@ 2021-07-18 22:39 ` Andreas Gruenbacher
  2021-07-19 19:26   ` Linus Torvalds
  2021-07-18 22:39 ` [Ocfs2-devel] [PATCH v2 2/6] iomap: Fix iomap_dio_rw return value for page faults Andreas Gruenbacher
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 12+ messages in thread
From: Andreas Gruenbacher @ 2021-07-18 22:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: cluster-devel, Jan Kara, Andreas Gruenbacher, linux-kernel,
	Christoph Hellwig, Alexander Viro, linux-fsdevel, ocfs2-devel

Introduce a new fault_in_iov_iter helper for manually faulting in an iterator.
Other than fault_in_pages_writeable(), this function is non-destructive.

We'll use fault_in_iov_iter in gfs2 once we've determined that the iterator
passed to .read_iter or .write_iter isn't in memory.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 include/linux/mm.h  |  3 ++
 include/linux/uio.h |  1 +
 lib/iov_iter.c      | 42 ++++++++++++++++++++++++++++
 mm/gup.c            | 68 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 114 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8ae31622deef..dc7d64632a60 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1836,6 +1836,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
 int pin_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages);
 
+unsigned long fault_in_user_pages(unsigned long start, unsigned long len,
+				  bool write);
+
 int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
 int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
 			struct task_struct *task, bool bypass_rlim);
diff --git a/include/linux/uio.h b/include/linux/uio.h
index d3ec87706d75..74f819c41735 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -124,6 +124,7 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
 void iov_iter_revert(struct iov_iter *i, size_t bytes);
 int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes);
+size_t fault_in_iov_iter(const struct iov_iter *i);
 size_t iov_iter_single_seg_count(const struct iov_iter *i);
 size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index c701b7a187f2..3beecf8f77de 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -487,6 +487,48 @@ int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
 }
 EXPORT_SYMBOL(iov_iter_fault_in_readable);
 
+/**
+ * fault_in_iov_iter - fault in iov iterator for reading / writing
+ * @i: iterator
+ *
+ * Faults in the iterator using get_user_pages, i.e., without triggering
+ * hardware page faults.
+ *
+ * This is primarily useful when we know that some or all of the pages in @i
+ * aren't in memory.  For iterators that are likely to be in memory,
+ * fault_in_pages_readable() may be more appropriate.
+ *
+ * Other than fault_in_pages_writeable(), this function is non-destructive even
+ * when faulting in pages for writing.
+ *
+ * Returns the number of bytes faulted in, or the size of @i if @i doesn't need
+ * faulting in.
+ */
+size_t fault_in_iov_iter(const struct iov_iter *i)
+{
+	size_t count = i->count;
+	const struct iovec *p;
+	size_t ret = 0, skip;
+
+	if (iter_is_iovec(i)) {
+		for (p = i->iov, skip = i->iov_offset; count; p++, skip = 0) {
+			unsigned long len = min(count, p->iov_len - skip);
+			unsigned long start, l;
+
+			if (unlikely(!len))
+				continue;
+			start = (unsigned long)p->iov_base + skip;
+			l = fault_in_user_pages(start, len, iov_iter_rw(i) != WRITE);
+			ret += l;
+			if (unlikely(l != len))
+				break;
+			count -= l;
+		}
+	}
+	return ret;
+}
+EXPORT_SYMBOL(fault_in_iov_iter);
+
 void iov_iter_init(struct iov_iter *i, unsigned int direction,
 			const struct iovec *iov, unsigned long nr_segs,
 			size_t count)
diff --git a/mm/gup.c b/mm/gup.c
index 3ded6a5f26b2..f87fb24d59e1 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1565,6 +1565,74 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
 }
 #endif /* !CONFIG_MMU */
 
+/**
+ * fault_in_user_pages - fault in an address range for reading / writing
+ * @start: start of address range
+ * @len: length of address range
+ * @write: fault in for writing
+ *
+ * Note that we don't pin or otherwise hold the pages referenced that we fault
+ * in.  There's no guarantee that they'll stay in memory for any duration of
+ * time.
+ *
+ * Returns the number of bytes faulted in from @start.
+ */
+unsigned long fault_in_user_pages(unsigned long start, unsigned long len,
+				  bool write)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma = NULL;
+	unsigned long end, nstart, nend;
+	int locked = 0;
+	int gup_flags;
+
+	/*
+	 * FIXME: Make sure this function doesn't succeed for pages that cannot
+	 * be accessed; otherwise we could end up in a loop trying to fault in
+	 * and then access the pages.  (It's okay if a page gets evicted and we
+	 * need more than one retry.)
+	 */
+
+	/*
+	 * FIXME: Are these the right FOLL_* flags?
+	 */
+
+	gup_flags = FOLL_TOUCH | FOLL_POPULATE;
+	if (write)
+		gup_flags |= FOLL_WRITE;
+
+	end = PAGE_ALIGN(start + len);
+	for (nstart = start & PAGE_MASK; nstart < end; nstart = nend) {
+		unsigned long nr_pages;
+		long ret;
+
+		if (!locked) {
+			locked = 1;
+			mmap_read_lock(mm);
+			vma = find_vma(mm, nstart);
+		} else if (nstart >= vma->vm_end)
+			vma = vma->vm_next;
+		if (!vma || vma->vm_start >= end)
+			break;
+		nend = min(end, vma->vm_end);
+		if (vma->vm_flags & (VM_IO | VM_PFNMAP))
+			continue;
+		if (nstart < vma->vm_start)
+			nstart = vma->vm_start;
+		nr_pages = (nend - nstart) / PAGE_SIZE;
+		ret = __get_user_pages_locked(mm, nstart, nr_pages,
+					      NULL, NULL, &locked, gup_flags);
+		if (ret <= 0)
+			break;
+		nend = nstart + ret * PAGE_SIZE;
+	}
+	if (locked)
+		mmap_read_unlock(mm);
+	if (nstart > start)
+		return min(nstart - start, len);
+	return 0;
+}
+
 /**
  * get_dump_page() - pin user page in memory while writing it to core dump
  * @addr: user address
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH v2 2/6] iomap: Fix iomap_dio_rw return value for page faults
  2021-07-18 22:39 [Ocfs2-devel] [PATCH v2 0/6] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
  2021-07-18 22:39 ` [Ocfs2-devel] [PATCH v2 1/6] iov_iter: Introduce fault_in_iov_iter helper Andreas Gruenbacher
@ 2021-07-18 22:39 ` Andreas Gruenbacher
  2021-07-18 22:39 ` [Ocfs2-devel] [PATCH v2 3/6] gfs2: Add wrapper for iomap_file_buffered_write Andreas Gruenbacher
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Andreas Gruenbacher @ 2021-07-18 22:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: cluster-devel, Jan Kara, Andreas Gruenbacher, linux-kernel,
	Christoph Hellwig, Alexander Viro, linux-fsdevel, ocfs2-devel

When a page fault occurs during a direct I/O, iomap_dio_rw can currently return
0 when a page cannot be accessed.  In that case, -EFAULT should be returned
instead.  (For reads, a return value of 0 indicates the end of file.)  Fix that
by casting the return value of iomap_apply from 0 to -EFAULT: in that position,
we know that we should have been able to read something.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/iomap/direct-io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 9398b8c31323..a87a43ee8278 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -561,6 +561,8 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		ret = iomap_apply(inode, pos, count, iomap_flags, ops, dio,
 				iomap_dio_actor);
 		if (ret <= 0) {
+			if (ret == 0)
+				ret = -EFAULT;
 			/* magic error code to fall back to buffered I/O */
 			if (ret == -ENOTBLK) {
 				wait_for_completion = true;
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH v2 3/6] gfs2: Add wrapper for iomap_file_buffered_write
  2021-07-18 22:39 [Ocfs2-devel] [PATCH v2 0/6] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
  2021-07-18 22:39 ` [Ocfs2-devel] [PATCH v2 1/6] iov_iter: Introduce fault_in_iov_iter helper Andreas Gruenbacher
  2021-07-18 22:39 ` [Ocfs2-devel] [PATCH v2 2/6] iomap: Fix iomap_dio_rw return value for page faults Andreas Gruenbacher
@ 2021-07-18 22:39 ` Andreas Gruenbacher
  2021-07-18 22:39 ` [Ocfs2-devel] [PATCH v2 4/6] gfs2: Fix mmap + page fault deadlocks for buffered I/O Andreas Gruenbacher
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Andreas Gruenbacher @ 2021-07-18 22:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: cluster-devel, Jan Kara, Andreas Gruenbacher, linux-kernel,
	Christoph Hellwig, Alexander Viro, linux-fsdevel, ocfs2-devel

Add a wrapper around iomap_file_buffered_write.  We'll add code for when
the operation needs to be retried here later.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 493a83e3f590..13f701493c3c 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -857,6 +857,18 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return written ? written : ret;
 }
 
+static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file_inode(file);
+	ssize_t ret;
+
+	current->backing_dev_info = inode_to_bdi(inode);
+	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+	current->backing_dev_info = NULL;
+	return ret;
+}
+
 /**
  * gfs2_file_write_iter - Perform a write to a file
  * @iocb: The io context
@@ -908,9 +920,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			goto out_unlock;
 
 		iocb->ki_flags |= IOCB_DSYNC;
-		current->backing_dev_info = inode_to_bdi(inode);
-		buffered = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
-		current->backing_dev_info = NULL;
+		buffered = gfs2_file_buffered_write(iocb, from);
 		if (unlikely(buffered <= 0)) {
 			if (!ret)
 				ret = buffered;
@@ -932,9 +942,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		if (!ret || ret2 > 0)
 			ret += ret2;
 	} else {
-		current->backing_dev_info = inode_to_bdi(inode);
-		ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
-		current->backing_dev_info = NULL;
+		ret = gfs2_file_buffered_write(iocb, from);
 		if (likely(ret > 0)) {
 			iocb->ki_pos += ret;
 			ret = generic_write_sync(iocb, ret);
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH v2 4/6] gfs2: Fix mmap + page fault deadlocks for buffered I/O
  2021-07-18 22:39 [Ocfs2-devel] [PATCH v2 0/6] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (2 preceding siblings ...)
  2021-07-18 22:39 ` [Ocfs2-devel] [PATCH v2 3/6] gfs2: Add wrapper for iomap_file_buffered_write Andreas Gruenbacher
@ 2021-07-18 22:39 ` Andreas Gruenbacher
  2021-07-18 22:39 ` [Ocfs2-devel] [PATCH v2 5/6] iov_iter: Introduce ITER_FLAG_FAST_ONLY flag Andreas Gruenbacher
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Andreas Gruenbacher @ 2021-07-18 22:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: cluster-devel, Jan Kara, Andreas Gruenbacher, linux-kernel,
	Christoph Hellwig, Alexander Viro, linux-fsdevel, ocfs2-devel

In the .read_iter and .write_iter file operations, we're accessing
user-space memory while holding the inodes glock.  There's a possibility
that the memory is mapped to the same file, in which case we'd recurse on
the same glock.

More complex scenarios can involve multiple glocks, processes, and even cluster
nodes.

Avoids these kinds of problems by disabling page faults while holding a glock.
If a page fault occurs, we either end up with a partial read or write, or with
-EFAULT if nothing could be read or written.  In that case, we drop the glock,
fault in the requested pages manually, and repeat the operation.

This locking problem in gfs2 was originally reported by Jan Kara.  Linus came
up with the proposal to disable page faults.  Many thanks to Al Viro and
Matthew Wilcox for their feedback as well.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 13f701493c3c..99df7934b4d8 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -824,6 +824,12 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	size_t written = 0;
 	ssize_t ret;
 
+	/*
+	 * In this function, we disable page faults when whe're holding the
+	 * inode glock while doing I/O.  If a page fault occurs, we drop the
+	 * inode glock, fault in the pages manually, and then we retry.
+	 */
+
 	if (iocb->ki_flags & IOCB_DIRECT) {
 		ret = gfs2_file_direct_read(iocb, to, &gh);
 		if (likely(ret != -ENOTBLK))
@@ -831,6 +837,7 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		iocb->ki_flags &= ~IOCB_DIRECT;
 	}
 	iocb->ki_flags |= IOCB_NOIO;
+	/* Leave page faults enabled while we're not holding any locks. */
 	ret = generic_file_read_iter(iocb, to);
 	iocb->ki_flags &= ~IOCB_NOIO;
 	if (ret >= 0) {
@@ -845,13 +852,19 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	}
 	ip = GFS2_I(iocb->ki_filp->f_mapping->host);
 	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
+retry:
 	ret = gfs2_glock_nq(&gh);
 	if (ret)
 		goto out_uninit;
+	pagefault_disable();
 	ret = generic_file_read_iter(iocb, to);
+	pagefault_enable();
 	if (ret > 0)
 		written += ret;
 	gfs2_glock_dq(&gh);
+	if (unlikely(iov_iter_count(to) && (ret > 0 || ret == -EFAULT)) &&
+	    fault_in_iov_iter(to))
+		goto retry;
 out_uninit:
 	gfs2_holder_uninit(&gh);
 	return written ? written : ret;
@@ -863,9 +876,20 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *fro
 	struct inode *inode = file_inode(file);
 	ssize_t ret;
 
+	/*
+	 * In this function, we disable page faults when whe're holding the
+	 * inode glock while doing I/O.  If a page fault occurs, we drop the
+	 * inode glock, fault in the pages manually, and then we retry.
+	 */
+
+retry:
 	current->backing_dev_info = inode_to_bdi(inode);
+	pagefault_disable();
 	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+	pagefault_enable();
 	current->backing_dev_info = NULL;
+	if (unlikely(ret == -EFAULT) && fault_in_iov_iter(from))
+		goto retry;
 	return ret;
 }
 
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH v2 5/6] iov_iter: Introduce ITER_FLAG_FAST_ONLY flag
  2021-07-18 22:39 [Ocfs2-devel] [PATCH v2 0/6] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (3 preceding siblings ...)
  2021-07-18 22:39 ` [Ocfs2-devel] [PATCH v2 4/6] gfs2: Fix mmap + page fault deadlocks for buffered I/O Andreas Gruenbacher
@ 2021-07-18 22:39 ` Andreas Gruenbacher
  2021-07-19 19:29   ` Linus Torvalds
  2021-07-18 22:39 ` [Ocfs2-devel] [PATCH v2 6/6] gfs2: Fix mmap + page fault deadlocks for direct I/O Andreas Gruenbacher
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 12+ messages in thread
From: Andreas Gruenbacher @ 2021-07-18 22:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: cluster-devel, Jan Kara, Andreas Gruenbacher, linux-kernel,
	Christoph Hellwig, Alexander Viro, linux-fsdevel, ocfs2-devel

Introduce a new ITER_FLAG_FAST_ONLY flag to indicate to get_user_pages to use
the FOLL_FAST_ONLY flag.  This will cause get_user_pages to fail when it would
otherwise fault in a page.

Currently, the ITER_FLAG_FAST_ONLY flag is only checked in iov_iter_get_pages
and iov_iter_get_pages_alloc.  This is enough for iomaop_dio_rw(), but it may
make sense to check for this flag in other contexts as well.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 include/linux/uio.h | 15 ++++++++++++---
 lib/iov_iter.c      | 20 +++++++++++++++-----
 2 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 74f819c41735..d3d629c2153a 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -18,6 +18,9 @@ struct kvec {
 };
 
 enum iter_type {
+	/* set if get_user_pages should use FOLL_FAST_ONLY */
+	ITER_FLAG_FAST_ONLY = 2,
+
 	/* iter types */
 	ITER_IOVEC = 4,
 	ITER_KVEC = 8,
@@ -30,8 +33,9 @@ enum iter_type {
 struct iov_iter {
 	/*
 	 * Bit 0 is the read/write bit, set if we're writing.
-	 * Bit 1 is the BVEC_FLAG_NO_REF bit, set if type is a bvec and
-	 * the caller isn't expecting to drop a page reference when done.
+	 * Bit 1 is the ITER_FLAG_FAST_ONLY bit, set if get_user_pages
+	 * should use the FOLL_FAST_ONLY flag when trying to fault in pages
+	 * (only useful for type ITER_IOVEC).
 	 */
 	unsigned int type;
 	size_t iov_offset;
@@ -55,7 +59,7 @@ struct iov_iter {
 
 static inline enum iter_type iov_iter_type(const struct iov_iter *i)
 {
-	return i->type & ~(READ | WRITE);
+	return i->type & ~(READ | WRITE | ITER_FLAG_FAST_ONLY);
 }
 
 static inline bool iter_is_iovec(const struct iov_iter *i)
@@ -93,6 +97,11 @@ static inline unsigned char iov_iter_rw(const struct iov_iter *i)
 	return i->type & (READ | WRITE);
 }
 
+static inline bool iov_iter_is_fast_only(const struct iov_iter *i)
+{
+	return i->type & ITER_FLAG_FAST_ONLY;
+}
+
 /*
  * Total number of bytes covered by an iovec.
  *
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 3beecf8f77de..182ff2afed19 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1538,6 +1538,8 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 		   struct page **pages, size_t maxsize, unsigned maxpages,
 		   size_t *start)
 {
+	unsigned int gup_flags = 0;
+
 	if (maxsize > i->count)
 		maxsize = i->count;
 
@@ -1548,6 +1550,11 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 	if (unlikely(iov_iter_is_discard(i)))
 		return -EFAULT;
 
+	if (iov_iter_rw(i) != WRITE)
+		gup_flags |= FOLL_WRITE;
+	if (iov_iter_is_fast_only(i))
+		gup_flags |= FOLL_FAST_ONLY;
+
 	iterate_all_kinds(i, maxsize, v, ({
 		unsigned long addr = (unsigned long)v.iov_base;
 		size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
@@ -1558,9 +1565,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 			len = maxpages * PAGE_SIZE;
 		addr &= ~(PAGE_SIZE - 1);
 		n = DIV_ROUND_UP(len, PAGE_SIZE);
-		res = get_user_pages_fast(addr, n,
-				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0,
-				pages);
+		res = get_user_pages_fast(addr, n, gup_flags, pages);
 		if (unlikely(res < 0))
 			return res;
 		return (res == n ? len : res * PAGE_SIZE) - *start;
@@ -1665,6 +1670,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   size_t *start)
 {
+	unsigned int gup_flags = 0;
 	struct page **p;
 
 	if (maxsize > i->count)
@@ -1677,6 +1683,11 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 	if (unlikely(iov_iter_is_discard(i)))
 		return -EFAULT;
 
+	if (iov_iter_rw(i) != WRITE)
+		gup_flags |= FOLL_WRITE;
+	if (iov_iter_is_fast_only(i))
+		gup_flags |= FOLL_FAST_ONLY;
+
 	iterate_all_kinds(i, maxsize, v, ({
 		unsigned long addr = (unsigned long)v.iov_base;
 		size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
@@ -1688,8 +1699,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		p = get_pages_array(n);
 		if (!p)
 			return -ENOMEM;
-		res = get_user_pages_fast(addr, n,
-				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0, p);
+		res = get_user_pages_fast(addr, n, gup_flags, p);
 		if (unlikely(res < 0)) {
 			kvfree(p);
 			return res;
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Ocfs2-devel] [PATCH v2 6/6] gfs2: Fix mmap + page fault deadlocks for direct I/O
  2021-07-18 22:39 [Ocfs2-devel] [PATCH v2 0/6] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (4 preceding siblings ...)
  2021-07-18 22:39 ` [Ocfs2-devel] [PATCH v2 5/6] iov_iter: Introduce ITER_FLAG_FAST_ONLY flag Andreas Gruenbacher
@ 2021-07-18 22:39 ` Andreas Gruenbacher
  2021-07-19 19:31 ` [Ocfs2-devel] [PATCH v2 0/6] gfs2: Fix mmap + page fault deadlocks Linus Torvalds
  2021-07-19 20:14 ` Al Viro
  7 siblings, 0 replies; 12+ messages in thread
From: Andreas Gruenbacher @ 2021-07-18 22:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: cluster-devel, Jan Kara, Andreas Gruenbacher, linux-kernel,
	Christoph Hellwig, Alexander Viro, linux-fsdevel, ocfs2-devel

Direct I/O differs from buffered I/O in that it uses bio_iov_iter_get_pages for
grabbing page references and for manually faulting in pages instead of
triggering actual page faults.  For disabling these manual page faults, it's
not enough to call pagefault_disable(); instead, we use the new
ITER_FLAG_FAST_ONLY flag for telling iomap_dio_rw to stop faulting pages in for
us.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 35 ++++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 99df7934b4d8..6feb857a8a1c 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -763,21 +763,42 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
 	struct file *file = iocb->ki_filp;
 	struct gfs2_inode *ip = GFS2_I(file->f_mapping->host);
 	size_t count = iov_iter_count(to);
+	size_t written = 0;
 	ssize_t ret;
 
+	/*
+	 * In this function, we disable page faults when whe're holding the
+	 * inode glock while doing I/O.  If a page fault occurs, we drop the
+	 * inode glock, fault in the pages manually, and then we retry.  Other
+	 * than in gfs2_file_read_iter, iomap_dio_rw can trigger implicit as
+	 * well as manual page faults, and we need to disable both kinds
+	 * separately.
+	 */
+
 	if (!count)
 		return 0; /* skip atime */
 
 	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
+retry:
 	ret = gfs2_glock_nq(gh);
 	if (ret)
 		goto out_uninit;
 
+	pagefault_disable();
+	to->type |= ITER_FLAG_FAST_ONLY;
 	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0);
+	to->type &= ~ITER_FLAG_FAST_ONLY;
+	pagefault_enable();
+
 	gfs2_glock_dq(gh);
+	if (ret > 0)
+		written += ret;
+	if (unlikely(iov_iter_count(to) && (ret > 0 || ret == -EFAULT)) &&
+	    fault_in_iov_iter(to))
+		goto retry;
 out_uninit:
 	gfs2_holder_uninit(gh);
-	return ret;
+	return written ? written : ret;
 }
 
 static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
@@ -790,6 +811,12 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	loff_t offset = iocb->ki_pos;
 	ssize_t ret;
 
+	/*
+	 * In this function, we disable page faults when whe're holding the
+	 * inode glock while doing I/O.  If a page fault occurs, we drop the
+	 * inode glock, fault in the pages manually, and then we retry.
+	 */
+
 	/*
 	 * Deferred lock, even if its a write, since we do no allocation on
 	 * this path. All we need to change is the atime, and this lock mode
@@ -799,6 +826,7 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	 * VFS does.
 	 */
 	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
+retry:
 	ret = gfs2_glock_nq(gh);
 	if (ret)
 		goto out_uninit;
@@ -807,11 +835,16 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	if (offset + len > i_size_read(&ip->i_inode))
 		goto out;
 
+	from->type |= ITER_FLAG_FAST_ONLY;
 	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0);
+	from->type &= ~ITER_FLAG_FAST_ONLY;
+
 	if (ret == -ENOTBLK)
 		ret = 0;
 out:
 	gfs2_glock_dq(gh);
+	if (unlikely(ret == -EFAULT) && fault_in_iov_iter(from))
+		goto retry;
 out_uninit:
 	gfs2_holder_uninit(gh);
 	return ret;
-- 
2.26.3


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Ocfs2-devel] [PATCH v2 1/6] iov_iter: Introduce fault_in_iov_iter helper
  2021-07-18 22:39 ` [Ocfs2-devel] [PATCH v2 1/6] iov_iter: Introduce fault_in_iov_iter helper Andreas Gruenbacher
@ 2021-07-19 19:26   ` Linus Torvalds
  0 siblings, 0 replies; 12+ messages in thread
From: Linus Torvalds @ 2021-07-19 19:26 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: cluster-devel, Jan Kara, Linux Kernel Mailing List,
	Christoph Hellwig, Alexander Viro, linux-fsdevel, ocfs2-devel

On Sun, Jul 18, 2021 at 3:39 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> Introduce a new fault_in_iov_iter helper for manually faulting in an iterator.
> Other than fault_in_pages_writeable(), this function is non-destructive.

You mean "Unlike" rather than "Other than" (also in the comment of the patch).

This is fairly inefficient, but as long as it's the exceptional case,
that's fine. It might be worth making that very explicit, so that
people don't try to use it normally.

                Linus

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Ocfs2-devel] [PATCH v2 5/6] iov_iter: Introduce ITER_FLAG_FAST_ONLY flag
  2021-07-18 22:39 ` [Ocfs2-devel] [PATCH v2 5/6] iov_iter: Introduce ITER_FLAG_FAST_ONLY flag Andreas Gruenbacher
@ 2021-07-19 19:29   ` Linus Torvalds
  2021-07-19 20:17     ` Al Viro
  0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2021-07-19 19:29 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: cluster-devel, Jan Kara, Linux Kernel Mailing List,
	Christoph Hellwig, Alexander Viro, linux-fsdevel, ocfs2-devel

On Sun, Jul 18, 2021 at 3:40 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> Introduce a new ITER_FLAG_FAST_ONLY flag

I think the code is fine, but I think it might be best to call this
"ITER_FLAG_NOIO" or something like that.

The "FAST_ONLY" name makes sense in the context of
"get_user_pages_fast()" where we have that "fast" naming (and the long
history too). But I don't think it makes much sense as a name in the
context of iov_iter.

                   Linus

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Ocfs2-devel] [PATCH v2 0/6] gfs2: Fix mmap + page fault deadlocks
  2021-07-18 22:39 [Ocfs2-devel] [PATCH v2 0/6] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (5 preceding siblings ...)
  2021-07-18 22:39 ` [Ocfs2-devel] [PATCH v2 6/6] gfs2: Fix mmap + page fault deadlocks for direct I/O Andreas Gruenbacher
@ 2021-07-19 19:31 ` Linus Torvalds
  2021-07-19 20:14 ` Al Viro
  7 siblings, 0 replies; 12+ messages in thread
From: Linus Torvalds @ 2021-07-19 19:31 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: cluster-devel, Jan Kara, Linux Kernel Mailing List,
	Christoph Hellwig, Alexander Viro, linux-fsdevel, ocfs2-devel

On Sun, Jul 18, 2021 at 3:39 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> here's an update to the gfs2 mmap + page faults fix that implements
> Linus's suggestion of disabling page faults while we're holding the
> inode glock.

Apart from some wording/naming issues, I think this looks a _lot_
better, and should fix the fundamental and underlying deadlock
properly.

So Ack from me with the trivial suggestions I sent to the individual patches.

            Linus

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Ocfs2-devel] [PATCH v2 0/6] gfs2: Fix mmap + page fault deadlocks
  2021-07-18 22:39 [Ocfs2-devel] [PATCH v2 0/6] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (6 preceding siblings ...)
  2021-07-19 19:31 ` [Ocfs2-devel] [PATCH v2 0/6] gfs2: Fix mmap + page fault deadlocks Linus Torvalds
@ 2021-07-19 20:14 ` Al Viro
  7 siblings, 0 replies; 12+ messages in thread
From: Al Viro @ 2021-07-19 20:14 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: cluster-devel, Jan Kara, linux-kernel, Christoph Hellwig,
	linux-fsdevel, Linus Torvalds, ocfs2-devel

On Mon, Jul 19, 2021 at 12:39:26AM +0200, Andreas Gruenbacher wrote:
> Hi Linus et al.,
> 
> here's an update to the gfs2 mmap + page faults fix that implements
> Linus's suggestion of disabling page faults while we're holding the
> inode glock.
> 
> This passes fstests except for test generic/095, which fails due to an
> independent bug in the iov_iter code.  I'm currently trying to get
> initial feedback from Al and Christoph on that.
> 
> Any feedback would be welcome.

What tree is that against?  Because in mainline your #5 sure as hell
won't apply...

There uio.h has
enum iter_type {
        /* iter types */
        ITER_IOVEC,
        ITER_KVEC,
        ITER_BVEC,
        ITER_PIPE,
        ITER_XARRAY,
        ITER_DISCARD,
};

and your series assumes

enum iter_type {
        /* iter types */
        ITER_IOVEC = 4,
        ITER_KVEC = 8,
        ITER_BVEC = 16,
        ITER_PIPE = 32,
        ITER_DISCARD = 64,
};

What had that been tested with?

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Ocfs2-devel] [PATCH v2 5/6] iov_iter: Introduce ITER_FLAG_FAST_ONLY flag
  2021-07-19 19:29   ` Linus Torvalds
@ 2021-07-19 20:17     ` Al Viro
  0 siblings, 0 replies; 12+ messages in thread
From: Al Viro @ 2021-07-19 20:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: cluster-devel, Jan Kara, Andreas Gruenbacher,
	Linux Kernel Mailing List, Christoph Hellwig, linux-fsdevel,
	ocfs2-devel

On Mon, Jul 19, 2021 at 12:29:35PM -0700, Linus Torvalds wrote:
> On Sun, Jul 18, 2021 at 3:40 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > Introduce a new ITER_FLAG_FAST_ONLY flag
> 
> I think the code is fine, but I think it might be best to call this
> "ITER_FLAG_NOIO" or something like that.
> 
> The "FAST_ONLY" name makes sense in the context of
> "get_user_pages_fast()" where we have that "fast" naming (and the long
> history too). But I don't think it makes much sense as a name in the
> context of iov_iter.

This code has never been tested with current lib/iov_iter.c as it is
in mainline.  Or had been in -next during the last cycle.  It won't
apply at all.

Sure, I can try to port that series over to the current mainline, but
I'd rather see that done (and tested) by the series author...

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-07-19 20:23 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-18 22:39 [Ocfs2-devel] [PATCH v2 0/6] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
2021-07-18 22:39 ` [Ocfs2-devel] [PATCH v2 1/6] iov_iter: Introduce fault_in_iov_iter helper Andreas Gruenbacher
2021-07-19 19:26   ` Linus Torvalds
2021-07-18 22:39 ` [Ocfs2-devel] [PATCH v2 2/6] iomap: Fix iomap_dio_rw return value for page faults Andreas Gruenbacher
2021-07-18 22:39 ` [Ocfs2-devel] [PATCH v2 3/6] gfs2: Add wrapper for iomap_file_buffered_write Andreas Gruenbacher
2021-07-18 22:39 ` [Ocfs2-devel] [PATCH v2 4/6] gfs2: Fix mmap + page fault deadlocks for buffered I/O Andreas Gruenbacher
2021-07-18 22:39 ` [Ocfs2-devel] [PATCH v2 5/6] iov_iter: Introduce ITER_FLAG_FAST_ONLY flag Andreas Gruenbacher
2021-07-19 19:29   ` Linus Torvalds
2021-07-19 20:17     ` Al Viro
2021-07-18 22:39 ` [Ocfs2-devel] [PATCH v2 6/6] gfs2: Fix mmap + page fault deadlocks for direct I/O Andreas Gruenbacher
2021-07-19 19:31 ` [Ocfs2-devel] [PATCH v2 0/6] gfs2: Fix mmap + page fault deadlocks Linus Torvalds
2021-07-19 20:14 ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).