linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/7] gfs2: Fix mmap + page fault deadlocks
@ 2021-07-23 20:58 Andreas Gruenbacher
  2021-07-23 20:58 ` [PATCH v3 1/7] iov_iter: Introduce fault_in_iov_iter helper Andreas Gruenbacher
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Andreas Gruenbacher @ 2021-07-23 20:58 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

Hi Linus et al.,

here's an update of my gfs2 mmap + page fault fixes (against -rc2).
From my point of view,

  * these two are ready and need to be looked at by Al:

    iov_iter: Introduce fault_in_iov_iter helper
    iov_iter: Introduce noio flag to disable page faults

  * these two need to be reviewed by Christoph at least:

    iomap: Fix iomap_dio_rw return value for user copies
    iomap: Support restarting direct I/O requests after user copy failures

Thanks a lot,
Andreas

Andreas Gruenbacher (7):
  iov_iter: Introduce fault_in_iov_iter helper
  gfs2: Add wrapper for iomap_file_buffered_write
  gfs2: Fix mmap + page fault deadlocks for buffered I/O
  iomap: Fix iomap_dio_rw return value for user copies
  iomap: Support restarting direct I/O requests after user copy failures
  iov_iter: Introduce noio flag to disable page faults
  gfs2: Fix mmap + page fault deadlocks for direct I/O

 fs/gfs2/file.c       | 77 ++++++++++++++++++++++++++++++++++++++++----
 fs/iomap/direct-io.c | 13 ++++++--
 include/linux/mm.h   |  3 ++
 include/linux/uio.h  |  2 ++
 lib/iov_iter.c       | 62 ++++++++++++++++++++++++++++++++---
 mm/gup.c             | 68 ++++++++++++++++++++++++++++++++++++++
 6 files changed, 211 insertions(+), 14 deletions(-)

-- 
2.26.3


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3 1/7] iov_iter: Introduce fault_in_iov_iter helper
  2021-07-23 20:58 [PATCH v3 0/7] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
@ 2021-07-23 20:58 ` Andreas Gruenbacher
  2021-07-23 23:40   ` Linus Torvalds
                     ` (2 more replies)
  2021-07-23 20:58 ` [PATCH v3 2/7] gfs2: Add wrapper for iomap_file_buffered_write Andreas Gruenbacher
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 18+ messages in thread
From: Andreas Gruenbacher @ 2021-07-23 20:58 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

Introduce a new fault_in_iov_iter helper for manually faulting in an iterator.
Other than fault_in_pages_writeable(), this function is non-destructive.

We'll use fault_in_iov_iter in gfs2 once we've determined that the iterator
passed to .read_iter or .write_iter isn't in memory.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 include/linux/mm.h  |  3 ++
 include/linux/uio.h |  1 +
 lib/iov_iter.c      | 42 ++++++++++++++++++++++++++++
 mm/gup.c            | 68 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 114 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7ca22e6e694a..14b1353995e2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1840,6 +1840,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
 int pin_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages);
 
+unsigned long fault_in_user_pages(unsigned long start, unsigned long len,
+				  bool write);
+
 int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
 int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
 			struct task_struct *task, bool bypass_rlim);
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 82c3c3e819e0..152b3605e86c 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -120,6 +120,7 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
 void iov_iter_revert(struct iov_iter *i, size_t bytes);
 int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes);
+size_t fault_in_iov_iter(const struct iov_iter *i);
 size_t iov_iter_single_seg_count(const struct iov_iter *i);
 size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 20dc3d800573..7221665f7ac4 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -460,6 +460,48 @@ int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes)
 }
 EXPORT_SYMBOL(iov_iter_fault_in_readable);
 
+/**
+ * fault_in_iov_iter - fault in iov iterator for reading / writing
+ * @i: iterator
+ *
+ * Faults in the iterator using get_user_pages, i.e., without triggering
+ * hardware page faults.
+ *
+ * This is primarily useful when we know that some or all of the pages in @i
+ * aren't in memory.  For iterators that are likely to be in memory,
+ * fault_in_pages_readable() may be more appropriate.
+ *
+ * Other than fault_in_pages_writeable(), this function is non-destructive even
+ * when faulting in pages for writing.
+ *
+ * Returns the number of bytes faulted in, or the size of @i if @i doesn't need
+ * faulting in.
+ */
+size_t fault_in_iov_iter(const struct iov_iter *i)
+{
+	size_t count = i->count;
+	const struct iovec *p;
+	size_t ret = 0, skip;
+
+	if (iter_is_iovec(i)) {
+		for (p = i->iov, skip = i->iov_offset; count; p++, skip = 0) {
+			unsigned long len = min(count, p->iov_len - skip);
+			unsigned long start, l;
+
+			if (unlikely(!len))
+				continue;
+			start = (unsigned long)p->iov_base + skip;
+			l = fault_in_user_pages(start, len, iov_iter_rw(i) != WRITE);
+			ret += l;
+			if (unlikely(l != len))
+				break;
+			count -= l;
+		}
+	}
+	return ret;
+}
+EXPORT_SYMBOL(fault_in_iov_iter);
+
 void iov_iter_init(struct iov_iter *i, unsigned int direction,
 			const struct iovec *iov, unsigned long nr_segs,
 			size_t count)
diff --git a/mm/gup.c b/mm/gup.c
index 42b8b1fa6521..033d66586c62 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1669,6 +1669,74 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
 }
 #endif /* !CONFIG_MMU */
 
+/**
+ * fault_in_user_pages - fault in an address range for reading / writing
+ * @start: start of address range
+ * @len: length of address range
+ * @write: fault in for writing
+ *
+ * Note that we don't pin or otherwise hold the pages referenced that we fault
+ * in.  There's no guarantee that they'll stay in memory for any duration of
+ * time.
+ *
+ * Returns the number of bytes faulted in from @start.
+ */
+unsigned long fault_in_user_pages(unsigned long start, unsigned long len,
+				  bool write)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma = NULL;
+	unsigned long end, nstart, nend;
+	int locked = 0;
+	int gup_flags;
+
+	/*
+	 * FIXME: Make sure this function doesn't succeed for pages that cannot
+	 * be accessed; otherwise we could end up in a loop trying to fault in
+	 * and then access the pages.  (It's okay if a page gets evicted and we
+	 * need more than one retry.)
+	 */
+
+	/*
+	 * FIXME: Are these the right FOLL_* flags?
+	 */
+
+	gup_flags = FOLL_TOUCH | FOLL_POPULATE;
+	if (write)
+		gup_flags |= FOLL_WRITE;
+
+	end = PAGE_ALIGN(start + len);
+	for (nstart = start & PAGE_MASK; nstart < end; nstart = nend) {
+		unsigned long nr_pages;
+		long ret;
+
+		if (!locked) {
+			locked = 1;
+			mmap_read_lock(mm);
+			vma = find_vma(mm, nstart);
+		} else if (nstart >= vma->vm_end)
+			vma = vma->vm_next;
+		if (!vma || vma->vm_start >= end)
+			break;
+		nend = min(end, vma->vm_end);
+		if (vma->vm_flags & (VM_IO | VM_PFNMAP))
+			continue;
+		if (nstart < vma->vm_start)
+			nstart = vma->vm_start;
+		nr_pages = (nend - nstart) / PAGE_SIZE;
+		ret = __get_user_pages_locked(mm, nstart, nr_pages,
+					      NULL, NULL, &locked, gup_flags);
+		if (ret <= 0)
+			break;
+		nend = nstart + ret * PAGE_SIZE;
+	}
+	if (locked)
+		mmap_read_unlock(mm);
+	if (nstart > start)
+		return min(nstart - start, len);
+	return 0;
+}
+
 /**
  * get_dump_page() - pin user page in memory while writing it to core dump
  * @addr: user address
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 2/7] gfs2: Add wrapper for iomap_file_buffered_write
  2021-07-23 20:58 [PATCH v3 0/7] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
  2021-07-23 20:58 ` [PATCH v3 1/7] iov_iter: Introduce fault_in_iov_iter helper Andreas Gruenbacher
@ 2021-07-23 20:58 ` Andreas Gruenbacher
  2021-07-23 20:58 ` [PATCH v3 3/7] gfs2: Fix mmap + page fault deadlocks for buffered I/O Andreas Gruenbacher
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Andreas Gruenbacher @ 2021-07-23 20:58 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

Add a wrapper around iomap_file_buffered_write.  We'll add code for when
the operation needs to be retried here later.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 84ec053d43b4..55ec1cadc9e6 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -876,6 +876,18 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return written ? written : ret;
 }
 
+static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file_inode(file);
+	ssize_t ret;
+
+	current->backing_dev_info = inode_to_bdi(inode);
+	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+	current->backing_dev_info = NULL;
+	return ret;
+}
+
 /**
  * gfs2_file_write_iter - Perform a write to a file
  * @iocb: The io context
@@ -927,9 +939,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			goto out_unlock;
 
 		iocb->ki_flags |= IOCB_DSYNC;
-		current->backing_dev_info = inode_to_bdi(inode);
-		buffered = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
-		current->backing_dev_info = NULL;
+		buffered = gfs2_file_buffered_write(iocb, from);
 		if (unlikely(buffered <= 0)) {
 			if (!ret)
 				ret = buffered;
@@ -951,9 +961,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		if (!ret || ret2 > 0)
 			ret += ret2;
 	} else {
-		current->backing_dev_info = inode_to_bdi(inode);
-		ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
-		current->backing_dev_info = NULL;
+		ret = gfs2_file_buffered_write(iocb, from);
 		if (likely(ret > 0)) {
 			iocb->ki_pos += ret;
 			ret = generic_write_sync(iocb, ret);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 3/7] gfs2: Fix mmap + page fault deadlocks for buffered I/O
  2021-07-23 20:58 [PATCH v3 0/7] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
  2021-07-23 20:58 ` [PATCH v3 1/7] iov_iter: Introduce fault_in_iov_iter helper Andreas Gruenbacher
  2021-07-23 20:58 ` [PATCH v3 2/7] gfs2: Add wrapper for iomap_file_buffered_write Andreas Gruenbacher
@ 2021-07-23 20:58 ` Andreas Gruenbacher
  2021-07-23 20:58 ` [PATCH v3 4/7] iomap: Fix iomap_dio_rw return value for user copies Andreas Gruenbacher
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Andreas Gruenbacher @ 2021-07-23 20:58 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

In the .read_iter and .write_iter file operations, we're accessing
user-space memory while holding the inodes glock.  There's a possibility
that the memory is mapped to the same file, in which case we'd recurse on
the same glock.

More complex scenarios can involve multiple glocks, processes, and even cluster
nodes.

Avoids these kinds of problems by disabling page faults while holding a glock.
If a page fault occurs, we either end up with a partial read or write, or with
-EFAULT if nothing could be read or written.  In that case, we drop the glock,
fault in the requested pages manually, and repeat the operation.

This locking problem in gfs2 was originally reported by Jan Kara.  Linus came
up with the proposal to disable page faults.  Many thanks to Al Viro and
Matthew Wilcox for their feedback as well.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 55ec1cadc9e6..f66ac7f56f6d 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -843,6 +843,12 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	size_t written = 0;
 	ssize_t ret;
 
+	/*
+	 * In this function, we disable page faults when we're holding the
+	 * inode glock while doing I/O.  If a page fault occurs, we drop the
+	 * inode glock, fault in the pages manually, and then we retry.
+	 */
+
 	if (iocb->ki_flags & IOCB_DIRECT) {
 		ret = gfs2_file_direct_read(iocb, to, &gh);
 		if (likely(ret != -ENOTBLK))
@@ -864,13 +870,19 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	}
 	ip = GFS2_I(iocb->ki_filp->f_mapping->host);
 	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
+retry:
 	ret = gfs2_glock_nq(&gh);
 	if (ret)
 		goto out_uninit;
+	pagefault_disable();
 	ret = generic_file_read_iter(iocb, to);
+	pagefault_enable();
 	if (ret > 0)
 		written += ret;
 	gfs2_glock_dq(&gh);
+	if (unlikely(iov_iter_count(to) && (ret > 0 || ret == -EFAULT)) &&
+	    fault_in_iov_iter(to))
+		goto retry;
 out_uninit:
 	gfs2_holder_uninit(&gh);
 	return written ? written : ret;
@@ -882,9 +894,20 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *fro
 	struct inode *inode = file_inode(file);
 	ssize_t ret;
 
+	/*
+	 * In this function, we disable page faults when we're holding the
+	 * inode glock while doing I/O.  If a page fault occurs, we drop the
+	 * inode glock, fault in the pages manually, and then we retry.
+	 */
+
+retry:
 	current->backing_dev_info = inode_to_bdi(inode);
+	pagefault_disable();
 	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+	pagefault_enable();
 	current->backing_dev_info = NULL;
+	if (unlikely(ret == -EFAULT) && fault_in_iov_iter(from))
+		goto retry;
 	return ret;
 }
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 4/7] iomap: Fix iomap_dio_rw return value for user copies
  2021-07-23 20:58 [PATCH v3 0/7] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (2 preceding siblings ...)
  2021-07-23 20:58 ` [PATCH v3 3/7] gfs2: Fix mmap + page fault deadlocks for buffered I/O Andreas Gruenbacher
@ 2021-07-23 20:58 ` Andreas Gruenbacher
  2021-07-23 20:58 ` [PATCH v3 5/7] iomap: Support restarting direct I/O requests after user copy failures Andreas Gruenbacher
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Andreas Gruenbacher @ 2021-07-23 20:58 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

When a user copy fails in one of the helpers of iomap_dio_rw, fail with -EFAULT
instead of returning 0.  This matches what iomap_dio_bio_actor already returns
when it gets an -EFAULT from bio_iov_iter_get_pages.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/iomap/direct-io.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 9398b8c31323..cc0b4bc8861b 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -370,7 +370,7 @@ iomap_dio_hole_actor(loff_t length, struct iomap_dio *dio)
 {
 	length = iov_iter_zero(length, dio->submit.iter);
 	dio->size += length;
-	return length;
+	return length ?: -EFAULT;
 }
 
 static loff_t
@@ -397,7 +397,7 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length,
 		copied = copy_to_iter(iomap->inline_data + pos, length, iter);
 	}
 	dio->size += copied;
-	return copied;
+	return copied ?: -EFAULT;
 }
 
 static loff_t
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 5/7] iomap: Support restarting direct I/O requests after user copy failures
  2021-07-23 20:58 [PATCH v3 0/7] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (3 preceding siblings ...)
  2021-07-23 20:58 ` [PATCH v3 4/7] iomap: Fix iomap_dio_rw return value for user copies Andreas Gruenbacher
@ 2021-07-23 20:58 ` Andreas Gruenbacher
  2021-07-26 17:19   ` Jan Kara
  2021-07-23 20:58 ` [PATCH v3 6/7] iov_iter: Introduce noio flag to disable page faults Andreas Gruenbacher
  2021-07-23 20:58 ` [PATCH v3 7/7] gfs2: Fix mmap + page fault deadlocks for direct I/O Andreas Gruenbacher
  6 siblings, 1 reply; 18+ messages in thread
From: Andreas Gruenbacher @ 2021-07-23 20:58 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

In __iomap_dio_rw, when iomap_apply returns an -EFAULT error, complete the
request synchronously and reset the iterator to the start position.  This
allows callers to deal with the failure and retry the operation.

In gfs2, we need to disable page faults while we're holding glocks to prevent
deadlocks.  This patch is the minimum solution I could find to make
iomap_dio_rw work with page faults disabled.  It's still expensive because any
I/O that was carried out before hitting -EFAULT needs to be retried.

A possible improvement would be to add an IOMAP_DIO_FAULT_RETRY or similar flag
that would allow iomap_dio_rw to return a short result when hitting -EFAULT.
Callers could then retry only the rest of the request after dealing with the
page fault.

Asynchronous requests turn into synchronous requests up to the point of the
page fault in any case, but they could be retried asynchronously after dealing
with the page fault.  To make that work, the completion notification would have
to include the bytes read or written before the page fault(s) as well, and we'd
need an additional iomap_dio_rw argument for that.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/iomap/direct-io.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index cc0b4bc8861b..b0a494211bb4 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -561,6 +561,15 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		ret = iomap_apply(inode, pos, count, iomap_flags, ops, dio,
 				iomap_dio_actor);
 		if (ret <= 0) {
+			if (ret == -EFAULT) {
+				/*
+				 * To allow retrying the request, fail
+				 * synchronously and reset the iterator.
+				 */
+				wait_for_completion = true;
+				iov_iter_revert(dio->submit.iter, dio->size);
+			}
+
 			/* magic error code to fall back to buffered I/O */
 			if (ret == -ENOTBLK) {
 				wait_for_completion = true;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 6/7] iov_iter: Introduce noio flag to disable page faults
  2021-07-23 20:58 [PATCH v3 0/7] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (4 preceding siblings ...)
  2021-07-23 20:58 ` [PATCH v3 5/7] iomap: Support restarting direct I/O requests after user copy failures Andreas Gruenbacher
@ 2021-07-23 20:58 ` Andreas Gruenbacher
  2021-07-23 20:58 ` [PATCH v3 7/7] gfs2: Fix mmap + page fault deadlocks for direct I/O Andreas Gruenbacher
  6 siblings, 0 replies; 18+ messages in thread
From: Andreas Gruenbacher @ 2021-07-23 20:58 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

Introduce a new noio flag to indicate to get_user_pages to use the
FOLL_FAST_ONLY flag.  This will cause get_user_pages to fail when it
would otherwise fault in a page.

Currently, the noio flag is only checked in iov_iter_get_pages and
iov_iter_get_pages_alloc.  This is enough for iomaop_dio_rw, but it
may make sense to check for this flag in other contexts as well.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 include/linux/uio.h |  1 +
 lib/iov_iter.c      | 20 +++++++++++++++-----
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 152b3605e86c..8de6354ade14 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -29,6 +29,7 @@ enum iter_type {
 
 struct iov_iter {
 	u8 iter_type;
+	bool noio;
 	bool data_source;
 	size_t iov_offset;
 	size_t count;
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 7221665f7ac4..a20426cedf60 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -509,6 +509,7 @@ void iov_iter_init(struct iov_iter *i, unsigned int direction,
 	WARN_ON(direction & ~(READ | WRITE));
 	*i = (struct iov_iter) {
 		.iter_type = ITER_IOVEC,
+		.noio = false,
 		.data_source = direction,
 		.iov = iov,
 		.nr_segs = nr_segs,
@@ -1519,13 +1520,17 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 		return 0;
 
 	if (likely(iter_is_iovec(i))) {
+		unsigned int gup_flags = 0;
 		unsigned long addr;
 
+		if (iov_iter_rw(i) != WRITE)
+			gup_flags |= FOLL_WRITE;
+		if (i->noio)
+			gup_flags |= FOLL_FAST_ONLY;
+
 		addr = first_iovec_segment(i, &len, start, maxsize, maxpages);
 		n = DIV_ROUND_UP(len, PAGE_SIZE);
-		res = get_user_pages_fast(addr, n,
-				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0,
-				pages);
+		res = get_user_pages_fast(addr, n, gup_flags, pages);
 		if (unlikely(res <= 0))
 			return res;
 		return (res == n ? len : res * PAGE_SIZE) - *start;
@@ -1641,15 +1646,20 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		return 0;
 
 	if (likely(iter_is_iovec(i))) {
+		unsigned int gup_flags = 0;
 		unsigned long addr;
 
+		if (iov_iter_rw(i) != WRITE)
+			gup_flags |= FOLL_WRITE;
+		if (i->noio)
+			gup_flags |= FOLL_FAST_ONLY;
+
 		addr = first_iovec_segment(i, &len, start, maxsize, ~0U);
 		n = DIV_ROUND_UP(len, PAGE_SIZE);
 		p = get_pages_array(n);
 		if (!p)
 			return -ENOMEM;
-		res = get_user_pages_fast(addr, n,
-				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0, p);
+		res = get_user_pages_fast(addr, n, gup_flags, p);
 		if (unlikely(res <= 0)) {
 			kvfree(p);
 			*pages = ZERO_SIZE_PTR;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 7/7] gfs2: Fix mmap + page fault deadlocks for direct I/O
  2021-07-23 20:58 [PATCH v3 0/7] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (5 preceding siblings ...)
  2021-07-23 20:58 ` [PATCH v3 6/7] iov_iter: Introduce noio flag to disable page faults Andreas Gruenbacher
@ 2021-07-23 20:58 ` Andreas Gruenbacher
  2021-07-26 17:02   ` Jan Kara
  6 siblings, 1 reply; 18+ messages in thread
From: Andreas Gruenbacher @ 2021-07-23 20:58 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

Also disable page faults during direct I/O requests and implement the same kind
of retry logic as in the buffered I/O case.

Direct I/O requests differ from buffered I/O requests in that they use
bio_iov_iter_get_pages for grabbing page references and faulting in pages
instead of triggering real page faults.  Those manual page faults can be
disabled with the iocb->noio flag.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index f66ac7f56f6d..7986f3be69d2 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -782,21 +782,41 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
 	struct file *file = iocb->ki_filp;
 	struct gfs2_inode *ip = GFS2_I(file->f_mapping->host);
 	size_t count = iov_iter_count(to);
+	size_t written = 0;
 	ssize_t ret;
 
+	/*
+	 * In this function, we disable page faults when we're holding the
+	 * inode glock while doing I/O.  If a page fault occurs, we drop the
+	 * inode glock, fault in the pages manually, and then we retry.  Other
+	 * than in gfs2_file_read_iter, iomap_dio_rw can trigger implicit as
+	 * well as manual page faults, and we need to disable both kinds
+	 * separately.
+	 */
+
 	if (!count)
 		return 0; /* skip atime */
 
 	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
+retry:
 	ret = gfs2_glock_nq(gh);
 	if (ret)
 		goto out_uninit;
 
+	pagefault_disable();
+	to->noio = true;
 	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0);
+	to->noio = false;
+	pagefault_enable();
+
 	gfs2_glock_dq(gh);
+	if (ret > 0)
+		written += ret;
+	if (unlikely(ret == -EFAULT) && fault_in_iov_iter(to))
+		goto retry;
 out_uninit:
 	gfs2_holder_uninit(gh);
-	return ret;
+	return written ? written : ret;
 }
 
 static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
@@ -809,6 +829,12 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	loff_t offset = iocb->ki_pos;
 	ssize_t ret;
 
+	/*
+	 * In this function, we disable page faults when we're holding the
+	 * inode glock while doing I/O.  If a page fault occurs, we drop the
+	 * inode glock, fault in the pages manually, and then we retry.
+	 */
+
 	/*
 	 * Deferred lock, even if its a write, since we do no allocation on
 	 * this path. All we need to change is the atime, and this lock mode
@@ -818,6 +844,7 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	 * VFS does.
 	 */
 	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
+retry:
 	ret = gfs2_glock_nq(gh);
 	if (ret)
 		goto out_uninit;
@@ -826,11 +853,16 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	if (offset + len > i_size_read(&ip->i_inode))
 		goto out;
 
+	from->noio = true;
 	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0);
+	from->noio = false;
+
 	if (ret == -ENOTBLK)
 		ret = 0;
 out:
 	gfs2_glock_dq(gh);
+	if (unlikely(ret == -EFAULT) && fault_in_iov_iter(from))
+		goto retry;
 out_uninit:
 	gfs2_holder_uninit(gh);
 	return ret;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/7] iov_iter: Introduce fault_in_iov_iter helper
  2021-07-23 20:58 ` [PATCH v3 1/7] iov_iter: Introduce fault_in_iov_iter helper Andreas Gruenbacher
@ 2021-07-23 23:40   ` Linus Torvalds
  2021-07-24  7:51     ` Andreas Grünbacher
  2021-07-24  1:52   ` Al Viro
  2021-07-26 16:33   ` Jan Kara
  2 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2021-07-23 23:40 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Alexander Viro, Christoph Hellwig, Darrick J. Wong, Jan Kara,
	Matthew Wilcox, cluster-devel, linux-fsdevel,
	Linux Kernel Mailing List, ocfs2-devel

On Fri, Jul 23, 2021 at 1:58 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> Introduce a new fault_in_iov_iter helper for manually faulting in an iterator.
> Other than fault_in_pages_writeable(), this function is non-destructive.

Again, as I pointed out in the previous version, "Other than" is not
sensible language.

You mean "Unlike".

Same issue in the comment:

> + * Other than fault_in_pages_writeable(), this function is non-destructive even
> + * when faulting in pages for writing.

It really should be

  "Unlike fault_in_pages_writeable(), this function .."

to parse correctly.

I understand what you mean, but only because I know what
fault_in_pages_writeable() does and what the issue was.

And in a year or two, I might have forgotten, and wonder what you meant.

             Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/7] iov_iter: Introduce fault_in_iov_iter helper
  2021-07-23 20:58 ` [PATCH v3 1/7] iov_iter: Introduce fault_in_iov_iter helper Andreas Gruenbacher
  2021-07-23 23:40   ` Linus Torvalds
@ 2021-07-24  1:52   ` Al Viro
  2021-07-24  8:05     ` Andreas Grünbacher
  2021-07-26 16:33   ` Jan Kara
  2 siblings, 1 reply; 18+ messages in thread
From: Al Viro @ 2021-07-24  1:52 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Linus Torvalds, Christoph Hellwig, Darrick J. Wong, Jan Kara,
	Matthew Wilcox, cluster-devel, linux-fsdevel, linux-kernel,
	ocfs2-devel

On Fri, Jul 23, 2021 at 10:58:34PM +0200, Andreas Gruenbacher wrote:
> Introduce a new fault_in_iov_iter helper for manually faulting in an iterator.
> Other than fault_in_pages_writeable(), this function is non-destructive.
> 
> We'll use fault_in_iov_iter in gfs2 once we've determined that the iterator
> passed to .read_iter or .write_iter isn't in memory.

Hmm...  I suspect that this is going to be much heavier for read access
than the existing variant.  Do we ever want it for anything other than
writes?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/7] iov_iter: Introduce fault_in_iov_iter helper
  2021-07-23 23:40   ` Linus Torvalds
@ 2021-07-24  7:51     ` Andreas Grünbacher
  0 siblings, 0 replies; 18+ messages in thread
From: Andreas Grünbacher @ 2021-07-24  7:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Gruenbacher, Alexander Viro, Christoph Hellwig,
	Darrick J. Wong, Jan Kara, Matthew Wilcox, cluster-devel,
	linux-fsdevel, Linux Kernel Mailing List, ocfs2-devel

Am Sa., 24. Juli 2021 um 01:41 Uhr schrieb Linus Torvalds
<torvalds@linux-foundation.org>:
> On Fri, Jul 23, 2021 at 1:58 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > Introduce a new fault_in_iov_iter helper for manually faulting in an iterator.
> > Other than fault_in_pages_writeable(), this function is non-destructive.
>
> Again, as I pointed out in the previous version, "Other than" is not
> sensible language.

Thanks for pointing this out a second time. I have no idea how I
screwed up fixing that.

Andreas

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/7] iov_iter: Introduce fault_in_iov_iter helper
  2021-07-24  1:52   ` Al Viro
@ 2021-07-24  8:05     ` Andreas Grünbacher
  0 siblings, 0 replies; 18+ messages in thread
From: Andreas Grünbacher @ 2021-07-24  8:05 UTC (permalink / raw)
  To: Al Viro
  Cc: Andreas Gruenbacher, Linus Torvalds, Christoph Hellwig,
	Darrick J. Wong, Jan Kara, Matthew Wilcox, cluster-devel,
	Linux FS-devel Mailing List, Linux Kernel Mailing List,
	ocfs2-devel

Am Sa., 24. Juli 2021 um 03:53 Uhr schrieb Al Viro <viro@zeniv.linux.org.uk>:
> On Fri, Jul 23, 2021 at 10:58:34PM +0200, Andreas Gruenbacher wrote:
> > Introduce a new fault_in_iov_iter helper for manually faulting in an iterator.
> > Other than fault_in_pages_writeable(), this function is non-destructive.
> >
> > We'll use fault_in_iov_iter in gfs2 once we've determined that the iterator
> > passed to .read_iter or .write_iter isn't in memory.
>
> Hmm...  I suspect that this is going to be much heavier for read access
> than the existing variant.  Do we ever want it for anything other than
> writes?

I don't know if it actually is slower when pages need to be faulted
in, but I'm fine turning it into a write-only function.

Thanks,
Andreas

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/7] iov_iter: Introduce fault_in_iov_iter helper
  2021-07-23 20:58 ` [PATCH v3 1/7] iov_iter: Introduce fault_in_iov_iter helper Andreas Gruenbacher
  2021-07-23 23:40   ` Linus Torvalds
  2021-07-24  1:52   ` Al Viro
@ 2021-07-26 16:33   ` Jan Kara
  2021-07-26 17:15     ` Linus Torvalds
  2 siblings, 1 reply; 18+ messages in thread
From: Jan Kara @ 2021-07-26 16:33 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Linus Torvalds, Alexander Viro, Christoph Hellwig,
	Darrick J. Wong, Jan Kara, Matthew Wilcox, cluster-devel,
	linux-fsdevel, linux-kernel, ocfs2-devel

On Fri 23-07-21 22:58:34, Andreas Gruenbacher wrote:
> Introduce a new fault_in_iov_iter helper for manually faulting in an iterator.
> Other than fault_in_pages_writeable(), this function is non-destructive.
> 
> We'll use fault_in_iov_iter in gfs2 once we've determined that the iterator
> passed to .read_iter or .write_iter isn't in memory.
> 
> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
...
> +unsigned long fault_in_user_pages(unsigned long start, unsigned long len,
> +				  bool write)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct vm_area_struct *vma = NULL;
> +	unsigned long end, nstart, nend;
> +	int locked = 0;
> +	int gup_flags;
> +
> +	/*
> +	 * FIXME: Make sure this function doesn't succeed for pages that cannot
> +	 * be accessed; otherwise we could end up in a loop trying to fault in
> +	 * and then access the pages.  (It's okay if a page gets evicted and we
> +	 * need more than one retry.)
> +	 */
> +
> +	/*
> +	 * FIXME: Are these the right FOLL_* flags?
> +	 */

How about the FIXMEs here? I guess we should answer these questions before
merging and remove the comments. Regarding the first FIXME I tend to agree
that if we cannot fault in the first page, we should return the error
rather than returning 0 as you do now. OTOH the caller can check for 0 and
understand there's something wrong going on as well. But the error would be
probably a bit clearer.

> +
> +	gup_flags = FOLL_TOUCH | FOLL_POPULATE;

I don't think FOLL_POPULATE makes sense here. It makes sense only with
FOLL_MLOCK and determines whether mlock(2) should fault in missing pages or
not.

								Honza

> +	if (write)
> +		gup_flags |= FOLL_WRITE;
> +
> +	end = PAGE_ALIGN(start + len);
> +	for (nstart = start & PAGE_MASK; nstart < end; nstart = nend) {
> +		unsigned long nr_pages;
> +		long ret;
> +
> +		if (!locked) {
> +			locked = 1;
> +			mmap_read_lock(mm);
> +			vma = find_vma(mm, nstart);
> +		} else if (nstart >= vma->vm_end)
> +			vma = vma->vm_next;
> +		if (!vma || vma->vm_start >= end)
> +			break;
> +		nend = min(end, vma->vm_end);
> +		if (vma->vm_flags & (VM_IO | VM_PFNMAP))
> +			continue;
> +		if (nstart < vma->vm_start)
> +			nstart = vma->vm_start;
> +		nr_pages = (nend - nstart) / PAGE_SIZE;
> +		ret = __get_user_pages_locked(mm, nstart, nr_pages,
> +					      NULL, NULL, &locked, gup_flags);
> +		if (ret <= 0)
> +			break;
> +		nend = nstart + ret * PAGE_SIZE;
> +	}
> +	if (locked)
> +		mmap_read_unlock(mm);
> +	if (nstart > start)
> +		return min(nstart - start, len);
> +	return 0;
> +}
> +
>  /**
>   * get_dump_page() - pin user page in memory while writing it to core dump
>   * @addr: user address
> -- 
> 2.26.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 7/7] gfs2: Fix mmap + page fault deadlocks for direct I/O
  2021-07-23 20:58 ` [PATCH v3 7/7] gfs2: Fix mmap + page fault deadlocks for direct I/O Andreas Gruenbacher
@ 2021-07-26 17:02   ` Jan Kara
       [not found]     ` <CAHpGcMLOZhZ7tGrY7rcYWUwx12sY884T=eC-Ckna63PBmF=zwA@mail.gmail.com>
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kara @ 2021-07-26 17:02 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Linus Torvalds, Alexander Viro, Christoph Hellwig,
	Darrick J. Wong, Jan Kara, Matthew Wilcox, cluster-devel,
	linux-fsdevel, linux-kernel, ocfs2-devel

On Fri 23-07-21 22:58:40, Andreas Gruenbacher wrote:
> Also disable page faults during direct I/O requests and implement the same kind
> of retry logic as in the buffered I/O case.
> 
> Direct I/O requests differ from buffered I/O requests in that they use
> bio_iov_iter_get_pages for grabbing page references and faulting in pages
> instead of triggering real page faults.  Those manual page faults can be
> disabled with the iocb->noio flag.
> 
> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> ---
>  fs/gfs2/file.c | 34 +++++++++++++++++++++++++++++++++-
>  1 file changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index f66ac7f56f6d..7986f3be69d2 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -782,21 +782,41 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
>  	struct file *file = iocb->ki_filp;
>  	struct gfs2_inode *ip = GFS2_I(file->f_mapping->host);
>  	size_t count = iov_iter_count(to);
> +	size_t written = 0;
>  	ssize_t ret;
>  
> +	/*
> +	 * In this function, we disable page faults when we're holding the
> +	 * inode glock while doing I/O.  If a page fault occurs, we drop the
> +	 * inode glock, fault in the pages manually, and then we retry.  Other
> +	 * than in gfs2_file_read_iter, iomap_dio_rw can trigger implicit as
> +	 * well as manual page faults, and we need to disable both kinds
> +	 * separately.
> +	 */
> +
>  	if (!count)
>  		return 0; /* skip atime */
>  
>  	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
> +retry:
>  	ret = gfs2_glock_nq(gh);
>  	if (ret)
>  		goto out_uninit;
>  
> +	pagefault_disable();

Is there any use in pagefault_disable() here? iomap_dio_rw() should not
trigger any page faults anyway, should it?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/7] iov_iter: Introduce fault_in_iov_iter helper
  2021-07-26 16:33   ` Jan Kara
@ 2021-07-26 17:15     ` Linus Torvalds
  0 siblings, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2021-07-26 17:15 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andreas Gruenbacher, Alexander Viro, Christoph Hellwig,
	Darrick J. Wong, Matthew Wilcox, cluster-devel, linux-fsdevel,
	Linux Kernel Mailing List, ocfs2-devel

On Mon, Jul 26, 2021 at 9:33 AM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 23-07-21 22:58:34, Andreas Gruenbacher wrote:
> > +     gup_flags = FOLL_TOUCH | FOLL_POPULATE;
>
> I don't think FOLL_POPULATE makes sense here. It makes sense only with
> FOLL_MLOCK and determines whether mlock(2) should fault in missing pages or
> not.

Yeah, it won't hurt, but FOLL_POPULATE doesn't actually do anything
unless FOLL_MLOCK is set. It is, as you say, a magic flag just for
mlock.

The only ones that should matter are FOLL_WRITE (for obvious reasons)
and FOLL_TOUCH (to set the accessed and dirty bits, rather than just
th protection bits)

                   Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 5/7] iomap: Support restarting direct I/O requests after user copy failures
  2021-07-23 20:58 ` [PATCH v3 5/7] iomap: Support restarting direct I/O requests after user copy failures Andreas Gruenbacher
@ 2021-07-26 17:19   ` Jan Kara
       [not found]     ` <CAHpGcMLtQ1=WOT1mTUS4=iWBwHLQ-EBzY=+XuSGJfu4gVPYTLw@mail.gmail.com>
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kara @ 2021-07-26 17:19 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Linus Torvalds, Alexander Viro, Christoph Hellwig,
	Darrick J. Wong, Jan Kara, Matthew Wilcox, cluster-devel,
	linux-fsdevel, linux-kernel, ocfs2-devel

On Fri 23-07-21 22:58:38, Andreas Gruenbacher wrote:
> In __iomap_dio_rw, when iomap_apply returns an -EFAULT error, complete the
> request synchronously and reset the iterator to the start position.  This
> allows callers to deal with the failure and retry the operation.
> 
> In gfs2, we need to disable page faults while we're holding glocks to prevent
> deadlocks.  This patch is the minimum solution I could find to make
> iomap_dio_rw work with page faults disabled.  It's still expensive because any
> I/O that was carried out before hitting -EFAULT needs to be retried.
> 
> A possible improvement would be to add an IOMAP_DIO_FAULT_RETRY or similar flag
> that would allow iomap_dio_rw to return a short result when hitting -EFAULT.
> Callers could then retry only the rest of the request after dealing with the
> page fault.
> 
> Asynchronous requests turn into synchronous requests up to the point of the
> page fault in any case, but they could be retried asynchronously after dealing
> with the page fault.  To make that work, the completion notification would have
> to include the bytes read or written before the page fault(s) as well, and we'd
> need an additional iomap_dio_rw argument for that.
> 
> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> ---
>  fs/iomap/direct-io.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index cc0b4bc8861b..b0a494211bb4 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -561,6 +561,15 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		ret = iomap_apply(inode, pos, count, iomap_flags, ops, dio,
>  				iomap_dio_actor);
>  		if (ret <= 0) {
> +			if (ret == -EFAULT) {
> +				/*
> +				 * To allow retrying the request, fail
> +				 * synchronously and reset the iterator.
> +				 */
> +				wait_for_completion = true;
> +				iov_iter_revert(dio->submit.iter, dio->size);
> +			}
> +

Hum, OK, but this means that if userspace submits large enough write, GFS2
will livelock trying to complete it? While other filesystems can just
submit multiple smaller bios constructed in iomap_apply() (paging in
different parts of the buffer) and thus complete the write?

								Honza

>  			/* magic error code to fall back to buffered I/O */
>  			if (ret == -ENOTBLK) {
>  				wait_for_completion = true;
> -- 
> 2.26.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 7/7] gfs2: Fix mmap + page fault deadlocks for direct I/O
       [not found]     ` <CAHpGcMLOZhZ7tGrY7rcYWUwx12sY884T=eC-Ckna63PBmF=zwA@mail.gmail.com>
@ 2021-07-26 18:00       ` Jan Kara
  0 siblings, 0 replies; 18+ messages in thread
From: Jan Kara @ 2021-07-26 18:00 UTC (permalink / raw)
  To: Andreas Grünbacher
  Cc: Jan Kara, Andreas Gruenbacher, Linus Torvalds, Alexander Viro,
	Christoph Hellwig, Darrick J. Wong, Matthew Wilcox,
	cluster-devel, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, ocfs2-devel

On Mon 26-07-21 19:50:23, Andreas Grünbacher wrote:
> Jan Kara <jack@suse.cz> schrieb am Mo., 26. Juli 2021, 19:10:
> 
> > On Fri 23-07-21 22:58:40, Andreas Gruenbacher wrote:
> > > Also disable page faults during direct I/O requests and implement the
> > same kind
> > > of retry logic as in the buffered I/O case.
> > >
> > > Direct I/O requests differ from buffered I/O requests in that they use
> > > bio_iov_iter_get_pages for grabbing page references and faulting in pages
> > > instead of triggering real page faults.  Those manual page faults can be
> > > disabled with the iocb->noio flag.
> > >
> > > Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> > > ---
> > >  fs/gfs2/file.c | 34 +++++++++++++++++++++++++++++++++-
> > >  1 file changed, 33 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> > > index f66ac7f56f6d..7986f3be69d2 100644
> > > --- a/fs/gfs2/file.c
> > > +++ b/fs/gfs2/file.c
> > > @@ -782,21 +782,41 @@ static ssize_t gfs2_file_direct_read(struct kiocb
> > *iocb, struct iov_iter *to,
> > >       struct file *file = iocb->ki_filp;
> > >       struct gfs2_inode *ip = GFS2_I(file->f_mapping->host);
> > >       size_t count = iov_iter_count(to);
> > > +     size_t written = 0;
> > >       ssize_t ret;
> > >
> > > +     /*
> > > +      * In this function, we disable page faults when we're holding the
> > > +      * inode glock while doing I/O.  If a page fault occurs, we drop
> > the
> > > +      * inode glock, fault in the pages manually, and then we retry.
> > Other
> > > +      * than in gfs2_file_read_iter, iomap_dio_rw can trigger implicit
> > as
> > > +      * well as manual page faults, and we need to disable both kinds
> > > +      * separately.
> > > +      */
> > > +
> > >       if (!count)
> > >               return 0; /* skip atime */
> > >
> > >       gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
> > > +retry:
> > >       ret = gfs2_glock_nq(gh);
> > >       if (ret)
> > >               goto out_uninit;
> > >
> > > +     pagefault_disable();
> >
> > Is there any use in pagefault_disable() here? iomap_dio_rw() should not
> > trigger any page faults anyway, should it?
> >
> 
> It can trigger physical page faults when reading from holes.

Aha, good point. Maybe even worth a comment at this site? Thanks for
explanation!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 5/7] iomap: Support restarting direct I/O requests after user copy failures
       [not found]     ` <CAHpGcMLtQ1=WOT1mTUS4=iWBwHLQ-EBzY=+XuSGJfu4gVPYTLw@mail.gmail.com>
@ 2021-07-26 18:08       ` Jan Kara
  0 siblings, 0 replies; 18+ messages in thread
From: Jan Kara @ 2021-07-26 18:08 UTC (permalink / raw)
  To: Andreas Grünbacher
  Cc: Jan Kara, Andreas Gruenbacher, Linus Torvalds, Alexander Viro,
	Christoph Hellwig, Darrick J. Wong, Matthew Wilcox,
	cluster-devel, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, ocfs2-devel

On Mon 26-07-21 19:45:22, Andreas Grünbacher wrote:
> Jan Kara <jack@suse.cz> schrieb am Mo., 26. Juli 2021, 19:21:
> 
> > On Fri 23-07-21 22:58:38, Andreas Gruenbacher wrote:
> > > In __iomap_dio_rw, when iomap_apply returns an -EFAULT error, complete
> > the
> > > request synchronously and reset the iterator to the start position.  This
> > > allows callers to deal with the failure and retry the operation.
> > >
> > > In gfs2, we need to disable page faults while we're holding glocks to
> > prevent
> > > deadlocks.  This patch is the minimum solution I could find to make
> > > iomap_dio_rw work with page faults disabled.  It's still expensive
> > because any
> > > I/O that was carried out before hitting -EFAULT needs to be retried.
> > >
> > > A possible improvement would be to add an IOMAP_DIO_FAULT_RETRY or
> > similar flag
> > > that would allow iomap_dio_rw to return a short result when hitting
> > -EFAULT.
> > > Callers could then retry only the rest of the request after dealing with
> > the
> > > page fault.
> > >
> > > Asynchronous requests turn into synchronous requests up to the point of
> > the
> > > page fault in any case, but they could be retried asynchronously after
> > dealing
> > > with the page fault.  To make that work, the completion notification
> > would have
> > > to include the bytes read or written before the page fault(s) as well,
> > and we'd
> > > need an additional iomap_dio_rw argument for that.
> > >
> > > Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> > > ---
> > >  fs/iomap/direct-io.c | 9 +++++++++
> > >  1 file changed, 9 insertions(+)
> > >
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index cc0b4bc8861b..b0a494211bb4 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -561,6 +561,15 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter
> > *iter,
> > >               ret = iomap_apply(inode, pos, count, iomap_flags, ops, dio,
> > >                               iomap_dio_actor);
> > >               if (ret <= 0) {
> > > +                     if (ret == -EFAULT) {
> > > +                             /*
> > > +                              * To allow retrying the request, fail
> > > +                              * synchronously and reset the iterator.
> > > +                              */
> > > +                             wait_for_completion = true;
> > > +                             iov_iter_revert(dio->submit.iter,
> > dio->size);
> > > +                     }
> > > +
> >
> > Hum, OK, but this means that if userspace submits large enough write, GFS2
> > will livelock trying to complete it? While other filesystems can just
> > submit multiple smaller bios constructed in iomap_apply() (paging in
> > different parts of the buffer) and thus complete the write?
> >
> 
> No. First, this affects reads but not writes. We cannot just blindly repeat
> writes; when a page fault occurs in the middle of a write, the result will
> be a short write. For reads, the plan is to ads a flag to allow
> iomap_dio_rw to return a partial result when a page fault occurs.
> (Currently, it fails the entire request.) Then we can handle the page fault
> and complete the rest of the request.
> 
> The changes needed for that are simple on the iomap side, but we need to go
> through some gymnastics for handling the page fault without giving up the
> glock in the non-contended case. There will still be the potential for
> losing the lock and having to re-acquire it, in which case we'll actually
> have to repeat the entire read.

I've missed you've already sent out v4 (I'm catching up after vacation so
my mailbox is a bit of a mess). What's in there addresses my objection.
I'm sorry for the noise.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-07-26 18:08 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-23 20:58 [PATCH v3 0/7] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
2021-07-23 20:58 ` [PATCH v3 1/7] iov_iter: Introduce fault_in_iov_iter helper Andreas Gruenbacher
2021-07-23 23:40   ` Linus Torvalds
2021-07-24  7:51     ` Andreas Grünbacher
2021-07-24  1:52   ` Al Viro
2021-07-24  8:05     ` Andreas Grünbacher
2021-07-26 16:33   ` Jan Kara
2021-07-26 17:15     ` Linus Torvalds
2021-07-23 20:58 ` [PATCH v3 2/7] gfs2: Add wrapper for iomap_file_buffered_write Andreas Gruenbacher
2021-07-23 20:58 ` [PATCH v3 3/7] gfs2: Fix mmap + page fault deadlocks for buffered I/O Andreas Gruenbacher
2021-07-23 20:58 ` [PATCH v3 4/7] iomap: Fix iomap_dio_rw return value for user copies Andreas Gruenbacher
2021-07-23 20:58 ` [PATCH v3 5/7] iomap: Support restarting direct I/O requests after user copy failures Andreas Gruenbacher
2021-07-26 17:19   ` Jan Kara
     [not found]     ` <CAHpGcMLtQ1=WOT1mTUS4=iWBwHLQ-EBzY=+XuSGJfu4gVPYTLw@mail.gmail.com>
2021-07-26 18:08       ` Jan Kara
2021-07-23 20:58 ` [PATCH v3 6/7] iov_iter: Introduce noio flag to disable page faults Andreas Gruenbacher
2021-07-23 20:58 ` [PATCH v3 7/7] gfs2: Fix mmap + page fault deadlocks for direct I/O Andreas Gruenbacher
2021-07-26 17:02   ` Jan Kara
     [not found]     ` <CAHpGcMLOZhZ7tGrY7rcYWUwx12sY884T=eC-Ckna63PBmF=zwA@mail.gmail.com>
2021-07-26 18:00       ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).