linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/8] gfs2: Fix mmap + page fault deadlocks
@ 2021-07-24 19:34 Andreas Gruenbacher
  2021-07-24 19:34 ` [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper Andreas Gruenbacher
                   ` (7 more replies)
  0 siblings, 8 replies; 19+ messages in thread
From: Andreas Gruenbacher @ 2021-07-24 19:34 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

Hi Linus et al.,

here's another update of this patch queue:

 * Finally fix the typos Linus has pointed out twice already.

 * Turn the previous fault_in_iov_iter helper that was used for reads
   and writes into iov_iter_fault_in_writeable per Al's suggestion.
   Use the existing iov_iter_fault_in_readable for writes.

 * Add a done_before argument and an IOMAP_DIO_FAULT_RETRY flag to
   iomap_dio_rw to allow iomap_dio_rw to return partial results and
   resume with the rest of a request.  This allows iomap_dio_rw to be
   used with page faults disabled without having to repeat any I/O.

 * Adjust the gfs2 patches accordingly.

With that, the two iov_ter patches and the three iomap patches should
hopefully be ready for mainline.

There's one remaining issue on the gfs2 side: during read requests, when
a writer now comes in in the middle of a read request, the read request
can currently return a result that never existed on disk.  So we need
to ensure that we only resume read requests when we know that no writer
got in the way, and retry the entire request otherwise.  It should be
relatively easy to add a mechanism to detect when a glock is "lost";
this won't affect the vfs or iomap patches.

Thanks a lot,
Andreas

Andreas Gruenbacher (8):
  iov_iter: Introduce iov_iter_fault_in_writeable helper
  gfs2: Add wrapper for iomap_file_buffered_write
  gfs2: Fix mmap + page fault deadlocks for buffered I/O
  iomap: Fix iomap_dio_rw return value for user copies
  iomap: Add done_before argument to iomap_dio_rw
  iomap: Support restarting direct I/O requests after user copy failures
  iov_iter: Introduce noio flag to disable page faults
  gfs2: Fix mmap + page fault deadlocks for direct I/O

 fs/btrfs/file.c       |  5 ++-
 fs/ext4/file.c        |  5 ++-
 fs/gfs2/file.c        | 95 +++++++++++++++++++++++++++++++++++++++----
 fs/iomap/direct-io.c  | 29 ++++++++++---
 fs/xfs/xfs_file.c     |  6 +--
 fs/zonefs/super.c     |  4 +-
 include/linux/iomap.h | 11 ++++-
 include/linux/mm.h    |  3 ++
 include/linux/uio.h   |  2 +
 lib/iov_iter.c        | 60 ++++++++++++++++++++++++---
 mm/gup.c              | 57 ++++++++++++++++++++++++++
 11 files changed, 246 insertions(+), 31 deletions(-)

-- 
2.26.3


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper
  2021-07-24 19:34 [PATCH v4 0/8] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
@ 2021-07-24 19:34 ` Andreas Gruenbacher
  2021-07-24 19:52   ` Linus Torvalds
  2021-07-24 19:34 ` [PATCH v4 2/8] gfs2: Add wrapper for iomap_file_buffered_write Andreas Gruenbacher
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Andreas Gruenbacher @ 2021-07-24 19:34 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

Introduce a new iov_iter_fault_in_writeable helper for faulting in an iterator
for writing.  The pages are faulted in manually, without writing to them, so
this function is non-destructive.

We'll use iov_iter_fault_in_writeable in gfs2 once we've determined that part
of the iterator isn't in memory.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 include/linux/mm.h  |  3 +++
 include/linux/uio.h |  1 +
 lib/iov_iter.c      | 40 +++++++++++++++++++++++++++++++
 mm/gup.c            | 57 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 101 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7ca22e6e694a..14b1353995e2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1840,6 +1840,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
 int pin_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages);
 
+unsigned long fault_in_user_pages(unsigned long start, unsigned long len,
+				  bool write);
+
 int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
 int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
 			struct task_struct *task, bool bypass_rlim);
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 82c3c3e819e0..8e469b8b862f 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -120,6 +120,7 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
 void iov_iter_revert(struct iov_iter *i, size_t bytes);
 int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes);
+int iov_iter_fault_in_writeable(const struct iov_iter *i, size_t bytes);
 size_t iov_iter_single_seg_count(const struct iov_iter *i);
 size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 20dc3d800573..ccf1ee8d4edf 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -460,6 +460,46 @@ int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes)
 }
 EXPORT_SYMBOL(iov_iter_fault_in_readable);
 
+/**
+ * iov_iter_fault_in_writeable - fault in an iov iterator for writing
+ * @i: iterator
+ * @bytes: maximum length
+ *
+ * Faults in part or all of the iterator.  This is primarily useful when we
+ * already know that some or all of the pages in @i aren't in memory.
+ *
+ * This function uses fault_in_user_pages() to fault in the pages, which
+ * internally uses get_user_pages(), so it doesn't trigger hardware page
+ * faults.  Unlike fault_in_pages_writeable() which writes to the memory to
+ * fault it in, this function is non-destructive.
+ *
+ * Returns 0 on success, and a non-zero error code if the memory could not be
+ * accessed (i.e. because it is an invalid address).
+ */
+int iov_iter_fault_in_writeable(const struct iov_iter *i, size_t bytes)
+{
+	if (iter_is_iovec(i)) {
+		const struct iovec *p;
+		size_t skip;
+
+		if (bytes > i->count)
+			bytes = i->count;
+		for (p = i->iov, skip = i->iov_offset; bytes; p++, skip = 0) {
+			unsigned long len = min(bytes, p->iov_len - skip);
+			unsigned long start;
+
+			if (unlikely(!len))
+				continue;
+			start = (unsigned long)p->iov_base + skip;
+			if (fault_in_user_pages(start, len, true) != len)
+				return -EFAULT;
+			bytes -= len;
+		}
+	}
+	return 0;
+}
+EXPORT_SYMBOL(iov_iter_fault_in_writeable);
+
 void iov_iter_init(struct iov_iter *i, unsigned int direction,
 			const struct iovec *iov, unsigned long nr_segs,
 			size_t count)
diff --git a/mm/gup.c b/mm/gup.c
index 42b8b1fa6521..784809c232f1 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1669,6 +1669,63 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
 }
 #endif /* !CONFIG_MMU */
 
+/**
+ * fault_in_user_pages - fault in an address range for reading / writing
+ * @start: start of address range
+ * @len: length of address range
+ * @write: fault in for writing
+ *
+ * Note that we don't pin or otherwise hold the pages referenced that we fault
+ * in.  There's no guarantee that they'll stay in memory for any duration of
+ * time.
+ *
+ * Returns the number of bytes faulted in from @start.
+ */
+unsigned long fault_in_user_pages(unsigned long start, unsigned long len,
+				  bool write)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma = NULL;
+	unsigned long end, nstart, nend;
+	int locked = 0;
+	int gup_flags;
+
+	gup_flags = FOLL_TOUCH | FOLL_POPULATE;
+	if (write)
+		gup_flags |= FOLL_WRITE;
+
+	end = PAGE_ALIGN(start + len);
+	for (nstart = start & PAGE_MASK; nstart < end; nstart = nend) {
+		unsigned long nr_pages;
+		long ret;
+
+		if (!locked) {
+			locked = 1;
+			mmap_read_lock(mm);
+			vma = find_vma(mm, nstart);
+		} else if (nstart >= vma->vm_end)
+			vma = vma->vm_next;
+		if (!vma || vma->vm_start >= end)
+			break;
+		nend = min(end, vma->vm_end);
+		if (vma->vm_flags & (VM_IO | VM_PFNMAP))
+			continue;
+		if (nstart < vma->vm_start)
+			nstart = vma->vm_start;
+		nr_pages = (nend - nstart) / PAGE_SIZE;
+		ret = __get_user_pages_locked(mm, nstart, nr_pages,
+					      NULL, NULL, &locked, gup_flags);
+		if (ret <= 0)
+			break;
+		nend = nstart + ret * PAGE_SIZE;
+	}
+	if (locked)
+		mmap_read_unlock(mm);
+	if (nstart > start)
+		return min(nstart - start, len);
+	return 0;
+}
+
 /**
  * get_dump_page() - pin user page in memory while writing it to core dump
  * @addr: user address
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 2/8] gfs2: Add wrapper for iomap_file_buffered_write
  2021-07-24 19:34 [PATCH v4 0/8] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
  2021-07-24 19:34 ` [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper Andreas Gruenbacher
@ 2021-07-24 19:34 ` Andreas Gruenbacher
  2021-07-24 19:34 ` [PATCH v4 3/8] gfs2: Fix mmap + page fault deadlocks for buffered I/O Andreas Gruenbacher
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Andreas Gruenbacher @ 2021-07-24 19:34 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

Add a wrapper around iomap_file_buffered_write.  We'll add code for when
the operation needs to be retried here later.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 84ec053d43b4..55ec1cadc9e6 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -876,6 +876,18 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return written ? written : ret;
 }
 
+static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file_inode(file);
+	ssize_t ret;
+
+	current->backing_dev_info = inode_to_bdi(inode);
+	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+	current->backing_dev_info = NULL;
+	return ret;
+}
+
 /**
  * gfs2_file_write_iter - Perform a write to a file
  * @iocb: The io context
@@ -927,9 +939,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			goto out_unlock;
 
 		iocb->ki_flags |= IOCB_DSYNC;
-		current->backing_dev_info = inode_to_bdi(inode);
-		buffered = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
-		current->backing_dev_info = NULL;
+		buffered = gfs2_file_buffered_write(iocb, from);
 		if (unlikely(buffered <= 0)) {
 			if (!ret)
 				ret = buffered;
@@ -951,9 +961,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		if (!ret || ret2 > 0)
 			ret += ret2;
 	} else {
-		current->backing_dev_info = inode_to_bdi(inode);
-		ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
-		current->backing_dev_info = NULL;
+		ret = gfs2_file_buffered_write(iocb, from);
 		if (likely(ret > 0)) {
 			iocb->ki_pos += ret;
 			ret = generic_write_sync(iocb, ret);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 3/8] gfs2: Fix mmap + page fault deadlocks for buffered I/O
  2021-07-24 19:34 [PATCH v4 0/8] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
  2021-07-24 19:34 ` [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper Andreas Gruenbacher
  2021-07-24 19:34 ` [PATCH v4 2/8] gfs2: Add wrapper for iomap_file_buffered_write Andreas Gruenbacher
@ 2021-07-24 19:34 ` Andreas Gruenbacher
  2021-07-24 19:34 ` [PATCH v4 4/8] iomap: Fix iomap_dio_rw return value for user copies Andreas Gruenbacher
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Andreas Gruenbacher @ 2021-07-24 19:34 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

In the .read_iter and .write_iter file operations, we're accessing
user-space memory while holding the inodes glock.  There's a possibility
that the memory is mapped to the same file, in which case we'd recurse on
the same glock.

More complex scenarios can involve multiple glocks, processes, and even cluster
nodes.

Avoids these kinds of problems by disabling page faults while holding a glock.
If a page fault occurs, we either end up with a partial read or write, or with
-EFAULT if nothing could be read or written.  In that case, we drop the glock,
fault in the requested pages manually, and repeat the operation.

This locking problem in gfs2 was originally reported by Jan Kara.  Linus came
up with the proposal to disable page faults.  Many thanks to Al Viro and
Matthew Wilcox for their feedback as well.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 55ec1cadc9e6..3aa66d4de383 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -843,6 +843,12 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	size_t written = 0;
 	ssize_t ret;
 
+	/*
+	 * In this function, we disable page faults when we're holding the
+	 * inode glock while doing I/O.  If a page fault occurs, we drop the
+	 * inode glock, fault in the pages manually, and then we retry.
+	 */
+
 	if (iocb->ki_flags & IOCB_DIRECT) {
 		ret = gfs2_file_direct_read(iocb, to, &gh);
 		if (likely(ret != -ENOTBLK))
@@ -864,13 +870,20 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	}
 	ip = GFS2_I(iocb->ki_filp->f_mapping->host);
 	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
+retry:
 	ret = gfs2_glock_nq(&gh);
 	if (ret)
 		goto out_uninit;
+	pagefault_disable();
 	ret = generic_file_read_iter(iocb, to);
+	pagefault_enable();
 	if (ret > 0)
 		written += ret;
 	gfs2_glock_dq(&gh);
+	if (unlikely(iov_iter_count(to) && (ret > 0 || ret == -EFAULT)) &&
+	    iter_is_iovec(to) &&
+	    iov_iter_fault_in_writeable(to, SIZE_MAX) == 0)
+		goto retry;
 out_uninit:
 	gfs2_holder_uninit(&gh);
 	return written ? written : ret;
@@ -882,9 +895,22 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *fro
 	struct inode *inode = file_inode(file);
 	ssize_t ret;
 
+	/*
+	 * In this function, we disable page faults when we're holding the
+	 * inode glock while doing I/O.  If a page fault occurs, we drop the
+	 * inode glock, fault in the pages manually, and then we retry.
+	 */
+
+retry:
 	current->backing_dev_info = inode_to_bdi(inode);
+	pagefault_disable();
 	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+	pagefault_enable();
 	current->backing_dev_info = NULL;
+	if (unlikely(ret == -EFAULT) &&
+	    iter_is_iovec(from) &&
+	    iov_iter_fault_in_readable(from, SIZE_MAX) == 0)
+		goto retry;
 	return ret;
 }
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 4/8] iomap: Fix iomap_dio_rw return value for user copies
  2021-07-24 19:34 [PATCH v4 0/8] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (2 preceding siblings ...)
  2021-07-24 19:34 ` [PATCH v4 3/8] gfs2: Fix mmap + page fault deadlocks for buffered I/O Andreas Gruenbacher
@ 2021-07-24 19:34 ` Andreas Gruenbacher
  2021-07-24 19:34 ` [PATCH v4 5/8] iomap: Add done_before argument to iomap_dio_rw Andreas Gruenbacher
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Andreas Gruenbacher @ 2021-07-24 19:34 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

When a user copy fails in one of the helpers of iomap_dio_rw, fail with -EFAULT
instead of returning 0.  This matches what iomap_dio_bio_actor already returns
when it gets an -EFAULT from bio_iov_iter_get_pages.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/iomap/direct-io.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 9398b8c31323..cc0b4bc8861b 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -370,7 +370,7 @@ iomap_dio_hole_actor(loff_t length, struct iomap_dio *dio)
 {
 	length = iov_iter_zero(length, dio->submit.iter);
 	dio->size += length;
-	return length;
+	return length ?: -EFAULT;
 }
 
 static loff_t
@@ -397,7 +397,7 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length,
 		copied = copy_to_iter(iomap->inline_data + pos, length, iter);
 	}
 	dio->size += copied;
-	return copied;
+	return copied ?: -EFAULT;
 }
 
 static loff_t
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 5/8] iomap: Add done_before argument to iomap_dio_rw
  2021-07-24 19:34 [PATCH v4 0/8] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (3 preceding siblings ...)
  2021-07-24 19:34 ` [PATCH v4 4/8] iomap: Fix iomap_dio_rw return value for user copies Andreas Gruenbacher
@ 2021-07-24 19:34 ` Andreas Gruenbacher
  2021-07-24 19:34 ` [PATCH v4 6/8] iomap: Support restarting direct I/O requests after user copy failures Andreas Gruenbacher
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Andreas Gruenbacher @ 2021-07-24 19:34 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

Add a done_before argument to iomap_dio_rw that indicates how much of the
request has already been transferred.  When the request succeeds, we report
that done_before additional bytes were tranferred.  This is useful for
finishing a request asynchronously when part of the request has already been
completed synchronously.

We'll use that to allow iomap_dio_rw to be used with page faults disabled: when
a page fault occurs while submitting a request, we synchronously complete the
part of the request that has already been submitted.  The caller can then take
care of the page fault and call iomap_dio_rw again for the rest of the request,
passing in the number of bytes already tranferred.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/btrfs/file.c       |  5 +++--
 fs/ext4/file.c        |  5 +++--
 fs/gfs2/file.c        |  4 ++--
 fs/iomap/direct-io.c  | 12 ++++++++----
 fs/xfs/xfs_file.c     |  6 +++---
 fs/zonefs/super.c     |  4 ++--
 include/linux/iomap.h |  4 ++--
 7 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index ee34497500e1..ea13d88e8c14 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1945,7 +1945,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	}
 
 	dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
-			     0);
+			     0, 0);
 
 	btrfs_inode_unlock(inode, ilock_flags);
 
@@ -3637,7 +3637,8 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
 		return 0;
 
 	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
-	ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops, 0);
+	ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
+			   0, 0);
 	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
 	return ret;
 }
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 816dedcbd541..4a5e7fd31fb5 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -74,7 +74,7 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		return generic_file_read_iter(iocb, to);
 	}
 
-	ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0);
+	ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0, 0);
 	inode_unlock_shared(inode);
 
 	file_accessed(iocb->ki_filp);
@@ -566,7 +566,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (ilock_shared)
 		iomap_ops = &ext4_iomap_overwrite_ops;
 	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
-			   (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0);
+			   (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0,
+			   0);
 	if (ret == -ENOTBLK)
 		ret = 0;
 
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 3aa66d4de383..eea42cc94585 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -792,7 +792,7 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
 	if (ret)
 		goto out_uninit;
 
-	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0);
+	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0, 0);
 	gfs2_glock_dq(gh);
 out_uninit:
 	gfs2_holder_uninit(gh);
@@ -826,7 +826,7 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	if (offset + len > i_size_read(&ip->i_inode))
 		goto out;
 
-	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0);
+	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0, 0);
 	if (ret == -ENOTBLK)
 		ret = 0;
 out:
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index cc0b4bc8861b..51831ce93f6e 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -31,6 +31,7 @@ struct iomap_dio {
 	atomic_t		ref;
 	unsigned		flags;
 	int			error;
+	size_t			done_before;
 	bool			wait_for_completion;
 
 	union {
@@ -128,7 +129,9 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
 
 	kfree(dio);
 
-	return ret;
+	if (ret < 0)
+		return ret;
+	return dio->done_before + ret;
 }
 EXPORT_SYMBOL_GPL(iomap_dio_complete);
 
@@ -450,7 +453,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 struct iomap_dio *
 __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
-		unsigned int dio_flags)
+		unsigned int dio_flags, size_t done_before)
 {
 	struct address_space *mapping = iocb->ki_filp->f_mapping;
 	struct inode *inode = file_inode(iocb->ki_filp);
@@ -477,6 +480,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	dio->dops = dops;
 	dio->error = 0;
 	dio->flags = 0;
+	dio->done_before = done_before;
 
 	dio->submit.iter = iter;
 	dio->submit.waiter = current;
@@ -642,11 +646,11 @@ EXPORT_SYMBOL_GPL(__iomap_dio_rw);
 ssize_t
 iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
-		unsigned int dio_flags)
+		unsigned int dio_flags, size_t done_before)
 {
 	struct iomap_dio *dio;
 
-	dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags);
+	dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags, done_before);
 	if (IS_ERR_OR_NULL(dio))
 		return PTR_ERR_OR_ZERO(dio);
 	return iomap_dio_complete(dio);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index cc3cfb12df53..3103d9bda466 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -259,7 +259,7 @@ xfs_file_dio_read(
 	ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
 	if (ret)
 		return ret;
-	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0);
+	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, 0);
 	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
 
 	return ret;
@@ -569,7 +569,7 @@ xfs_file_dio_write_aligned(
 	}
 	trace_xfs_file_direct_write(iocb, from);
 	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
-			   &xfs_dio_write_ops, 0);
+			   &xfs_dio_write_ops, 0, 0);
 out_unlock:
 	if (iolock)
 		xfs_iunlock(ip, iolock);
@@ -647,7 +647,7 @@ xfs_file_dio_write_unaligned(
 
 	trace_xfs_file_direct_write(iocb, from);
 	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
-			   &xfs_dio_write_ops, flags);
+			   &xfs_dio_write_ops, flags, 0);
 
 	/*
 	 * Retry unaligned I/O with exclusive blocking semantics if the DIO
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 70055d486bf7..85ca2f5fe06e 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -864,7 +864,7 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
 		ret = zonefs_file_dio_append(iocb, from);
 	else
 		ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops,
-				   &zonefs_write_dio_ops, 0);
+				   &zonefs_write_dio_ops, 0, 0);
 	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
 	    (ret > 0 || ret == -EIOCBQUEUED)) {
 		if (ret > 0)
@@ -999,7 +999,7 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		}
 		file_accessed(iocb->ki_filp);
 		ret = iomap_dio_rw(iocb, to, &zonefs_iomap_ops,
-				   &zonefs_read_dio_ops, 0);
+				   &zonefs_read_dio_ops, 0, 0);
 	} else {
 		ret = generic_file_read_iter(iocb, to);
 		if (ret == -EIO)
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 479c1da3e221..8aea35f1a003 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -269,10 +269,10 @@ struct iomap_dio_ops {
 
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
-		unsigned int dio_flags);
+		unsigned int dio_flags, size_t done_before);
 struct iomap_dio *__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
-		unsigned int dio_flags);
+		unsigned int dio_flags, size_t done_before);
 ssize_t iomap_dio_complete(struct iomap_dio *dio);
 int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 6/8] iomap: Support restarting direct I/O requests after user copy failures
  2021-07-24 19:34 [PATCH v4 0/8] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (4 preceding siblings ...)
  2021-07-24 19:34 ` [PATCH v4 5/8] iomap: Add done_before argument to iomap_dio_rw Andreas Gruenbacher
@ 2021-07-24 19:34 ` Andreas Gruenbacher
  2021-07-24 19:34 ` [PATCH v4 7/8] iov_iter: Introduce noio flag to disable page faults Andreas Gruenbacher
  2021-07-24 19:34 ` [PATCH v4 8/8] gfs2: Fix mmap + page fault deadlocks for direct I/O Andreas Gruenbacher
  7 siblings, 0 replies; 19+ messages in thread
From: Andreas Gruenbacher @ 2021-07-24 19:34 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

In iomap_dio_rw, when iomap_apply returns an -EFAULT error, complete the
request synchronously.  Either return a partial result (when the
IOMAP_DIO_FAULT_RETRY flag is set and the caller is thus prepared to handle
partial results), or reset the iterator to the start to allow a restart.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/iomap/direct-io.c  | 13 +++++++++++++
 include/linux/iomap.h |  7 +++++++
 2 files changed, 20 insertions(+)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 51831ce93f6e..1ba825bab08b 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -565,6 +565,19 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		ret = iomap_apply(inode, pos, count, iomap_flags, ops, dio,
 				iomap_dio_actor);
 		if (ret <= 0) {
+			if (ret == -EFAULT) {
+				/*
+				 * Finish synchronously and revert the iterator
+				 * when failing the request to allow a retry.
+				 */
+				wait_for_completion = true;
+				if (dio->size &&
+				    (dio_flags & IOMAP_DIO_FAULT_RETRY))
+					ret = 0;
+				else
+					iov_iter_revert(iter, dio->size);
+			}
+
 			/* magic error code to fall back to buffered I/O */
 			if (ret == -ENOTBLK) {
 				wait_for_completion = true;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 8aea35f1a003..9dbba9f7c945 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -267,6 +267,13 @@ struct iomap_dio_ops {
   */
 #define IOMAP_DIO_OVERWRITE_ONLY	(1 << 1)
 
+/*
+ * When a page fault occurs, return a partial synchronous result and allow
+ * the caller to retry the rest of the operation after dealing with the page
+ * fault.
+ */
+#define IOMAP_DIO_FAULT_RETRY		(1 << 2)
+
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
 		unsigned int dio_flags, size_t done_before);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 7/8] iov_iter: Introduce noio flag to disable page faults
  2021-07-24 19:34 [PATCH v4 0/8] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (5 preceding siblings ...)
  2021-07-24 19:34 ` [PATCH v4 6/8] iomap: Support restarting direct I/O requests after user copy failures Andreas Gruenbacher
@ 2021-07-24 19:34 ` Andreas Gruenbacher
  2021-07-24 19:34 ` [PATCH v4 8/8] gfs2: Fix mmap + page fault deadlocks for direct I/O Andreas Gruenbacher
  7 siblings, 0 replies; 19+ messages in thread
From: Andreas Gruenbacher @ 2021-07-24 19:34 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

Introduce a new noio flag to indicate to get_user_pages to use the
FOLL_FAST_ONLY flag.  This will cause get_user_pages to fail when it
would otherwise fault in a page.

Currently, the noio flag is only checked in iov_iter_get_pages and
iov_iter_get_pages_alloc.  This is enough for iomaop_dio_rw, but it
may make sense to check for this flag in other contexts as well.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 include/linux/uio.h |  1 +
 lib/iov_iter.c      | 20 +++++++++++++++-----
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 8e469b8b862f..83cc33465295 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -29,6 +29,7 @@ enum iter_type {
 
 struct iov_iter {
 	u8 iter_type;
+	bool noio;
 	bool data_source;
 	size_t iov_offset;
 	size_t count;
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index ccf1ee8d4edf..f0fa1b89fe9b 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -507,6 +507,7 @@ void iov_iter_init(struct iov_iter *i, unsigned int direction,
 	WARN_ON(direction & ~(READ | WRITE));
 	*i = (struct iov_iter) {
 		.iter_type = ITER_IOVEC,
+		.noio = false,
 		.data_source = direction,
 		.iov = iov,
 		.nr_segs = nr_segs,
@@ -1517,13 +1518,17 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 		return 0;
 
 	if (likely(iter_is_iovec(i))) {
+		unsigned int gup_flags = 0;
 		unsigned long addr;
 
+		if (iov_iter_rw(i) != WRITE)
+			gup_flags |= FOLL_WRITE;
+		if (i->noio)
+			gup_flags |= FOLL_FAST_ONLY;
+
 		addr = first_iovec_segment(i, &len, start, maxsize, maxpages);
 		n = DIV_ROUND_UP(len, PAGE_SIZE);
-		res = get_user_pages_fast(addr, n,
-				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0,
-				pages);
+		res = get_user_pages_fast(addr, n, gup_flags, pages);
 		if (unlikely(res <= 0))
 			return res;
 		return (res == n ? len : res * PAGE_SIZE) - *start;
@@ -1639,15 +1644,20 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		return 0;
 
 	if (likely(iter_is_iovec(i))) {
+		unsigned int gup_flags = 0;
 		unsigned long addr;
 
+		if (iov_iter_rw(i) != WRITE)
+			gup_flags |= FOLL_WRITE;
+		if (i->noio)
+			gup_flags |= FOLL_FAST_ONLY;
+
 		addr = first_iovec_segment(i, &len, start, maxsize, ~0U);
 		n = DIV_ROUND_UP(len, PAGE_SIZE);
 		p = get_pages_array(n);
 		if (!p)
 			return -ENOMEM;
-		res = get_user_pages_fast(addr, n,
-				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0, p);
+		res = get_user_pages_fast(addr, n, gup_flags, p);
 		if (unlikely(res <= 0)) {
 			kvfree(p);
 			*pages = ZERO_SIZE_PTR;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 8/8] gfs2: Fix mmap + page fault deadlocks for direct I/O
  2021-07-24 19:34 [PATCH v4 0/8] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
                   ` (6 preceding siblings ...)
  2021-07-24 19:34 ` [PATCH v4 7/8] iov_iter: Introduce noio flag to disable page faults Andreas Gruenbacher
@ 2021-07-24 19:34 ` Andreas Gruenbacher
  7 siblings, 0 replies; 19+ messages in thread
From: Andreas Gruenbacher @ 2021-07-24 19:34 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro, Christoph Hellwig, Darrick J. Wong
  Cc: Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	linux-kernel, ocfs2-devel, Andreas Gruenbacher

Also disable page faults during direct I/O requests and implement the same
kind of retry logic as in the buffered I/O case.

Direct I/O requests differ from buffered I/O requests in that they use
bio_iov_iter_get_pages for grabbing page references and faulting in pages
instead of triggering physical page faults.  Those manual page faults can be
disabled with the new iocb->noio flag.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index eea42cc94585..fbdee282185f 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -782,21 +782,47 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
 	struct file *file = iocb->ki_filp;
 	struct gfs2_inode *ip = GFS2_I(file->f_mapping->host);
 	size_t count = iov_iter_count(to);
+	size_t written = 0;
 	ssize_t ret;
 
+	/*
+	 * In this function, we disable page faults when we're holding the
+	 * inode glock while doing I/O.  If a page fault occurs, we drop the
+	 * inode glock, fault in the pages manually, and then we retry.
+	 *
+	 * Unlike generic_file_read_iter, for reads, iomap_dio_rw can trigger
+	 * physical as well as manual page faults, and we need to disable both
+	 * kinds.
+	 */
+
 	if (!count)
 		return 0; /* skip atime */
 
 	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
+retry:
 	ret = gfs2_glock_nq(gh);
 	if (ret)
 		goto out_uninit;
 
-	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0, 0);
+	pagefault_disable();
+	to->noio = true;
+	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL,
+			   IOMAP_DIO_FAULT_RETRY, written);
+	to->noio = false;
+	pagefault_enable();
+
 	gfs2_glock_dq(gh);
+	if (ret > 0)
+		written = ret;
+	if (unlikely(iov_iter_count(to) && (ret > 0 || ret == -EFAULT)) &&
+	    iter_is_iovec(to) &&
+	    iov_iter_fault_in_writeable(to, SIZE_MAX) == 0)
+		goto retry;
 out_uninit:
 	gfs2_holder_uninit(gh);
-	return ret;
+	if (ret < 0)
+		return ret;
+	return written;
 }
 
 static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
@@ -809,6 +835,15 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	loff_t offset = iocb->ki_pos;
 	ssize_t ret;
 
+	/*
+	 * In this function, we disable page faults when we're holding the
+	 * inode glock while doing I/O.  If a page fault occurs, we drop the
+	 * inode glock, fault in the pages manually, and then we retry.
+	 *
+	 * For writes, iomap_dio_rw only triggers manual page faults, so we
+	 * don't need to disable physical ones.
+	 */
+
 	/*
 	 * Deferred lock, even if its a write, since we do no allocation on
 	 * this path. All we need to change is the atime, and this lock mode
@@ -818,6 +853,7 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	 * VFS does.
 	 */
 	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
+retry:
 	ret = gfs2_glock_nq(gh);
 	if (ret)
 		goto out_uninit;
@@ -826,11 +862,18 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	if (offset + len > i_size_read(&ip->i_inode))
 		goto out;
 
+	from->noio = true;
 	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0, 0);
+	from->noio = false;
+
 	if (ret == -ENOTBLK)
 		ret = 0;
 out:
 	gfs2_glock_dq(gh);
+	if (unlikely(ret == -EFAULT) &&
+	    iter_is_iovec(from) &&
+	    iov_iter_fault_in_readable(from, SIZE_MAX) == 0)
+		goto retry;
 out_uninit:
 	gfs2_holder_uninit(gh);
 	return ret;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper
  2021-07-24 19:34 ` [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper Andreas Gruenbacher
@ 2021-07-24 19:52   ` Linus Torvalds
  2021-07-24 20:24     ` Al Viro
  2021-07-27  9:30     ` David Laight
  0 siblings, 2 replies; 19+ messages in thread
From: Linus Torvalds @ 2021-07-24 19:52 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Alexander Viro, Christoph Hellwig, Darrick J. Wong, Jan Kara,
	Matthew Wilcox, cluster-devel, linux-fsdevel,
	Linux Kernel Mailing List, ocfs2-devel

On Sat, Jul 24, 2021 at 12:35 PM Andreas Gruenbacher
<agruenba@redhat.com> wrote:
>
> +int iov_iter_fault_in_writeable(const struct iov_iter *i, size_t bytes)
> +{
...
> +                       if (fault_in_user_pages(start, len, true) != len)
> +                               return -EFAULT;

Looking at this once more, I think this is likely wrong.

Why?

Because any user can/should only care about at least *part* of the
area being writable.

Imagine that you're doing a large read. If the *first* page is
writable, you should still return the partial read, not -EFAULT.

So I think the code needs to return 0 if _any_ fault was successful.
Or perhaps return how much it was able to fault in. Because returning
-EFAULT if any of it failed seems wrong, and doesn't allow for partial
success being reported.

The other reaction I have is that you now only do the
iov_iter_fault_in_writeable, but then you make fault_in_user_pages()
still have that "bool write" argument.

We already have 'fault_in_pages_readable()', and that one is more
efficient (well, at least if the fault isn't needed it is). So it
would make more sense to just implement fault_in_pages_writable()
instead of that "fault_in_user_pages(, bool write)".

                 Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper
  2021-07-24 19:52   ` Linus Torvalds
@ 2021-07-24 20:24     ` Al Viro
  2021-07-24 20:37       ` Linus Torvalds
  2021-07-24 21:38       ` Andreas Gruenbacher
  2021-07-27  9:30     ` David Laight
  1 sibling, 2 replies; 19+ messages in thread
From: Al Viro @ 2021-07-24 20:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Gruenbacher, Christoph Hellwig, Darrick J. Wong,
	Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	Linux Kernel Mailing List, ocfs2-devel

On Sat, Jul 24, 2021 at 12:52:34PM -0700, Linus Torvalds wrote:
> ...
> > +                       if (fault_in_user_pages(start, len, true) != len)
> > +                               return -EFAULT;
> 
> Looking at this once more, I think this is likely wrong.
> 
> Why?
> 
> Because any user can/should only care about at least *part* of the
> area being writable.
> 
> Imagine that you're doing a large read. If the *first* page is
> writable, you should still return the partial read, not -EFAULT.

Agreed.

> So I think the code needs to return 0 if _any_ fault was successful.

s/any/the first/...

The same goes for fault-in for read, of course; I've a half-baked conversion
to such semantics (-EFAULT vs. 0; precise length is unreliable anyway,
especially if you have sub-page failure areas), need to finish and post
it...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper
  2021-07-24 20:24     ` Al Viro
@ 2021-07-24 20:37       ` Linus Torvalds
  2021-07-24 21:38       ` Andreas Gruenbacher
  1 sibling, 0 replies; 19+ messages in thread
From: Linus Torvalds @ 2021-07-24 20:37 UTC (permalink / raw)
  To: Al Viro
  Cc: Andreas Gruenbacher, Christoph Hellwig, Darrick J. Wong,
	Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	Linux Kernel Mailing List, ocfs2-devel

On Sat, Jul 24, 2021 at 1:26 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Sat, Jul 24, 2021 at 12:52:34PM -0700, Linus Torvalds wrote:
>
> > So I think the code needs to return 0 if _any_ fault was successful.
>
> s/any/the first/...

Yes, but as long as we do them in order, and stop when it fails, "any"
and "first" end up being the same thing ;)

> The same goes for fault-in for read

Yeah. That said, a partial write() (ie "read from user space") might
be something that a filesystem is not willing to touch for other
reasons, so I think returning -EFAULT in that case is, I think,
slightly more reasonable.

Things like "I have to prepare buffers to be filled" etc.

The partial read() case is at least something that a filesystem should
be able to handle fairly easily.

And I don't think returning -EFAULT early is necessarily wrong - we
obviously do it anyway if you give really invalid addresses.

But we have generally strived to allow partial IO for missing pages,
because people sometimes play games with unmapping things dynamically
or using mprotect() to catch modifications (ie doing things like catch
SIGSEGV/SIGBUS, and remap it).

But those kinds of uses tend to have to catch -EFAULT anyway, so I
guess it's not a big deal either way.

           Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper
  2021-07-24 20:24     ` Al Viro
  2021-07-24 20:37       ` Linus Torvalds
@ 2021-07-24 21:38       ` Andreas Gruenbacher
  2021-07-24 21:57         ` Al Viro
  1 sibling, 1 reply; 19+ messages in thread
From: Andreas Gruenbacher @ 2021-07-24 21:38 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Christoph Hellwig, Darrick J. Wong, Jan Kara,
	Matthew Wilcox, cluster-devel, linux-fsdevel,
	Linux Kernel Mailing List, ocfs2-devel

On Sat, Jul 24, 2021 at 10:24 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sat, Jul 24, 2021 at 12:52:34PM -0700, Linus Torvalds wrote:
> > ...
> > > +                       if (fault_in_user_pages(start, len, true) != len)
> > > +                               return -EFAULT;
> >
> > Looking at this once more, I think this is likely wrong.
> >
> > Why?
> >
> > Because any user can/should only care about at least *part* of the
> > area being writable.
> >
> > Imagine that you're doing a large read. If the *first* page is
> > writable, you should still return the partial read, not -EFAULT.
>
> Agreed.
>
> > So I think the code needs to return 0 if _any_ fault was successful.
>
> s/any/the first/...
>
> The same goes for fault-in for read, of course; I've a half-baked conversion
> to such semantics (-EFAULT vs. 0; precise length is unreliable anyway,
> especially if you have sub-page failure areas), need to finish and post
> it...

Hmm, how could we have sub-page failure areas when this is about if
and how pages are mapped? If we return the number of bytes that are
accessible, then users will know if they got nothing, something, or
everything, and they can act accordingly.

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper
  2021-07-24 21:38       ` Andreas Gruenbacher
@ 2021-07-24 21:57         ` Al Viro
  2021-07-24 22:06           ` Andreas Gruenbacher
  0 siblings, 1 reply; 19+ messages in thread
From: Al Viro @ 2021-07-24 21:57 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Linus Torvalds, Christoph Hellwig, Darrick J. Wong, Jan Kara,
	Matthew Wilcox, cluster-devel, linux-fsdevel,
	Linux Kernel Mailing List, ocfs2-devel

On Sat, Jul 24, 2021 at 11:38:20PM +0200, Andreas Gruenbacher wrote:

> Hmm, how could we have sub-page failure areas when this is about if
> and how pages are mapped? If we return the number of bytes that are
> accessible, then users will know if they got nothing, something, or
> everything, and they can act accordingly.

What I'm saying is that in situation when you have cacheline-sized
poisoned areas, there's no way to get an accurate count of readable
area other than try and copy it out.

What's more, "something" is essentially useless information - the
pages might get unmapped right as your function returns; the caller
still needs to deal with partial copies.  And that's a slow path
by definition, so informing them of a partial fault-in is not
going to be useful.

As far as callers are concerned, it's "nothing suitable in the
beginning of the area" vs. "something might be accessible".

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper
  2021-07-24 21:57         ` Al Viro
@ 2021-07-24 22:06           ` Andreas Gruenbacher
  2021-07-24 23:39             ` Al Viro
  0 siblings, 1 reply; 19+ messages in thread
From: Andreas Gruenbacher @ 2021-07-24 22:06 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Christoph Hellwig, Darrick J. Wong, Jan Kara,
	Matthew Wilcox, cluster-devel, linux-fsdevel,
	Linux Kernel Mailing List, ocfs2-devel

On Sat, Jul 24, 2021 at 11:57 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sat, Jul 24, 2021 at 11:38:20PM +0200, Andreas Gruenbacher wrote:
>
> > Hmm, how could we have sub-page failure areas when this is about if
> > and how pages are mapped? If we return the number of bytes that are
> > accessible, then users will know if they got nothing, something, or
> > everything, and they can act accordingly.
>
> What I'm saying is that in situation when you have cacheline-sized
> poisoned areas, there's no way to get an accurate count of readable
> area other than try and copy it out.
>
> What's more, "something" is essentially useless information - the
> pages might get unmapped right as your function returns; the caller
> still needs to deal with partial copies.  And that's a slow path
> by definition, so informing them of a partial fault-in is not
> going to be useful.
>
> As far as callers are concerned, it's "nothing suitable in the
> beginning of the area" vs. "something might be accessible".

Yes, and the third case would be "something might be accessible, but
not all of it". There probably are callers that give up when they
don't have it all.

Andreas


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper
  2021-07-24 22:06           ` Andreas Gruenbacher
@ 2021-07-24 23:39             ` Al Viro
  0 siblings, 0 replies; 19+ messages in thread
From: Al Viro @ 2021-07-24 23:39 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Linus Torvalds, Christoph Hellwig, Darrick J. Wong, Jan Kara,
	Matthew Wilcox, cluster-devel, linux-fsdevel,
	Linux Kernel Mailing List, ocfs2-devel

On Sun, Jul 25, 2021 at 12:06:41AM +0200, Andreas Gruenbacher wrote:
> On Sat, Jul 24, 2021 at 11:57 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Sat, Jul 24, 2021 at 11:38:20PM +0200, Andreas Gruenbacher wrote:
> >
> > > Hmm, how could we have sub-page failure areas when this is about if
> > > and how pages are mapped? If we return the number of bytes that are
> > > accessible, then users will know if they got nothing, something, or
> > > everything, and they can act accordingly.
> >
> > What I'm saying is that in situation when you have cacheline-sized
> > poisoned areas, there's no way to get an accurate count of readable
> > area other than try and copy it out.
> >
> > What's more, "something" is essentially useless information - the
> > pages might get unmapped right as your function returns; the caller
> > still needs to deal with partial copies.  And that's a slow path
> > by definition, so informing them of a partial fault-in is not
> > going to be useful.
> >
> > As far as callers are concerned, it's "nothing suitable in the
> > beginning of the area" vs. "something might be accessible".
> 
> Yes, and the third case would be "something might be accessible, but
> not all of it". There probably are callers that give up when they
> don't have it all.

Who cares?  Again,
	1) those callers *still* have to cope with copyin/copyout failures
halfway through.  Fully successful fault-in does not guarantee anything
whatsoever.  IOW, you won't get rid of any complexity that way.
	2) earlier bailout in rare error case is not worth bothering with.
If you'd been given an iov_iter spanning an unmapped/unreadable/unwritable
area of user memory, it's either a fucking rare race with truncate() of
an mmapped file or a pilot error.  Neither case is worth optimizing for.

	The difference between partially accessible and completely accessible
at the fault-in time is useless for callers.  Really.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper
  2021-07-24 19:52   ` Linus Torvalds
  2021-07-24 20:24     ` Al Viro
@ 2021-07-27  9:30     ` David Laight
  2021-07-27 11:13       ` Andreas Gruenbacher
  1 sibling, 1 reply; 19+ messages in thread
From: David Laight @ 2021-07-27  9:30 UTC (permalink / raw)
  To: 'Linus Torvalds', Andreas Gruenbacher
  Cc: Alexander Viro, Christoph Hellwig, Darrick J. Wong, Jan Kara,
	Matthew Wilcox, cluster-devel, linux-fsdevel,
	Linux Kernel Mailing List, ocfs2-devel

From: Linus Torvalds
> Sent: 24 July 2021 20:53
> 
> On Sat, Jul 24, 2021 at 12:35 PM Andreas Gruenbacher
> <agruenba@redhat.com> wrote:
> >
> > +int iov_iter_fault_in_writeable(const struct iov_iter *i, size_t bytes)
> > +{
> ...
> > +                       if (fault_in_user_pages(start, len, true) != len)
> > +                               return -EFAULT;
> 
> Looking at this once more, I think this is likely wrong.
> 
> Why?
> 
> Because any user can/should only care about at least *part* of the
> area being writable.
> 
> Imagine that you're doing a large read. If the *first* page is
> writable, you should still return the partial read, not -EFAULT.

My 2c...

Is it actually worth doing any more than ensuring the first byte
of the buffer is paged in before entering the block that has
to disable page faults?

Most of the all the pages are present so the IO completes.

The pages can always get unmapped (due to page pressure or
another application thread unmapping them) so there needs
to be a retry loop.
Given the cost of actually faulting in a page going around
the outer loop may not matter.
Indeed, if an application has just mmap()ed in a very large
file and is then doing a write() from it then it is quite
likely that the pages got unmapped!

Clearly there needs to be extra code to ensure progress is made.
This might actually require the use of 'bounce buffers'
for really problematic user requests.

I also wonder what actually happens for pipes and fifos.
IIRC reads and write of up to PIPE_MAX (typically 4096)
are expected to be atomic.
This should be true even if there are page faults part way
through the copy_to/from_user().

It has to be said I can't see any reference to PIPE_MAX
in the linux man pages, but I'm sure it is in the POSIX/TOG
spec.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper
  2021-07-27  9:30     ` David Laight
@ 2021-07-27 11:13       ` Andreas Gruenbacher
  2021-07-27 17:51         ` Linus Torvalds
  0 siblings, 1 reply; 19+ messages in thread
From: Andreas Gruenbacher @ 2021-07-27 11:13 UTC (permalink / raw)
  To: David Laight
  Cc: Linus Torvalds, Alexander Viro, Christoph Hellwig,
	Darrick J. Wong, Jan Kara, Matthew Wilcox, cluster-devel,
	linux-fsdevel, Linux Kernel Mailing List, ocfs2-devel

On Tue, Jul 27, 2021 at 11:30 AM David Laight <David.Laight@aculab.com> wrote:
> From: Linus Torvalds
> > Sent: 24 July 2021 20:53
> >
> > On Sat, Jul 24, 2021 at 12:35 PM Andreas Gruenbacher
> > <agruenba@redhat.com> wrote:
> > >
> > > +int iov_iter_fault_in_writeable(const struct iov_iter *i, size_t bytes)
> > > +{
> > ...
> > > +                       if (fault_in_user_pages(start, len, true) != len)
> > > +                               return -EFAULT;
> >
> > Looking at this once more, I think this is likely wrong.
> >
> > Why?
> >
> > Because any user can/should only care about at least *part* of the
> > area being writable.
> >
> > Imagine that you're doing a large read. If the *first* page is
> > writable, you should still return the partial read, not -EFAULT.
>
> My 2c...
>
> Is it actually worth doing any more than ensuring the first byte
> of the buffer is paged in before entering the block that has
> to disable page faults?

We definitely do want to process as many pages as we can, especially
if allocations are involved during a write.

> Most of the all the pages are present so the IO completes.

That's not guaranteed. There are cases in which none of the pages are
present, and then there are cases in which only the first page is
present (for example, because of a previous access that wasn't page
aligned).

> The pages can always get unmapped (due to page pressure or
> another application thread unmapping them) so there needs
> to be a retry loop.
> Given the cost of actually faulting in a page going around
> the outer loop may not matter.
> Indeed, if an application has just mmap()ed in a very large
> file and is then doing a write() from it then it is quite
> likely that the pages got unmapped!
>
> Clearly there needs to be extra code to ensure progress is made.
> This might actually require the use of 'bounce buffers'
> for really problematic user requests.

I'm not sure if repeated unmapping of the pages that we've just
faulted in is going to be a problem (in terms of preventing progress).
But a suitable heuristic might be to shrink the fault-in "window" on
each retry until it's only one page.

> I also wonder what actually happens for pipes and fifos.
> IIRC reads and write of up to PIPE_MAX (typically 4096)
> are expected to be atomic.
> This should be true even if there are page faults part way
> through the copy_to/from_user().
>
> It has to be said I can't see any reference to PIPE_MAX
> in the linux man pages, but I'm sure it is in the POSIX/TOG
> spec.
>
>         David

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper
  2021-07-27 11:13       ` Andreas Gruenbacher
@ 2021-07-27 17:51         ` Linus Torvalds
  0 siblings, 0 replies; 19+ messages in thread
From: Linus Torvalds @ 2021-07-27 17:51 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: David Laight, Alexander Viro, Christoph Hellwig, Darrick J. Wong,
	Jan Kara, Matthew Wilcox, cluster-devel, linux-fsdevel,
	Linux Kernel Mailing List, ocfs2-devel

On Tue, Jul 27, 2021 at 4:14 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> On Tue, Jul 27, 2021 at 11:30 AM David Laight <David.Laight@aculab.com> wrote:
> >
> > Is it actually worth doing any more than ensuring the first byte
> > of the buffer is paged in before entering the block that has
> > to disable page faults?
>
> We definitely do want to process as many pages as we can, especially
> if allocations are involved during a write.

Yeah, from an efficiency standpoint, once you start walking page
tables, it's probably best to just handle as much as you can.

But once you get an error, I don't think it should be "everything is bad".

This is a bit annoying, because while *most* users really just want
that "everything is good", *some* users might just want to handle the
partial success case.

It's why "copy_to/from_user()" returns the number of bytes *not*
written, rather than -EFAULT like get/put_user(). 99% of all users
just want to know "did I write all bytes" (and then checking for a
zero return is a simple and cheap verification of "everything was
ok").

But then very occasionally, you hit a case where you actually want to
know how much of a copy worked. It's rare, but it happens, and the
read/write system calls tend to be the main user of it.

And yes, the fact that "copy_to/from_user()" doesn't return an error
(like get/put_user() does) has confused people many times over the
years. It's annoying, but it's required by those (few) users that
really do want to handle that partial case.

I think this iov_iter_fault_in_readable/writeable() case should do the same.

And no, it's not new to Andreas' patch. iov_iter_fault_in_readable()
is doing the "everything has to be good" thing already.

Which maybe implies that nobody cares about partial reads/writes. Or
it's very very rare - I've seen code that handles page faults in user
space, but it's admittedly been some very special CPU
simulator/emulator checkpointing stuff.

               Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-07-27 17:52 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-24 19:34 [PATCH v4 0/8] gfs2: Fix mmap + page fault deadlocks Andreas Gruenbacher
2021-07-24 19:34 ` [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper Andreas Gruenbacher
2021-07-24 19:52   ` Linus Torvalds
2021-07-24 20:24     ` Al Viro
2021-07-24 20:37       ` Linus Torvalds
2021-07-24 21:38       ` Andreas Gruenbacher
2021-07-24 21:57         ` Al Viro
2021-07-24 22:06           ` Andreas Gruenbacher
2021-07-24 23:39             ` Al Viro
2021-07-27  9:30     ` David Laight
2021-07-27 11:13       ` Andreas Gruenbacher
2021-07-27 17:51         ` Linus Torvalds
2021-07-24 19:34 ` [PATCH v4 2/8] gfs2: Add wrapper for iomap_file_buffered_write Andreas Gruenbacher
2021-07-24 19:34 ` [PATCH v4 3/8] gfs2: Fix mmap + page fault deadlocks for buffered I/O Andreas Gruenbacher
2021-07-24 19:34 ` [PATCH v4 4/8] iomap: Fix iomap_dio_rw return value for user copies Andreas Gruenbacher
2021-07-24 19:34 ` [PATCH v4 5/8] iomap: Add done_before argument to iomap_dio_rw Andreas Gruenbacher
2021-07-24 19:34 ` [PATCH v4 6/8] iomap: Support restarting direct I/O requests after user copy failures Andreas Gruenbacher
2021-07-24 19:34 ` [PATCH v4 7/8] iov_iter: Introduce noio flag to disable page faults Andreas Gruenbacher
2021-07-24 19:34 ` [PATCH v4 8/8] gfs2: Fix mmap + page fault deadlocks for direct I/O Andreas Gruenbacher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).