linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v4 0/5] Support for RWF_UNCACHED
@ 2019-12-12 19:01 Jens Axboe
  2019-12-12 19:01 ` [PATCH 1/5] fs: add read support " Jens Axboe
                   ` (5 more replies)
  0 siblings, 6 replies; 27+ messages in thread
From: Jens Axboe @ 2019-12-12 19:01 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-block; +Cc: willy, clm, torvalds, david

Recently someone asked me how io_uring buffered IO compares to mmaped
IO in terms of performance. So I ran some tests with buffered IO, and
found the experience to be somewhat painful. The test case is pretty
basic, random reads over a dataset that's 10x the size of RAM.
Performance starts out fine, and then the page cache fills up and we
hit a throughput cliff. CPU usage of the IO threads go up, and we have
kswapd spending 100% of a core trying to keep up. Seeing that, I was
reminded of the many complaints I here about buffered IO, and the fact
that most of the folks complaining will ultimately bite the bullet and
move to O_DIRECT to just get the kernel out of the way.

But I don't think it needs to be like that. Switching to O_DIRECT isn't
always easily doable. The buffers have different life times, size and
alignment constraints, etc. On top of that, mixing buffered and O_DIRECT
can be painful.

Seems to me that we have an opportunity to provide something that sits
somewhere in between buffered and O_DIRECT, and this is where
RWF_UNCACHED enters the picture. If this flag is set on IO, we get the
following behavior:

- If the data is in cache, it remains in cache and the copy (in or out)
  is served to/from that. This is true for both reads and writes.

- For writes, if the data is NOT in cache, we add it while performing the
  IO. When the IO is done, we remove it again.

- For reads, if the data is NOT in the cache, we allocate a private page
  and use that for IO. When the IO is done, we free this page. The page
  never sees the page cache.

With this, I can do 100% smooth buffered reads or writes without pushing
the kernel to the state where kswapd is sweating bullets. In fact it
doesn't even register.

Comments appreciated! This should work on any standard file system,
using either the generic helpers or iomap. I have tested ext4 and xfs
for the right read/write behavior, but no further validation has been
done yet. This version contains the bigger prep patch of switching
iomap_apply() and actors to struct iomap_data, and I hope I didn't
mess that up too badly. I'll try and exercise it all, I've done XFS
mounts and reads+writes and it seems happy from that POV at least.

The core of the changes are actually really small, the majority of
the diff is just prep work to get there.

Patches are against current git, and can also be found here:

https://git.kernel.dk/cgit/linux-block/log/?h=buffered-uncached

 fs/ceph/file.c          |   2 +-
 fs/dax.c                |  25 +++--
 fs/ext4/file.c          |   2 +-
 fs/iomap/apply.c        |  50 ++++++---
 fs/iomap/buffered-io.c  | 225 +++++++++++++++++++++++++---------------
 fs/iomap/direct-io.c    |  57 +++++-----
 fs/iomap/fiemap.c       |  48 +++++----
 fs/iomap/seek.c         |  64 +++++++-----
 fs/iomap/swapfile.c     |  27 ++---
 fs/nfs/file.c           |   2 +-
 include/linux/fs.h      |   7 +-
 include/linux/iomap.h   |  20 +++-
 include/uapi/linux/fs.h |   5 +-
 mm/filemap.c            |  89 +++++++++++++---
 14 files changed, 416 insertions(+), 207 deletions(-)

Changes since v3:
- Add iomap_actor_data to cut down on arguments
- Fix bad flag drop in iomap_write_begin()
- Remove unused IOMAP_WRITE_F_UNCACHED flag
- Don't use the page cache at all for reads

Changes since v2:
- Rework the write side according to Chinners suggestions. Much cleaner
  this way. It does mean that we invalidate the full write region if just
  ONE page (or more) had to be created, where before it was more granular.
  I don't think that's a concern, and on the plus side, we now no longer
  have to chunk invalidations into 15/16 pages at the time.
- Cleanups

Changes since v1:
- Switch to pagevecs for write_drop_cached_pages()
- Use page_offset() instead of manual shift
- Ensure we hold a reference on the page between calling ->write_end()
  and checking the mapping on the locked page
- Fix XFS multi-page streamed writes, we'd drop the UNCACHED flag after
  the first page

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/5] fs: add read support for RWF_UNCACHED
  2019-12-12 19:01 [PATCHSET v4 0/5] Support for RWF_UNCACHED Jens Axboe
@ 2019-12-12 19:01 ` Jens Axboe
  2019-12-12 21:21   ` Matthew Wilcox
  2019-12-12 19:01 ` [PATCH 2/5] mm: make generic_perform_write() take a struct kiocb Jens Axboe
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2019-12-12 19:01 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-block
  Cc: willy, clm, torvalds, david, Jens Axboe

If RWF_UNCACHED is set for io_uring (or preadv2(2)), we'll use private
pages for the buffered reads. These pages will never be inserted into
the page cache, and they are simply droped when we have done the copy at
the end of IO.

If pages in the read range are already in the page cache, then use those
for just copying the data instead of starting IO on private pages.

A previous solution used the page cache even for non-cached ranges, but
the cost of doing so was too high. Removing nodes at the end is
expensive, even with LRU bypass. On top of that, repeatedly
instantiating new xarray nodes is very costly, as it needs to memset 576
bytes of data, and freeing said nodes involve an RCU call per node as
well. All that adds up, making uncached somewhat slower than O_DIRECT.

With the current solition, we're basically at O_DIRECT levels of
performance for RWF_UNCACHED IO.

Protect against truncate the same way O_DIRECT does, by calling
inode_dio_begin() to elevate the inode->i_dio_count.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/fs.h      |  3 +++
 include/uapi/linux/fs.h |  5 ++++-
 mm/filemap.c            | 40 +++++++++++++++++++++++++++++++++-------
 3 files changed, 40 insertions(+), 8 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 98e0349adb52..092ea2a4319b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -314,6 +314,7 @@ enum rw_hint {
 #define IOCB_SYNC		(1 << 5)
 #define IOCB_WRITE		(1 << 6)
 #define IOCB_NOWAIT		(1 << 7)
+#define IOCB_UNCACHED		(1 << 8)
 
 struct kiocb {
 	struct file		*ki_filp;
@@ -3418,6 +3419,8 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
 		ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
 	if (flags & RWF_APPEND)
 		ki->ki_flags |= IOCB_APPEND;
+	if (flags & RWF_UNCACHED)
+		ki->ki_flags |= IOCB_UNCACHED;
 	return 0;
 }
 
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 379a612f8f1d..357ebb0e0c5d 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -299,8 +299,11 @@ typedef int __bitwise __kernel_rwf_t;
 /* per-IO O_APPEND */
 #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
 
+/* drop cache after reading or writing data */
+#define RWF_UNCACHED	((__force __kernel_rwf_t)0x00000040)
+
 /* mask of flags supported by the kernel */
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
-			 RWF_APPEND)
+			 RWF_APPEND | RWF_UNCACHED)
 
 #endif /* _UAPI_LINUX_FS_H */
diff --git a/mm/filemap.c b/mm/filemap.c
index bf6aa30be58d..5d299d69b185 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1990,6 +1990,13 @@ static void shrink_readahead_size_eio(struct file *filp,
 	ra->ra_pages /= 4;
 }
 
+static void buffered_put_page(struct page *page, bool clear_mapping)
+{
+	if (clear_mapping)
+		page->mapping = NULL;
+	put_page(page);
+}
+
 /**
  * generic_file_buffered_read - generic file read routine
  * @iocb:	the iocb to read
@@ -2013,6 +2020,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
 	struct file_ra_state *ra = &filp->f_ra;
+	bool did_dio_begin = false;
 	loff_t *ppos = &iocb->ki_pos;
 	pgoff_t index;
 	pgoff_t last_index;
@@ -2032,6 +2040,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 	offset = *ppos & ~PAGE_MASK;
 
 	for (;;) {
+		bool clear_mapping = false;
 		struct page *page;
 		pgoff_t end_index;
 		loff_t isize;
@@ -2048,6 +2057,13 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		if (!page) {
 			if (iocb->ki_flags & IOCB_NOWAIT)
 				goto would_block;
+			/* UNCACHED implies no read-ahead */
+			if (iocb->ki_flags & IOCB_UNCACHED) {
+				did_dio_begin = true;
+				/* block truncate for UNCACHED reads */
+				inode_dio_begin(inode);
+				goto no_cached_page;
+			}
 			page_cache_sync_readahead(mapping,
 					ra, filp,
 					index, last_index - index);
@@ -2106,7 +2122,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		isize = i_size_read(inode);
 		end_index = (isize - 1) >> PAGE_SHIFT;
 		if (unlikely(!isize || index > end_index)) {
-			put_page(page);
+			buffered_put_page(page, clear_mapping);
 			goto out;
 		}
 
@@ -2115,7 +2131,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		if (index == end_index) {
 			nr = ((isize - 1) & ~PAGE_MASK) + 1;
 			if (nr <= offset) {
-				put_page(page);
+				buffered_put_page(page, clear_mapping);
 				goto out;
 			}
 		}
@@ -2147,7 +2163,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		offset &= ~PAGE_MASK;
 		prev_offset = offset;
 
-		put_page(page);
+		buffered_put_page(page, clear_mapping);
 		written += ret;
 		if (!iov_iter_count(iter))
 			goto out;
@@ -2189,7 +2205,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 
 		if (unlikely(error)) {
 			if (error == AOP_TRUNCATED_PAGE) {
-				put_page(page);
+				buffered_put_page(page, clear_mapping);
 				error = 0;
 				goto find_page;
 			}
@@ -2206,7 +2222,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 					 * invalidate_mapping_pages got it
 					 */
 					unlock_page(page);
-					put_page(page);
+					buffered_put_page(page, clear_mapping);
 					goto find_page;
 				}
 				unlock_page(page);
@@ -2221,7 +2237,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 
 readpage_error:
 		/* UHHUH! A synchronous read error occurred. Report it */
-		put_page(page);
+		buffered_put_page(page, clear_mapping);
 		goto out;
 
 no_cached_page:
@@ -2234,7 +2250,15 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 			error = -ENOMEM;
 			goto out;
 		}
-		error = add_to_page_cache_lru(page, mapping, index,
+		if (iocb->ki_flags & IOCB_UNCACHED) {
+			__SetPageLocked(page);
+			page->mapping = mapping;
+			page->index = index;
+			clear_mapping = true;
+			goto readpage;
+		}
+
+		error = add_to_page_cache(page, mapping, index,
 				mapping_gfp_constraint(mapping, GFP_KERNEL));
 		if (error) {
 			put_page(page);
@@ -2250,6 +2274,8 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 would_block:
 	error = -EAGAIN;
 out:
+	if (did_dio_begin)
+		inode_dio_end(inode);
 	ra->prev_pos = prev_index;
 	ra->prev_pos <<= PAGE_SHIFT;
 	ra->prev_pos |= prev_offset;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/5] mm: make generic_perform_write() take a struct kiocb
  2019-12-12 19:01 [PATCHSET v4 0/5] Support for RWF_UNCACHED Jens Axboe
  2019-12-12 19:01 ` [PATCH 1/5] fs: add read support " Jens Axboe
@ 2019-12-12 19:01 ` Jens Axboe
  2019-12-12 19:01 ` [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED Jens Axboe
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2019-12-12 19:01 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-block
  Cc: willy, clm, torvalds, david, Jens Axboe

Right now all callers pass in iocb->ki_pos, just pass in the iocb. This
is in preparation for using the iocb flags in generic_perform_write().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/ceph/file.c     | 2 +-
 fs/ext4/file.c     | 2 +-
 fs/nfs/file.c      | 2 +-
 include/linux/fs.h | 3 ++-
 mm/filemap.c       | 8 +++++---
 5 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 11929d2bb594..096c009f188f 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1538,7 +1538,7 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		 * are pending vmtruncate. So write and vmtruncate
 		 * can not run at the same time
 		 */
-		written = generic_perform_write(file, from, pos);
+		written = generic_perform_write(file, from, iocb);
 		if (likely(written >= 0))
 			iocb->ki_pos = pos + written;
 		ceph_end_io_write(inode);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 6a7293a5cda2..9ffb857765d5 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -249,7 +249,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
 		goto out;
 
 	current->backing_dev_info = inode_to_bdi(inode);
-	ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos);
+	ret = generic_perform_write(iocb->ki_filp, from, iocb);
 	current->backing_dev_info = NULL;
 
 out:
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 8eb731d9be3e..d8f51a702a4e 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -624,7 +624,7 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
 	result = generic_write_checks(iocb, from);
 	if (result > 0) {
 		current->backing_dev_info = inode_to_bdi(inode);
-		result = generic_perform_write(file, from, iocb->ki_pos);
+		result = generic_perform_write(file, from, iocb);
 		current->backing_dev_info = NULL;
 	}
 	nfs_end_io_write(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 092ea2a4319b..bf58db1bc032 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3103,7 +3103,8 @@ extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_file_direct_write(struct kiocb *, struct iov_iter *);
-extern ssize_t generic_perform_write(struct file *, struct iov_iter *, loff_t);
+extern ssize_t generic_perform_write(struct file *, struct iov_iter *,
+				     struct kiocb *);
 
 ssize_t vfs_iter_read(struct file *file, struct iov_iter *iter, loff_t *ppos,
 		rwf_t flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index 5d299d69b185..00b8e87fb9cf 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3292,10 +3292,11 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 EXPORT_SYMBOL(grab_cache_page_write_begin);
 
 ssize_t generic_perform_write(struct file *file,
-				struct iov_iter *i, loff_t pos)
+				struct iov_iter *i, struct kiocb *iocb)
 {
 	struct address_space *mapping = file->f_mapping;
 	const struct address_space_operations *a_ops = mapping->a_ops;
+	loff_t pos = iocb->ki_pos;
 	long status = 0;
 	ssize_t written = 0;
 	unsigned int flags = 0;
@@ -3429,7 +3430,8 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
 			goto out;
 
-		status = generic_perform_write(file, from, pos = iocb->ki_pos);
+		pos = iocb->ki_pos;
+		status = generic_perform_write(file, from, iocb);
 		/*
 		 * If generic_perform_write() returned a synchronous error
 		 * then we want to return the number of bytes which were
@@ -3461,7 +3463,7 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			 */
 		}
 	} else {
-		written = generic_perform_write(file, from, iocb->ki_pos);
+		written = generic_perform_write(file, from, iocb);
 		if (likely(written > 0))
 			iocb->ki_pos += written;
 	}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED
  2019-12-12 19:01 [PATCHSET v4 0/5] Support for RWF_UNCACHED Jens Axboe
  2019-12-12 19:01 ` [PATCH 1/5] fs: add read support " Jens Axboe
  2019-12-12 19:01 ` [PATCH 2/5] mm: make generic_perform_write() take a struct kiocb Jens Axboe
@ 2019-12-12 19:01 ` Jens Axboe
  2019-12-12 19:01 ` [PATCH 4/5] iomap: add struct iomap_data Jens Axboe
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2019-12-12 19:01 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-block
  Cc: willy, clm, torvalds, david, Jens Axboe

If RWF_UNCACHED is set for io_uring (or pwritev2(2)), we'll drop the
cache instantiated for buffered writes. If new pages aren't
instantiated, we leave them alone. This provides similar semantics to
reads with RWF_UNCACHED set.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/fs.h |  1 +
 mm/filemap.c       | 41 +++++++++++++++++++++++++++++++++++++++--
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index bf58db1bc032..5ea5fc167524 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -285,6 +285,7 @@ enum positive_aop_returns {
 #define AOP_FLAG_NOFS			0x0002 /* used by filesystem to direct
 						* helper code (eg buffer layer)
 						* to clear GFP_FS from alloc */
+#define AOP_FLAG_UNCACHED		0x0004
 
 /*
  * oh the beauties of C type declarations.
diff --git a/mm/filemap.c b/mm/filemap.c
index 00b8e87fb9cf..fbcd4537979d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3277,10 +3277,12 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 					pgoff_t index, unsigned flags)
 {
 	struct page *page;
-	int fgp_flags = FGP_LOCK|FGP_WRITE|FGP_CREAT;
+	int fgp_flags = FGP_LOCK|FGP_WRITE;
 
 	if (flags & AOP_FLAG_NOFS)
 		fgp_flags |= FGP_NOFS;
+	if (!(flags & AOP_FLAG_UNCACHED))
+		fgp_flags |= FGP_CREAT;
 
 	page = pagecache_get_page(mapping, index, fgp_flags,
 			mapping_gfp_mask(mapping));
@@ -3301,6 +3303,9 @@ ssize_t generic_perform_write(struct file *file,
 	ssize_t written = 0;
 	unsigned int flags = 0;
 
+	if (iocb->ki_flags & IOCB_UNCACHED)
+		flags |= AOP_FLAG_UNCACHED;
+
 	do {
 		struct page *page;
 		unsigned long offset;	/* Offset into pagecache page */
@@ -3333,10 +3338,16 @@ ssize_t generic_perform_write(struct file *file,
 			break;
 		}
 
+retry:
 		status = a_ops->write_begin(file, mapping, pos, bytes, flags,
 						&page, &fsdata);
-		if (unlikely(status < 0))
+		if (unlikely(status < 0)) {
+			if (status == -ENOMEM && (flags & AOP_FLAG_UNCACHED)) {
+				flags &= ~AOP_FLAG_UNCACHED;
+				goto retry;
+			}
 			break;
+		}
 
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_page(page);
@@ -3372,6 +3383,32 @@ ssize_t generic_perform_write(struct file *file,
 		balance_dirty_pages_ratelimited(mapping);
 	} while (iov_iter_count(i));
 
+	if (written && (iocb->ki_flags & IOCB_UNCACHED)) {
+		loff_t end;
+
+		pos = iocb->ki_pos;
+		end = pos + written;
+
+		status = filemap_write_and_wait_range(mapping, pos, end);
+		if (status)
+			goto out;
+
+		/*
+		 * No pages were created for this range, we're done
+		 */
+		if (flags & AOP_FLAG_UNCACHED)
+			 goto out;
+
+		/*
+		 * Try to invalidate cache pages for the range we just wrote.
+		 * We don't care if invalidation fails as the write has still
+		 * worked and leaving clean uptodate pages in the page cache
+		 * isn't a corruption vector for uncached IO.
+		 */
+		 invalidate_inode_pages2_range(mapping,
+					pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
+	}
+out:
 	return written ? written : status;
 }
 EXPORT_SYMBOL(generic_perform_write);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/5] iomap: add struct iomap_data
  2019-12-12 19:01 [PATCHSET v4 0/5] Support for RWF_UNCACHED Jens Axboe
                   ` (2 preceding siblings ...)
  2019-12-12 19:01 ` [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED Jens Axboe
@ 2019-12-12 19:01 ` Jens Axboe
  2019-12-13 20:32   ` Darrick J. Wong
  2019-12-12 19:01 ` [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes Jens Axboe
  2019-12-13  0:53 ` [PATCHSET v4 0/5] Support for RWF_UNCACHED Jens Axboe
  5 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2019-12-12 19:01 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-block
  Cc: willy, clm, torvalds, david, Jens Axboe

We pass a lot of arguments to iomap_apply(), and subsequently to the
actors that it calls. In preparation for adding one more argument,
switch them to using a struct iomap_data instead. The actor gets a const
version of that, they are not supposed to change anything in it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/dax.c               |  25 +++--
 fs/iomap/apply.c       |  26 +++---
 fs/iomap/buffered-io.c | 202 +++++++++++++++++++++++++----------------
 fs/iomap/direct-io.c   |  57 +++++++-----
 fs/iomap/fiemap.c      |  48 ++++++----
 fs/iomap/seek.c        |  64 ++++++++-----
 fs/iomap/swapfile.c    |  27 +++---
 include/linux/iomap.h  |  15 ++-
 8 files changed, 278 insertions(+), 186 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 1f1f0201cad1..d1c32dbbdf24 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1090,13 +1090,16 @@ int __dax_zero_page_range(struct block_device *bdev,
 EXPORT_SYMBOL_GPL(__dax_zero_page_range);
 
 static loff_t
-dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap, struct iomap *srcmap)
+dax_iomap_actor(const struct iomap_data *data, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	struct block_device *bdev = iomap->bdev;
 	struct dax_device *dax_dev = iomap->dax_dev;
-	struct iov_iter *iter = data;
+	struct iov_iter *iter = data->priv;
+	loff_t pos = data->pos;
+	loff_t length = pos + data->len;
 	loff_t end = pos + length, done = 0;
+	struct inode *inode = data->inode;
 	ssize_t ret = 0;
 	size_t xfer;
 	int id;
@@ -1197,22 +1200,26 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 {
 	struct address_space *mapping = iocb->ki_filp->f_mapping;
 	struct inode *inode = mapping->host;
-	loff_t pos = iocb->ki_pos, ret = 0, done = 0;
-	unsigned flags = 0;
+	loff_t ret = 0, done = 0;
+	struct iomap_data data = {
+		.inode	= inode,
+		.pos	= iocb->ki_pos,
+		.priv	= iter,
+	};
 
 	if (iov_iter_rw(iter) == WRITE) {
 		lockdep_assert_held_write(&inode->i_rwsem);
-		flags |= IOMAP_WRITE;
+		data.flags |= IOMAP_WRITE;
 	} else {
 		lockdep_assert_held(&inode->i_rwsem);
 	}
 
 	while (iov_iter_count(iter)) {
-		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
-				iter, dax_iomap_actor);
+		data.len = iov_iter_count(iter);
+		ret = iomap_apply(&data, ops, dax_iomap_actor);
 		if (ret <= 0)
 			break;
-		pos += ret;
+		data.pos += ret;
 		done += ret;
 	}
 
diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
index 76925b40b5fd..e76148db03b8 100644
--- a/fs/iomap/apply.c
+++ b/fs/iomap/apply.c
@@ -21,15 +21,16 @@
  * iomap_end call.
  */
 loff_t
-iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
-		const struct iomap_ops *ops, void *data, iomap_actor_t actor)
+iomap_apply(struct iomap_data *data, const struct iomap_ops *ops,
+	    iomap_actor_t actor)
 {
 	struct iomap iomap = { .type = IOMAP_HOLE };
 	struct iomap srcmap = { .type = IOMAP_HOLE };
 	loff_t written = 0, ret;
 	u64 end;
 
-	trace_iomap_apply(inode, pos, length, flags, ops, actor, _RET_IP_);
+	trace_iomap_apply(data->inode, data->pos, data->len, data->flags, ops,
+				actor, _RET_IP_);
 
 	/*
 	 * Need to map a range from start position for length bytes. This can
@@ -43,17 +44,18 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
 	 * expose transient stale data. If the reserve fails, we can safely
 	 * back out at this point as there is nothing to undo.
 	 */
-	ret = ops->iomap_begin(inode, pos, length, flags, &iomap, &srcmap);
+	ret = ops->iomap_begin(data->inode, data->pos, data->len, data->flags,
+				&iomap, &srcmap);
 	if (ret)
 		return ret;
-	if (WARN_ON(iomap.offset > pos))
+	if (WARN_ON(iomap.offset > data->pos))
 		return -EIO;
 	if (WARN_ON(iomap.length == 0))
 		return -EIO;
 
-	trace_iomap_apply_dstmap(inode, &iomap);
+	trace_iomap_apply_dstmap(data->inode, &iomap);
 	if (srcmap.type != IOMAP_HOLE)
-		trace_iomap_apply_srcmap(inode, &srcmap);
+		trace_iomap_apply_srcmap(data->inode, &srcmap);
 
 	/*
 	 * Cut down the length to the one actually provided by the filesystem,
@@ -62,8 +64,8 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
 	end = iomap.offset + iomap.length;
 	if (srcmap.type != IOMAP_HOLE)
 		end = min(end, srcmap.offset + srcmap.length);
-	if (pos + length > end)
-		length = end - pos;
+	if (data->pos + data->len > end)
+		data->len = end - data->pos;
 
 	/*
 	 * Now that we have guaranteed that the space allocation will succeed,
@@ -77,7 +79,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
 	 * iomap into the actors so that they don't need to have special
 	 * handling for the two cases.
 	 */
-	written = actor(inode, pos, length, data, &iomap,
+	written = actor(data, &iomap,
 			srcmap.type != IOMAP_HOLE ? &srcmap : &iomap);
 
 	/*
@@ -85,9 +87,9 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
 	 * should not fail unless the filesystem has had a fatal error.
 	 */
 	if (ops->iomap_end) {
-		ret = ops->iomap_end(inode, pos, length,
+		ret = ops->iomap_end(data->inode, data->pos, data->len,
 				     written > 0 ? written : 0,
-				     flags, &iomap);
+				     data->flags, &iomap);
 	}
 
 	return written ? written : ret;
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 828444e14d09..0a1a195ed1cc 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -248,14 +248,15 @@ static inline bool iomap_block_needs_zeroing(struct inode *inode,
 }
 
 static loff_t
-iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap, struct iomap *srcmap)
+iomap_readpage_actor(const struct iomap_data *data, struct iomap *iomap,
+		     struct iomap *srcmap)
 {
-	struct iomap_readpage_ctx *ctx = data;
+	struct iomap_readpage_ctx *ctx = data->priv;
+	struct inode *inode = data->inode;
 	struct page *page = ctx->cur_page;
 	struct iomap_page *iop = iomap_page_create(inode, page);
 	bool same_page = false, is_contig = false;
-	loff_t orig_pos = pos;
+	loff_t pos = data->pos, orig_pos = data->pos;
 	unsigned poff, plen;
 	sector_t sector;
 
@@ -266,7 +267,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	}
 
 	/* zero post-eof blocks as the page may be mapped */
-	iomap_adjust_read_range(inode, iop, &pos, length, &poff, &plen);
+	iomap_adjust_read_range(inode, iop, &pos, data->len, &poff, &plen);
 	if (plen == 0)
 		goto done;
 
@@ -302,7 +303,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 
 	if (!ctx->bio || !is_contig || bio_full(ctx->bio, plen)) {
 		gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL);
-		int nr_vecs = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
+		int nr_vecs = (data->len + PAGE_SIZE - 1) >> PAGE_SHIFT;
 
 		if (ctx->bio)
 			submit_bio(ctx->bio);
@@ -333,16 +334,20 @@ int
 iomap_readpage(struct page *page, const struct iomap_ops *ops)
 {
 	struct iomap_readpage_ctx ctx = { .cur_page = page };
-	struct inode *inode = page->mapping->host;
+	struct iomap_data data = {
+		.inode	= page->mapping->host,
+		.priv	= &ctx,
+		.flags	= 0
+	};
 	unsigned poff;
 	loff_t ret;
 
 	trace_iomap_readpage(page->mapping->host, 1);
 
 	for (poff = 0; poff < PAGE_SIZE; poff += ret) {
-		ret = iomap_apply(inode, page_offset(page) + poff,
-				PAGE_SIZE - poff, 0, ops, &ctx,
-				iomap_readpage_actor);
+		data.pos = page_offset(page) + poff;
+		data.len = PAGE_SIZE - poff;
+		ret = iomap_apply(&data, ops, iomap_readpage_actor);
 		if (ret <= 0) {
 			WARN_ON_ONCE(ret == 0);
 			SetPageError(page);
@@ -396,28 +401,34 @@ iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos,
 }
 
 static loff_t
-iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+iomap_readpages_actor(const struct iomap_data *data, struct iomap *iomap,
+		      struct iomap *srcmap)
 {
-	struct iomap_readpage_ctx *ctx = data;
+	struct iomap_readpage_ctx *ctx = data->priv;
 	loff_t done, ret;
 
-	for (done = 0; done < length; done += ret) {
-		if (ctx->cur_page && offset_in_page(pos + done) == 0) {
+	for (done = 0; done < data->len; done += ret) {
+		struct iomap_data rp_data = {
+			.inode	= data->inode,
+			.pos	= data->pos + done,
+			.len	= data->len - done,
+			.priv	= ctx,
+		};
+
+		if (ctx->cur_page && offset_in_page(rp_data.pos) == 0) {
 			if (!ctx->cur_page_in_bio)
 				unlock_page(ctx->cur_page);
 			put_page(ctx->cur_page);
 			ctx->cur_page = NULL;
 		}
 		if (!ctx->cur_page) {
-			ctx->cur_page = iomap_next_page(inode, ctx->pages,
-					pos, length, &done);
+			ctx->cur_page = iomap_next_page(data->inode, ctx->pages,
+					data->pos, data->len, &done);
 			if (!ctx->cur_page)
 				break;
 			ctx->cur_page_in_bio = false;
 		}
-		ret = iomap_readpage_actor(inode, pos + done, length - done,
-				ctx, iomap, srcmap);
+		ret = iomap_readpage_actor(&rp_data, iomap, srcmap);
 	}
 
 	return done;
@@ -431,21 +442,27 @@ iomap_readpages(struct address_space *mapping, struct list_head *pages,
 		.pages		= pages,
 		.is_readahead	= true,
 	};
-	loff_t pos = page_offset(list_entry(pages->prev, struct page, lru));
+	struct iomap_data data = {
+		.inode	= mapping->host,
+		.priv	= &ctx,
+		.flags	= 0
+	};
 	loff_t last = page_offset(list_entry(pages->next, struct page, lru));
-	loff_t length = last - pos + PAGE_SIZE, ret = 0;
+	loff_t ret = 0;
+
+	data.pos = page_offset(list_entry(pages->prev, struct page, lru));
+	data.len = last - data.pos + PAGE_SIZE;
 
-	trace_iomap_readpages(mapping->host, nr_pages);
+	trace_iomap_readpages(data.inode, nr_pages);
 
-	while (length > 0) {
-		ret = iomap_apply(mapping->host, pos, length, 0, ops,
-				&ctx, iomap_readpages_actor);
+	while (data.len > 0) {
+		ret = iomap_apply(&data, ops, iomap_readpages_actor);
 		if (ret <= 0) {
 			WARN_ON_ONCE(ret == 0);
 			goto done;
 		}
-		pos += ret;
-		length -= ret;
+		data.pos += ret;
+		data.len -= ret;
 	}
 	ret = 0;
 done:
@@ -796,10 +813,13 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len, unsigned copied,
 }
 
 static loff_t
-iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap, struct iomap *srcmap)
+iomap_write_actor(const struct iomap_data *data, struct iomap *iomap,
+		  struct iomap *srcmap)
 {
-	struct iov_iter *i = data;
+	struct inode *inode = data->inode;
+	struct iov_iter *i = data->priv;
+	loff_t length = data->len;
+	loff_t pos = data->pos;
 	long status = 0;
 	ssize_t written = 0;
 
@@ -879,15 +899,20 @@ ssize_t
 iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops)
 {
-	struct inode *inode = iocb->ki_filp->f_mapping->host;
-	loff_t pos = iocb->ki_pos, ret = 0, written = 0;
+	struct iomap_data data = {
+		.inode	= iocb->ki_filp->f_mapping->host,
+		.pos	= iocb->ki_pos,
+		.priv	= iter,
+		.flags	= IOMAP_WRITE
+	};
+	loff_t ret = 0, written = 0;
 
 	while (iov_iter_count(iter)) {
-		ret = iomap_apply(inode, pos, iov_iter_count(iter),
-				IOMAP_WRITE, ops, iter, iomap_write_actor);
+		data.len = iov_iter_count(iter);
+		ret = iomap_apply(&data, ops, iomap_write_actor);
 		if (ret <= 0)
 			break;
-		pos += ret;
+		data.pos += ret;
 		written += ret;
 	}
 
@@ -896,9 +921,11 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter,
 EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
 
 static loff_t
-iomap_unshare_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap, struct iomap *srcmap)
+iomap_unshare_actor(const struct iomap_data *data, struct iomap *iomap,
+		    struct iomap *srcmap)
 {
+	loff_t pos = data->pos;
+	loff_t length = data->len;
 	long status = 0;
 	ssize_t written = 0;
 
@@ -914,13 +941,13 @@ iomap_unshare_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		unsigned long bytes = min_t(loff_t, PAGE_SIZE - offset, length);
 		struct page *page;
 
-		status = iomap_write_begin(inode, pos, bytes,
+		status = iomap_write_begin(data->inode, pos, bytes,
 				IOMAP_WRITE_F_UNSHARE, &page, iomap, srcmap);
 		if (unlikely(status))
 			return status;
 
-		status = iomap_write_end(inode, pos, bytes, bytes, page, iomap,
-				srcmap);
+		status = iomap_write_end(data->inode, pos, bytes, bytes, page,
+						iomap, srcmap);
 		if (unlikely(status <= 0)) {
 			if (WARN_ON_ONCE(status == 0))
 				return -EIO;
@@ -933,7 +960,7 @@ iomap_unshare_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		written += status;
 		length -= status;
 
-		balance_dirty_pages_ratelimited(inode->i_mapping);
+		balance_dirty_pages_ratelimited(data->inode->i_mapping);
 	} while (length);
 
 	return written;
@@ -943,15 +970,20 @@ int
 iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops)
 {
+	struct iomap_data data = {
+		.inode	= inode,
+		.pos	= pos,
+		.len	= len,
+		.flags	= IOMAP_WRITE,
+	};
 	loff_t ret;
 
-	while (len) {
-		ret = iomap_apply(inode, pos, len, IOMAP_WRITE, ops, NULL,
-				iomap_unshare_actor);
+	while (data.len) {
+		ret = iomap_apply(&data, ops, iomap_unshare_actor);
 		if (ret <= 0)
 			return ret;
-		pos += ret;
-		len -= ret;
+		data.pos += ret;
+		data.len -= ret;
 	}
 
 	return 0;
@@ -982,16 +1014,18 @@ static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
 }
 
 static loff_t
-iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+iomap_zero_range_actor(const struct iomap_data *data, struct iomap *iomap,
+		       struct iomap *srcmap)
 {
-	bool *did_zero = data;
+	bool *did_zero = data->priv;
+	loff_t count = data->len;
+	loff_t pos = data->pos;
 	loff_t written = 0;
 	int status;
 
 	/* already zeroed?  we're done. */
 	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
-		return count;
+		return data->len;
 
 	do {
 		unsigned offset, bytes;
@@ -999,11 +1033,11 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
 		offset = offset_in_page(pos);
 		bytes = min_t(loff_t, PAGE_SIZE - offset, count);
 
-		if (IS_DAX(inode))
+		if (IS_DAX(data->inode))
 			status = iomap_dax_zero(pos, offset, bytes, iomap);
 		else
-			status = iomap_zero(inode, pos, offset, bytes, iomap,
-					srcmap);
+			status = iomap_zero(data->inode, pos, offset, bytes,
+						iomap, srcmap);
 		if (status < 0)
 			return status;
 
@@ -1021,16 +1055,22 @@ int
 iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 		const struct iomap_ops *ops)
 {
+	struct iomap_data data = {
+		.inode	= inode,
+		.pos	= pos,
+		.len	= len,
+		.priv	= did_zero,
+		.flags	= IOMAP_ZERO
+	};
 	loff_t ret;
 
-	while (len > 0) {
-		ret = iomap_apply(inode, pos, len, IOMAP_ZERO,
-				ops, did_zero, iomap_zero_range_actor);
+	while (data.len > 0) {
+		ret = iomap_apply(&data, ops, iomap_zero_range_actor);
 		if (ret <= 0)
 			return ret;
 
-		pos += ret;
-		len -= ret;
+		data.pos += ret;
+		data.len -= ret;
 	}
 
 	return 0;
@@ -1052,57 +1092,59 @@ iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
 EXPORT_SYMBOL_GPL(iomap_truncate_page);
 
 static loff_t
-iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+iomap_page_mkwrite_actor(const struct iomap_data *data,
+			 struct iomap *iomap, struct iomap *srcmap)
 {
-	struct page *page = data;
+	struct page *page = data->priv;
 	int ret;
 
 	if (iomap->flags & IOMAP_F_BUFFER_HEAD) {
-		ret = __block_write_begin_int(page, pos, length, NULL, iomap);
+		ret = __block_write_begin_int(page, data->pos, data->len, NULL,
+						iomap);
 		if (ret)
 			return ret;
-		block_commit_write(page, 0, length);
+		block_commit_write(page, 0, data->len);
 	} else {
 		WARN_ON_ONCE(!PageUptodate(page));
-		iomap_page_create(inode, page);
+		iomap_page_create(data->inode, page);
 		set_page_dirty(page);
 	}
 
-	return length;
+	return data->len;
 }
 
 vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops)
 {
 	struct page *page = vmf->page;
-	struct inode *inode = file_inode(vmf->vma->vm_file);
-	unsigned long length;
-	loff_t offset, size;
+	struct iomap_data data = {
+		.inode	= file_inode(vmf->vma->vm_file),
+		.pos	= page_offset(page),
+		.flags	= IOMAP_WRITE | IOMAP_FAULT,
+		.priv	= page,
+	};
 	ssize_t ret;
+	loff_t size;
 
 	lock_page(page);
-	size = i_size_read(inode);
-	offset = page_offset(page);
-	if (page->mapping != inode->i_mapping || offset > size) {
+	size = i_size_read(data.inode);
+	if (page->mapping != data.inode->i_mapping || data.pos > size) {
 		/* We overload EFAULT to mean page got truncated */
 		ret = -EFAULT;
 		goto out_unlock;
 	}
 
 	/* page is wholly or partially inside EOF */
-	if (offset > size - PAGE_SIZE)
-		length = offset_in_page(size);
+	if (data.pos > size - PAGE_SIZE)
+		data.len = offset_in_page(size);
 	else
-		length = PAGE_SIZE;
+		data.len = PAGE_SIZE;
 
-	while (length > 0) {
-		ret = iomap_apply(inode, offset, length,
-				IOMAP_WRITE | IOMAP_FAULT, ops, page,
-				iomap_page_mkwrite_actor);
+	while (data.len > 0) {
+		ret = iomap_apply(&data, ops, iomap_page_mkwrite_actor);
 		if (unlikely(ret <= 0))
 			goto out_unlock;
-		offset += ret;
-		length -= ret;
+		data.pos += ret;
+		data.len -= ret;
 	}
 
 	wait_for_stable_page(page);
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 23837926c0c5..e561ca9329ac 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -364,24 +364,27 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length,
 }
 
 static loff_t
-iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+iomap_dio_actor(const struct iomap_data *data, struct iomap *iomap,
+		struct iomap *srcmap)
 {
-	struct iomap_dio *dio = data;
+	struct iomap_dio *dio = data->priv;
 
 	switch (iomap->type) {
 	case IOMAP_HOLE:
 		if (WARN_ON_ONCE(dio->flags & IOMAP_DIO_WRITE))
 			return -EIO;
-		return iomap_dio_hole_actor(length, dio);
+		return iomap_dio_hole_actor(data->len, dio);
 	case IOMAP_UNWRITTEN:
 		if (!(dio->flags & IOMAP_DIO_WRITE))
-			return iomap_dio_hole_actor(length, dio);
-		return iomap_dio_bio_actor(inode, pos, length, dio, iomap);
+			return iomap_dio_hole_actor(data->len, dio);
+		return iomap_dio_bio_actor(data->inode, data->pos, data->len,
+						dio, iomap);
 	case IOMAP_MAPPED:
-		return iomap_dio_bio_actor(inode, pos, length, dio, iomap);
+		return iomap_dio_bio_actor(data->inode, data->pos, data->len,
+						dio, iomap);
 	case IOMAP_INLINE:
-		return iomap_dio_inline_actor(inode, pos, length, dio, iomap);
+		return iomap_dio_inline_actor(data->inode, data->pos, data->len,
+						dio, iomap);
 	default:
 		WARN_ON_ONCE(1);
 		return -EIO;
@@ -404,16 +407,19 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 {
 	struct address_space *mapping = iocb->ki_filp->f_mapping;
 	struct inode *inode = file_inode(iocb->ki_filp);
-	size_t count = iov_iter_count(iter);
-	loff_t pos = iocb->ki_pos;
-	loff_t end = iocb->ki_pos + count - 1, ret = 0;
-	unsigned int flags = IOMAP_DIRECT;
+	struct iomap_data data = {
+		.inode	= inode,
+		.pos	= iocb->ki_pos,
+		.len	= iov_iter_count(iter),
+		.flags	= IOMAP_DIRECT
+	};
+	loff_t end = data.pos + data.len - 1, ret = 0;
 	struct blk_plug plug;
 	struct iomap_dio *dio;
 
 	lockdep_assert_held(&inode->i_rwsem);
 
-	if (!count)
+	if (!data.len)
 		return 0;
 
 	if (WARN_ON(is_sync_kiocb(iocb) && !wait_for_completion))
@@ -436,14 +442,16 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	dio->submit.cookie = BLK_QC_T_NONE;
 	dio->submit.last_queue = NULL;
 
+	data.priv = dio;
+
 	if (iov_iter_rw(iter) == READ) {
-		if (pos >= dio->i_size)
+		if (data.pos >= dio->i_size)
 			goto out_free_dio;
 
 		if (iter_is_iovec(iter))
 			dio->flags |= IOMAP_DIO_DIRTY;
 	} else {
-		flags |= IOMAP_WRITE;
+		data.flags |= IOMAP_WRITE;
 		dio->flags |= IOMAP_DIO_WRITE;
 
 		/* for data sync or sync, we need sync completion processing */
@@ -461,14 +469,14 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	}
 
 	if (iocb->ki_flags & IOCB_NOWAIT) {
-		if (filemap_range_has_page(mapping, pos, end)) {
+		if (filemap_range_has_page(mapping, data.pos, end)) {
 			ret = -EAGAIN;
 			goto out_free_dio;
 		}
-		flags |= IOMAP_NOWAIT;
+		data.flags |= IOMAP_NOWAIT;
 	}
 
-	ret = filemap_write_and_wait_range(mapping, pos, end);
+	ret = filemap_write_and_wait_range(mapping, data.pos, end);
 	if (ret)
 		goto out_free_dio;
 
@@ -479,7 +487,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	 * pretty crazy thing to do, so we don't support it 100%.
 	 */
 	ret = invalidate_inode_pages2_range(mapping,
-			pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
+			data.pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
 	if (ret)
 		dio_warn_stale_pagecache(iocb->ki_filp);
 	ret = 0;
@@ -495,8 +503,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 
 	blk_start_plug(&plug);
 	do {
-		ret = iomap_apply(inode, pos, count, flags, ops, dio,
-				iomap_dio_actor);
+		ret = iomap_apply(&data, ops, iomap_dio_actor);
 		if (ret <= 0) {
 			/* magic error code to fall back to buffered I/O */
 			if (ret == -ENOTBLK) {
@@ -505,18 +512,18 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 			}
 			break;
 		}
-		pos += ret;
+		data.pos += ret;
 
-		if (iov_iter_rw(iter) == READ && pos >= dio->i_size) {
+		if (iov_iter_rw(iter) == READ && data.pos >= dio->i_size) {
 			/*
 			 * We only report that we've read data up to i_size.
 			 * Revert iter to a state corresponding to that as
 			 * some callers (such as splice code) rely on it.
 			 */
-			iov_iter_revert(iter, pos - dio->i_size);
+			iov_iter_revert(iter, data.pos - dio->i_size);
 			break;
 		}
-	} while ((count = iov_iter_count(iter)) > 0);
+	} while ((data.len = iov_iter_count(iter)) > 0);
 	blk_finish_plug(&plug);
 
 	if (ret < 0)
diff --git a/fs/iomap/fiemap.c b/fs/iomap/fiemap.c
index bccf305ea9ce..4075fbe0e3f5 100644
--- a/fs/iomap/fiemap.c
+++ b/fs/iomap/fiemap.c
@@ -43,20 +43,20 @@ static int iomap_to_fiemap(struct fiemap_extent_info *fi,
 }
 
 static loff_t
-iomap_fiemap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap, struct iomap *srcmap)
+iomap_fiemap_actor(const struct iomap_data *data, struct iomap *iomap,
+		   struct iomap *srcmap)
 {
-	struct fiemap_ctx *ctx = data;
-	loff_t ret = length;
+	struct fiemap_ctx *ctx = data->priv;
+	loff_t ret = data->len;
 
 	if (iomap->type == IOMAP_HOLE)
-		return length;
+		return data->len;
 
 	ret = iomap_to_fiemap(ctx->fi, &ctx->prev, 0);
 	ctx->prev = *iomap;
 	switch (ret) {
 	case 0:		/* success */
-		return length;
+		return data->len;
 	case 1:		/* extent array full */
 		return 0;
 	default:
@@ -68,6 +68,13 @@ int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fi,
 		loff_t start, loff_t len, const struct iomap_ops *ops)
 {
 	struct fiemap_ctx ctx;
+	struct iomap_data data = {
+		.inode	= inode,
+		.pos	= start,
+		.len	= len,
+		.flags	= IOMAP_REPORT,
+		.priv	= &ctx
+	};
 	loff_t ret;
 
 	memset(&ctx, 0, sizeof(ctx));
@@ -84,9 +91,8 @@ int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fi,
 			return ret;
 	}
 
-	while (len > 0) {
-		ret = iomap_apply(inode, start, len, IOMAP_REPORT, ops, &ctx,
-				iomap_fiemap_actor);
+	while (data.len > 0) {
+		ret = iomap_apply(&data, ops, iomap_fiemap_actor);
 		/* inode with no (attribute) mapping will give ENOENT */
 		if (ret == -ENOENT)
 			break;
@@ -95,8 +101,8 @@ int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fi,
 		if (ret == 0)
 			break;
 
-		start += ret;
-		len -= ret;
+		data.pos += ret;
+		data.len -= ret;
 	}
 
 	if (ctx.prev.type != IOMAP_HOLE) {
@@ -110,13 +116,14 @@ int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fi,
 EXPORT_SYMBOL_GPL(iomap_fiemap);
 
 static loff_t
-iomap_bmap_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+iomap_bmap_actor(const struct iomap_data *data, struct iomap *iomap,
+		 struct iomap *srcmap)
 {
-	sector_t *bno = data, addr;
+	sector_t *bno = data->priv, addr;
 
 	if (iomap->type == IOMAP_MAPPED) {
-		addr = (pos - iomap->offset + iomap->addr) >> inode->i_blkbits;
+		addr = (data->pos - iomap->offset + iomap->addr) >>
+				data->inode->i_blkbits;
 		if (addr > INT_MAX)
 			WARN(1, "would truncate bmap result\n");
 		else
@@ -131,16 +138,19 @@ iomap_bmap(struct address_space *mapping, sector_t bno,
 		const struct iomap_ops *ops)
 {
 	struct inode *inode = mapping->host;
-	loff_t pos = bno << inode->i_blkbits;
-	unsigned blocksize = i_blocksize(inode);
+	struct iomap_data data = {
+		.inode	= inode,
+		.pos	= bno << inode->i_blkbits,
+		.len	= i_blocksize(inode),
+		.priv	= &bno
+	};
 	int ret;
 
 	if (filemap_write_and_wait(mapping))
 		return 0;
 
 	bno = 0;
-	ret = iomap_apply(inode, pos, blocksize, 0, ops, &bno,
-			  iomap_bmap_actor);
+	ret = iomap_apply(&data, ops, iomap_bmap_actor);
 	if (ret)
 		return 0;
 	return bno;
diff --git a/fs/iomap/seek.c b/fs/iomap/seek.c
index 89f61d93c0bc..288bee0b5d9b 100644
--- a/fs/iomap/seek.c
+++ b/fs/iomap/seek.c
@@ -118,21 +118,23 @@ page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length,
 
 
 static loff_t
-iomap_seek_hole_actor(struct inode *inode, loff_t offset, loff_t length,
-		      void *data, struct iomap *iomap, struct iomap *srcmap)
+iomap_seek_hole_actor(const struct iomap_data *data, struct iomap *iomap,
+		      struct iomap *srcmap)
 {
+	loff_t offset = data->pos;
+
 	switch (iomap->type) {
 	case IOMAP_UNWRITTEN:
-		offset = page_cache_seek_hole_data(inode, offset, length,
-						   SEEK_HOLE);
+		offset = page_cache_seek_hole_data(data->inode, offset,
+						   data->len, SEEK_HOLE);
 		if (offset < 0)
-			return length;
+			return data->len;
 		/* fall through */
 	case IOMAP_HOLE:
-		*(loff_t *)data = offset;
+		*(loff_t *)data->priv = offset;
 		return 0;
 	default:
-		return length;
+		return data->len;
 	}
 }
 
@@ -140,23 +142,28 @@ loff_t
 iomap_seek_hole(struct inode *inode, loff_t offset, const struct iomap_ops *ops)
 {
 	loff_t size = i_size_read(inode);
-	loff_t length = size - offset;
+	struct iomap_data data = {
+		.inode	= inode,
+		.len	= size - offset,
+		.priv	= &offset,
+		.flags	= IOMAP_REPORT
+	};
 	loff_t ret;
 
 	/* Nothing to be found before or beyond the end of the file. */
 	if (offset < 0 || offset >= size)
 		return -ENXIO;
 
-	while (length > 0) {
-		ret = iomap_apply(inode, offset, length, IOMAP_REPORT, ops,
-				  &offset, iomap_seek_hole_actor);
+	while (data.len > 0) {
+		data.pos = offset;
+		ret = iomap_apply(&data, ops, iomap_seek_hole_actor);
 		if (ret < 0)
 			return ret;
 		if (ret == 0)
 			break;
 
 		offset += ret;
-		length -= ret;
+		data.len -= ret;
 	}
 
 	return offset;
@@ -164,20 +171,22 @@ iomap_seek_hole(struct inode *inode, loff_t offset, const struct iomap_ops *ops)
 EXPORT_SYMBOL_GPL(iomap_seek_hole);
 
 static loff_t
-iomap_seek_data_actor(struct inode *inode, loff_t offset, loff_t length,
-		      void *data, struct iomap *iomap, struct iomap *srcmap)
+iomap_seek_data_actor(const struct iomap_data *data, struct iomap *iomap,
+		      struct iomap *srcmap)
 {
+	loff_t offset = data->pos;
+
 	switch (iomap->type) {
 	case IOMAP_HOLE:
-		return length;
+		return data->len;
 	case IOMAP_UNWRITTEN:
-		offset = page_cache_seek_hole_data(inode, offset, length,
-						   SEEK_DATA);
+		offset = page_cache_seek_hole_data(data->inode, offset,
+						   data->len, SEEK_DATA);
 		if (offset < 0)
-			return length;
+			return data->len;
 		/*FALLTHRU*/
 	default:
-		*(loff_t *)data = offset;
+		*(loff_t *)data->priv = offset;
 		return 0;
 	}
 }
@@ -186,26 +195,31 @@ loff_t
 iomap_seek_data(struct inode *inode, loff_t offset, const struct iomap_ops *ops)
 {
 	loff_t size = i_size_read(inode);
-	loff_t length = size - offset;
+	struct iomap_data data = {
+		.inode  = inode,
+		.len    = size - offset,
+		.priv   = &offset,
+		.flags  = IOMAP_REPORT
+	};
 	loff_t ret;
 
 	/* Nothing to be found before or beyond the end of the file. */
 	if (offset < 0 || offset >= size)
 		return -ENXIO;
 
-	while (length > 0) {
-		ret = iomap_apply(inode, offset, length, IOMAP_REPORT, ops,
-				  &offset, iomap_seek_data_actor);
+	while (data.len > 0) {
+		data.pos = offset;
+		ret = iomap_apply(&data, ops, iomap_seek_data_actor);
 		if (ret < 0)
 			return ret;
 		if (ret == 0)
 			break;
 
 		offset += ret;
-		length -= ret;
+		data.len -= ret;
 	}
 
-	if (length <= 0)
+	if (data.len <= 0)
 		return -ENXIO;
 	return offset;
 }
diff --git a/fs/iomap/swapfile.c b/fs/iomap/swapfile.c
index a648dbf6991e..d911ab4b69ea 100644
--- a/fs/iomap/swapfile.c
+++ b/fs/iomap/swapfile.c
@@ -75,11 +75,10 @@ static int iomap_swapfile_add_extent(struct iomap_swapfile_info *isi)
  * swap only cares about contiguous page-aligned physical extents and makes no
  * distinction between written and unwritten extents.
  */
-static loff_t iomap_swapfile_activate_actor(struct inode *inode, loff_t pos,
-		loff_t count, void *data, struct iomap *iomap,
-		struct iomap *srcmap)
+static loff_t iomap_swapfile_activate_actor(const struct iomap_data *data,
+		struct iomap *iomap, struct iomap *srcmap)
 {
-	struct iomap_swapfile_info *isi = data;
+	struct iomap_swapfile_info *isi = data->priv;
 	int error;
 
 	switch (iomap->type) {
@@ -125,7 +124,7 @@ static loff_t iomap_swapfile_activate_actor(struct inode *inode, loff_t pos,
 			return error;
 		memcpy(&isi->iomap, iomap, sizeof(isi->iomap));
 	}
-	return count;
+	return data->len;
 }
 
 /*
@@ -142,8 +141,13 @@ int iomap_swapfile_activate(struct swap_info_struct *sis,
 	};
 	struct address_space *mapping = swap_file->f_mapping;
 	struct inode *inode = mapping->host;
-	loff_t pos = 0;
-	loff_t len = ALIGN_DOWN(i_size_read(inode), PAGE_SIZE);
+	struct iomap_data data = {
+		.inode	= inode,
+		.pos	= 0,
+		.len 	= ALIGN_DOWN(i_size_read(inode), PAGE_SIZE),
+		.priv	= &isi,
+		.flags	= IOMAP_REPORT
+	};
 	loff_t ret;
 
 	/*
@@ -154,14 +158,13 @@ int iomap_swapfile_activate(struct swap_info_struct *sis,
 	if (ret)
 		return ret;
 
-	while (len > 0) {
-		ret = iomap_apply(inode, pos, len, IOMAP_REPORT,
-				ops, &isi, iomap_swapfile_activate_actor);
+	while (data.len > 0) {
+		ret = iomap_apply(&data, ops, iomap_swapfile_activate_actor);
 		if (ret <= 0)
 			return ret;
 
-		pos += ret;
-		len -= ret;
+		data.pos += ret;
+		data.len -= ret;
 	}
 
 	if (isi.iomap.length) {
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 8b09463dae0d..30f40145a9e9 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -145,11 +145,18 @@ struct iomap_ops {
 /*
  * Main iomap iterator function.
  */
-typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
-		void *data, struct iomap *iomap, struct iomap *srcmap);
+struct iomap_data {
+	struct inode	*inode;
+	loff_t		pos;
+	loff_t		len;
+	void		*priv;
+	unsigned	flags;
+};
+
+typedef loff_t (*iomap_actor_t)(const struct iomap_data *data,
+		struct iomap *iomap, struct iomap *srcmap);
 
-loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
-		unsigned flags, const struct iomap_ops *ops, void *data,
+loff_t iomap_apply(struct iomap_data *data, const struct iomap_ops *ops,
 		iomap_actor_t actor);
 
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-12 19:01 [PATCHSET v4 0/5] Support for RWF_UNCACHED Jens Axboe
                   ` (3 preceding siblings ...)
  2019-12-12 19:01 ` [PATCH 4/5] iomap: add struct iomap_data Jens Axboe
@ 2019-12-12 19:01 ` Jens Axboe
  2019-12-13  2:26   ` Darrick J. Wong
  2019-12-13  0:53 ` [PATCHSET v4 0/5] Support for RWF_UNCACHED Jens Axboe
  5 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2019-12-12 19:01 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-block
  Cc: willy, clm, torvalds, david, Jens Axboe

This adds support for RWF_UNCACHED for file systems using iomap to
perform buffered writes. We use the generic infrastructure for this,
by tracking pages we created and calling write_drop_cached_pages()
to issue writeback and prune those pages.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/iomap/apply.c       | 24 ++++++++++++++++++++++++
 fs/iomap/buffered-io.c | 23 +++++++++++++++++++----
 include/linux/iomap.h  |  5 +++++
 3 files changed, 48 insertions(+), 4 deletions(-)

diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
index e76148db03b8..11b6812f7b37 100644
--- a/fs/iomap/apply.c
+++ b/fs/iomap/apply.c
@@ -92,5 +92,29 @@ iomap_apply(struct iomap_data *data, const struct iomap_ops *ops,
 				     data->flags, &iomap);
 	}
 
+	if (written && (data->flags & IOMAP_UNCACHED)) {
+		struct address_space *mapping = data->inode->i_mapping;
+
+		end = data->pos + written;
+		ret = filemap_write_and_wait_range(mapping, data->pos, end);
+		if (ret)
+			goto out;
+
+		/*
+		 * No pages were created for this range, we're done
+		 */
+		if (!(iomap.flags & IOMAP_F_PAGE_CREATE))
+			goto out;
+
+		/*
+		 * Try to invalidate cache pages for the range we just wrote.
+		 * We don't care if invalidation fails as the write has still
+		 * worked and leaving clean uptodate pages in the page cache
+		 * isn't a corruption vector for uncached IO.
+		 */
+		invalidate_inode_pages2_range(mapping,
+				data->pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
+	}
+out:
 	return written ? written : ret;
 }
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 0a1a195ed1cc..df9d6002858e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -659,6 +659,7 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 		struct page **pagep, struct iomap *iomap, struct iomap *srcmap)
 {
 	const struct iomap_page_ops *page_ops = iomap->page_ops;
+	unsigned aop_flags;
 	struct page *page;
 	int status = 0;
 
@@ -675,8 +676,11 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 			return status;
 	}
 
+	aop_flags = AOP_FLAG_NOFS;
+	if (flags & IOMAP_UNCACHED)
+		aop_flags |= AOP_FLAG_UNCACHED;
 	page = grab_cache_page_write_begin(inode->i_mapping, pos >> PAGE_SHIFT,
-			AOP_FLAG_NOFS);
+						aop_flags);
 	if (!page) {
 		status = -ENOMEM;
 		goto out_no_page;
@@ -818,6 +822,7 @@ iomap_write_actor(const struct iomap_data *data, struct iomap *iomap,
 {
 	struct inode *inode = data->inode;
 	struct iov_iter *i = data->priv;
+	unsigned flags = data->flags;
 	loff_t length = data->len;
 	loff_t pos = data->pos;
 	long status = 0;
@@ -851,10 +856,17 @@ iomap_write_actor(const struct iomap_data *data, struct iomap *iomap,
 			break;
 		}
 
-		status = iomap_write_begin(inode, pos, bytes, 0, &page, iomap,
-				srcmap);
-		if (unlikely(status))
+retry:
+		status = iomap_write_begin(inode, pos, bytes, flags,
+						&page, iomap, srcmap);
+		if (unlikely(status)) {
+			if (status == -ENOMEM && (flags & IOMAP_UNCACHED)) {
+				iomap->flags |= IOMAP_F_PAGE_CREATE;
+				flags &= ~IOMAP_UNCACHED;
+				goto retry;
+			}
 			break;
+		}
 
 		if (mapping_writably_mapped(inode->i_mapping))
 			flush_dcache_page(page);
@@ -907,6 +919,9 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter,
 	};
 	loff_t ret = 0, written = 0;
 
+	if (iocb->ki_flags & IOCB_UNCACHED)
+		data.flags |= IOMAP_UNCACHED;
+
 	while (iov_iter_count(iter)) {
 		data.len = iov_iter_count(iter);
 		ret = iomap_apply(&data, ops, iomap_write_actor);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 30f40145a9e9..30bb248e1d0d 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -48,12 +48,16 @@ struct vm_fault;
  *
  * IOMAP_F_BUFFER_HEAD indicates that the file system requires the use of
  * buffer heads for this mapping.
+ *
+ * IOMAP_F_PAGE_CREATE indicates that pages had to be allocated to satisfy
+ * this operation.
  */
 #define IOMAP_F_NEW		0x01
 #define IOMAP_F_DIRTY		0x02
 #define IOMAP_F_SHARED		0x04
 #define IOMAP_F_MERGED		0x08
 #define IOMAP_F_BUFFER_HEAD	0x10
+#define IOMAP_F_PAGE_CREATE	0x20
 
 /*
  * Flags set by the core iomap code during operations:
@@ -121,6 +125,7 @@ struct iomap_page_ops {
 #define IOMAP_FAULT		(1 << 3) /* mapping for page fault */
 #define IOMAP_DIRECT		(1 << 4) /* direct I/O */
 #define IOMAP_NOWAIT		(1 << 5) /* do not block */
+#define IOMAP_UNCACHED		(1 << 6)
 
 struct iomap_ops {
 	/*
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/5] fs: add read support for RWF_UNCACHED
  2019-12-12 19:01 ` [PATCH 1/5] fs: add read support " Jens Axboe
@ 2019-12-12 21:21   ` Matthew Wilcox
  2019-12-12 21:27     ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2019-12-12 21:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-mm, linux-fsdevel, linux-block, clm, torvalds, david

On Thu, Dec 12, 2019 at 12:01:29PM -0700, Jens Axboe wrote:
> @@ -2234,7 +2250,15 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
>  			error = -ENOMEM;
>  			goto out;
>  		}
> -		error = add_to_page_cache_lru(page, mapping, index,
[...]
> +		error = add_to_page_cache(page, mapping, index,
>  				mapping_gfp_constraint(mapping, GFP_KERNEL));

Surely a mistake?  (and does this mistake invalidate the testing you
did earlier?)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/5] fs: add read support for RWF_UNCACHED
  2019-12-12 21:21   ` Matthew Wilcox
@ 2019-12-12 21:27     ` Jens Axboe
  0 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2019-12-12 21:27 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-fsdevel, linux-block, clm, torvalds, david

On 12/12/19 2:21 PM, Matthew Wilcox wrote:
> On Thu, Dec 12, 2019 at 12:01:29PM -0700, Jens Axboe wrote:
>> @@ -2234,7 +2250,15 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
>>  			error = -ENOMEM;
>>  			goto out;
>>  		}
>> -		error = add_to_page_cache_lru(page, mapping, index,
> [...]
>> +		error = add_to_page_cache(page, mapping, index,
>>  				mapping_gfp_constraint(mapping, GFP_KERNEL));
> 
> Surely a mistake?  (and does this mistake invalidate the testing you
> did earlier?)

Yeah I already caught that one too - this is new in the v4 patchset, so
doesn't invalidate any of the earlier buffered testing.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHSET v4 0/5] Support for RWF_UNCACHED
  2019-12-12 19:01 [PATCHSET v4 0/5] Support for RWF_UNCACHED Jens Axboe
                   ` (4 preceding siblings ...)
  2019-12-12 19:01 ` [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes Jens Axboe
@ 2019-12-13  0:53 ` Jens Axboe
  5 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2019-12-13  0:53 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-block; +Cc: willy, clm, torvalds, david

On 12/12/19 12:01 PM, Jens Axboe wrote:
> Recently someone asked me how io_uring buffered IO compares to mmaped
> IO in terms of performance. So I ran some tests with buffered IO, and
> found the experience to be somewhat painful. The test case is pretty
> basic, random reads over a dataset that's 10x the size of RAM.
> Performance starts out fine, and then the page cache fills up and we
> hit a throughput cliff. CPU usage of the IO threads go up, and we have
> kswapd spending 100% of a core trying to keep up. Seeing that, I was
> reminded of the many complaints I here about buffered IO, and the fact
> that most of the folks complaining will ultimately bite the bullet and
> move to O_DIRECT to just get the kernel out of the way.
> 
> But I don't think it needs to be like that. Switching to O_DIRECT isn't
> always easily doable. The buffers have different life times, size and
> alignment constraints, etc. On top of that, mixing buffered and O_DIRECT
> can be painful.
> 
> Seems to me that we have an opportunity to provide something that sits
> somewhere in between buffered and O_DIRECT, and this is where
> RWF_UNCACHED enters the picture. If this flag is set on IO, we get the
> following behavior:
> 
> - If the data is in cache, it remains in cache and the copy (in or out)
>   is served to/from that. This is true for both reads and writes.
> 
> - For writes, if the data is NOT in cache, we add it while performing the
>   IO. When the IO is done, we remove it again.
> 
> - For reads, if the data is NOT in the cache, we allocate a private page
>   and use that for IO. When the IO is done, we free this page. The page
>   never sees the page cache.
> 
> With this, I can do 100% smooth buffered reads or writes without pushing
> the kernel to the state where kswapd is sweating bullets. In fact it
> doesn't even register.
> 
> Comments appreciated! This should work on any standard file system,
> using either the generic helpers or iomap. I have tested ext4 and xfs
> for the right read/write behavior, but no further validation has been
> done yet. This version contains the bigger prep patch of switching
> iomap_apply() and actors to struct iomap_data, and I hope I didn't
> mess that up too badly. I'll try and exercise it all, I've done XFS
> mounts and reads+writes and it seems happy from that POV at least.
> 
> The core of the changes are actually really small, the majority of
> the diff is just prep work to get there.
> 
> Patches are against current git, and can also be found here:
> 
> https://git.kernel.dk/cgit/linux-block/log/?h=buffered-uncached
> 
>  fs/ceph/file.c          |   2 +-
>  fs/dax.c                |  25 +++--
>  fs/ext4/file.c          |   2 +-
>  fs/iomap/apply.c        |  50 ++++++---
>  fs/iomap/buffered-io.c  | 225 +++++++++++++++++++++++++---------------
>  fs/iomap/direct-io.c    |  57 +++++-----
>  fs/iomap/fiemap.c       |  48 +++++----
>  fs/iomap/seek.c         |  64 +++++++-----
>  fs/iomap/swapfile.c     |  27 ++---
>  fs/nfs/file.c           |   2 +-
>  include/linux/fs.h      |   7 +-
>  include/linux/iomap.h   |  20 +++-
>  include/uapi/linux/fs.h |   5 +-
>  mm/filemap.c            |  89 +++++++++++++---
>  14 files changed, 416 insertions(+), 207 deletions(-)
> 
> Changes since v3:
> - Add iomap_actor_data to cut down on arguments
> - Fix bad flag drop in iomap_write_begin()
> - Remove unused IOMAP_WRITE_F_UNCACHED flag
> - Don't use the page cache at all for reads

Had the silly lru error in v4, and also an XFS flags error. I'm not
going to re-post already due to this, but please use:

https://git.kernel.dk/cgit/linux-block/log/?h=buffered-uncached

if you're going to test this. You can pull it at:

git://git.kernel.dk/linux-block buffered-uncached

Those are the only two changes since v4. I'll throw a v5 out there a bit
later.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-12 19:01 ` [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes Jens Axboe
@ 2019-12-13  2:26   ` Darrick J. Wong
  2019-12-13  2:38     ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2019-12-13  2:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-mm, linux-fsdevel, linux-block, willy, clm, torvalds, david

On Thu, Dec 12, 2019 at 12:01:33PM -0700, Jens Axboe wrote:
> This adds support for RWF_UNCACHED for file systems using iomap to
> perform buffered writes. We use the generic infrastructure for this,
> by tracking pages we created and calling write_drop_cached_pages()
> to issue writeback and prune those pages.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/iomap/apply.c       | 24 ++++++++++++++++++++++++
>  fs/iomap/buffered-io.c | 23 +++++++++++++++++++----
>  include/linux/iomap.h  |  5 +++++
>  3 files changed, 48 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
> index e76148db03b8..11b6812f7b37 100644
> --- a/fs/iomap/apply.c
> +++ b/fs/iomap/apply.c
> @@ -92,5 +92,29 @@ iomap_apply(struct iomap_data *data, const struct iomap_ops *ops,
>  				     data->flags, &iomap);
>  	}
>  
> +	if (written && (data->flags & IOMAP_UNCACHED)) {

Hmmm... why is a chunk of buffered write(?) code landing in the iomap
apply function?

The #define for IOMAP_UNCACHED doesn't have a comment, so I don't know
what this is supposed to mean.  Judging from the one place it gets set
in the buffered write function I gather that this is how you implement
the "write through page cache and immediately unmap the page if it
wasn't there before" behavior?

So based on that, I think you want ...

if IOMAP_WRITE && _UNCACHED && !_DIRECT && written > 0:
	flush and invalidate

Since direct writes are never going to create page cache, right?

And in that case, why not put this at the end of iomap_write_actor?

(Sorry if this came up in the earlier discussions, I've been busy this
week and still have a long way to go for catching up...)

> +		struct address_space *mapping = data->inode->i_mapping;
> +
> +		end = data->pos + written;
> +		ret = filemap_write_and_wait_range(mapping, data->pos, end);
> +		if (ret)
> +			goto out;
> +
> +		/*
> +		 * No pages were created for this range, we're done
> +		 */
> +		if (!(iomap.flags & IOMAP_F_PAGE_CREATE))
> +			goto out;
> +
> +		/*
> +		 * Try to invalidate cache pages for the range we just wrote.
> +		 * We don't care if invalidation fails as the write has still
> +		 * worked and leaving clean uptodate pages in the page cache
> +		 * isn't a corruption vector for uncached IO.
> +		 */
> +		invalidate_inode_pages2_range(mapping,
> +				data->pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> +	}
> +out:
>  	return written ? written : ret;
>  }
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 0a1a195ed1cc..df9d6002858e 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -659,6 +659,7 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
>  		struct page **pagep, struct iomap *iomap, struct iomap *srcmap)
>  {
>  	const struct iomap_page_ops *page_ops = iomap->page_ops;
> +	unsigned aop_flags;
>  	struct page *page;
>  	int status = 0;
>  
> @@ -675,8 +676,11 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
>  			return status;
>  	}
>  
> +	aop_flags = AOP_FLAG_NOFS;
> +	if (flags & IOMAP_UNCACHED)
> +		aop_flags |= AOP_FLAG_UNCACHED;
>  	page = grab_cache_page_write_begin(inode->i_mapping, pos >> PAGE_SHIFT,
> -			AOP_FLAG_NOFS);
> +						aop_flags);
>  	if (!page) {
>  		status = -ENOMEM;
>  		goto out_no_page;
> @@ -818,6 +822,7 @@ iomap_write_actor(const struct iomap_data *data, struct iomap *iomap,
>  {
>  	struct inode *inode = data->inode;
>  	struct iov_iter *i = data->priv;
> +	unsigned flags = data->flags;
>  	loff_t length = data->len;
>  	loff_t pos = data->pos;
>  	long status = 0;
> @@ -851,10 +856,17 @@ iomap_write_actor(const struct iomap_data *data, struct iomap *iomap,
>  			break;
>  		}
>  
> -		status = iomap_write_begin(inode, pos, bytes, 0, &page, iomap,
> -				srcmap);
> -		if (unlikely(status))
> +retry:
> +		status = iomap_write_begin(inode, pos, bytes, flags,
> +						&page, iomap, srcmap);
> +		if (unlikely(status)) {
> +			if (status == -ENOMEM && (flags & IOMAP_UNCACHED)) {
> +				iomap->flags |= IOMAP_F_PAGE_CREATE;
> +				flags &= ~IOMAP_UNCACHED;
> +				goto retry;
> +			}
>  			break;
> +		}
>  
>  		if (mapping_writably_mapped(inode->i_mapping))
>  			flush_dcache_page(page);
> @@ -907,6 +919,9 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter,
>  	};
>  	loff_t ret = 0, written = 0;
>  
> +	if (iocb->ki_flags & IOCB_UNCACHED)
> +		data.flags |= IOMAP_UNCACHED;
> +
>  	while (iov_iter_count(iter)) {
>  		data.len = iov_iter_count(iter);
>  		ret = iomap_apply(&data, ops, iomap_write_actor);
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 30f40145a9e9..30bb248e1d0d 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -48,12 +48,16 @@ struct vm_fault;
>   *
>   * IOMAP_F_BUFFER_HEAD indicates that the file system requires the use of
>   * buffer heads for this mapping.
> + *
> + * IOMAP_F_PAGE_CREATE indicates that pages had to be allocated to satisfy
> + * this operation.
>   */
>  #define IOMAP_F_NEW		0x01
>  #define IOMAP_F_DIRTY		0x02
>  #define IOMAP_F_SHARED		0x04
>  #define IOMAP_F_MERGED		0x08
>  #define IOMAP_F_BUFFER_HEAD	0x10
> +#define IOMAP_F_PAGE_CREATE	0x20

I think these new flags need an update to the _STRINGS arrays in
fs/iomap/trace.h.
>  
>  /*
>   * Flags set by the core iomap code during operations:
> @@ -121,6 +125,7 @@ struct iomap_page_ops {
>  #define IOMAP_FAULT		(1 << 3) /* mapping for page fault */
>  #define IOMAP_DIRECT		(1 << 4) /* direct I/O */
>  #define IOMAP_NOWAIT		(1 << 5) /* do not block */
> +#define IOMAP_UNCACHED		(1 << 6)

No comment?

--D

>  
>  struct iomap_ops {
>  	/*
> -- 
> 2.24.1
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-13  2:26   ` Darrick J. Wong
@ 2019-12-13  2:38     ` Jens Axboe
  0 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2019-12-13  2:38 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-mm, linux-fsdevel, linux-block, willy, clm, torvalds, david

On 12/12/19 7:26 PM, Darrick J. Wong wrote:
> On Thu, Dec 12, 2019 at 12:01:33PM -0700, Jens Axboe wrote:
>> This adds support for RWF_UNCACHED for file systems using iomap to
>> perform buffered writes. We use the generic infrastructure for this,
>> by tracking pages we created and calling write_drop_cached_pages()
>> to issue writeback and prune those pages.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  fs/iomap/apply.c       | 24 ++++++++++++++++++++++++
>>  fs/iomap/buffered-io.c | 23 +++++++++++++++++++----
>>  include/linux/iomap.h  |  5 +++++
>>  3 files changed, 48 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
>> index e76148db03b8..11b6812f7b37 100644
>> --- a/fs/iomap/apply.c
>> +++ b/fs/iomap/apply.c
>> @@ -92,5 +92,29 @@ iomap_apply(struct iomap_data *data, const struct iomap_ops *ops,
>>  				     data->flags, &iomap);
>>  	}
>>  
>> +	if (written && (data->flags & IOMAP_UNCACHED)) {
> 
> Hmmm... why is a chunk of buffered write(?) code landing in the iomap
> apply function?

I'm going to say that Dave suggested it ;-)

> The #define for IOMAP_UNCACHED doesn't have a comment, so I don't know
> what this is supposed to mean.  Judging from the one place it gets set
> in the buffered write function I gather that this is how you implement
> the "write through page cache and immediately unmap the page if it
> wasn't there before" behavior?
> 
> So based on that, I think you want ...
> 
> if IOMAP_WRITE && _UNCACHED && !_DIRECT && written > 0:
> 	flush and invalidate

Looking at the comments, I did think it was just for writes, but it
looks generic. I'll take the blame for that, we should only call into
that sync-and-invalidate code for buffered writes. I'll make that
change.

> Since direct writes are never going to create page cache, right?

If they do, it's handled separately.

> And in that case, why not put this at the end of iomap_write_actor?
> 
> (Sorry if this came up in the earlier discussions, I've been busy this
> week and still have a long way to go for catching up...)

It did come up, the idea is to do it for the full range, not per chunk.
Does that help?

>> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
>> index 30f40145a9e9..30bb248e1d0d 100644
>> --- a/include/linux/iomap.h
>> +++ b/include/linux/iomap.h
>> @@ -48,12 +48,16 @@ struct vm_fault;
>>   *
>>   * IOMAP_F_BUFFER_HEAD indicates that the file system requires the use of
>>   * buffer heads for this mapping.
>> + *
>> + * IOMAP_F_PAGE_CREATE indicates that pages had to be allocated to satisfy
>> + * this operation.
>>   */
>>  #define IOMAP_F_NEW		0x01
>>  #define IOMAP_F_DIRTY		0x02
>>  #define IOMAP_F_SHARED		0x04
>>  #define IOMAP_F_MERGED		0x08
>>  #define IOMAP_F_BUFFER_HEAD	0x10
>> +#define IOMAP_F_PAGE_CREATE	0x20
> 
> I think these new flags need an update to the _STRINGS arrays in
> fs/iomap/trace.h.

I'll add that.

>>  /*
>>   * Flags set by the core iomap code during operations:
>> @@ -121,6 +125,7 @@ struct iomap_page_ops {
>>  #define IOMAP_FAULT		(1 << 3) /* mapping for page fault */
>>  #define IOMAP_DIRECT		(1 << 4) /* direct I/O */
>>  #define IOMAP_NOWAIT		(1 << 5) /* do not block */
>> +#define IOMAP_UNCACHED		(1 << 6)
> 
> No comment?

Definitely, I'll add a comment.

Thanks for taking a look! I'll incorporate your suggestions.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/5] iomap: add struct iomap_data
  2019-12-12 19:01 ` [PATCH 4/5] iomap: add struct iomap_data Jens Axboe
@ 2019-12-13 20:32   ` Darrick J. Wong
  2019-12-13 20:47     ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2019-12-13 20:32 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-mm, linux-fsdevel, linux-block, willy, clm, torvalds, david

On Thu, Dec 12, 2019 at 12:01:32PM -0700, Jens Axboe wrote:
> We pass a lot of arguments to iomap_apply(), and subsequently to the
> actors that it calls. In preparation for adding one more argument,
> switch them to using a struct iomap_data instead. The actor gets a const
> version of that, they are not supposed to change anything in it.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

Looks good, only a couple of questions...

> ---
>  fs/dax.c               |  25 +++--
>  fs/iomap/apply.c       |  26 +++---
>  fs/iomap/buffered-io.c | 202 +++++++++++++++++++++++++----------------
>  fs/iomap/direct-io.c   |  57 +++++++-----
>  fs/iomap/fiemap.c      |  48 ++++++----
>  fs/iomap/seek.c        |  64 ++++++++-----
>  fs/iomap/swapfile.c    |  27 +++---
>  include/linux/iomap.h  |  15 ++-
>  8 files changed, 278 insertions(+), 186 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 1f1f0201cad1..d1c32dbbdf24 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1090,13 +1090,16 @@ int __dax_zero_page_range(struct block_device *bdev,
>  EXPORT_SYMBOL_GPL(__dax_zero_page_range);
>  
>  static loff_t
> -dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> -		struct iomap *iomap, struct iomap *srcmap)
> +dax_iomap_actor(const struct iomap_data *data, struct iomap *iomap,

I wonder, is 'struct iomap_ctx' a better name for the context structure?

> +		struct iomap *srcmap)
>  {
>  	struct block_device *bdev = iomap->bdev;
>  	struct dax_device *dax_dev = iomap->dax_dev;
> -	struct iov_iter *iter = data;
> +	struct iov_iter *iter = data->priv;
> +	loff_t pos = data->pos;
> +	loff_t length = pos + data->len;
>  	loff_t end = pos + length, done = 0;
> +	struct inode *inode = data->inode;
>  	ssize_t ret = 0;
>  	size_t xfer;
>  	int id;
> @@ -1197,22 +1200,26 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
>  {
>  	struct address_space *mapping = iocb->ki_filp->f_mapping;
>  	struct inode *inode = mapping->host;
> -	loff_t pos = iocb->ki_pos, ret = 0, done = 0;
> -	unsigned flags = 0;
> +	loff_t ret = 0, done = 0;
> +	struct iomap_data data = {
> +		.inode	= inode,
> +		.pos	= iocb->ki_pos,
> +		.priv	= iter,
> +	};
>  
>  	if (iov_iter_rw(iter) == WRITE) {
>  		lockdep_assert_held_write(&inode->i_rwsem);
> -		flags |= IOMAP_WRITE;
> +		data.flags |= IOMAP_WRITE;
>  	} else {
>  		lockdep_assert_held(&inode->i_rwsem);
>  	}
>  
>  	while (iov_iter_count(iter)) {
> -		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
> -				iter, dax_iomap_actor);
> +		data.len = iov_iter_count(iter);
> +		ret = iomap_apply(&data, ops, dax_iomap_actor);
>  		if (ret <= 0)
>  			break;
> -		pos += ret;
> +		data.pos += ret;
>  		done += ret;
>  	}
>  
> diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
> index 76925b40b5fd..e76148db03b8 100644
> --- a/fs/iomap/apply.c
> +++ b/fs/iomap/apply.c
> @@ -21,15 +21,16 @@
>   * iomap_end call.
>   */
>  loff_t
> -iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
> -		const struct iomap_ops *ops, void *data, iomap_actor_t actor)
> +iomap_apply(struct iomap_data *data, const struct iomap_ops *ops,
> +	    iomap_actor_t actor)
>  {
>  	struct iomap iomap = { .type = IOMAP_HOLE };
>  	struct iomap srcmap = { .type = IOMAP_HOLE };
>  	loff_t written = 0, ret;
>  	u64 end;
>  
> -	trace_iomap_apply(inode, pos, length, flags, ops, actor, _RET_IP_);
> +	trace_iomap_apply(data->inode, data->pos, data->len, data->flags, ops,
> +				actor, _RET_IP_);
>  
>  	/*
>  	 * Need to map a range from start position for length bytes. This can
> @@ -43,17 +44,18 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
>  	 * expose transient stale data. If the reserve fails, we can safely
>  	 * back out at this point as there is nothing to undo.
>  	 */
> -	ret = ops->iomap_begin(inode, pos, length, flags, &iomap, &srcmap);
> +	ret = ops->iomap_begin(data->inode, data->pos, data->len, data->flags,
> +				&iomap, &srcmap);

...and second, what do people think about about passing "const struct
iomap_ctx *ctx" to iomap_begin and iomap_end to reduce the argument
counts there too?

(That's definitely a separate patch though, and I might just do that on
my own if nobody beats me to it...)

--D

>  	if (ret)
>  		return ret;
> -	if (WARN_ON(iomap.offset > pos))
> +	if (WARN_ON(iomap.offset > data->pos))
>  		return -EIO;
>  	if (WARN_ON(iomap.length == 0))
>  		return -EIO;
>  
> -	trace_iomap_apply_dstmap(inode, &iomap);
> +	trace_iomap_apply_dstmap(data->inode, &iomap);
>  	if (srcmap.type != IOMAP_HOLE)
> -		trace_iomap_apply_srcmap(inode, &srcmap);
> +		trace_iomap_apply_srcmap(data->inode, &srcmap);
>  
>  	/*
>  	 * Cut down the length to the one actually provided by the filesystem,
> @@ -62,8 +64,8 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
>  	end = iomap.offset + iomap.length;
>  	if (srcmap.type != IOMAP_HOLE)
>  		end = min(end, srcmap.offset + srcmap.length);
> -	if (pos + length > end)
> -		length = end - pos;
> +	if (data->pos + data->len > end)
> +		data->len = end - data->pos;
>  
>  	/*
>  	 * Now that we have guaranteed that the space allocation will succeed,
> @@ -77,7 +79,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
>  	 * iomap into the actors so that they don't need to have special
>  	 * handling for the two cases.
>  	 */
> -	written = actor(inode, pos, length, data, &iomap,
> +	written = actor(data, &iomap,
>  			srcmap.type != IOMAP_HOLE ? &srcmap : &iomap);
>  
>  	/*
> @@ -85,9 +87,9 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
>  	 * should not fail unless the filesystem has had a fatal error.
>  	 */
>  	if (ops->iomap_end) {
> -		ret = ops->iomap_end(inode, pos, length,
> +		ret = ops->iomap_end(data->inode, data->pos, data->len,
>  				     written > 0 ? written : 0,
> -				     flags, &iomap);
> +				     data->flags, &iomap);
>  	}
>  
>  	return written ? written : ret;
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 828444e14d09..0a1a195ed1cc 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -248,14 +248,15 @@ static inline bool iomap_block_needs_zeroing(struct inode *inode,
>  }
>  
>  static loff_t
> -iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> -		struct iomap *iomap, struct iomap *srcmap)
> +iomap_readpage_actor(const struct iomap_data *data, struct iomap *iomap,
> +		     struct iomap *srcmap)
>  {
> -	struct iomap_readpage_ctx *ctx = data;
> +	struct iomap_readpage_ctx *ctx = data->priv;
> +	struct inode *inode = data->inode;
>  	struct page *page = ctx->cur_page;
>  	struct iomap_page *iop = iomap_page_create(inode, page);
>  	bool same_page = false, is_contig = false;
> -	loff_t orig_pos = pos;
> +	loff_t pos = data->pos, orig_pos = data->pos;
>  	unsigned poff, plen;
>  	sector_t sector;
>  
> @@ -266,7 +267,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  	}
>  
>  	/* zero post-eof blocks as the page may be mapped */
> -	iomap_adjust_read_range(inode, iop, &pos, length, &poff, &plen);
> +	iomap_adjust_read_range(inode, iop, &pos, data->len, &poff, &plen);
>  	if (plen == 0)
>  		goto done;
>  
> @@ -302,7 +303,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  
>  	if (!ctx->bio || !is_contig || bio_full(ctx->bio, plen)) {
>  		gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL);
> -		int nr_vecs = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +		int nr_vecs = (data->len + PAGE_SIZE - 1) >> PAGE_SHIFT;
>  
>  		if (ctx->bio)
>  			submit_bio(ctx->bio);
> @@ -333,16 +334,20 @@ int
>  iomap_readpage(struct page *page, const struct iomap_ops *ops)
>  {
>  	struct iomap_readpage_ctx ctx = { .cur_page = page };
> -	struct inode *inode = page->mapping->host;
> +	struct iomap_data data = {
> +		.inode	= page->mapping->host,
> +		.priv	= &ctx,
> +		.flags	= 0
> +	};
>  	unsigned poff;
>  	loff_t ret;
>  
>  	trace_iomap_readpage(page->mapping->host, 1);
>  
>  	for (poff = 0; poff < PAGE_SIZE; poff += ret) {
> -		ret = iomap_apply(inode, page_offset(page) + poff,
> -				PAGE_SIZE - poff, 0, ops, &ctx,
> -				iomap_readpage_actor);
> +		data.pos = page_offset(page) + poff;
> +		data.len = PAGE_SIZE - poff;
> +		ret = iomap_apply(&data, ops, iomap_readpage_actor);
>  		if (ret <= 0) {
>  			WARN_ON_ONCE(ret == 0);
>  			SetPageError(page);
> @@ -396,28 +401,34 @@ iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos,
>  }
>  
>  static loff_t
> -iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
> -		void *data, struct iomap *iomap, struct iomap *srcmap)
> +iomap_readpages_actor(const struct iomap_data *data, struct iomap *iomap,
> +		      struct iomap *srcmap)
>  {
> -	struct iomap_readpage_ctx *ctx = data;
> +	struct iomap_readpage_ctx *ctx = data->priv;
>  	loff_t done, ret;
>  
> -	for (done = 0; done < length; done += ret) {
> -		if (ctx->cur_page && offset_in_page(pos + done) == 0) {
> +	for (done = 0; done < data->len; done += ret) {
> +		struct iomap_data rp_data = {
> +			.inode	= data->inode,
> +			.pos	= data->pos + done,
> +			.len	= data->len - done,
> +			.priv	= ctx,
> +		};
> +
> +		if (ctx->cur_page && offset_in_page(rp_data.pos) == 0) {
>  			if (!ctx->cur_page_in_bio)
>  				unlock_page(ctx->cur_page);
>  			put_page(ctx->cur_page);
>  			ctx->cur_page = NULL;
>  		}
>  		if (!ctx->cur_page) {
> -			ctx->cur_page = iomap_next_page(inode, ctx->pages,
> -					pos, length, &done);
> +			ctx->cur_page = iomap_next_page(data->inode, ctx->pages,
> +					data->pos, data->len, &done);
>  			if (!ctx->cur_page)
>  				break;
>  			ctx->cur_page_in_bio = false;
>  		}
> -		ret = iomap_readpage_actor(inode, pos + done, length - done,
> -				ctx, iomap, srcmap);
> +		ret = iomap_readpage_actor(&rp_data, iomap, srcmap);
>  	}
>  
>  	return done;
> @@ -431,21 +442,27 @@ iomap_readpages(struct address_space *mapping, struct list_head *pages,
>  		.pages		= pages,
>  		.is_readahead	= true,
>  	};
> -	loff_t pos = page_offset(list_entry(pages->prev, struct page, lru));
> +	struct iomap_data data = {
> +		.inode	= mapping->host,
> +		.priv	= &ctx,
> +		.flags	= 0
> +	};
>  	loff_t last = page_offset(list_entry(pages->next, struct page, lru));
> -	loff_t length = last - pos + PAGE_SIZE, ret = 0;
> +	loff_t ret = 0;
> +
> +	data.pos = page_offset(list_entry(pages->prev, struct page, lru));
> +	data.len = last - data.pos + PAGE_SIZE;
>  
> -	trace_iomap_readpages(mapping->host, nr_pages);
> +	trace_iomap_readpages(data.inode, nr_pages);
>  
> -	while (length > 0) {
> -		ret = iomap_apply(mapping->host, pos, length, 0, ops,
> -				&ctx, iomap_readpages_actor);
> +	while (data.len > 0) {
> +		ret = iomap_apply(&data, ops, iomap_readpages_actor);
>  		if (ret <= 0) {
>  			WARN_ON_ONCE(ret == 0);
>  			goto done;
>  		}
> -		pos += ret;
> -		length -= ret;
> +		data.pos += ret;
> +		data.len -= ret;
>  	}
>  	ret = 0;
>  done:
> @@ -796,10 +813,13 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len, unsigned copied,
>  }
>  
>  static loff_t
> -iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> -		struct iomap *iomap, struct iomap *srcmap)
> +iomap_write_actor(const struct iomap_data *data, struct iomap *iomap,
> +		  struct iomap *srcmap)
>  {
> -	struct iov_iter *i = data;
> +	struct inode *inode = data->inode;
> +	struct iov_iter *i = data->priv;
> +	loff_t length = data->len;
> +	loff_t pos = data->pos;
>  	long status = 0;
>  	ssize_t written = 0;
>  
> @@ -879,15 +899,20 @@ ssize_t
>  iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter,
>  		const struct iomap_ops *ops)
>  {
> -	struct inode *inode = iocb->ki_filp->f_mapping->host;
> -	loff_t pos = iocb->ki_pos, ret = 0, written = 0;
> +	struct iomap_data data = {
> +		.inode	= iocb->ki_filp->f_mapping->host,
> +		.pos	= iocb->ki_pos,
> +		.priv	= iter,
> +		.flags	= IOMAP_WRITE
> +	};
> +	loff_t ret = 0, written = 0;
>  
>  	while (iov_iter_count(iter)) {
> -		ret = iomap_apply(inode, pos, iov_iter_count(iter),
> -				IOMAP_WRITE, ops, iter, iomap_write_actor);
> +		data.len = iov_iter_count(iter);
> +		ret = iomap_apply(&data, ops, iomap_write_actor);
>  		if (ret <= 0)
>  			break;
> -		pos += ret;
> +		data.pos += ret;
>  		written += ret;
>  	}
>  
> @@ -896,9 +921,11 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter,
>  EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
>  
>  static loff_t
> -iomap_unshare_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> -		struct iomap *iomap, struct iomap *srcmap)
> +iomap_unshare_actor(const struct iomap_data *data, struct iomap *iomap,
> +		    struct iomap *srcmap)
>  {
> +	loff_t pos = data->pos;
> +	loff_t length = data->len;
>  	long status = 0;
>  	ssize_t written = 0;
>  
> @@ -914,13 +941,13 @@ iomap_unshare_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		unsigned long bytes = min_t(loff_t, PAGE_SIZE - offset, length);
>  		struct page *page;
>  
> -		status = iomap_write_begin(inode, pos, bytes,
> +		status = iomap_write_begin(data->inode, pos, bytes,
>  				IOMAP_WRITE_F_UNSHARE, &page, iomap, srcmap);
>  		if (unlikely(status))
>  			return status;
>  
> -		status = iomap_write_end(inode, pos, bytes, bytes, page, iomap,
> -				srcmap);
> +		status = iomap_write_end(data->inode, pos, bytes, bytes, page,
> +						iomap, srcmap);
>  		if (unlikely(status <= 0)) {
>  			if (WARN_ON_ONCE(status == 0))
>  				return -EIO;
> @@ -933,7 +960,7 @@ iomap_unshare_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		written += status;
>  		length -= status;
>  
> -		balance_dirty_pages_ratelimited(inode->i_mapping);
> +		balance_dirty_pages_ratelimited(data->inode->i_mapping);
>  	} while (length);
>  
>  	return written;
> @@ -943,15 +970,20 @@ int
>  iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
>  		const struct iomap_ops *ops)
>  {
> +	struct iomap_data data = {
> +		.inode	= inode,
> +		.pos	= pos,
> +		.len	= len,
> +		.flags	= IOMAP_WRITE,
> +	};
>  	loff_t ret;
>  
> -	while (len) {
> -		ret = iomap_apply(inode, pos, len, IOMAP_WRITE, ops, NULL,
> -				iomap_unshare_actor);
> +	while (data.len) {
> +		ret = iomap_apply(&data, ops, iomap_unshare_actor);
>  		if (ret <= 0)
>  			return ret;
> -		pos += ret;
> -		len -= ret;
> +		data.pos += ret;
> +		data.len -= ret;
>  	}
>  
>  	return 0;
> @@ -982,16 +1014,18 @@ static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
>  }
>  
>  static loff_t
> -iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
> -		void *data, struct iomap *iomap, struct iomap *srcmap)
> +iomap_zero_range_actor(const struct iomap_data *data, struct iomap *iomap,
> +		       struct iomap *srcmap)
>  {
> -	bool *did_zero = data;
> +	bool *did_zero = data->priv;
> +	loff_t count = data->len;
> +	loff_t pos = data->pos;
>  	loff_t written = 0;
>  	int status;
>  
>  	/* already zeroed?  we're done. */
>  	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> -		return count;
> +		return data->len;
>  
>  	do {
>  		unsigned offset, bytes;
> @@ -999,11 +1033,11 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
>  		offset = offset_in_page(pos);
>  		bytes = min_t(loff_t, PAGE_SIZE - offset, count);
>  
> -		if (IS_DAX(inode))
> +		if (IS_DAX(data->inode))
>  			status = iomap_dax_zero(pos, offset, bytes, iomap);
>  		else
> -			status = iomap_zero(inode, pos, offset, bytes, iomap,
> -					srcmap);
> +			status = iomap_zero(data->inode, pos, offset, bytes,
> +						iomap, srcmap);
>  		if (status < 0)
>  			return status;
>  
> @@ -1021,16 +1055,22 @@ int
>  iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>  		const struct iomap_ops *ops)
>  {
> +	struct iomap_data data = {
> +		.inode	= inode,
> +		.pos	= pos,
> +		.len	= len,
> +		.priv	= did_zero,
> +		.flags	= IOMAP_ZERO
> +	};
>  	loff_t ret;
>  
> -	while (len > 0) {
> -		ret = iomap_apply(inode, pos, len, IOMAP_ZERO,
> -				ops, did_zero, iomap_zero_range_actor);
> +	while (data.len > 0) {
> +		ret = iomap_apply(&data, ops, iomap_zero_range_actor);
>  		if (ret <= 0)
>  			return ret;
>  
> -		pos += ret;
> -		len -= ret;
> +		data.pos += ret;
> +		data.len -= ret;
>  	}
>  
>  	return 0;
> @@ -1052,57 +1092,59 @@ iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
>  EXPORT_SYMBOL_GPL(iomap_truncate_page);
>  
>  static loff_t
> -iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
> -		void *data, struct iomap *iomap, struct iomap *srcmap)
> +iomap_page_mkwrite_actor(const struct iomap_data *data,
> +			 struct iomap *iomap, struct iomap *srcmap)
>  {
> -	struct page *page = data;
> +	struct page *page = data->priv;
>  	int ret;
>  
>  	if (iomap->flags & IOMAP_F_BUFFER_HEAD) {
> -		ret = __block_write_begin_int(page, pos, length, NULL, iomap);
> +		ret = __block_write_begin_int(page, data->pos, data->len, NULL,
> +						iomap);
>  		if (ret)
>  			return ret;
> -		block_commit_write(page, 0, length);
> +		block_commit_write(page, 0, data->len);
>  	} else {
>  		WARN_ON_ONCE(!PageUptodate(page));
> -		iomap_page_create(inode, page);
> +		iomap_page_create(data->inode, page);
>  		set_page_dirty(page);
>  	}
>  
> -	return length;
> +	return data->len;
>  }
>  
>  vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops)
>  {
>  	struct page *page = vmf->page;
> -	struct inode *inode = file_inode(vmf->vma->vm_file);
> -	unsigned long length;
> -	loff_t offset, size;
> +	struct iomap_data data = {
> +		.inode	= file_inode(vmf->vma->vm_file),
> +		.pos	= page_offset(page),
> +		.flags	= IOMAP_WRITE | IOMAP_FAULT,
> +		.priv	= page,
> +	};
>  	ssize_t ret;
> +	loff_t size;
>  
>  	lock_page(page);
> -	size = i_size_read(inode);
> -	offset = page_offset(page);
> -	if (page->mapping != inode->i_mapping || offset > size) {
> +	size = i_size_read(data.inode);
> +	if (page->mapping != data.inode->i_mapping || data.pos > size) {
>  		/* We overload EFAULT to mean page got truncated */
>  		ret = -EFAULT;
>  		goto out_unlock;
>  	}
>  
>  	/* page is wholly or partially inside EOF */
> -	if (offset > size - PAGE_SIZE)
> -		length = offset_in_page(size);
> +	if (data.pos > size - PAGE_SIZE)
> +		data.len = offset_in_page(size);
>  	else
> -		length = PAGE_SIZE;
> +		data.len = PAGE_SIZE;
>  
> -	while (length > 0) {
> -		ret = iomap_apply(inode, offset, length,
> -				IOMAP_WRITE | IOMAP_FAULT, ops, page,
> -				iomap_page_mkwrite_actor);
> +	while (data.len > 0) {
> +		ret = iomap_apply(&data, ops, iomap_page_mkwrite_actor);
>  		if (unlikely(ret <= 0))
>  			goto out_unlock;
> -		offset += ret;
> -		length -= ret;
> +		data.pos += ret;
> +		data.len -= ret;
>  	}
>  
>  	wait_for_stable_page(page);
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 23837926c0c5..e561ca9329ac 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -364,24 +364,27 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length,
>  }
>  
>  static loff_t
> -iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
> -		void *data, struct iomap *iomap, struct iomap *srcmap)
> +iomap_dio_actor(const struct iomap_data *data, struct iomap *iomap,
> +		struct iomap *srcmap)
>  {
> -	struct iomap_dio *dio = data;
> +	struct iomap_dio *dio = data->priv;
>  
>  	switch (iomap->type) {
>  	case IOMAP_HOLE:
>  		if (WARN_ON_ONCE(dio->flags & IOMAP_DIO_WRITE))
>  			return -EIO;
> -		return iomap_dio_hole_actor(length, dio);
> +		return iomap_dio_hole_actor(data->len, dio);
>  	case IOMAP_UNWRITTEN:
>  		if (!(dio->flags & IOMAP_DIO_WRITE))
> -			return iomap_dio_hole_actor(length, dio);
> -		return iomap_dio_bio_actor(inode, pos, length, dio, iomap);
> +			return iomap_dio_hole_actor(data->len, dio);
> +		return iomap_dio_bio_actor(data->inode, data->pos, data->len,
> +						dio, iomap);
>  	case IOMAP_MAPPED:
> -		return iomap_dio_bio_actor(inode, pos, length, dio, iomap);
> +		return iomap_dio_bio_actor(data->inode, data->pos, data->len,
> +						dio, iomap);
>  	case IOMAP_INLINE:
> -		return iomap_dio_inline_actor(inode, pos, length, dio, iomap);
> +		return iomap_dio_inline_actor(data->inode, data->pos, data->len,
> +						dio, iomap);
>  	default:
>  		WARN_ON_ONCE(1);
>  		return -EIO;
> @@ -404,16 +407,19 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  {
>  	struct address_space *mapping = iocb->ki_filp->f_mapping;
>  	struct inode *inode = file_inode(iocb->ki_filp);
> -	size_t count = iov_iter_count(iter);
> -	loff_t pos = iocb->ki_pos;
> -	loff_t end = iocb->ki_pos + count - 1, ret = 0;
> -	unsigned int flags = IOMAP_DIRECT;
> +	struct iomap_data data = {
> +		.inode	= inode,
> +		.pos	= iocb->ki_pos,
> +		.len	= iov_iter_count(iter),
> +		.flags	= IOMAP_DIRECT
> +	};
> +	loff_t end = data.pos + data.len - 1, ret = 0;
>  	struct blk_plug plug;
>  	struct iomap_dio *dio;
>  
>  	lockdep_assert_held(&inode->i_rwsem);
>  
> -	if (!count)
> +	if (!data.len)
>  		return 0;
>  
>  	if (WARN_ON(is_sync_kiocb(iocb) && !wait_for_completion))
> @@ -436,14 +442,16 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	dio->submit.cookie = BLK_QC_T_NONE;
>  	dio->submit.last_queue = NULL;
>  
> +	data.priv = dio;
> +
>  	if (iov_iter_rw(iter) == READ) {
> -		if (pos >= dio->i_size)
> +		if (data.pos >= dio->i_size)
>  			goto out_free_dio;
>  
>  		if (iter_is_iovec(iter))
>  			dio->flags |= IOMAP_DIO_DIRTY;
>  	} else {
> -		flags |= IOMAP_WRITE;
> +		data.flags |= IOMAP_WRITE;
>  		dio->flags |= IOMAP_DIO_WRITE;
>  
>  		/* for data sync or sync, we need sync completion processing */
> @@ -461,14 +469,14 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	}
>  
>  	if (iocb->ki_flags & IOCB_NOWAIT) {
> -		if (filemap_range_has_page(mapping, pos, end)) {
> +		if (filemap_range_has_page(mapping, data.pos, end)) {
>  			ret = -EAGAIN;
>  			goto out_free_dio;
>  		}
> -		flags |= IOMAP_NOWAIT;
> +		data.flags |= IOMAP_NOWAIT;
>  	}
>  
> -	ret = filemap_write_and_wait_range(mapping, pos, end);
> +	ret = filemap_write_and_wait_range(mapping, data.pos, end);
>  	if (ret)
>  		goto out_free_dio;
>  
> @@ -479,7 +487,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	 * pretty crazy thing to do, so we don't support it 100%.
>  	 */
>  	ret = invalidate_inode_pages2_range(mapping,
> -			pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> +			data.pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
>  	if (ret)
>  		dio_warn_stale_pagecache(iocb->ki_filp);
>  	ret = 0;
> @@ -495,8 +503,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  
>  	blk_start_plug(&plug);
>  	do {
> -		ret = iomap_apply(inode, pos, count, flags, ops, dio,
> -				iomap_dio_actor);
> +		ret = iomap_apply(&data, ops, iomap_dio_actor);
>  		if (ret <= 0) {
>  			/* magic error code to fall back to buffered I/O */
>  			if (ret == -ENOTBLK) {
> @@ -505,18 +512,18 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  			}
>  			break;
>  		}
> -		pos += ret;
> +		data.pos += ret;
>  
> -		if (iov_iter_rw(iter) == READ && pos >= dio->i_size) {
> +		if (iov_iter_rw(iter) == READ && data.pos >= dio->i_size) {
>  			/*
>  			 * We only report that we've read data up to i_size.
>  			 * Revert iter to a state corresponding to that as
>  			 * some callers (such as splice code) rely on it.
>  			 */
> -			iov_iter_revert(iter, pos - dio->i_size);
> +			iov_iter_revert(iter, data.pos - dio->i_size);
>  			break;
>  		}
> -	} while ((count = iov_iter_count(iter)) > 0);
> +	} while ((data.len = iov_iter_count(iter)) > 0);
>  	blk_finish_plug(&plug);
>  
>  	if (ret < 0)
> diff --git a/fs/iomap/fiemap.c b/fs/iomap/fiemap.c
> index bccf305ea9ce..4075fbe0e3f5 100644
> --- a/fs/iomap/fiemap.c
> +++ b/fs/iomap/fiemap.c
> @@ -43,20 +43,20 @@ static int iomap_to_fiemap(struct fiemap_extent_info *fi,
>  }
>  
>  static loff_t
> -iomap_fiemap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> -		struct iomap *iomap, struct iomap *srcmap)
> +iomap_fiemap_actor(const struct iomap_data *data, struct iomap *iomap,
> +		   struct iomap *srcmap)
>  {
> -	struct fiemap_ctx *ctx = data;
> -	loff_t ret = length;
> +	struct fiemap_ctx *ctx = data->priv;
> +	loff_t ret = data->len;
>  
>  	if (iomap->type == IOMAP_HOLE)
> -		return length;
> +		return data->len;
>  
>  	ret = iomap_to_fiemap(ctx->fi, &ctx->prev, 0);
>  	ctx->prev = *iomap;
>  	switch (ret) {
>  	case 0:		/* success */
> -		return length;
> +		return data->len;
>  	case 1:		/* extent array full */
>  		return 0;
>  	default:
> @@ -68,6 +68,13 @@ int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fi,
>  		loff_t start, loff_t len, const struct iomap_ops *ops)
>  {
>  	struct fiemap_ctx ctx;
> +	struct iomap_data data = {
> +		.inode	= inode,
> +		.pos	= start,
> +		.len	= len,
> +		.flags	= IOMAP_REPORT,
> +		.priv	= &ctx
> +	};
>  	loff_t ret;
>  
>  	memset(&ctx, 0, sizeof(ctx));
> @@ -84,9 +91,8 @@ int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fi,
>  			return ret;
>  	}
>  
> -	while (len > 0) {
> -		ret = iomap_apply(inode, start, len, IOMAP_REPORT, ops, &ctx,
> -				iomap_fiemap_actor);
> +	while (data.len > 0) {
> +		ret = iomap_apply(&data, ops, iomap_fiemap_actor);
>  		/* inode with no (attribute) mapping will give ENOENT */
>  		if (ret == -ENOENT)
>  			break;
> @@ -95,8 +101,8 @@ int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fi,
>  		if (ret == 0)
>  			break;
>  
> -		start += ret;
> -		len -= ret;
> +		data.pos += ret;
> +		data.len -= ret;
>  	}
>  
>  	if (ctx.prev.type != IOMAP_HOLE) {
> @@ -110,13 +116,14 @@ int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fi,
>  EXPORT_SYMBOL_GPL(iomap_fiemap);
>  
>  static loff_t
> -iomap_bmap_actor(struct inode *inode, loff_t pos, loff_t length,
> -		void *data, struct iomap *iomap, struct iomap *srcmap)
> +iomap_bmap_actor(const struct iomap_data *data, struct iomap *iomap,
> +		 struct iomap *srcmap)
>  {
> -	sector_t *bno = data, addr;
> +	sector_t *bno = data->priv, addr;
>  
>  	if (iomap->type == IOMAP_MAPPED) {
> -		addr = (pos - iomap->offset + iomap->addr) >> inode->i_blkbits;
> +		addr = (data->pos - iomap->offset + iomap->addr) >>
> +				data->inode->i_blkbits;
>  		if (addr > INT_MAX)
>  			WARN(1, "would truncate bmap result\n");
>  		else
> @@ -131,16 +138,19 @@ iomap_bmap(struct address_space *mapping, sector_t bno,
>  		const struct iomap_ops *ops)
>  {
>  	struct inode *inode = mapping->host;
> -	loff_t pos = bno << inode->i_blkbits;
> -	unsigned blocksize = i_blocksize(inode);
> +	struct iomap_data data = {
> +		.inode	= inode,
> +		.pos	= bno << inode->i_blkbits,
> +		.len	= i_blocksize(inode),
> +		.priv	= &bno
> +	};
>  	int ret;
>  
>  	if (filemap_write_and_wait(mapping))
>  		return 0;
>  
>  	bno = 0;
> -	ret = iomap_apply(inode, pos, blocksize, 0, ops, &bno,
> -			  iomap_bmap_actor);
> +	ret = iomap_apply(&data, ops, iomap_bmap_actor);
>  	if (ret)
>  		return 0;
>  	return bno;
> diff --git a/fs/iomap/seek.c b/fs/iomap/seek.c
> index 89f61d93c0bc..288bee0b5d9b 100644
> --- a/fs/iomap/seek.c
> +++ b/fs/iomap/seek.c
> @@ -118,21 +118,23 @@ page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length,
>  
>  
>  static loff_t
> -iomap_seek_hole_actor(struct inode *inode, loff_t offset, loff_t length,
> -		      void *data, struct iomap *iomap, struct iomap *srcmap)
> +iomap_seek_hole_actor(const struct iomap_data *data, struct iomap *iomap,
> +		      struct iomap *srcmap)
>  {
> +	loff_t offset = data->pos;
> +
>  	switch (iomap->type) {
>  	case IOMAP_UNWRITTEN:
> -		offset = page_cache_seek_hole_data(inode, offset, length,
> -						   SEEK_HOLE);
> +		offset = page_cache_seek_hole_data(data->inode, offset,
> +						   data->len, SEEK_HOLE);
>  		if (offset < 0)
> -			return length;
> +			return data->len;
>  		/* fall through */
>  	case IOMAP_HOLE:
> -		*(loff_t *)data = offset;
> +		*(loff_t *)data->priv = offset;
>  		return 0;
>  	default:
> -		return length;
> +		return data->len;
>  	}
>  }
>  
> @@ -140,23 +142,28 @@ loff_t
>  iomap_seek_hole(struct inode *inode, loff_t offset, const struct iomap_ops *ops)
>  {
>  	loff_t size = i_size_read(inode);
> -	loff_t length = size - offset;
> +	struct iomap_data data = {
> +		.inode	= inode,
> +		.len	= size - offset,
> +		.priv	= &offset,
> +		.flags	= IOMAP_REPORT
> +	};
>  	loff_t ret;
>  
>  	/* Nothing to be found before or beyond the end of the file. */
>  	if (offset < 0 || offset >= size)
>  		return -ENXIO;
>  
> -	while (length > 0) {
> -		ret = iomap_apply(inode, offset, length, IOMAP_REPORT, ops,
> -				  &offset, iomap_seek_hole_actor);
> +	while (data.len > 0) {
> +		data.pos = offset;
> +		ret = iomap_apply(&data, ops, iomap_seek_hole_actor);
>  		if (ret < 0)
>  			return ret;
>  		if (ret == 0)
>  			break;
>  
>  		offset += ret;
> -		length -= ret;
> +		data.len -= ret;
>  	}
>  
>  	return offset;
> @@ -164,20 +171,22 @@ iomap_seek_hole(struct inode *inode, loff_t offset, const struct iomap_ops *ops)
>  EXPORT_SYMBOL_GPL(iomap_seek_hole);
>  
>  static loff_t
> -iomap_seek_data_actor(struct inode *inode, loff_t offset, loff_t length,
> -		      void *data, struct iomap *iomap, struct iomap *srcmap)
> +iomap_seek_data_actor(const struct iomap_data *data, struct iomap *iomap,
> +		      struct iomap *srcmap)
>  {
> +	loff_t offset = data->pos;
> +
>  	switch (iomap->type) {
>  	case IOMAP_HOLE:
> -		return length;
> +		return data->len;
>  	case IOMAP_UNWRITTEN:
> -		offset = page_cache_seek_hole_data(inode, offset, length,
> -						   SEEK_DATA);
> +		offset = page_cache_seek_hole_data(data->inode, offset,
> +						   data->len, SEEK_DATA);
>  		if (offset < 0)
> -			return length;
> +			return data->len;
>  		/*FALLTHRU*/
>  	default:
> -		*(loff_t *)data = offset;
> +		*(loff_t *)data->priv = offset;
>  		return 0;
>  	}
>  }
> @@ -186,26 +195,31 @@ loff_t
>  iomap_seek_data(struct inode *inode, loff_t offset, const struct iomap_ops *ops)
>  {
>  	loff_t size = i_size_read(inode);
> -	loff_t length = size - offset;
> +	struct iomap_data data = {
> +		.inode  = inode,
> +		.len    = size - offset,
> +		.priv   = &offset,
> +		.flags  = IOMAP_REPORT
> +	};
>  	loff_t ret;
>  
>  	/* Nothing to be found before or beyond the end of the file. */
>  	if (offset < 0 || offset >= size)
>  		return -ENXIO;
>  
> -	while (length > 0) {
> -		ret = iomap_apply(inode, offset, length, IOMAP_REPORT, ops,
> -				  &offset, iomap_seek_data_actor);
> +	while (data.len > 0) {
> +		data.pos = offset;
> +		ret = iomap_apply(&data, ops, iomap_seek_data_actor);
>  		if (ret < 0)
>  			return ret;
>  		if (ret == 0)
>  			break;
>  
>  		offset += ret;
> -		length -= ret;
> +		data.len -= ret;
>  	}
>  
> -	if (length <= 0)
> +	if (data.len <= 0)
>  		return -ENXIO;
>  	return offset;
>  }
> diff --git a/fs/iomap/swapfile.c b/fs/iomap/swapfile.c
> index a648dbf6991e..d911ab4b69ea 100644
> --- a/fs/iomap/swapfile.c
> +++ b/fs/iomap/swapfile.c
> @@ -75,11 +75,10 @@ static int iomap_swapfile_add_extent(struct iomap_swapfile_info *isi)
>   * swap only cares about contiguous page-aligned physical extents and makes no
>   * distinction between written and unwritten extents.
>   */
> -static loff_t iomap_swapfile_activate_actor(struct inode *inode, loff_t pos,
> -		loff_t count, void *data, struct iomap *iomap,
> -		struct iomap *srcmap)
> +static loff_t iomap_swapfile_activate_actor(const struct iomap_data *data,
> +		struct iomap *iomap, struct iomap *srcmap)
>  {
> -	struct iomap_swapfile_info *isi = data;
> +	struct iomap_swapfile_info *isi = data->priv;
>  	int error;
>  
>  	switch (iomap->type) {
> @@ -125,7 +124,7 @@ static loff_t iomap_swapfile_activate_actor(struct inode *inode, loff_t pos,
>  			return error;
>  		memcpy(&isi->iomap, iomap, sizeof(isi->iomap));
>  	}
> -	return count;
> +	return data->len;
>  }
>  
>  /*
> @@ -142,8 +141,13 @@ int iomap_swapfile_activate(struct swap_info_struct *sis,
>  	};
>  	struct address_space *mapping = swap_file->f_mapping;
>  	struct inode *inode = mapping->host;
> -	loff_t pos = 0;
> -	loff_t len = ALIGN_DOWN(i_size_read(inode), PAGE_SIZE);
> +	struct iomap_data data = {
> +		.inode	= inode,
> +		.pos	= 0,
> +		.len 	= ALIGN_DOWN(i_size_read(inode), PAGE_SIZE),
> +		.priv	= &isi,
> +		.flags	= IOMAP_REPORT
> +	};
>  	loff_t ret;
>  
>  	/*
> @@ -154,14 +158,13 @@ int iomap_swapfile_activate(struct swap_info_struct *sis,
>  	if (ret)
>  		return ret;
>  
> -	while (len > 0) {
> -		ret = iomap_apply(inode, pos, len, IOMAP_REPORT,
> -				ops, &isi, iomap_swapfile_activate_actor);
> +	while (data.len > 0) {
> +		ret = iomap_apply(&data, ops, iomap_swapfile_activate_actor);
>  		if (ret <= 0)
>  			return ret;
>  
> -		pos += ret;
> -		len -= ret;
> +		data.pos += ret;
 +		data.len -= ret;
>  	}
>  
>  	if (isi.iomap.length) {
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 8b09463dae0d..30f40145a9e9 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -145,11 +145,18 @@ struct iomap_ops {
>  /*
>   * Main iomap iterator function.
>   */
> -typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
> -		void *data, struct iomap *iomap, struct iomap *srcmap);
> +struct iomap_data {
> +	struct inode	*inode;
> +	loff_t		pos;
> +	loff_t		len;
> +	void		*priv;
> +	unsigned	flags;
> +};
> +
> +typedef loff_t (*iomap_actor_t)(const struct iomap_data *data,
> +		struct iomap *iomap, struct iomap *srcmap);
>  
> -loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
> -		unsigned flags, const struct iomap_ops *ops, void *data,
> +loff_t iomap_apply(struct iomap_data *data, const struct iomap_ops *ops,
>  		iomap_actor_t actor);
>  
>  ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
> -- 
> 2.24.1
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/5] iomap: add struct iomap_data
  2019-12-13 20:32   ` Darrick J. Wong
@ 2019-12-13 20:47     ` Jens Axboe
  0 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2019-12-13 20:47 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-mm, linux-fsdevel, linux-block, willy, clm, torvalds, david

On 12/13/19 1:32 PM, Darrick J. Wong wrote:
> On Thu, Dec 12, 2019 at 12:01:32PM -0700, Jens Axboe wrote:
>> We pass a lot of arguments to iomap_apply(), and subsequently to the
>> actors that it calls. In preparation for adding one more argument,
>> switch them to using a struct iomap_data instead. The actor gets a const
>> version of that, they are not supposed to change anything in it.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> Looks good, only a couple of questions...

Thanks!

>>  fs/dax.c               |  25 +++--
>>  fs/iomap/apply.c       |  26 +++---
>>  fs/iomap/buffered-io.c | 202 +++++++++++++++++++++++++----------------
>>  fs/iomap/direct-io.c   |  57 +++++++-----
>>  fs/iomap/fiemap.c      |  48 ++++++----
>>  fs/iomap/seek.c        |  64 ++++++++-----
>>  fs/iomap/swapfile.c    |  27 +++---
>>  include/linux/iomap.h  |  15 ++-
>>  8 files changed, 278 insertions(+), 186 deletions(-)
>>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index 1f1f0201cad1..d1c32dbbdf24 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -1090,13 +1090,16 @@ int __dax_zero_page_range(struct block_device *bdev,
>>  EXPORT_SYMBOL_GPL(__dax_zero_page_range);
>>  
>>  static loff_t
>> -dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>> -		struct iomap *iomap, struct iomap *srcmap)
>> +dax_iomap_actor(const struct iomap_data *data, struct iomap *iomap,
> 
> I wonder, is 'struct iomap_ctx' a better name for the context structure?

Yeah I think you are right, does seem like a better fit. I'll rename it
for the next version.

>> @@ -43,17 +44,18 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
>>  	 * expose transient stale data. If the reserve fails, we can safely
>>  	 * back out at this point as there is nothing to undo.
>>  	 */
>> -	ret = ops->iomap_begin(inode, pos, length, flags, &iomap, &srcmap);
>> +	ret = ops->iomap_begin(data->inode, data->pos, data->len, data->flags,
>> +				&iomap, &srcmap);
> 
> ...and second, what do people think about about passing "const struct
> iomap_ctx *ctx" to iomap_begin and iomap_end to reduce the argument
> counts there too?
> 
> (That's definitely a separate patch though, and I might just do that on
> my own if nobody beats me to it...)

To be honest, I just did what I needed, but I do think it's worth
pursuing in general. The argument clutter is real.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-18  0:49             ` Dave Chinner
@ 2019-12-18  1:01               ` Jens Axboe
  0 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2019-12-18  1:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-fsdevel, linux-block, willy, clm, torvalds

On 12/17/19 5:49 PM, Dave Chinner wrote:
> On Tue, Dec 17, 2019 at 07:31:51AM -0700, Jens Axboe wrote:
>> On 12/15/19 9:17 PM, Dave Chinner wrote:
>>> On Thu, Dec 12, 2019 at 05:57:57PM -0700, Jens Axboe wrote:
>>>> On 12/12/19 5:54 PM, Jens Axboe wrote:
>>>>> On 12/12/19 3:34 PM, Dave Chinner wrote:
>>>>>> Just a thought on further optimisation for this for XFS.
>>>>>> IOMAP_UNCACHED is being passed into the filesystem ->iomap_begin
>>>>>> methods by iomap_apply().  Hence the filesystems know that it is
>>>>>> an uncached IO that is being done, and we can tailor allocation
>>>>>> strategies to suit the fact that the data is going to be written
>>>>>> immediately.
>>>>>>
>>>>>> In this case, XFS needs to treat it the same way it treats direct
>>>>>> IO. That is, we do immediate unwritten extent allocation rather than
>>>>>> delayed allocation. This will reduce the allocation overhead and
>>>>>> will optimise for immediate IO locality rather than optimise for
>>>>>> delayed allocation.
>>>>>>
>>>>>> This should just be a relatively simple change to
>>>>>> xfs_file_iomap_begin() along the lines of:
>>>>>>
>>>>>> -	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && !(flags & IOMAP_DIRECT) &&
>>>>>> -			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
>>>>>> +	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) &&
>>>>>> +	    !(flags & (IOMAP_DIRECT | IOMAP_UNCACHED)) &&
>>>>>> +	    !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
>>>>>> 		/* Reserve delalloc blocks for regular writeback. */
>>>>>> 		return xfs_file_iomap_begin_delay(inode, offset, length, flags,
>>>>>> 				iomap);
>>>>>> 	}
>>>>>>
>>>>>> so that it avoids delayed allocation for uncached IO...
>>>>>
>>>>> That's very handy! Thanks, I'll add that to the next version. Just out
>>>>> of curiosity, would you prefer this as a separate patch, or just bundle
>>>>> it with the iomap buffered RWF_UNCACHED patch? I'm assuming the latter,
>>>>> and I'll just mention it in the changelog.
>>>>
>>>> OK, since it's in XFS, it'd be a separate patch.
>>>
>>> *nod*
>>>
>>>> The code you quote seems
>>>> to be something out-of-tree?
>>>
>>> Ah, I quoted the code in the 5.4 release branch, not the 5.5-rc1
>>> tree. I'd forgotten that the xfs_file_iomap_begin() got massively
>>> refactored in the 5.5 merge and I hadn't updated my cscope trees. SO
>>> I'm guessing you want to go looking for the
>>> xfs_buffered_write_iomap_begin() and add another case to this
>>> initial branch:
>>>
>>>         /* we can't use delayed allocations when using extent size hints */
>>>         if (xfs_get_extsz_hint(ip))
>>>                 return xfs_direct_write_iomap_begin(inode, offset, count,
>>>                                 flags, iomap, srcmap);
>>>
>>> To make the buffered write IO go down the direct IO allocation path...
>>
>> Makes it even simpler! Something like this:
>>
>>
>> commit 1783722cd4b7088a3c004462c7ae610b8e42b720
>> Author: Jens Axboe <axboe@kernel.dk>
>> Date:   Tue Dec 17 07:30:04 2019 -0700
>>
>>     xfs: don't do delayed allocations for uncached buffered writes
>>     
>>     This data is going to be written immediately, so don't bother trying
>>     to do delayed allocation for it.
>>     
>>     Suggested-by: Dave Chinner <david@fromorbit.com>
>>     Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 28e2d1f37267..d0cd4a05d59f 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -847,8 +847,11 @@ xfs_buffered_write_iomap_begin(
>>  	int			allocfork = XFS_DATA_FORK;
>>  	int			error = 0;
>>  
>> -	/* we can't use delayed allocations when using extent size hints */
>> -	if (xfs_get_extsz_hint(ip))
>> +	/*
>> +	 * Don't do delayed allocations when using extent size hints, or
>> +	 * if we were asked to do uncached buffered writes.
>> +	 */
>> +	if (xfs_get_extsz_hint(ip) || (flags & IOMAP_UNCACHED))
>>  		return xfs_direct_write_iomap_begin(inode, offset, count,
>>  				flags, iomap, srcmap);
>>  
> 
> Yup, that's pretty much what I was thinking. :)

Perfect, thanks for checking!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-17 14:31           ` Jens Axboe
@ 2019-12-18  0:49             ` Dave Chinner
  2019-12-18  1:01               ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Dave Chinner @ 2019-12-18  0:49 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-mm, linux-fsdevel, linux-block, willy, clm, torvalds

On Tue, Dec 17, 2019 at 07:31:51AM -0700, Jens Axboe wrote:
> On 12/15/19 9:17 PM, Dave Chinner wrote:
> > On Thu, Dec 12, 2019 at 05:57:57PM -0700, Jens Axboe wrote:
> >> On 12/12/19 5:54 PM, Jens Axboe wrote:
> >>> On 12/12/19 3:34 PM, Dave Chinner wrote:
> >>>> Just a thought on further optimisation for this for XFS.
> >>>> IOMAP_UNCACHED is being passed into the filesystem ->iomap_begin
> >>>> methods by iomap_apply().  Hence the filesystems know that it is
> >>>> an uncached IO that is being done, and we can tailor allocation
> >>>> strategies to suit the fact that the data is going to be written
> >>>> immediately.
> >>>>
> >>>> In this case, XFS needs to treat it the same way it treats direct
> >>>> IO. That is, we do immediate unwritten extent allocation rather than
> >>>> delayed allocation. This will reduce the allocation overhead and
> >>>> will optimise for immediate IO locality rather than optimise for
> >>>> delayed allocation.
> >>>>
> >>>> This should just be a relatively simple change to
> >>>> xfs_file_iomap_begin() along the lines of:
> >>>>
> >>>> -	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && !(flags & IOMAP_DIRECT) &&
> >>>> -			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
> >>>> +	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) &&
> >>>> +	    !(flags & (IOMAP_DIRECT | IOMAP_UNCACHED)) &&
> >>>> +	    !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
> >>>> 		/* Reserve delalloc blocks for regular writeback. */
> >>>> 		return xfs_file_iomap_begin_delay(inode, offset, length, flags,
> >>>> 				iomap);
> >>>> 	}
> >>>>
> >>>> so that it avoids delayed allocation for uncached IO...
> >>>
> >>> That's very handy! Thanks, I'll add that to the next version. Just out
> >>> of curiosity, would you prefer this as a separate patch, or just bundle
> >>> it with the iomap buffered RWF_UNCACHED patch? I'm assuming the latter,
> >>> and I'll just mention it in the changelog.
> >>
> >> OK, since it's in XFS, it'd be a separate patch.
> > 
> > *nod*
> > 
> >> The code you quote seems
> >> to be something out-of-tree?
> > 
> > Ah, I quoted the code in the 5.4 release branch, not the 5.5-rc1
> > tree. I'd forgotten that the xfs_file_iomap_begin() got massively
> > refactored in the 5.5 merge and I hadn't updated my cscope trees. SO
> > I'm guessing you want to go looking for the
> > xfs_buffered_write_iomap_begin() and add another case to this
> > initial branch:
> > 
> >         /* we can't use delayed allocations when using extent size hints */
> >         if (xfs_get_extsz_hint(ip))
> >                 return xfs_direct_write_iomap_begin(inode, offset, count,
> >                                 flags, iomap, srcmap);
> > 
> > To make the buffered write IO go down the direct IO allocation path...
> 
> Makes it even simpler! Something like this:
> 
> 
> commit 1783722cd4b7088a3c004462c7ae610b8e42b720
> Author: Jens Axboe <axboe@kernel.dk>
> Date:   Tue Dec 17 07:30:04 2019 -0700
> 
>     xfs: don't do delayed allocations for uncached buffered writes
>     
>     This data is going to be written immediately, so don't bother trying
>     to do delayed allocation for it.
>     
>     Suggested-by: Dave Chinner <david@fromorbit.com>
>     Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 28e2d1f37267..d0cd4a05d59f 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -847,8 +847,11 @@ xfs_buffered_write_iomap_begin(
>  	int			allocfork = XFS_DATA_FORK;
>  	int			error = 0;
>  
> -	/* we can't use delayed allocations when using extent size hints */
> -	if (xfs_get_extsz_hint(ip))
> +	/*
> +	 * Don't do delayed allocations when using extent size hints, or
> +	 * if we were asked to do uncached buffered writes.
> +	 */
> +	if (xfs_get_extsz_hint(ip) || (flags & IOMAP_UNCACHED))
>  		return xfs_direct_write_iomap_begin(inode, offset, count,
>  				flags, iomap, srcmap);
>  

Yup, that's pretty much what I was thinking. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-16  4:17         ` Dave Chinner
@ 2019-12-17 14:31           ` Jens Axboe
  2019-12-18  0:49             ` Dave Chinner
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2019-12-17 14:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-fsdevel, linux-block, willy, clm, torvalds

On 12/15/19 9:17 PM, Dave Chinner wrote:
> On Thu, Dec 12, 2019 at 05:57:57PM -0700, Jens Axboe wrote:
>> On 12/12/19 5:54 PM, Jens Axboe wrote:
>>> On 12/12/19 3:34 PM, Dave Chinner wrote:
>>>> On Wed, Dec 11, 2019 at 08:29:43AM -0700, Jens Axboe wrote:
>>>>> This adds support for RWF_UNCACHED for file systems using iomap to
>>>>> perform buffered writes. We use the generic infrastructure for this,
>>>>> by tracking pages we created and calling write_drop_cached_pages()
>>>>> to issue writeback and prune those pages.
>>>>>
>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>> ---
>>>>>  fs/iomap/apply.c       | 24 ++++++++++++++++++++++++
>>>>>  fs/iomap/buffered-io.c | 37 +++++++++++++++++++++++++++++--------
>>>>>  include/linux/iomap.h  |  5 +++++
>>>>>  3 files changed, 58 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
>>>>> index 562536da8a13..966826ad4bb9 100644
>>>>> --- a/fs/iomap/apply.c
>>>>> +++ b/fs/iomap/apply.c
>>>>> @@ -90,5 +90,29 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
>>>>>  				     flags, &iomap);
>>>>>  	}
>>>>>  
>>>>> +	if (written && (flags & IOMAP_UNCACHED)) {
>>>>> +		struct address_space *mapping = inode->i_mapping;
>>>>> +
>>>>> +		end = pos + written;
>>>>> +		ret = filemap_write_and_wait_range(mapping, pos, end);
>>>>> +		if (ret)
>>>>> +			goto out;
>>>>> +
>>>>> +		/*
>>>>> +		 * No pages were created for this range, we're done
>>>>> +		 */
>>>>> +		if (!(iomap.flags & IOMAP_F_PAGE_CREATE))
>>>>> +			goto out;
>>>>> +
>>>>> +		/*
>>>>> +		 * Try to invalidate cache pages for the range we just wrote.
>>>>> +		 * We don't care if invalidation fails as the write has still
>>>>> +		 * worked and leaving clean uptodate pages in the page cache
>>>>> +		 * isn't a corruption vector for uncached IO.
>>>>> +		 */
>>>>> +		invalidate_inode_pages2_range(mapping,
>>>>> +				pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
>>>>> +	}
>>>>> +out:
>>>>>  	return written ? written : ret;
>>>>>  }
>>>>
>>>> Just a thought on further optimisation for this for XFS.
>>>> IOMAP_UNCACHED is being passed into the filesystem ->iomap_begin
>>>> methods by iomap_apply().  Hence the filesystems know that it is
>>>> an uncached IO that is being done, and we can tailor allocation
>>>> strategies to suit the fact that the data is going to be written
>>>> immediately.
>>>>
>>>> In this case, XFS needs to treat it the same way it treats direct
>>>> IO. That is, we do immediate unwritten extent allocation rather than
>>>> delayed allocation. This will reduce the allocation overhead and
>>>> will optimise for immediate IO locality rather than optimise for
>>>> delayed allocation.
>>>>
>>>> This should just be a relatively simple change to
>>>> xfs_file_iomap_begin() along the lines of:
>>>>
>>>> -	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && !(flags & IOMAP_DIRECT) &&
>>>> -			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
>>>> +	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) &&
>>>> +	    !(flags & (IOMAP_DIRECT | IOMAP_UNCACHED)) &&
>>>> +	    !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
>>>> 		/* Reserve delalloc blocks for regular writeback. */
>>>> 		return xfs_file_iomap_begin_delay(inode, offset, length, flags,
>>>> 				iomap);
>>>> 	}
>>>>
>>>> so that it avoids delayed allocation for uncached IO...
>>>
>>> That's very handy! Thanks, I'll add that to the next version. Just out
>>> of curiosity, would you prefer this as a separate patch, or just bundle
>>> it with the iomap buffered RWF_UNCACHED patch? I'm assuming the latter,
>>> and I'll just mention it in the changelog.
>>
>> OK, since it's in XFS, it'd be a separate patch.
> 
> *nod*
> 
>> The code you quote seems
>> to be something out-of-tree?
> 
> Ah, I quoted the code in the 5.4 release branch, not the 5.5-rc1
> tree. I'd forgotten that the xfs_file_iomap_begin() got massively
> refactored in the 5.5 merge and I hadn't updated my cscope trees. SO
> I'm guessing you want to go looking for the
> xfs_buffered_write_iomap_begin() and add another case to this
> initial branch:
> 
>         /* we can't use delayed allocations when using extent size hints */
>         if (xfs_get_extsz_hint(ip))
>                 return xfs_direct_write_iomap_begin(inode, offset, count,
>                                 flags, iomap, srcmap);
> 
> To make the buffered write IO go down the direct IO allocation path...

Makes it even simpler! Something like this:


commit 1783722cd4b7088a3c004462c7ae610b8e42b720
Author: Jens Axboe <axboe@kernel.dk>
Date:   Tue Dec 17 07:30:04 2019 -0700

    xfs: don't do delayed allocations for uncached buffered writes
    
    This data is going to be written immediately, so don't bother trying
    to do delayed allocation for it.
    
    Suggested-by: Dave Chinner <david@fromorbit.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 28e2d1f37267..d0cd4a05d59f 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -847,8 +847,11 @@ xfs_buffered_write_iomap_begin(
 	int			allocfork = XFS_DATA_FORK;
 	int			error = 0;
 
-	/* we can't use delayed allocations when using extent size hints */
-	if (xfs_get_extsz_hint(ip))
+	/*
+	 * Don't do delayed allocations when using extent size hints, or
+	 * if we were asked to do uncached buffered writes.
+	 */
+	if (xfs_get_extsz_hint(ip) || (flags & IOMAP_UNCACHED))
 		return xfs_direct_write_iomap_begin(inode, offset, count,
 				flags, iomap, srcmap);
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-13  0:57       ` Jens Axboe
@ 2019-12-16  4:17         ` Dave Chinner
  2019-12-17 14:31           ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Dave Chinner @ 2019-12-16  4:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-mm, linux-fsdevel, linux-block, willy, clm, torvalds

On Thu, Dec 12, 2019 at 05:57:57PM -0700, Jens Axboe wrote:
> On 12/12/19 5:54 PM, Jens Axboe wrote:
> > On 12/12/19 3:34 PM, Dave Chinner wrote:
> >> On Wed, Dec 11, 2019 at 08:29:43AM -0700, Jens Axboe wrote:
> >>> This adds support for RWF_UNCACHED for file systems using iomap to
> >>> perform buffered writes. We use the generic infrastructure for this,
> >>> by tracking pages we created and calling write_drop_cached_pages()
> >>> to issue writeback and prune those pages.
> >>>
> >>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>> ---
> >>>  fs/iomap/apply.c       | 24 ++++++++++++++++++++++++
> >>>  fs/iomap/buffered-io.c | 37 +++++++++++++++++++++++++++++--------
> >>>  include/linux/iomap.h  |  5 +++++
> >>>  3 files changed, 58 insertions(+), 8 deletions(-)
> >>>
> >>> diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
> >>> index 562536da8a13..966826ad4bb9 100644
> >>> --- a/fs/iomap/apply.c
> >>> +++ b/fs/iomap/apply.c
> >>> @@ -90,5 +90,29 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
> >>>  				     flags, &iomap);
> >>>  	}
> >>>  
> >>> +	if (written && (flags & IOMAP_UNCACHED)) {
> >>> +		struct address_space *mapping = inode->i_mapping;
> >>> +
> >>> +		end = pos + written;
> >>> +		ret = filemap_write_and_wait_range(mapping, pos, end);
> >>> +		if (ret)
> >>> +			goto out;
> >>> +
> >>> +		/*
> >>> +		 * No pages were created for this range, we're done
> >>> +		 */
> >>> +		if (!(iomap.flags & IOMAP_F_PAGE_CREATE))
> >>> +			goto out;
> >>> +
> >>> +		/*
> >>> +		 * Try to invalidate cache pages for the range we just wrote.
> >>> +		 * We don't care if invalidation fails as the write has still
> >>> +		 * worked and leaving clean uptodate pages in the page cache
> >>> +		 * isn't a corruption vector for uncached IO.
> >>> +		 */
> >>> +		invalidate_inode_pages2_range(mapping,
> >>> +				pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> >>> +	}
> >>> +out:
> >>>  	return written ? written : ret;
> >>>  }
> >>
> >> Just a thought on further optimisation for this for XFS.
> >> IOMAP_UNCACHED is being passed into the filesystem ->iomap_begin
> >> methods by iomap_apply().  Hence the filesystems know that it is
> >> an uncached IO that is being done, and we can tailor allocation
> >> strategies to suit the fact that the data is going to be written
> >> immediately.
> >>
> >> In this case, XFS needs to treat it the same way it treats direct
> >> IO. That is, we do immediate unwritten extent allocation rather than
> >> delayed allocation. This will reduce the allocation overhead and
> >> will optimise for immediate IO locality rather than optimise for
> >> delayed allocation.
> >>
> >> This should just be a relatively simple change to
> >> xfs_file_iomap_begin() along the lines of:
> >>
> >> -	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && !(flags & IOMAP_DIRECT) &&
> >> -			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
> >> +	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) &&
> >> +	    !(flags & (IOMAP_DIRECT | IOMAP_UNCACHED)) &&
> >> +	    !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
> >> 		/* Reserve delalloc blocks for regular writeback. */
> >> 		return xfs_file_iomap_begin_delay(inode, offset, length, flags,
> >> 				iomap);
> >> 	}
> >>
> >> so that it avoids delayed allocation for uncached IO...
> > 
> > That's very handy! Thanks, I'll add that to the next version. Just out
> > of curiosity, would you prefer this as a separate patch, or just bundle
> > it with the iomap buffered RWF_UNCACHED patch? I'm assuming the latter,
> > and I'll just mention it in the changelog.
> 
> OK, since it's in XFS, it'd be a separate patch.

*nod*

> The code you quote seems
> to be something out-of-tree?

Ah, I quoted the code in the 5.4 release branch, not the 5.5-rc1
tree. I'd forgotten that the xfs_file_iomap_begin() got massively
refactored in the 5.5 merge and I hadn't updated my cscope trees. SO
I'm guessing you want to go looking for the
xfs_buffered_write_iomap_begin() and add another case to this
initial branch:

        /* we can't use delayed allocations when using extent size hints */
        if (xfs_get_extsz_hint(ip))
                return xfs_direct_write_iomap_begin(inode, offset, count,
                                flags, iomap, srcmap);

To make the buffered write IO go down the direct IO allocation path...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-13  0:54     ` Jens Axboe
@ 2019-12-13  0:57       ` Jens Axboe
  2019-12-16  4:17         ` Dave Chinner
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2019-12-13  0:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-fsdevel, linux-block, willy, clm, torvalds

On 12/12/19 5:54 PM, Jens Axboe wrote:
> On 12/12/19 3:34 PM, Dave Chinner wrote:
>> On Wed, Dec 11, 2019 at 08:29:43AM -0700, Jens Axboe wrote:
>>> This adds support for RWF_UNCACHED for file systems using iomap to
>>> perform buffered writes. We use the generic infrastructure for this,
>>> by tracking pages we created and calling write_drop_cached_pages()
>>> to issue writeback and prune those pages.
>>>
>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>> ---
>>>  fs/iomap/apply.c       | 24 ++++++++++++++++++++++++
>>>  fs/iomap/buffered-io.c | 37 +++++++++++++++++++++++++++++--------
>>>  include/linux/iomap.h  |  5 +++++
>>>  3 files changed, 58 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
>>> index 562536da8a13..966826ad4bb9 100644
>>> --- a/fs/iomap/apply.c
>>> +++ b/fs/iomap/apply.c
>>> @@ -90,5 +90,29 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
>>>  				     flags, &iomap);
>>>  	}
>>>  
>>> +	if (written && (flags & IOMAP_UNCACHED)) {
>>> +		struct address_space *mapping = inode->i_mapping;
>>> +
>>> +		end = pos + written;
>>> +		ret = filemap_write_and_wait_range(mapping, pos, end);
>>> +		if (ret)
>>> +			goto out;
>>> +
>>> +		/*
>>> +		 * No pages were created for this range, we're done
>>> +		 */
>>> +		if (!(iomap.flags & IOMAP_F_PAGE_CREATE))
>>> +			goto out;
>>> +
>>> +		/*
>>> +		 * Try to invalidate cache pages for the range we just wrote.
>>> +		 * We don't care if invalidation fails as the write has still
>>> +		 * worked and leaving clean uptodate pages in the page cache
>>> +		 * isn't a corruption vector for uncached IO.
>>> +		 */
>>> +		invalidate_inode_pages2_range(mapping,
>>> +				pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
>>> +	}
>>> +out:
>>>  	return written ? written : ret;
>>>  }
>>
>> Just a thought on further optimisation for this for XFS.
>> IOMAP_UNCACHED is being passed into the filesystem ->iomap_begin
>> methods by iomap_apply().  Hence the filesystems know that it is
>> an uncached IO that is being done, and we can tailor allocation
>> strategies to suit the fact that the data is going to be written
>> immediately.
>>
>> In this case, XFS needs to treat it the same way it treats direct
>> IO. That is, we do immediate unwritten extent allocation rather than
>> delayed allocation. This will reduce the allocation overhead and
>> will optimise for immediate IO locality rather than optimise for
>> delayed allocation.
>>
>> This should just be a relatively simple change to
>> xfs_file_iomap_begin() along the lines of:
>>
>> -	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && !(flags & IOMAP_DIRECT) &&
>> -			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
>> +	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) &&
>> +	    !(flags & (IOMAP_DIRECT | IOMAP_UNCACHED)) &&
>> +	    !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
>> 		/* Reserve delalloc blocks for regular writeback. */
>> 		return xfs_file_iomap_begin_delay(inode, offset, length, flags,
>> 				iomap);
>> 	}
>>
>> so that it avoids delayed allocation for uncached IO...
> 
> That's very handy! Thanks, I'll add that to the next version. Just out
> of curiosity, would you prefer this as a separate patch, or just bundle
> it with the iomap buffered RWF_UNCACHED patch? I'm assuming the latter,
> and I'll just mention it in the changelog.

OK, since it's in XFS, it'd be a separate patch. The code you quote seems
to be something out-of-tree?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-12 22:34   ` Dave Chinner
@ 2019-12-13  0:54     ` Jens Axboe
  2019-12-13  0:57       ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2019-12-13  0:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-fsdevel, linux-block, willy, clm, torvalds

On 12/12/19 3:34 PM, Dave Chinner wrote:
> On Wed, Dec 11, 2019 at 08:29:43AM -0700, Jens Axboe wrote:
>> This adds support for RWF_UNCACHED for file systems using iomap to
>> perform buffered writes. We use the generic infrastructure for this,
>> by tracking pages we created and calling write_drop_cached_pages()
>> to issue writeback and prune those pages.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  fs/iomap/apply.c       | 24 ++++++++++++++++++++++++
>>  fs/iomap/buffered-io.c | 37 +++++++++++++++++++++++++++++--------
>>  include/linux/iomap.h  |  5 +++++
>>  3 files changed, 58 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
>> index 562536da8a13..966826ad4bb9 100644
>> --- a/fs/iomap/apply.c
>> +++ b/fs/iomap/apply.c
>> @@ -90,5 +90,29 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
>>  				     flags, &iomap);
>>  	}
>>  
>> +	if (written && (flags & IOMAP_UNCACHED)) {
>> +		struct address_space *mapping = inode->i_mapping;
>> +
>> +		end = pos + written;
>> +		ret = filemap_write_and_wait_range(mapping, pos, end);
>> +		if (ret)
>> +			goto out;
>> +
>> +		/*
>> +		 * No pages were created for this range, we're done
>> +		 */
>> +		if (!(iomap.flags & IOMAP_F_PAGE_CREATE))
>> +			goto out;
>> +
>> +		/*
>> +		 * Try to invalidate cache pages for the range we just wrote.
>> +		 * We don't care if invalidation fails as the write has still
>> +		 * worked and leaving clean uptodate pages in the page cache
>> +		 * isn't a corruption vector for uncached IO.
>> +		 */
>> +		invalidate_inode_pages2_range(mapping,
>> +				pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
>> +	}
>> +out:
>>  	return written ? written : ret;
>>  }
> 
> Just a thought on further optimisation for this for XFS.
> IOMAP_UNCACHED is being passed into the filesystem ->iomap_begin
> methods by iomap_apply().  Hence the filesystems know that it is
> an uncached IO that is being done, and we can tailor allocation
> strategies to suit the fact that the data is going to be written
> immediately.
> 
> In this case, XFS needs to treat it the same way it treats direct
> IO. That is, we do immediate unwritten extent allocation rather than
> delayed allocation. This will reduce the allocation overhead and
> will optimise for immediate IO locality rather than optimise for
> delayed allocation.
> 
> This should just be a relatively simple change to
> xfs_file_iomap_begin() along the lines of:
> 
> -	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && !(flags & IOMAP_DIRECT) &&
> -			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
> +	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) &&
> +	    !(flags & (IOMAP_DIRECT | IOMAP_UNCACHED)) &&
> +	    !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
> 		/* Reserve delalloc blocks for regular writeback. */
> 		return xfs_file_iomap_begin_delay(inode, offset, length, flags,
> 				iomap);
> 	}
> 
> so that it avoids delayed allocation for uncached IO...

That's very handy! Thanks, I'll add that to the next version. Just out
of curiosity, would you prefer this as a separate patch, or just bundle
it with the iomap buffered RWF_UNCACHED patch? I'm assuming the latter,
and I'll just mention it in the changelog.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-11 15:29 ` [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes Jens Axboe
  2019-12-11 17:19   ` Matthew Wilcox
@ 2019-12-12 22:34   ` Dave Chinner
  2019-12-13  0:54     ` Jens Axboe
  1 sibling, 1 reply; 27+ messages in thread
From: Dave Chinner @ 2019-12-12 22:34 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-mm, linux-fsdevel, linux-block, willy, clm, torvalds

On Wed, Dec 11, 2019 at 08:29:43AM -0700, Jens Axboe wrote:
> This adds support for RWF_UNCACHED for file systems using iomap to
> perform buffered writes. We use the generic infrastructure for this,
> by tracking pages we created and calling write_drop_cached_pages()
> to issue writeback and prune those pages.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/iomap/apply.c       | 24 ++++++++++++++++++++++++
>  fs/iomap/buffered-io.c | 37 +++++++++++++++++++++++++++++--------
>  include/linux/iomap.h  |  5 +++++
>  3 files changed, 58 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
> index 562536da8a13..966826ad4bb9 100644
> --- a/fs/iomap/apply.c
> +++ b/fs/iomap/apply.c
> @@ -90,5 +90,29 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
>  				     flags, &iomap);
>  	}
>  
> +	if (written && (flags & IOMAP_UNCACHED)) {
> +		struct address_space *mapping = inode->i_mapping;
> +
> +		end = pos + written;
> +		ret = filemap_write_and_wait_range(mapping, pos, end);
> +		if (ret)
> +			goto out;
> +
> +		/*
> +		 * No pages were created for this range, we're done
> +		 */
> +		if (!(iomap.flags & IOMAP_F_PAGE_CREATE))
> +			goto out;
> +
> +		/*
> +		 * Try to invalidate cache pages for the range we just wrote.
> +		 * We don't care if invalidation fails as the write has still
> +		 * worked and leaving clean uptodate pages in the page cache
> +		 * isn't a corruption vector for uncached IO.
> +		 */
> +		invalidate_inode_pages2_range(mapping,
> +				pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> +	}
> +out:
>  	return written ? written : ret;
>  }

Just a thought on further optimisation for this for XFS.
IOMAP_UNCACHED is being passed into the filesystem ->iomap_begin
methods by iomap_apply().  Hence the filesystems know that it is
an uncached IO that is being done, and we can tailor allocation
strategies to suit the fact that the data is going to be written
immediately.

In this case, XFS needs to treat it the same way it treats direct
IO. That is, we do immediate unwritten extent allocation rather than
delayed allocation. This will reduce the allocation overhead and
will optimise for immediate IO locality rather than optimise for
delayed allocation.

This should just be a relatively simple change to
xfs_file_iomap_begin() along the lines of:

-	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && !(flags & IOMAP_DIRECT) &&
-			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
+	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) &&
+	    !(flags & (IOMAP_DIRECT | IOMAP_UNCACHED)) &&
+	    !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
		/* Reserve delalloc blocks for regular writeback. */
		return xfs_file_iomap_begin_delay(inode, offset, length, flags,
				iomap);
	}

so that it avoids delayed allocation for uncached IO...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-11 17:19   ` Matthew Wilcox
@ 2019-12-11 18:05     ` Jens Axboe
  0 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2019-12-11 18:05 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-fsdevel, linux-block, clm, torvalds, david

On 12/11/19 10:19 AM, Matthew Wilcox wrote:
> On Wed, Dec 11, 2019 at 08:29:43AM -0700, Jens Axboe wrote:
>> @@ -670,9 +675,14 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
>>  		iomap_read_inline_data(inode, page, srcmap);
>>  	else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
>>  		status = __block_write_begin_int(page, pos, len, NULL, srcmap);
>> -	else
>> -		status = __iomap_write_begin(inode, pos, len, flags, page,
>> +	else {
>> +		unsigned wb_flags = 0;
>> +
>> +		if (flags & IOMAP_UNCACHED)
>> +			wb_flags = IOMAP_WRITE_F_UNCACHED;
>> +		status = __iomap_write_begin(inode, pos, len, wb_flags, page,
>>  				srcmap);
> 
> I think if you do an uncached write to a currently shared extent, you
> just lost the IOMAP_WRITE_F_UNSHARE flag?

Oops good catch, I'll fix that up.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-11 15:29 ` [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes Jens Axboe
@ 2019-12-11 17:19   ` Matthew Wilcox
  2019-12-11 18:05     ` Jens Axboe
  2019-12-12 22:34   ` Dave Chinner
  1 sibling, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2019-12-11 17:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-mm, linux-fsdevel, linux-block, clm, torvalds, david

On Wed, Dec 11, 2019 at 08:29:43AM -0700, Jens Axboe wrote:
> @@ -670,9 +675,14 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
>  		iomap_read_inline_data(inode, page, srcmap);
>  	else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
>  		status = __block_write_begin_int(page, pos, len, NULL, srcmap);
> -	else
> -		status = __iomap_write_begin(inode, pos, len, flags, page,
> +	else {
> +		unsigned wb_flags = 0;
> +
> +		if (flags & IOMAP_UNCACHED)
> +			wb_flags = IOMAP_WRITE_F_UNCACHED;
> +		status = __iomap_write_begin(inode, pos, len, wb_flags, page,
>  				srcmap);

I think if you do an uncached write to a currently shared extent, you
just lost the IOMAP_WRITE_F_UNSHARE flag?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-11 15:29 [PATCHSET v3 " Jens Axboe
@ 2019-12-11 15:29 ` Jens Axboe
  2019-12-11 17:19   ` Matthew Wilcox
  2019-12-12 22:34   ` Dave Chinner
  0 siblings, 2 replies; 27+ messages in thread
From: Jens Axboe @ 2019-12-11 15:29 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-block
  Cc: willy, clm, torvalds, david, Jens Axboe

This adds support for RWF_UNCACHED for file systems using iomap to
perform buffered writes. We use the generic infrastructure for this,
by tracking pages we created and calling write_drop_cached_pages()
to issue writeback and prune those pages.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/iomap/apply.c       | 24 ++++++++++++++++++++++++
 fs/iomap/buffered-io.c | 37 +++++++++++++++++++++++++++++--------
 include/linux/iomap.h  |  5 +++++
 3 files changed, 58 insertions(+), 8 deletions(-)

diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
index 562536da8a13..966826ad4bb9 100644
--- a/fs/iomap/apply.c
+++ b/fs/iomap/apply.c
@@ -90,5 +90,29 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
 				     flags, &iomap);
 	}
 
+	if (written && (flags & IOMAP_UNCACHED)) {
+		struct address_space *mapping = inode->i_mapping;
+
+		end = pos + written;
+		ret = filemap_write_and_wait_range(mapping, pos, end);
+		if (ret)
+			goto out;
+
+		/*
+		 * No pages were created for this range, we're done
+		 */
+		if (!(iomap.flags & IOMAP_F_PAGE_CREATE))
+			goto out;
+
+		/*
+		 * Try to invalidate cache pages for the range we just wrote.
+		 * We don't care if invalidation fails as the write has still
+		 * worked and leaving clean uptodate pages in the page cache
+		 * isn't a corruption vector for uncached IO.
+		 */
+		invalidate_inode_pages2_range(mapping,
+				pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
+	}
+out:
 	return written ? written : ret;
 }
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 9b5b770ca4c7..09440f114506 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -566,6 +566,7 @@ EXPORT_SYMBOL_GPL(iomap_migrate_page);
 
 enum {
 	IOMAP_WRITE_F_UNSHARE		= (1 << 0),
+	IOMAP_WRITE_F_UNCACHED		= (1 << 1),
 };
 
 static void
@@ -643,6 +644,7 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 		struct page **pagep, struct iomap *iomap, struct iomap *srcmap)
 {
 	const struct iomap_page_ops *page_ops = iomap->page_ops;
+	unsigned aop_flags;
 	struct page *page;
 	int status = 0;
 
@@ -659,8 +661,11 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 			return status;
 	}
 
+	aop_flags = AOP_FLAG_NOFS;
+	if (flags & IOMAP_UNCACHED)
+		aop_flags |= AOP_FLAG_UNCACHED;
 	page = grab_cache_page_write_begin(inode->i_mapping, pos >> PAGE_SHIFT,
-			AOP_FLAG_NOFS);
+						aop_flags);
 	if (!page) {
 		status = -ENOMEM;
 		goto out_no_page;
@@ -670,9 +675,14 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 		iomap_read_inline_data(inode, page, srcmap);
 	else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
 		status = __block_write_begin_int(page, pos, len, NULL, srcmap);
-	else
-		status = __iomap_write_begin(inode, pos, len, flags, page,
+	else {
+		unsigned wb_flags = 0;
+
+		if (flags & IOMAP_UNCACHED)
+			wb_flags = IOMAP_WRITE_F_UNCACHED;
+		status = __iomap_write_begin(inode, pos, len, wb_flags, page,
 				srcmap);
+	}
 
 	if (unlikely(status))
 		goto out_unlock;
@@ -832,10 +842,17 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 			break;
 		}
 
-		status = iomap_write_begin(inode, pos, bytes, 0, &page, iomap,
-				srcmap);
-		if (unlikely(status))
+retry:
+		status = iomap_write_begin(inode, pos, bytes, flags, &page,
+						iomap, srcmap);
+		if (unlikely(status)) {
+			if (status == -ENOMEM && (flags & IOMAP_UNCACHED)) {
+				iomap->flags |= IOMAP_F_PAGE_CREATE;
+				flags &= ~IOMAP_UNCACHED;
+				goto retry;
+			}
 			break;
+		}
 
 		if (mapping_writably_mapped(inode->i_mapping))
 			flush_dcache_page(page);
@@ -882,10 +899,14 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter,
 {
 	struct inode *inode = iocb->ki_filp->f_mapping->host;
 	loff_t pos = iocb->ki_pos, ret = 0, written = 0;
+	unsigned flags = IOMAP_WRITE;
+
+	if (iocb->ki_flags & IOCB_UNCACHED)
+		flags |= IOMAP_UNCACHED;
 
 	while (iov_iter_count(iter)) {
-		ret = iomap_apply(inode, pos, iov_iter_count(iter),
-				IOMAP_WRITE, ops, iter, iomap_write_actor);
+		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags,
+					ops, iter, iomap_write_actor);
 		if (ret <= 0)
 			break;
 		pos += ret;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 61fcaa3904d4..b5b5cf781eea 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -48,12 +48,16 @@ struct vm_fault;
  *
  * IOMAP_F_BUFFER_HEAD indicates that the file system requires the use of
  * buffer heads for this mapping.
+ *
+ * IOMAP_F_PAGE_CREATE indicates that pages had to be allocated to satisfy
+ * this operation.
  */
 #define IOMAP_F_NEW		0x01
 #define IOMAP_F_DIRTY		0x02
 #define IOMAP_F_SHARED		0x04
 #define IOMAP_F_MERGED		0x08
 #define IOMAP_F_BUFFER_HEAD	0x10
+#define IOMAP_F_PAGE_CREATE	0x20
 
 /*
  * Flags set by the core iomap code during operations:
@@ -121,6 +125,7 @@ struct iomap_page_ops {
 #define IOMAP_FAULT		(1 << 3) /* mapping for page fault */
 #define IOMAP_DIRECT		(1 << 4) /* direct I/O */
 #define IOMAP_NOWAIT		(1 << 5) /* do not block */
+#define IOMAP_UNCACHED		(1 << 6)
 
 struct iomap_ops {
 	/*
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-11  1:14   ` Dave Chinner
@ 2019-12-11 14:44     ` Jens Axboe
  0 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2019-12-11 14:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-fsdevel, linux-block, willy, clm

On 12/10/19 6:14 PM, Dave Chinner wrote:
>> @@ -864,15 +896,29 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>>  			 */
>>  			bytes = min_t(unsigned long, PAGE_SIZE - offset,
>>  						iov_iter_single_seg_count(i));
>> +			if (drop_page)
>> +				put_page(page);
>>  			goto again;
>>  		}
>> +
>> +		if (drop_page &&
>> +		    ((pos >> PAGE_SHIFT) != ((pos + copied) >> PAGE_SHIFT))) {
>> +			if (!pagevec_add(&pvec, page))
>> +				write_drop_cached_pages(&pvec, mapping);
>> +		} else {
>> +			if (drop_page)
>> +				put_page(page);
>> +			balance_dirty_pages_ratelimited(inode->i_mapping);
>> +		}
> 
> This looks like it's a problem: this is going to write the
> data, which can cause the extent mapping of the file to change
> beyond the range that was written (e.g. due to speculative delayed
> allocation) and so the iomap we have already cached to direct write
> behaviour may now be stale.
> 
> IOWs, to be safe we need to terminate the write loop at this point,
> return to iomap_apply() and remap the range we are writing into so
> that we don't end up using a stale iomap. That kinda defeats the
> purpose of iomap - we are trying to do a single extent mapping per
> IO instead of per-page, and this pulls it back to an iomap per 16
> pages for large user IOs. And it has the issues with breaking
> delayed allocation optimisations, too.
> 
> Hence, IMO, this is the wrong layer in iomap to be dealing with
> writeback and cache residency for uncached IO. We should be caching
> residency/invalidation at a per-IO level, not a per-page level.
> 
> Sure, have the write actor return a flag (e.g. in the iomap) to say
> that it encountered cached pages so that we can decide whether or
> not to invalidate the entire range we just wrote in iomap_apply, but
> doing it between mappings in iomap_apply means that the writeback is
> done once per user IO, and cache invalidation only occurs if no
> cached pages were encountered during that IO. i.e. add this to
> iomap_apply() after ops->iomap_end() has been called:
> 
> 
> 	if (flags & RWF_UNCACHED) {
> 		ret = filemap_write_and_wait_range(mapping, start, end);
> 		if (ret)
> 			goto out;
> 
> 		if (!drop_cache)
> 			goto out;
> 
> 		/*
> 		 * Try to invalidate cache pages for the range we
> 		 * just wrote. We don't care if invalidation fails
> 		 * as the write has still worked and leaving clean
> 		 * uptodate pages * in the page cache isn't a
> 		 * corruption vector for uncached IO.
> 		 */
> 		invalidate_inode_pages2_range(mapping,
> 				start >> PAGE_SHIFT, end >> PAGE_SHIFT);
> 	}
> out:
> 	return written ? written : ret;
> }

I really like this, and some variant of that solution can also be
applied to the non-iomap case. It'll make for a cleaner implementation
as well.

Once we do it per-write we'll have solved the performance and iomap
weirdness, but it does mean that we need to make a decision on when to
invalidate. For the per-page case it's clear - if the page is already
there, leave it. If not, (attempt to) kill it. For the full write range,
what if just one page was not there? Do we invalidate then? What if one
page was there?

I'm going to move forward with the notion that if we had to allocate any
page in the range, we invalidate the range. Only ranges that were fully
cached already will remain in the cache. I think those semantics make
the most sense.

> Note that this doesn't solve the write error return issue. i.e.
> if filemap_write_and_wait_range() fails, should that error be
> returned or ignored?

I think we should handle this exactly like we do the O_DIRECT case on
buffered IO write-and-wait.

> And that leads to my next question: what data integrity guarantees
> does RWF_UNCACHED give? What if the underlying device has a volatile
> write cache or we dirtied metadata during block allocation? i.e.  to
> a user, "UNCACHED" kinda implies that the write has ended up on
> stable storage because they are saying "do not cache this data". To
> me, none of this implementation guarantees data integrity, and users
> would still need O_DSYNC or fsync() with RWF_UNCACHED IO. That seems
> sane to me (same as direct io requirements) but whatever is decided
> here, it will need to be spelled out clearly in the man page so that 
> users don't get it wrong.

Fully agree, this isn't about data integrity guarantees. You will still
need the appropriate sync magic to make it safe. This is exactly like
O_DIRECT, and I think we should retain those exact same semantics for
the RWF_UNCACHED case.

Thanks for taking a look at this! I'll respin (and re-test) with the
write side changed.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-10 20:43 ` [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes Jens Axboe
@ 2019-12-11  1:14   ` Dave Chinner
  2019-12-11 14:44     ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Dave Chinner @ 2019-12-11  1:14 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-mm, linux-fsdevel, linux-block, willy, clm

On Tue, Dec 10, 2019 at 01:43:04PM -0700, Jens Axboe wrote:
> This adds support for RWF_UNCACHED for file systems using iomap to
> perform buffered writes. We use the generic infrastructure for this,
> by tracking pages we created and calling write_drop_cached_pages()
> to issue writeback and prune those pages.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
.....
>  static loff_t
>  iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
>  {
> +	struct address_space *mapping = inode->i_mapping;
>  	struct iov_iter *i = data;
> +	struct pagevec pvec;
>  	long status = 0;
>  	ssize_t written = 0;
>  
> +	pagevec_init(&pvec);
> +

Ok, so the actor is called after we've already mapped and allocated
an extent of arbitrary length. It may be a delalloc extent, it might
be unwritten, it could be a COW mapping, etc.

>  	do {
>  		struct page *page;
>  		unsigned long offset;	/* Offset into pagecache page */
>  		unsigned long bytes;	/* Bytes to write to page */
>  		size_t copied;		/* Bytes copied from user */
> +		bool drop_page = false;	/* drop page after IO */
> +		unsigned lflags = flags;
>  
>  		offset = offset_in_page(pos);
>  		bytes = min_t(unsigned long, PAGE_SIZE - offset,
> @@ -832,10 +851,17 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  			break;
>  		}
>  
> -		status = iomap_write_begin(inode, pos, bytes, 0, &page, iomap,
> -				srcmap);
> -		if (unlikely(status))
> +retry:
> +		status = iomap_write_begin(inode, pos, bytes, lflags, &page,
> +						iomap, srcmap);
> +		if (unlikely(status)) {
> +			if (status == -ENOMEM && (lflags & IOMAP_UNCACHED)) {
> +				drop_page = true;
> +				lflags &= ~IOMAP_UNCACHED;
> +				goto retry;
> +			}
>  			break;
> +		}
>  
>  		if (mapping_writably_mapped(inode->i_mapping))
>  			flush_dcache_page(page);
> @@ -844,10 +870,16 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  
>  		flush_dcache_page(page);
>  
> +		if (drop_page)
> +			get_page(page);
> +
>  		status = iomap_write_end(inode, pos, bytes, copied, page, iomap,
>  				srcmap);
> -		if (unlikely(status < 0))
> +		if (unlikely(status < 0)) {
> +			if (drop_page)
> +				put_page(page);
>  			break;
> +		}
>  		copied = status;
>  
>  		cond_resched();
> @@ -864,15 +896,29 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  			 */
>  			bytes = min_t(unsigned long, PAGE_SIZE - offset,
>  						iov_iter_single_seg_count(i));
> +			if (drop_page)
> +				put_page(page);
>  			goto again;
>  		}
> +
> +		if (drop_page &&
> +		    ((pos >> PAGE_SHIFT) != ((pos + copied) >> PAGE_SHIFT))) {
> +			if (!pagevec_add(&pvec, page))
> +				write_drop_cached_pages(&pvec, mapping);
> +		} else {
> +			if (drop_page)
> +				put_page(page);
> +			balance_dirty_pages_ratelimited(inode->i_mapping);
> +		}

This looks like it's a problem: this is going to write the
data, which can cause the extent mapping of the file to change
beyond the range that was written (e.g. due to speculative delayed
allocation) and so the iomap we have already cached to direct write
behaviour may now be stale.

IOWs, to be safe we need to terminate the write loop at this point,
return to iomap_apply() and remap the range we are writing into so
that we don't end up using a stale iomap. That kinda defeats the
purpose of iomap - we are trying to do a single extent mapping per
IO instead of per-page, and this pulls it back to an iomap per 16
pages for large user IOs. And it has the issues with breaking
delayed allocation optimisations, too.

Hence, IMO, this is the wrong layer in iomap to be dealing with
writeback and cache residency for uncached IO. We should be caching
residency/invalidation at a per-IO level, not a per-page level.

Sure, have the write actor return a flag (e.g. in the iomap) to say
that it encountered cached pages so that we can decide whether or
not to invalidate the entire range we just wrote in iomap_apply, but
doing it between mappings in iomap_apply means that the writeback is
done once per user IO, and cache invalidation only occurs if no
cached pages were encountered during that IO. i.e. add this to
iomap_apply() after ops->iomap_end() has been called:


	if (flags & RWF_UNCACHED) {
		ret = filemap_write_and_wait_range(mapping, start, end);
		if (ret)
			goto out;

		if (!drop_cache)
			goto out;

		/*
		 * Try to invalidate cache pages for the range we
		 * just wrote. We don't care if invalidation fails
		 * as the write has still worked and leaving clean
		 * uptodate pages * in the page cache isn't a
		 * corruption vector for uncached IO.
		 */
		invalidate_inode_pages2_range(mapping,
				start >> PAGE_SHIFT, end >> PAGE_SHIFT);
	}
out:
	return written ? written : ret;
}

Note that this doesn't solve the write error return issue. i.e.
if filemap_write_and_wait_range() fails, should that error be
returned or ignored?

And that leads to my next question: what data integrity guarantees
does RWF_UNCACHED give? What if the underlying device has a volatile
write cache or we dirtied metadata during block allocation? i.e.  to
a user, "UNCACHED" kinda implies that the write has ended up on
stable storage because they are saying "do not cache this data". To
me, none of this implementation guarantees data integrity, and users
would still need O_DSYNC or fsync() with RWF_UNCACHED IO. That seems
sane to me (same as direct io requirements) but whatever is decided
here, it will need to be spelled out clearly in the man page so that 
users don't get it wrong.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-10 20:42 [PATCHSET v2 0/5] Support for RWF_UNCACHED Jens Axboe
@ 2019-12-10 20:43 ` Jens Axboe
  2019-12-11  1:14   ` Dave Chinner
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2019-12-10 20:43 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-block; +Cc: willy, clm, Jens Axboe

This adds support for RWF_UNCACHED for file systems using iomap to
perform buffered writes. We use the generic infrastructure for this,
by tracking pages we created and calling write_drop_cached_pages()
to issue writeback and prune those pages.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/iomap/buffered-io.c | 72 +++++++++++++++++++++++++++++++++++-------
 include/linux/iomap.h  |  1 +
 2 files changed, 62 insertions(+), 11 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 9b5b770ca4c7..3a18a6af8cb3 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -17,6 +17,7 @@
 #include <linux/bio.h>
 #include <linux/sched/signal.h>
 #include <linux/migrate.h>
+#include <linux/pagevec.h>
 #include "trace.h"
 
 #include "../internal.h"
@@ -566,6 +567,7 @@ EXPORT_SYMBOL_GPL(iomap_migrate_page);
 
 enum {
 	IOMAP_WRITE_F_UNSHARE		= (1 << 0),
+	IOMAP_WRITE_F_UNCACHED		= (1 << 1),
 };
 
 static void
@@ -643,6 +645,7 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 		struct page **pagep, struct iomap *iomap, struct iomap *srcmap)
 {
 	const struct iomap_page_ops *page_ops = iomap->page_ops;
+	unsigned aop_flags;
 	struct page *page;
 	int status = 0;
 
@@ -659,8 +662,11 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 			return status;
 	}
 
+	aop_flags = AOP_FLAG_NOFS;
+	if (flags & IOMAP_UNCACHED)
+		aop_flags |= AOP_FLAG_UNCACHED;
 	page = grab_cache_page_write_begin(inode->i_mapping, pos >> PAGE_SHIFT,
-			AOP_FLAG_NOFS);
+						aop_flags);
 	if (!page) {
 		status = -ENOMEM;
 		goto out_no_page;
@@ -670,9 +676,14 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 		iomap_read_inline_data(inode, page, srcmap);
 	else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
 		status = __block_write_begin_int(page, pos, len, NULL, srcmap);
-	else
-		status = __iomap_write_begin(inode, pos, len, flags, page,
+	else {
+		unsigned wb_flags = 0;
+
+		if (flags & IOMAP_UNCACHED)
+			wb_flags = IOMAP_WRITE_F_UNCACHED;
+		status = __iomap_write_begin(inode, pos, len, wb_flags, page,
 				srcmap);
+	}
 
 	if (unlikely(status))
 		goto out_unlock;
@@ -796,19 +807,27 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len, unsigned copied,
 	return ret;
 }
 
+#define GPW_PAGE_BATCH		16
+
 static loff_t
 iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
+	struct address_space *mapping = inode->i_mapping;
 	struct iov_iter *i = data;
+	struct pagevec pvec;
 	long status = 0;
 	ssize_t written = 0;
 
+	pagevec_init(&pvec);
+
 	do {
 		struct page *page;
 		unsigned long offset;	/* Offset into pagecache page */
 		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
+		bool drop_page = false;	/* drop page after IO */
+		unsigned lflags = flags;
 
 		offset = offset_in_page(pos);
 		bytes = min_t(unsigned long, PAGE_SIZE - offset,
@@ -832,10 +851,17 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 			break;
 		}
 
-		status = iomap_write_begin(inode, pos, bytes, 0, &page, iomap,
-				srcmap);
-		if (unlikely(status))
+retry:
+		status = iomap_write_begin(inode, pos, bytes, lflags, &page,
+						iomap, srcmap);
+		if (unlikely(status)) {
+			if (status == -ENOMEM && (lflags & IOMAP_UNCACHED)) {
+				drop_page = true;
+				lflags &= ~IOMAP_UNCACHED;
+				goto retry;
+			}
 			break;
+		}
 
 		if (mapping_writably_mapped(inode->i_mapping))
 			flush_dcache_page(page);
@@ -844,10 +870,16 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 
 		flush_dcache_page(page);
 
+		if (drop_page)
+			get_page(page);
+
 		status = iomap_write_end(inode, pos, bytes, copied, page, iomap,
 				srcmap);
-		if (unlikely(status < 0))
+		if (unlikely(status < 0)) {
+			if (drop_page)
+				put_page(page);
 			break;
+		}
 		copied = status;
 
 		cond_resched();
@@ -864,15 +896,29 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 			 */
 			bytes = min_t(unsigned long, PAGE_SIZE - offset,
 						iov_iter_single_seg_count(i));
+			if (drop_page)
+				put_page(page);
 			goto again;
 		}
+
+		if (drop_page &&
+		    ((pos >> PAGE_SHIFT) != ((pos + copied) >> PAGE_SHIFT))) {
+			if (!pagevec_add(&pvec, page))
+				write_drop_cached_pages(&pvec, mapping);
+		} else {
+			if (drop_page)
+				put_page(page);
+			balance_dirty_pages_ratelimited(inode->i_mapping);
+		}
+
 		pos += copied;
 		written += copied;
 		length -= copied;
-
-		balance_dirty_pages_ratelimited(inode->i_mapping);
 	} while (iov_iter_count(i) && length);
 
+	if (pagevec_count(&pvec))
+		write_drop_cached_pages(&pvec, mapping);
+
 	return written ? written : status;
 }
 
@@ -882,10 +928,14 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter,
 {
 	struct inode *inode = iocb->ki_filp->f_mapping->host;
 	loff_t pos = iocb->ki_pos, ret = 0, written = 0;
+	unsigned flags = IOMAP_WRITE;
+
+	if (iocb->ki_flags & IOCB_UNCACHED)
+		flags |= IOMAP_UNCACHED;
 
 	while (iov_iter_count(iter)) {
-		ret = iomap_apply(inode, pos, iov_iter_count(iter),
-				IOMAP_WRITE, ops, iter, iomap_write_actor);
+		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags,
+					ops, iter, iomap_write_actor);
 		if (ret <= 0)
 			break;
 		pos += ret;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 61fcaa3904d4..833dd43507ac 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -121,6 +121,7 @@ struct iomap_page_ops {
 #define IOMAP_FAULT		(1 << 3) /* mapping for page fault */
 #define IOMAP_DIRECT		(1 << 4) /* direct I/O */
 #define IOMAP_NOWAIT		(1 << 5) /* do not block */
+#define IOMAP_UNCACHED		(1 << 6)
 
 struct iomap_ops {
 	/*
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-10 16:24 [PATCHSET 0/5] Support for RWF_UNCACHED Jens Axboe
@ 2019-12-10 16:24 ` Jens Axboe
  0 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2019-12-10 16:24 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-block; +Cc: Jens Axboe

This adds support for RWF_UNCACHED for file systems using iomap to
perform buffered writes. We use the generic infrastructure for this,
by tracking pages we created and calling write_drop_cached_pages()
to issue writeback and prune those pages.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/iomap/buffered-io.c | 58 ++++++++++++++++++++++++++++++++++--------
 include/linux/iomap.h  |  1 +
 2 files changed, 49 insertions(+), 10 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 9b5b770ca4c7..c8d36b280ff2 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -566,6 +566,7 @@ EXPORT_SYMBOL_GPL(iomap_migrate_page);
 
 enum {
 	IOMAP_WRITE_F_UNSHARE		= (1 << 0),
+	IOMAP_WRITE_F_UNCACHED		= (1 << 1),
 };
 
 static void
@@ -643,6 +644,7 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 		struct page **pagep, struct iomap *iomap, struct iomap *srcmap)
 {
 	const struct iomap_page_ops *page_ops = iomap->page_ops;
+	unsigned aop_flags;
 	struct page *page;
 	int status = 0;
 
@@ -659,8 +661,11 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 			return status;
 	}
 
+	aop_flags = AOP_FLAG_NOFS;
+	if (flags & IOMAP_UNCACHED)
+		aop_flags |= AOP_FLAG_UNCACHED;
 	page = grab_cache_page_write_begin(inode->i_mapping, pos >> PAGE_SHIFT,
-			AOP_FLAG_NOFS);
+						aop_flags);
 	if (!page) {
 		status = -ENOMEM;
 		goto out_no_page;
@@ -670,9 +675,14 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 		iomap_read_inline_data(inode, page, srcmap);
 	else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
 		status = __block_write_begin_int(page, pos, len, NULL, srcmap);
-	else
-		status = __iomap_write_begin(inode, pos, len, flags, page,
+	else {
+		unsigned wb_flags = 0;
+
+		if (flags & IOMAP_UNCACHED)
+			wb_flags = IOMAP_WRITE_F_UNCACHED;
+		status = __iomap_write_begin(inode, pos, len, wb_flags, page,
 				srcmap);
+	}
 
 	if (unlikely(status))
 		goto out_unlock;
@@ -796,19 +806,25 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len, unsigned copied,
 	return ret;
 }
 
+#define GPW_PAGE_BATCH		16
+
 static loff_t
 iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
+	struct address_space *mapping = inode->i_mapping;
+	struct page *drop_pages[GPW_PAGE_BATCH];
 	struct iov_iter *i = data;
 	long status = 0;
 	ssize_t written = 0;
+	unsigned drop_nr = 0;
 
 	do {
 		struct page *page;
 		unsigned long offset;	/* Offset into pagecache page */
 		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
+		bool drop_page = false;	/* drop page after IO */
 
 		offset = offset_in_page(pos);
 		bytes = min_t(unsigned long, PAGE_SIZE - offset,
@@ -832,10 +848,17 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 			break;
 		}
 
-		status = iomap_write_begin(inode, pos, bytes, 0, &page, iomap,
-				srcmap);
-		if (unlikely(status))
+retry:
+		status = iomap_write_begin(inode, pos, bytes, flags, &page,
+						iomap, srcmap);
+		if (unlikely(status)) {
+			if (status == -ENOMEM && (flags & IOMAP_UNCACHED)) {
+				drop_page = true;
+				flags &= ~IOMAP_UNCACHED;
+				goto retry;
+			}
 			break;
+		}
 
 		if (mapping_writably_mapped(inode->i_mapping))
 			flush_dcache_page(page);
@@ -866,13 +889,24 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 						iov_iter_single_seg_count(i));
 			goto again;
 		}
+
+		if (drop_page &&
+		    ((pos >> PAGE_SHIFT) != ((pos + copied) >> PAGE_SHIFT))) {
+			drop_pages[drop_nr] = page;
+			if (++drop_nr == GPW_PAGE_BATCH)
+				write_drop_cached_pages(drop_pages, mapping,
+								&drop_nr);
+		} else
+			balance_dirty_pages_ratelimited(inode->i_mapping);
+
 		pos += copied;
 		written += copied;
 		length -= copied;
-
-		balance_dirty_pages_ratelimited(inode->i_mapping);
 	} while (iov_iter_count(i) && length);
 
+	if (drop_nr)
+		write_drop_cached_pages(drop_pages, mapping, &drop_nr);
+
 	return written ? written : status;
 }
 
@@ -882,10 +916,14 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter,
 {
 	struct inode *inode = iocb->ki_filp->f_mapping->host;
 	loff_t pos = iocb->ki_pos, ret = 0, written = 0;
+	unsigned flags = IOMAP_WRITE;
+
+	if (iocb->ki_flags & IOCB_UNCACHED)
+		flags |= IOMAP_UNCACHED;
 
 	while (iov_iter_count(iter)) {
-		ret = iomap_apply(inode, pos, iov_iter_count(iter),
-				IOMAP_WRITE, ops, iter, iomap_write_actor);
+		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags,
+					ops, iter, iomap_write_actor);
 		if (ret <= 0)
 			break;
 		pos += ret;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 61fcaa3904d4..833dd43507ac 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -121,6 +121,7 @@ struct iomap_page_ops {
 #define IOMAP_FAULT		(1 << 3) /* mapping for page fault */
 #define IOMAP_DIRECT		(1 << 4) /* direct I/O */
 #define IOMAP_NOWAIT		(1 << 5) /* do not block */
+#define IOMAP_UNCACHED		(1 << 6)
 
 struct iomap_ops {
 	/*
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2019-12-18  1:01 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-12 19:01 [PATCHSET v4 0/5] Support for RWF_UNCACHED Jens Axboe
2019-12-12 19:01 ` [PATCH 1/5] fs: add read support " Jens Axboe
2019-12-12 21:21   ` Matthew Wilcox
2019-12-12 21:27     ` Jens Axboe
2019-12-12 19:01 ` [PATCH 2/5] mm: make generic_perform_write() take a struct kiocb Jens Axboe
2019-12-12 19:01 ` [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED Jens Axboe
2019-12-12 19:01 ` [PATCH 4/5] iomap: add struct iomap_data Jens Axboe
2019-12-13 20:32   ` Darrick J. Wong
2019-12-13 20:47     ` Jens Axboe
2019-12-12 19:01 ` [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes Jens Axboe
2019-12-13  2:26   ` Darrick J. Wong
2019-12-13  2:38     ` Jens Axboe
2019-12-13  0:53 ` [PATCHSET v4 0/5] Support for RWF_UNCACHED Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2019-12-11 15:29 [PATCHSET v3 " Jens Axboe
2019-12-11 15:29 ` [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes Jens Axboe
2019-12-11 17:19   ` Matthew Wilcox
2019-12-11 18:05     ` Jens Axboe
2019-12-12 22:34   ` Dave Chinner
2019-12-13  0:54     ` Jens Axboe
2019-12-13  0:57       ` Jens Axboe
2019-12-16  4:17         ` Dave Chinner
2019-12-17 14:31           ` Jens Axboe
2019-12-18  0:49             ` Dave Chinner
2019-12-18  1:01               ` Jens Axboe
2019-12-10 20:42 [PATCHSET v2 0/5] Support for RWF_UNCACHED Jens Axboe
2019-12-10 20:43 ` [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes Jens Axboe
2019-12-11  1:14   ` Dave Chinner
2019-12-11 14:44     ` Jens Axboe
2019-12-10 16:24 [PATCHSET 0/5] Support for RWF_UNCACHED Jens Axboe
2019-12-10 16:24 ` [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).